toStringz or not toStringz

Wed Jul 13 07:59:27 PDT 2011

Ok, it's clear there has been some confusion over what exactly I am  
suggesting.

I am not suggesting the compiler simply insert calls to the existing  
toStringz function as it appears the function does not, or cannot do what  
I am imagining.

I am suggesting the compiler will perform a special operation on all char*  
parameters passed to extern "C" functions.

The operation is a toStringz like operation which is (more or less) as  
follows:

1. If there is a \0 character inside foo[0..$], do nothing.
2. If the array allocated memory is > the array length, place a \0 at  
foo[$]
3. Reallocate the array memory, updating foo, place a \0 at foo[$]
4. Call the C function passing foo.ptr

So, it will handle all the following cases:

char[] foo;
.. code to populate foo ..

ucase(foo);
ucase(foo.ptr);
ucase(toStringz(foo));

The problem cases are the buffer cases I mentioned earlier, and they  
wouldn't be a problem if char was initialised to \0 as I first imagined.

Other replies inline below..

On Tue, 12 Jul 2011 18:28:56 +0100, Steven Schveighoffer  
<schveiguy at yahoo.com> wrote:
> On Tue, 12 Jul 2011 13:00:41 -0400, Regan Heath <regan at netmail.co.nz>  
> wrote:
>
>> On Tue, 12 Jul 2011 17:09:04 +0100, Steven Schveighoffer  
>> <schveiguy at yahoo.com> wrote:
>>
>>> On Tue, 12 Jul 2011 11:41:56 -0400, Regan Heath <regan at netmail.co.nz>  
>>> wrote:
>>>
>>>> On Tue, 12 Jul 2011 15:59:58 +0100, Steven Schveighoffer  
>>>> <schveiguy at yahoo.com> wrote:
>>>>
>>>>> On Tue, 12 Jul 2011 10:50:07 -0400, Regan Heath  
>>>>> <regan at netmail.co.nz> wrote:
>>>>
>>>>>>> What if you expect the function is expecting to write to the  
>>>>>>> buffer, and the compiler just made a copy of it?  Won't that be  
>>>>>>> pretty surprising?
>>>>>>
>>>>>> Assuming a C function in this form:
>>>>>>
>>>>>>    void write_to_buffer(char *buffer, int length);
>>>>>
>>>>> No, assuming C function in this form:
>>>>>
>>>>> void ucase(char* str);
>>>>>
>>>>> Essentially, a C function which takes a writable  
>>>>> already-null-terminated string, and writes to it.
>>>>
>>>> Ok, that's an even better example for my case.
>>>>
>>>> It would be used/called like...
>>>>
>>>>    char[] foo;
>>>>    .. code which populates foo with something ..
>>>>    ucase(foo);
>>>>
>>>> and in D today this would corrupt memory.  Unless the programmer  
>>>> remembered to write:
>>>
>>> No, it wouldn't compile.  char[] does not cast implicitly to char *.   
>>> (if it does, that needs to change).
>>
>> Replace foo with foo.ptr, it makes no difference to the point I was  
>> making.
>
> You fix does not help in that case, foo.ptr will be passed as a non-null  
> terminated string.

No, see above.

> So, your proposal fixes the case:
>
> 1. The user tries to pass a string/char[] to a C function.  Fails to  
> compile.
> 2. Instead of trying to understand the issue, realizes the .ptr member  
> is the right type, and switches to that.
>
> It does not fix or help with cases where:
>
>   * a programmer notices the type of the parameter is char * and uses  
> foo.ptr without trying foo first. (crash)
>   * a programmer calls toStringz without going through the compile/fix  
> cycle above.
>   * a programmer tries to pass string/char[], fails to compile, then  
> looks up how to interface with C and finds toStringz
>
> I think this fix really doesn't solve a very common problem.

See above, my intention was to solve all the cases listed here as I  
suspect the compiler can detect them all, and just 'do the right thing'.

In these cases..

1. If the programmer writes foo.ptr, the compiler detects that, calls  
toStringz on 'foo' (not foo.ptr) and updates foo as required (if  
reallocation occurs).
2. If the programmer calls toStringz, this case is the same as #1 as  
toStringz returns foo.ptr (I assume).
3. If the programmer passes 'foo', the compiler calls toStringz etc.

>>>> I am assuming also that if this idea were implemented it would handle  
>>>> things intelligently, like for example if when toStringz is called  
>>>> the underlying array is out of room and needs to be reallocated, the  
>>>> compiler would update the slice/reference 'foo' in the same way as it  
>>>> already does for an append which triggers a reallocation.
>>>
>>> OK, but what if it's like this:
>>>
>>> char[] foo = new char[100];
>>> auto bar = foo;
>>>
>>> ucase(foo);
>>>
>>> In most cases, bar is also written to, but in some cases only foo is  
>>> written to.
>>>
>>> Granted, we're getting further out on the hypothetical limb here :)   
>>> But my point is, making it require explicit calling of toStringz  
>>> instead of implicit makes the code less confusing, because you  
>>> understand "oh, toStringz may reallocate, so I can't expect bar to  
>>> also get updated" vs. simply calling a function with a buffer.
>>
>> This is not a 'new' problem introduced the idea, it's a general problem  
>> for D/arrays/slices and the same happens with an append, right?  In  
>> which case it's not a reason against the idea.
>
> It's new to the features of the C function being called.  If you look up  
> the man page for such a hypothetical function, it might claim that it  
> alters the data passed in through the argument, but it seems to not be  
> the case!  So there's no way for someone (who arguably is not well  
> versed in C functions if they didn't know to use toStringz) to figure  
> out why the code seems not to do what it says it should.  Such a  
> programmer may blame either the implementation of the C function, or  
> blame the D compiler for not calling the function properly.

None of this is relevant, let me explain..

My idea is for the compiler to detect a char* parameter to an extern "C"  
function and to call toStringz.  When it does so it will correctly update  
the slice/array being passed if reallocation occurs.  The C function will  
write to the slice/array being passed.  So, it's not relevant if there was  
another slice referencing the array before it was reallocated, because  
that case is no different to calling a D function which does something  
similar, like appending to the passed slice/array.

In short, the end result will ALWAYS be that the passed slice/array will  
contain the output of the C function.

The goal is to make a call to an extern "C" function "just work" in the  
same way as calling Win32/C functions "just work" from C# .. which also  
has it's own string type.

>>>>>> You might initially extern it as:
>>>>>>
>>>>>>    extern "C" void write_to_buffer(char *buffer, int length);
>>>>>>
>>>>>> And, you could call it one of 2 ways (legitimately):
>>>>>>
>>>>>>    char[] foo = new char[100];
>>>>>>    write_to_buffer(foo, foo.length);
>>>>>>
>>>>>> or:
>>>>>>
>>>>>>    char[100] foo;
>>>>>>    write_to_buffer(foo, foo.length);
>>>>>>
>>>>>> and in both cases, toStringz would do nothing as foo is zero  
>>>>>> terminated already (in both cases), or am I wrong about that?
>>>>>
>>>>> In neither case are they required to be null terminated.
>>>>
>>>> True, but I was outlining the worst case scenario for my suggestion,  
>>>> not describing the real C function requirements.
>>>
>>> No, I mean you were wrong, D does not guarantee either of those (stack  
>>> allocated or heap allocated) is null terminated.  So toStringz must  
>>> add a '\0' at the end (which is mildly expensive for heap data, and  
>>> very expensive for stack data).
>>
>> Ah, ok, this was because I had forgotten char is initialised to 0xFF.   
>> If it was initialised to \0 then both arrays would have been full of  
>> null terminators.  The default value of char is the killing blow to the  
>> idea.
>
> toStringz does not currently check for '\0' anywhere in the existing  
> string.  It simply appends '\0' to the end of the passed string.  If you  
> want it to check for '\0', how far should it go?  Doesn't this also add  
> to the overhead (looping over all chars looking for '\0')?
>
> Note also, that toStringz has old code that used to check for "one byte  
> beyond" the array, but this is commented out, because it's unreliable  
> (could cause a segfault).

So, toStringz is not as clever as I imagined.  I thought it would  
intelligently detect cases where a \0 was already present in the slice  
(from 0 to $) and if not, put one at $+1 (inside pre-allocated array  
memory).  I was assuming toStringz had access to the underlying array  
allocation size and would know how far it can 'look' without causing a  
segfault.  In the case where the slice length equaled the array reserved  
memory area, it would re-allocate and place the \0 at $+1 (inside the  
newly allocated memory).

-- 
Using Opera's revolutionary email client: http://www.opera.com/mail/