toStringz or not toStringz

Steven Schveighoffer schveiguy at yahoo.com
Wed Jul 13 09:00:39 PDT 2011


On Wed, 13 Jul 2011 10:59:27 -0400, Regan Heath <regan at netmail.co.nz>  
wrote:

>
> I am suggesting the compiler will perform a special operation on all  
> char* parameters passed to extern "C" functions.
>
> The operation is a toStringz like operation which is (more or less) as  
> follows:
>
> 1. If there is a \0 character inside foo[0..$], do nothing.

This is an O(n) operation -- too much overhead.  Especially if you already  
know foo has a 0 in it.  Note that toStringz does not have this overhead.

> 2. If the array allocated memory is > the array length, place a \0 at  
> foo[$]

The check to see if the array has allocated length requires a GC lock, and  
O(lgn) search for the block info in the GC.

Not that it doesn't already happen in toStringz, but I just want to point  
out that it's not a small cost.

> 3. Reallocate the array memory, updating foo, place a \0 at foo[$]
> 4. Call the C function passing foo.ptr
>
> So, it will handle all the following cases:
>
> char[] foo;
> .. code to populate foo ..
>
> ucase(foo);
> ucase(foo.ptr);

I read in your responses below, this is due to you making this equivalent  
to ucase(foo)?  This still has the same problems I listed above.

What about

char * foo;
.. code to populate foo ..
ucase(foo);

Is there still anything special done by the compiler?

> ucase(toStringz(foo));
>
> The problem cases are the buffer cases I mentioned earlier, and they  
> wouldn't be a problem if char was initialised to \0 as I first imagined.

The largest problem I've had with all this is there is a necessary  
overhead of conversion.  Not only that, but due to the way reallocation  
works, there may be a move of data.  I think it's better to require  
explicit calls incurring such overhead vs. hiding the overhead calls from  
the developer.  Especially if the overhead calls are unnecessary.

> Other replies inline below..
>
> On Tue, 12 Jul 2011 18:28:56 +0100, Steven Schveighoffer  
> <schveiguy at yahoo.com> wrote:
>> On Tue, 12 Jul 2011 13:00:41 -0400, Regan Heath <regan at netmail.co.nz>  
>> wrote:
>>> Replace foo with foo.ptr, it makes no difference to the point I was  
>>> making.
>>
>> You fix does not help in that case, foo.ptr will be passed as a  
>> non-null terminated string.
>
> No, see above.

How does your proposal know that a char * is part of a heap-allocated  
array?  If you are assuming the only case where char * is passed will be  
arr.ptr, then that doesn't cut it.  What if the compiler doesn't know  
where the char * came from?

The inherent problem of zero-terminated strings is that you don't know how  
long it is until you search for a zero.  If it's not properly terminated,  
then you are screwed.  That problem cannot be "solved", even with compiler  
help -- you can get situations where there is no more information other  
than the pointer.

>> So, your proposal fixes the case:
>>
>> 1. The user tries to pass a string/char[] to a C function.  Fails to  
>> compile.
>> 2. Instead of trying to understand the issue, realizes the .ptr member  
>> is the right type, and switches to that.
>>
>> It does not fix or help with cases where:
>>
>>   * a programmer notices the type of the parameter is char * and uses  
>> foo.ptr without trying foo first. (crash)
>>   * a programmer calls toStringz without going through the compile/fix  
>> cycle above.
>>   * a programmer tries to pass string/char[], fails to compile, then  
>> looks up how to interface with C and finds toStringz
>>
>> I think this fix really doesn't solve a very common problem.
>
> See above, my intention was to solve all the cases listed here as I  
> suspect the compiler can detect them all, and just 'do the right thing'.
>
> In these cases..
>
> 1. If the programmer writes foo.ptr, the compiler detects that, calls  
> toStringz on 'foo' (not foo.ptr) and updates foo as required (if  
> reallocation occurs).

What if it's not foo.ptr?  What if it's some random char * whose origin  
the compiler isn't aware of?

> 2. If the programmer calls toStringz, this case is the same as #1 as  
> toStringz returns foo.ptr (I assume).

Huh?  Why should it do anything with toStringz?  I'm not getting this one,  
toStringz already has done the work your proposal wants to do.

>>> This is not a 'new' problem introduced the idea, it's a general  
>>> problem for D/arrays/slices and the same happens with an append,  
>>> right?  In which case it's not a reason against the idea.
>>
>> It's new to the features of the C function being called.  If you look  
>> up the man page for such a hypothetical function, it might claim that  
>> it alters the data passed in through the argument, but it seems to not  
>> be the case!  So there's no way for someone (who arguably is not well  
>> versed in C functions if they didn't know to use toStringz) to figure  
>> out why the code seems not to do what it says it should.  Such a  
>> programmer may blame either the implementation of the C function, or  
>> blame the D compiler for not calling the function properly.
>
> None of this is relevant, let me explain..
>
> My idea is for the compiler to detect a char* parameter to an extern "C"  
> function and to call toStringz.  When it does so it will correctly  
> update the slice/array being passed if reallocation occurs.  The C  
> function will write to the slice/array being passed.  So, it's not  
> relevant if there was another slice referencing the array before it was  
> reallocated, because that case is no different to calling a D function  
> which does something similar, like appending to the passed slice/array.

What about this case?

char buffer[12];
buffer[] = "hello, world";

ucase(buffer[]); // does nothing to buffer!

I'm saying, the charter of the function is to update a string in place,  
and your proposal is making that not true in some cases.

> The goal is to make a call to an extern "C" function "just work" in the  
> same way as calling Win32/C functions "just work" from C# .. which also  
> has it's own string type.

This is very different.  C#'s strings are full reference types, so adding  
a '0' at the end affects all references to that string, reallocation or  
not.

>> toStringz does not currently check for '\0' anywhere in the existing  
>> string.  It simply appends '\0' to the end of the passed string.  If  
>> you want it to check for '\0', how far should it go?  Doesn't this also  
>> add to the overhead (looping over all chars looking for '\0')?
>>
>> Note also, that toStringz has old code that used to check for "one byte  
>> beyond" the array, but this is commented out, because it's unreliable  
>> (could cause a segfault).
>
> So, toStringz is not as clever as I imagined.  I thought it would  
> intelligently detect cases where a \0 was already present in the slice  
> (from 0 to $) and if not, put one at $+1 (inside pre-allocated array  
> memory).  I was assuming toStringz had access to the underlying array  
> allocation size and would know how far it can 'look' without causing a  
> segfault.  In the case where the slice length equaled the array reserved  
> memory area, it would re-allocate and place the \0 at $+1 (inside the  
> newly allocated memory).

s/clever/slow/

The only "intelligent" way to check for a 0 is a linear search.

Without knowing where the data came from, there is no way to look past the  
slice without possibly calling a segfault.  If you know it's a heap  
allocation, you can look at the block information to see if you can look  
past it.  This might be possible to do for toStringz, but the linear check  
for 0 is just unacceptable for a simple function call.  Appending a 0 is  
at least amortized.  One thing though, it could make some smarter  
decisions as to whether to reallocate depending on the type of the array,  
since it is already doing a lookup of block info.

But I still always come back to the fact that I should be able to  
circumvent some auto-intelligent decision that isn't aware of things that  
a developer can be aware of (such as knowing an array already contains a  
0).  The compiler shouldn't be too intrusive here.

-Steve


More information about the Digitalmars-d mailing list