toStringz or not toStringz
Steven Schveighoffer
schveiguy at yahoo.com
Wed Jul 13 09:00:39 PDT 2011
On Wed, 13 Jul 2011 10:59:27 -0400, Regan Heath <regan at netmail.co.nz>
wrote:
>
> I am suggesting the compiler will perform a special operation on all
> char* parameters passed to extern "C" functions.
>
> The operation is a toStringz like operation which is (more or less) as
> follows:
>
> 1. If there is a \0 character inside foo[0..$], do nothing.
This is an O(n) operation -- too much overhead. Especially if you already
know foo has a 0 in it. Note that toStringz does not have this overhead.
> 2. If the array allocated memory is > the array length, place a \0 at
> foo[$]
The check to see if the array has allocated length requires a GC lock, and
O(lgn) search for the block info in the GC.
Not that it doesn't already happen in toStringz, but I just want to point
out that it's not a small cost.
> 3. Reallocate the array memory, updating foo, place a \0 at foo[$]
> 4. Call the C function passing foo.ptr
>
> So, it will handle all the following cases:
>
> char[] foo;
> .. code to populate foo ..
>
> ucase(foo);
> ucase(foo.ptr);
I read in your responses below, this is due to you making this equivalent
to ucase(foo)? This still has the same problems I listed above.
What about
char * foo;
.. code to populate foo ..
ucase(foo);
Is there still anything special done by the compiler?
> ucase(toStringz(foo));
>
> The problem cases are the buffer cases I mentioned earlier, and they
> wouldn't be a problem if char was initialised to \0 as I first imagined.
The largest problem I've had with all this is there is a necessary
overhead of conversion. Not only that, but due to the way reallocation
works, there may be a move of data. I think it's better to require
explicit calls incurring such overhead vs. hiding the overhead calls from
the developer. Especially if the overhead calls are unnecessary.
> Other replies inline below..
>
> On Tue, 12 Jul 2011 18:28:56 +0100, Steven Schveighoffer
> <schveiguy at yahoo.com> wrote:
>> On Tue, 12 Jul 2011 13:00:41 -0400, Regan Heath <regan at netmail.co.nz>
>> wrote:
>>> Replace foo with foo.ptr, it makes no difference to the point I was
>>> making.
>>
>> You fix does not help in that case, foo.ptr will be passed as a
>> non-null terminated string.
>
> No, see above.
How does your proposal know that a char * is part of a heap-allocated
array? If you are assuming the only case where char * is passed will be
arr.ptr, then that doesn't cut it. What if the compiler doesn't know
where the char * came from?
The inherent problem of zero-terminated strings is that you don't know how
long it is until you search for a zero. If it's not properly terminated,
then you are screwed. That problem cannot be "solved", even with compiler
help -- you can get situations where there is no more information other
than the pointer.
>> So, your proposal fixes the case:
>>
>> 1. The user tries to pass a string/char[] to a C function. Fails to
>> compile.
>> 2. Instead of trying to understand the issue, realizes the .ptr member
>> is the right type, and switches to that.
>>
>> It does not fix or help with cases where:
>>
>> * a programmer notices the type of the parameter is char * and uses
>> foo.ptr without trying foo first. (crash)
>> * a programmer calls toStringz without going through the compile/fix
>> cycle above.
>> * a programmer tries to pass string/char[], fails to compile, then
>> looks up how to interface with C and finds toStringz
>>
>> I think this fix really doesn't solve a very common problem.
>
> See above, my intention was to solve all the cases listed here as I
> suspect the compiler can detect them all, and just 'do the right thing'.
>
> In these cases..
>
> 1. If the programmer writes foo.ptr, the compiler detects that, calls
> toStringz on 'foo' (not foo.ptr) and updates foo as required (if
> reallocation occurs).
What if it's not foo.ptr? What if it's some random char * whose origin
the compiler isn't aware of?
> 2. If the programmer calls toStringz, this case is the same as #1 as
> toStringz returns foo.ptr (I assume).
Huh? Why should it do anything with toStringz? I'm not getting this one,
toStringz already has done the work your proposal wants to do.
>>> This is not a 'new' problem introduced the idea, it's a general
>>> problem for D/arrays/slices and the same happens with an append,
>>> right? In which case it's not a reason against the idea.
>>
>> It's new to the features of the C function being called. If you look
>> up the man page for such a hypothetical function, it might claim that
>> it alters the data passed in through the argument, but it seems to not
>> be the case! So there's no way for someone (who arguably is not well
>> versed in C functions if they didn't know to use toStringz) to figure
>> out why the code seems not to do what it says it should. Such a
>> programmer may blame either the implementation of the C function, or
>> blame the D compiler for not calling the function properly.
>
> None of this is relevant, let me explain..
>
> My idea is for the compiler to detect a char* parameter to an extern "C"
> function and to call toStringz. When it does so it will correctly
> update the slice/array being passed if reallocation occurs. The C
> function will write to the slice/array being passed. So, it's not
> relevant if there was another slice referencing the array before it was
> reallocated, because that case is no different to calling a D function
> which does something similar, like appending to the passed slice/array.
What about this case?
char buffer[12];
buffer[] = "hello, world";
ucase(buffer[]); // does nothing to buffer!
I'm saying, the charter of the function is to update a string in place,
and your proposal is making that not true in some cases.
> The goal is to make a call to an extern "C" function "just work" in the
> same way as calling Win32/C functions "just work" from C# .. which also
> has it's own string type.
This is very different. C#'s strings are full reference types, so adding
a '0' at the end affects all references to that string, reallocation or
not.
>> toStringz does not currently check for '\0' anywhere in the existing
>> string. It simply appends '\0' to the end of the passed string. If
>> you want it to check for '\0', how far should it go? Doesn't this also
>> add to the overhead (looping over all chars looking for '\0')?
>>
>> Note also, that toStringz has old code that used to check for "one byte
>> beyond" the array, but this is commented out, because it's unreliable
>> (could cause a segfault).
>
> So, toStringz is not as clever as I imagined. I thought it would
> intelligently detect cases where a \0 was already present in the slice
> (from 0 to $) and if not, put one at $+1 (inside pre-allocated array
> memory). I was assuming toStringz had access to the underlying array
> allocation size and would know how far it can 'look' without causing a
> segfault. In the case where the slice length equaled the array reserved
> memory area, it would re-allocate and place the \0 at $+1 (inside the
> newly allocated memory).
s/clever/slow/
The only "intelligent" way to check for a 0 is a linear search.
Without knowing where the data came from, there is no way to look past the
slice without possibly calling a segfault. If you know it's a heap
allocation, you can look at the block information to see if you can look
past it. This might be possible to do for toStringz, but the linear check
for 0 is just unacceptable for a simple function call. Appending a 0 is
at least amortized. One thing though, it could make some smarter
decisions as to whether to reallocate depending on the type of the array,
since it is already doing a lookup of block info.
But I still always come back to the fact that I should be able to
circumvent some auto-intelligent decision that isn't aware of things that
a developer can be aware of (such as knowing an array already contains a
0). The compiler shouldn't be too intrusive here.
-Steve
More information about the Digitalmars-d
mailing list