toStringz or not toStringz

Wed Jul 13 10:16:25 PDT 2011

On 2011-07-13 09:00, Steven Schveighoffer wrote:
> On Wed, 13 Jul 2011 10:59:27 -0400, Regan Heath <regan at netmail.co.nz>
> 
> wrote:
> > I am suggesting the compiler will perform a special operation on all
> > char* parameters passed to extern "C" functions.
> > 
> > The operation is a toStringz like operation which is (more or less) as
> > follows:
> > 
> > 1. If there is a \0 character inside foo[0..$], do nothing.
> 
> This is an O(n) operation -- too much overhead. Especially if you already
> know foo has a 0 in it. Note that toStringz does not have this overhead.
> 
> > 2. If the array allocated memory is > the array length, place a \0 at
> > foo[$]
> 
> The check to see if the array has allocated length requires a GC lock, and
> O(lgn) search for the block info in the GC.
> 
> Not that it doesn't already happen in toStringz, but I just want to point
> out that it's not a small cost.
> 
> > 3. Reallocate the array memory, updating foo, place a \0 at foo[$]
> > 4. Call the C function passing foo.ptr
> > 
> > So, it will handle all the following cases:
> > 
> > char[] foo;
> > .. code to populate foo ..
> > 
> > ucase(foo);
> > ucase(foo.ptr);
> 
> I read in your responses below, this is due to you making this equivalent
> to ucase(foo)? This still has the same problems I listed above.
> 
> What about
> 
> char * foo;
> .. code to populate foo ..
> ucase(foo);
> 
> Is there still anything special done by the compiler?
> 
> > ucase(toStringz(foo));
> > 
> > The problem cases are the buffer cases I mentioned earlier, and they
> > wouldn't be a problem if char was initialised to \0 as I first imagined.
> 
> The largest problem I've had with all this is there is a necessary
> overhead of conversion. Not only that, but due to the way reallocation
> works, there may be a move of data. I think it's better to require
> explicit calls incurring such overhead vs. hiding the overhead calls from
> the developer. Especially if the overhead calls are unnecessary.
> 
> > Other replies inline below..
> > 
> > On Tue, 12 Jul 2011 18:28:56 +0100, Steven Schveighoffer
> > 
> > <schveiguy at yahoo.com> wrote:
> >> On Tue, 12 Jul 2011 13:00:41 -0400, Regan Heath <regan at netmail.co.nz>
> >> 
> >> wrote:
> >>> Replace foo with foo.ptr, it makes no difference to the point I was
> >>> making.
> >> 
> >> You fix does not help in that case, foo.ptr will be passed as a
> >> non-null terminated string.
> > 
> > No, see above.
> 
> How does your proposal know that a char * is part of a heap-allocated
> array? If you are assuming the only case where char * is passed will be
> arr.ptr, then that doesn't cut it. What if the compiler doesn't know
> where the char * came from?
> 
> The inherent problem of zero-terminated strings is that you don't know how
> long it is until you search for a zero. If it's not properly terminated,
> then you are screwed. That problem cannot be "solved", even with compiler
> help -- you can get situations where there is no more information other
> than the pointer.
> 
> >> So, your proposal fixes the case:
> >> 
> >> 1. The user tries to pass a string/char[] to a C function. Fails to
> >> compile.
> >> 2. Instead of trying to understand the issue, realizes the .ptr member
> >> is the right type, and switches to that.
> >> 
> >> It does not fix or help with cases where:
> >> * a programmer notices the type of the parameter is char * and uses
> >> 
> >> foo.ptr without trying foo first. (crash)
> >> 
> >> * a programmer calls toStringz without going through the compile/fix
> >> 
> >> cycle above.
> >> 
> >> * a programmer tries to pass string/char[], fails to compile, then
> >> 
> >> looks up how to interface with C and finds toStringz
> >> 
> >> I think this fix really doesn't solve a very common problem.
> > 
> > See above, my intention was to solve all the cases listed here as I
> > suspect the compiler can detect them all, and just 'do the right thing'.
> > 
> > In these cases..
> > 
> > 1. If the programmer writes foo.ptr, the compiler detects that, calls
> > toStringz on 'foo' (not foo.ptr) and updates foo as required (if
> > reallocation occurs).
> 
> What if it's not foo.ptr? What if it's some random char * whose origin
> the compiler isn't aware of?
> 
> > 2. If the programmer calls toStringz, this case is the same as #1 as
> > toStringz returns foo.ptr (I assume).
> 
> Huh? Why should it do anything with toStringz? I'm not getting this one,
> toStringz already has done the work your proposal wants to do.
> 
> >>> This is not a 'new' problem introduced the idea, it's a general
> >>> problem for D/arrays/slices and the same happens with an append,
> >>> right? In which case it's not a reason against the idea.
> >> 
> >> It's new to the features of the C function being called. If you look
> >> up the man page for such a hypothetical function, it might claim that
> >> it alters the data passed in through the argument, but it seems to not
> >> be the case! So there's no way for someone (who arguably is not well
> >> versed in C functions if they didn't know to use toStringz) to figure
> >> out why the code seems not to do what it says it should. Such a
> >> programmer may blame either the implementation of the C function, or
> >> blame the D compiler for not calling the function properly.
> > 
> > None of this is relevant, let me explain..
> > 
> > My idea is for the compiler to detect a char* parameter to an extern "C"
> > function and to call toStringz. When it does so it will correctly
> > update the slice/array being passed if reallocation occurs. The C
> > function will write to the slice/array being passed. So, it's not
> > relevant if there was another slice referencing the array before it was
> > reallocated, because that case is no different to calling a D function
> > which does something similar, like appending to the passed slice/array.
> 
> What about this case?
> 
> char buffer[12];
> buffer[] = "hello, world";
> 
> ucase(buffer[]); // does nothing to buffer!
> 
> I'm saying, the charter of the function is to update a string in place,
> and your proposal is making that not true in some cases.
> 
> > The goal is to make a call to an extern "C" function "just work" in the
> > same way as calling Win32/C functions "just work" from C# .. which also
> > has it's own string type.
> 
> This is very different. C#'s strings are full reference types, so adding
> a '0' at the end affects all references to that string, reallocation or
> not.
> 
> >> toStringz does not currently check for '\0' anywhere in the existing
> >> string. It simply appends '\0' to the end of the passed string. If
> >> you want it to check for '\0', how far should it go? Doesn't this also
> >> add to the overhead (looping over all chars looking for '\0')?
> >> 
> >> Note also, that toStringz has old code that used to check for "one byte
> >> beyond" the array, but this is commented out, because it's unreliable
> >> (could cause a segfault).
> > 
> > So, toStringz is not as clever as I imagined. I thought it would
> > intelligently detect cases where a \0 was already present in the slice
> > (from 0 to $) and if not, put one at $+1 (inside pre-allocated array
> > memory). I was assuming toStringz had access to the underlying array
> > allocation size and would know how far it can 'look' without causing a
> > segfault. In the case where the slice length equaled the array reserved
> > memory area, it would re-allocate and place the \0 at $+1 (inside the
> > newly allocated memory).
> 
> s/clever/slow/
> 
> The only "intelligent" way to check for a 0 is a linear search.
> 
> Without knowing where the data came from, there is no way to look past the
> slice without possibly calling a segfault. If you know it's a heap
> allocation, you can look at the block information to see if you can look
> past it. This might be possible to do for toStringz, but the linear check
> for 0 is just unacceptable for a simple function call. Appending a 0 is
> at least amortized. One thing though, it could make some smarter
> decisions as to whether to reallocate depending on the type of the array,
> since it is already doing a lookup of block info.
> 
> But I still always come back to the fact that I should be able to
> circumvent some auto-intelligent decision that isn't aware of things that
> a developer can be aware of (such as knowing an array already contains a
> 0). The compiler shouldn't be too intrusive here.

Andrej Mitrovic found a rather annoying issue (which is fortunately highly 
unlikely and therefore almost certainly rare) with toStringz and toUTFz with 
checking for a terminating '\0' one past the end of the string (which both 
functions do under some circumstances). You might want to have a look at it:

https://github.com/D-Programming-Language/phobos/pull/123

Given what you know about the GC and arrays, your thoughts on the matter would 
be welcome.

- Jonathan M Davis