[challenge] can you break wstring's back?

Wed Nov 24 00:46:26 PST 2010

Am 24.11.2010 04:08, schrieb Steven Schveighoffer:
> I am working on a string implementation that enforces the correct
> restrictions on a string (bi-directional range, etc), and I came across
> what I feel is a bug.
>
> However, I don't know enough about utf to construct a test case to prove
> it wrong.
>
> In std.array, there are separate functions for array.popBack(),
> depending on whether the array is a char[], a wchar[], or any other
> array type. The char[] and wchar[] popBacks are drastically different.
>
> However, there is only one back() function for narrow strings which
> supposedly handles both char[] and wchar[]. It looks like it will parse
> 1, 2, 3, or 4 elements depending on the bit pattern, and it's only
> looking at the least significant 8 bits of the elements to determine
> this. Does this make sense for wstring? I would think the wstring has a
> different way of decoding data than the string, otherwise why the two
> different popBacks?
>
> I don't know how to construct a string which shows there is an issue, is
> there one? If so, can you prove it with a unit test?

Here you go

import std.array;
import std.conv;

void main() {
     dchar c = cast(dchar) 0x10000;
     auto ws = to!wstring(c);
     assert(ws.length == 2);            // decoded as surrogate pair
     assert(ws.back == c);              // fails with decoding error
}

>
> Hint, the bit pattern of the end of the string must 'trick' the function
> into using the wrong number of elements, because ones that happen to
> match the correct number of elements needed will not cause an error
> (after deciding how many elements to decode, the data is passed to the
> decode function, which should do the right thing).
>
> As a bonus, can you write a correct wstring.back function so I can
> include it in my string struct? :)

Use the same logic as in popBack for wstring, i.e. check whether the 
last wchar is the high part of a surrogate pair (i.e between 0xDC00 and 
0xDFFF inclusive). If yes, two wchars are needed to decode to dchar. 
Otherwise, only one is needed.

>
> -Steve