Today's programming challenge - How's your Range-Fu ?

Sat Apr 18 07:00:38 PDT 2015

On Saturday, 18 April 2015 at 13:30:09 UTC, H. S. Teoh wrote:
> On Sat, Apr 18, 2015 at 11:52:50AM +0000, Chris via 
> Digitalmars-d wrote:
>> On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg 
>> wrote:
>> >On 2015-04-18 12:27, Walter Bright wrote:
>> >
>> >>That doesn't make sense to me, because the umlauts and the 
>> >>accented
>> >>e all have Unicode code point assignments.
>> >
>> >This code snippet demonstrates the problem:
>> >
>> >import std.stdio;
>> >
>> >void main ()
>> >{
>> >    dstring a = "e\u0301";
>> >    dstring b = "é";
>> >    assert(a != b);
>> >    assert(a.length == 2);
>> >    assert(b.length == 1);
>> >    writefln(a, " ", b);
>> >}
>> >
>> >If you run the above code all asserts should pass. If your 
>> >system
>> >correctly supports Unicode (works on OS X 10.10) the two 
>> >printed
>> >characters should look exactly the same.
>> >
>> >\u0301 is the "combining acute accent" [1].
>> >
>> >[1] 
>> >http://www.fileformat.info/info/unicode/char/0301/index.htm
>> 
>> Yep, this was the cause of some bugs I had in my program. The 
>> thing is
>> you never know, if a text is composed or decomposed, so you 
>> have to be
>> prepared that "é" has length 2 or 1. On OS X these characters 
>> are
>> automatically decomposed by default. So if you pipe it through 
>> the
>> system an "é" (length=1) automatically becomes "e\u0301" 
>> (length=2).
>> Same goes for file names on OS X. I've had to find a 
>> workaround for
>> this more than once.
>
> Wait, I thought the recommended approach is to normalize first, 
> then do
> string processing later? Normalizing first will eliminate
> inconsistencies of this sort, and allow string-processing code 
> to use a
> uniform approach to handling the string. I don't think it's a 
> good idea
> to manually deal with composed/decomposed issues within every 
> individual
> string function.
>
> Of course, even after normalization, you still have the issue of
> zero-width characters and combining diacritics, because not 
> every
> language has precomposed characters handy.
>
> Using byGrapheme, within the current state of Phobos, is still 
> the best
> bet as to correctly counting the number of printed columns as 
> opposed to
> the number of "characters" (which, in the Unicode definition, 
> does not
> always match the layman's notion of "character"). Unfortunately,
> byGrapheme may allocate, which fails Walter's requirements.
>
> Well, to be fair, byGrapheme only *occasionally* allocates -- 
> only for
> input with unusually long sequences of combining diacritics -- 
> for
> normal use cases you'll pretty much never have any allocations. 
> But the
> language can't express the idea of "occasionally allocates", 
> there is
> only "allocates" or "@nogc". Which makes it unusable in @nogc 
> code.
>
> One possible solution would be to modify std.uni.graphemeStride 
> to not
> allocate, since it shouldn't need to do so just to compute the 
> length of
> the next grapheme.
>
>
> T

This is why on OS X I always normalized strings to composed. 
However, there are always issues with Unicode, because, as you 
said, the layman's notion of what a character is is not the same 
as Unicode's. I wrote a utility function that uses byGrapheme and 
byCodePoint. It's a bit of an overhead, but I always get the 
correct length and character access (e.g. if txt.startsWith("é")).