Today's programming challenge - How's your Range-Fu ?

Sat Apr 18 06:27:24 PDT 2015

On Sat, Apr 18, 2015 at 11:52:50AM +0000, Chris via Digitalmars-d wrote:
> On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg wrote:
> >On 2015-04-18 12:27, Walter Bright wrote:
> >
> >>That doesn't make sense to me, because the umlauts and the accented
> >>e all have Unicode code point assignments.
> >
> >This code snippet demonstrates the problem:
> >
> >import std.stdio;
> >
> >void main ()
> >{
> >    dstring a = "e\u0301";
> >    dstring b = "é";
> >    assert(a != b);
> >    assert(a.length == 2);
> >    assert(b.length == 1);
> >    writefln(a, " ", b);
> >}
> >
> >If you run the above code all asserts should pass. If your system
> >correctly supports Unicode (works on OS X 10.10) the two printed
> >characters should look exactly the same.
> >
> >\u0301 is the "combining acute accent" [1].
> >
> >[1] http://www.fileformat.info/info/unicode/char/0301/index.htm
> 
> Yep, this was the cause of some bugs I had in my program. The thing is
> you never know, if a text is composed or decomposed, so you have to be
> prepared that "é" has length 2 or 1. On OS X these characters are
> automatically decomposed by default. So if you pipe it through the
> system an "é" (length=1) automatically becomes "e\u0301" (length=2).
> Same goes for file names on OS X. I've had to find a workaround for
> this more than once.

Wait, I thought the recommended approach is to normalize first, then do
string processing later? Normalizing first will eliminate
inconsistencies of this sort, and allow string-processing code to use a
uniform approach to handling the string. I don't think it's a good idea
to manually deal with composed/decomposed issues within every individual
string function.

Of course, even after normalization, you still have the issue of
zero-width characters and combining diacritics, because not every
language has precomposed characters handy.

Using byGrapheme, within the current state of Phobos, is still the best
bet as to correctly counting the number of printed columns as opposed to
the number of "characters" (which, in the Unicode definition, does not
always match the layman's notion of "character"). Unfortunately,
byGrapheme may allocate, which fails Walter's requirements.

Well, to be fair, byGrapheme only *occasionally* allocates -- only for
input with unusually long sequences of combining diacritics -- for
normal use cases you'll pretty much never have any allocations. But the
language can't express the idea of "occasionally allocates", there is
only "allocates" or "@nogc". Which makes it unusable in @nogc code.

One possible solution would be to modify std.uni.graphemeStride to not
allocate, since it shouldn't need to do so just to compute the length of
the next grapheme.

T

-- 
Just because you survived after you did it, doesn't mean it wasn't stupid!