Today's programming challenge - How's your Range-Fu ?
Chris via Digitalmars-d
digitalmars-d at puremagic.com
Sat Apr 18 07:00:38 PDT 2015
On Saturday, 18 April 2015 at 13:30:09 UTC, H. S. Teoh wrote:
> On Sat, Apr 18, 2015 at 11:52:50AM +0000, Chris via
> Digitalmars-d wrote:
>> On Saturday, 18 April 2015 at 11:35:47 UTC, Jacob Carlborg
>> wrote:
>> >On 2015-04-18 12:27, Walter Bright wrote:
>> >
>> >>That doesn't make sense to me, because the umlauts and the
>> >>accented
>> >>e all have Unicode code point assignments.
>> >
>> >This code snippet demonstrates the problem:
>> >
>> >import std.stdio;
>> >
>> >void main ()
>> >{
>> > dstring a = "e\u0301";
>> > dstring b = "é";
>> > assert(a != b);
>> > assert(a.length == 2);
>> > assert(b.length == 1);
>> > writefln(a, " ", b);
>> >}
>> >
>> >If you run the above code all asserts should pass. If your
>> >system
>> >correctly supports Unicode (works on OS X 10.10) the two
>> >printed
>> >characters should look exactly the same.
>> >
>> >\u0301 is the "combining acute accent" [1].
>> >
>> >[1]
>> >http://www.fileformat.info/info/unicode/char/0301/index.htm
>>
>> Yep, this was the cause of some bugs I had in my program. The
>> thing is
>> you never know, if a text is composed or decomposed, so you
>> have to be
>> prepared that "é" has length 2 or 1. On OS X these characters
>> are
>> automatically decomposed by default. So if you pipe it through
>> the
>> system an "é" (length=1) automatically becomes "e\u0301"
>> (length=2).
>> Same goes for file names on OS X. I've had to find a
>> workaround for
>> this more than once.
>
> Wait, I thought the recommended approach is to normalize first,
> then do
> string processing later? Normalizing first will eliminate
> inconsistencies of this sort, and allow string-processing code
> to use a
> uniform approach to handling the string. I don't think it's a
> good idea
> to manually deal with composed/decomposed issues within every
> individual
> string function.
>
> Of course, even after normalization, you still have the issue of
> zero-width characters and combining diacritics, because not
> every
> language has precomposed characters handy.
>
> Using byGrapheme, within the current state of Phobos, is still
> the best
> bet as to correctly counting the number of printed columns as
> opposed to
> the number of "characters" (which, in the Unicode definition,
> does not
> always match the layman's notion of "character"). Unfortunately,
> byGrapheme may allocate, which fails Walter's requirements.
>
> Well, to be fair, byGrapheme only *occasionally* allocates --
> only for
> input with unusually long sequences of combining diacritics --
> for
> normal use cases you'll pretty much never have any allocations.
> But the
> language can't express the idea of "occasionally allocates",
> there is
> only "allocates" or "@nogc". Which makes it unusable in @nogc
> code.
>
> One possible solution would be to modify std.uni.graphemeStride
> to not
> allocate, since it shouldn't need to do so just to compute the
> length of
> the next grapheme.
>
>
> T
This is why on OS X I always normalized strings to composed.
However, there are always issues with Unicode, because, as you
said, the layman's notion of what a character is is not the same
as Unicode's. I wrote a utility function that uses byGrapheme and
byCodePoint. It's a bit of an overhead, but I always get the
correct length and character access (e.g. if txt.startsWith("é")).
More information about the Digitalmars-d
mailing list