Unicode handling comparison

Wed Nov 27 12:39:47 PST 2013

On Wednesday, 27 November 2013 at 17:22:43 UTC, Jakob Ovrum wrote:
>
> i18nString sounds like a range of graphemes to me.
>
Maybe.  If I had called it...say, "normalisedString"?  Would you 
still think that?  That was an off-the-cuff name because my 
morning brain imagined that this sort of thing would be useful 
for user input where you can't make assumptions about its form.

> I would like a convenient function in std.uni to get such a 
> range of graphemes from a range of points, but I wouldn't want 
> to elevate it to any particular status; that would be a 
> knee-jerk reaction. D's granularity when it comes to Unicode is 
> because there is an appropriate level of representation for 
> each domain. Shoe-horning everything into a range of graphemes 
> is something we should avoid.
>
Okay, hold up.  It's a bit late to prevent everyone from diving 
down this rabbit hole, but let me be clear:

This really isn't about graphemes.  Not really.  They may be 
involved, but I think focusing on that obscures the point.

If you recall the original article, I don't think he's being 
unfair in expecting "noël" to have a length of four no matter 
how it was composed.  I don't think it's unfair to expect that 
"noël".take(3) returns "noë", and I don't think it's unfair 
that reversing it should be "lëon".  All the places where his 
expectations were defied (and more!) are implementation details.

While I stated before that I don't necessarily have anything 
against people learning more about unicode, neither do I 
fundamentally believe that's something a lot of people _need_ to 
worry about.  I'm not saying the default string in D should 
change or anything crazy like that.  All I'm suggesting is maybe, 
rather than telling people they should read a small book about 
the most arcane stuff imaginable and then explaining which tool 
does what when that doesn't take, we could just tell them "Here, 
use this library type where you need it" with the admonishment 
that it may be too slow if abused.  I think THAT could be useful.

> In D, we can write code that is both Unicode-correct and highly 
> performant, while still being simple and pleasant to read. To 
> write such code, one must have a modicum of understanding of 
> how Unicode works (in order to choose the right tools from the 
> toolbox), but I think it's a novel compromise.

See, this sways me only a little bit.  The reason for that is, 
often, convenience greatly trumps elegance or performance.  Sure 
I COULD write something in C to look for obvious bad stuff in my 
syslog, but would I bother when I have a shell with pipes, grep, 
cut, and sed?  This all isn't to say I don't LIKE performance and 
elegance; but I live, work, and play on both sides of this 
spectrum, and I'd like to think they can peacefully coexist 
without too much fuss.

-Wyatt