Unicode handling comparison
Wyatt
wyatt.epp at gmail.com
Wed Nov 27 12:39:47 PST 2013
On Wednesday, 27 November 2013 at 17:22:43 UTC, Jakob Ovrum wrote:
>
> i18nString sounds like a range of graphemes to me.
>
Maybe. If I had called it...say, "normalisedString"? Would you
still think that? That was an off-the-cuff name because my
morning brain imagined that this sort of thing would be useful
for user input where you can't make assumptions about its form.
> I would like a convenient function in std.uni to get such a
> range of graphemes from a range of points, but I wouldn't want
> to elevate it to any particular status; that would be a
> knee-jerk reaction. D's granularity when it comes to Unicode is
> because there is an appropriate level of representation for
> each domain. Shoe-horning everything into a range of graphemes
> is something we should avoid.
>
Okay, hold up. It's a bit late to prevent everyone from diving
down this rabbit hole, but let me be clear:
This really isn't about graphemes. Not really. They may be
involved, but I think focusing on that obscures the point.
If you recall the original article, I don't think he's being
unfair in expecting "noël" to have a length of four no matter
how it was composed. I don't think it's unfair to expect that
"noël".take(3) returns "noë", and I don't think it's unfair
that reversing it should be "lëon". All the places where his
expectations were defied (and more!) are implementation details.
While I stated before that I don't necessarily have anything
against people learning more about unicode, neither do I
fundamentally believe that's something a lot of people _need_ to
worry about. I'm not saying the default string in D should
change or anything crazy like that. All I'm suggesting is maybe,
rather than telling people they should read a small book about
the most arcane stuff imaginable and then explaining which tool
does what when that doesn't take, we could just tell them "Here,
use this library type where you need it" with the admonishment
that it may be too slow if abused. I think THAT could be useful.
> In D, we can write code that is both Unicode-correct and highly
> performant, while still being simple and pleasant to read. To
> write such code, one must have a modicum of understanding of
> how Unicode works (in order to choose the right tools from the
> toolbox), but I think it's a novel compromise.
See, this sways me only a little bit. The reason for that is,
often, convenience greatly trumps elegance or performance. Sure
I COULD write something in C to look for obvious bad stuff in my
syslog, but would I bother when I have a shell with pipes, grep,
cut, and sed? This all isn't to say I don't LIKE performance and
elegance; but I live, work, and play on both sides of this
spectrum, and I'd like to think they can peacefully coexist
without too much fuss.
-Wyatt
More information about the Digitalmars-d
mailing list