How to correctly deal with unicode strings?
monarch_dodra
monarchdodra at gmail.com
Wed Nov 27 06:55:29 PST 2013
Beaophile also linked this article:
http://forum.dlang.org/thread/nieoqqmidngwoqwnktih@forum.dlang.org
On Wednesday, 27 November 2013 at 14:34:15 UTC, Gary Willoughby
wrote:
> I've just been reading this article:
> http://mortoray.com/2013/11/27/the-string-type-is-broken/ and
> wanted to test if D performed in the same way as he describes,
> i.e. unicode strings being 'broken' because they are just
> arrays.
No. While unidcode string are "just" arrays, that's not why it's
"broken". Unicode string are *stored* in arrays, as a sequence of
"codeunits", but they are still decoded entire "codepoints" at
once, so that's not the issue.
The main issue is that in unicode, a "character" (if that means
anything), or a "grapheme", can be composed of two codepoints,
that mesn't be separated. Currently, D does not know how to deal
with this.
> Although i understand the difference between code units and
> code points it's not entirely clear in D what i need to do to
> avoid the situations he describes. For example:
>
> import std.algorithm;
> import std.stdio;
>
> void main(string[] args)
> {
> char[] x = "noël".dup;
>
> assert(x.length == 6); // Actual
> // assert(x.length == 4); // Expected.
This is a source of confusion: a string is *not* a random access
range. This means that "length" is not actually part of the
"string interface": It is only an "underlying implementation
detail".
try this:
alias String = string;
static if (hasLength!String)
assert(x.length == 4);
else
assert(x.walkLength == 4);
This will work regardless of string's "width" (char/wchar/dchar).
> assert(x[0 .. 3] == "noe".dup); // Actual.
> // assert(x[0 .. 3] == "noë".dup); // Expected.
Again, don't slice your strings like that, a string isn't random
access nor sliceable. You have no guarantee your third character
will start at index 3. You want:
assert(equal(x.take(3), "noe"));
Note that "x.take(3)" will not actually give you a slice, but a
lazy range. If you want a slice, you need to walk the string, and
extract the index:
auto index = x.length - x.dropFront(3).length;
assert(x[0 .. index] == "noe");
Note that this is *only* "UTF-correct", but it is still wrong
from a unicode point of view. Again, it's because ë is actually a
single grapheme composed of *two* codepoints.
> x.reverse;
>
> assert(x == "l̈eon".dup); // Actual
> // assert(x == "lëon".dup); // Expected.
> }
>
> Here i understand what is happening but how could i improve
> this example to make the expected asserts true?
AFAIK, We don't have any way of dealing with this (yet).
More information about the Digitalmars-d-learn
mailing list