Making all strings UTF ranges has some risk of WTF

dsimcha dsimcha at yahoo.com
Wed Feb 3 19:35:54 PST 2010


== Quote from Andrei Alexandrescu (SeeWebsiteForEmail at erdani.org)'s article
> It's no secret that string et al. are not a magic recipe for writing
> correct Unicode code. However, things are pretty good and could be
> further improved by operating the following changes in std.array and
> std.range:
> - make front() and back() for UTF-8 and UTF-16 automatically decode the
> first and last Unicode character
> - make popFront() and popBack() skip one entire Unicode character
> (instead of just one code unit)
> - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings
> - change hasLength to return false for UTF-8 and UTF-16 strings
> These changes effectively make UTF-8 and UTF-16 bidirectional ranges,
> with the quirk that you still have a sort of a random-access operator.
> I'm very strongly in favor of this change. Bidirectional strings allow
> beautiful correct algorithms to be written that handle encoded strings
> without any additional effort; with these changes, everything applicable
> of std.algorithm works out of the box (with the appropriate fixes here
> and there), which is really remarkable.
> The remaining WTF is the length property. Traditionally, a range
> offering length also implies the expectation that a range of length n
> allows you to call popFront n times and then assert that the range is
> empty. However, if you check e.g. hasLength!string it will yield false,
> although the string does have an accessible member by that name and of
> the appropriate type.
> Although Phobos always checks its assumptions, people might occasionally
> write code that just uses .length without checking hasLength. Then,
> they'll be annoyed when the code fails with UTF-8 and UTF-16 strings.
> (The "real" length of the range is not stored, but can be computed by
> using str.walkLength() in std.range.)
> What can be done about that? I see a number of solutions:
> (a) Do not operate the change at all.
> (b) Operate the change and mention that in range algorithms you should
> check hasLength and only then use "length" under the assumption that it
> really means "elements count".
> (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define
> a different name for that. Any other name (codeUnits, codes etc.) would
> do. The entire point is to not make algorithms believe strings have a
> .length property.
> (d) Have std.range define a distinct property called e.g. "count" and
> then specialize it appropriately. Then change all references to .length
> in std.algorithm and elsewhere to .count.
> What would you do? Any ideas are welcome.
> Andrei

I personally would find this extremely annoying because most of the code I write
that involves strings is scientific computing code that will never be
internationalized, let alone released to the general public.  I basically just use
ASCII because it's all I need and if your UTF-8 string contains only ASCII
characters, it can be treated as random-access.  I don't know how many people out
there are in similar situations, but I doubt they'll be too happy.

On the other hand, I guess it wouldn't be hard to write a simple wrapper struct on
top of immutable(ubyte)[] and call it AsciiString.  Once alias this gets fully
debugged, I could even make it implicitly convert to immutable(char)[].



More information about the Digitalmars-d mailing list