Making all strings UTF ranges has some risk of WTF

Jason House jason.james.house at gmail.com
Thu Feb 4 05:53:20 PST 2010


Andrei Alexandrescu Wrote:

> It's no secret that string et al. are not a magic recipe for writing 
> correct Unicode code. However, things are pretty good and could be 
> further improved by operating the following changes in std.array and 
> std.range:
> 
> These changes effectively make UTF-8 and UTF-16 bidirectional ranges, 
> with the quirk that you still have a sort of a random-access operator.
> 
> I'm very strongly in favor of this change. Bidirectional strings allow 
> beautiful correct algorithms to be written that handle encoded strings 
> without any additional effort; with these changes, everything applicable 
> of std.algorithm works out of the box (with the appropriate fixes here 
> and there), which is really remarkable.
> 
> The remaining WTF is the length property. Traditionally, a range 
> offering length also implies the expectation that a range of length n 
> allows you to call popFront n times and then assert that the range is 
> empty. However, if you check e.g. hasLength!string it will yield false, 
> although the string does have an accessible member by that name and of 
> the appropriate type.
> 
> Although Phobos always checks its assumptions, people might occasionally 
> write code that just uses .length without checking hasLength. Then, 
> they'll be annoyed when the code fails with UTF-8 and UTF-16 strings.
> 
> (The "real" length of the range is not stored, but can be computed by 
> using str.walkLength() in std.range.)
> 
> What can be done about that? I see a number of solutions:

The underlying array of byte-sized data fragments is an implementation detail. hasLength is a kludge. Follow good OO design and hide the implementation details from the standard interface!

I would use a struct for UTF8 and UTF16 strings, and add a method to get the raw array. That allows simple, compiler-enforced usage while still allowing special casing to use raw data. As an added bonus, this method can generalize for other variable widthrange elements.



More information about the Digitalmars-d mailing list