Making all strings UTF ranges has some risk of WTF
Jason House
jason.james.house at gmail.com
Thu Feb 4 05:53:20 PST 2010
Andrei Alexandrescu Wrote:
> It's no secret that string et al. are not a magic recipe for writing
> correct Unicode code. However, things are pretty good and could be
> further improved by operating the following changes in std.array and
> std.range:
>
> These changes effectively make UTF-8 and UTF-16 bidirectional ranges,
> with the quirk that you still have a sort of a random-access operator.
>
> I'm very strongly in favor of this change. Bidirectional strings allow
> beautiful correct algorithms to be written that handle encoded strings
> without any additional effort; with these changes, everything applicable
> of std.algorithm works out of the box (with the appropriate fixes here
> and there), which is really remarkable.
>
> The remaining WTF is the length property. Traditionally, a range
> offering length also implies the expectation that a range of length n
> allows you to call popFront n times and then assert that the range is
> empty. However, if you check e.g. hasLength!string it will yield false,
> although the string does have an accessible member by that name and of
> the appropriate type.
>
> Although Phobos always checks its assumptions, people might occasionally
> write code that just uses .length without checking hasLength. Then,
> they'll be annoyed when the code fails with UTF-8 and UTF-16 strings.
>
> (The "real" length of the range is not stored, but can be computed by
> using str.walkLength() in std.range.)
>
> What can be done about that? I see a number of solutions:
The underlying array of byte-sized data fragments is an implementation detail. hasLength is a kludge. Follow good OO design and hide the implementation details from the standard interface!
I would use a struct for UTF8 and UTF16 strings, and add a method to get the raw array. That allows simple, compiler-enforced usage while still allowing special casing to use raw data. As an added bonus, this method can generalize for other variable widthrange elements.
More information about the Digitalmars-d
mailing list