Making all strings UTF ranges has some risk of WTF

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Wed Feb 3 20:41:02 PST 2010


dsimcha wrote:
> == Quote from Andrei Alexandrescu (SeeWebsiteForEmail at erdani.org)'s article
>> It's no secret that string et al. are not a magic recipe for writing
>> correct Unicode code. However, things are pretty good and could be
>> further improved by operating the following changes in std.array and
>> std.range:
>> - make front() and back() for UTF-8 and UTF-16 automatically decode the
>> first and last Unicode character
>> - make popFront() and popBack() skip one entire Unicode character
>> (instead of just one code unit)
>> - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings
>> - change hasLength to return false for UTF-8 and UTF-16 strings
>> These changes effectively make UTF-8 and UTF-16 bidirectional ranges,
>> with the quirk that you still have a sort of a random-access operator.
>> I'm very strongly in favor of this change. Bidirectional strings allow
>> beautiful correct algorithms to be written that handle encoded strings
>> without any additional effort; with these changes, everything applicable
>> of std.algorithm works out of the box (with the appropriate fixes here
>> and there), which is really remarkable.
>> The remaining WTF is the length property. Traditionally, a range
>> offering length also implies the expectation that a range of length n
>> allows you to call popFront n times and then assert that the range is
>> empty. However, if you check e.g. hasLength!string it will yield false,
>> although the string does have an accessible member by that name and of
>> the appropriate type.
>> Although Phobos always checks its assumptions, people might occasionally
>> write code that just uses .length without checking hasLength. Then,
>> they'll be annoyed when the code fails with UTF-8 and UTF-16 strings.
>> (The "real" length of the range is not stored, but can be computed by
>> using str.walkLength() in std.range.)
>> What can be done about that? I see a number of solutions:
>> (a) Do not operate the change at all.
>> (b) Operate the change and mention that in range algorithms you should
>> check hasLength and only then use "length" under the assumption that it
>> really means "elements count".
>> (c) Deprecate the name .length for UTF-8 and UTF-16 strings, and define
>> a different name for that. Any other name (codeUnits, codes etc.) would
>> do. The entire point is to not make algorithms believe strings have a
>> .length property.
>> (d) Have std.range define a distinct property called e.g. "count" and
>> then specialize it appropriately. Then change all references to .length
>> in std.algorithm and elsewhere to .count.
>> What would you do? Any ideas are welcome.
>> Andrei
> 
> I personally would find this extremely annoying because most of the code I write
> that involves strings is scientific computing code that will never be
> internationalized, let alone released to the general public.  I basically just use
> ASCII because it's all I need and if your UTF-8 string contains only ASCII
> characters, it can be treated as random-access.  I don't know how many people out
> there are in similar situations, but I doubt they'll be too happy.
> 
> On the other hand, I guess it wouldn't be hard to write a simple wrapper struct on
> top of immutable(ubyte)[] and call it AsciiString.  Once alias this gets fully
> debugged, I could even make it implicitly convert to immutable(char)[].

It's definitely going to be easy to use all sensible algorithms with 
immutable(ubyte)[]. But even if you go with string, there should be no 
problem at all. Remember, telling ASCII from UTF is one mask and one 
test away, and the way Walter and I wrote virtually all related routines 
was to special-case ASCII. In most cases I don't think you'll notice a 
decrease in performance.

Andrei



More information about the Digitalmars-d mailing list