Making all strings UTF ranges has some risk of WTF

Wed Feb 3 23:09:27 PST 2010

Ali Çehreli wrote:
> Andrei Alexandrescu wrote:
>  > It's no secret that string et al. are not a magic recipe for writing
>  > correct Unicode code. However, things are pretty good and could be
>  > further improved by operating the following changes in std.array and
>  > std.range:
>  >
>  > - make front() and back() for UTF-8 and UTF-16 automatically decode the
>  > first and last Unicode character
> 
> They would yield dchar, right? Wouldn't that cause trouble in templated 
> code?

Yes, dchar. There was some figuring out in parts of Phobos, but the 
gains are well worth it.

The simplifications are enormous. Until now, Phobos didn't hit the nail 
on the head with simple encoding/decoding/transcoding primitives. There 
were many attempts in std.utf, std.encoding, and std.string - all very 
clunky to use. Now I can just write s.front to get the first dchar of 
any string, and s.popFront to drop it. Very simple!

>  > - make popFront() and popBack() skip one entire Unicode character
>  > (instead of just one code unit)
> 
> That's perfectly fine, because the opposite operations do "encode":
> 
>     string s = "ağ";
>     assert(s.length == 3);
>     s ~= 'ş';
>     assert(s.length == 5);
> 
>  > - alter isRandomAccessRange to return false for UTF-8 and UTF-16 strings
> 
> Ok.
> 
>  > - change hasLength to return false for UTF-8 and UTF-16 strings
> 
> I don't understand that one. strings have lengths. Adding and removing 
> does not alter length by 1 for those types. I don't think it's a big 
> deal. It is already so in the language for those types. dstring does not 
> have that problem and could be used when by-1 change is desired.

hasLength is a property used by range algorithms to tell them that a 
range stores the length with a particular meaning (the number of 
elements). It is perfectly fine that strings don't obey hasLength but do 
expose .length - it's just that it has different semantics.

>  > (b) Operate the change and mention that in range algorithms you should
>  > check hasLength and only then use "length" under the assumption that it
>  > really means "elements count".
> 
> The change sounds ok and hasLength should yield true. Or... can it 
> return an enum { no, kind_of, yes } ;)
> 
> Current utf.decode takes the index by reference and modifies it by the 
> amount. Could popFront() do something similar?

I think we could dedicate a special function for that. In fact it does 
exist I think - it's called stride().

Andrei