Narrow string is not a random access range

Wed Oct 24 04:07:46 PDT 2012

On Wednesday, October 24, 2012 12:42:59 mist wrote:
> On Tuesday, 23 October 2012 at 17:36:53 UTC, Simen Kjaeraas wrote:
> > On 2012-10-23, 19:21, mist wrote:
> >> Hm, and all phobos functions should operate on narrow strings
> >> as if they where not random-acessible? I am thinking about
> >> something like commonPrefix from std.algorithm, which operates
> >> on code points for strings.
> > 
> > Preferably, yes. If there are performance (or other) benefits
> > from
> > operating on code units, and it's just as safe, then operating
> > on code
> > units is ok.
> 
> Probably I don't undertsand it fully, but D approach has always
> been "safe first, fast with some additional syntax". Back to
> commonPrefix and take:
> 
> ==========================
> import std.stdio, std.traits, std.algorithm, std.range;
> 
> void main()
> {
> 	auto beer = "Пиво";
> 	auto r1 = beer.take(2);
> 	auto pony = "Пони";
> 	auto r2 = commonPrefix(beer, pony);
> 	writeln(r1);
> 	writeln(r2);
> }
> ==========================
> 
> First one returns 2 symbols. Second one - 3 code points and
> broken string. There is no way such incosistency by-default in
> standard library is understandable by a newbie.

We don't really have much choice here. As long as strings are arrays of code 
units, it wouldn't work to treat them as ranges of their elements, because 
that would be a complete disaster for unicode. You'd be operating on code 
units rather than code points, which is almost always wrong. Pretty much the 
only way to really solve the problem as long as strings are arrays with all of 
the normal array operations is for the std.range traits (hasLength, 
hasSlicing, etc.) and the range functions for arrays in std.array (e.g. front, 
popFront, etc.) to treat strings as ranges of code points (dchar), which is 
what they do. The result _is_ confusing, but as long as strings are arrays of 
code units like they are now, to do anything else would result in incorrect 
behavior. There just isn't a good solution given what strings currently are in 
the language itself.

Andrei's suggestion would work if Walter could be talked into it, but that 
doesn't look like it's going to happen. And making it so that strings are 
structs which hold arrays of code units could work, but without language 
support, it's likely to have major issues. String literals would have to 
become the struct type, which could cause issue with calling C functions, and 
the code breakage would be _way_ larger than with Andrei's suggestion, since 
arrays of code units would no longer be strings at all. It would be feasible, 
but it gets really messy. What we have is probably about the best that we can 
do without actually changing the language (and Andrei's suggestion is likely 
the best way to do that IMHO), but that's unlikely to happen at this point, 
especilaly since Walter seems to view unicode quite differently from your 
average programmer and expects your average programmer to actually understand 
it and handle correctly (which just isn't going to happen).

The confusion could be reduced if we not only had an article on dlang.org 
explaining exactly what ranges were and how to use them with Phobos but also 
an article (maybe the same one, maybe another), which explained what this 
means for strings and why. That way, it would become easier to become 
educated. But no one has written (or at least finished writing) such an article 
for dlang.org (I keep meaning to, but I never get around to it). Some stuff has 
been written outside of dlang.org (e.g. http://www.drdobbs.com/architecture-
and-design/component-programming-in-d/240008321 and 
http://ddili.org/ders/d.en/ranges.html ), but there's nothing on dlang.org, 
and I don't believe that there's really anything online aside from stray 
newsgroup posts or stackoverflow answers which discusses why strings are the 
way they are with regards to ranges. And there should be.

- Jonathan M Davis