Narrow string is not a random access range
Jonathan M Davis
jmdavisProg at gmx.com
Wed Oct 24 04:07:46 PDT 2012
On Wednesday, October 24, 2012 12:42:59 mist wrote:
> On Tuesday, 23 October 2012 at 17:36:53 UTC, Simen Kjaeraas wrote:
> > On 2012-10-23, 19:21, mist wrote:
> >> Hm, and all phobos functions should operate on narrow strings
> >> as if they where not random-acessible? I am thinking about
> >> something like commonPrefix from std.algorithm, which operates
> >> on code points for strings.
> >
> > Preferably, yes. If there are performance (or other) benefits
> > from
> > operating on code units, and it's just as safe, then operating
> > on code
> > units is ok.
>
> Probably I don't undertsand it fully, but D approach has always
> been "safe first, fast with some additional syntax". Back to
> commonPrefix and take:
>
> ==========================
> import std.stdio, std.traits, std.algorithm, std.range;
>
> void main()
> {
> auto beer = "Пиво";
> auto r1 = beer.take(2);
> auto pony = "Пони";
> auto r2 = commonPrefix(beer, pony);
> writeln(r1);
> writeln(r2);
> }
> ==========================
>
> First one returns 2 symbols. Second one - 3 code points and
> broken string. There is no way such incosistency by-default in
> standard library is understandable by a newbie.
We don't really have much choice here. As long as strings are arrays of code
units, it wouldn't work to treat them as ranges of their elements, because
that would be a complete disaster for unicode. You'd be operating on code
units rather than code points, which is almost always wrong. Pretty much the
only way to really solve the problem as long as strings are arrays with all of
the normal array operations is for the std.range traits (hasLength,
hasSlicing, etc.) and the range functions for arrays in std.array (e.g. front,
popFront, etc.) to treat strings as ranges of code points (dchar), which is
what they do. The result _is_ confusing, but as long as strings are arrays of
code units like they are now, to do anything else would result in incorrect
behavior. There just isn't a good solution given what strings currently are in
the language itself.
Andrei's suggestion would work if Walter could be talked into it, but that
doesn't look like it's going to happen. And making it so that strings are
structs which hold arrays of code units could work, but without language
support, it's likely to have major issues. String literals would have to
become the struct type, which could cause issue with calling C functions, and
the code breakage would be _way_ larger than with Andrei's suggestion, since
arrays of code units would no longer be strings at all. It would be feasible,
but it gets really messy. What we have is probably about the best that we can
do without actually changing the language (and Andrei's suggestion is likely
the best way to do that IMHO), but that's unlikely to happen at this point,
especilaly since Walter seems to view unicode quite differently from your
average programmer and expects your average programmer to actually understand
it and handle correctly (which just isn't going to happen).
The confusion could be reduced if we not only had an article on dlang.org
explaining exactly what ranges were and how to use them with Phobos but also
an article (maybe the same one, maybe another), which explained what this
means for strings and why. That way, it would become easier to become
educated. But no one has written (or at least finished writing) such an article
for dlang.org (I keep meaning to, but I never get around to it). Some stuff has
been written outside of dlang.org (e.g. http://www.drdobbs.com/architecture-
and-design/component-programming-in-d/240008321 and
http://ddili.org/ders/d.en/ranges.html ), but there's nothing on dlang.org,
and I don't believe that there's really anything online aside from stray
newsgroup posts or stackoverflow answers which discusses why strings are the
way they are with regards to ranges. And there should be.
- Jonathan M Davis
More information about the Digitalmars-d-learn
mailing list