string is rarely useful as a function argument

Sat Dec 31 12:44:00 PST 2011

On 2011-12-31 18:56:01 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:

> On 12/31/11 10:47 AM, Michel Fortin wrote:
>> It seems to work fine, but it doesn't
>> handle (yet) characters spanning multiple code points.
> 
> That's the job of std.range, not std.algorithm.

As I keep saying, if you handle combining code points at the range 
level you'll have very inefficient code. But I think you get that.

>> To handle this
>> case, you could use a logical glyph range, but that'd be quite
>> inefficient. Or you can improve the algorithm working on code points so
>> that it checks for combining characters on the edges, but then is it
>> still a generic algorithm?
>> 
>> Second, it doesn't work efficiently. Sure you can specialize the
>> algorithm so it does not decode all code units when it's not necessary,
>> but then does it still classify as a generic algorithm?
>> 
>> My point is that *generic* algorithms cannot work *efficiently* with
>> Unicode, not that they can't work at all. And even then, for the
>> inneficient generic algorithm to work correctly with all input, the user
>> need to choose the correct Unicode representation to for the problem at
>> hand, which requires some general knowledge of Unicode.
>> 
>> Which is why I'd just discourage generic algorithms for strings.
> 
> I think you are in a position that is defensible, but not generous and 
> therefore undesirable. The military equivalent would be defending a 
> fortified landfill drained by a sewer. You don't _want_ to be there.

I don't get the analogy.

> Taking your argument to its ultimate conclusion is that we give up on 
> genericity for strings and go home.

That is more or less what I am saying. Genericity for strings leads to 
inefficient algorithms, and you don't want inefficient algorithms, at 
least not without being warned in advance. This is why for instance you 
give a special name to inefficient (linear) operations in 
std.container. In the same way, I think generic operations on strings 
should be disallowed unless you opt-in by explicitly saying on which 
representation you want to algorithm to perform its task.

>> This is the kind of "range" I'd use to create algorithms dealing with
>> Unicode properly:
>> 
>> struct UnicodeRange(U)
>> {
>> U frontUnit() @property;
>> dchar frontPoint() @property;
>> immutable(U)[] frontGlyph() @property;
>> 
>> void popFrontUnit();
>> void popFrontPoint();
>> void popFrontGlyph();
>> 
>> ...
>> }
> 
> We already have most of that. For a string s, s[0] is frontUnit, 
> s.front is frontPoint, s = s[1 .. $] is popFrontUnit(), s.popFront() is 
> popFrontPoint. We only need to define the glyph routines.

Indeed. I came with this concept when writing my XML parser, I defined 
frontUnit and popFrontUnit and used it all over the place (in 
conjunction with slicing). And I rarely needed to decode whole code 
points using front and popFront.

> But I think you'd be stopping short. You want generic variable-length 
> encoding, not the above.

Really? How'd that work?

> Except for the glpyhs implementation, we're already there. You are 
> talking about existing capabilities!
> 
>> The problem with .raw is that it creates a separate range for the units.
> 
> That's the best part about it.

Depends. It should create a *linked* range, not a *separate* one, in 
the sense that if you advance the "raw" range with popFront, it should 
advance the underlying "code point" range too.

>> This means you can't look at the frontUnit and then decide to pop the
>> unit and then look at the next, decide you need to decode using
>> frontPoint, then call popPoint and return to looking at the front unit.
> 
> Of course you can.
> 
> while (condition) {
>    if (s.raw.front == someFrontUnitThatICareAbout) {
>       s.raw.popFront();
>       auto c = s.front;
>       s.popFront();
>    }
> }

But will s.raw.popFront() also pop a single unit from s? "raw" would 
need to be defined as a reinterpret cast of the reference to the char[] 
to do what I want, something like this:

	ref ubyte[] raw(ref char[] s) { return *cast(ubyte[]*)&s; }

The current std.string.representation doesn't do that at all.

Also, how does it work with slicing? It can work with raw, but you'll 
have to cast things everywhere because raw is a ubyte[]:

	string = "éà";
	s = cast(typeof(s))s.raw[0..4];

> Now that I wrote it I'm even more enthralled with the coolness of the 
> scheme. You essentially have access to two separate ranges on top of 
> the same fabric.

Glad you like the concept.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/