string is rarely useful as a function argument

Andrei Alexandrescu SeeWebsiteForEmail at erdani.org
Sat Dec 31 10:56:01 PST 2011


On 12/31/11 10:47 AM, Michel Fortin wrote:
> On 2011-12-31 15:03:13 +0000, Andrei Alexandrescu
> <SeeWebsiteForEmail at erdani.org> said:
>
>> On 12/31/11 8:17 CST, Michel Fortin wrote:
>>> At one time Java and other frameworks started to use UTF-16 as
>>> if they were characters, that turned wrong on them. Now we know that not
>>> even code points should be considered characters, thanks to characters
>>> spanning on multiple code points. You might call it perfect, but for
>>> that you have made two assumptions:
>>>
>>> 1. treating code points as characters is good enough, and
>>> 2. the performance penalty of decoding everything is tolerable
>>
>> I'm not sure how you concluded I drew such assumptions.
>
> 1: Because treating UTF-8 strings as a range of code point encourage
> people to think so. 2: From things you posted on the newsgroup
> previously. Sorry I don't have the references, but it'd take too long to
> dig them back.

That's sort of difficult to refute. Anyhow, I think it's great that 
algorithms can use types to go down to the representation if needed, and 
stay up at bidirectional range level otherwise.

>>> Ranges of code points might be perfect for you, but it's a tradeoff that
>>> won't work in every situations.
>>
>> Ranges can be defined to span logical glyphs that span multiple code
>> points.
>
> I'm talking about the default interpretation, where string ranges are
> ranges of code units, making that tradeoff the default.
>
> And also, I think we can agree that a logical glyph range would be
> terribly inefficient in practice, although it could be a nice teaching
> tool.

Well people who want that could use byGlyph() or something. If you want 
glyphs, you gotta pay the price.

>>> The whole concept of generic algorithms working on strings efficiently
>>> doesn't work.
>>
>> Apparently std.algorithm does.
>
> First, it doesn't really work.

Oh yes it does.

> It seems to work fine, but it doesn't
> handle (yet) characters spanning multiple code points.

That's the job of std.range, not std.algorithm.

> To handle this
> case, you could use a logical glyph range, but that'd be quite
> inefficient. Or you can improve the algorithm working on code points so
> that it checks for combining characters on the edges, but then is it
> still a generic algorithm?
>
> Second, it doesn't work efficiently. Sure you can specialize the
> algorithm so it does not decode all code units when it's not necessary,
> but then does it still classify as a generic algorithm?
>
> My point is that *generic* algorithms cannot work *efficiently* with
> Unicode, not that they can't work at all. And even then, for the
> inneficient generic algorithm to work correctly with all input, the user
> need to choose the correct Unicode representation to for the problem at
> hand, which requires some general knowledge of Unicode.
>
> Which is why I'd just discourage generic algorithms for strings.

I think you are in a position that is defensible, but not generous and 
therefore undesirable. The military equivalent would be defending a 
fortified landfill drained by a sewer. You don't _want_ to be there. 
Taking your argument to its ultimate conclusion is that we give up on 
genericity for strings and go home.

Strings are a variable-length encoding on top of an array. That is a 
relatively easy abstraction to model. Currently we don't have a 
dedicated model for that - we offer the encoded data as a bidirectional 
range and also the underlying array. Algorithms that work with 
bidirectional ranges work out of the box. Those that can use the 
representation gainfully can opportunistically specialize on isSomeString!R.

You contend that that doesn't "work", and I think you're wrong. But to 
the extent you have a case, an abstraction could be defined for 
variable-length encodings, and algorithms could be defined to work with 
that abstraction. I thought several times about that, but couldn't 
gather enough motivation for the simple reason that the current approach 
_works_.

>>> I'm not against making strings more opaque to encourage people to use
>>> the Unicode algorithms from the standard library instead of rolling
>>> their own.
>>
>> I'd say we're discussing making the two kinds of manipulation (encoded
>> sequence of logical character vs. array of code units) more
>> distinguished from each other. That's a Good Thing(tm).
>
> It's a good abstraction to show the theory of Unicode. But it's not the
> way to go if you want efficiency. For efficiency you need for each
> element in the string to use the lowest abstraction required to handle
> this element, so your algorithm needs to know about the various
> abstraction layers.

Correct.

> This is the kind of "range" I'd use to create algorithms dealing with
> Unicode properly:
>
> struct UnicodeRange(U)
> {
> U frontUnit() @property;
> dchar frontPoint() @property;
> immutable(U)[] frontGlyph() @property;
>
> void popFrontUnit();
> void popFrontPoint();
> void popFrontGlyph();
>
> ...
> }

We already have most of that. For a string s, s[0] is frontUnit, s.front 
is frontPoint, s = s[1 .. $] is popFrontUnit(), s.popFront() is 
popFrontPoint. We only need to define the glyph routines.

But I think you'd be stopping short. You want generic variable-length 
encoding, not the above.

> Not really a range per your definition of ranges, but basically it lets
> you intermix working with units, code points, and glyphs. Add a way to
> slice at the unit level and a way to know the length at the unit level
> and it's all I need to make an efficient parser, or any algorithm really.

Except for the glpyhs implementation, we're already there. You are 
talking about existing capabilities!

> The problem with .raw is that it creates a separate range for the units.

That's the best part about it.

> This means you can't look at the frontUnit and then decide to pop the
> unit and then look at the next, decide you need to decode using
> frontPoint, then call popPoint and return to looking at the front unit.

Of course you can.

while (condition) {
   if (s.raw.front == someFrontUnitThatICareAbout) {
      s.raw.popFront();
      auto c = s.front;
      s.popFront();
   }
}

Now that I wrote it I'm even more enthralled with the coolness of the 
scheme. You essentially have access to two separate ranges on top of the 
same fabric.


Andrei


More information about the Digitalmars-d mailing list