string is rarely useful as a function argument
Michel Fortin
michel.fortin at michelf.com
Sat Dec 31 12:44:00 PST 2011
On 2011-12-31 18:56:01 +0000, Andrei Alexandrescu
<SeeWebsiteForEmail at erdani.org> said:
> On 12/31/11 10:47 AM, Michel Fortin wrote:
>> It seems to work fine, but it doesn't
>> handle (yet) characters spanning multiple code points.
>
> That's the job of std.range, not std.algorithm.
As I keep saying, if you handle combining code points at the range
level you'll have very inefficient code. But I think you get that.
>> To handle this
>> case, you could use a logical glyph range, but that'd be quite
>> inefficient. Or you can improve the algorithm working on code points so
>> that it checks for combining characters on the edges, but then is it
>> still a generic algorithm?
>>
>> Second, it doesn't work efficiently. Sure you can specialize the
>> algorithm so it does not decode all code units when it's not necessary,
>> but then does it still classify as a generic algorithm?
>>
>> My point is that *generic* algorithms cannot work *efficiently* with
>> Unicode, not that they can't work at all. And even then, for the
>> inneficient generic algorithm to work correctly with all input, the user
>> need to choose the correct Unicode representation to for the problem at
>> hand, which requires some general knowledge of Unicode.
>>
>> Which is why I'd just discourage generic algorithms for strings.
>
> I think you are in a position that is defensible, but not generous and
> therefore undesirable. The military equivalent would be defending a
> fortified landfill drained by a sewer. You don't _want_ to be there.
I don't get the analogy.
> Taking your argument to its ultimate conclusion is that we give up on
> genericity for strings and go home.
That is more or less what I am saying. Genericity for strings leads to
inefficient algorithms, and you don't want inefficient algorithms, at
least not without being warned in advance. This is why for instance you
give a special name to inefficient (linear) operations in
std.container. In the same way, I think generic operations on strings
should be disallowed unless you opt-in by explicitly saying on which
representation you want to algorithm to perform its task.
>> This is the kind of "range" I'd use to create algorithms dealing with
>> Unicode properly:
>>
>> struct UnicodeRange(U)
>> {
>> U frontUnit() @property;
>> dchar frontPoint() @property;
>> immutable(U)[] frontGlyph() @property;
>>
>> void popFrontUnit();
>> void popFrontPoint();
>> void popFrontGlyph();
>>
>> ...
>> }
>
> We already have most of that. For a string s, s[0] is frontUnit,
> s.front is frontPoint, s = s[1 .. $] is popFrontUnit(), s.popFront() is
> popFrontPoint. We only need to define the glyph routines.
Indeed. I came with this concept when writing my XML parser, I defined
frontUnit and popFrontUnit and used it all over the place (in
conjunction with slicing). And I rarely needed to decode whole code
points using front and popFront.
> But I think you'd be stopping short. You want generic variable-length
> encoding, not the above.
Really? How'd that work?
> Except for the glpyhs implementation, we're already there. You are
> talking about existing capabilities!
>
>> The problem with .raw is that it creates a separate range for the units.
>
> That's the best part about it.
Depends. It should create a *linked* range, not a *separate* one, in
the sense that if you advance the "raw" range with popFront, it should
advance the underlying "code point" range too.
>> This means you can't look at the frontUnit and then decide to pop the
>> unit and then look at the next, decide you need to decode using
>> frontPoint, then call popPoint and return to looking at the front unit.
>
> Of course you can.
>
> while (condition) {
> if (s.raw.front == someFrontUnitThatICareAbout) {
> s.raw.popFront();
> auto c = s.front;
> s.popFront();
> }
> }
But will s.raw.popFront() also pop a single unit from s? "raw" would
need to be defined as a reinterpret cast of the reference to the char[]
to do what I want, something like this:
ref ubyte[] raw(ref char[] s) { return *cast(ubyte[]*)&s; }
The current std.string.representation doesn't do that at all.
Also, how does it work with slicing? It can work with raw, but you'll
have to cast things everywhere because raw is a ubyte[]:
string = "éà";
s = cast(typeof(s))s.raw[0..4];
> Now that I wrote it I'm even more enthralled with the coolness of the
> scheme. You essentially have access to two separate ranges on top of
> the same fabric.
Glad you like the concept.
--
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/
More information about the Digitalmars-d
mailing list