string is rarely useful as a function argument

Sat Dec 31 08:47:32 PST 2011

On 2011-12-31 15:03:13 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:

> On 12/31/11 8:17 CST, Michel Fortin wrote:
>> At one time Java and other frameworks started to use UTF-16 as
>> if they were characters, that turned wrong on them. Now we know that not
>> even code points should be considered characters, thanks to characters
>> spanning on multiple code points. You might call it perfect, but for
>> that you have made two assumptions:
>> 
>> 1. treating code points as characters is good enough, and
>> 2. the performance penalty of decoding everything is tolerable
> 
> I'm not sure how you concluded I drew such assumptions.

1: Because treating UTF-8 strings as a range of code point encourage 
people to think so. 2: From things you posted on the newsgroup 
previously. Sorry I don't have the references, but it'd take too long 
to dig them back.

>> Ranges of code points might be perfect for you, but it's a tradeoff that
>> won't work in every situations.
> 
> Ranges can be defined to span logical glyphs that span multiple code points.

I'm talking about the default interpretation, where string ranges are 
ranges of code units, making that tradeoff the default.

And also, I think we can agree that a logical glyph range would be 
terribly inefficient in practice, although it could be a nice teaching 
tool.

>> The whole concept of generic algorithms working on strings efficiently
>> doesn't work.
> 
> Apparently std.algorithm does.

First, it doesn't really work. It seems to work fine, but it doesn't 
handle (yet) characters spanning multiple code points. To handle this 
case, you could use a logical glyph range, but that'd be quite 
inefficient. Or you can improve the algorithm working on code points so 
that it checks for combining characters on the edges, but then is it 
still a generic algorithm?

Second, it doesn't work efficiently. Sure you can specialize the 
algorithm so it does not decode all code units when it's not necessary, 
but then does it still classify as a generic algorithm?

My point is that *generic* algorithms cannot work *efficiently* with 
Unicode, not that they can't work at all. And even then, for the 
inneficient generic algorithm to work correctly with all input, the 
user need to choose the correct Unicode representation to for the 
problem at hand, which requires some general knowledge of Unicode.

Which is why I'd just discourage generic algorithms for strings.

>> I'm not against making strings more opaque to encourage people to use
>> the Unicode algorithms from the standard library instead of rolling
>> their own.
> 
> I'd say we're discussing making the two kinds of manipulation (encoded 
> sequence of logical character vs. array of code units) more 
> distinguished from each other. That's a Good Thing(tm).

It's a good abstraction to show the theory of Unicode. But it's not the 
way to go if you want efficiency. For efficiency you need for each 
element in the string to use the lowest abstraction required to handle 
this element, so your algorithm needs to know about the various 
abstraction layers.

This is the kind of "range" I'd use to create algorithms dealing with 
Unicode properly:

struct UnicodeRange(U)
{
	U frontUnit() @property;
	dchar frontPoint() @property;
	immutable(U)[] frontGlyph() @property;

	void popFrontUnit();
	void popFrontPoint();
	void popFrontGlyph();

	...
}

Not really a range per your definition of ranges, but basically it lets 
you intermix working with units, code points, and glyphs. Add a way to 
slice at the unit level and a way to know the length at the unit level 
and it's all I need to make an efficient parser, or any algorithm 
really.

The problem with .raw is that it creates a separate range for the 
units. This means you can't look at the frontUnit and then decide to 
pop the unit and then look at the next, decide you need to decode using 
frontPoint, then call popPoint and return to looking at the front unit.

Also, I'm not sure the "glyph" part of that range is required most of 
the time, because most of the time you don't need to decode glyphs to 
be glyph-aware. But it'd be nice if you wanted to count them and having 
it there alongside the rest makes teaches makes users aware of them.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/