std.d.lexer requirements

Thu Aug 2 00:29:09 PDT 2012

On 8/1/2012 11:56 PM, Jonathan M Davis wrote:
> Another thing that I should point out is that a range of UTF-8 or UTF-16
> wouldn't work with many range-based functions at all. Most of std.algorithm
> and its ilk would be completely useless. Range-based functions operate on a
> ranges elements, so operating on a range of code units would mean operating on
> code units, which is going to be _wrong_ almost all the time. Think about what
> would happen if you used a function like map or filter on a range of code
> units. The resultant range would be completely invalid as far as unicode goes.

My experience in writing fast string based code that worked on UTF8 and 
correctly handled multibyte characters was that they are very possible and 
practical, and they are faster.

The lexer MUST MUST MUST be FAST FAST FAST. Or it will not be useful. If it 
isn't fast, serious users will eschew it and will cook up their own. You'll have 
a nice, pretty, useless toy of std.d.lexer.

I think there's some serious underestimation of how critical this is.

> Range-based functions need to be operating on _characters_. Technically, not
> even code points gets us there, so it's _still_ buggy. It's just a _lot_
> closer to being correct and works 99.99+% of the time.

Multi-code point characters are quite irrelevant to the correctness of a D lexer.

> If we want to be able to operate on ranges of UTF-8 or UTF-16, we need to add
> a concept of variably-length encoded ranges so that it's possible to treat
> them as both their encoding and whatever they represent (e.g. code point or
> grapheme in the case of ranges of code units).

No, this is not necessary.