std.d.lexer requirements

Wed Aug 1 23:56:59 PDT 2012

On Wednesday, August 01, 2012 21:52:15 Jonathan M Davis wrote:
> And as much as there are potential performance issues with Phobos' choice of
> treating strings as ranges of dchar, if it were to continue to treat them
> as ranges of code units, it's pretty much a guarantee that there would be a
> _ton_ of bugs caused by it. Most programmers have absolutely no clue about
> how unicode works and don't really want to know. They want strings to just
> work. Phobos' approach of defaulting to correct but making it possible to
> make the code faster through specializations is very much in line with D's
> typical approach of making things safe by default but allowing the
> programmer to do unsafe things for optimizations when they know what
> they're doing.

Another thing that I should point out is that a range of UTF-8 or UTF-16 
wouldn't work with many range-based functions at all. Most of std.algorithm 
and its ilk would be completely useless. Range-based functions operate on a 
ranges elements, so operating on a range of code units would mean operating on 
code units, which is going to be _wrong_ almost all the time. Think about what 
would happen if you used a function like map or filter on a range of code 
units. The resultant range would be completely invalid as far as unicode goes. 
Range-based functions need to be operating on _characters_. Technically, not 
even code points gets us there, so it's _still_ buggy. It's just a _lot_ 
closer to being correct and works 99.99+% of the time.

If we want to be able to operate on ranges of UTF-8 or UTF-16, we need to add 
a concept of variably-length encoded ranges so that it's possible to treat 
them as both their encoding and whatever they represent (e.g. code point or 
grapheme in the case of ranges of code units).

- Jonathan M Davis