std.d.lexer requirements

Thu Aug 2 01:44:18 PDT 2012

On 8/2/2012 1:38 AM, Jonathan M Davis wrote:
> On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
>> On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
>>> It is for ranges in general. In the general case, a range of UTF-8 or
>>> UTF-16 makes no sense whatsoever. Having range-based functions which
>>> understand the encodings and optimize accordingly can be very beneficial
>>> (which happens with strings but can't happen with general ranges without
>>> the concept of a variably-length encoded range like we have with forward
>>> range or random access range), but to actually have a range of UTF-8 or
>>> UTF-16 just wouldn't work. Range-based functions operate on elements, and
>>> doing stuff like filter or map or reduce on code units doesn't make any
>>> sense at all.
>>
>> Yes, it can work.
>
> How?

Keep a 6 character buffer in your consumer. If you read a char with the high bit 
set, start filling that buffer and then decode it.

> Do you really think that it makes sense for a function like map or filter to
> operate on individual code units? Because that's what would end up happening
> with a range of code units. Your average, range-based function only makes
> sense with _characters_, not code units. Functions which can operate on ranges
> of code units without screwing up the encoding are a rarity.

Rare or not, they are certainly possible, and the early versions of std.string 
did just that (although they weren't using ranges, the same techniques apply).