std.d.lexer requirements

Thu Aug 2 15:14:17 PDT 2012

On 8/2/2012 1:26 PM, Jonathan M Davis wrote:
> On Thursday, August 02, 2012 01:44:18 Walter Bright wrote:
>> On 8/2/2012 1:38 AM, Jonathan M Davis wrote:
>>> On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
>>>> On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
>>>>> It is for ranges in general. In the general case, a range of UTF-8 or
>>>>> UTF-16 makes no sense whatsoever. Having range-based functions which
>>>>> understand the encodings and optimize accordingly can be very beneficial
>>>>> (which happens with strings but can't happen with general ranges without
>>>>> the concept of a variably-length encoded range like we have with forward
>>>>> range or random access range), but to actually have a range of UTF-8 or
>>>>> UTF-16 just wouldn't work. Range-based functions operate on elements,
>>>>> and
>>>>> doing stuff like filter or map or reduce on code units doesn't make any
>>>>> sense at all.
>>>>
>>>> Yes, it can work.
>>>
>>> How?
>>
>> Keep a 6 character buffer in your consumer. If you read a char with the high
>> bit set, start filling that buffer and then decode it.
>
> And how on earth is that going to work as a range?

1. read a character from the range
2. if the character is the start of a multibyte character, put the character in 
the buffer
3. keep reading from the range until you've got the whole of the multibyte character
4. convert that 6 (or 4) character buffer into a dchar

Remember, its the consumer doing the decoding, not the input range.

> I agree that we should be making string operations more efficient by taking code
> units into account, but I completely disagree that we can do that generically.

The requirement I listed was that the input range present UTF8 characters. Not 
any random character type.