std.d.lexer requirements

Dmitry Olshansky dmitry.olsh at gmail.com
Thu Aug 2 08:46:49 PDT 2012


On 02-Aug-12 12:44, Walter Bright wrote:
> On 8/2/2012 1:38 AM, Jonathan M Davis wrote:
>> On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
>>> On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
>>>> It is for ranges in general. In the general case, a range of UTF-8 or
>>>> UTF-16 makes no sense whatsoever. Having range-based functions which
>>>> understand the encodings and optimize accordingly can be very
>>>> beneficial
>>>> (which happens with strings but can't happen with general ranges
>>>> without
>>>> the concept of a variably-length encoded range like we have with
>>>> forward
>>>> range or random access range), but to actually have a range of UTF-8 or
>>>> UTF-16 just wouldn't work. Range-based functions operate on
>>>> elements, and
>>>> doing stuff like filter or map or reduce on code units doesn't make any
>>>> sense at all.
>>>
>>> Yes, it can work.
>>
>> How?
>
> Keep a 6 character buffer in your consumer. If you read a char with the
> high bit set, start filling that buffer and then decode it.
>
4 bytes is enough.

Since Unicode 5(?) the range of codepoints was defined to be 
0...0x10FFFF specifically so that it could be encoded in 4 bytes of UTF-8.


P.S. Looks like I'm too late for this party ;)


-- 
Dmitry Olshansky


More information about the Digitalmars-d mailing list