std.d.lexer requirements

Jonathan M Davis jmdavisProg at gmx.com
Thu Aug 2 13:26:21 PDT 2012


On Thursday, August 02, 2012 01:44:18 Walter Bright wrote:
> On 8/2/2012 1:38 AM, Jonathan M Davis wrote:
> > On Thursday, August 02, 2012 01:14:30 Walter Bright wrote:
> >> On 8/2/2012 12:43 AM, Jonathan M Davis wrote:
> >>> It is for ranges in general. In the general case, a range of UTF-8 or
> >>> UTF-16 makes no sense whatsoever. Having range-based functions which
> >>> understand the encodings and optimize accordingly can be very beneficial
> >>> (which happens with strings but can't happen with general ranges without
> >>> the concept of a variably-length encoded range like we have with forward
> >>> range or random access range), but to actually have a range of UTF-8 or
> >>> UTF-16 just wouldn't work. Range-based functions operate on elements,
> >>> and
> >>> doing stuff like filter or map or reduce on code units doesn't make any
> >>> sense at all.
> >> 
> >> Yes, it can work.
> > 
> > How?
> 
> Keep a 6 character buffer in your consumer. If you read a char with the high
> bit set, start filling that buffer and then decode it.

And how on earth is that going to work as a range? Range-based functions 
operate on elements. They use empty, front, popFront, etc. If front doesn't 
return an element that a range-based function can operate on without caring 
what it is, then that type isn't going to work as a range. If you need the 
consumer to be doing something special, then that means you need to special 
case it for that range type. And that's what you're doing when you special-
case range-base functions for strings.

So, because of how front and popFront work, you either have to have a range of 
code units or a range of code points. With a range of code units, the element 
type is a code unit, so any operations that you do will be operating on 
individual code units, not code points. With a range of code points, any 
operations being done will operate on code points, which _will_ require 
decoding as long as you're actually using the range API. You only make strings 
more efficent by special-casing the function for them such that it understands 
unicode and will operate on the string in the most efficient way according to 
how that string's encoding works.

You seem to be arguing that we can somehow have a generic range API which 
operates on strings, and yet what you're saying a function using those ranges 
must do (e.g. having a buffer of multipl code units) requires that a range-
based function operate in a non-generic manner for strings. If the consumer 
has to do _anything_ special for strings, then what it's doing is non-generic 
with regards to strings.

I agree that we should be making string operations more efficient by taking code 
units into account, but I completely disagree that we can do that generically. 
At best, we could add the concept of a variably-length encoded range such that 
a range-based function could special case them and use the encoding where 
appropriate, but all that buys us is the ability to special case variably-
length encoded ranges _other_ than strings (since we can already special case 
strings), and I don't think that it's even possible to deal with a variably-
length encoded range properly without understanding what the encoding is, in 
which case, we'd be dealing with special range types which were specifically 
UTF-8 encoded or UTF-16 encoded, and range-based functions would be special-
casing _those_ rather than a generic variably-length encoded range.

In either case, because the consumer must do something other than simply 
operate on front, popFront, empty, etc., you're _not_ dealing with the range 
API but rather working around it.

- Jonathan M Davis


More information about the Digitalmars-d mailing list