std.d.lexer requirements

Thu Aug 2 19:40:13 PDT 2012

On 8/2/2012 3:38 PM, Jonathan M Davis wrote:
> On Thursday, August 02, 2012 15:14:17 Walter Bright wrote:
>> Remember, its the consumer doing the decoding, not the input range.
>
> But that's the problem. The consumer has to treat character ranges specially
> to make this work. It's not generic. If it were generic, then it would simply
> be using front, popFront, etc. It's going to have to special case strings to
> do the buffering that you're suggesting. And if you have to special case
> strings, then how is that any different from what we have now?

No, the consumer can do its own buffering. It only needs a 4 character buffer, 
worst case.

> If you're arguing that strings should be treated as ranges of code units, then
> pretty much _every_ range-based function will have to special case strings to
> even work correctly - otherwise it'll be operating on individual code points
> rather than code points (e.g. filtering code units rather than code points,
> which would generate an invalid string). This makes the default behavior
> incorrect, forcing _everyone_ to special case strings _everywhere_ if they
> want correct behavior with ranges which are strings. And efficiency means
> nothing if the result is wrong.

No, I'm arguing that the LEXER should accept a UTF8 input range for its input. I 
am not making a general argument about ranges, characters, or Phobos.

> As it is now, the default behavior of strings with range-based functions is
> correct but inefficient, so at least we get correct code. And if someone wants
> their string processing to be efficient, then they special case strings and do
> things like the buffering that you're suggesting. So, we have correct by
> default with efficiency as an option. The alternative that you seem to be
> suggesting (making strings be treated as ranges of code units) means that it
> would be fast by default but correct as an option, which is completely
> backwards IMHO. Efficiency is important, but it's pointless how efficient
> something is if it's wrong, and expecting that your average programmer is
> going to write unicode-aware code which functions correctly is completely
> unrealistic.

Efficiency for the *lexer* is of *paramount* importance. I don't anticipate 
std.d.lexer will be implemented by some random newbie, I expect it to be 
carefully implemented and to do Unicode correctly, regardless of how difficult 
or easy that may be.

I seem to utterly fail at making this point.

The same point applies to std.regex - efficiency is terribly, terribly important 
for it. Everyone judges regexes by their speed, and nobody cares how hard they 
are to implement to get that speed.

To reiterate another point, since we are in the compiler business, people will 
expect std.d.lexer to be of top quality, not some bag on the side. It needs to 
be usable as a base for writing a professional quality compiler. It's the reason 
why I'm pushing much harder on this than I do for other modules.