std.d.lexer requirements

Michel Fortin michel.fortin at michelf.ca
Thu Aug 2 11:17:37 PDT 2012


On 2012-08-02 12:28:03 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:

> Regarding the problem at hand, it's becoming painfully obvious to me 
> that the lexer MUST do its own decoding internally.

That's not a great surprise to me. I hit the same issues when writing 
my XML parser, which is why I invented functions called frontUnit and 
popFrontUnit. I'm glad you're realizing this.

> Hence, a very simple thing to do is have the entire lexer only deal 
> with ranges of ubyte. If someone passes a char[], the lexer's front end 
> can simply call s.representation and obtain the underlying ubyte[].

That's ugly, but it could work (assuming s.representation returns the 
casted range by ref). I still prefer my frontUnit and popFrontUnit 
approach though.

In fact, any parser for which speed is important will have to bypass 
std.range's clever handling of UTF characters. Dealing simply with 
ubytes isn't enough, since in some cases you'll want to fire up the UTF 
decoder.

The next issue, which I haven's seen discussed here is that for a 
parser to be efficient it should operate on buffers. You can make it 
work with arbitrary ranges, but if you don't have a buffer you can 
slice when you need to preserve a string, you're going to have to build 
the string character by character, which is not efficient at all. But 
then you can only really return slices if the underlying representation 
is the same as the output representation, and unless your API has a 
templated output type, you're going to special case a lot of things.

After having attempted an XML parser with ranges, I'm not sure parsing 
using generic ranges can be made very efficient. Automatic conversion 
to UTF-32 is a nuisance for performance, and if the output needs to 
return parts of the input, you'll need to create an inefficient special 
case just to allocate many new strings in the correct format.

I wonder how your call with Walter will turn out.

-- 
Michel Fortin
michel.fortin at michelf.ca
http://michelf.ca/



More information about the Digitalmars-d mailing list