Status of std.xml (D2/Phobos)

Mon Jun 28 11:46:08 PDT 2010

On 2010-06-28 14:27:13 -0400, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:

>> Here's the generated documentation:
>> 
>> http://michelf.com/docs/d/mfr/xmltok.html
>> http://michelf.com/docs/d/mfr/xml.html
>> 
>> I'm slowly revamping it to use ranges instead of strings.
> 
> I think a tokenizer should be a higher-order range that is fed an input 
> range of ubyte, char, wchar, or dchar (so that would be a type 
> parameter) and is itself a range of Tokens that include the token type, 
> token value etc.

And I've implemented a tokenizer range just like you describe on top of 
my tokenizer function. Look at the documentation for 
mfr.xmltok.XMLForwardRange. (I should probably rename it to 
XMLTokenRange.)

Personally, I prefer to use the callback approach which automatically 
calls the right function according to the token type. But what's nice 
about my tokenizer is that you can do both callbacks and pull-style 
tokenization (the later can be wrapped in a range), and mix these 
approaches together as needed.

What is missing is taking arbitrary ranges as input (it deals with 
strings currently). Strings are like the optimized case for 
tokenization because you don't have to dynamically allocate anything: 
referencing the original string is enough when making substrings. With 
arbitrary ranges you have to copy the text and tag names to a string 
one character at a time, which is less efficient. I don't want to write 
two separate parsers for this, so I'm trying to abstract things at the 
right level to maximize code reuse while keeping performance optimized 
for the string-as-input case, but how to do that is not so obvious.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/