Phobos Proposal: replace std.xml with kxml.

Tue May 4 16:41:39 PDT 2010

Michel Fortin wrote:
> On 2010-05-04 12:09:29 -0400, Andrei Alexandrescu 
> <SeeWebsiteForEmail at erdani.org> said:
> 
>> Graham Fawcett wrote:
>>> By "adapt" do you mean writing a wrapper for an existing library, or 
>>> translating the source code of the library into D?
>>> What constitutes a "generous license" in this context? (For what it's 
>>> worth, libxml2 is under the MIT License.)
>>>
>>> Graham
>>
>> We'd need to modify the code. I haven't looked into available xml 
>> libraries so I don't know which would be eligible.
> 
> I think if you wanted to port an XML library to make use of ranges, the 
> only viable option is probably to find one based on C++ iterators. 
> Otherwise it'll look more like a rewrite than a port, and at this point 
> why not write one from scratch?

Design is also a considerable time expense, though I agree that use of 
ranges may actually improve the design too.

> Anyway, just in case, would you be interested in an XML tokenizer and 
> simple DOM following this model?
> 
>     http://michelf.com/docs/d/mfr/xmltok.html
>     http://michelf.com/docs/d/mfr/xml.html
> 
> At the base is a pull parser and an event parser mixed in the same 
> function template: "tokenize", allowing you to alternate between 
> even-based and pull-parsing at will. I'm using it, but its development 
> is on hold at this time, I'm just maintaining it so it compiles on the 
> newest versions of DMD.

Sounds great, but I need to defer XML expertise to others.

> The only thing it doesn't parse at this time is inline DTDs inside the 
> doctype.
> 
> Also, it currently only works only with strings, for simplicity and 
> performance. There is one issue about non-string parsing: when parsing a 
> string, it's easy to just slice the string and move it around, but if 
> you're parsing from a generic input range, you basically have to copy 
> characters one by one, which is much less efficient. So ideally the 
> algorithm should use slices whenever it can (when the input is a string).
> 
> I'm not sure yet how to attack this problem, but I'm thinking that 
> perhaps parsing primitives should be "part of" the range interface. I 
> say this in the sense that a range should provide specialized 
> implementation of primitive when it can implement them more efficiently 
> (like by slicing). You wrote a while ago about designing parsing 
> primitives, is this part of Phobos now?
> 
> Anyway, the problem above is probably the one reason we might want to 
> write the parser from scratch: it needs to bind to specializable 
> higher-level parsing functions to take advantage of the performance 
> characteristics of certain ranges, such as those you can slice.

There are a number of issues. One is that you should allow wchar and 
dchar in addition to char as basic character types (probably ubyte too 
for exotic encodings). In essence the char type should be a template 
parameter. The other is that perhaps you could be able to use zero-based 
slices, i.e. s[0 .. i] as opposed to arbitrary slices s[i .. j]. A 
zero-based slice can be supported better than an arbitrary one.

Andrei