Phobos Proposal: replace std.xml with kxml.

Tue May 4 18:55:53 PDT 2010

On 2010-05-04 19:41:39 -0400, Andrei Alexandrescu 
<SeeWebsiteForEmail at erdani.org> said:

>> Anyway, just in case, would you be interested in an XML tokenizer and 
>> simple DOM following this model?
>> 
>>     http://michelf.com/docs/d/mfr/xmltok.html
>>     http://michelf.com/docs/d/mfr/xml.html
>> 
>> At the base is a pull parser and an event parser mixed in the same 
>> function template: "tokenize", allowing you to alternate between 
>> even-based and pull-parsing at will. I'm using it, but its development 
>> is on hold at this time, I'm just maintaining it so it compiles on the 
>> newest versions of DMD.
> 
> Sounds great, but I need to defer XML expertise to others.

If someone else wants to use it, I offer it. Otherwise I'll surely 
continue working on it eventually.

>> Anyway, the problem above is probably the one reason we might want to 
>> write the parser from scratch: it needs to bind to specializable 
>> higher-level parsing functions to take advantage of the performance 
>> characteristics of certain ranges, such as those you can slice.
> 
> There are a number of issues. One is that you should allow wchar and 
> dchar in addition to char as basic character types (probably ubyte too 
> for exotic encodings). In essence the char type should be a template 
> parameter.

I totally agree about wchar and dchar... and you also need ubyte with 
an encoding detection system (checking the encoding in the xml prolog) 
to correctly parse XML files in any encoding. I think I have a ubyte 
parser, but it currently just accepts UTF-8 and then branch to the 
"string" version. (UTF-16 would be needed to implement correctly the 
XML spec.)

> The other is that perhaps you could be able to use zero-based slices, 
> i.e. s[0 .. i] as opposed to arbitrary slices s[i .. j]. A zero-based 
> slice can be supported better than an arbitrary one.

Yeah, well, it doesn't really work so well. You have to parse the input 
before slicing. XML doesn't tell you in advance how many characters or 
code units a string will take.

What you need is more like this:

	// Example XML Parser

	bool isAtEndOfAttributeContent(dchar char) {
		if (char == '"') return true;
		if (char == '<') throw new ParseError();
		return false;
	}

	void parseXML(T)(T input) if (IsInputRange!T) {
		[...]
		case '"':
			input.popFront(); // remove leading quote
			string content = readUntil!(isAtEndOfAttributeContent)(input);
			assert(input.front == '"');
			input.popFront(); // remove tailing quote
		[...]
	}

	// Example parsing primitive

	// String version: can slice
	string readUntil(isAtEndPredicate)(ref string input) {
		string savedInput;
		while (!input.empty && isAtEndPredicate(input.front)) {
			input.popFront();
		}
		return savedInput[0..$-input.length];
	}

	// Generic input range version: can't slice, must copy
	immutable(ElementType!T)[] readUntil(isAtEndPredicate, T)(T input) if 
(IsInputRange!T) {
		immutable(ElementType!T)[] copy; // should use appender here
		while (!input.empty) {
			dchar frontChar = input.front;
			if (isAtEndPredicate(frontChar))
				break;
			else
			copy ~= frontChar;
			input.popFront();
		}
		return copy;
	}

It's easy to appreciate the difference in performance between the 
string version and the generic version of readUntil just by looking at 
the code.

-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/