Replacing std.xml

Michel Fortin michel.fortin at michelf.ca
Thu Aug 29 18:31:48 PDT 2013


On 2013-08-29 17:38:23 +0000, "Jonathan M Davis" <jmdavisProg at gmx.com> said:

> Well, as I said, I couldn't remember exactly what the XML standard said about
> encodings, but if it can contain non-ASCII characters, then my first
> inclination is to say that it has to be UTF-8, UTF-16, or UTF-32 based on the
> fact that that's what we support in the language and in Phobos (as I
> understand it, std.encodings is a bit of a joke that needs to be rethought and
> replaced, but regardless, it's the only Phobos module supporting any non-
> Unicode encodings).

The XML standard says that an XML parser MUST support UTF-8 and UTF-16, 
and MAY support other encodings.

Supporting non-UTF-8 encodings is a separate problem from parsing XML, 
and proper code for that would have much broader applications. Keep in 
mind that the more encoding you support, the more bloat you add to the 
executable, so there's a tradeoff to be made. In many cases, UTF-8 is 
enough, while in many others it's not.

(My XML implementation has a function that parses the XML prolog and 
tells you the encoding so you can take the appropriate code path before 
feeding the parser. A higher level API could handle encodings 
automatically based on that that. )


> However, because all of the XML special symbols should be ASCII, you should
> still be able to avoid decoding characters for the most part. It's only when
> you have to actually look at the content that Unicode would potentially
> matter. So, the performance hit of decoding Unicode characters should mostly
> be able to be avoided.

Just like my XML implementation does. (I made frontUnit/popFrontUnit 
functions I'm using when decoding code points is unnecessary.)


-- 
Michel Fortin
michel.fortin at michelf.ca
http://michelf.ca



More information about the Digitalmars-d mailing list