Replacing std.xml

H. S. Teoh hsteoh at quickfur.ath.cx
Thu Aug 29 11:57:33 PDT 2013


On Thu, Aug 29, 2013 at 01:38:23PM -0400, Jonathan M Davis wrote:
[...]
> Well, as I said, I couldn't remember exactly what the XML standard said about 
> encodings, but if it can contain non-ASCII characters, then my first 
> inclination is to say that it has to be UTF-8, UTF-16, or UTF-32 based on the 
> fact that that's what we support in the language and in Phobos

Take a look here:

	http://www.w3schools.com/xml/xml_encoding.asp

XML files can have *any* valid encoding, including nastiness like
windows-1252 and relics like iso-8859-1. Unfortunately, I don't think we
have a way around this, since existing XML files out there probably
already have all of these encodings are more, and std.xml is gonna hafta
support 'em all. Otherwise we're gonna get irate users complaining "why
can't std.xml parse my oddly-encoded-but-standards-compliant XML file?!"


> (as I understand it, std.encodings is a bit of a joke that needs to be
> rethought and replaced, but regardless, it's the only Phobos module
> supporting any non- Unicode encodings).

No kidding! I was trying to write a program that navigates a website
automatically using std.net.curl, and I'm running into all sorts of
silly roadblocks, including std.encoding not supporting iso-8859-*
encodings.

The good news is that on Linux, there's a handy utility called 'recode',
which comes with a library called 'librecode', that supports converting
between a huge number of different encodings -- many more than probably
you or I have imagined existed -- including to/from Unicode.  I know we
don't like including external libraries in Phobos, but I honestly don't
see any justification for reinventing the wheel by writing (and
maintaining!) our own equivalent to librecode, unless licensing issues
prevents us from including librecode in Phobos, nicely wrapped in a
modern range-based D API.


> However, because all of the XML special symbols should be ASCII, you
> should still be able to avoid decoding characters for the most part.
> It's only when you have to actually look at the content that Unicode
> would potentially matter. So, the performance hit of decoding Unicode
> characters should mostly be able to be avoided.
[...]

One way is to write the core code of std.xml in such a way that it
handles all data as ubyte[] (or ushort[]/uint[] for 16-bit/32-bit
encodings) so that it's encoding-independent. Then on top of this core,
write some convenience wrappers that casts/converts to string, wstring,
dstring. As an initial stab, we could support only UTF-8, UTF-16, UTF-32
if the user asks for string/wstring/dstring, and leave XML in other
encodings up to the user to decode manually. This way, at least the user
can get the data out of the file.

Later on, once we've gotten our act together with std.encoding, we can
hook it up to std.xml to provide autoconversion.


T

-- 
Almost all proofs have bugs, but almost all theorems are true. -- Paul Pedersen


More information about the Digitalmars-d mailing list