Replacing std.xml

Jonathan M Davis jmdavisProg at gmx.com
Thu Aug 29 10:38:23 PDT 2013


On Thursday, August 29, 2013 15:20:39 Jacob Carlborg wrote:
> On 2013-08-29 11:23, Jonathan M Davis wrote:
> > IIRC, everything in XML is
> > ASCII anyway, with stuff like HTML codes to indicate Unicode characters.
> > And if that's the case, avoiding unnecessary decoding is trivial when
> > operating on strings.
> 
> What! I hardly believe that. That might be the case for HTML but I don't
> think it is for XML. There are many file formats that are based on XML.
> I don't think all those use HTML codes.
> 
> This is what W3 Schools says:
> 
> "XML documents can contain non ASCII characters, like Norwegian æ ø å ,
> or French ê è é.
> 
> To avoid errors, specify the XML encoding, or save XML files as Unicode.".

Well, as I said, I couldn't remember exactly what the XML standard said about 
encodings, but if it can contain non-ASCII characters, then my first 
inclination is to say that it has to be UTF-8, UTF-16, or UTF-32 based on the 
fact that that's what we support in the language and in Phobos (as I 
understand it, std.encodings is a bit of a joke that needs to be rethought and 
replaced, but regardless, it's the only Phobos module supporting any non-
Unicode encodings).

However, because all of the XML special symbols should be ASCII, you should 
still be able to avoid decoding characters for the most part. It's only when 
you have to actually look at the content that Unicode would potentially 
matter. So, the performance hit of decoding Unicode characters should mostly 
be able to be avoided.

- Jonathan M Davis


More information about the Digitalmars-d mailing list