[dmd-internals] named character entities, the spec, and dmd
Jonathan M Davis
jmdavisProg at gmx.com
Mon Jul 30 22:51:58 PDT 2012
This is the D spec page on named character entities:
http://dlang.org/entity.html
>From what I can tell, it's essentially (if not exactly) what HTML 4 lists:
http://www.htmlhelp.com/reference/html40/entities/
http://www.w3.org/TR/html4/sgml/entities.html
However, what dmd seems to be using is essentilaly what HTML 5 lists:
http://www.w3.org/TR/html5/named-character-references.html
though it appears to have taken its list from here:
http://www.w3.org/2003/entities/2007/w3centities-f.ent
The problem is (aside from the fact that the D spec and dmd don't match) that
we appear to be dealing with a moving target here, since HTML is a moving
target. Should the D spec and dmd target a specific version of HTML? If so, can
that list change later (e.g. moving from HTML 4 to 5 or 5 to 6 whenever 6
comes along)? And if it's HTML 5, HTML 5 itself isn't finalized yet, so
_that_'s potentially a moving target. Or should the D spec just make its own
list and stick with that (in which case, it would presumably match one of the
HTML specs initially but may not do so in the long run)? I assume that we
_don't_ want to take the approach of letting the implementation define whatever
entities it feels like, even with the caveat that they're supposed to have
come from one of the HTML specs.
>From what I can tell, two names were redefined (〈 and &rlang;) in HTML 5,
but other than that, it's purely additive. I ran into this, because I'm
working on a lexer for D, and I created a unit test to check all of the named
entities I had against what dmd did, and those two didn't match. Looking at
entity.c in dmd, it's worse than that in that there are far more defined there
than in the spec, but regardless, it's clearly a potential implementation
issue if anything following the spec is going to match dmd - especially if the
spec is a moving target in this case.
What _I_ would be tempted to go for is to have the D spec specifically state
that it supports the list of named entities that HTML 5 does, giving a link to
the current HTML 5 spec and then update that link and dmd (and the changelog)
whenever the HTML 5 spec changes and then leave it at the final draft of HTML 5
once that comes around. That way, we don't have to list every single entity in
the spec ourselves and it's clearly defined what we currently support. It
_does_ present a slightly moving target for the moment that way, but I suspect
that the named character entities aren't changing much in the HTML 5 spec, and
since dmd is _already_ supporting them, I'm not sure that it's reasonable to
say that we're supporting HTML 4 (which is what the spec currently seems to
match though it doesn't say so).
Thoughts? I'm perfectly willing to go and create whatever pull requests are
necessary for dmd and d-programming-language.org to fix this, but we need a
decision of some kind on how we want to proceed.
- Jonathan M Davis
More information about the dmd-internals
mailing list