[dmd-internals] named character entities, the spec, and dmd

Jonathan M Davis jmdavisProg at gmx.com
Mon Jul 30 22:51:58 PDT 2012


This is the D spec page on named character entities:

http://dlang.org/entity.html

>From what I can tell, it's essentially (if not exactly) what HTML 4 lists:

http://www.htmlhelp.com/reference/html40/entities/
http://www.w3.org/TR/html4/sgml/entities.html

However, what dmd seems to be using is essentilaly what HTML 5 lists:

http://www.w3.org/TR/html5/named-character-references.html

though it appears to have taken its list from here:

http://www.w3.org/2003/entities/2007/w3centities-f.ent

The problem is (aside from the fact that the D spec and dmd don't match) that 
we appear to be dealing with a moving target here, since HTML is a moving 
target. Should the D spec and dmd target a specific version of HTML? If so, can 
that list change later (e.g. moving from HTML 4 to 5 or 5 to 6 whenever 6 
comes along)? And if it's HTML 5, HTML 5 itself isn't finalized yet, so 
_that_'s potentially a moving target. Or should the D spec just make its own 
list and stick with that (in which case, it would presumably match one of the 
HTML specs initially but may not do so in the long run)? I assume that we 
_don't_ want to take the approach of letting the implementation define whatever 
entities it feels like, even with the caveat that they're supposed to have 
come from one of the HTML specs.

>From what I can tell, two names were redefined (⟨ and &rlang;) in HTML 5, 
but other than that, it's purely additive. I ran into this, because I'm 
working on a lexer for D, and I created a unit test to check all of the named 
entities I had against what dmd did, and those two didn't match. Looking at 
entity.c in dmd, it's worse than that in that there are far more defined there 
than in the spec, but regardless, it's clearly a potential implementation 
issue if anything following the spec is going to match dmd - especially if the 
spec is a moving target in this case.

What _I_ would be tempted to go for is to have the D spec specifically state 
that it supports the list of named entities that HTML 5 does, giving a link to 
the current HTML 5 spec and then update that link and dmd (and the changelog) 
whenever the HTML 5 spec changes and then leave it at the final draft of HTML 5 
once that comes around. That way, we don't have to list every single entity in 
the spec ourselves and it's clearly defined what we currently support. It 
_does_ present a slightly moving target for the moment that way, but I suspect 
that the named character entities aren't changing much in the HTML 5 spec, and 
since dmd is _already_ supporting them, I'm not sure that it's reasonable to 
say that we're supporting HTML 4 (which is what the spec currently seems to 
match though it doesn't say so).

Thoughts? I'm perfectly willing to go and create whatever pull requests are 
necessary for dmd and d-programming-language.org to fix this, but we need a 
decision of some kind on how we want to proceed.

- Jonathan M Davis


More information about the dmd-internals mailing list