[Issue 5221] entity.c: Merge Walter's list with Thomas'

d-bugmail at puremagic.com d-bugmail at puremagic.com
Sat Jan 29 08:55:17 PST 2011


http://d.puremagic.com/issues/show_bug.cgi?id=5221



--- Comment #11 from Aziz Köksal <aziz.koeksal at gmail.com> 2011-01-29 08:53:01 PST ---
Good to see somebody is working on this. :)

Here's some info I could gather on the entities in that file:

Almost all entities look like this:

<!ENTITY AElig            "&#x000C6;" ><!--LATIN CAPITAL LETTER AE -->

The odd ones that stand out are:

<!ENTITY AMP              "&#38;" ><!--AMPERSAND -->
// The name has a dot (you already noticed.)
<!ENTITY b.Delta          "&#x1D6AB;" ><!--MATHEMATICAL BOLD CAPITAL DELTA -->
// Notice the leading space in the value.
<!ENTITY DotDot           " &#x020DC;" ><!--COMBINING FOUR DOTS ABOVE -->
// The entity has two characters in its value.
<!ENTITY bne              "&#x0003D;&#x020E5;" ><!--EQUALS SIGN with reverse
slash -->
// Double chars + initial &.
<!ENTITY nvlt             "&#x0003C;&#x020D2;" ><!--LESS-THAN SIGN with
vertical line -->


I was wondering what constitutes valid names for HTML entities and after a long
and hard search I found this page:
http://www.w3.org/TR/REC-xml/#NT-Name

Basically you can stick in a lot more different characters than just
alphanumeric ones into an HTML entity. However, I would not recommend adjusting
the lexer to recognise all of them. We should just allow only those that are
actually in the list, because it keeps things a lot simpler. Therefore the "."
char should be allowed as well.

But what to do about those entities that define two replacement characters?
Again, to keep things simple and efficient, let's just leave them out of the
lookup-table in the compiler.

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------


More information about the Digitalmars-d-bugs mailing list