suggestion: clean white space / end of line definition
Thomas Kuehne
thomas-dloop at kuehne.cn
Sat Oct 28 03:00:33 PDT 2006
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Current definition(http://www.digitalmars.com/d/lex.html):
> EndOfLine:
> \u000D
> \u000A
> \u000D \u000A
> EndOfFile
>
> WhiteSpace:
> Space
> Space WhiteSpace
>
> Space:
> \u0020
> \u0009
> \u000B
> \u000C
DMD's frontend however doesn't strictly conform to those definitions.
doc.c:1395: only \u0020, \u0009 and \u000A are treated as spaces
html.c:351: \u000D and \u000A are treated as space too
html.c:683: \u00A0 is treated as space only if it was encountered via a html entity
inifile.c:264: \u000D and \u000A are treated as space too
lexer.c:2360: \u000B and \u000C aren't treated as spaces
lexer.c: treats \u2028 and \u2029 as line seperators too
The oddest case is enitiy.c:577:
treat "\ " as "\u0020" istead of "\u00A0"
suggested definition:
> EndOfLine:
> Unicode(all non-tailorable Line Breaking Classes causing a line break)
> EndOfFile
>
> WhiteSpace:
> Space
> Space WhiteSpace
>
> Space:
> ( Unicode(General_Category == Space_Seperator)
> || Unicode(Bidi_Class == Segment_Separator)
> || Unicode(Bidi_Class == Whitespace)
> ) && !EndOfLine
this expands to:
> EndOfLine:
> 000A // LINE FEED
> 000B // LINE TABULATION
> 000C // FORM FEED
> 000D // CARRIAGE RETURN
> 000D 000A // CARRIAGE RETURN followed by LINE FEED
> 0085 // NEXT LINE
> 2028 // LINE SEPARATOR
> 2029 // PARAGRAPH SEPARATOR
>
> Space:
> Unicode(General_Category == Space_Seperator) && !EndOfLine
> 0020 // SPACE
> 00A0 // NO-BREAK SPACE
> 1680 // OGHAM SPACE MARK
> 180E // MONGOLIAN VOWEL SEPARATOR
> 2000..200A // EN QUAD..HAIR SPACE
> 202F // NARROW NO-BREAK SPACE
> 205F // MEDIUM MATHEMATICAL SPACE
> 3000 // IDEOGRAPHIC SPACE
>
> Unicode(Bidi_Class == Segment_Separator) && !EndOfLine
> 0009 // CHARACTER TABULATION
> 001F // INFORMATION SEPARATOR ONE
>
> Unicode(Bidi_Class == Whitespace) && !EndOfLine
> <all part of the Space_Seperator listing>
>
Thomas
-----BEGIN PGP SIGNATURE-----
iD8DBQFFQzdILK5blCcjpWoRArgLAJ90xljYG+pNPEit3WU8JtAYlC+3PACfRPTU
J0cixnT2X7yynpjxBQx+rps=
=IDK6
-----END PGP SIGNATURE-----
More information about the Digitalmars-d
mailing list