suggestion: clean white space / end of line definition

Thomas Kuehne thomas-dloop at kuehne.cn
Sat Oct 28 03:00:33 PDT 2006


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Current definition(http://www.digitalmars.com/d/lex.html):
> EndOfLine:
>	\u000D
>	\u000A
>	\u000D \u000A
>	EndOfFile
>
> WhiteSpace:
>	Space
>	Space WhiteSpace
>
> Space:
>	\u0020
>	\u0009
>	\u000B
>	\u000C

DMD's frontend however doesn't strictly conform to those definitions.

doc.c:1395: only \u0020, \u0009 and \u000A are treated as spaces
html.c:351: \u000D and \u000A are treated as space too
html.c:683: \u00A0 is treated as space only if it was encountered via a html entity
inifile.c:264: \u000D and \u000A are treated as space too
lexer.c:2360: \u000B and \u000C aren't treated as spaces
lexer.c: treats \u2028 and \u2029 as line seperators too

The oddest case is enitiy.c:577:
treat "\ " as "\u0020" istead of "\u00A0"

suggested definition:
> EndOfLine:
>	Unicode(all non-tailorable Line Breaking Classes causing a line break)
>	EndOfFile
>
> WhiteSpace:
>	Space
>	Space WhiteSpace
>
> Space:
>	( Unicode(General_Category == Space_Seperator)
>		|| Unicode(Bidi_Class == Segment_Separator)
>		|| Unicode(Bidi_Class == Whitespace)
>	) && !EndOfLine

this expands to:
> EndOfLine:
>	000A		// LINE FEED
>	000B		// LINE TABULATION
>	000C		// FORM FEED
>	000D		// CARRIAGE RETURN
>	000D 000A	// CARRIAGE RETURN followed by LINE FEED
>	0085		// NEXT LINE
>	2028		// LINE SEPARATOR
>	2029		// PARAGRAPH SEPARATOR
>
> Space:
>	Unicode(General_Category == Space_Seperator) && !EndOfLine
>		0020       // SPACE
>		00A0       // NO-BREAK SPACE
>		1680       // OGHAM SPACE MARK
>		180E       // MONGOLIAN VOWEL SEPARATOR
>		2000..200A // EN QUAD..HAIR SPACE
>		202F       // NARROW NO-BREAK SPACE
>		205F       // MEDIUM MATHEMATICAL SPACE
>		3000       // IDEOGRAPHIC SPACE
>
>	Unicode(Bidi_Class == Segment_Separator) && !EndOfLine
>		0009	// CHARACTER TABULATION
>		001F	// INFORMATION SEPARATOR ONE
>
>	Unicode(Bidi_Class == Whitespace) && !EndOfLine
>		<all part of the Space_Seperator listing>
>

Thomas

-----BEGIN PGP SIGNATURE-----

iD8DBQFFQzdILK5blCcjpWoRArgLAJ90xljYG+pNPEit3WU8JtAYlC+3PACfRPTU
J0cixnT2X7yynpjxBQx+rps=
=IDK6
-----END PGP SIGNATURE-----



More information about the Digitalmars-d mailing list