suggestion: clean white space / end of line definition

Tue Oct 31 08:56:21 PST 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Walter Bright schrieb am 2006-10-31:
> Thomas Kuehne wrote:

<snip>

>> suggested definition:
>>> EndOfLine:
>>> 	Unicode(all non-tailorable Line Breaking Classes causing a line break)
>>> 	EndOfFile
>>>
>>> WhiteSpace:
>>> 	Space
>>> 	Space WhiteSpace
>>>
>>> Space:
>>> 	( Unicode(General_Category == Space_Seperator)
>>> 		|| Unicode(Bidi_Class == Segment_Separator)
>>> 		|| Unicode(Bidi_Class == Whitespace)
>>> 	) && !EndOfLine
>> 
>> this expands to:
>>> EndOfLine:
>>> 	000A		// LINE FEED
>>> 	000B		// LINE TABULATION
>>> 	000C		// FORM FEED
>>> 	000D		// CARRIAGE RETURN
>>> 	000D 000A	// CARRIAGE RETURN followed by LINE FEED
>>> 	0085		// NEXT LINE
>>> 	2028		// LINE SEPARATOR
>>> 	2029		// PARAGRAPH SEPARATOR
>>>
>>> Space:
>>> 	Unicode(General_Category == Space_Seperator) && !EndOfLine
>>> 		0020       // SPACE
>>> 		00A0       // NO-BREAK SPACE
>>> 		1680       // OGHAM SPACE MARK
>>> 		180E       // MONGOLIAN VOWEL SEPARATOR
>>> 		2000..200A // EN QUAD..HAIR SPACE
>>> 		202F       // NARROW NO-BREAK SPACE
>>> 		205F       // MEDIUM MATHEMATICAL SPACE
>>> 		3000       // IDEOGRAPHIC SPACE
>>>
>>> 	Unicode(Bidi_Class == Segment_Separator) && !EndOfLine
>>> 		0009	// CHARACTER TABULATION
>>> 		001F	// INFORMATION SEPARATOR ONE
>>>
>>> 	Unicode(Bidi_Class == Whitespace) && !EndOfLine
>>> 		<all part of the Space_Seperator listing>
>
> Is it really worth doing all that?

What is actually changing for EndOfLine?
	000A new
 	000B formerly white space
 	000C formerly white space
 	0085 new
 	2028 implemented but undocumented
 	2029 implemented but undocumented

\v and \f were probably defined as white space to due to
C's isspace. Please note however that \r and \n are recognised
by isspace too. Implementing 2028 and 2029 seems implicit due to
the use of UTF encodings.

All the different line endings can be converted to '\n' for
non UTF-8 D files in Module::parse. UTF-8 encoded HTML sources
can use a similar approach in html.c(GDC currently uses a
isLineSeperator there). UTF-8 encoded D files would require
support at
lexer.c: 303,709,763,835,1113,1301,1375,1457,1520,1520,2258,2272,2386.
The alternative and more robust solution would be a 'new line cleanup'
at module.c:485 and a goto from module.c:523. This way, all the
'\r', LS and PS tests sprinkled around lexer.c and html.c could be
removed.

In my opinion the EndOfLine change is well worth it.

The SPACE changed was prompted by the broken
00A0 (NO-BREAK SPACE) kludges in html.c and entity.c.
The issue isn't that the idea was bad but the
reasons wasn't layed out properly. If 00A0 is to be considered
a SPACE, then why 00A0 and not character foo-bar? At least the
2000..200A range will become the same problem 00A0 was originally.
Using the Unicode standard as reference would direct all further
debates if a character is a space to the Unicode consortium and leave
D out of potentially length debates.

Changes would be required somewhere around
lexer.c:490,1331,2218,2368,2375,2404

Using a function like

// returns NULL or end of white space
char* isUniSpace(char*)

would also clean up white space parsing.
lexer.c currently tests for '\t' on 6 occasions,
7 times for ' ' and only 3 times for '\f' and '\v' each.

Thomas

-----BEGIN PGP SIGNATURE-----

iD8DBQFFR4z+LK5blCcjpWoRAmbAAJoDASDAvpcpZzWcDl2gh7MhCX5mvgCfdvNm
x3IrjxWSgml7rc3R/soHZn0=
=YYyK
-----END PGP SIGNATURE-----