suggestion: clean white space / end of line definition

Mon Oct 30 17:14:54 PST 2006

Thomas Kuehne wrote:
> DMD's frontend however doesn't strictly conform to those definitions.
> 
> doc.c:1395: only \u0020, \u0009 and \u000A are treated as spaces
> html.c:351: \u000D and \u000A are treated as space too
> html.c:683: \u00A0 is treated as space only if it was encountered via a html entity
> inifile.c:264: \u000D and \u000A are treated as space too
> lexer.c:2360: \u000B and \u000C aren't treated as spaces
> lexer.c: treats \u2028 and \u2029 as line seperators too
> 
> The oddest case is enitiy.c:577:
> treat "\&nbsp;" as "\u0020" istead of "\u00A0"

Thanks, I'll try to get those fixed.

> suggested definition:
>> EndOfLine:
>> 	Unicode(all non-tailorable Line Breaking Classes causing a line break)
>> 	EndOfFile
>>
>> WhiteSpace:
>> 	Space
>> 	Space WhiteSpace
>>
>> Space:
>> 	( Unicode(General_Category == Space_Seperator)
>> 		|| Unicode(Bidi_Class == Segment_Separator)
>> 		|| Unicode(Bidi_Class == Whitespace)
>> 	) && !EndOfLine
> 
> this expands to:
>> EndOfLine:
>> 	000A		// LINE FEED
>> 	000B		// LINE TABULATION
>> 	000C		// FORM FEED
>> 	000D		// CARRIAGE RETURN
>> 	000D 000A	// CARRIAGE RETURN followed by LINE FEED
>> 	0085		// NEXT LINE
>> 	2028		// LINE SEPARATOR
>> 	2029		// PARAGRAPH SEPARATOR
>>
>> Space:
>> 	Unicode(General_Category == Space_Seperator) && !EndOfLine
>> 		0020       // SPACE
>> 		00A0       // NO-BREAK SPACE
>> 		1680       // OGHAM SPACE MARK
>> 		180E       // MONGOLIAN VOWEL SEPARATOR
>> 		2000..200A // EN QUAD..HAIR SPACE
>> 		202F       // NARROW NO-BREAK SPACE
>> 		205F       // MEDIUM MATHEMATICAL SPACE
>> 		3000       // IDEOGRAPHIC SPACE
>>
>> 	Unicode(Bidi_Class == Segment_Separator) && !EndOfLine
>> 		0009	// CHARACTER TABULATION
>> 		001F	// INFORMATION SEPARATOR ONE
>>
>> 	Unicode(Bidi_Class == Whitespace) && !EndOfLine
>> 		<all part of the Space_Seperator listing>

Is it really worth doing all that?