suggestion: clean white space / end of line definition
Walter Bright
newshound at digitalmars.com
Mon Oct 30 17:14:54 PST 2006
Thomas Kuehne wrote:
> DMD's frontend however doesn't strictly conform to those definitions.
>
> doc.c:1395: only \u0020, \u0009 and \u000A are treated as spaces
> html.c:351: \u000D and \u000A are treated as space too
> html.c:683: \u00A0 is treated as space only if it was encountered via a html entity
> inifile.c:264: \u000D and \u000A are treated as space too
> lexer.c:2360: \u000B and \u000C aren't treated as spaces
> lexer.c: treats \u2028 and \u2029 as line seperators too
>
> The oddest case is enitiy.c:577:
> treat "\ " as "\u0020" istead of "\u00A0"
Thanks, I'll try to get those fixed.
> suggested definition:
>> EndOfLine:
>> Unicode(all non-tailorable Line Breaking Classes causing a line break)
>> EndOfFile
>>
>> WhiteSpace:
>> Space
>> Space WhiteSpace
>>
>> Space:
>> ( Unicode(General_Category == Space_Seperator)
>> || Unicode(Bidi_Class == Segment_Separator)
>> || Unicode(Bidi_Class == Whitespace)
>> ) && !EndOfLine
>
> this expands to:
>> EndOfLine:
>> 000A // LINE FEED
>> 000B // LINE TABULATION
>> 000C // FORM FEED
>> 000D // CARRIAGE RETURN
>> 000D 000A // CARRIAGE RETURN followed by LINE FEED
>> 0085 // NEXT LINE
>> 2028 // LINE SEPARATOR
>> 2029 // PARAGRAPH SEPARATOR
>>
>> Space:
>> Unicode(General_Category == Space_Seperator) && !EndOfLine
>> 0020 // SPACE
>> 00A0 // NO-BREAK SPACE
>> 1680 // OGHAM SPACE MARK
>> 180E // MONGOLIAN VOWEL SEPARATOR
>> 2000..200A // EN QUAD..HAIR SPACE
>> 202F // NARROW NO-BREAK SPACE
>> 205F // MEDIUM MATHEMATICAL SPACE
>> 3000 // IDEOGRAPHIC SPACE
>>
>> Unicode(Bidi_Class == Segment_Separator) && !EndOfLine
>> 0009 // CHARACTER TABULATION
>> 001F // INFORMATION SEPARATOR ONE
>>
>> Unicode(Bidi_Class == Whitespace) && !EndOfLine
>> <all part of the Space_Seperator listing>
Is it really worth doing all that?
More information about the Digitalmars-d
mailing list