DMD: invalid UTF character `\U0000d800`

Per Nordlöw per.nordlow at gmail.com
Sat Nov 7 16:12:06 UTC 2020


I'm writing a parser generator for ANTLR-grammars and have come 
across the rule

fragment Letter
     : [a-zA-Z$_] // these are below 0x7F
     | ~[\u0000-\u007F\uD800-\uDBFF] // covers all characters 
above 0x7F which are not a surrogate
     | [\uD800-\uDBFF] [\uDC00-\uDFFF] // covers UTF-16 surrogate 
pairs encodings for U+10000 to U+10FFFF
     ;

at

https://github.com/antlr/grammars-v4/blob/master/cto/CtoLexer.g4#L158

This rule is converted into

     Match m__Letter()
     {
         return alt(alt(rng('a', 'z'), rng('A', 'Z'), ch('$'), 
ch('_')),
                    not(alt(rng('\u0000', '\u007F'), rng('\uD800', 
'\uDBFF'))),
                    seq(rng('\uD800', '\uDBFF'), rng('\uDC00', 
'\uDFFF')));
     }

given suitable defs of alt, rng, seq, not.

This errors as

  CtoLexer_parser.d   665  57 error           invalid UTF 
character \U0000d800
  CtoLexer_parser.d   665  67 error           invalid UTF 
character \U0000dbff
  CtoLexer_parser.d   666  28 error           invalid UTF 
character \U0000d800
  CtoLexer_parser.d   666  38 error           invalid UTF 
character \U0000dbff
  CtoLexer_parser.d   666  53 error           invalid UTF 
character \U0000dc00
  CtoLexer_parser.d   666  63 error           invalid UTF 
character \U0000dfff

Doesn't DMD support these Unicodes yet?


More information about the Digitalmars-d-learn mailing list