[Issue 3455] Some Unicode characters not allowed in identifiers

Fri Oct 30 11:40:07 PDT 2009

http://d.puremagic.com/issues/show_bug.cgi?id=3455

--- Comment #2 from Andrei Alexandrescu <andrei at metalanguage.com> 2009-10-30 11:40:05 PDT ---
(In reply to comment #1)
> As http://www.digitalmars.com/d/1.0/lex.html#identifier very clearly states,
> the allowed characters in identifiers are those defined in the C99 standard,
> ISO/IEC 9899:1999(E) Annex D. Have a look at it:
> http://www.open-std.org/JTC1/SC22/wg14/www/docs/n1124.pdf
> 
> ９, code point 0xff19, is not in that list. The maximum one is 0xd7a3, in fact. 
> This is not a bug, this is an enhancement.
> 
> However, rather than an arbitrary and frozen list, I /would/ prefer basing it
> simply on Unicode properties, such as Java's choice: identifiers may start with
> letters or numeric letters, and may contain, in addition to those, connecting
> punctuation, decimal digits, and combining and non-spacing marks. In other
> words:
> 
> Identifiers may start with code points from the general categories Ll, Lm, Lo,
> Lt, Lu, Nl.
> 
> Identifiers may contain code points from the general categories Ll, Lm, Lo, Lt,
> Lu, Mc, Mn, Nd, Nl, No, Pc.
> 
> Java also allows Cc and Cf, of whose usefulness I'm not so convinced. These are
> control characters and things like "soft hyphen", which isn't even supposed to
> be displayed unless the word line-wraps. Too much potential for confusion IMHO.

Oh ok. Thanks Matti. I'm leaving this as an enhancement request. Currently the
error message is:

invalid UTF-8 sequence
unsupported char 0x99

This is factually incorrect because the UTF-8 sequence is correct. I suggest
instead:

Unicode character 0xFF19 not allowed in a symbol

Andrei

-- 
Configure issuemail: http://d.puremagic.com/issues/userprefs.cgi?tab=email
------- You are receiving this mail because: -------