identifiers & "unialpha"

Fri Sep 22 14:30:49 PDT 2006

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Walter Bright schrieb am 2006-09-22:
> Thomas Kuehne wrote:
>> 
>> http://www.digitalmars.com/d/lex.html#identifier
>> # Identifiers start with a letter, _, or universal alpha, and are followed
>> # by any number of letters, _, digits, or universal alphas. Universal
>> # alphas are as defined in ISO/IEC 9899:1999(E) Appendix D. (This is the
>> # C99 Standard.)
>> 
>> Why is D referencing "ISO/IEC 9899:1999 (E) Appendix D" for defining
>> "universal alpha"? "ISO/IEC 9899:1999 (E) Appendix D" isn't listing
>> "universal alpha".
>> 
>> Sample:
>> \u00B7 (MIDDLE DOT, Other_Punctuation) isn't an "universal alpha" but
>> allowed by Appendix D in identifiers.
>> 
>> "ISO/IEC 9899:1999 (E) Appendix D" itself is referencing
>> "ISO/IEC TR 10176:1998" for the character data. I strongly suggest to
>> drop the redirection via "Appendix D" and use
>> "ISO/IEC TR 10176 (current)" instead of the dated version
>> "ISO/IEC TR 10176:1998". The 1998 version didn't yet include quite a
>> chunk of CJK and Math characters that can be found in the current version.
>
> I'd like to leave things as they are for 1.0. I don't think that 
> anyone's code will be adversely affected by not having the latest alpha 
> character additions to identifiers, and I also don't think math 
> characters should be part of identifiers. What is CJK?

CJK: Chinese, Japanese & Korean
0x20000 .. 0x2A6D6 CJK Ideograph Extension B
0x2F800 .. 0x2FA1D CJK COMPATIBILITY IDEOGRAPHS

> As it is now, it matches standard C's definition of identifiers, which 
> is the intent of the reference. I haven't checked, but I think it 
> matches Java's idea of an identifier character, too.

ISO/IEC 9899:1999 (E) Appendix D
# 1) This clause lists the hexadecimal code values that are valid in
# universal character names in identiﬁers.

Whereas Appendix D defines valid characters in identifiers, D uses it
as a source for "universal alpha". As a consequence std.uni.isUniAlpha
claims that \u00B7 (MIDDLE DOT) is a letter...

> P.S. It also bugs me that the unicode people can't seem to make up their 
> minds. Do character sets really need to change every 2 or 3 years?

Task at hand: Create a table of all characters used by humans all over
the world and minimize friction due to political issues
(e.g. characters' names). Except for bug fixes (typos...) the unicode people
usually only extend previous versions of the standard.

Thomas

-----BEGIN PGP SIGNATURE-----

iD8DBQFFFGNBLK5blCcjpWoRAh+mAJ9k2lTcyhSiNjFsVRtCtiDhbCVdQwCdHiKE
LTtcD8IPwAUsHWoJMMXm+70=
=wNTb
-----END PGP SIGNATURE-----