identifiers & "unialpha"

Fri Sep 22 15:38:31 PDT 2006

Walter Bright wrote:
> Thomas Kuehne wrote:
>>
>> ISO/IEC 9899:1999 (E) Appendix D
>> # 1) This clause lists the hexadecimal code values that are valid in
>> # universal character names in identiﬁers.
>>
>> Whereas Appendix D defines valid characters in identifiers, D uses it
>> as a source for "universal alpha". As a consequence std.uni.isUniAlpha
>> claims that \u00B7 (MIDDLE DOT) is a letter...
> 
> I guess I don't see why C99 would say . is a valid identifier character 
> if it isn't an alpha. It's all confusing to me, and I think needlessly 
> complicated. Is \u00B7 the only difference?

No, there are other differences as well.  I think C99 was simply 
referring to the latest version of the document available in 1999, and 
it has since been revised (in 2003, apparently).  But I have no idea why 
characters present in the 1999 doc are not present in the 2003 doc.  To 
pass the buck even further, "ISO/IEC TR 10176:2003" Annex A says the 
following:

     This list comprises the letters (combining or not), syllables, and
     ideographs from ISO/IEC 10646-1, together with the modifier letters
     and marks conventionally used as parts of words.

So their list of characters is copied from the Unicode standard (ISO/IEC 
10646).  I can only conclude that the Unicode standard changed between 
1999-2003 and ISO/IEC 10176 simply incorporated the new list.  But who 
knows why the list was changed.

This does raise an interesting point however.  Since the C and C++ 
standards separately refer to SO/IEC 10176 for their character list, the 
identifiers a compliant C99 and C++2003 compiler should accept are 
different.  This seems contrary to the usual C++ practice of deferring 
to the C standard on semantic issues.

>>> P.S. It also bugs me that the unicode people can't seem to make up 
>>> their minds. Do character sets really need to change every 2 or 3 years?
>>
>> Task at hand: Create a table of all characters used by humans all over
>> the world and minimize friction due to political issues
>> (e.g. characters' names). Except for bug fixes (typos...) the unicode 
>> people
>> usually only extend previous versions of the standard.
> 
> Chinese, Japanese, and Korean are hardly obscure so I don't see why the 
> character sets for them seem to need large numbers of additions this 
> late in the game.

Me either.  But then I'm not terribly inclined to read the Unicode 
standards committee minutes to find out either :-)

Sean