Why is BOM required to use unicode in tokens?

James Blachly james.blachly at gmail.com
Tue Sep 15 14:18:20 UTC 2020


On 9/15/20 4:36 AM, Dominikus Dittes Scherkl wrote:
> On Tuesday, 15 September 2020 at 06:49:08 UTC, Jon Degenhardt wrote:
>> On Tuesday, 15 September 2020 at 02:23:31 UTC, Paul Backus wrote:
>>> Identifiers start with a letter, _, or universal alpha, and are 
>>> followed by any number of letters, _, digits, or universal alphas. 
>>> Universal alphas are as defined in ISO/IEC 9899:1999(E) Appendix D of 
>>> the C99 Standard.
>>
>> I was unable to find the definition of a "universal alpha", or whether 
>> that includes non-ascii alphabetic characters.
> 
> ISO/IEC 9899:1999 (E)
> Annex D
> 
> Universal character names for identifiers
> -----------------------------------------
...
> -----------------------
> 
> This is outdated to the brim. Also it doesn't allow for letter-like 
> symbols (which is debatable, but especially the mathematical ones like 
> double-struck letters are intended for such use).
> Instead of some old C-Standard, D should better rely directly on the 
> properties from UnicodeData.txt, which is updated with every new unicode 
> version.
> 

Thanks to Paul, Jon, Dominikus and H.S. for thoughtful responses.

What will it take (i.e. order of difficulty) to get this fixed -- will 
merely a bug report (and PR, not sure if I can tackle or not) do it, or 
will this require more in-depth discussion with compiler maintainers?

James


More information about the Digitalmars-d-learn mailing list