Why is BOM required to use unicode in tokens?

Dominikus Dittes Scherkl dominikus at scherkl.de
Tue Sep 15 08:36:33 UTC 2020

On Tuesday, 15 September 2020 at 06:49:08 UTC, Jon Degenhardt 
> On Tuesday, 15 September 2020 at 02:23:31 UTC, Paul Backus 
> wrote:
>> Identifiers start with a letter, _, or universal alpha, and 
>> are followed by any number of letters, _, digits, or universal 
>> alphas. Universal alphas are as defined in ISO/IEC 
>> 9899:1999(E) Appendix D of the C99 Standard.
> I was unable to find the definition of a "universal alpha", or 
> whether that includes non-ascii alphabetic characters.

ISO/IEC 9899:1999 (E)
Annex D

Universal character names for identifiers

Latin: 00AA, 00BA, 00C0-00D6, 00D8-00F6, 00F8-01F5, 01FA-0217, 
0250-02A8, 1E00-1E9B, 1EA0-1EF9, 207F
Greek: 0386, 0388-038A, 038C, 038E-03A1, 03A3-03CE, 03D0-03D6, 
03DA, 03DC, 03DE, 03E0, 03E2-03F3, 1F00-1F15, 1F18-1F1D, 
1F20-1F45, 1F48-1F4D, 1F50-1F57, 1F59, 1F5B, 1F5D, 1F5F-1F7D, 
1F80-1FB4, 1FB6-1FBC, 1FC2-1FC4, 1FC6-1FCC, 1FD0-1FD3, 1FD6-1FDB, 
1FE0-1FEC, 1FF2-1FF4, 1FF6-1FFC
Cyrillic: 0401-040C, 040E-044F, 0451-045C, 045E-0481, 0490-04C4, 
04C7-04C8, 04CB-04CC, 04D0-04EB, 04EE-04F5, 04F8-04F9
Armenian: 0531-0556, 0561-0587
Hebrew: 05B0-05B9, 05BB-05BD, 05BF, 05C1-05C2, 05D0-05EA, 
Arabic: 0621-063A, 0640-0652, 0670-06B7, 06BA-06BE, 06C0-06CE, 
06D0-06DC, 06E5-06E8, 06EA-06ED
Devanagari: 0901-0903, 0905-0939, 093E-094D, 0950-0952, 0958-0963
Bengali: 0981-0983, 0985-098C, 098F-0990, 0993-09A8, 09AA-09B0, 
09B2, 09B6-09B9, 09BE-09C4, 09C7-09C8, 09CB-09CD, 09DC-09DD, 
09DF-09E3, 09F0-09F1
Gurmukhi: 0A02, 0A05-0A0A, 0A0F-0A10, 0A13-0A28, 0A2A-0A30, 
0A32-0A33, 0A35-0A36, 0A38-0A39, 0A3E-0A42, 0A47-0A48, 0A4B-0A4D, 
0A59-0A5C, 0A5E, 0A74
Gujarati: 0A81-0A83, 0A85-0A8B, 0A8D, 0A8F-0A91, 0A93-0AA8, 
0AAA-0AB0, 0AB2-0AB3, 0AB5-0AB9, 0ABD-0AC5, 0AC7-0AC9, 0ACB-0ACD, 
0AD0, 0AE0
Oriya: 0B01-0B03, 0B05-0B0C, 0B0F-0B10, 0B13-0B28, 0B2A-0B30, 
0B32-0B33, 0B36-0B39, 0B3E-0B43, 0B47-0B48, 0B4B-0B4D, 0B5C-0B5D, 
Tamil: 0B82-0B83, 0B85-0B8A, 0B8E-0B90, 0B92-0B95, 0B99-0B9A, 
0B9C, 0B9E-0B9F, 0BA3-0BA4, 0BA8-0BAA, 0BAE-0BB5, 0BB7-0BB9, 
0BBE-0BC2, 0BC6-0BC8, 0BCA-0BCD
Telugu: 0C01-0C03, 0C05-0C0C, 0C0E-0C10, 0C12-0C28, 0C2A-0C33, 
0C35-0C39, 0C3E-0C44, 0C46-0C48, 0C4A-0C4D, 0C60-0C61
Kannada: 0C82-0C83, 0C85-0C8C, 0C8E-0C90, 0C92-0CA8, 0CAA-0CB3, 
0CB5-0CB9, 0CBE-0CC4, 0CC6-0CC8, 0CCA-0CCD, 0CDE, 0CE0-0CE1
Malayalam: 0D02-0D03, 0D05-0D0C, 0D0E-0D10, 0D12-0D28, 0D2A-0D39, 
0D3E-0D43, 0D46-0D48, 0D4A-0D4D, 0D60-0D61
Thai: 0E01-0E3A, 0E40-0E5B
Lao: 0E81-0E82, 0E84, 0E87-0E88, 0E8A, 0E8D, 0E94-0E97, 
0E99-0E9F, 0EA1-0EA3, 0EA5, 0EA7, 0EAA-0EAB, 0EAD-0EAE, 
0EB0-0EB9, 0EBB-0EBD, 0EC0-0EC4, 0EC6, 0EC8-0ECD, 0EDC-0EDD
Tibetan: 0F00, 0F18-0F19, 0F35, 0F37, 0F39, 0F3E-0F47, 0F49-0F69, 
0F71-0F84, 0F86-0F8B, 0F90-0F95, 0F97, 0F99-0FAD, 0FB1-0FB7, 0FB9
Georgian: 10A0-10C5, 10D0-10F6
Hiragana: 3041-3093, 309B-309C
Katakana: 30A1-30F6, 30FB-30FC
Bopomofo: 3105-312C
CJK Unified Ideographs: 4E00-9FA5
Hangul: AC00-D7A3
Digits: 0660-0669, 06F0-06F9, 0966-096F, 09E6-09EF, 0A66-0A6F, 
0AE6-0AEF, 0B66-0B6F, 0BE7-0BEF, 0C66-0C6F, 0CE6-0CEF, 0D66-0D6F, 
0E50-0E59, 0ED0-0ED9, 0F20-0F33
Special characters: 00B5, 00B7, 02B0-02B8, 02BB, 02BD-02C1, 
02D0-02D1, 02E0-02E4, 037A, 0559, 093D, 0B3D, 1FBE, 203F-2040, 
2102, 2107, 210A-2113, 2115, 2118-211D, 2124, 2126, 2128, 
212A-2131, 2133-2138, 2160-2182, 3005-3007, 3021-3029


This is outdated to the brim. Also it doesn't allow for 
letter-like symbols (which is debatable, but especially the 
mathematical ones like double-struck letters are intended for 
such use).
Instead of some old C-Standard, D should better rely directly on 
the properties from UnicodeData.txt, which is updated with every 
new unicode version.

More information about the Digitalmars-d-learn mailing list