Updating D beyond Unicode 2.0

Fri Sep 21 16:27:46 UTC 2018

D's currently accepted identifier characters are based on Unicode 
2.0:

* ASCII range values are handled specially.
* Letters and combining marks from Unicode 2.0 are accepted.
* Numbers outside the ASCII range are accepted.
* Eight random punctuation marks are accepted.

This follows the C99 standard.

Many languages use the Unicode standard explicitly: C#, Go, Java, 
Python, ECMAScript, just to name a few. A small number of 
languages reject non-ASCII characters: Dart, Perl. Some languages 
are weirdly generous: Swift and C11 allow everything outside the 
Basic Multilingual Plane.

I'd like to update that so that D accepts something as a valid 
identifier character if it's a letter or combining mark or 
modifier symbol that's present in Unicode 11, or a non-ASCII 
number. This allows the 146 most popular writing systems and a 
lot more characters from those writing systems. This *would* 
reject those eight random punctuation marks, so I'll keep them in 
as legacy characters.

It would mean we don't have to reference the C99 standard when 
enumerating the allowed characters; we just have to refer to the 
Unicode standard, which we already need to talk about in the 
lexical part of the spec.

It might also make the lexer a tiny bit faster; it reduces the 
number of valid-ident-char segments to search from 245 to 134. On 
the other hand, it will change the ident char ranges from wchar 
to dchar, which means the table takes up marginally more memory.

And, of course, it lets you write programs entirely in Linear B, 
and that's a marketing ploy not to be missed.

I've got this coded up and can submit a PR, but I thought I'd get 
feedback here first.

Does anyone see any horrible potential problems here?

Or is there an interestingly better option?

Does this need a DIP?