Updating D beyond Unicode 2.0

Tue Sep 25 19:21:49 UTC 2018

On 2018-09-21 18:27, Neia Neutuladh wrote:
> D's currently accepted identifier characters are based on Unicode 2.0:
> 
> * ASCII range values are handled specially.
> * Letters and combining marks from Unicode 2.0 are accepted.
> * Numbers outside the ASCII range are accepted.
> * Eight random punctuation marks are accepted.
> 
> This follows the C99 standard.
> 
> Many languages use the Unicode standard explicitly: C#, Go, Java, 
> Python, ECMAScript, just to name a few. A small number of languages 
> reject non-ASCII characters: Dart, Perl. Some languages are weirdly 
> generous: Swift and C11 allow everything outside the Basic Multilingual 
> Plane.
> 
> I'd like to update that so that D accepts something as a valid 
> identifier character if it's a letter or combining mark or modifier 
> symbol that's present in Unicode 11, or a non-ASCII number. This allows 
> the 146 most popular writing systems and a lot more characters from 
> those writing systems. This *would* reject those eight random 
> punctuation marks, so I'll keep them in as legacy characters.
> 
> It would mean we don't have to reference the C99 standard when 
> enumerating the allowed characters; we just have to refer to the Unicode 
> standard, which we already need to talk about in the lexical part of the 
> spec.
> 
> It might also make the lexer a tiny bit faster; it reduces the number of 
> valid-ident-char segments to search from 245 to 134. On the other hand, 
> it will change the ident char ranges from wchar to dchar, which means 
> the table takes up marginally more memory.
> 
> And, of course, it lets you write programs entirely in Linear B, and 
> that's a marketing ploy not to be missed.
> 
> I've got this coded up and can submit a PR, but I thought I'd get 
> feedback here first.
> 
> Does anyone see any horrible potential problems here?
> 
> Or is there an interestingly better option?
> 
> Does this need a DIP?

I'm not a native English speaker but I write all my public and private 
code in English. Anyone I work with, I will expect them and make sure 
they're writing the code in English as well. English is not enough 
either, it has to be American English.

Despite this I think that D should support as much of the Unicode as 
possible (including using Unicode for identifiers). It should not be up 
to the programming language to decide which language the developer 
should write the code in.

-- 
/Jacob Carlborg