Updating D beyond Unicode 2.0

Mon Sep 24 13:26:14 UTC 2018

On 9/24/18 12:23 AM, Neia Neutuladh wrote:
> On Monday, 24 September 2018 at 01:39:43 UTC, Walter Bright wrote:
>> On 9/23/2018 3:23 PM, Neia Neutuladh wrote:
>>> Okay, that's why you previously selected C99 as the standard for what 
>>> characters to allow. Do you want to update to match C11? It's been 
>>> out for the better part of a decade, after all.
>>
>> I wasn't aware it changed in C11.
> 
> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf page 522 (PDF 
> numbering) or 504 (internal numbering).
> 
> Outside the BMP, almost everything is allowed, including many things 
> that are not currently mapped to any Unicode value. Within the BMP, a 
> heck of a lot of stuff is allowed, including a lot that D doesn't 
> currently allow.
> 
> GCC hasn't even updated to the C99 standard here, as far as I can tell, 
> but clang-5.0 is up to date.

I searched around for the current state of symbol names in C, and found 
some really crappy rules, though maybe this site isn't up to date?:

https://en.cppreference.com/w/c/language/identifier

What I understand from that is:

1. Yes, you can use any unicode character you want in C/C++ (seemingly 
since C99)
2. There are no rules about what *encoding* is acceptable, it's 
implementation defined. So various compilers have different rules as to 
what will be accepted in the actual source code. In fact, I read 
somewhere that not even ASCII is guaranteed to be supported.

The result being, that you have to write the identifiers with an ASCII 
escape sequence in order for it to be actually portable. Which to me, 
completely defeats the purpose of using such identifiers in the first place.

For example, on that page, they have a line that works in clang, not in 
GCC (tagged as implementation defined):

char *🐱 = "cat";

The portable version looks like this:

char *\U0001f431 = "cat";

Seriously, who wants to use that?

Now, D can potentially do better (especially when all front-ends are the 
same) and support such things in the spec, but I think the argument 
"because C supports it" is kind of bunk.

Or am I reading it wrong?

In any case, I would expect that symbol name support should be focused 
only on languages which people use, not emojis. If there are words in 
Chinese or Japanese that can't be expressed using D, while other words 
can, it would seem inconsistent to a Chinese or Japanese speaking user, 
and I think we should work to fix that. I just have no idea what the 
state of that is.

I also tend to agree that most code is going to be written in English, 
even when the primary language of the user is not. Part of the reason, 
which I haven't read here yet, is that all the keywords are in English. 
Someone has to kind of understand those to get the meaning of some 
constructs, and it's going to read strangely with the non-english words.

One group which I believe hasn't spoken up yet is the group making the 
hunt framework, whom I believe are all Chinese? At least their web site 
is. It would be good to hear from a group like that which has large 
experience writing mature D code (it appears all to be in English) and 
how they feel about the support.

-Steve