Why is BOM required to use unicode in tokens?

Tue Sep 15 16:23:01 UTC 2020

On Tuesday, 15 September 2020 at 14:59:03 UTC, Steven 
Schveighoffer wrote:
> On 9/15/20 10:18 AM, James Blachly wrote:
>> What will it take (i.e. order of difficulty) to get this fixed 
>> -- will merely a bug report (and PR, not sure if I can tackle 
>> or not) do it, or will this require more in-depth discussion 
>> with compiler maintainers?
>
> I'm thinking your issue will not be fixed (just like we don't 
> allow $abc to be an identifier). But the spec can be fixed to 
> refer to the correct standards.

Looks like it has to do with the '∂' character. But non-ascii 
alphabetic characters work generally.

# The 'Ш' and 'ä' characters are fine.
$ echo $'import std.stdio; void Шä() { writeln("Hello World!"); } 
void main() { Шä(); }' | dmd -run -
Hello World!

# But not '∂'
$ echo $'import std.stdio; void x∂() { writeln("Hello World!"); } 
void main() { x∂(); }' | dmd -run -
__stdin.d(1): Error: char 0x2202 not allowed in identifier
__stdin.d(1): Error: character 0x2202 is not a valid token
__stdin.d(1): Error: char 0x2202 not allowed in identifier
__stdin.d(1): Error: character 0x2202 is not a valid token

However, 'Ш' and 'ä' satisfy the definition of a Unicode letter, 
'∂' does not. (Using D's current Unicode definitions). I'll use 
tsv-filter (from tsv-utils) to show this rather than writing out 
the full D code. But, this uses std.regex.matchFirst().

# The input
$ echo $'x\n∂\nШ\nä'
x
∂
Ш
ä

The input filtered by Unicode letter '\p{L}'
$ echo $'x\n∂\nШ\nä' | tsv-filter --regex 1:'^\p{L}$'
x
Ш
ä

The spec can be made more clear and correct. But if a "universal 
alpha" is essentially about Unicode letters you might be looking 
for a change in the spec to use the symbol chosen.

--Jon