Why is BOM required to use unicode in tokens?

Jon Degenhardt jond at noreply.com
Tue Sep 15 06:49:08 UTC 2020

On Tuesday, 15 September 2020 at 02:23:31 UTC, Paul Backus wrote:
> On Tuesday, 15 September 2020 at 01:49:13 UTC, James Blachly 
> wrote:
>> I wish to write a function including ∂x and ∂y (these are 
>> trivial to type with appropriate keyboard shortcuts - alt+d on 
>> Mac), but without a unicode byte order mark at the beginning 
>> of the file, the lexer rejects the tokens.
>> It is not apparently easy to insert such marks (AFAICT no 
>> common tool does this specifically), while other languages 
>> work fine (i.e., accept unicode in their source) without it.
>> Is there a downside to at least presuming UTF-8?
> According to the spec [1] this should Just Work. I'd recommend 
> filing a bug.
> [1] https://dlang.org/spec/lex.html#source_text

Under the identifiers section 
(https://dlang.org/spec/lex.html#identifiers) it describes 
identifiers as:

> Identifiers start with a letter, _, or universal alpha, and are 
> followed by any number of letters, _, digits, or universal 
> alphas. Universal alphas are as defined in ISO/IEC 9899:1999(E) 
> Appendix D of the C99 Standard.

I was unable to find the definition of a "universal alpha", or 
whether that includes non-ascii alphabetic characters.

More information about the Digitalmars-d-learn mailing list