Why is BOM required to use unicode in tokens?
Jon Degenhardt
jond at noreply.com
Tue Sep 15 06:49:08 UTC 2020
On Tuesday, 15 September 2020 at 02:23:31 UTC, Paul Backus wrote:
> On Tuesday, 15 September 2020 at 01:49:13 UTC, James Blachly
> wrote:
>> I wish to write a function including ∂x and ∂y (these are
>> trivial to type with appropriate keyboard shortcuts - alt+d on
>> Mac), but without a unicode byte order mark at the beginning
>> of the file, the lexer rejects the tokens.
>>
>> It is not apparently easy to insert such marks (AFAICT no
>> common tool does this specifically), while other languages
>> work fine (i.e., accept unicode in their source) without it.
>>
>> Is there a downside to at least presuming UTF-8?
>
> According to the spec [1] this should Just Work. I'd recommend
> filing a bug.
>
> [1] https://dlang.org/spec/lex.html#source_text
Under the identifiers section
(https://dlang.org/spec/lex.html#identifiers) it describes
identifiers as:
> Identifiers start with a letter, _, or universal alpha, and are
> followed by any number of letters, _, digits, or universal
> alphas. Universal alphas are as defined in ISO/IEC 9899:1999(E)
> Appendix D of the C99 Standard.
I was unable to find the definition of a "universal alpha", or
whether that includes non-ascii alphabetic characters.
More information about the Digitalmars-d-learn
mailing list