Why is BOM required to use unicode in tokens?
H. S. Teoh
hsteoh at quickfur.ath.cx
Tue Sep 15 05:34:04 UTC 2020
On Mon, Sep 14, 2020 at 09:49:13PM -0400, James Blachly via Digitalmars-d-learn wrote:
> I wish to write a function including ∂x and ∂y (these are trivial to
> type with appropriate keyboard shortcuts - alt+d on Mac), but without
> a unicode byte order mark at the beginning of the file, the lexer
> rejects the tokens.
>
> It is not apparently easy to insert such marks (AFAICT no common tool
> does this specifically), while other languages work fine (i.e., accept
> unicode in their source) without it.
>
> Is there a downside to at least presuming UTF-8?
Tested it locally, with and without BOM; the lexer rejects ∂ as a valid
token. I suspect the reason has nothing to do with BOMs, but with the
fact that ∂ is not classified as an alphanumeric (see std.uni.isAlpha,
which returns false for ∂). The following code, which contains Cyrillic
letters, compiles just fine without BOM (std.uni.isAlpha('Ш') returns
true):
void main() {
int Ш = 1;
writeln(Ш);
}
As the docs for std.uni.isAlpha states, it tests for general Unicode
category 'Alphabetic'. Probably identifiers are restricted to
characters of this category plus the numerics and '_' (and maybe one or
two others, perhaps '$'? Don't remember now).
T
--
People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG
More information about the Digitalmars-d-learn
mailing list