Why is BOM required to use unicode in tokens?

H. S. Teoh hsteoh at quickfur.ath.cx
Tue Sep 15 05:34:04 UTC 2020


On Mon, Sep 14, 2020 at 09:49:13PM -0400, James Blachly via Digitalmars-d-learn wrote:
> I wish to write a function including ∂x and ∂y (these are trivial to
> type with appropriate keyboard shortcuts - alt+d on Mac), but without
> a unicode byte order mark at the beginning of the file, the lexer
> rejects the tokens.
> 
> It is not apparently easy to insert such marks (AFAICT no common tool
> does this specifically), while other languages work fine (i.e., accept
> unicode in their source) without it.
> 
> Is there a downside to at least presuming UTF-8?

Tested it locally, with and without BOM; the lexer rejects ∂ as a valid
token. I suspect the reason has nothing to do with BOMs, but with the
fact that ∂ is not classified as an alphanumeric (see std.uni.isAlpha,
which returns false for ∂).  The following code, which contains Cyrillic
letters, compiles just fine without BOM (std.uni.isAlpha('Ш') returns
true):

	void main() {
		int Ш = 1;
		writeln(Ш);
	}

As the docs for std.uni.isAlpha states, it tests for general Unicode
category 'Alphabetic'.  Probably identifiers are restricted to
characters of this category plus the numerics and '_' (and maybe one or
two others, perhaps '$'? Don't remember now).


T

-- 
People say I'm indecisive, but I'm not sure about that. -- YHL, CONLANG


More information about the Digitalmars-d-learn mailing list