Non-ASCII in the future in the lexer

Wed May 31 15:13:14 UTC 2023

On Wed, May 31, 2023 at 06:23:43AM +0000, Cecil Ward via Digitalmars-d wrote:
> What do you think? It occurred to me that as the language develops we
> are occasionally having discussions about new keywords, or even
> changing them, for example: s/body/do/ some while back.
> 
> Unicode has been around for 30 years now and yet it is not getting
> fully used in programming languages for example. We are still stuck in
> our minds with ASCII only. Should we in future start mining the riches
> of unicode when we make changes to the grammar of programming
> languages (and other grammars)?

D already supports Unicode identifiers.  For example, this is valid D
today:

	int функция(int параметр) {
		return (параметр > 0) ? 2*функция(параметр-1) + 1 : 2;
	}

Of course, current language keywords are English- (and ASCII-) only.

> Would it be worthwhile considering wider unicode alternatives for
> keywords that we already have? Examples: comparison operators and
> other operators. We have unicode symbols for
> 
> ≤     less than or equal <=
> ≥    greater than or equal >=
> 
> a proper multiplication sign ‘×’, like an x, as well as the * that we
> have been stuck with since the beginning of time.

This is all great, but as someone else has already said, the input
method could be a problem area.  On my PC, I've set up XKB input with a
compose key such that many of these symbols are relatively easily
accessible; for example, Compose + < + = produces ≤; and Compose + v + /
produces √.  However, some symbols are more tricky to input, and some
are not accessible this way.  While it's always possible to, e.g., use a
character map widget to select a particular symbol, that significantly
slows down how fast you can type code, which negatively affects
productivity.

One dream I've always had is the so-called software-controlled keyboard:
instead of a keyboard with physical keys, you'd have a keyboard that's
actually a touchscreen, with keys that can be replaced from software.
So for example, when writing D + Unicode symbols, you'd switch to
"Unicode D" layout where symbols like ≤, ≥, ×, etc. are easily
accessible.  We already have this on our mobile devices, in fact, to
various degrees of customizability.  It just has to be taken to the next
step of allowing easy remapping of keyboard layouts and switching
between them.  Each future programming language, for example, could come
with its own layout having language-specific symbols easily accessible.

> ± 	plus or minus might come in useful someday, can’t think what for.

In one of my projects, there's a vector calculator program where ±
produces an expression that returns a list of values produced by all
possible combinations of signs where the ± operator appears.  It's very
useful for certain applications, like combinatorial polytopes where ±
appears frequently.

[...]
> Argument against: would complicate our regexes with a new need for
> multiple alternatives as in  [xyz] rather than just one possible
> character in a search or replace operation. But I think that some
> regex engines are unicode aware and can understand concepts like all
> x-characters where x is some property or defines a subset.

std.regex *is* unicode-aware, BTW. Check this out:

````d
import std;
string преобразовать(string текст) {
	return текст.replaceAll(regex(`[а-я]`), "X");
}
void main() {
	writefln("blah blah это не правда blah blah".преобразовать);
}
````

Output:

````
blah blah XXX XX XXXXXX blah blah
````

It correctly handles ranges of non-ASCII characters.

T

-- 
Real Programmers use "cat > a.out".