Non-ASCII in the future in the lexer

Thu Jun 1 15:47:00 UTC 2023

TL;DR: What you want can be gained using smart fonts or other 
smart UI tools.

---

On Wednesday, 31 May 2023 at 06:23:43 UTC, Cecil Ward wrote:
> Unicode has been around for 30 years now and yet it is not 
> getting fully used in programming languages for example. We are 
> still stuck in our minds with ASCII only. Should we in future 
> start mining the riches of unicode when we make changes to the 
> grammar of programming languages (and other grammars)?

The gain is too little for the cost. The gain is circumstantially 
negative and that will happen at exactly those places where it is 
particularly unfortunate.

> Would it be worthwhile considering wider unicode alternatives 
> for keywords that we already have? Examples: comparison 
> operators and other operators. We have unicode symbols for
>
> ≤     less than or equal <=
> ≥    greater than or equal >=
>
> a proper multiplication sign ‘×’, like an x, as well as the * 
> that we have been stuck with since the beginning of time.
>
> ± 	plus or minus might come in useful someday, can’t think what 
> for.

I can: `±` could be used for in-place negation. Let’s say you 
have:
```d
ref int f(); // is costly or has side-effects
```
To negate the result in-place, you have to do:
```d
int* p = &f();
*p = -*p;
```
or
```d
(ref int x) { x = -x; }(f());
```

> I have … as one character; would be nice to have that as an 
> alternative to .. (two ASCII fullstops) maybe?
>
> I realise that this issue is hardly about the cure for world 
> peace, but there seems to be little reason to be confined to 
> ASCII forever when there are better suited alternatives and 
> things that might spark the imagination of designers.

The problem are fonts that don’t support certain characters and 
editors defaulting to legacy encodings. One can handle 
`FranÃ§ais`, but `a Ã— b` (UTF-8 read as Windows-1252) is a 
problem because who knows what the character was.

It’s not that the gain is rather little, it’s the potential for 
high cost. A lot of people will avoid those like the plague 
because of legacy issues.

> One extreme case or two: Many editors now automatically employ 
> ‘ ’ supposed to be 6-9 quotes, instead of ASCII '', so too with 
> “ ” (6-9 matching pair).

Many document processors do that. Whoever writes code in them, 
they’re wrong.

> When Walter was designing the literal strings lexical items 
> many items needed to be found for all the alternatives. And we 
> have « » which are familiar to French speakers? It would be 
> very nice to to fall over on 6-9 quotes anyway, and just accept 
> them as an alternative.

Accepting them is one possibility. Having an editor that replaces 
“” by "" and ‘’ by '' is another. Any regex-replace can easily 
used for that: `‘([^’]*)’` by `'$1'`.

> The second case that comes to mind: I was thinking about regex 
> grammars and XML’s grammar, and I think one or both can now 
> handle all kinds of unicode whitespace.

Definitely not regex. It’s not standardized at all.

XML is quite a non-problem because directly supports specifying 
an encoding.

> That’s the kind of thinking I’m interested in. It would be good 
> to handle all kinds of whitespace, as we do all kinds of 
> newline sequences. We probably already do both well. And no one 
> complains saying ‘we ought not bother with tab’, so handling 
> U+0085 and the various whitespace types such as &nbsp in our 
> lexicon of our grammar is to me a no-brainer.
>
> And what use might we find some day for § and ¶ ? Could be 
> great for some new exotic grammatical structural pattern. Look 
> at the mess that C++ got into with the syntax of templates. 
> They needed something other than < >. Almost anything. They 
> could have done no worse with « ».

As a German, I find «» and ‹› a little irritating, because we’re 
using them like this: »« and ›‹. The Swiss use «content» and the 
French use « content » (with half-spaces).

C++ was wrong on template syntax, but they were right on using 
ASCII. D has good template syntax, and it’s ASCII.

> Another point: These exotics are easy to find in your text 
> editor because they won’t be overused.

Citation needed.

> As for usability, some of our tools now have or could have 
> ‘favourite characters’ or ‘snippet’ text strings in a place in 
> the ui where they are readily accessible. I have a unicode 
> character map app and also a file with my unicode favourite 
> characters in it. So there are things that we can do ourselves. 
> And having a favourites comment block in a starter template 
> file might be another example.

If you employ tooling, the best option is to leave the source 
code as-is and use a OpenType font or other UI-oriented things.

> Argument against: would complicate our regexes with a new need 
> for multiple alternatives as in  [xyz] rather than just one 
> possible character in a search or replace operation. But I 
> think that some regex engines are unicode aware and can 
> understand concepts like all x-characters where x is some 
> property or defines a subset.

Making `grep` harder to use is definitely a deal-breaker.

> I have a concern. I love the betterC idea. Something inside my 
> head tells me not to move too far from C. But we have already 
> left the grammar of C behind, for good reason. C doesn’t have 
> .. or … ( :-) ) nor does it have $. So that train has left. But 
> I’m talking about things that C is never going to have.

Unicode has U+2025 ‥ for you as well.

C is overly restrictive. It’s not based on ASCII, but a proper 
subset of ASCII that’s compatible with even older standards like 
EBCDIC. In today’s age, ASCII support is quite a safe bet. 
Unicode support isn’t.

> One point of clarification: I am not talking about D runtime. 
> I’m confining myself to D’s lexer and D’s grammar.

It sounds great in theory, but if any tool in your chain has no 
support for that, you’re out. I was running into that on Windows 
recently. Not D related.

I’m a Unicode fan. I created my own keyboard layout which puts a 
lot of nice stuff on AltGr and dead key sequences (e.g. proper 
quotation marks, currency symbols, math symbols, the complete 
Greek alphabet) while leaving anything that is printed on the 
keys where it was. Yet I fail to see the advantage of × over * 
and similar *in code.* There are several fonts that visually 
replace <= by a wider ≤ sign, != by a wide ≠, etc. If you want 
alternatives, use a font. It’s non-intrusive to the source code. 
It’s a million times better than Unicode in source. I don’t use 
those fonts because for some reason, they add a plethora of 
things that make sense in certain languages, e.g. replace `>>` by 
a ligature (think of `»`). That makes sense when it’s an 
operator, but it doesn’t when it’s two closing angle brackets 
(cf. Java or C++).