Non-ASCII in the future in the lexer

Cecil Ward cecil at cecilward.com
Wed May 31 06:23:43 UTC 2023



What do you think? It occurred to me that as the language 
develops we are occasionally having discussions about new 
keywords, or even changing them, for example: s/body/do/ some 
while back.

Unicode has been around for 30 years now and yet it is not 
getting fully used in programming languages for example. We are 
still stuck in our minds with ASCII only. Should we in future 
start mining the riches of unicode when we make changes to the 
grammar of programming languages (and other grammars)?

Would it be worthwhile considering wider unicode alternatives for 
keywords that we already have? Examples: comparison operators and 
other operators. We have unicode symbols for

≤     less than or equal <=
≥    greater than or equal >=

a proper multiplication sign ‘×’, like an x, as well as the * 
that we have been stuck with since the beginning of time.

± 	plus or minus might come in useful someday, can’t think what 
for.

I have … as one character; would be nice to have that as an 
alternative to .. (two ASCII fullstops) maybe?

I realise that this issue is hardly about the cure for world 
peace, but there seems to be little reason to be confined to 
ASCII forever when there are better suited alternatives and 
things that might spark the imagination of designers.

One extreme case or two: Many editors now automatically employ ‘ 
’ supposed to be 6-9 quotes, instead of ASCII '', so too with “ ” 
(6-9 matching pair). When Walter was designing the literal 
strings lexical items many items needed to be found for all the 
alternatives. And we have « » which are familiar to French 
speakers? It would be very nice to to fall over on 6-9 quotes 
anyway, and just accept them as an alternative. The second case 
that comes to mind: I was thinking about regex grammars and XML’s 
grammar, and I think one or both can now handle all kinds of 
unicode whitespace. That’s the kind of thinking I’m interested 
in. It would be good to handle all kinds of whitespace, as we do 
all kinds of newline sequences. We probably already do both well. 
And no one complains saying ‘we ought not bother with tab’, so 
handling U+0085 and the various whitespace types such as &nbsp in 
our lexicon of our grammar is to me a no-brainer.

And what use might we find some day for § and ¶ ? Could be great 
for some new exotic grammatical structural pattern. Look at the 
mess that C++ got into with the syntax of templates. They needed 
something other than < >. Almost anything. They could have done 
no worse with « ».

Another point: These exotics are easy to find in your text editor 
because they won’t be overused.

As for usability, some of our tools now have or could have 
‘favourite characters’ or ‘snippet’ text strings in a place in 
the ui where they are readily accessible. I have a unicode 
character map app and also a file with my unicode favourite 
characters in it. So there are things that we can do ourselves. 
And having a favourites comment block in a starter template file 
might be another example.

Argument against: would complicate our regexes with a new need 
for multiple alternatives as in  [xyz] rather than just one 
possible character in a search or replace operation. But I think 
that some regex engines are unicode aware and can understand 
concepts like all x-characters where x is some property or 
defines a subset.

I have a concern. I love the betterC idea. Something inside my 
head tells me not to move too far from C. But we have already 
left the grammar of C behind, for good reason. C doesn’t have .. 
or … ( :-) ) nor does it have $. So that train has left. But I’m 
talking about things that C is never going to have.

One point of clarification: I am not talking about D runtime. I’m 
confining myself to D’s lexer and D’s grammar.


More information about the Digitalmars-d mailing list