Non-ASCII in the future in the lexer
Cecil Ward
cecil at cecilward.com
Wed May 31 06:23:43 UTC 2023
What do you think? It occurred to me that as the language
develops we are occasionally having discussions about new
keywords, or even changing them, for example: s/body/do/ some
while back.
Unicode has been around for 30 years now and yet it is not
getting fully used in programming languages for example. We are
still stuck in our minds with ASCII only. Should we in future
start mining the riches of unicode when we make changes to the
grammar of programming languages (and other grammars)?
Would it be worthwhile considering wider unicode alternatives for
keywords that we already have? Examples: comparison operators and
other operators. We have unicode symbols for
≤ less than or equal <=
≥ greater than or equal >=
a proper multiplication sign ‘×’, like an x, as well as the *
that we have been stuck with since the beginning of time.
± plus or minus might come in useful someday, can’t think what
for.
I have … as one character; would be nice to have that as an
alternative to .. (two ASCII fullstops) maybe?
I realise that this issue is hardly about the cure for world
peace, but there seems to be little reason to be confined to
ASCII forever when there are better suited alternatives and
things that might spark the imagination of designers.
One extreme case or two: Many editors now automatically employ ‘
’ supposed to be 6-9 quotes, instead of ASCII '', so too with “ ”
(6-9 matching pair). When Walter was designing the literal
strings lexical items many items needed to be found for all the
alternatives. And we have « » which are familiar to French
speakers? It would be very nice to to fall over on 6-9 quotes
anyway, and just accept them as an alternative. The second case
that comes to mind: I was thinking about regex grammars and XML’s
grammar, and I think one or both can now handle all kinds of
unicode whitespace. That’s the kind of thinking I’m interested
in. It would be good to handle all kinds of whitespace, as we do
all kinds of newline sequences. We probably already do both well.
And no one complains saying ‘we ought not bother with tab’, so
handling U+0085 and the various whitespace types such as   in
our lexicon of our grammar is to me a no-brainer.
And what use might we find some day for § and ¶ ? Could be great
for some new exotic grammatical structural pattern. Look at the
mess that C++ got into with the syntax of templates. They needed
something other than < >. Almost anything. They could have done
no worse with « ».
Another point: These exotics are easy to find in your text editor
because they won’t be overused.
As for usability, some of our tools now have or could have
‘favourite characters’ or ‘snippet’ text strings in a place in
the ui where they are readily accessible. I have a unicode
character map app and also a file with my unicode favourite
characters in it. So there are things that we can do ourselves.
And having a favourites comment block in a starter template file
might be another example.
Argument against: would complicate our regexes with a new need
for multiple alternatives as in [xyz] rather than just one
possible character in a search or replace operation. But I think
that some regex engines are unicode aware and can understand
concepts like all x-characters where x is some property or
defines a subset.
I have a concern. I love the betterC idea. Something inside my
head tells me not to move too far from C. But we have already
left the grammar of C behind, for good reason. C doesn’t have ..
or … ( :-) ) nor does it have $. So that train has left. But I’m
talking about things that C is never going to have.
One point of clarification: I am not talking about D runtime. I’m
confining myself to D’s lexer and D’s grammar.
More information about the Digitalmars-d
mailing list