Arbitrary identifiers - syntax

Tue Jul 4 07:55:51 UTC 2023

This is an intentionally vague post about an idea without a clear 
solution, so this is not a concrete proposal, but is intended to 
solicit suggestions and ideas.

In mathematics or physics, you might have variables such as t and 
t′ the second character of the last variable is a U+2032 (prime), 
and there’s also a similar glyph at U+02B9. I posted a while back 
about the use of unicode, and in that I was thinking about text 
in various non-English human languages. The docs say that D 
identifiers such as variable names are chosen from a subset of 
Unicode defined by an appendix of C99. This gives a massive list 
of acceptable characters in umpteen writing systems and human 
languages. How does D deal with that in the lexer? Enormous table 
lookup? I would be interested to know, compiler authors.

However in maths many of the symbols such as my earlier example 
contain characters that are not legal in identifiers as Unicode 
considers them to be maybe punctuation or similar non-ident 
concept. How to make D maths-friendly. Yes we can and do write 
things like t_prime, but it doesn’t look great. And it’s 
longwinded. Yes I hear you about the ease-of-use of Unicode but 
that was discussed before and belongs to the earlier thread. Is 
there a way of allowing (almost) ‘arbitrary’ content in 
identifiers in D’s grammar? Think of the kind of syntax that 
exploits say "my file.ext"-type double quoting for otherwise 
unacceptable filenames such as this example one with a space in 
it.

Is it at all possible that a future D might have a mechanism like 
that to accommodate arbitrary identifiers for maths? Maybe even a 
kind of extensible lexer? - perhaps way too hard, and an easier 
but less attractive solution like the bracketing could be found. 
abut whatever is suggested would have to be compact, neat and 
minimal so that mathematical equations could clearly resemble D 
statements and expressions.

I thought about all the imaginative literal string syntax that we 
already have, where a lot of work was done to make literal 
strings more workable in various use-cases.

I’d be very interested to hear suggestions as to how we do 
special relativity with t, t′, and then t″. `it may be just 
simply too hard to do it cleverly. I’m thinking about making D 
the most maths-friendly language, Let’s displace Fortran ;-). ( 
Would need to make complex numbers friendlier for that though, 
maybe with more of the syntactic sugar brought back, but that’s 
another story. ) I think it would possibly be a good idea to 
restrict ‘arbitrary’ characters to a certain subset, not allowing 
absolutely any Unicode character, so no whitespace, no control 
characters, no existing D tokens such as ‘=‘, maybe disallow all 
punctuation characters that are already ‘taken’ in D, that is, 
already in use in the existing lexer’s grammar, but I’m unsure 
about that. What do do about ‘-‘ hyphen-minus? It is allowed in 
some languages, such as XSLT and used there a lot. Perhaps ban it 
because of the confusion with minus for subtraction. I don’t 
know. It doesn’t seem to be used in physics, for that same reason.

Thoughts?