Arbitrary identifiers - syntax

Richard (Rikki) Andrew Cattermole richard at cattermole.co.nz
Tue Jul 4 08:05:43 UTC 2023


Ah huh!

This is something that I am very familiar with, as I'm updating dmd to 
use UAX31 identifiers (Unicode 15).

What you are wanting is called Medial.

The definition of a UAX31 identifier is: ``<Identifier> := <Start> 
<Continue>* (<Medial> <Continue>+)*``

For possible characters for Medial: 
https://unicode.org/reports/tr31/#Table_Optional_Medial

https://unicode.org/reports/tr31

As for how to represent it... the way that dmd does it currently is with 
a ``wchar[2][]`` and then a binary search with a start + end. This of 
course isn't standard and is not the best.

The standard solution as per Unicode Demystified (strongly recommend 
buying it if you are interested in this subject) is to use an inversion 
list which is just the start of a given range, and using the index 
odd/even to determine if its in the range or not. You would use a search 
algorithm like binary to do the lookup.

I will be switching dmd over should my C23 PR go in, to a inversion list 
+ fibonacci search to take advantage of ASCII, BMP, then per plane 
probabilities. I've been talking about this quite a bit recently on 
Discord #langdev channel.


More information about the Digitalmars-d mailing list