VLERange: a range in between BidirectionalRange and RandomAccessRange

Fri Jan 14 05:06:29 PST 2011

On Friday 14 January 2011 04:47:59 Steven Schveighoffer wrote:
> On Fri, 14 Jan 2011 01:44:19 -0500, Nick Sabalausky <a at a.a> wrote:
> > "Andrei Alexandrescu" <SeeWebsiteForEmail at erdani.org> wrote in message
> > news:igoqrm$1n5r$1 at digitalmars.com...
> > 
> >> On 1/13/11 10:26 PM, Nick Sabalausky wrote:
> >> [snip]
> >> 
> >>> [ 'f', {u with the umlaut}, 'n', 'f' ]
> >>> 
> >>> Or:
> >>> 
> >>> [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]
> >>> 
> >>> Those *both* get rendered exactly the same, and both represent the same
> >>> four-letter sequence. In the second example, the 'u' and the {umlaut
> >>> combining character} combine to form one grapheme. The f's and n's just
> >>> happen to be single-code-point graphemes.
> >>> 
> >>> Note that while some characters exist in pre-combined form (such as the
> >>> {u
> >>> with the umlaut} above), legend has it there are others than can only
> >>> be
> >>> represented using a combining character.
> >>> 
> >>> It's also my understanding, though I'm not certain, that sometimes
> >>> multiple
> >>> combining characters can be used together on the same "root" character.
> >> 
> >> Thanks. One further question is: in the above example with
> >> u-with-umlaut,
> >> there is one code point that corresponds to the entire combination. Are
> >> there combinations that do not have a unique code point?
> > 
> > My understanding is "yes". At least that's what I've heard, and I've
> > never
> > heard any claims of "no". I don't know of any specific ones offhand,
> > though.
> > Actually, it might be possible to use any combining character with any
> > old
> > letter or number (like maybe a 7 with an umlaut), though I'm not certain.
> > 
> > FWIW, the Wikipedia article might help, or at least link to other things
> > that might help: http://en.wikipedia.org/wiki/Combining_character
> 
> http://en.wikipedia.org/wiki/Unicode_normalization
> 
> Linked from that page, the normalization process is probably something we
> need to look at.  Using decomposed canonical form would mean we need more
> state than just what code-unit are we on, plus it creates more likelyhood
> that a match will be found with part of a grapheme (spir or Michel brought
> it up earlier).  So I think the correct case is to use composed canonical
> form.  This is after just reading that page, so maybe I'm missing
> something.
> 
> Non-composable combinations would be a problem.  The string range is
> formed on the basis that the element type is a dchar.  If there are
> combinations that cannot be composed into a single dchar, then the element
> type has to be a dchar array (or some other type which contains all the
> info).  The other option is to simply leave them decomposed.  Then you
> risk things like partial matches.
> 
> I'm leaning towards a solution like this: While iterating a string, it
> should output dchars in normalized composed form.  But a specialized
> comparison function should be used when doing things like searches or
> regex, because it might not be possible to compose two combining
> characters.
> 
> The drawback to this is that a dchar might not be able to represent a
> grapheme (only if it cannot be composed), but I think it's too much of a
> hit in complexity and performance to make the element type of a string
> larger than a dchar.

Well, there's plenty in std.string that already deals in strings rather than 
dchar, and for the most part, any case where you couldn't fit a grapheme in a 
dchar could be covered by using a string.

> Those who wish to work with a more comprehensive string type can use a
> more complex string type such as the one created by spir.
> 
> Does that sound reasonable?

We really should have something along those lines it seems. From what little _I_ 
know, the basic approach that you suggest seems like the correct one, but 
perhaps someone more knowledgeable will be able to come up with a reason why 
it's not a good idea. Certainly, I think that any solution that I'd come up with 
would be similar to what you're suggesting.

- Jonathan M Davis