VLERange: a range in between BidirectionalRange and RandomAccessRange

Steven Schveighoffer schveiguy at yahoo.com
Fri Jan 14 04:47:59 PST 2011


On Fri, 14 Jan 2011 01:44:19 -0500, Nick Sabalausky <a at a.a> wrote:

> "Andrei Alexandrescu" <SeeWebsiteForEmail at erdani.org> wrote in message
> news:igoqrm$1n5r$1 at digitalmars.com...
>> On 1/13/11 10:26 PM, Nick Sabalausky wrote:
>> [snip]
>>> [ 'f', {u with the umlaut}, 'n', 'f' ]
>>>
>>> Or:
>>>
>>> [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]
>>>
>>> Those *both* get rendered exactly the same, and both represent the same
>>> four-letter sequence. In the second example, the 'u' and the {umlaut
>>> combining character} combine to form one grapheme. The f's and n's just
>>> happen to be single-code-point graphemes.
>>>
>>> Note that while some characters exist in pre-combined form (such as the
>>> {u
>>> with the umlaut} above), legend has it there are others than can only  
>>> be
>>> represented using a combining character.
>>>
>>> It's also my understanding, though I'm not certain, that sometimes
>>> multiple
>>> combining characters can be used together on the same "root" character.
>>
>> Thanks. One further question is: in the above example with  
>> u-with-umlaut,
>> there is one code point that corresponds to the entire combination. Are
>> there combinations that do not have a unique code point?
>>
>
> My understanding is "yes". At least that's what I've heard, and I've  
> never
> heard any claims of "no". I don't know of any specific ones offhand,  
> though.
> Actually, it might be possible to use any combining character with any  
> old
> letter or number (like maybe a 7 with an umlaut), though I'm not certain.
>
> FWIW, the Wikipedia article might help, or at least link to other things
> that might help: http://en.wikipedia.org/wiki/Combining_character

http://en.wikipedia.org/wiki/Unicode_normalization

Linked from that page, the normalization process is probably something we  
need to look at.  Using decomposed canonical form would mean we need more  
state than just what code-unit are we on, plus it creates more likelyhood  
that a match will be found with part of a grapheme (spir or Michel brought  
it up earlier).  So I think the correct case is to use composed canonical  
form.  This is after just reading that page, so maybe I'm missing  
something.

Non-composable combinations would be a problem.  The string range is  
formed on the basis that the element type is a dchar.  If there are  
combinations that cannot be composed into a single dchar, then the element  
type has to be a dchar array (or some other type which contains all the  
info).  The other option is to simply leave them decomposed.  Then you  
risk things like partial matches.

I'm leaning towards a solution like this: While iterating a string, it  
should output dchars in normalized composed form.  But a specialized  
comparison function should be used when doing things like searches or  
regex, because it might not be possible to compose two combining  
characters.

The drawback to this is that a dchar might not be able to represent a  
grapheme (only if it cannot be composed), but I think it's too much of a  
hit in complexity and performance to make the element type of a string  
larger than a dchar.

Those who wish to work with a more comprehensive string type can use a  
more complex string type such as the one created by spir.

Does that sound reasonable?

-Steve


More information about the Digitalmars-d mailing list