VLERange: a range in between BidirectionalRange and RandomAccessRange
Steven Schveighoffer
schveiguy at yahoo.com
Fri Jan 14 04:47:59 PST 2011
On Fri, 14 Jan 2011 01:44:19 -0500, Nick Sabalausky <a at a.a> wrote:
> "Andrei Alexandrescu" <SeeWebsiteForEmail at erdani.org> wrote in message
> news:igoqrm$1n5r$1 at digitalmars.com...
>> On 1/13/11 10:26 PM, Nick Sabalausky wrote:
>> [snip]
>>> [ 'f', {u with the umlaut}, 'n', 'f' ]
>>>
>>> Or:
>>>
>>> [ 'f', 'u', {umlaut combining character}, 'n', 'f' ]
>>>
>>> Those *both* get rendered exactly the same, and both represent the same
>>> four-letter sequence. In the second example, the 'u' and the {umlaut
>>> combining character} combine to form one grapheme. The f's and n's just
>>> happen to be single-code-point graphemes.
>>>
>>> Note that while some characters exist in pre-combined form (such as the
>>> {u
>>> with the umlaut} above), legend has it there are others than can only
>>> be
>>> represented using a combining character.
>>>
>>> It's also my understanding, though I'm not certain, that sometimes
>>> multiple
>>> combining characters can be used together on the same "root" character.
>>
>> Thanks. One further question is: in the above example with
>> u-with-umlaut,
>> there is one code point that corresponds to the entire combination. Are
>> there combinations that do not have a unique code point?
>>
>
> My understanding is "yes". At least that's what I've heard, and I've
> never
> heard any claims of "no". I don't know of any specific ones offhand,
> though.
> Actually, it might be possible to use any combining character with any
> old
> letter or number (like maybe a 7 with an umlaut), though I'm not certain.
>
> FWIW, the Wikipedia article might help, or at least link to other things
> that might help: http://en.wikipedia.org/wiki/Combining_character
http://en.wikipedia.org/wiki/Unicode_normalization
Linked from that page, the normalization process is probably something we
need to look at. Using decomposed canonical form would mean we need more
state than just what code-unit are we on, plus it creates more likelyhood
that a match will be found with part of a grapheme (spir or Michel brought
it up earlier). So I think the correct case is to use composed canonical
form. This is after just reading that page, so maybe I'm missing
something.
Non-composable combinations would be a problem. The string range is
formed on the basis that the element type is a dchar. If there are
combinations that cannot be composed into a single dchar, then the element
type has to be a dchar array (or some other type which contains all the
info). The other option is to simply leave them decomposed. Then you
risk things like partial matches.
I'm leaning towards a solution like this: While iterating a string, it
should output dchars in normalized composed form. But a specialized
comparison function should be used when doing things like searches or
regex, because it might not be possible to compose two combining
characters.
The drawback to this is that a dchar might not be able to represent a
grapheme (only if it cannot be composed), but I think it's too much of a
hit in complexity and performance to make the element type of a string
larger than a dchar.
Those who wish to work with a more comprehensive string type can use a
more complex string type such as the one created by spir.
Does that sound reasonable?
-Steve
More information about the Digitalmars-d
mailing list