VLERange: a range in between BidirectionalRange and RandomAccessRange

Steven Schveighoffer schveiguy at yahoo.com
Fri Jan 14 06:34:55 PST 2011


On Fri, 14 Jan 2011 08:59:35 -0500, spir <denis.spir at gmail.com> wrote:

> On 01/14/2011 02:37 PM, Steven Schveighoffer wrote:
>>
>> * I don't even know how to make a grapheme that is more than one
>> code-unit, let alone more than one code-point :)  Every time I try, I
>> get 'invalid utf sequence'.
>>
>> I feel significantly ignorant on this issue, and I'm slowly getting
>> enough knowledge to join the discussion, but being a dumb American who
>> only speaks English, I have a hard time grasping how this shit all  
>> works.
>
> 1. See my text at  
> https://bitbucket.org/denispir/denispir-d/src/c572ccaefa33/U%20missing%20level%20of%20abstraction

I can't read that document, it's black background with super-dark-grey  
text.

> 2.
>      writeln ("A\u0308\u0330");
> <A + tilde above + umlaut below> (or the opposite)
> If it does not display properly, either set your terminal to UTF* or use  
> a more unicode-aware font (eg DejaVu series).

OK, I'll have to remember this so I can use it to test my string type ;)

> The point is not playing like that with Unicode flexibility. Rather that  
> composite characters are just normal thingies in most languages of the  
> world. Actually, on this point, english is a rare exception (discarding  
> letters imported from foreign languages like french 'à'); to the point  
> of beeing, I guess, the only western language without any diacritic.

Is it common to have multiple modifiers on a single character?  The  
problem I see with using decomposed canonical form for strings is that we  
would have to return a dchar[] for each 'element', which severely  
complicates code that, for instance, only expects to handle English.

I was hoping to lazily transform a string into its composed canonical  
form, allowing the (hopefully rare) exception when a composed character  
does not exist.  My thinking was that this at least gives a useful string  
representation for 90% of usages, leaving the remaining 10% of usages to  
find a more complex representation (like your Text type).  If we only get  
like 20% or 30% there by making dchar the element type, then we haven't  
made it useful enough.

Either way, we need a string type that can be compared canonically for  
things like searches or opEquals.

-Steve


More information about the Digitalmars-d mailing list