VLERange: a range in between BidirectionalRange and RandomAccessRange

Ali Çehreli acehreli at yahoo.com
Wed Jan 12 15:00:48 PST 2011


spir wrote:
 > On 01/12/2011 08:28 PM, Don wrote:
 >> I think the only problem that we really have, is that "char[]",
 >> "dchar[]" implies that code points is always the appropriate level of
 >> abstraction.
 >
 > I'd like to know when it happens that codepoint is the appropriate level
 > of abstraction.

When on a document that describes code points... :)

 > * If pieces of text are not manipulated, meaning just used in the
 > application, or just transferred via the application as is (from file /
 > input / literal to any kind of output), then any kind of encoding just
 > works. One can even concatenate, provided all pieces use the same
 > encoding. --> _lower_ level than codepoint is OK.
 > * But any of manipulation (indexing, slicing, compare,

Compare according to which alphabet's ordering? Surely not Unicode's... 
I may be alone in this, but ordering is tied to an alphabet (or writing 
system), not locale.)

I try to solve that issue with my trileri library:

   http://code.google.com/p/trileri/source/browse/#svn%2Ftrunk%2Ftr

Warning: the code is in Turkish and is not aware of the concept of 
collation at all; it has its own simplistic view of text, where every 
character is an entity that can be lower/upper cased to a single character.

 > search, count,
 > replace, not to speak about regex/parsing) requires operating at the
 > _higher_ level of characters (in the common sense).

I don't know this about Unicode: should e and ´ (acute accent) be always 
collated? If so, wouldn't it be impossible to put those two in that 
order say, in a text book? (Perhaps Unicode defines a way to stop 
collation.)

 > Just like with
 > historic character sets in which codes used to represent characters (not
 > lower-level thingies as in UCS). Else, one reads, compares, changes
 > meaningless bits of text.
 >
 > As I see it now, we need 2 types:

I think we need more than 2 types...

 > * One plain string similar to good old ones (bytestring would do the
 > job, since most unicode is utf8 encoded) for the first kind of use
 > above. With optional validity check when it's supposed to be unicode 
text.

Agreed. D gives us three UTF encondings, but I am not sure that there is 
only one abstraction above that.

 > * One hiher-level type abstracting from codepoint (not code unit)
 > issues, restoring the necessary properties: (1) each character is one
 > element in the sequence (2) each character is always represented the
 > same way.

I think VLERange should solve only the variable-length-encoding issue. 
It should not get into higher abstractions.

Ali


More information about the Digitalmars-d mailing list