VLERange: a range in between BidirectionalRange and RandomAccessRange
Ali Çehreli
acehreli at yahoo.com
Wed Jan 12 15:00:48 PST 2011
spir wrote:
> On 01/12/2011 08:28 PM, Don wrote:
>> I think the only problem that we really have, is that "char[]",
>> "dchar[]" implies that code points is always the appropriate level of
>> abstraction.
>
> I'd like to know when it happens that codepoint is the appropriate level
> of abstraction.
When on a document that describes code points... :)
> * If pieces of text are not manipulated, meaning just used in the
> application, or just transferred via the application as is (from file /
> input / literal to any kind of output), then any kind of encoding just
> works. One can even concatenate, provided all pieces use the same
> encoding. --> _lower_ level than codepoint is OK.
> * But any of manipulation (indexing, slicing, compare,
Compare according to which alphabet's ordering? Surely not Unicode's...
I may be alone in this, but ordering is tied to an alphabet (or writing
system), not locale.)
I try to solve that issue with my trileri library:
http://code.google.com/p/trileri/source/browse/#svn%2Ftrunk%2Ftr
Warning: the code is in Turkish and is not aware of the concept of
collation at all; it has its own simplistic view of text, where every
character is an entity that can be lower/upper cased to a single character.
> search, count,
> replace, not to speak about regex/parsing) requires operating at the
> _higher_ level of characters (in the common sense).
I don't know this about Unicode: should e and ´ (acute accent) be always
collated? If so, wouldn't it be impossible to put those two in that
order say, in a text book? (Perhaps Unicode defines a way to stop
collation.)
> Just like with
> historic character sets in which codes used to represent characters (not
> lower-level thingies as in UCS). Else, one reads, compares, changes
> meaningless bits of text.
>
> As I see it now, we need 2 types:
I think we need more than 2 types...
> * One plain string similar to good old ones (bytestring would do the
> job, since most unicode is utf8 encoded) for the first kind of use
> above. With optional validity check when it's supposed to be unicode
text.
Agreed. D gives us three UTF encondings, but I am not sure that there is
only one abstraction above that.
> * One hiher-level type abstracting from codepoint (not code unit)
> issues, restoring the necessary properties: (1) each character is one
> element in the sequence (2) each character is always represented the
> same way.
I think VLERange should solve only the variable-length-encoding issue.
It should not get into higher abstractions.
Ali
More information about the Digitalmars-d
mailing list