VLERange: a range in between BidirectionalRange and RandomAccessRange

Michel Fortin michel.fortin at michelf.com
Tue Jan 11 19:18:29 PST 2011


On 2011-01-11 20:28:26 -0500, Steven Wawryk <stevenw at acres.com.au> said:

> Sorry if I'm jumping inhere without the appropriate background, but I 
> don't understand why jumping through these hoops are necessary.  Please 
> let me know if I'm missing anything.
> 
> Many problems can be solved by another layer of indirection.  Isn't a 
> string essentially a bidirectional range of code points built on top of 
> a random access range of code units?

Actually, displaying a UTF-8/UTF-16 string involves a range of of 
glyphs layered over a range of graphemes layered over a range of code 
points layered over a range of code units. Glyphs represent the visual 
characters you can get from a font, they often map one-to-one with 
graphemes but not always (ligatures for instance). Graphemes are what 
people generally reason about when they see text (the so called 
"user-perceived characters"), they often map one-to-one with code 
points but not always (combining marks for instance). Code points are a 
list of standardized codes representing various elements of a string, 
and code units basically encode the code points.

If you're writing an XML, JSON or whatever else parser you'll probably 
care about code points. If you're advancing the insertion point in a 
text field or count the number of user-perceived characters you'll 
probably want to deal with graphemes. For searching a substring inside 
a string, or comparing strings you'll probably want to deal with either 
graphemes or collation elements (collation elements are layered on top 
of code points). To print a string you'll need to map graphemes to the 
glyphs from a particular font.

Reducing string operations to code points manipulations will only work 
as long as all your graphemes, collation elements, or glyphs map 
one-to-one with code points.


> It seems to me that each abstraction separately already fits within the 
> existing D range framework and all the difficulties arise as a 
> consequence of trying to lump them into a single abstraction.

It's true that each of these abstraction can fit within the existing 
range framework.


> Why not choose which of these abstractions is most appropriate in a 
> given situation instead of trying to shoe-horn both concepts into a 
> single abstraction, and provide for easy conversion between them?  When 
> character representation is the primary requirement then make it a 
> bidirectional range of code points.  When storage representation and 
> random access is required then make it a random access range of code 
> units.

I think you're right. The need for a new concept isn't that great, and 
it gets complicated really fast.


-- 
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/



More information about the Digitalmars-d mailing list