VLERange: a range in between BidirectionalRange and RandomAccessRange
Michel Fortin
michel.fortin at michelf.com
Tue Jan 11 19:18:29 PST 2011
On 2011-01-11 20:28:26 -0500, Steven Wawryk <stevenw at acres.com.au> said:
> Sorry if I'm jumping inhere without the appropriate background, but I
> don't understand why jumping through these hoops are necessary. Please
> let me know if I'm missing anything.
>
> Many problems can be solved by another layer of indirection. Isn't a
> string essentially a bidirectional range of code points built on top of
> a random access range of code units?
Actually, displaying a UTF-8/UTF-16 string involves a range of of
glyphs layered over a range of graphemes layered over a range of code
points layered over a range of code units. Glyphs represent the visual
characters you can get from a font, they often map one-to-one with
graphemes but not always (ligatures for instance). Graphemes are what
people generally reason about when they see text (the so called
"user-perceived characters"), they often map one-to-one with code
points but not always (combining marks for instance). Code points are a
list of standardized codes representing various elements of a string,
and code units basically encode the code points.
If you're writing an XML, JSON or whatever else parser you'll probably
care about code points. If you're advancing the insertion point in a
text field or count the number of user-perceived characters you'll
probably want to deal with graphemes. For searching a substring inside
a string, or comparing strings you'll probably want to deal with either
graphemes or collation elements (collation elements are layered on top
of code points). To print a string you'll need to map graphemes to the
glyphs from a particular font.
Reducing string operations to code points manipulations will only work
as long as all your graphemes, collation elements, or glyphs map
one-to-one with code points.
> It seems to me that each abstraction separately already fits within the
> existing D range framework and all the difficulties arise as a
> consequence of trying to lump them into a single abstraction.
It's true that each of these abstraction can fit within the existing
range framework.
> Why not choose which of these abstractions is most appropriate in a
> given situation instead of trying to shoe-horn both concepts into a
> single abstraction, and provide for easy conversion between them? When
> character representation is the primary requirement then make it a
> bidirectional range of code points. When storage representation and
> random access is required then make it a random access range of code
> units.
I think you're right. The need for a new concept isn't that great, and
it gets complicated really fast.
--
Michel Fortin
michel.fortin at michelf.com
http://michelf.com/
More information about the Digitalmars-d
mailing list