Unicode handling comparison
Jakob Ovrum
jakobovrum at gmail.com
Wed Nov 27 08:39:27 PST 2013
On Wednesday, 27 November 2013 at 16:22:58 UTC, Wyatt wrote:
> Whoops, overzealous pasting. That is, "e\u0308", which
> composes to "ë". A grapheme cluster seems to represent one
> printed character: "...a horizontally segmentable unit of text,
> consisting of some grapheme base (which may consist of a Korean
> syllable) together with any number of nonspacing marks applied
> to it."
>
> Is that about right?
>
> -Wyatt
Yes.
A grapheme is also sometimes explained as being the unit that lay
people intuitively think of as being a "character".
The difference between a grapheme and a grapheme cluster is just
a matter of perspective, like the difference between a character
and a code point; the former simply refers to the decoded result,
while the latter refers to the sum of encoding parts (where the
parts are code points for grapheme cluster, and code units for a
code point).
Yet another example is that of the UTF-32 code unit: one UTF-32
code unit is (currently) equal to one Unicode code point, but
both terms are meaningful in the right context.
More information about the Digitalmars-d
mailing list