Unicode handling comparison

Jakob Ovrum jakobovrum at gmail.com
Wed Nov 27 08:39:27 PST 2013


On Wednesday, 27 November 2013 at 16:22:58 UTC, Wyatt wrote:
> Whoops, overzealous pasting.  That is, "e\u0308", which 
> composes to "ë".  A grapheme cluster seems to represent one 
> printed character: "...a horizontally segmentable unit of text, 
> consisting of some grapheme base (which may consist of a Korean 
> syllable) together with any number of nonspacing marks applied 
> to it."
>
> Is that about right?
>
> -Wyatt

Yes.

A grapheme is also sometimes explained as being the unit that lay 
people intuitively think of as being a "character".

The difference between a grapheme and a grapheme cluster is just 
a matter of perspective, like the difference between a character 
and a code point; the former simply refers to the decoded result, 
while the latter refers to the sum of encoding parts (where the 
parts are code points for grapheme cluster, and code units for a 
code point).

Yet another example is that of the UTF-32 code unit: one UTF-32 
code unit is (currently) equal to one Unicode code point, but 
both terms are meaningful in the right context.


More information about the Digitalmars-d mailing list