Why can't D store all UTF-8 code units in char type? (not really understanding explanation)

H. S. Teoh hsteoh at qfbox.info
Fri Dec 2 22:59:02 UTC 2022


On Fri, Dec 02, 2022 at 02:32:47PM -0800, Ali Çehreli via Digitalmars-d-learn wrote:
> On 12/2/22 13:44, rikki cattermole wrote:
> 
> > Yeah you're right, its code unit not code point.
> 
> This proves yet again how badly chosen those names are. I must look it
> up every time before using one or the other.
> 
> So they are both "code"? One is a "unit" and the other is a "point"?
> Sheesh!
[...]

Think of Unicode as a vector space.  A code point is a point in this
space, and a code unit is one of the unit vectors; although some points
can be reached with a single unit vector, to get to a general point you
need to combine one or more unit vectors.

Furthermore, the set of unit vectors you have depends on which
coordinate system (i.e., encoding) you're using.  Reencoding a Unicode
string is essentially changing your coordinate system. ;-) (Exercise for
the reader: compute the transformation matrix for reencoding. :-P)

Also, a grapheme is a curve through this space (you *graph* the curve,
you see), and as we all know, a curve may consist of more than one
point.

:-D

(Exercise for the reader: what's the Hausdorff dimension of the set of
strings over Unicode space? :-P)


T

-- 
First Rule of History: History doesn't repeat itself -- historians merely repeat each other.


More information about the Digitalmars-d-learn mailing list