Why can't D store all UTF-8 code units in char type? (not really understanding explanation)
H. S. Teoh
hsteoh at qfbox.info
Fri Dec 2 22:59:02 UTC 2022
On Fri, Dec 02, 2022 at 02:32:47PM -0800, Ali Çehreli via Digitalmars-d-learn wrote:
> On 12/2/22 13:44, rikki cattermole wrote:
>
> > Yeah you're right, its code unit not code point.
>
> This proves yet again how badly chosen those names are. I must look it
> up every time before using one or the other.
>
> So they are both "code"? One is a "unit" and the other is a "point"?
> Sheesh!
[...]
Think of Unicode as a vector space. A code point is a point in this
space, and a code unit is one of the unit vectors; although some points
can be reached with a single unit vector, to get to a general point you
need to combine one or more unit vectors.
Furthermore, the set of unit vectors you have depends on which
coordinate system (i.e., encoding) you're using. Reencoding a Unicode
string is essentially changing your coordinate system. ;-) (Exercise for
the reader: compute the transformation matrix for reencoding. :-P)
Also, a grapheme is a curve through this space (you *graph* the curve,
you see), and as we all know, a curve may consist of more than one
point.
:-D
(Exercise for the reader: what's the Hausdorff dimension of the set of
strings over Unicode space? :-P)
T
--
First Rule of History: History doesn't repeat itself -- historians merely repeat each other.
More information about the Digitalmars-d-learn
mailing list