The length of strings vs. # of chars vs. sizeof

Sun Nov 1 12:12:10 PST 2009

Charles Hixson wrote:
> I've read and re-read the documentation, but I can't decide whether a
> UTF-8 character that takes multiple bytes to express counts as one or
> multiple values in length and sizeof.  Sizeof seems to presume that all
> entries are the same length, but otherwise it seems to be the property I
> need.  (I suppose that I could just enter a string that I know is
> multi-byte chars, but it sure would be better if I could find out from
> the documentation.)  I'm pretty certain that it just counts as one
> character for indexing, so length would almost need to also count the
> number of characters rather than bytes.

Strings are just arrays of code units.  Their length is the number of
elements (i.e. code units) they contain, just like other arrays.  A code
point may comprise multiple code units, and a logical character may
comprise multiple code points.  The latter is true even with dchar/utf-32.

-- 
Rainer Deyke - rainerd at eldwood.com