The length of strings vs. # of chars vs. sizeof

Tue Nov 3 05:54:35 PST 2009

On Tue, Nov 3, 2009 at 2:47 AM, rmcguire <rjmcguire at gmail.com> wrote:
> Charles Hixson <charleshixsn at earthlink.net> wrote:
>
>> Jesse Phillips wrote:
>>> On Sun, 01 Nov 2009 11:36:31 -0800, Charles Hixson wrote:
>>>
>>>> Does anyone just *know* the answer?  (And if so, could they make the
>>>> documentation explicit?)
>>>
>>> I believe the documentation you are looking for is:
>>>
>>> http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD
>>>
>>> It is more about understanding UTF than it is about learning strings.
>> Thanks, that does appear to be the answer.
>>
>> So if a string is too long, and I shorten it by one character, I'd
>> better test it with std.utf.validate(str).  If it doesn't throw an
>> error, it's ok.  Otherwise shorten it again and retry.
>>
>> I hope I understood this correctly.  (I'm sure there's a more elegant
>> way to do this, but here I'm going for a simple approach, as I should
>> rarely be encountering this problem.)
>>
>>
> As far as I know if you want to shorten a utf8 string you just check the
> first bit of the last byte to see if its 0. If its 0 go back further
> until you find a byte that starts with 1, and then remove that byte too.
>
> All characters start with a byte that starts with 1, the number of 1s in
> the first byte of the character tell you how many bytes in the character.
>
> Hope that helps, but you should find a library that already has a
> "shorten my string" function.

It's explained well in Andrei's book.
0* -- single byte character
11* -- first byte of multi-byte char
10* -- subsequent byte of multi-byte char

--bb