The length of strings vs. # of chars vs. sizeof

Tue Nov 3 07:15:18 PST 2009

Bill Baxter <wbaxter at gmail.com> wrote:

> On Tue, Nov 3, 2009 at 2:47 AM, rmcguire <rjmcguire at gmail.com> wrote:
>> Charles Hixson <charleshixsn at earthlink.net> wrote:
>>
>>> Jesse Phillips wrote:
>>>> On Sun, 01 Nov 2009 11:36:31 -0800, Charles Hixson wrote:
>>>>
>>>>> Does anyone just *know* the answer?  (And if so, could they make the
>>>>> documentation explicit?)
>>>>
>>>> I believe the documentation you are looking for is:
>>>>
>>>> http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD
>>>>
>>>> It is more about understanding UTF than it is about learning strings.
>>> Thanks, that does appear to be the answer.
>>>
>>> So if a string is too long, and I shorten it by one character, I'd
>>> better test it with std.utf.validate(str).  If it doesn't throw an
>>> error, it's ok.  Otherwise shorten it again and retry.
>>>
>>> I hope I understood this correctly.  (I'm sure there's a more elegant
>>> way to do this, but here I'm going for a simple approach, as I should
>>> rarely be encountering this problem.)
>>>
>>>
>> As far as I know if you want to shorten a utf8 string you just check the
>> first bit of the last byte to see if its 0. If its 0 go back further
>> until you find a byte that starts with 1, and then remove that byte too.
>>
>> All characters start with a byte that starts with 1, the number of 1s in
>> the first byte of the character tell you how many bytes in the character.
>>
>> Hope that helps, but you should find a library that already has a
>> "shorten my string" function.
> 
> It's explained well in Andrei's book.
> 0* -- single byte character
> 11* -- first byte of multi-byte char
> 10* -- subsequent byte of multi-byte char
> 
> --bb
> 
:) forgot about that, its been a while since I played with utf8.

made a Hessian serializer in C.

-Rory