Character is only first byte of an UTF-8 sequence

Stewart Gordon smjg_1998 at yahoo.com
Mon Sep 3 18:15:58 PDT 2007


"Længlich" <nospam at void.de> wrote in message 
news:fbeldf$1tbn$1 at digitalmars.com...
> Hello!
>
> From what I've read about D I think I will like this language much more 
> then
> C++, Java and the other well-known languages. But now that I'm using it 
> the
> first time, I've got a serious problem with the handling of user input.
>
> The input comes from a TextBox from the DFL (D Forms Library) which seems 
> to
> be working fine - except the problem that I cannot sensefully access any 
> given
> string (char[]). Whenever I try to do something with the string (e.g. 
> concat
> it to another one, or use a string function like tolower), I get an 
> "Invalid
> UTF-8 sequence" error.

I'm a bit puzzled.  Concatenating arrays shouldn't care about their content.

> When I try to access a character directly (e.g. with a
> foreach loop over the string), I only get the first byte of each 
> character.
> For example: If the character is '�' (i.e. has the UTF-8 encoding C3 A4) 
> and I
> cast it to int, the result is 195 - which equals C3. The second byte, A4,
> seems to be lost.

Sounds as though DFL is buggy.  A char is indeed a single byte, but it 
shouldn't be losing the remaining bytes of the character.  Are you sure it's 
actually returning the first UTF-8 byte of each character, and not some 
other encoding like ANSI?

I don't know DFL myself, but meanwhile, please try evaluating
    std.string.format(cast(ubyte[]) text)
on the text retrieved from your TextBox, and then post the result (along 
with what text you typed).  This might help with diagnosing the problem.

> If it is an ASCII-character, everything works as desired, but with all 
> higher
> characters I have this problem. I tried using dchar instead of char, and I
> tried applying all of the converting functions from std.utf, but the 
> problem
> did not even change.

You can foreach with dchar over a char[].  Or have you tried that?

<snip>
> * The encoding doesn't matter to me. I just want to be able to compare 
> them to
> other characters without them always being equal to 195.

If you want to compare them _to_ other characters, it would make most sense 
to do so if they are all the same.  If you want to compare them _with_ other 
characters, OTOH....

If different characters are all coming out as 195, with no bytes in between 
to distinguish them, then it's definitely a bug in DFL.

Stewart. 




More information about the Digitalmars-d mailing list