Character is only first byte of an UTF-8 sequence
Stewart Gordon
smjg_1998 at yahoo.com
Mon Sep 3 18:15:58 PDT 2007
"Længlich" <nospam at void.de> wrote in message
news:fbeldf$1tbn$1 at digitalmars.com...
> Hello!
>
> From what I've read about D I think I will like this language much more
> then
> C++, Java and the other well-known languages. But now that I'm using it
> the
> first time, I've got a serious problem with the handling of user input.
>
> The input comes from a TextBox from the DFL (D Forms Library) which seems
> to
> be working fine - except the problem that I cannot sensefully access any
> given
> string (char[]). Whenever I try to do something with the string (e.g.
> concat
> it to another one, or use a string function like tolower), I get an
> "Invalid
> UTF-8 sequence" error.
I'm a bit puzzled. Concatenating arrays shouldn't care about their content.
> When I try to access a character directly (e.g. with a
> foreach loop over the string), I only get the first byte of each
> character.
> For example: If the character is '�' (i.e. has the UTF-8 encoding C3 A4)
> and I
> cast it to int, the result is 195 - which equals C3. The second byte, A4,
> seems to be lost.
Sounds as though DFL is buggy. A char is indeed a single byte, but it
shouldn't be losing the remaining bytes of the character. Are you sure it's
actually returning the first UTF-8 byte of each character, and not some
other encoding like ANSI?
I don't know DFL myself, but meanwhile, please try evaluating
std.string.format(cast(ubyte[]) text)
on the text retrieved from your TextBox, and then post the result (along
with what text you typed). This might help with diagnosing the problem.
> If it is an ASCII-character, everything works as desired, but with all
> higher
> characters I have this problem. I tried using dchar instead of char, and I
> tried applying all of the converting functions from std.utf, but the
> problem
> did not even change.
You can foreach with dchar over a char[]. Or have you tried that?
<snip>
> * The encoding doesn't matter to me. I just want to be able to compare
> them to
> other characters without them always being equal to 195.
If you want to compare them _to_ other characters, it would make most sense
to do so if they are all the same. If you want to compare them _with_ other
characters, OTOH....
If different characters are all coming out as 195, with no bytes in between
to distinguish them, then it's definitely a bug in DFL.
Stewart.
More information about the Digitalmars-d
mailing list