Character is only first byte of an UTF-8 sequence

Længlich nospam at void.de
Sun Sep 2 08:38:23 PDT 2007


Hello!

>From what I've read about D I think I will like this language much more then
C++, Java and the other well-known languages. But now that I'm using it the
first time, I've got a serious problem with the handling of user input.

The input comes from a TextBox from the DFL (D Forms Library) which seems to
be working fine - except the problem that I cannot sensefully access any given
string (char[]). Whenever I try to do something with the string (e.g. concat
it to another one, or use a string function like tolower), I get an "Invalid
UTF-8 sequence" error. When I try to access a character directly (e.g. with a
foreach loop over the string), I only get the first byte of each character.
For example: If the character is 'ä' (i.e. has the UTF-8 encoding C3 A4) and I
cast it to int, the result is 195 - which equals C3. The second byte, A4,
seems to be lost.
If it is an ASCII-character, everything works as desired, but with all higher
characters I have this problem. I tried using dchar instead of char, and I
tried applying all of the converting functions from std.utf, but the problem
did not even change.

So, is there an encoding function which returns the real characters* so that I
can work with them, or do I actually have to work with single bytes (which
would necessarily result in reinventing the squared wheel)?

By the way, I'm using MS Windows XP SP2 in German, and my source code ist
UTF-8 with BOM. I'm not sure if one of these facts matters.

Thank you for any feedback and kindest regards,
Længlich

* The encoding doesn't matter to me. I just want to be able to compare them to
other characters without them always being equal to 195.



More information about the Digitalmars-d mailing list