Working with utf
Regan Heath
regan at netmail.co.nz
Thu Jun 14 05:44:38 PDT 2007
Simen Haugen Wrote:
> I hate it!
>
> Say we have a string "øl". When I read this from a text file, it is two
> chars, but since this is no utf8 string, I have to convert it to utf8 before
> I can do any string operations on it.
> I can easily live with that. Say we have a file with several lines, and its
> important that all lines are of equal length.
> The string "ol" is two chars, but the string "øl" is 3 chars in utf8.
> Because of this I have to convert it back to latin-1 before checking
> lengths. The same applies to slicing, but even worse.
> For all I care, "ø" is one character, not two. If I slice "ø" to get the
> first character, I only get the first half of the character. Isn't it more
> obvious that all string manipulation works with all utf8 characters as one
> character instead of two for values greater than 127?
>
> I cannot find any nice solutions for this, and have to convert to and from
> latin-1/utf8 all the time.
>
> There must be a better way...
I think what we want for this is a String class which internally stores the data as utf-8, 16 or 32 (making it's own decision or being told which to use) and provides slicing of characters as opposed to codpoints.
Then, all you need is to convert from latin-1 to String, do all your work with String and convert back to latin-1 only if/when you need to write it back to a file or similar.
My gut feeling is that this functionality belongs in a class and not the language itself. After all, you may want/need to manipulate utf-8, 16, or 32 codepoints directly for some reason.
Regan Heath
More information about the Digitalmars-d
mailing list