reading UTF-8 formated file into dynamic array
Stewart Gordon
smjg_1998 at yahoo.com
Tue Jul 24 11:35:01 PDT 2007
"thofis" <thofis at gmail.com> wrote in message
news:f8559q$2ld9$1 at digitalmars.com...
<snip>
> But it seems that this casting does not result in proper filling of
> _input_. Any non ACSII character is splited into multiple characters.
> (BOM is also contained in _input_.)
Welcome to UTF-8.
> So..., how I get a UTF-8 file properly mapped into an dynamic array?
It is properly mapped. If you examine the file in a hex editor/dumper and
compare with what your program has made of it, you'll see that the file has
been read byte for byte. You'll also notice that input.length matches the
file size reported by the OS.
In UTF-8, every non-ASCII character occupies more than one byte. There is
no splitting into multiple _characters_ - when the file is saved as UTF-8 in
the first place, any character above U+007F is split into multiple _bytes_,
but it is still only one _character_.
If you want to work with the file data in a strict character-by-character
format, you can do any of the following after reading the file:
- convert it to UTF-32, using std.utf.toUTF32
- use std.utf.decode to extract characters from the read-in UTF-8 text
- use foreach with a dchar variable
Stewart.
More information about the Digitalmars-d-learn
mailing list