Proper way to fix core.exception.UnicodeException at src\rt\util\utf.d(292): invalid UTF-8 sequence by File.readln()

Fri Apr 6 16:42:01 UTC 2018

On Friday, April 06, 2018 16:10:56 Dr.No via Digitalmars-d-learn wrote:
> I'm reading line by line the lines from a CSV file provided by
> the user which is assumed to be UTF8. But an user has provided an
>
> ANSI file which resulted in the error:
> >core.exception.UnicodeException at src\rt\util\utf.d(292): invalid
> >UTF-8 sequence
>
> (it happend when the user took the originally UTF8 encoded file
> generated by another application, made some edit using an editor
> (which I don't know the name) then saved not aware it was
> changing the encoding to ANSI.
>
> My question is: what's the proper way to solve that? using toUTF8
>
> didn't solve:
> > while((line = csvFile.readln().toUTF8) !is null) {
>
> I didn't find a way to set explicitly the encoding with
> std.stdio.File to set to UTF8 regardless it's an ANSI or already
> UTF8.
> I don't want to conver the whole file to UTF8, the CSV file can
> be large and might take quite while. And if I do so to a
> temporary copy the file (which will make things even more slow)
> to avoid touch user's original file.
>
> I thought in writing my own readLine() with
> std.stdio.File.byChunk to take as many bytes as possible until
> '\n' byte is seen, treat it as UTF8 and return.
>
> But I'd like to not reinvent the wheel and use something native,
> if possible. Any ideas?

In general, Phobos pretty much requires that text files be in UTF-8 or that
they be in UTF-16 or UTF-32 with the native endianness. Certainly, char,
wchar, and dchar are assumed by both the language and the library to be
valid UTF-8, UTF-16, and UTF-32 respectively, and if you do anything with a
string as a range, by default, it will decode the code units to code points,
meaning that it will validate the UTF. If you want to read in anything that
is not going to be valid UTF, then you're going to have to read it in as
ubytes rather than as characters. If you're dealing with text that is
supposed to be valid UTF but might contain invalid characters, then you can
use std.utf.byCodeUnit to treat a string as a range of code units where it
replaces invalid UTF with the Unicode replacement character, but you'd still
have to read in the text as bytes (e.g. readText validates that the text is
proper UTF). std.utf.byUTF could be use instead of byCodeUnit to convert to
UTF-8, UTF-16, or UTF-32, and invalid UTF will be replaced with the
replacement character (so it doesn't have to be the target encoding, but it
does have to be one of those three encodings and not something else). byUTF8
is basically the version of byUTF!char / byChar that returns a string rather
than a lazy range, which is why it would have not worked for you if the file
is not UTF-8, UTF-16, or UTF-32.

If you're dealing with an encoding other than UTF-8, UTF-16, or UTF-32, and
you've read the data in as ubytes, then std.encoding _might_ help, but it's
not a great module and doesn't support a lot of encodings. On the other
hand, if by ANSI, you mean the encoding that Windows uses with the A
versions of its functions, then you can use std.windows.charset.fromMBSz to
convert it to string, though it will need to be a null terminated
immutable(char)* to work with fromMBSz, so somewhat stupidly, it basically
needs to be a null-terminated string that isn't valid UTF-8 that you pass
using str.ptr or &str[0].

If the encoding you're dealing with is not one that works with fromMBSz, and
it is not one of the few supported by std.encoding, then you're on your own.

- Jonathan M Davis