std.file: read, readText and UTF-8 decoding

Fri Sep 22 02:31:20 UTC 2023

On Thursday, September 21, 2023 9:29:17 AM MDT Uranuz via Digitalmars-d-learn 
wrote:
> Hello!
> I have some strange problem. I am trying to parse XML files and
> extract some information from it.
> I use library dxml for it by Jonathan M Davis. But I have a
> probleme that I have multiple  XML files made by different people
> around the world. Some of these files was created with Byte Order
> Mark, but some of them without BOM. dxml expects no BOM at the
> start of the string.
> At first I tried to read file with std.file.readText. Looks like
> it doesn't decode file at any way and doesn't remove BOM, so dxml
> failed to parse it then. This looks strange for me, because I
> expect that "text" function must decode data to UTF-8. Then I
> read that this behavior is documented at least:
> """
> ...However, no width or endian conversions are performed. So, if
> the width or endianness of the characters in the given file
> differ from the width or endianness of the element type of S,
> then validation will fail.
> """
> So it's OK. But I understood that this function "readText" is not
> usefull for me.
> So I tried to use plain "read" that returns "void[]". Problemmme
> is that I still don't understand which method I should use to
> convert this to string[] with proper UTF-8 decoding and remove
> BOM and etc.
> Could you help me, please to make some clearance.
> P.S. Function readText looks odd in std.file, because you cannot
> specify any encoding to decode this file. And logic how it
> decodes is unclear...

readText works great as long as you know that you're dealing with files with
a specific encoding and without a BOM (which is very often true when dealing
with text files on *nix systems where they're using UTF-8), but it's not so
great when you're reading files where you have no clue what their encoding
is going to be (and it's worse on Windows where they unfortunately are much
more likely to be UTF-16). Phobos does give you the tools to solve the
problem, but it doesn't currently make it as easy as it arguably should be.
std.encoding has the pieces that you're missing here.

https://dlang.org/phobos/std_encoding.html#BOM
https://dlang.org/phobos/std_encoding.html#getBOM

You'll need to do something like

    import std.encoding : BOM, getBOM;
    import std.file : read;

    auto data = read(file);
    immutable bom = getBOM(cast(ubyte[])data).schema;

to get the BOM. Then you can compare the BOM against BOM.utf8, BOM.utf16le,
etc. so that you know what type to cast the data array to (string, wstring,
etc.). Then you can remove the BOM with something like

    R stripBOM(R)(R range)
        if(isForwardRange!R && isSomeChar!(ElementType!R))
    {
        import std.utf : decodeFront, UseReplacementDchar;
        if(range.empty)
            return range;
        auto orig = range.save;
        immutable c = range.decodeFront!(UseReplacementDchar.yes)();
        return c == '\uFEFF' ? range : orig;
    }

And then you either operate on the array with its current encoding type,
convert it to the desired string type (e.g. to!string) or wrap it in a type
that converts it as you parse it (e.g. std.utf.byChar).

Alternatively, you can just read the very beginning of the file and grab the
BOM that way and then call readText with the correct type after you've
figured out the file's encoding.

readText should currently handle the BOM correctly insofar as it checks
whether you made the correct choice when you told it whether you wanted a
string, wstring, etc., but since it reads in the entire file, it's not a
great plan to try it with each encoding (catching each exception in turn)
until you get the right one, and it doesn't strip the BOM off for you.

So, Phobos probably should get some new functionality to handle this better,
but it's at least possible to make it work with what's there.

- Jonathan M Davis