UTF-8 requested. BOM is for UTF-16

Mon Sep 16 22:21:17 UTC 2019

On Monday, September 16, 2019 1:27:52 PM MDT Gregor Mückl via Digitalmars-d 
wrote:
> On Sunday, 15 September 2019 at 23:50:27 UTC, solidstate1991
>
> wrote:
> > This is what I get when I try to run a unittest on one of my
> > projects.
> >
> > Here's the file that generates this error:
> > https://github.com/ZILtoid1991/pixelperfectengine/blob/master/pixelperfe
> > ctengine/src/PixelPerfectEngine/graphics/extensions.d
> Since the Unicode BOM is a set of code points like any other,
> they can be expressed in UTF-8. There are the occasional text
> files that have a BOM even though they are encoded as UTF-8. I
> feel that readText is too strict in that case. It should just
> silently ignore the BOM and continue because it is just
> meaningless. Throwing an exception seems completely wrong to me.

The BOMs for UTF-16 and UTF-32 are not valid UTF-8. e.g. this program will
print false for all 4 BOMs:

    import std.encoding;
    import std.stdio;
    import std.utf;

    void main()
    {
        foreach(i; [BOM.utf16be, BOM.utf16le, BOM.utf32be, BOM.utf32le])
            writeln(isValidUTF(cast(const char[])bomTable[i].sequence));
    }

    bool isValidUTF(const(char)[] str)
    {
        try
        {
            validate(str);
            return true;
        }
        catch(UTFException)
            return false;
    }

The BOM for UTF-8 is valid UTF-8, but the others aren't. And it would be
invalid to have a BOM for UTF-16 or UTF-32 and then treat the text as UTF-8
anyway. A program is free to ignore the BOM and then treat the rest of the
text in whatever encoding it wants, but if the BOM doesn't match the actual
encoding, then the text is invalid Unicode.

If you want to ignore the BOM, it's trivial enough to write a wrapper around
std.file.read which does that.

What Phobos is missing is a function akin to readText which looks at the BOM
and then converts the text to the requested encoding instead of throwing,
but that gets a bit more involved, and it probably wouldn't work with more
than UTF-8, UTF-16, and UTF-32. It also becomes problematic if you want it
to handle all of the cases where the text is UTF-16 or UTF-32 but doesn't
have a BOM, since that requires examining the text and guessing (as well as
potentially having to deal with reversing the endianness of the encoding).
Regardless, I definitely wouldn't expect a function like that to ignore the
BOM if it's present.

- Jonathan M Davis