UTF-8 requested. BOM is for UTF-16
Gregor Mückl
gregormueckl at gmx.de
Tue Sep 17 11:00:28 UTC 2019
On Monday, 16 September 2019 at 22:21:17 UTC, Jonathan M Davis
wrote:
> The BOM for UTF-8 is valid UTF-8, but the others aren't. And it
> would be invalid to have a BOM for UTF-16 or UTF-32 and then
> treat the text as UTF-8 anyway. A program is free to ignore the
> BOM and then treat the rest of the text in whatever encoding it
> wants, but if the BOM doesn't match the actual encoding, then
> the text is invalid Unicode.
Ah, I misunderstood. Apologies for that. My initial mistake was
that I didn't quite understand that readText essentially ends up
checking actual BOM encodings as a sanity check. So, yes, phobos
isn't breaking any standard here.
This is a point where adhering to Unicode terminology prevents
confusion. There is only one BOM. But it is selected specifically
so that it is encoded differently in each encoding scheme. To
make that work, the endianess-revered decoded versions of the BOM
are defined to be illegal code points. It is entirely legal to
have UTF-8 *encoded* BOM at the start of a UTF-8 string, but that
is a sequence of bytes that is distinct from the UTF-16 little
endian *encoded* BOM, which is also distinct from a UTF-16 big
endian *encoded* BOM etc.
More information about the Digitalmars-d
mailing list