UTF-8 requested. BOM is for UTF-16

Gregor Mückl gregormueckl at gmx.de
Tue Sep 17 11:00:28 UTC 2019


On Monday, 16 September 2019 at 22:21:17 UTC, Jonathan M Davis 
wrote:
> The BOM for UTF-8 is valid UTF-8, but the others aren't. And it 
> would be invalid to have a BOM for UTF-16 or UTF-32 and then 
> treat the text as UTF-8 anyway. A program is free to ignore the 
> BOM and then treat the rest of the text in whatever encoding it 
> wants, but if the BOM doesn't match the actual encoding, then 
> the text is invalid Unicode.

Ah, I misunderstood. Apologies for that. My initial mistake was 
that I didn't quite understand that readText essentially ends up 
checking actual BOM encodings as a sanity check. So, yes, phobos 
isn't breaking any standard here.

This is a point where adhering to Unicode terminology prevents 
confusion. There is only one BOM. But it is selected specifically 
so that it is encoded differently in each encoding scheme. To 
make that work, the endianess-revered decoded versions of the BOM 
are defined to be illegal code points. It is entirely legal to 
have UTF-8 *encoded* BOM at the start of a UTF-8 string, but that 
is a sequence of bytes that is distinct from the UTF-16 little 
endian *encoded* BOM, which is also distinct from a UTF-16 big 
endian *encoded* BOM etc.



More information about the Digitalmars-d mailing list