First Impressions!

Thu Nov 30 18:20:00 UTC 2017

On Thursday, 30 November 2017 at 17:40:08 UTC, Jonathan M Davis 
wrote:
> English and thus don't as easily hit the cases where their code 
> is wrong. For better or worse, UTF-16 hides it better than 
> UTF-8, but the problem exists in both.
>

To give just an example of what can go wrong with UTF-16. Reading 
a file in UTF-16 and converting it tosomething else like UTF-8 or 
UTF-32. Reading block by block and hitting exactly a SMP 
codepoint at the buffer limit, high surrogate at the end of the 
first buffer, low surrogate at the start of the next. If you 
don't think about it => 2 invalid characters instead of your nice 
poop 💩 emoji character (emojis are in the SMP and they are more 
and more frequent).