First Impressions!

Patrick Schluter Patrick.Schluter at bbox.fr
Fri Dec 1 06:07:07 UTC 2017


On Thursday, 30 November 2017 at 19:37:47 UTC, Steven 
Schveighoffer wrote:
> On 11/30/17 1:20 PM, Patrick Schluter wrote:
>> On Thursday, 30 November 2017 at 17:40:08 UTC, Jonathan M 
>> Davis wrote:
>>> English and thus don't as easily hit the cases where their 
>>> code is wrong. For better or worse, UTF-16 hides it better 
>>> than UTF-8, but the problem exists in both.
>>>
>> 
>> To give just an example of what can go wrong with UTF-16. 
>> Reading a file in UTF-16 and converting it tosomething else 
>> like UTF-8 or UTF-32. Reading block by block and hitting 
>> exactly a SMP codepoint at the buffer limit, high surrogate at 
>> the end of the first buffer, low surrogate at the start of the 
>> next. If you don't think about it => 2 invalid characters 
>> instead of your nice poop 💩 emoji character (emojis are in the 
>> SMP and they are more and more frequent).
>
> iopipe handles this: 
> http://schveiguy.github.io/iopipe/iopipe/textpipe/ensureDecodeable.html
>

It was only to give an example. With UTF-8 people who implement 
the low level code in general think about the multiple codeunits 
at the buffer boundary. With UTF-16 it's often forgotten. In 
UTF-16 there are also 2 other common pitfalls, that exist also in 
UTF-8 but are less consciously acknowledged, overlong encoding 
and isolated codepoints. So UTF-16 has the same issues as UTF-8, 
plus some more, endianness and size.



More information about the Digitalmars-d mailing list