First Impressions!

A Guy With a Question aguywithanquestion at gmail.com
Fri Dec 1 12:21:22 UTC 2017


On Friday, 1 December 2017 at 06:07:07 UTC, Patrick Schluter 
wrote:
> On Thursday, 30 November 2017 at 19:37:47 UTC, Steven 
> Schveighoffer wrote:
>> On 11/30/17 1:20 PM, Patrick Schluter wrote:
>>> On Thursday, 30 November 2017 at 17:40:08 UTC, Jonathan M 
>>> Davis wrote:
>>>> English and thus don't as easily hit the cases where their 
>>>> code is wrong. For better or worse, UTF-16 hides it better 
>>>> than UTF-8, but the problem exists in both.
>>>>
>>> 
>>> To give just an example of what can go wrong with UTF-16. 
>>> Reading a file in UTF-16 and converting it tosomething else 
>>> like UTF-8 or UTF-32. Reading block by block and hitting 
>>> exactly a SMP codepoint at the buffer limit, high surrogate 
>>> at the end of the first buffer, low surrogate at the start of 
>>> the next. If you don't think about it => 2 invalid characters 
>>> instead of your nice poop 💩 emoji character (emojis are in 
>>> the SMP and they are more and more frequent).
>>
>> iopipe handles this: 
>> http://schveiguy.github.io/iopipe/iopipe/textpipe/ensureDecodeable.html
>>
>
> It was only to give an example. With UTF-8 people who implement 
> the low level code in general think about the multiple 
> codeunits at the buffer boundary. With UTF-16 it's often 
> forgotten. In UTF-16 there are also 2 other common pitfalls, 
> that exist also in UTF-8 but are less consciously acknowledged, 
> overlong encoding and isolated codepoints. So UTF-16 has the 
> same issues as UTF-8, plus some more, endianness and size.

Most problems with UTF16 is applicable to UTF8. The only issue 
that isn't, is if you are just dealing with ASCII it's a bit of a 
waste of space.


More information about the Digitalmars-d mailing list