Of possible interest: fast UTF8 validation

Wed May 16 20:11:35 UTC 2018

On 5/16/18 1:18 PM, Joakim wrote:
> On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshansky wrote:
>> On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:
>>> On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei Alexandrescu wrote:
>>>> https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/ 
>>>>
>>>
>>> Sigh, this reminds me of the old quote about people spending a bunch 
>>> of time making more efficient what shouldn't be done at all.
>>
>> Validating UTF-8 is super common, most text protocols and files these 
>> days would use it, other would have an option to do so.
>>
>> I’d like our validateUtf to be fast, since right now we do validation 
>> every time we decode string. And THAT is slow. Trying to not validate 
>> on decode means most things should be validated on input...
> 
> I think you know what I'm referring to, which is that UTF-8 is a badly 
> designed format, not that input validation shouldn't be done.

I find this an interesting minority opinion, at least from the 
perspective of the circles I frequent, where UTF8 is unanimously 
heralded as a great design. Only a couple of weeks ago I saw Dylan 
Beattie give a very entertaining talk on exactly this topic: 
https://dotnext-piter.ru/en/2018/spb/talks/2rioyakmuakcak0euk0ww8/

If you could share some details on why you think UTF8 is badly designed 
and how you believe it could be/have been better, I'd be in your debt!

Andrei