Of possible interest: fast UTF8 validation

Thu May 17 05:01:54 UTC 2018

On Wednesday, 16 May 2018 at 20:11:35 UTC, Andrei Alexandrescu 
wrote:
> On 5/16/18 1:18 PM, Joakim wrote:
>> On Wednesday, 16 May 2018 at 16:48:28 UTC, Dmitry Olshansky 
>> wrote:
>>> On Wednesday, 16 May 2018 at 15:48:09 UTC, Joakim wrote:
>>>> On Wednesday, 16 May 2018 at 11:18:54 UTC, Andrei 
>>>> Alexandrescu wrote:
>>>>> https://www.reddit.com/r/programming/comments/8js69n/validating_utf8_strings_using_as_little_as_07/
>>>>>
>>>>
>>>> Sigh, this reminds me of the old quote about people spending 
>>>> a bunch of time making more efficient what shouldn't be done 
>>>> at all.
>>>
>>> Validating UTF-8 is super common, most text protocols and 
>>> files these days would use it, other would have an option to 
>>> do so.
>>>
>>> I’d like our validateUtf to be fast, since right now we do 
>>> validation every time we decode string. And THAT is slow. 
>>> Trying to not validate on decode means most things should be 
>>> validated on input...
>> 
>> I think you know what I'm referring to, which is that UTF-8 is 
>> a badly designed format, not that input validation shouldn't 
>> be done.
>
> I find this an interesting minority opinion, at least from the 
> perspective of the circles I frequent, where UTF8 is 
> unanimously heralded as a great design. Only a couple of weeks 
> ago I saw Dylan Beattie give a very entertaining talk on 
> exactly this topic: 
> https://dotnext-piter.ru/en/2018/spb/talks/2rioyakmuakcak0euk0ww8/

Thanks for the link, skipped to the part about text encodings, 
should be fun to read the rest later.

> If you could share some details on why you think UTF8 is badly 
> designed and how you believe it could be/have been better, I'd 
> be in your debt!

Unicode was a standardization of all the existing code pages and 
then added these new transfer formats, but I have long thought 
that they'd have been better off going with a header-based format 
that kept most languages in a single-byte scheme, as they mostly 
were except for obviously the Asian CJK languages. That way, you 
optimize for the common string, ie one that contains a single 
language or at least no CJK, rather than pessimizing every 
non-ASCII language by doubling its character width, as UTF-8 
does. This UTF-8 issue is one of the first topics I raised in 
this forum, but as you noted at the time nobody agreed and I 
don't want to dredge that all up again.

I have been researching this a bit since then, and the stated 
goals for UTF-8 at inception were that it _could not overlap with 
ASCII anywhere for other languages_, to avoid issues with legacy 
software wrongly processing other languages as ASCII, and to 
allow seeking from an arbitrary location within a byte stream:

https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

I have no dispute with these priorities at the time, as they were 
optimizing for the institutional and tech realities of 1992 as 
Dylan also notes, and UTF-8 is actually a nice hack given those 
constraints. What I question is that those priorities are at all 
relevant today, when billions of smartphone users are regularly 
not using ASCII, and these tech companies are the largest private 
organizations on the planet, ie they have the resources to design 
a new transfer format. I see basically no relevance for the 
streaming requirement today, as I noted in this forum years ago, 
but I can see why it might have been considered important in the 
early '90s, before packet-based networking protocols had won.

I think a header-based scheme would be _much_ better today and 
the reason I know Dmitry knows that is that I have discussed 
privately with him over email that I plan to prototype a format 
like that in D. Even if UTF-8 is already fairly widespread, 
something like that could be useful as a better intermediate 
format for string processing, and maybe someday could replace 
UTF-8 too.