Of possible interest: fast UTF8 validation

Thu May 17 17:16:03 UTC 2018

On 5/16/2018 10:01 PM, Joakim wrote:
> Unicode was a standardization of all the existing code pages and then added 
> these new transfer formats, but I have long thought that they'd have been better 
> off going with a header-based format that kept most languages in a single-byte 
> scheme, as they mostly were except for obviously the Asian CJK languages. That 
> way, you optimize for the common string, ie one that contains a single language 
> or at least no CJK, rather than pessimizing every non-ASCII language by doubling 
> its character width, as UTF-8 does. This UTF-8 issue is one of the first topics 
> I raised in this forum, but as you noted at the time nobody agreed and I don't 
> want to dredge that all up again.

It sounds like the main issue is that a header based encoding would take less size?

If that's correct, then I hypothesize that adding an LZW compression layer would 
achieve the same or better result.