Of possible interest: fast UTF8 validation

Thu May 17 17:26:04 UTC 2018

On Thursday, 17 May 2018 at 17:16:03 UTC, Walter Bright wrote:
> On 5/16/2018 10:01 PM, Joakim wrote:
>> Unicode was a standardization of all the existing code pages 
>> and then added these new transfer formats, but I have long 
>> thought that they'd have been better off going with a 
>> header-based format that kept most languages in a single-byte 
>> scheme, as they mostly were except for obviously the Asian CJK 
>> languages. That way, you optimize for the common string, ie 
>> one that contains a single language or at least no CJK, rather 
>> than pessimizing every non-ASCII language by doubling its 
>> character width, as UTF-8 does. This UTF-8 issue is one of the 
>> first topics I raised in this forum, but as you noted at the 
>> time nobody agreed and I don't want to dredge that all up 
>> again.
>
> It sounds like the main issue is that a header based encoding 
> would take less size?
>
> If that's correct, then I hypothesize that adding an LZW 
> compression layer would achieve the same or better result.

Indeed, and some other compression/deduplication options that 
would allow limited random access / slicing (by decoding a single 
“block” to access an element for instance).

Anything that depends on external information and is not 
self-sync is awful for interchange. Internally the application 
can do some smarts though, but even then things like interning 
(partial interning) might be more valuable approach. TCP being 
reliable just plain doesn’t cut it. Corruption of single bit is 
very real.