Of possible interest: fast UTF8 validation

Joakim dlang at joakim.fea.st
Thu May 17 18:34:05 UTC 2018


On Thursday, 17 May 2018 at 17:16:03 UTC, Walter Bright wrote:
> On 5/16/2018 10:01 PM, Joakim wrote:
>> Unicode was a standardization of all the existing code pages 
>> and then added these new transfer formats, but I have long 
>> thought that they'd have been better off going with a 
>> header-based format that kept most languages in a single-byte 
>> scheme, as they mostly were except for obviously the Asian CJK 
>> languages. That way, you optimize for the common string, ie 
>> one that contains a single language or at least no CJK, rather 
>> than pessimizing every non-ASCII language by doubling its 
>> character width, as UTF-8 does. This UTF-8 issue is one of the 
>> first topics I raised in this forum, but as you noted at the 
>> time nobody agreed and I don't want to dredge that all up 
>> again.
>
> It sounds like the main issue is that a header based encoding 
> would take less size?

Yes, and be easier to process.

> If that's correct, then I hypothesize that adding an LZW 
> compression layer would achieve the same or better result.

In general, you would be wrong, a carefully designed binary 
format will usually beat the pants off general-purpose 
compression:

https://www.w3.org/TR/2009/WD-exi-evaluation-20090407/#compactness-results

Of course, that's because you can tailor your binary format for 
specific types of data, text in this case, and take advantage of 
patterns in that subset, such as specialized image compression 
formats do. In this case though, I haven't compared this scheme 
to general compression of UTF-8 strings, so I don't know which 
would compress better.

However, that would mostly matter for network transmission, 
another big gain of a header-based scheme that doesn't use 
compression is much faster string processing in memory. Yes, the 
average end user doesn't care for this, but giant consumers of 
text data, like search engines, would benefit greatly from this.

On Thursday, 17 May 2018 at 17:26:04 UTC, Dmitry Olshansky wrote:
> Indeed, and some other compression/deduplication options that 
> would allow limited random access / slicing (by decoding a 
> single “block” to access an element for instance).

Possibly competitive for compression only for transmission over 
the network, but unlikely for processing, as noted for Walter's 
idea.

> Anything that depends on external information and is not 
> self-sync is awful for interchange.

You are describing the vast majority of all formats and 
protocols, amazing how we got by with them all this time.

> Internally the application can do some smarts though, but even 
> then things like interning (partial interning) might be more 
> valuable approach. TCP being reliable just plain doesn’t cut 
> it. Corruption of single bit is very real.

You seem to have missed my point entirely: UTF-8 will not catch 
most bit flips either, only if it happens to corrupt certain key 
bits in a certain way, a minority of the possibilities. Nobody is 
arguing that data corruption doesn't happen or that some 
error-correction shouldn't be done somewhere.

The question is whether the extremely limited robustness of UTF-8 
added by its significant redundancy is a good tradeoff. I think 
it's obvious that it isn't, and I posit that anybody who knows 
anything about error-correcting codes would agree with that 
assessment. You would be much better off by having a more compact 
header-based transfer format and layering on the level of error 
correction you need at a different level, which as I noted is 
already done at the link and transport layers and various other 
parts of the system already.

If you need more error-correction than that, do it right, not in 
a broken way as UTF-8 does. Honestly, error detection/correction 
is the most laughably broken part of UTF-8, it is amazing that 
people even bring that up as a benefit.


More information about the Digitalmars-d mailing list