Of possible interest: fast UTF8 validation
Joakim
dlang at joakim.fea.st
Thu May 17 18:34:05 UTC 2018
On Thursday, 17 May 2018 at 17:16:03 UTC, Walter Bright wrote:
> On 5/16/2018 10:01 PM, Joakim wrote:
>> Unicode was a standardization of all the existing code pages
>> and then added these new transfer formats, but I have long
>> thought that they'd have been better off going with a
>> header-based format that kept most languages in a single-byte
>> scheme, as they mostly were except for obviously the Asian CJK
>> languages. That way, you optimize for the common string, ie
>> one that contains a single language or at least no CJK, rather
>> than pessimizing every non-ASCII language by doubling its
>> character width, as UTF-8 does. This UTF-8 issue is one of the
>> first topics I raised in this forum, but as you noted at the
>> time nobody agreed and I don't want to dredge that all up
>> again.
>
> It sounds like the main issue is that a header based encoding
> would take less size?
Yes, and be easier to process.
> If that's correct, then I hypothesize that adding an LZW
> compression layer would achieve the same or better result.
In general, you would be wrong, a carefully designed binary
format will usually beat the pants off general-purpose
compression:
https://www.w3.org/TR/2009/WD-exi-evaluation-20090407/#compactness-results
Of course, that's because you can tailor your binary format for
specific types of data, text in this case, and take advantage of
patterns in that subset, such as specialized image compression
formats do. In this case though, I haven't compared this scheme
to general compression of UTF-8 strings, so I don't know which
would compress better.
However, that would mostly matter for network transmission,
another big gain of a header-based scheme that doesn't use
compression is much faster string processing in memory. Yes, the
average end user doesn't care for this, but giant consumers of
text data, like search engines, would benefit greatly from this.
On Thursday, 17 May 2018 at 17:26:04 UTC, Dmitry Olshansky wrote:
> Indeed, and some other compression/deduplication options that
> would allow limited random access / slicing (by decoding a
> single “block” to access an element for instance).
Possibly competitive for compression only for transmission over
the network, but unlikely for processing, as noted for Walter's
idea.
> Anything that depends on external information and is not
> self-sync is awful for interchange.
You are describing the vast majority of all formats and
protocols, amazing how we got by with them all this time.
> Internally the application can do some smarts though, but even
> then things like interning (partial interning) might be more
> valuable approach. TCP being reliable just plain doesn’t cut
> it. Corruption of single bit is very real.
You seem to have missed my point entirely: UTF-8 will not catch
most bit flips either, only if it happens to corrupt certain key
bits in a certain way, a minority of the possibilities. Nobody is
arguing that data corruption doesn't happen or that some
error-correction shouldn't be done somewhere.
The question is whether the extremely limited robustness of UTF-8
added by its significant redundancy is a good tradeoff. I think
it's obvious that it isn't, and I posit that anybody who knows
anything about error-correcting codes would agree with that
assessment. You would be much better off by having a more compact
header-based transfer format and layering on the level of error
correction you need at a different level, which as I noted is
already done at the link and transport layers and various other
parts of the system already.
If you need more error-correction than that, do it right, not in
a broken way as UTF-8 does. Honestly, error detection/correction
is the most laughably broken part of UTF-8, it is amazing that
people even bring that up as a benefit.
More information about the Digitalmars-d
mailing list