Of possible interest: fast UTF8 validation
Dmitry Olshansky
dmitry.olsh at gmail.com
Thu May 17 17:26:04 UTC 2018
On Thursday, 17 May 2018 at 17:16:03 UTC, Walter Bright wrote:
> On 5/16/2018 10:01 PM, Joakim wrote:
>> Unicode was a standardization of all the existing code pages
>> and then added these new transfer formats, but I have long
>> thought that they'd have been better off going with a
>> header-based format that kept most languages in a single-byte
>> scheme, as they mostly were except for obviously the Asian CJK
>> languages. That way, you optimize for the common string, ie
>> one that contains a single language or at least no CJK, rather
>> than pessimizing every non-ASCII language by doubling its
>> character width, as UTF-8 does. This UTF-8 issue is one of the
>> first topics I raised in this forum, but as you noted at the
>> time nobody agreed and I don't want to dredge that all up
>> again.
>
> It sounds like the main issue is that a header based encoding
> would take less size?
>
> If that's correct, then I hypothesize that adding an LZW
> compression layer would achieve the same or better result.
Indeed, and some other compression/deduplication options that
would allow limited random access / slicing (by decoding a single
“block” to access an element for instance).
Anything that depends on external information and is not
self-sync is awful for interchange. Internally the application
can do some smarts though, but even then things like interning
(partial interning) might be more valuable approach. TCP being
reliable just plain doesn’t cut it. Corruption of single bit is
very real.
More information about the Digitalmars-d
mailing list