Of possible interest: fast UTF8 validation
Walter Bright
newshound2 at digitalmars.com
Thu May 17 17:16:03 UTC 2018
On 5/16/2018 10:01 PM, Joakim wrote:
> Unicode was a standardization of all the existing code pages and then added
> these new transfer formats, but I have long thought that they'd have been better
> off going with a header-based format that kept most languages in a single-byte
> scheme, as they mostly were except for obviously the Asian CJK languages. That
> way, you optimize for the common string, ie one that contains a single language
> or at least no CJK, rather than pessimizing every non-ASCII language by doubling
> its character width, as UTF-8 does. This UTF-8 issue is one of the first topics
> I raised in this forum, but as you noted at the time nobody agreed and I don't
> want to dredge that all up again.
It sounds like the main issue is that a header based encoding would take less size?
If that's correct, then I hypothesize that adding an LZW compression layer would
achieve the same or better result.
More information about the Digitalmars-d
mailing list