Of possible interest: fast UTF8 validation

Joakim dlang at joakim.fea.st
Thu May 17 15:16:19 UTC 2018


On Thursday, 17 May 2018 at 13:14:46 UTC, Patrick Schluter wrote:
> This is not practical, sorry. What happens when your message 
> loses the header? Exactly, the rest of the message is garbled.

Why would it lose the header? TCP guarantees delivery and 
checksums the data, that's effective enough at the transport 
layer.

I agree that UTF-8 is a more redundant format, as others have 
mentioned earlier, and is thus more robust to certain types of 
data loss than a header-based scheme. However, I don't consider 
that the job of the text format, it's better done by other 
layers, like transport protocols or filesystems, which will guard 
against such losses much more reliably and efficiently.

For example, a random bitflip somewhere in the middle of a UTF-8 
string will not be detectable most of the time. However, more 
robust error-correcting schemes at other layers of the system 
will easily catch that.

> That's exactly what happened with code page based texts when 
> you don't know in which code page it is encoded. It has the 
> supplemental inconvenience that mixing languages becomes 
> impossible or at least very cumbersome.
> UTF-8 has several properties that are difficult to have with 
> other schemes.
> - It is state-less, means any byte in a stream always means the 
> same thing. Its meaning  does not depend on external or a 
> previous byte.

I realize this was considered important at one time, but I think 
it has proven to be a bad design decision, for HTTP too. There 
are some advantages when building rudimentary systems with crude 
hardware and lots of noise, as was the case back then, but that's 
not the tech world we live in today. That's why almost every HTTP 
request today is part of a stateful session that explicitly keeps 
track of the connection, whether through cookies, https 
encryption, or HTTP/2.

> - It can mix any language in the same stream without acrobatics 
> and if one thinks that mixing languages doesn't happen often 
> should get his head extracted from his rear, because it is very 
> common (check wikipedia's front page for example).

I question that almost anybody needs to mix "streams." As for 
messages or files, headers handle multiple language mixing 
easily, as noted in that earlier thread.

> - The multi byte nature of other alphabets is not as bad as 
> people think because texts in computer do not live on their 
> own, meaning that they are generally embedded inside file 
> formats, which more often than not are extremely bloated (xml, 
> html, xliff, akoma ntoso, rtf etc.). The few bytes more in the 
> text do not weigh that much.

Heh, the other parts of the tech stack are much more bloated, so 
this bloat is okay? A unique argument, but I'd argue that's why 
those bloated formats you mention are largely dying off too.

> I'm in charge at the European Commission of the biggest 
> translation memory in the world. It handles currently 30 
> languages and without UTF-8 and UTF-16 it would be 
> unmanageable. I still remember when I started there in 2002 
> when we handled only 11 languages of which only 1 was of 
> another alphabet (Greek). Everything was based on RTF with 
> codepages and it was a braindead mess. My first job in 2003 was 
> to extend the system to handle the 8 newcomer languages and 
> with ASCII based encodings it was completely unmanageable 
> because every document processed mixes languages and alphabets 
> freely (addresses and names are often written in their original 
> form for instance).

I have no idea what a "translation memory" is. I don't doubt that 
dealing with non-standard codepages or layouts was difficult, and 
that a standard like Unicode made your life easier. But the 
question isn't whether standards would clean things up, of course 
they would, the question is whether a hypothetical header-based 
standard would be better than the current continuation byte 
standard, UTF-8. I think your life would've been even easier with 
the former, though depending on your usage, maybe the main gain 
for you would be just from standardization.

> 2 years ago we implemented also support for Chinese. The nice 
> thing was that we didn't have to change much to do that thanks 
> to Unicode. The second surprise was with the file sizes, 
> Chinese documents were generally smaller than their European 
> counterparts. Yes CJK requires 3 bytes for each ideogram, but 
> generally 1 ideogram replaces many letters. The ideogram 亿 
> replaces "One hundred million" for example, which of them take 
> more bytes? So if CJK indeed requires more bytes to encode, it 
> is firstly because they NEED many more bits in the first place 
> (there are around 30000 CJK codepoints in the BMP alone, add to 
> it the 60000 that are in the SIP and we have a need of 17 bits 
> only to encode them.

That's not the relevant criteria: nobody cares if the CJK 
documents were smaller than their European counterparts. What 
they care about is that, given a different transfer format, the 
CJK document could have been significantly smaller still. Because 
almost nobody cares about which translation version is smaller, 
they care that the text they sent in Chinese or Korean is as 
small as it can be.

Anyway, I didn't mean to restart this debate, so I'll leave it 
here.


More information about the Digitalmars-d mailing list