Of possible interest: fast UTF8 validation

Patrick Schluter Patrick.Schluter at bbox.fr
Thu May 17 19:13:23 UTC 2018


On Thursday, 17 May 2018 at 15:16:19 UTC, Joakim wrote:
> On Thursday, 17 May 2018 at 13:14:46 UTC, Patrick Schluter 
> wrote:
>> This is not practical, sorry. What happens when your message 
>> loses the header? Exactly, the rest of the message is garbled.
>
> Why would it lose the header? TCP guarantees delivery and 
> checksums the data, that's effective enough at the transport 
> layer.

What does TCP/IP got to do with anything in discussion here. 
UTF-8 (or UTF-16 or UTF-32) has nothing to do with network 
protocols. That's completely unrelated. A file encoded on a disk 
may never leave the machine it is written on and may never see a 
wire in its lifetime and its encoding is still of vital 
importance. That's why a header encoding is too restrictive.

>
> I agree that UTF-8 is a more redundant format, as others have 
> mentioned earlier, and is thus more robust to certain types of 
> data loss than a header-based scheme. However, I don't consider 
> that the job of the text format, it's better done by other 
> layers, like transport protocols or filesystems, which will 
> guard against such losses much more reliably and efficiently.

No. A text format cannot depend on a network protocol. It would 
be as if you could only listen to a music file or a video on 
streaming and never save it on offline file as there was nowhere 
the information of what that blob of bytes represents. It doesn't 
make any sense.

>
> For example, a random bitflip somewhere in the middle of a 
> UTF-8 string will not be detectable most of the time. However, 
> more robust error-correcting schemes at other layers of the 
> system will easily catch that.

That's the job of the other layers. Any other file would have the 
same problem. At least, with utf-8 there will be at most only 
ever 1 codepoint lost or changed. Any other encoding would fare 
better. This said if a checksum header for your document is 
important you can add it to externally anyway.


>
>> That's exactly what happened with code page based texts when 
>> you don't know in which code page it is encoded. It has the 
>> supplemental inconvenience that mixing languages becomes 
>> impossible or at least very cumbersome.
>> UTF-8 has several properties that are difficult to have with 
>> other schemes.
>> - It is state-less, means any byte in a stream always means 
>> the same thing. Its meaning  does not depend on external or a 
>> previous byte.
>
> I realize this was considered important at one time, but I 
> think it has proven to be a bad design decision, for HTTP too. 
> There are some advantages when building rudimentary systems 
> with crude hardware and lots of noise, as was the case back 
> then, but that's not the tech world we live in today. That's 
> why almost every HTTP request today is part of a stateful 
> session that explicitly keeps track of the connection, whether 
> through cookies, https encryption, or HTTP/2.

Again, orthogonal to utf-8. When I speak above of streams it 
doesn't limit to sockets, file are also read in streams. So stop 
of equating UTF-8 with the Internet, these are 2 different 
domains. Internet and its protocols were defined and invented 
long before Unicode and Unicode is very usefull also offline.

>> - It can mix any language in the same stream without 
>> acrobatics and if one thinks that mixing languages doesn't 
>> happen often should get his head extracted from his rear, 
>> because it is very common (check wikipedia's front page for 
>> example).
>
> I question that almost anybody needs to mix "streams." As for 
> messages or files, headers handle multiple language mixing 
> easily, as noted in that earlier thread.

Ok, show me how you transmit that, I'm curious:

<prop type="Txt::Doc. No.">E2010C0002</prop>
<tuv lang="EN-GB">
<seg>EFTA Surveillance Authority Decision</seg>
</tuv>
<tuv lang="DE-DE">
<seg>Beschluss der EFTA-Überwachungsbehörde</seg>
</tuv>
<tuv lang="DA-01">
<seg>EFTA-Tilsynsmyndighedens beslutning</seg>
</tuv>
<tuv lang="EL-01">
<seg>Απόφαση της Εποπτεύουσας Αρχής της ΕΖΕΣ</seg>
</tuv>
<tuv lang="ES-ES">
<seg>Decisión del Órgano de Vigilancia de la AELC</seg>
</tuv>
<tuv lang="FI-01">
<seg>EFTAn valvontaviranomaisen päätös</seg>
</tuv>
<tuv lang="FR-FR">
<seg>Décision de l'Autorité de surveillance AELE</seg>
</tuv>
<tuv lang="IT-IT">
<seg>Decisione dell’Autorità di vigilanza EFTA</seg>
</tuv>
<tuv lang="NL-NL">
<seg>Besluit van de Toezichthoudende Autoriteit van de EVA</seg>
</tuv>
<tuv lang="PT-PT">
<seg>Decisão do Órgão de Fiscalização da EFTA</seg>
</tuv>
<tuv lang="SV-SE">
<seg>Beslut av Eftas övervakningsmyndighet</seg>
</tuv>
<tuv lang="LV-01">
<seg>EBTA Uzraudzības iestādes Lēmums</seg>
</tuv>
<tuv lang="CS-01">
<seg>Rozhodnutí Kontrolního úřadu ESVO</seg>
</tuv>
<tuv lang="ET-01">
<seg>EFTA järelevalveameti otsus</seg>
</tuv>
<tuv lang="PL-01">
<seg>Decyzja Urzędu Nadzoru EFTA</seg>
</tuv>
<tuv lang="SL-01">
<seg>Odločba Nadzornega organa EFTE</seg>
</tuv>
<tuv lang="LT-01">
<seg>ELPA priežiūros institucijos sprendimas</seg>
</tuv>
<tuv lang="MT-01">
<seg>Deċiżjoni tal-Awtorità tas-Sorveljanza tal-EFTA</seg>
</tuv>
<tuv lang="SK-01">
<seg>Rozhodnutie Dozorného orgánu EZVO</seg>
</tuv>
<tuv lang="BG-01">
<seg>Решение на Надзорния орган на ЕАСТ</seg>
</tuv>
</tu>
<tu>


>
>> - The multi byte nature of other alphabets is not as bad as 
>> people think because texts in computer do not live on their 
>> own, meaning that they are generally embedded inside file 
>> formats, which more often than not are extremely bloated (xml, 
>> html, xliff, akoma ntoso, rtf etc.). The few bytes more in the 
>> text do not weigh that much.
>
> Heh, the other parts of the tech stack are much more bloated, 
> so this bloat is okay? A unique argument, but I'd argue that's 
> why those bloated formats you mention are largely dying off too.

They don't, it's getting worse by the day, that's why I mentioned 
Akoma Ntoso and XLIFF, they will be used more and more. The world 
is not limited to webshit (see n-gate.com for the reference).

>
>> I'm in charge at the European Commission of the biggest 
>> translation memory in the world. It handles currently 30 
>> languages and without UTF-8 and UTF-16 it would be 
>> unmanageable. I still remember when I started there in 2002 
>> when we handled only 11 languages of which only 1 was of 
>> another alphabet (Greek). Everything was based on RTF with 
>> codepages and it was a braindead mess. My first job in 2003 
>> was to extend the system to handle the 8 newcomer languages 
>> and with ASCII based encodings it was completely unmanageable 
>> because every document processed mixes languages and alphabets 
>> freely (addresses and names are often written in their 
>> original form for instance).
>
> I have no idea what a "translation memory" is. I don't doubt 
> that dealing with non-standard codepages or layouts was 
> difficult, and that a standard like Unicode made your life 
> easier. But the question isn't whether standards would clean 
> things up, of course they would, the question is whether a 
> hypothetical header-based standard would be better than the 
> current continuation byte standard, UTF-8. I think your life 
> would've been even easier with the former, though depending on 
> your usage, maybe the main gain for you would be just from 
> standardization.

I doubt it because the issue has nothing to do with network 
protocols as you seem to imply. It is about data format, i.e. the 
content that may be shuffled over a net, but can also stay on a 
disk, be printed on paper (gasp so old tech) or used 
interactively in a GUI.


>
>> 2 years ago we implemented also support for Chinese. The nice 
>> thing was that we didn't have to change much to do that thanks 
>> to Unicode. The second surprise was with the file sizes, 
>> Chinese documents were generally smaller than their European 
>> counterparts. Yes CJK requires 3 bytes for each ideogram, but 
>> generally 1 ideogram replaces many letters. The ideogram 亿 
>> replaces "One hundred million" for example, which of them take 
>> more bytes? So if CJK indeed requires more bytes to encode, it 
>> is firstly because they NEED many more bits in the first place 
>> (there are around 30000 CJK codepoints in the BMP alone, add 
>> to it the 60000 that are in the SIP and we have a need of 17 
>> bits only to encode them.
>
> That's not the relevant criteria: nobody cares if the CJK 
> documents were smaller than their European counterparts. What 
> they care about is that, given a different transfer format, the 
> CJK document could have been significantly smaller still. 
> Because almost nobody cares about which translation version is 
> smaller, they care that the text they sent in Chinese or Korean 
> is as small as it can be.

At most 50% more but if the size is really that important it can 
use UTF-16 which is the same size as Big-5 or Shit-JIS, or as 
Walter suggested they would better compress the file in that case.

>
> Anyway, I didn't mean to restart this debate, so I'll leave it 
> here.

- the auto-synchronization and the statelessness are big deals.



More information about the Digitalmars-d mailing list