Of possible interest: fast UTF8 validation
Patrick Schluter
Patrick.Schluter at bbox.fr
Thu May 17 19:13:23 UTC 2018
On Thursday, 17 May 2018 at 15:16:19 UTC, Joakim wrote:
> On Thursday, 17 May 2018 at 13:14:46 UTC, Patrick Schluter
> wrote:
>> This is not practical, sorry. What happens when your message
>> loses the header? Exactly, the rest of the message is garbled.
>
> Why would it lose the header? TCP guarantees delivery and
> checksums the data, that's effective enough at the transport
> layer.
What does TCP/IP got to do with anything in discussion here.
UTF-8 (or UTF-16 or UTF-32) has nothing to do with network
protocols. That's completely unrelated. A file encoded on a disk
may never leave the machine it is written on and may never see a
wire in its lifetime and its encoding is still of vital
importance. That's why a header encoding is too restrictive.
>
> I agree that UTF-8 is a more redundant format, as others have
> mentioned earlier, and is thus more robust to certain types of
> data loss than a header-based scheme. However, I don't consider
> that the job of the text format, it's better done by other
> layers, like transport protocols or filesystems, which will
> guard against such losses much more reliably and efficiently.
No. A text format cannot depend on a network protocol. It would
be as if you could only listen to a music file or a video on
streaming and never save it on offline file as there was nowhere
the information of what that blob of bytes represents. It doesn't
make any sense.
>
> For example, a random bitflip somewhere in the middle of a
> UTF-8 string will not be detectable most of the time. However,
> more robust error-correcting schemes at other layers of the
> system will easily catch that.
That's the job of the other layers. Any other file would have the
same problem. At least, with utf-8 there will be at most only
ever 1 codepoint lost or changed. Any other encoding would fare
better. This said if a checksum header for your document is
important you can add it to externally anyway.
>
>> That's exactly what happened with code page based texts when
>> you don't know in which code page it is encoded. It has the
>> supplemental inconvenience that mixing languages becomes
>> impossible or at least very cumbersome.
>> UTF-8 has several properties that are difficult to have with
>> other schemes.
>> - It is state-less, means any byte in a stream always means
>> the same thing. Its meaning does not depend on external or a
>> previous byte.
>
> I realize this was considered important at one time, but I
> think it has proven to be a bad design decision, for HTTP too.
> There are some advantages when building rudimentary systems
> with crude hardware and lots of noise, as was the case back
> then, but that's not the tech world we live in today. That's
> why almost every HTTP request today is part of a stateful
> session that explicitly keeps track of the connection, whether
> through cookies, https encryption, or HTTP/2.
Again, orthogonal to utf-8. When I speak above of streams it
doesn't limit to sockets, file are also read in streams. So stop
of equating UTF-8 with the Internet, these are 2 different
domains. Internet and its protocols were defined and invented
long before Unicode and Unicode is very usefull also offline.
>> - It can mix any language in the same stream without
>> acrobatics and if one thinks that mixing languages doesn't
>> happen often should get his head extracted from his rear,
>> because it is very common (check wikipedia's front page for
>> example).
>
> I question that almost anybody needs to mix "streams." As for
> messages or files, headers handle multiple language mixing
> easily, as noted in that earlier thread.
Ok, show me how you transmit that, I'm curious:
<prop type="Txt::Doc. No.">E2010C0002</prop>
<tuv lang="EN-GB">
<seg>EFTA Surveillance Authority Decision</seg>
</tuv>
<tuv lang="DE-DE">
<seg>Beschluss der EFTA-Überwachungsbehörde</seg>
</tuv>
<tuv lang="DA-01">
<seg>EFTA-Tilsynsmyndighedens beslutning</seg>
</tuv>
<tuv lang="EL-01">
<seg>Απόφαση της Εποπτεύουσας Αρχής της ΕΖΕΣ</seg>
</tuv>
<tuv lang="ES-ES">
<seg>Decisión del Órgano de Vigilancia de la AELC</seg>
</tuv>
<tuv lang="FI-01">
<seg>EFTAn valvontaviranomaisen päätös</seg>
</tuv>
<tuv lang="FR-FR">
<seg>Décision de l'Autorité de surveillance AELE</seg>
</tuv>
<tuv lang="IT-IT">
<seg>Decisione dell’Autorità di vigilanza EFTA</seg>
</tuv>
<tuv lang="NL-NL">
<seg>Besluit van de Toezichthoudende Autoriteit van de EVA</seg>
</tuv>
<tuv lang="PT-PT">
<seg>Decisão do Órgão de Fiscalização da EFTA</seg>
</tuv>
<tuv lang="SV-SE">
<seg>Beslut av Eftas övervakningsmyndighet</seg>
</tuv>
<tuv lang="LV-01">
<seg>EBTA Uzraudzības iestādes Lēmums</seg>
</tuv>
<tuv lang="CS-01">
<seg>Rozhodnutí Kontrolního úřadu ESVO</seg>
</tuv>
<tuv lang="ET-01">
<seg>EFTA järelevalveameti otsus</seg>
</tuv>
<tuv lang="PL-01">
<seg>Decyzja Urzędu Nadzoru EFTA</seg>
</tuv>
<tuv lang="SL-01">
<seg>Odločba Nadzornega organa EFTE</seg>
</tuv>
<tuv lang="LT-01">
<seg>ELPA priežiūros institucijos sprendimas</seg>
</tuv>
<tuv lang="MT-01">
<seg>Deċiżjoni tal-Awtorità tas-Sorveljanza tal-EFTA</seg>
</tuv>
<tuv lang="SK-01">
<seg>Rozhodnutie Dozorného orgánu EZVO</seg>
</tuv>
<tuv lang="BG-01">
<seg>Решение на Надзорния орган на ЕАСТ</seg>
</tuv>
</tu>
<tu>
>
>> - The multi byte nature of other alphabets is not as bad as
>> people think because texts in computer do not live on their
>> own, meaning that they are generally embedded inside file
>> formats, which more often than not are extremely bloated (xml,
>> html, xliff, akoma ntoso, rtf etc.). The few bytes more in the
>> text do not weigh that much.
>
> Heh, the other parts of the tech stack are much more bloated,
> so this bloat is okay? A unique argument, but I'd argue that's
> why those bloated formats you mention are largely dying off too.
They don't, it's getting worse by the day, that's why I mentioned
Akoma Ntoso and XLIFF, they will be used more and more. The world
is not limited to webshit (see n-gate.com for the reference).
>
>> I'm in charge at the European Commission of the biggest
>> translation memory in the world. It handles currently 30
>> languages and without UTF-8 and UTF-16 it would be
>> unmanageable. I still remember when I started there in 2002
>> when we handled only 11 languages of which only 1 was of
>> another alphabet (Greek). Everything was based on RTF with
>> codepages and it was a braindead mess. My first job in 2003
>> was to extend the system to handle the 8 newcomer languages
>> and with ASCII based encodings it was completely unmanageable
>> because every document processed mixes languages and alphabets
>> freely (addresses and names are often written in their
>> original form for instance).
>
> I have no idea what a "translation memory" is. I don't doubt
> that dealing with non-standard codepages or layouts was
> difficult, and that a standard like Unicode made your life
> easier. But the question isn't whether standards would clean
> things up, of course they would, the question is whether a
> hypothetical header-based standard would be better than the
> current continuation byte standard, UTF-8. I think your life
> would've been even easier with the former, though depending on
> your usage, maybe the main gain for you would be just from
> standardization.
I doubt it because the issue has nothing to do with network
protocols as you seem to imply. It is about data format, i.e. the
content that may be shuffled over a net, but can also stay on a
disk, be printed on paper (gasp so old tech) or used
interactively in a GUI.
>
>> 2 years ago we implemented also support for Chinese. The nice
>> thing was that we didn't have to change much to do that thanks
>> to Unicode. The second surprise was with the file sizes,
>> Chinese documents were generally smaller than their European
>> counterparts. Yes CJK requires 3 bytes for each ideogram, but
>> generally 1 ideogram replaces many letters. The ideogram 亿
>> replaces "One hundred million" for example, which of them take
>> more bytes? So if CJK indeed requires more bytes to encode, it
>> is firstly because they NEED many more bits in the first place
>> (there are around 30000 CJK codepoints in the BMP alone, add
>> to it the 60000 that are in the SIP and we have a need of 17
>> bits only to encode them.
>
> That's not the relevant criteria: nobody cares if the CJK
> documents were smaller than their European counterparts. What
> they care about is that, given a different transfer format, the
> CJK document could have been significantly smaller still.
> Because almost nobody cares about which translation version is
> smaller, they care that the text they sent in Chinese or Korean
> is as small as it can be.
At most 50% more but if the size is really that important it can
use UTF-16 which is the same size as Big-5 or Shit-JIS, or as
Walter suggested they would better compress the file in that case.
>
> Anyway, I didn't mean to restart this debate, so I'll leave it
> here.
- the auto-synchronization and the statelessness are big deals.
More information about the Digitalmars-d
mailing list