Of possible interest: fast UTF8 validation

Joakim dlang at joakim.fea.st
Fri May 18 08:44:41 UTC 2018


On Thursday, 17 May 2018 at 23:11:22 UTC, Ethan wrote:
> On Thursday, 17 May 2018 at 17:26:04 UTC, Dmitry Olshansky 
> wrote:
>> TCP being  reliable just plain doesn’t cut it. Corruption of
>> single bit is very real.
>
> Quoting to highlight and agree.
>
> TCP is reliable because it resends dropped packets and delivers 
> them in order.
>
> I don't write TCP packets to my long-term storage medium.
>
> UTF as a transportation protocol Unicode is *far* more useful 
> than just sending across a network.

The point wasn't that TCP is handling all the errors, it was a 
throwaway example of one other layer of the system, the network 
transport layer, that actually has a checksum that will detect a 
single bitflip, which UTF-8 will not usually detect. I mentioned 
that the filesystem and several other layers have their own such 
error detection, yet you guys crazily latch on to the TCP example 
alone.

On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:
> On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via 
> Digitalmars-d wrote: [...]
>> - the auto-synchronization and the statelessness are big deals.
>
> Yes.  Imagine if we standardized on a header-based string 
> encoding, and we wanted to implement a substring function over 
> a string that contains multiple segments of different 
> languages. Instead of a cheap slicing over the string, you'd 
> need to scan the string or otherwise keep track of which 
> segment the start/end of the substring lies in, allocate memory 
> to insert headers so that the segments are properly 
> interpreted, etc.. It would be an implementational nightmare, 
> and an unavoidable performance hit (you'd have to copy data 
> every time you take a substring), and the @nogc guys would be 
> up in arms.

As we discussed when I first raised this header scheme years ago, 
you're right that slicing could be more expensive, depending on 
whether you chose to allocate a new header for the substring or 
not. The question is whether the optimizations available from 
such a header telling you where all the language substrings are 
in a multi-language string make up for having to expensively 
process the entire UTF-8 string to get that or other data. I 
think it's fairly obvious the design tradeoff of the header would 
beat out UTF-8 for all but a few degenerate cases, but maybe you 
don't see it.

> And that's assuming we have a sane header-based encoding for 
> strings that contain segments in multiple languages in the 
> first place. Linguistic analysis articles, for example, would 
> easily contain many such segments within a paragraph, or 
> perhaps in the same sentence. How would a header-based encoding 
> work for such documents?

It would bloat the header to some extent, but less so than a 
UTF-8 string. You may want to use special header encodings for 
such edge cases too, if you want to maintain the same large 
performance lead over UTF-8 that you'd have for the common case.

>Nevermind the recent trend of
> liberally sprinkling emojis all over regular text. If every 
> emoticon embedded in a string requires splitting the string 
> into 3 segments complete with their own headers, I dare not 
> imagine what the code that manipulates such strings would look 
> like.

Personally, I don't consider emojis worth implementing, :) they 
shouldn't be part of Unicode. But since they are, I'm fairly 
certain header-based text messages with emojis would be 
significantly smaller than using UTF-8/16.

I was surprised to see that adding a emoji to a text message I 
sent last year cut my message character quota in half.  I googled 
why this was and it turns out that when you add an emoji, the 
text messaging client actually changes your message encoding from 
UTF-8 to UTF-16! I don't know if this is a limitation of the 
default Android messaging client, my telco carrier, or SMS, but I 
strongly suspect this is widespread.

Anyway, I can see the arguments about UTF-8 this time around are 
as bad as the first time I raised it five years back, so I'll 
leave this thread here.


More information about the Digitalmars-d mailing list