Of possible interest: fast UTF8 validation

Neia Neutuladh neia at ikeran.org
Fri May 18 16:40:57 UTC 2018


On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:
> On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via 
> Digitalmars-d wrote: [...]
>> - the auto-synchronization and the statelessness are big deals.
>
> Yes.  Imagine if we standardized on a header-based string 
> encoding, and we wanted to implement a substring function over 
> a string that contains multiple segments of different 
> languages. Instead of a cheap slicing over the string, you'd 
> need to scan the string or otherwise keep track of which 
> segment the start/end of the substring lies in, allocate memory 
> to insert headers so that the segments are properly 
> interpreted, etc.. It would be an implementational nightmare, 
> and an unavoidable performance hit (you'd have to copy data 
> every time you take a substring), and the @nogc guys would be 
> up in arms.

You'd have three data structures: Strand, Rope, and Slice.

A Strand is a series of bytes with an encoding. A Rope is a 
series of Strands. A Slice is a pair of location references 
within a Rope. You probably want a special datastructure to name 
a location within a Rope: Strand offset, then byte offset. Total 
of five words instead of two to pass a Slice, but zero dynamic 
allocations.

This would be a problem for data locality. However, rope-style 
datastructures are handy for some types of string manipulation.

As an alternative, you might have a separate document specifying 
what encodings apply to what byte ranges. Slices would then be 
three words long (pointer to the string struct, start offset, end 
offset). Iterating would cost O(log(S) + M), where S is the 
number of encoded segments and M is the number of bytes in the 
slice.

Anyway, you either get a more complex data structure, or you have 
terrible time complexity, but you don't have both.

> And that's assuming we have a sane header-based encoding for 
> strings that contain segments in multiple languages in the 
> first place. Linguistic analysis articles, for example, would 
> easily contain many such segments within a paragraph, or 
> perhaps in the same sentence. How would a header-based encoding 
> work for such documents?  Nevermind the recent trend of 
> liberally sprinkling emojis all over regular text. If every 
> emoticon embedded in a string requires splitting the string 
> into 3 segments complete with their own headers, I dare not 
> imagine what the code that manipulates such strings would look 
> like.

"Header" implies that all encoding data appears at the start of 
the document, or in a separate metadata segment. (Call it a start 
index and two bytes to specify the encoding; reserve the first 
few bits of the encoding to specify the width.) It also brings to 
mind HTTP, and reminds me that most documents are either mostly 
ASCII or a heavy mix of ASCII and something else (HTML and XML 
being the forerunners).

If the encoding succeeded at making most scripts single-byte, 
then, testing with https://ar.wikipedia.org/wiki/Main_Page, you 
might get within 15% of UTF-8's efficiency. And then a simple 
sentence like "Ĉu ĝi ŝajnas ankaŭ esti ŝafo?" is 2.64 times as 
long in this encoding as UTF-8, since it has ten encoded 
segments, each with overhead. (Assuming the header supports 
strings up to 2^32 bytes long.)

If it didn't succeed at making Latin and Arabic single-byte 
scripts (and Latin contains over 800 characters in Unicode, while 
Arabic has over three hundred), it would be worse than UTF-16.


More information about the Digitalmars-d mailing list