Of possible interest: fast UTF8 validation
Neia Neutuladh
neia at ikeran.org
Fri May 18 16:40:57 UTC 2018
On Thursday, 17 May 2018 at 23:16:03 UTC, H. S. Teoh wrote:
> On Thu, May 17, 2018 at 07:13:23PM +0000, Patrick Schluter via
> Digitalmars-d wrote: [...]
>> - the auto-synchronization and the statelessness are big deals.
>
> Yes. Imagine if we standardized on a header-based string
> encoding, and we wanted to implement a substring function over
> a string that contains multiple segments of different
> languages. Instead of a cheap slicing over the string, you'd
> need to scan the string or otherwise keep track of which
> segment the start/end of the substring lies in, allocate memory
> to insert headers so that the segments are properly
> interpreted, etc.. It would be an implementational nightmare,
> and an unavoidable performance hit (you'd have to copy data
> every time you take a substring), and the @nogc guys would be
> up in arms.
You'd have three data structures: Strand, Rope, and Slice.
A Strand is a series of bytes with an encoding. A Rope is a
series of Strands. A Slice is a pair of location references
within a Rope. You probably want a special datastructure to name
a location within a Rope: Strand offset, then byte offset. Total
of five words instead of two to pass a Slice, but zero dynamic
allocations.
This would be a problem for data locality. However, rope-style
datastructures are handy for some types of string manipulation.
As an alternative, you might have a separate document specifying
what encodings apply to what byte ranges. Slices would then be
three words long (pointer to the string struct, start offset, end
offset). Iterating would cost O(log(S) + M), where S is the
number of encoded segments and M is the number of bytes in the
slice.
Anyway, you either get a more complex data structure, or you have
terrible time complexity, but you don't have both.
> And that's assuming we have a sane header-based encoding for
> strings that contain segments in multiple languages in the
> first place. Linguistic analysis articles, for example, would
> easily contain many such segments within a paragraph, or
> perhaps in the same sentence. How would a header-based encoding
> work for such documents? Nevermind the recent trend of
> liberally sprinkling emojis all over regular text. If every
> emoticon embedded in a string requires splitting the string
> into 3 segments complete with their own headers, I dare not
> imagine what the code that manipulates such strings would look
> like.
"Header" implies that all encoding data appears at the start of
the document, or in a separate metadata segment. (Call it a start
index and two bytes to specify the encoding; reserve the first
few bits of the encoding to specify the width.) It also brings to
mind HTTP, and reminds me that most documents are either mostly
ASCII or a heavy mix of ASCII and something else (HTML and XML
being the forerunners).
If the encoding succeeded at making most scripts single-byte,
then, testing with https://ar.wikipedia.org/wiki/Main_Page, you
might get within 15% of UTF-8's efficiency. And then a simple
sentence like "Ĉu ĝi ŝajnas ankaŭ esti ŝafo?" is 2.64 times as
long in this encoding as UTF-8, since it has ten encoded
segments, each with overhead. (Assuming the header supports
strings up to 2^32 bytes long.)
If it didn't succeed at making Latin and Arabic single-byte
scripts (and Latin contains over 800 characters in Unicode, while
Arabic has over three hundred), it would be worse than UTF-16.
More information about the Digitalmars-d
mailing list