Narrow string is not a random access range

H. S. Teoh hsteoh at quickfur.ath.cx
Wed Oct 24 12:53:23 PDT 2012


On Wed, Oct 24, 2012 at 12:38:41PM -0700, Jonathan M Davis wrote:
> On Wednesday, October 24, 2012 14:37:33 mist wrote:
> > Wait. So you consider commonPrefix returning malformed string to
> > be fine? I have lost you here. For example, for code sample given
> > above, output is:
> > 
> > ==========
> > Пи
> > П[\D0]
> > ==========
> > 
> > Problem is if you use == on code unit you can match only part of
> > valid symbol.
> 
> Hmmm. Let me think this through for a moment. Every code point starts
> with a code unit that tells you how many code units are in the code
> point, and each code point should have only one sequence of code units
> which represents it, so something like find or startsWith should be
> able to just use code units.  commonPrefix is effectively doing a
> startsWith/find, but it's shortcutted once there's a difference, and
> that _could_ be in the middle of a code point, since you could have a
> code point with 3 code units where the first 2 match but no the third
> one. So, yes. There is a bug here.
> 
> Now, a full decode still isn't necessary. It just has to keep track of
> how long the code point is and return a slice starting at the end of
> the code previous code point if not all of a code point matches, but
> you've definitely found a bug.
[...]

For many algorithms, full decode is not necessary. This is something
that Phobos should take advantage of (at least in theory; I'm not sure
how practical this is with the current codebase).

Actually, in the above case, *no* decode is necessary at all. UTF-8 was
designed specifically for this: if you see a byte with its highest bits
set to 0b10, that means you're in the middle of a code point. You can
scan forwards or backwards until the first byte whose highest bits
aren't 0b10; that's guaranteed to be the start of a code point (provided
the original string is actually well-formed UTF-8). There is no need to
keep track of length at all.

Many algorithms can be optimized to take advantage of this. Counting the
number of code points is simply counting the number of bytes whose
highest bits are not 0b10. Given some arbitrary offset into a char[],
you can use std.range.radial to find the nearest code point boundary
(i.e., byte whose upper bits are not 0b10).

Given a badly-truncated UTF-8 string (i.e., it got cut in the middle of
a code point), you can recover the still-valid substring by deleting the
bytes with high bits 0b10 at the beginning/end of the string. You'll
lose the truncated code point, but the rest of the string is still
usable.

Etc..


T

-- 
Marketing: the art of convincing people to pay for what they didn't need before which you can't deliver after.


More information about the Digitalmars-d-learn mailing list