Notice/Warning on narrowStrings .length

H. S. Teoh hsteoh at quickfur.ath.cx
Thu Apr 26 14:23:22 PDT 2012


On Thu, Apr 26, 2012 at 01:51:17PM -0400, Nick Sabalausky wrote:
> "James Miller" <james at aatch.net> wrote in message 
> news:qdgacdzxkhmhojqcettj at forum.dlang.org...
> > I'm writing an introduction/tutorial to using strings in D, paying
> > particular attention to the complexities of UTF-8 and 16. I realised
> > that when you want the number of characters, you normally actually
> > want to use walkLength, not length. Is is reasonable for the
> > compiler to pick this up during semantic analysis and point out this
> > situation?
> >
> > It's just a thought because a lot of the time, using length will get
> > the right answer, but for the wrong reasons, resulting in lurking
> > bugs. You can always cast to immutable(ubyte)[] or
> > immutable(short)[] if you want to work with the actual bytes anyway.
> 
> I find that most of the time I actually *do* want to use length. Don't
> know if that's common, though, or if it's just a reflection of my
> particular use-cases.
> 
> Also, keep in mind that (unless I'm mistaken) walkLength does *not*
> return the number of "characters" (ie, graphemes), but merely the
> number of code points - which is not the same thing (due to existence
> of the [confusingly-named] "combining characters").
[...]

And don't forget that some code points (notably from the CJK block) are
specified as "double-width", so if you're trying to do text layout,
you'll want yet a different length (layoutLength?).

So we really need all four lengths. Ain't unicode fun?! :-)

Array length is simple.  Walklength is already implemented. Grapheme
length requires recognition of 'combining characters' (or rather,
ignoring said characters), and layout length requires recognizing
widthless, single- and double-width characters.

I've been thinking about unicode processing recently. Traditionally, we
have to decode narrow strings into UTF-32 (aka dchar) then do table
lookups and such. But unicode encoding and properties, etc., are static
information (at least within a single unicode release). So why bother
with hardcoding tables and stuff at all?

What we *really* should be doing, esp. for commonly-used functions like
computing various lengths, is to automatically process said tables and
encode the computation in finite-state machines that can then be
optimized at the FSM level (there are known algos for generating optimal
FSMs), codegen'd, and then optimized again at the assembly level by the
compiler. These FSMs will operate at the native narrow string char type
level, so that there will be no need for explicit decoding.

The generation algo can then be run just once per unicode release, and
everything will Just Work.


T

-- 
Give me some fresh salted fish, please.


More information about the Digitalmars-d mailing list