Notice/Warning on narrowStrings .length

Fri Apr 27 01:20:13 PDT 2012

On 27.04.2012 1:23, H. S. Teoh wrote:
> On Thu, Apr 26, 2012 at 01:51:17PM -0400, Nick Sabalausky wrote:
>> "James Miller"<james at aatch.net>  wrote in message
>> news:qdgacdzxkhmhojqcettj at forum.dlang.org...
>>> I'm writing an introduction/tutorial to using strings in D, paying
>>> particular attention to the complexities of UTF-8 and 16. I realised
>>> that when you want the number of characters, you normally actually
>>> want to use walkLength, not length. Is is reasonable for the
>>> compiler to pick this up during semantic analysis and point out this
>>> situation?
>>>
>>> It's just a thought because a lot of the time, using length will get
>>> the right answer, but for the wrong reasons, resulting in lurking
>>> bugs. You can always cast to immutable(ubyte)[] or
>>> immutable(short)[] if you want to work with the actual bytes anyway.
>>
>> I find that most of the time I actually *do* want to use length. Don't
>> know if that's common, though, or if it's just a reflection of my
>> particular use-cases.
>>
>> Also, keep in mind that (unless I'm mistaken) walkLength does *not*
>> return the number of "characters" (ie, graphemes), but merely the
>> number of code points - which is not the same thing (due to existence
>> of the [confusingly-named] "combining characters").
> [...]
>
> And don't forget that some code points (notably from the CJK block) are
> specified as "double-width", so if you're trying to do text layout,
> you'll want yet a different length (layoutLength?).
>
> So we really need all four lengths. Ain't unicode fun?! :-)
>
> Array length is simple.  Walklength is already implemented. Grapheme
> length requires recognition of 'combining characters' (or rather,
> ignoring said characters), and layout length requires recognizing
> widthless, single- and double-width characters.
>
> I've been thinking about unicode processing recently. Traditionally, we
> have to decode narrow strings into UTF-32 (aka dchar) then do table
> lookups and such. But unicode encoding and properties, etc., are static
> information (at least within a single unicode release). So why bother
> with hardcoding tables and stuff at all?

Of course they are generated.

>
> What we *really* should be doing, esp. for commonly-used functions like
> computing various lengths, is to automatically process said tables and
> encode the computation in finite-state machines that can then be
> optimized at the FSM level (there are known algos for generating optimal
> FSMs),

FSA are based on tables so it's all runs in the circle. Only the layout 
changes. Yet the speed gains of non-decoding are huge.

  codegen'd, and then optimized again at the assembly level by the
> compiler. These FSMs will operate at the native narrow string char type
> level, so that there will be no need for explicit decoding.
>
> The generation algo can then be run just once per unicode release, and
> everything will Just Work.
>
This year Unicode in D will receive a nice upgrade.
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/dolsh/20002#

Anyway keep me posted if you have these FSA ever come to soil your sleep ;)

-- 
Dmitry Olshansky