Notice/Warning on narrowStrings .length

Nick Sabalausky SeeWebsiteToContactMe at semitwist.com
Thu Apr 26 15:13:00 PDT 2012


"H. S. Teoh" <hsteoh at quickfur.ath.cx> wrote in message 
news:mailman.2173.1335475413.4860.digitalmars-d at puremagic.com...
> On Thu, Apr 26, 2012 at 01:51:17PM -0400, Nick Sabalausky wrote:
>> "James Miller" <james at aatch.net> wrote in message
>> news:qdgacdzxkhmhojqcettj at forum.dlang.org...
>> > I'm writing an introduction/tutorial to using strings in D, paying
>> > particular attention to the complexities of UTF-8 and 16. I realised
>> > that when you want the number of characters, you normally actually
>> > want to use walkLength, not length. Is is reasonable for the
>> > compiler to pick this up during semantic analysis and point out this
>> > situation?
>> >
>> > It's just a thought because a lot of the time, using length will get
>> > the right answer, but for the wrong reasons, resulting in lurking
>> > bugs. You can always cast to immutable(ubyte)[] or
>> > immutable(short)[] if you want to work with the actual bytes anyway.
>>
>> I find that most of the time I actually *do* want to use length. Don't
>> know if that's common, though, or if it's just a reflection of my
>> particular use-cases.
>>
>> Also, keep in mind that (unless I'm mistaken) walkLength does *not*
>> return the number of "characters" (ie, graphemes), but merely the
>> number of code points - which is not the same thing (due to existence
>> of the [confusingly-named] "combining characters").
> [...]
>
> And don't forget that some code points (notably from the CJK block) are
> specified as "double-width", so if you're trying to do text layout,
> you'll want yet a different length (layoutLength?).
>

Interesting. Kinda makes sence that such thing exists, though: The CJK 
characters (even the relatively simple Japanese *kanas) are detailed enough 
that they need to be larger to achieve the same readability. And that's the 
*non*-double-length ones. So I don't doubt there's ones that need to be 
tagged as "Draw Extra Big!!" :)

For example, I have my font size in Windows Notepad set to a comfortable 
value. But when I want to use hiragana or katakana, I have to go into the 
settings and increase the font size so I can actually read it (Well, to what 
*little* extent I can even read it in the first place ;) ). And those kana's 
tend to be among the simplest CJK characters.

(Don't worry - I only use Notepad as a quick-n-dirty scrap space, never for 
real coding/writing).

> So we really need all four lengths. Ain't unicode fun?! :-)
>

No kidding. The *one* thing I really, really hate about Unicode is the fact 
that most (if not all) of its complexity actually *is* necessary.

Unicode *itself* is undisputably necessary, but I do sure miss ASCII.

> Array length is simple.  Walklength is already implemented. Grapheme
> length requires recognition of 'combining characters' (or rather,
> ignoring said characters), and layout length requires recognizing
> widthless, single- and double-width characters.
>

Yup.

> I've been thinking about unicode processing recently. Traditionally, we
> have to decode narrow strings into UTF-32 (aka dchar) then do table
> lookups and such. But unicode encoding and properties, etc., are static
> information (at least within a single unicode release). So why bother
> with hardcoding tables and stuff at all?
>
> What we *really* should be doing, esp. for commonly-used functions like
> computing various lengths, is to automatically process said tables and
> encode the computation in finite-state machines that can then be
> optimized at the FSM level (there are known algos for generating optimal
> FSMs), codegen'd, and then optimized again at the assembly level by the
> compiler. These FSMs will operate at the native narrow string char type
> level, so that there will be no need for explicit decoding.
>
> The generation algo can then be run just once per unicode release, and
> everything will Just Work.
>

While I find that very intersting...I'm afraid I don't actually understand 
your suggestion :/ (I do understand FSM's and how they work, though) Could 
you give a little example of what you mean?




More information about the Digitalmars-d mailing list