Notice/Warning on narrowStrings .length
Nick Sabalausky
SeeWebsiteToContactMe at semitwist.com
Thu Apr 26 15:13:00 PDT 2012
"H. S. Teoh" <hsteoh at quickfur.ath.cx> wrote in message
news:mailman.2173.1335475413.4860.digitalmars-d at puremagic.com...
> On Thu, Apr 26, 2012 at 01:51:17PM -0400, Nick Sabalausky wrote:
>> "James Miller" <james at aatch.net> wrote in message
>> news:qdgacdzxkhmhojqcettj at forum.dlang.org...
>> > I'm writing an introduction/tutorial to using strings in D, paying
>> > particular attention to the complexities of UTF-8 and 16. I realised
>> > that when you want the number of characters, you normally actually
>> > want to use walkLength, not length. Is is reasonable for the
>> > compiler to pick this up during semantic analysis and point out this
>> > situation?
>> >
>> > It's just a thought because a lot of the time, using length will get
>> > the right answer, but for the wrong reasons, resulting in lurking
>> > bugs. You can always cast to immutable(ubyte)[] or
>> > immutable(short)[] if you want to work with the actual bytes anyway.
>>
>> I find that most of the time I actually *do* want to use length. Don't
>> know if that's common, though, or if it's just a reflection of my
>> particular use-cases.
>>
>> Also, keep in mind that (unless I'm mistaken) walkLength does *not*
>> return the number of "characters" (ie, graphemes), but merely the
>> number of code points - which is not the same thing (due to existence
>> of the [confusingly-named] "combining characters").
> [...]
>
> And don't forget that some code points (notably from the CJK block) are
> specified as "double-width", so if you're trying to do text layout,
> you'll want yet a different length (layoutLength?).
>
Interesting. Kinda makes sence that such thing exists, though: The CJK
characters (even the relatively simple Japanese *kanas) are detailed enough
that they need to be larger to achieve the same readability. And that's the
*non*-double-length ones. So I don't doubt there's ones that need to be
tagged as "Draw Extra Big!!" :)
For example, I have my font size in Windows Notepad set to a comfortable
value. But when I want to use hiragana or katakana, I have to go into the
settings and increase the font size so I can actually read it (Well, to what
*little* extent I can even read it in the first place ;) ). And those kana's
tend to be among the simplest CJK characters.
(Don't worry - I only use Notepad as a quick-n-dirty scrap space, never for
real coding/writing).
> So we really need all four lengths. Ain't unicode fun?! :-)
>
No kidding. The *one* thing I really, really hate about Unicode is the fact
that most (if not all) of its complexity actually *is* necessary.
Unicode *itself* is undisputably necessary, but I do sure miss ASCII.
> Array length is simple. Walklength is already implemented. Grapheme
> length requires recognition of 'combining characters' (or rather,
> ignoring said characters), and layout length requires recognizing
> widthless, single- and double-width characters.
>
Yup.
> I've been thinking about unicode processing recently. Traditionally, we
> have to decode narrow strings into UTF-32 (aka dchar) then do table
> lookups and such. But unicode encoding and properties, etc., are static
> information (at least within a single unicode release). So why bother
> with hardcoding tables and stuff at all?
>
> What we *really* should be doing, esp. for commonly-used functions like
> computing various lengths, is to automatically process said tables and
> encode the computation in finite-state machines that can then be
> optimized at the FSM level (there are known algos for generating optimal
> FSMs), codegen'd, and then optimized again at the assembly level by the
> compiler. These FSMs will operate at the native narrow string char type
> level, so that there will be no need for explicit decoding.
>
> The generation algo can then be run just once per unicode release, and
> everything will Just Work.
>
While I find that very intersting...I'm afraid I don't actually understand
your suggestion :/ (I do understand FSM's and how they work, though) Could
you give a little example of what you mean?
More information about the Digitalmars-d
mailing list