Notice/Warning on narrowStrings .length

H. S. Teoh hsteoh at quickfur.ath.cx
Thu Apr 26 17:26:40 PDT 2012


On Thu, Apr 26, 2012 at 06:13:00PM -0400, Nick Sabalausky wrote:
> "H. S. Teoh" <hsteoh at quickfur.ath.cx> wrote in message 
> news:mailman.2173.1335475413.4860.digitalmars-d at puremagic.com...
[...]
> > And don't forget that some code points (notably from the CJK block)
> > are specified as "double-width", so if you're trying to do text
> > layout, you'll want yet a different length (layoutLength?).
> >

Correction: the official term for this is "full-width" (as opposed to
the "half-width" of the typical European scripts).


> Interesting. Kinda makes sence that such thing exists, though: The CJK
> characters (even the relatively simple Japanese *kanas) are detailed
> enough that they need to be larger to achieve the same readability.
> And that's the *non*-double-length ones. So I don't doubt there's ones
> that need to be tagged as "Draw Extra Big!!" :)

Have you seen U+9598? It's an insanely convoluted glyph composed of
*three copies* of an already extremely complex glyph.

	http://upload.wikimedia.org/wikipedia/commons/3/3c/U%2B9F98.png

(And yes, that huge thing is supposed to fit inside a SINGLE
character... what *were* those ancient Chinese scribes thinking?!)


> For example, I have my font size in Windows Notepad set to a
> comfortable value. But when I want to use hiragana or katakana, I have
> to go into the settings and increase the font size so I can actually
> read it (Well, to what *little* extent I can even read it in the first
> place ;) ). And those kana's tend to be among the simplest CJK
> characters.
> 
> (Don't worry - I only use Notepad as a quick-n-dirty scrap space,
> never for real coding/writing).

LOL... love the fact that you felt obligated to justify your use of
notepad. :-P


> > So we really need all four lengths. Ain't unicode fun?! :-)
> >
> 
> No kidding. The *one* thing I really, really hate about Unicode is the
> fact that most (if not all) of its complexity actually *is* necessary.

We're lucky the more imaginative scribes of the world have either been
dead for centuries or have restricted themselves to writing fictional
languages. :-) The inventions of the dead ones have been codified and
simplified by the unfortunate people who inherited their overly complex
systems (*cough*CJK glyphs*cough), and the inventions of the living ones
are largely ignored by the world due to the fact that, well, their
scripts are only useful for writing fictional languages. :-)

So despite the fact that there are still some crazy convoluted stuff out
there, such as Arabic or Indic scripts with pair-wise substitution rules
in Unicode, overall things are relatively tame. At least the
subcomponents of CJK glyphs are no longer productive (actively being
used to compose new characters by script users) -- can you imagine the
insanity if Unicode had to support composition by those radicals and
subparts? Or if Unicode had to support a script like this one:

	http://www.arthaey.com/conlang/ashaille/writing/sarapin.html

whose components are graphically composed in, shall we say, entirely
non-trivial ways (see the composed samples at the bottom of the page)?


> Unicode *itself* is undisputably necessary, but I do sure miss ASCII.

In an ideal world, where memory is not an issue and bus width is
indefinitely wide, a Unicode string would simply be a sequence of
integers (of arbitrary size). Things like combining diacritics, etc.,
would have dedicated bits/digits for representing them, so there's no
need of the complexity of UTF-8, UTF-16, etc.. Everything fits into a
single character. Every possible combination of diacritics on every
possible character has a unique representation as a single integer.
String length would be equal to glyph count.

In such an ideal world, screens would also be of indefinitely detailed
resolution, so anything can fit inside a single grid cell, so there's no
need of half-width/double-width distinctions.  You could port ancient
ASCII-centric C code just by increasing sizeof(char), and things would
Just Work.

Yeah I know. Totally impossible. But one can dream, right? :-)


[...]
> > I've been thinking about unicode processing recently. Traditionally,
> > we have to decode narrow strings into UTF-32 (aka dchar) then do
> > table lookups and such. But unicode encoding and properties, etc.,
> > are static information (at least within a single unicode release).
> > So why bother with hardcoding tables and stuff at all?
> >
> > What we *really* should be doing, esp. for commonly-used functions
> > like computing various lengths, is to automatically process said
> > tables and encode the computation in finite-state machines that can
> > then be optimized at the FSM level (there are known algos for
> > generating optimal FSMs), codegen'd, and then optimized again at the
> > assembly level by the compiler. These FSMs will operate at the
> > native narrow string char type level, so that there will be no need
> > for explicit decoding.
> >
> > The generation algo can then be run just once per unicode release,
> > and everything will Just Work.
> >
> 
> While I find that very intersting...I'm afraid I don't actually
> understand your suggestion :/ (I do understand FSM's and how they
> work, though) Could you give a little example of what you mean?
[...]

Currently, std.uni code (argh the pun!!) is hand-written with tables of
which character belongs to which class, etc.. These hand-coded tables
are error-prone and unnecessary. For example, think of computing the
layout width of a UTF-8 stream. Why waste time decoding into dchar, and
then doing all sorts of table lookups to compute the width? Instead,
treat the stream as a byte stream, with certain sequences of bytes
evaluating to length 2, others to length 1, and yet others to length 0.

A lexer engine is perfectly suited for recognizing these kinds of
sequences with optimal speed. The only difference from a real lexer is
that instead of spitting out tokens, it keeps a running total (layout)
length, which is output at the end.

So what we should do is to write a tool that processes Unicode.txt (the
official table of character properties from the Unicode standard) and
generates lexer engines that compute various Unicode properties
(grapheme count, layout length, etc.) for each of the UTF encodings.

This way, we get optimal speed for these algorithms, plus we don't need
to manually maintain tables and stuff, we just run the tool on
Unicode.txt each time there's a new Unicode release, and the correct
code will be generated automatically.


T

-- 
Public parking: euphemism for paid parking. -- Flora


More information about the Digitalmars-d mailing list