Notice/Warning on narrowStrings .length

Nick Sabalausky SeeWebsiteToContactMe at semitwist.com
Thu Apr 26 18:03:59 PDT 2012


"H. S. Teoh" <hsteoh at quickfur.ath.cx> wrote in message 
news:mailman.2179.1335486409.4860.digitalmars-d at puremagic.com...
>
> Have you seen U+9598? It's an insanely convoluted glyph composed of
> *three copies* of an already extremely complex glyph.
>
> http://upload.wikimedia.org/wikipedia/commons/3/3c/U%2B9F98.png
>
> (And yes, that huge thing is supposed to fit inside a SINGLE
> character... what *were* those ancient Chinese scribes thinking?!)
>

Yikes!

>
>> For example, I have my font size in Windows Notepad set to a
>> comfortable value. But when I want to use hiragana or katakana, I have
>> to go into the settings and increase the font size so I can actually
>> read it (Well, to what *little* extent I can even read it in the first
>> place ;) ). And those kana's tend to be among the simplest CJK
>> characters.
>>
>> (Don't worry - I only use Notepad as a quick-n-dirty scrap space,
>> never for real coding/writing).
>
> LOL... love the fact that you felt obligated to justify your use of
> notepad. :-P
>

Heh, any usage of Notepad *needs* to be justified. For example, it has an 
undo buffer of exactly ONE change. And the stupid thing doesn't even handle 
Unix-style newlines. *Everything* handes Unix-style newlines these days, 
even on Windows. Windows *BATCH* files even accept Unix-style newlines, for 
goddsakes! But not Notepad.

It is nice in it's leanness and no-nonsence-ness. But it desperately needs 
some updates.

At least it actually supports Unicode though. (Which actually I find 
somewhat surprising.)

'Course, this is all XP. For all I know maybe they have finally updated it 
in MS OSX, erm, I mean Vista and Win7...

>
>> > So we really need all four lengths. Ain't unicode fun?! :-)
>> >
>>
>> No kidding. The *one* thing I really, really hate about Unicode is the
>> fact that most (if not all) of its complexity actually *is* necessary.
>
> We're lucky the more imaginative scribes of the world have either been
> dead for centuries or have restricted themselves to writing fictional
> languages. :-) The inventions of the dead ones have been codified and
> simplified by the unfortunate people who inherited their overly complex
> systems (*cough*CJK glyphs*cough), and the inventions of the living ones
> are largely ignored by the world due to the fact that, well, their
> scripts are only useful for writing fictional languages. :-)
>
> So despite the fact that there are still some crazy convoluted stuff out
> there, such as Arabic or Indic scripts with pair-wise substitution rules
> in Unicode, overall things are relatively tame. At least the
> subcomponents of CJK glyphs are no longer productive (actively being
> used to compose new characters by script users) -- can you imagine the
> insanity if Unicode had to support composition by those radicals and
> subparts? Or if Unicode had to support a script like this one:
>
> http://www.arthaey.com/conlang/ashaille/writing/sarapin.html
>
> whose components are graphically composed in, shall we say, entirely
> non-trivial ways (see the composed samples at the bottom of the page)?
>

That's insane!

And yet, very very interesting...

>>
>> While I find that very intersting...I'm afraid I don't actually
>> understand your suggestion :/ (I do understand FSM's and how they
>> work, though) Could you give a little example of what you mean?
> [...]
>
> Currently, std.uni code (argh the pun!!)

Hah! :)

> is hand-written with tables of
> which character belongs to which class, etc.. These hand-coded tables
> are error-prone and unnecessary. For example, think of computing the
> layout width of a UTF-8 stream. Why waste time decoding into dchar, and
> then doing all sorts of table lookups to compute the width? Instead,
> treat the stream as a byte stream, with certain sequences of bytes
> evaluating to length 2, others to length 1, and yet others to length 0.
>
> A lexer engine is perfectly suited for recognizing these kinds of
> sequences with optimal speed. The only difference from a real lexer is
> that instead of spitting out tokens, it keeps a running total (layout)
> length, which is output at the end.
>
> So what we should do is to write a tool that processes Unicode.txt (the
> official table of character properties from the Unicode standard) and
> generates lexer engines that compute various Unicode properties
> (grapheme count, layout length, etc.) for each of the UTF encodings.
>
> This way, we get optimal speed for these algorithms, plus we don't need
> to manually maintain tables and stuff, we just run the tool on
> Unicode.txt each time there's a new Unicode release, and the correct
> code will be generated automatically.
>

I see. I think that's a very good observation, and a great suggestion. In 
fact, it'd imagine it'd be considerably simpler than a typial lexer 
generator. Much less of the fancy regexy-ness would be needed. Maybe put 
together a pull request if you get the time...?




More information about the Digitalmars-d mailing list