Notice/Warning on narrowStrings .length

Jonathan M Davis jmdavisProg at gmx.com
Thu Apr 26 18:17:04 PDT 2012


On Thursday, April 26, 2012 17:26:40 H. S. Teoh wrote:
> Currently, std.uni code (argh the pun!!) is hand-written with tables of
> which character belongs to which class, etc.. These hand-coded tables
> are error-prone and unnecessary. For example, think of computing the
> layout width of a UTF-8 stream. Why waste time decoding into dchar, and
> then doing all sorts of table lookups to compute the width? Instead,
> treat the stream as a byte stream, with certain sequences of bytes
> evaluating to length 2, others to length 1, and yet others to length 0.
> 
> A lexer engine is perfectly suited for recognizing these kinds of
> sequences with optimal speed. The only difference from a real lexer is
> that instead of spitting out tokens, it keeps a running total (layout)
> length, which is output at the end.
> 
> So what we should do is to write a tool that processes Unicode.txt (the
> official table of character properties from the Unicode standard) and
> generates lexer engines that compute various Unicode properties
> (grapheme count, layout length, etc.) for each of the UTF encodings.
> 
> This way, we get optimal speed for these algorithms, plus we don't need
> to manually maintain tables and stuff, we just run the tool on
> Unicode.txt each time there's a new Unicode release, and the correct
> code will be generated automatically.

That's a fantastic idea! Of course, that leaves the job of implementing it... 
:)

- Jonathan M Davis


More information about the Digitalmars-d mailing list