Today's programming challenge - How's your Range-Fu ?

H. S. Teoh via Digitalmars-d digitalmars-d at puremagic.com
Mon Apr 20 11:21:34 PDT 2015


On Mon, Apr 20, 2015 at 06:03:49PM +0000, John Colvin via Digitalmars-d wrote:
> On Monday, 20 April 2015 at 17:48:17 UTC, Panke wrote:
> >To measure the columns needed to print a string, you'll need the
> >number of graphemes. (d|)?string.length gives you the number of code
> >units.
> 
> Even that's not really true. In the end it's up to the font and layout
> engine to decide how much space anything takes up. Unicode doesn't
> play nicely with the idea of text as a grid of rows and fixed-width
> columns of characters, although quite a lot can (and is, see urxvt for
> example) be shoe-horned in.

Yeah, even the grapheme count does not necessarily tell you how wide the
printed string really is. The characters in the CJK block are usually
rendered with fonts that are, on average, twice as wide as your typical
Latin/Cyrillic character, so even applications like urxvt that shoehorn
proportional-width fonts into a text grid render CJK characters as two
columns rather than one.

Because of this, I actually wrote a function at one time to determine
the width of a given Unicode character (i.e., zero, single, or double)
as displayed in urxvt. Obviously, this is no help if you need to wrap
lines rendered with a proportional font. And it doesn't even attempt to
work correctly with bidi text.

This is why I said at the beginning that wrapping a line of text is a
LOT harder than it sounds. A function that only takes a string as input
does not have the necessary information to do this correctly in all use
cases. The current wrap() function doesn't even do it correctly modulo
the information available: it doesn't handle combining diacritics and
zero-width characters properly. In fact, it doesn't even handle control
characters properly, except perhaps for \t and \n. There are so many
things wrong with the current wrap() function (and many other
string-processing functions in Phobos) that it makes it look like a joke
when we claim that D provides Unicode correctness out-of-the-box.

The only use case where wrap() gives the correct result is when you
stick with pre-Unicode Latin strings to be displayed on a text console.
As such, I don't really see the general utility of wrap() as it
currently stands, and I question its value in Phobos, as opposed to an
actually more useful implementation that, for instance, correctly
implements the Unicode line-breaking algorithm.


T

-- 
It said to install Windows 2000 or better, so I installed Linux instead.


More information about the Digitalmars-d mailing list