Today's programming challenge - How's your Range-Fu ?

H. S. Teoh via Digitalmars-d digitalmars-d at puremagic.com
Fri Apr 17 09:59:40 PDT 2015


On Fri, Apr 17, 2015 at 02:09:07AM -0700, Walter Bright via Digitalmars-d wrote:
> Challenge level - Moderately easy
>
> Consider the function std.string.wrap:
> 
>   http://dlang.org/phobos/std_string.html#.wrap
> 
> It takes a string as input, and returns a GC allocated string that is
> word-wrapped. It needs to be enhanced to:
> 
> 1. Accept a ForwardRange as input.
> 2. Return a lazy ForwardRange that delivers the characters of the
> wrapped result one by one on demand.
> 3. Not allocate any memory.
> 4. The element encoding type of the returned range must be the same as
> the element encoding type of the input.

This is harder than it looks at first sight, actually. Mostly thanks to
the complexity of Unicode... you need to identify zero-width,
normal-width, and double-width characters, combining diacritics, various
kinds of spaces (e.g. cannot break on non-breaking space) and treat them
accordingly.  Which requires decoding.  (Well, in theory std.uni could
be enhanced to work directly with encoded data, but right now it
doesn't. In any case this is outside the scope of this challenge, I
think.)

Unfortunately, the only reliable way I know of currently that can deal
with the spacing of Unicode characters correctly is to segment the input
with byGrapheme, which currently is GC-dependent. So this fails (3).

There's also the question of what to do with bidi markings: how do you
handle counting the columns in that case?

Of course, if you forego Unicode correctness, then you *could* just
word-wrap on a per-character basis (i.e., every character counts as 1
column), but this also makes the resulting code useless as far as
dealing with general Unicode data is concerned -- it'd only work for
ASCII, and various character ranges inherited from the old 8-bit
European encodings. Not to mention, line-breaking in Chinese encodings
cannot work as prescribed anyway, because the rules are different (you
can break anywhere at a character boundary except punctuation -- there
is no such concept as a space character in Chinese writing). Same
applies for Korean/Japanese.

So either you have to throw out all pretenses of Unicode-correctness and
just stick with ASCII-style per-character line-wrapping, or you have to
live with byGrapheme with all the complexity that it entails. The former
is quite easy to write -- I could throw it together in a couple o' hours
max, but the latter is a pretty big project (cf. Unicode line-breaking
algorithm, which is one of the TR's).


T

-- 
All problems are easy in retrospect.


More information about the Digitalmars-d mailing list