Using decodeFront with a generalised input range

Fri Nov 9 11:24:42 UTC 2018

On Friday, November 9, 2018 3:45:49 AM MST Vinay Sajip via Digitalmars-d-
learn wrote:
> On Friday, 9 November 2018 at 10:26:46 UTC, Dennis wrote:
> > On Friday, 9 November 2018 at 09:47:32 UTC, Vinay Sajip wrote:
> >> std.utf.decodeFront(Flag useReplacementDchar =
> >> No.useReplacementDchar, S)(ref S str) if (isInputRange!S &&
> >> isSomeChar!(ElementType!S))
> >
> > This is the overload you want, let's check if it matches:
> > ref S str - your InputRange can be passed by reference, but you
> > specified S = dchar. S here is the type of the inputRange, and
> > it is not of type dchar. It's best not to specify S so the
> > compiler will infer it, range types can be very complicated.
> > Once we fix that, let's look at the rest:
> >
> > isInputRange!S - S is an inputRange
> > isSomeChar!(ElementType!S) - ElementType!S is ubyte, but
> > isSomeChar!ubyte is not true.
> >
> > The function wants characters, but you give bytes. A quick fix
> > would be to do:
> > ```
> > import std.algorithm: map;
> > auto mapped = r.map!(x => cast(char) x);
> > mapped.decodeFront!(No.useReplacementDchar)();
> > ```
> >
> > But it may be better for somefn to accept an InputRange!(char)
> > instead.
> >
> > Note that if you directly do:
> > ```
> > r.map!(x => cast(char)
> > x).decodeFront!(No.useReplacementDchar)();
> > ```
> > It still won't work, since it wants `ref S str` and r.map!(...)
> > is a temporary that can't be passed by reference.
> >
> > As you can see, ensuring template constraints can be really
> > difficult. The error messages give little help here, so you
> > have to manually check whether the conditions of the overload
> > you want hold.
>
> Thanks, that's helpful. My confusion seems due to my thinking
> that a decoding operation converts (unsigned) bytes to chars,
> which is not how the writers of std.utf seem to have thought of
> it. As I see it, a ubyte 0x20 could be decoded to an ASCII char '
> ', and likewise to wchar or dchar. It doesn't (to me) make sense
> to decode a char to a wchar or dchar. Anyway, you've shown me how
> decodeFront can be used, so great!

decode and decodeFront are for converting a UTF code unit to a Unicode code
point. So, you're taking UTF-8 code unit (char), UTF-16 code unit (wchar),
or a UTF-32 code unit (dchar) and decoding it. In the case of UTF-32, that's
a no-op, since UTF-32 code units are already code points, but for UTF-8 and
UTF-16, they're not the same at all.

For UTF-8, a code point is encoded as 1 to 4 code units which are 8 bits in
size (char). For UTF-16, a code point is encoded as 1 or 2 code units which
are 16 bits in size (wchar), and for UTF-32, code points are encoded as code
units which are 32-bits in size (dchar). The decoding is doing that
conversion. None of this has anything to do with ASCII or any other encoding
except insofar as ASCII happens to line up with Unicode.

Code points are then 32-bit integer values (which D represents as dchar).
They are often called Unicode characters, and can be represented
graphically, but many of them represent bits of what you would actually
consider to be a character (e.g. an accent could be a code point on its
own), so in many cases, code points have to be combine to create what's
called a grapheme or grapheme cluster (which unfortunately, means that can
can have to worry about normalizing code points). std.uni provides code for
worrying about that sort of thing. Ultimately, what gets rendered to the
screen by with a font is as grapheme. In the simplest case, with an ASCII
character, a single character is a single code unit, a single code point,
and a single grapheme in all representations, but with more complex
characters (e.g. a Hebrew character or a character with a couple of accents
on it), it could be several code units, one or more code points, and a
single grapheme.

I would advise against doing much with decode or decodeFront without having
a decent understanding of the basics of Unicode.

> Supplementary question: is an operation like r.map!(x =>
> cast(char) x) effectively a run-time no-op and just to keep the
> compiler happy, or does it actually result in code being
> executed? I came across a similar issue with ranges recently
> where the answer was to map immutable(byte) to byte in the same
> way.

That would depend on the optimization flags chosen and the exact code in
question. In general, ldc is more likely to do a good job at optimizing such
code than dmd, though dmd doesn't necessarily do a bad job. I don't know how
good a job dmd does in this particular case. It depends on the code. In
general, dmd compiles very quickly and as such is great for development,
whereas ldc does a better job at generating fast executables. I would expect
ldc to optimize such code properly.

- Jonathan M Davis