Using decodeFront with a generalised input range

Jonathan M Davis newsgroup.d at jmdavisprog.com
Fri Nov 9 13:12:44 UTC 2018


On Friday, November 9, 2018 5:22:27 AM MST Vinay Sajip via Digitalmars-d-
learn wrote:
> On Friday, 9 November 2018 at 11:24:42 UTC, Jonathan M Davis
>
> wrote:
> > decode and decodeFront are for converting a UTF code unit to a
> > Unicode code point. So, you're taking UTF-8 code unit (char),
> > UTF-16 code unit (wchar), or a UTF-32 code unit (dchar) and
> > decoding it. In the case of UTF-32, that's a no-op, since
> > UTF-32 code units are already code points, but for UTF-8 and
> > UTF-16, they're not the same at all.
> >
> > I would advise against doing much with decode or decodeFront
> > without having a decent understanding of the basics of Unicode.
>
> I think I understand enough of the basics of Unicode, at least
> for my application; my unfamiliarity is with the D language and
> standard library, to which I am very new.
>
> There are applications where one needs to decode a stream of
> bytes into Unicode text: perhaps it's just semantic quibbling
> distinguishing between "a ubyte" and "a UTF-8 code unit", as
> they're the same at the level of bits and bytes (as I understand
> it - please tell me if you think otherwise). If I open a file
> using mode "rb", I get a sequence of bytes, which may contain
> structured binary data, parts of which are to be interpreted as
> text encoded in UTF-8. Is there something in the D standard
> library which enables incremental decoding of such (parts of) a
> byte stream? Or does one have to resort to the `map!(x =>
> cast(char) x)` method for this?

In principle, a char is assumed to be a UTF-8 code unit, though it's
certainly possible for code to manage to end up with a char that's not a
valid UTF-8 code unit. So, char is specifically a character type, whereas
byte and ubyte are 8 bit integer types which can contain arbitrary data. D
purposefully has char, wchar, and dchar as separate types from byte, ubyte,
short, ushort, etc. in order to distinguish between character types and
integer types, and in general, the D standard library does not treat byte or
ubyte as having anything to do with characters.

decode and decodeFront operate on ranges of characters, not ranges of
arbitrary integer types. So, if you have a range of byte or ubyte which
contains UTF-8 code units, and you want to use decode or decodeFront, then
you will need to convert that range to a range of char. map would likely be
the most straightforward way to do that.

- Jonathan M Davis





More information about the Digitalmars-d-learn mailing list