What features of D you would not miss?

H. S. Teoh hsteoh at qfbox.info
Fri Sep 16 21:50:53 UTC 2022


On Fri, Sep 16, 2022 at 09:24:40PM +0000, Luhrel via Digitalmars-d wrote:
[...]
> I would also remove `wchar` and `dchar` (and [`d`/`w`]`string`), and
> go full Unicode for the `char` type, like in Rust.

wchar/wstring is occasionally useful for interfacing with some Windows
APIs.

`dchar` is needed for representing a full Unicode code point (which can
go up to 0x10_ffff). But `dstring` is pretty much useless.


> However, this would be a little challenging when slicing :
>  1. Panic when slicing in the "middle" of a character (Rust does it
>  this way)

The problem is that what we think of as a "character" is *not* what the
Unicode standard calls a "character". Well actually, Unicode doesn't
even use the word "character"; it has something called "code points",
which people mistakenly assume is the same as our concept of "character"
(unfortunately, this is not true). A `char` corresponds with a "code
unit" in UTF-8; one or more code units correspond with a single code
point.  However, what we think of as a "character" may consist of
*multiple* code points: for example, the sequence \u0041\u0301 consists
of *two* Unicode code points, but a single displayed character (which
Unicode calls a grapheme).

The problem of slicing in the "middle" of a "character" will occur even
if you don't allow breaking apart code units that encode a single code
point. For example, the above sequence "\u0041\u0301" can be legally
split into two separate code points, but that would also break the
grapheme.  The only way to avoid this is to allow slicing only between
grapheme boundaries...

... unfortunately, computing grapheme boundaries is non-trivial in
Unicode and introduces a big performance hit if you overuse it.

Most code actually should *not* care about any of the above; they should
treat strings as opaque binary data and only use Unicode library
functions to manipulate them. In the rare case when you actually need to
parse individual elements in the string, you can iterate over graphemes
with std.uni.byGrapheme. (Or iterate over code points, depending on what
your code is trying to do.)


>  2. Throw an Exception

Bad idea. In fact, we worked really hard to try to get rid of this
behaviour in Phobos, and I'm not sure if we're 100% there yet.


>  3. Change `char` size to 4 bytes, so "`char` = `dchar`" by default

This would imply extending autodecoding to every string operation on
UTF-8 data.  Autodecoding is something we've been trying to get rid of,
not keep, much less extend. :-P


T

-- 
If you're not part of the solution, you're part of the precipitate.


More information about the Digitalmars-d mailing list