Counting an initialised array, and segments

Mon Jun 26 22:19:25 UTC 2023

On Monday, June 26, 2023 1:09:24 PM MDT Cecil Ward via Digitalmars-d-learn 
wrote:
> No, point taken, a sloppy example. I don’t in fact do that in the
> real code. I use dchar everywhere appropriate instead of uint. In
> fact I have aliases for dstring and dchar and successfully did an
> alternative build with the aliases renamed to use 16-bits wchar /
> w string instead of 32-bits and rebuilt and all was well, just to
> test that it is code word size-independent. I would need to do
> something different though if I ever decided to change to use
> 16-bit code words in memory because I would still be wanting to
> manipulate 32-bit values for char code points when they are being
> handled in registers, for efficiency too as well as code
> correctness, as 16-bit ‘partial words’ are bad news for
> performance on x86-64. I perhaps ought to introduce a new alias
> called codepoint, which is always 32-bits, to distinguish dchar
> in registers from words in memory. It turns out that I can get
> away with not caring about utf16, as I’m merely _scanning_ a
> string. I couldn’t ever get away with changing the in-memory code
> word type to be 8-bit chars, and then using utf8 though, as I do
> occasionally deal with non-ASCII characters, and I would have to
> either preconvert the Utf8 to do the decoding, or parse 8-bit
> code words and handle the decoding myself on the fly which would
> be madness. If I have to handle utf8 data I will just preconvert
> it.

Well, I can't really comment on the details of what you're doing, since I
don't know them, but I would point out that a dchar is a code point by
definition. That is its purpose. char is a UTF-8 code unit, wchar is a
UTF-16 code unit, and dchar is both a UTF-32 code unit and a code point,
since UTF-32 code units are code points by definition. It is possible for a
dchar to be an invalid code point if you give it bad data, but code points
are 32-bit, and dchar is intended to represent that. Actual characters, of
course, can be multiple code points, annoyingly enough, so all of that
Unicode stuff is of course an annoyingly complicated mess, but D and Phobos
do have a pretty good set of primitives for handling code units and code
points without programmers needing to come up with their own types for
those. char is a UTF-8 code unit, wchar is a UTF-16 code unit, and dchar is
both a UTF-32 code unit and a code point, since UTF-32 code units are code
points by definition.

The primary mistake in what D has is that strings are all ranges of dchar
with the code units automatically being decoded to dchar by front, popFront,
etc. (at the time, Andrei thought that that would ensure correctness, since
he didn't understand that you could have characters that were multiple code
points). We'd like to get rid of that, but it's difficult to do so without
breaking code. std.utf.byCodeUnit helps work around that, and of course, you
can do so by simply operating on the strings as arrays without using the
range primitives, but the range primitives do decode to dchar,
unfortunately. However, in spite of that quirk, the tools are there to
operate on Unicode correctly in a way that don't exist out of the box with
many languages. So, in general, you shouldn't need to be creating new types
for Unicode primitives. The language already has that.

- Jonathan M Davis