No we should not support enum types derived from strings

Jon Degenhardt jond at noreply.com
Sat May 8 03:49:56 UTC 2021


On Saturday, 8 May 2021 at 02:05:42 UTC, Andrei Alexandrescu 
wrote:
> On 5/7/21 6:34 PM, Jon Degenhardt wrote:
>> It'd be very useful to have an approach to utf-8 strings that 
>> enabled switching interpretations easily, without casting.
>
> String s;
> func1(s.bytes);
> func2(s.dchars);

That's not quite what I was getting at. But that's my fault. A 
hastily written message that muddled a couple of concepts. Sorry 
about that, I need to write up a better description. But there 
are two underlying thoughts.

One is being able to convert from a random access byte array to 
char input range (e.g. `byUTF`), do something with it (e.g. 
`popFront`), then convert that form back to a random access byte 
range. This is logically doable because both are views on the 
same physical array. However, once something is an input range it 
doesn't convert simply to a random access range.

This first one strikes me as potentially challenging because this 
dual view on the underlying data is not common, so there's not a 
lot of incentive to support it as a general concept.

The second issue is more about current Phobos algorithms that 
specialize their implementations depending on whether the 
argument is a `char[]` or a `byte[]`. This normally involves 
conditioning on `isSomeString` or `isSomeChar`. `char[]` / `char` 
pass these tests, `byte[]` / `byte` do not. The cases I remember 
are cases where the string form was specialized to have better 
performance than the byte form. Look through searching.d for 
`isSomeString` use to see this.

The trouble with this is that at the application level it can be 
necessary to use a byte array when working with a number 
facilities. This often involves I/O. E.g. Reading fixed sized 
blocks from an input stream (`File.byChunk`). This operates on 
`ubyte[]` arrays. It can be cast to a `char[]`. But, this can run 
afoul of autodecoding related routines that expect correctly 
formed utf-8 characters. When reading fixed size buffers, the 
starts and ends of the buffer will often not fall on utf-8 
boundaries, so examining the bytes is necessary to handle these 
cases. (And input streams may contain corrupt utf-8 characters.)

I know the above is still not an adequate description. At some 
point I'll try to write up something more compelling.

--Jon


More information about the Digitalmars-d mailing list