No we should not support enum types derived from strings

Jon Degenhardt jond at noreply.com
Sat May 8 18:44:00 UTC 2021


On Saturday, 8 May 2021 at 16:04:24 UTC, guai wrote:
> On Friday, 7 May 2021 at 22:34:19 UTC, Jon Degenhardt wrote:
>> `byLine` implementations will usually work by iterating 
>> forward, but there are random access use cases as well. For 
>> example, it is perfectly reasonable to divide a utf-8 array in 
>> roughly in half using byte offsets, then searching for the 
>> nearest utf-8 character boundary. At after this both halves 
>> are treated as utf-8 input ranges, not random access.
>
> In my experience treating a string as byte array is almost 
> never a good thing. Person doing it must be very careful and 
> truly understand what they are doing.
> What are those use cases other than `byLine` where this is 
> useful?
> Dividing utf-8 array and searching for the nearest char may 
> split inside a combining character which isn't a thing you 
> usually want. Especially when human would read this text.
> Conceptually string is a sequence of characters. A range of 
> dchar in D's terms.

Data and log file processing are common cases. Single byte ascii 
characters are normally used to delimit structure in such files. 
Record delimiters, field delimiters, name-value pair delimiters, 
escape syntax, etc. A common way to operate on such files is to 
identify structural boundaries by finding the requisite single 
byte ascii characters and treating the contained data as opaque 
(uninterpreted) sequences of utf-8 bytes.

The details depend on the file format. But the key part is that 
single byte ascii characters can be unambiguously identified 
without interpreting other characters in a utf-8 data stream. Of 
course, when it comes time to interpreting the data inside these 
data streams it is necessary to operate on cohesive blocks. Yes 
graphemes, but also things like numbers. It's not useful to split 
a number in the middle and then call `std.conv.to!double` on it.

Operating on the single byte structural elements allows deferring 
interpretation of multi-byte unicode content until it is needed. 
This is why it's useful to switch back and forth between a 
byte-oriented view and a UTF character view. Operating on bytes 
is faster (e.g. `memchr`, no utf-8 decoding), enables 
parallelization (depending on the type of file), and can be used 
with fixed size buffer reads and writes.

--Jon


More information about the Digitalmars-d mailing list