No we should not support enum types derived from strings
Jon Degenhardt
jond at noreply.com
Sat May 8 18:44:00 UTC 2021
On Saturday, 8 May 2021 at 16:04:24 UTC, guai wrote:
> On Friday, 7 May 2021 at 22:34:19 UTC, Jon Degenhardt wrote:
>> `byLine` implementations will usually work by iterating
>> forward, but there are random access use cases as well. For
>> example, it is perfectly reasonable to divide a utf-8 array in
>> roughly in half using byte offsets, then searching for the
>> nearest utf-8 character boundary. At after this both halves
>> are treated as utf-8 input ranges, not random access.
>
> In my experience treating a string as byte array is almost
> never a good thing. Person doing it must be very careful and
> truly understand what they are doing.
> What are those use cases other than `byLine` where this is
> useful?
> Dividing utf-8 array and searching for the nearest char may
> split inside a combining character which isn't a thing you
> usually want. Especially when human would read this text.
> Conceptually string is a sequence of characters. A range of
> dchar in D's terms.
Data and log file processing are common cases. Single byte ascii
characters are normally used to delimit structure in such files.
Record delimiters, field delimiters, name-value pair delimiters,
escape syntax, etc. A common way to operate on such files is to
identify structural boundaries by finding the requisite single
byte ascii characters and treating the contained data as opaque
(uninterpreted) sequences of utf-8 bytes.
The details depend on the file format. But the key part is that
single byte ascii characters can be unambiguously identified
without interpreting other characters in a utf-8 data stream. Of
course, when it comes time to interpreting the data inside these
data streams it is necessary to operate on cohesive blocks. Yes
graphemes, but also things like numbers. It's not useful to split
a number in the middle and then call `std.conv.to!double` on it.
Operating on the single byte structural elements allows deferring
interpretation of multi-byte unicode content until it is needed.
This is why it's useful to switch back and forth between a
byte-oriented view and a UTF character view. Operating on bytes
is faster (e.g. `memchr`, no utf-8 decoding), enables
parallelization (depending on the type of file), and can be used
with fixed size buffer reads and writes.
--Jon
More information about the Digitalmars-d
mailing list