No we should not support enum types derived from strings
guai
guai at inbox.ru
Sat May 8 19:33:45 UTC 2021
On Saturday, 8 May 2021 at 18:44:00 UTC, Jon Degenhardt wrote:
> On Saturday, 8 May 2021 at 16:04:24 UTC, guai wrote:
>> On Friday, 7 May 2021 at 22:34:19 UTC, Jon Degenhardt wrote:
>>> `byLine` implementations will usually work by iterating
>>> forward, but there are random access use cases as well. For
>>> example, it is perfectly reasonable to divide a utf-8 array
>>> in roughly in half using byte offsets, then searching for the
>>> nearest utf-8 character boundary. At after this both halves
>>> are treated as utf-8 input ranges, not random access.
>>
>> In my experience treating a string as byte array is almost
>> never a good thing. Person doing it must be very careful and
>> truly understand what they are doing.
>> What are those use cases other than `byLine` where this is
>> useful?
>> Dividing utf-8 array and searching for the nearest char may
>> split inside a combining character which isn't a thing you
>> usually want. Especially when human would read this text.
>> Conceptually string is a sequence of characters. A range of
>> dchar in D's terms.
>
> Data and log file processing are common cases. Single byte
> ascii characters are normally used to delimit structure in such
> files. Record delimiters, field delimiters, name-value pair
> delimiters, escape syntax, etc. A common way to operate on such
> files is to identify structural boundaries by finding the
> requisite single byte ascii characters and treating the
> contained data as opaque (uninterpreted) sequences of utf-8
> bytes.
>
> The details depend on the file format. But the key part is that
> single byte ascii characters can be unambiguously identified
> without interpreting other characters in a utf-8 data stream.
> Of course, when it comes time to interpreting the data inside
> these data streams it is necessary to operate on cohesive
> blocks. Yes graphemes, but also things like numbers. It's not
> useful to split a number in the middle and then call
> `std.conv.to!double` on it.
>
> Operating on the single byte structural elements allows
> deferring interpretation of multi-byte unicode content until it
> is needed. This is why it's useful to switch back and forth
> between a byte-oriented view and a UTF character view.
> Operating on bytes is faster (e.g. `memchr`, no utf-8
> decoding), enables parallelization (depending on the type of
> file), and can be used with fixed size buffer reads and writes.
>
> --Jon
When you work with log files first you pull it in as a byte
stream, split in chunks. Then make a string out of each of them.
Once you've done it, you process it like a string with all the
rules of unicode. For example split it into words. And then you
may want to convert a word to bytes back again.
But you cannot split a string wherever you want treating it as
bytes. It most certainly wouldn't work with all the languages out
there.
With string you cannot get a char by index, you must read them
sequentially. You can search, you can tokenize, rewind and
reinterpret maybe.
More information about the Digitalmars-d
mailing list