No we should not support enum types derived from strings

guai guai at inbox.ru
Sat May 8 19:33:45 UTC 2021


On Saturday, 8 May 2021 at 18:44:00 UTC, Jon Degenhardt wrote:
> On Saturday, 8 May 2021 at 16:04:24 UTC, guai wrote:
>> On Friday, 7 May 2021 at 22:34:19 UTC, Jon Degenhardt wrote:
>>> `byLine` implementations will usually work by iterating 
>>> forward, but there are random access use cases as well. For 
>>> example, it is perfectly reasonable to divide a utf-8 array 
>>> in roughly in half using byte offsets, then searching for the 
>>> nearest utf-8 character boundary. At after this both halves 
>>> are treated as utf-8 input ranges, not random access.
>>
>> In my experience treating a string as byte array is almost 
>> never a good thing. Person doing it must be very careful and 
>> truly understand what they are doing.
>> What are those use cases other than `byLine` where this is 
>> useful?
>> Dividing utf-8 array and searching for the nearest char may 
>> split inside a combining character which isn't a thing you 
>> usually want. Especially when human would read this text.
>> Conceptually string is a sequence of characters. A range of 
>> dchar in D's terms.
>
> Data and log file processing are common cases. Single byte 
> ascii characters are normally used to delimit structure in such 
> files. Record delimiters, field delimiters, name-value pair 
> delimiters, escape syntax, etc. A common way to operate on such 
> files is to identify structural boundaries by finding the 
> requisite single byte ascii characters and treating the 
> contained data as opaque (uninterpreted) sequences of utf-8 
> bytes.
>
> The details depend on the file format. But the key part is that 
> single byte ascii characters can be unambiguously identified 
> without interpreting other characters in a utf-8 data stream. 
> Of course, when it comes time to interpreting the data inside 
> these data streams it is necessary to operate on cohesive 
> blocks. Yes graphemes, but also things like numbers. It's not 
> useful to split a number in the middle and then call 
> `std.conv.to!double` on it.
>
> Operating on the single byte structural elements allows 
> deferring interpretation of multi-byte unicode content until it 
> is needed. This is why it's useful to switch back and forth 
> between a byte-oriented view and a UTF character view. 
> Operating on bytes is faster (e.g. `memchr`, no utf-8 
> decoding), enables parallelization (depending on the type of 
> file), and can be used with fixed size buffer reads and writes.
>
> --Jon

When you work with log files first you pull it in as a byte 
stream, split in chunks. Then make a string out of each of them. 
Once you've done it, you process it like a string with all the 
rules of unicode. For example split it into words. And then you 
may want to convert a word to bytes back again.
But you cannot split a string wherever you want treating it as 
bytes. It most certainly wouldn't work with all the languages out 
there.
With string you cannot get a char by index, you must read them 
sequentially. You can search, you can tokenize, rewind and 
reinterpret maybe.


More information about the Digitalmars-d mailing list