No we should not support enum types derived from strings

Jon Degenhardt jond at noreply.com
Sat May 8 22:38:25 UTC 2021


On Saturday, 8 May 2021 at 21:47:21 UTC, guai wrote:
> On Saturday, 8 May 2021 at 20:22:28 UTC, Jon Degenhardt wrote:
>> If all an algorithm needs to do is split a string roughly in 
>> half, then use the byte offsets to find the halfway point and 
>> then look for a utf-8 character boundary. If the algorithm is 
>> based on some other boundary, say, token boundaries, then find 
>> one of those boundaries.
>
> Those algorithms you talking about are either doesn't need 
> strings at all but instead byte/char arrays or would produce 
> garbage for any input other than ascii.

I don't understand the point you are trying to make. Perhaps you 
could rephrase.

I've implemented any number of these types of algorithms. Its 
very common to mix interpretation as unicode strings with 
interpretation as utf-8 bytes. e.g. Maybe its necessary to do 
case-conversion at some stage of processing. This has to be done 
on unicode characters, not bytes. But needing to do such 
processing at some point does exclude such treating the data as 
utf-8 bytes for other purposes.

Also, a `char[]` in D is defined to be utf-8, and a `string` is 
an `immutable(char)[]`. So why would utf-8 data, including 
non-ascii characters, read into a `char[]` produce garbage? The 
answer is that it wouldn't. No, you cannot simply start on an 
arbitrary byte boundary, but nobody has suggested this.

> Your example with log files mixes binary data with text. 
> Properly done logger will escape delimiters inside text chunks, 
> so it isn't even a string per se, it's some binary data from 
> which you need to extract a string first.

Again, I'm not following the logic. Log files may or may not 
include binary data. But I'm sure why that matters. I'm talking 
about log files where the text portions are encoded as utf-8.

> A lot of bugs are caused by this mixing of text with binary. 
> And I think it is better to distinguish them properly on a type 
> level.

Perhaps it would help if you described what you mean by "binary". 
I tend to think of "binary" as things like image data, binary 
serialization formats, base-64 coding, compressed or encrypted 
text. These are quite different than utf-8 encoded unicode text.



More information about the Digitalmars-d mailing list