A D vs. Rust example

Walter Bright newshound2 at digitalmars.com
Fri Oct 28 04:27:25 UTC 2022


On 10/27/2022 4:55 PM, H. S. Teoh wrote:
> You don't have to refuse anything.  Just substitute it with the Unicode
> replacement character in your standard library, and no downstream code
> will need to worry about it anymore.

That's one way to deal with it. But until it is so processed, it isn't a string 
if the string requires strict UTF-8.


> And should you ever need to process invalid sequences (e.g., in a
> utility to repair broken encodings), just read it as binary and process
> it that way.

Yes, but you can't do it with strings, if strings don't allow invalid sequences.


>> A better approach is to have the string processing be tolerant of
>> invalid UTF-8.
> 
> Which makes string-processing code more fragile and possibly more
> complex.

I've coded a lot of Phobos to be tolerant of invalid UTF-8. It turns out that 
it's *unusual* to need to decode UTF-8 at all. It's robust, not fragile.


> Better to let the standard library replace all invalid
> sequences with the replacement character so that downstream code doesn't
> have to worry about it anymore.

Then you have another processing step, and have to make a copy of the string. As 
I wrote, I have some experience with this. Being tolerant of invalid UTF-8 is a 
winning strategy.



More information about the Digitalmars-d mailing list