[Issue 14519] [Enh] foreach on strings should return replacementDchar rather than throwing

via Digitalmars-d-bugs digitalmars-d-bugs at puremagic.com
Wed Apr 29 04:09:02 PDT 2015


https://issues.dlang.org/show_bug.cgi?id=14519

--- Comment #16 from Vladimir Panteleev <thecybershadow at gmail.com> ---
(In reply to Walter Bright from comment #15)
> It still allocates memory. But it's worth thinking about. Maybe assert()?

Sure.

> > I did not mean Unicode normalization - it was a joke (std.algorithm will
> > "normalize" invalid UTF characters to the replacement character). But since
> > .front on strings autodecodes, feeding a string to any generic range
> > function in std.algorithm will cause auto-decoding (and thus, character
> > substitution).
> 
> That can be fixed as I suggested.

Sorry, I'm not following. Which suggestion here will fix what in what way?

> Global opt-in for foreach is not feasible.

I agree - some libraries will expect one thing, and others another.

> However, one can add an algorithm
> "validate" which throws on invalid UTF, and put that at the start of a
> pipeline, as in:
> 
>     text.validate.A.B.C.D;

This is part of a solution. There also needs to be a way to ensure that
validate was called, which is the hard part.

> You brought up guessing the encoding of XML text by reading the start of it:
> "what if it was some 8-bit encoding that only LOOKED like valid UTF-8?"

No, that's not what I meant.

UTF-8 and old 8-bit encodings (ISO 8859-*, Windows-125*) both use the high bit
in the byte to indicate Unicode. Consider a program that expects an UTF-8
document, but is actually fed one in an 8-bit encoding: it is possible
(although unlikely) that text that is actually in an 8-bit encoding may be
successfully interpreted as a valid UTF-8 stream. Thus, invalid UTF-8 can
indicate a problem with the entire document, and not just the immediate
sequence of bytes.

> If you have a pipeline A.B.C.D, then A throws on invalid UTF, and B.C.D
> never are executed. But if A does not throw, then B.C.D guaranteed to be
> getting valid UTF, but they still pay the penalty of the compiler thinking
> they can allocate memory and throw.

OK, so you're saying that we can somehow automatically remove the cost of
handling invalid UTF-8 if we know that the UTF-8 we're getting is valid? I
don't see how this would work in practice, or how it would provide a noticeable
benefit in practice. Since the cost of removing a code path is negligible, I
assume you're talking about exception frames, but I still don't see how this
applies. Could you elaborate, or is this improvement a theory for now?

Besides, won't A's output be a range of dchar, so B, C and D will not
autodecode with or without this change?

--


More information about the Digitalmars-d-bugs mailing list