Handling invalid UTF sequences

Denis Shelomovskij verylonglogin.reg at gmail.com
Fri Mar 21 03:39:44 PDT 2014


21.03.2014 12:25, monarch_dodra пишет:
> On Thursday, 20 March 2014 at 23:34:02 UTC, Brad Anderson wrote:
>> I'm a fan of this approach but Timon pointed out when I wrote about it
>> once that it's rather trivial to get an invalid string through slicing
>> mid-code point so now I'm not so sure.
>
> It's just as easy to slice mid-codepoint as it is to access a range out
> of bounds. In both cases, it's a programming error.
>
> The only excuse I see for throwing an exception for slicing
> mid-codepoint, is that
> 1. programmers are less aware of the issue, so it's more forgiving in a
> released program (nobody likes a crash).
> 2. arguably, it's not the *program* state that's bad. It's the *data*.
>
> Well, in regards to "2", you could argue that program state and data
> state is one and the same.
>
>> I think I'm still in favor of it because you've obviously got a logic
>> error if that happens so your program isn't correct anyway (it's not a
>> matter of bad user input).
>
>
> If I remember correctly, with a specially written UTF string, it *was*
> possible to corrupt program state. I think. I need to double check. I
> didn't give it much thought then ("it should virtually never happen"),
> but it could be used as deliberate security vulnerability.

Almost nothing to add here. We already have `-noboundscheck` which can 
dramatically increase performance, throwing `UTFError` should either use 
same mechanics (`-noutfcheck`?) or just be stripped in release. 
Personally I'd choose the latter as there are lots of (sometimes very 
slow) assertions stripped with `-release` in real programs, which 
indicates same critical data corruption.

-- 
Денис В. Шеломовский
Denis V. Shelomovskij


More information about the Digitalmars-d mailing list