Handling invalid UTF sequences

monarch_dodra monarchdodra at gmail.com
Fri Mar 21 01:25:00 PDT 2014


On Thursday, 20 March 2014 at 23:34:02 UTC, Brad Anderson wrote:
> I'm a fan of this approach but Timon pointed out when I wrote 
> about it once that it's rather trivial to get an invalid string 
> through slicing mid-code point so now I'm not so sure.

It's just as easy to slice mid-codepoint as it is to access a 
range out of bounds. In both cases, it's a programming error.

The only excuse I see for throwing an exception for slicing 
mid-codepoint, is that
1. programmers are less aware of the issue, so it's more 
forgiving in a released program (nobody likes a crash).
2. arguably, it's not the *program* state that's bad. It's the 
*data*.

Well, in regards to "2", you could argue that program state and 
data state is one and the same.

> I think I'm still in favor of it because you've obviously got a 
> logic error if that happens so your program isn't correct 
> anyway (it's not a matter of bad user input).


If I remember correctly, with a specially written UTF string, it 
*was* possible to corrupt program state. I think. I need to 
double check. I didn't give it much thought then ("it should 
virtually never happen"), but it could be used as deliberate 
security vulnerability.


More information about the Digitalmars-d mailing list