dchar undefined behaviour

Thu Oct 22 18:31:44 PDT 2015

While working on updating and improving Lionello Lunesu's 
proposed fix for DMD issue #259, I have come across a value range 
propagation related issue with the dchar type.

The patch adds VRP-based compile-time evaluation of integer type 
comparisons, where possible. This caused the following issue:

The compiler will now optimize out attempts to handle invalid, 
out-of-range dchar values. For example:

dchar c = cast(dchar) uint.max;
if(c > 0x10FFFF)
     writeln("invalid");
else
     writeln("OK");

With constant folding for integer comparisons, the above will 
print "OK" rather than "invalid", as it should. The predicate (c 
 > 0x10FFFF) is simply *assumed* to be false, because the current 
starting range.imax for a dchar expression is dchar.max.

So, this leads to the question: is making use of dchar values 
greater than dchar.max considered undefined behaviour, or not?

1. If it is UB, then there is quite a lot of D code (including 
std.uni) which must be corrected to use uint instead of dchar 
when dealing with values which could possibly fall outside the 
officially supported range.

2. If it is not UB, then the compiler needs to be updated to stop 
assuming that dchar values greater than dchar.max are impossible. 
This basically just means removing some of dchar's special 
treatment, and running it through more of the same code paths as 
uint.

At the moment, I strongly prefer #2, but I suppose #1 could make 
sense if people think code which might have to deal with invalid 
code points can be isolated sufficiently from other unicode 
processing.