Behavior of strings with invalid unicode...

Wed Nov 21 10:23:11 PST 2012

On Wednesday, November 21, 2012 14:25:00 monarch_dodra wrote:
> So here are my 2 questions:
> 1. Is there, or does anyone know of, a standardized "behavior to
> follow when decoding utf with invalid codes"?
> 
> 2. Do we even really support invalid UTF after we "leave" the
> std.utf.decode layer? EG: We simply suppose that the string is
> valid?

We don't support invalid unicode being providing ways to check for it and in 
some cases throwing if it's encountered. If you create a string with invalid 
unicode, then you're shooting yourself in the foot, and you could get weird 
results. Some code checks for validity and will throw when it's given invalid 
unicode (decode in particular does this), whereas some code will simply ignore 
the fact that it's invalid and move on (generally, because it's not bothering 
to go to the effort of validating it). I believe that at the moment, the idea 
is that when the full decoding of a character occurs, a UTFException will be 
thrown if an invalid code point is encountered, whereas anything which 
partially decodes characters (e.g. just figures out how large a code point is) 
may or may not throw. popFront used to throw but doesn't any longer in an 
effort to make it faster, letting decode be the one to throw (so front would 
still throw, but popFront wouldn't).

I'm not aware of there being any standard way to deal with invalid Unicode, 
but I believe that popFront currently just treats invalid code points as being 
of length 1.

- Jonathan M Davis