Behavior of strings with invalid unicode...
Jonathan M Davis
jmdavisProg at gmx.com
Wed Nov 21 10:23:11 PST 2012
On Wednesday, November 21, 2012 14:25:00 monarch_dodra wrote:
> So here are my 2 questions:
> 1. Is there, or does anyone know of, a standardized "behavior to
> follow when decoding utf with invalid codes"?
>
> 2. Do we even really support invalid UTF after we "leave" the
> std.utf.decode layer? EG: We simply suppose that the string is
> valid?
We don't support invalid unicode being providing ways to check for it and in
some cases throwing if it's encountered. If you create a string with invalid
unicode, then you're shooting yourself in the foot, and you could get weird
results. Some code checks for validity and will throw when it's given invalid
unicode (decode in particular does this), whereas some code will simply ignore
the fact that it's invalid and move on (generally, because it's not bothering
to go to the effort of validating it). I believe that at the moment, the idea
is that when the full decoding of a character occurs, a UTFException will be
thrown if an invalid code point is encountered, whereas anything which
partially decodes characters (e.g. just figures out how large a code point is)
may or may not throw. popFront used to throw but doesn't any longer in an
effort to make it faster, letting decode be the one to throw (so front would
still throw, but popFront wouldn't).
I'm not aware of there being any standard way to deal with invalid Unicode,
but I believe that popFront currently just treats invalid code points as being
of length 1.
- Jonathan M Davis
More information about the Digitalmars-d
mailing list