Behavior of strings with invalid unicode...

monarch_dodra monarchdodra at gmail.com
Sun Nov 25 23:47:48 PST 2012


On Wednesday, 21 November 2012 at 18:25:56 UTC, Jonathan M Davis 
wrote:
> On Wednesday, November 21, 2012 14:25:00 monarch_dodra wrote:
>> So here are my 2 questions:
>> 1. Is there, or does anyone know of, a standardized "behavior 
>> to
>> follow when decoding utf with invalid codes"?
>> 
>> 2. Do we even really support invalid UTF after we "leave" the
>> std.utf.decode layer? EG: We simply suppose that the string is
>> valid?
>
> We don't support invalid unicode being providing ways to check 
> for it and in
> some cases throwing if it's encountered. If you create a string 
> with invalid
> unicode, then you're shooting yourself in the foot, and you 
> could get weird
> results. Some code checks for validity and will throw when it's 
> given invalid
> unicode (decode in particular does this), whereas some code 
> will simply ignore
> the fact that it's invalid and move on (generally, because it's 
> not bothering
> to go to the effort of validating it). I believe that at the 
> moment, the idea
> is that when the full decoding of a character occurs, a 
> UTFException will be
> thrown if an invalid code point is encountered, whereas 
> anything which
> partially decodes characters (e.g. just figures out how large a 
> code point is)
> may or may not throw. popFront used to throw but doesn't any 
> longer in an
> effort to make it faster, letting decode be the one to throw 
> (so front would
> still throw, but popFront wouldn't).

OK: I guess that makes sense. I kind of which there'd be more of 
a documented "two-level" scheme, but that should be fine.

> I'm not aware of there being any standard way to deal with 
> invalid Unicode,
> but I believe that popFront currently just treats invalid code 
> points as being
> of length 1.
>
> - Jonathan M Davis

Well, popFront only pops 1 element only if the very first element 
of is an invalid code point, but will not "see" if the code point 
at index 2 is invalid for multi-byte codes.

This kind of gives it a double-standard behavior, but I guess we 
have to draw a line somewhere.


More information about the Digitalmars-d mailing list