Behavior of strings with invalid unicode...
monarch_dodra
monarchdodra at gmail.com
Sun Nov 25 23:47:48 PST 2012
On Wednesday, 21 November 2012 at 18:25:56 UTC, Jonathan M Davis
wrote:
> On Wednesday, November 21, 2012 14:25:00 monarch_dodra wrote:
>> So here are my 2 questions:
>> 1. Is there, or does anyone know of, a standardized "behavior
>> to
>> follow when decoding utf with invalid codes"?
>>
>> 2. Do we even really support invalid UTF after we "leave" the
>> std.utf.decode layer? EG: We simply suppose that the string is
>> valid?
>
> We don't support invalid unicode being providing ways to check
> for it and in
> some cases throwing if it's encountered. If you create a string
> with invalid
> unicode, then you're shooting yourself in the foot, and you
> could get weird
> results. Some code checks for validity and will throw when it's
> given invalid
> unicode (decode in particular does this), whereas some code
> will simply ignore
> the fact that it's invalid and move on (generally, because it's
> not bothering
> to go to the effort of validating it). I believe that at the
> moment, the idea
> is that when the full decoding of a character occurs, a
> UTFException will be
> thrown if an invalid code point is encountered, whereas
> anything which
> partially decodes characters (e.g. just figures out how large a
> code point is)
> may or may not throw. popFront used to throw but doesn't any
> longer in an
> effort to make it faster, letting decode be the one to throw
> (so front would
> still throw, but popFront wouldn't).
OK: I guess that makes sense. I kind of which there'd be more of
a documented "two-level" scheme, but that should be fine.
> I'm not aware of there being any standard way to deal with
> invalid Unicode,
> but I believe that popFront currently just treats invalid code
> points as being
> of length 1.
>
> - Jonathan M Davis
Well, popFront only pops 1 element only if the very first element
of is an invalid code point, but will not "see" if the code point
at index 2 is invalid for multi-byte codes.
This kind of gives it a double-standard behavior, but I guess we
have to draw a line somewhere.
More information about the Digitalmars-d
mailing list