If invalid string should crash(was:string need to be robust)

ZY Zhou rinick at GeeeeMail.com
Sun Mar 13 23:55:58 PDT 2011


Thank you Jussi,

But still this is not part of the standard, U+FFFD is a commonly used approach,
while the U+DC80..U+DCFF is also a common solution for
that(http://en.wikipedia.org/wiki/Utf8#Invalid_byte_sequences), different approach
solve different problems.

I think the current problem in D is that std.utf module is ill defined, it's not
designed to make developer's life easier. It just make the developers to ignore
the case that utf8 string can be invalid.

--ZY Zhou

== Quote from Jussi Jumppanen (jussij at zeusedit.com)'s article
> %u Wrote:
> > I agree with a), but not b), Can't find anything in unicode standard says
> > you can use the low surrogate like that
> According to: http://www.cl.cam.ac.uk/~mgk25/
>     According to ISO 10646-1:2000, sections D.7 and 2.3c, a device
>     receiving UTF-8 shall interpret a "malformed sequence in the same way
>     that it interprets a character that is outside the adopted subset" and
>     "characters that are not within the adopted subset shall be indicated
>     to the user" by a receiving device. A quite commonly used approach in
>     UTF-8 decoders is to replace any malformed UTF-8 sequence by a
>     replacement character (U+FFFD), which looks a bit like an inverted
>     question mark, or a similar symbol.
> Refer to this file for the above quote:
> http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt



More information about the Digitalmars-d mailing list