string need to be robust

ZY Zhou rinick at GeeMail.com
Sun Mar 13 01:57:12 PST 2011


Hi,

I wrote a small program to read and parse html(charset=UTF-8). It worked great
until some invalid utf8 chars appears in that page.
When the string is invalid, things like foreach or std.string.tolower will
just crash.
this make the string type totally unusable when processing files, since there
is no guarantee that utf8 file doesn't contain invalid utf8 chars.

So I made a utf8 decoder myself to convert char[] to dchar[]. In my decoder, I
convert all invalid utf8 chars to low surrogate code points(0x80~0xFF ->
0xDC80~0xDCFF), since low surrogate are invalid utf32 codes, I'm still able to
know which part of the string is invalid. Besides, after processing the
dchar[] string, I still can convert it back to utf8 char[] without affecting
any of the invalid part.

But it is still too easy to crash program with invalid string.
Is it possible to make this a native feature of string? Or is there any other
recommended method to solve this issue?


Thank you,
--ZY Zhou


More information about the Digitalmars-d mailing list