string need to be robust

Sun Mar 13 03:07:55 PDT 2011

On Sunday 13 March 2011 01:57:12 ZY Zhou wrote:
> Hi,
> 
> I wrote a small program to read and parse html(charset=UTF-8). It worked
> great until some invalid utf8 chars appears in that page.
> When the string is invalid, things like foreach or std.string.tolower will
> just crash.
> this make the string type totally unusable when processing files, since
> there is no guarantee that utf8 file doesn't contain invalid utf8 chars.
> 
> So I made a utf8 decoder myself to convert char[] to dchar[]. In my
> decoder, I convert all invalid utf8 chars to low surrogate code
> points(0x80~0xFF -> 0xDC80~0xDCFF), since low surrogate are invalid utf32
> codes, I'm still able to know which part of the string is invalid.
> Besides, after processing the dchar[] string, I still can convert it back
> to utf8 char[] without affecting any of the invalid part.
> 
> But it is still too easy to crash program with invalid string.
> Is it possible to make this a native feature of string? Or is there any
> other recommended method to solve this issue?

Check out std.utf. It has the functions for dealing with unicode stuff.

- Jonathan M Davis