string need to be robust
Jonathan M Davis
jmdavisProg at gmx.com
Sun Mar 13 03:07:55 PDT 2011
On Sunday 13 March 2011 01:57:12 ZY Zhou wrote:
> Hi,
>
> I wrote a small program to read and parse html(charset=UTF-8). It worked
> great until some invalid utf8 chars appears in that page.
> When the string is invalid, things like foreach or std.string.tolower will
> just crash.
> this make the string type totally unusable when processing files, since
> there is no guarantee that utf8 file doesn't contain invalid utf8 chars.
>
> So I made a utf8 decoder myself to convert char[] to dchar[]. In my
> decoder, I convert all invalid utf8 chars to low surrogate code
> points(0x80~0xFF -> 0xDC80~0xDCFF), since low surrogate are invalid utf32
> codes, I'm still able to know which part of the string is invalid.
> Besides, after processing the dchar[] string, I still can convert it back
> to utf8 char[] without affecting any of the invalid part.
>
> But it is still too easy to crash program with invalid string.
> Is it possible to make this a native feature of string? Or is there any
> other recommended method to solve this issue?
Check out std.utf. It has the functions for dealing with unicode stuff.
- Jonathan M Davis
More information about the Digitalmars-d
mailing list