string need to be robust

Sun Mar 13 05:22:32 PDT 2011

On 03/13/2011 10:57 AM, ZY Zhou wrote:
> Hi,
>
> I wrote a small program to read and parse html(charset=UTF-8). It worked great
> until some invalid utf8 chars appears in that page.
> When the string is invalid, things like foreach or std.string.tolower will
> just crash.
> this make the string type totally unusable when processing files, since there
> is no guarantee that utf8 file doesn't contain invalid utf8 chars.
>
> So I made a utf8 decoder myself to convert char[] to dchar[]. In my decoder, I
> convert all invalid utf8 chars to low surrogate code points(0x80~0xFF ->
> 0xDC80~0xDCFF), since low surrogate are invalid utf32 codes, I'm still able to
> know which part of the string is invalid. Besides, after processing the
> dchar[] string, I still can convert it back to utf8 char[] without affecting
> any of the invalid part.
>
> But it is still too easy to crash program with invalid string.
> Is it possible to make this a native feature of string? Or is there any other
> recommended method to solve this issue?

D native features *must* crash or throw when the source text is invalid. What 
do you think?
What should a square root function do when you pass it negative input?
/You/ may have special requirements for those cases (ignore it, log it, negate 
it, replace it with 0 or 1...), but the library must crash anyway. Your 
requirements are application-specific needs that /you/ must define yourself. 
Hope I'm clear.
D offers an utf8 checking function (checking utf8 beeing the same as 
convertingto utf32, it just tries to convert and throws when fails). I would 
use before process to do what /you/ expect.

Denis
-- 
_________________
vita es estrany
spir.wikidot.com