string need to be robust
Jacob Carlborg
doob at me.com
Sun Mar 13 11:55:34 PDT 2011
On 2011-03-13 13:22, spir wrote:
> On 03/13/2011 10:57 AM, ZY Zhou wrote:
>> Hi,
>>
>> I wrote a small program to read and parse html(charset=UTF-8). It
>> worked great
>> until some invalid utf8 chars appears in that page.
>> When the string is invalid, things like foreach or std.string.tolower
>> will
>> just crash.
>> this make the string type totally unusable when processing files,
>> since there
>> is no guarantee that utf8 file doesn't contain invalid utf8 chars.
>>
>> So I made a utf8 decoder myself to convert char[] to dchar[]. In my
>> decoder, I
>> convert all invalid utf8 chars to low surrogate code points(0x80~0xFF ->
>> 0xDC80~0xDCFF), since low surrogate are invalid utf32 codes, I'm still
>> able to
>> know which part of the string is invalid. Besides, after processing the
>> dchar[] string, I still can convert it back to utf8 char[] without
>> affecting
>> any of the invalid part.
>>
>> But it is still too easy to crash program with invalid string.
>> Is it possible to make this a native feature of string? Or is there
>> any other
>> recommended method to solve this issue?
>
> D native features *must* crash or throw when the source text is invalid.
> What do you think?
> What should a square root function do when you pass it negative input?
> /You/ may have special requirements for those cases (ignore it, log it,
> negate it, replace it with 0 or 1...), but the library must crash
> anyway. Your requirements are application-specific needs that /you/ must
> define yourself. Hope I'm clear.
> D offers an utf8 checking function (checking utf8 beeing the same as
> convertingto utf32, it just tries to convert and throws when fails). I
> would use before process to do what /you/ expect.
>
> Denis
I would say that the functions should NOT crash but instead throw an
exception. Then the developer can choose what to do when there's an
invalid unicode character.
--
/Jacob Carlborg
More information about the Digitalmars-d
mailing list