suggestion of type: ustring

spir denis.spir at gmail.com
Sun Mar 20 11:22:10 PDT 2011


On 03/20/2011 05:12 PM, Jesse Phillips wrote:
> ZY Zhou Wrote:
>
>>> It would be prohibitively expensive to be constantly validating strings.
>>
>> No, it would be much much cheaper, since there are only 2 cases the validating is
>> needed
>>
>> 1) when you convert char[] to ustring, in this case, the validating is necessary
>> 2) when you use split on ustring. but since ustring is guaranteed to be valid, the
>> validating only need to check 2 bytes of data (start and end), much cheaper than
>> validating the entire string.
>>
>> after that, all the other functions will no longer need to worry about invalid
>> utf8 string, as long as the parameter is ustring, no validating is needed.
>
> Honestly, so far the only time I had problems processing utf has been when someone stuck a stupid BOM[1] at the beginning of the file.
>
> Question, what is so hard about inserting validity checks[2] into your code just as you have described? This way you don't have to put them in contracts of all your functions.
>
> 1. http://en.wikipedia.org/wiki/Byte_order_mark
> 2. http://digitalmars.com/d/2.0/phobos/std_utf.html#validate

Anyway, std.utf.validate just tries to *decode*! (see function source below) 
(decode itself throws an exception when stepping on invalid utf)
So, it would more as efficient to just decode (which in D means the same as 
converting to dstring) at start, and work with strings of code points all along 
your process. In addition to validating only once, further operations can be 
much faster each time you need to operate at the level of code points.

void validate(S)(in S s) if (isSomeString!S)
{
     immutable len = s.length;
     for (size_t i = 0; i < len; )
     {
         decode(s, i);
     }
}

Denis
-- 
_________________
vita es estrany
spir.wikidot.com



More information about the Digitalmars-d mailing list