Ceci n'est pas une char
Walter Bright
newshound at digitalmars.com
Thu Apr 6 22:53:21 PDT 2006
Sean Kelly wrote:
> Walter Bright wrote:
>> Sean Kelly wrote:
>>> Walter Bright wrote:
>>>> Thomas Kuehne wrote:
>>>>> Challenge:
>>>>> Provide a D implementation that firsts converts to UTF-32 and has
>>>>> shorter runtime than the code below:
>>>>
>>>> I don't know about that, but the code below isn't optimal <g>.
>>>> Replace the sar's with a lookup of the 'stride' of the UTF-8
>>>> character (see std.utf.UTF8stride[]). An implementation is
>>>> std.utf.toUTFindex().
>>>
>>> I've been wondering about this. Will 'stride' be accurate for any
>>> arbitrary string position or input data? I would assume so, but
>>> don't know enough about how UTF-8 is structured to be sure.
>>
>> UTF8stride[] will give 0xFF for values that are not at the beginning
>> of a valid UTF-8 sequence.
>
> Thanks. I saw the 0xFF entries in UTF8stride and mostly wanted to make
> sure an odd combination of bytes couldn't be mistaken as a valid
> character, as stride seems the best fit for an "is valid UTF-8 char"
> type function. I've been giving the 0xFF choice some thought however,
> and while it would avoid stalling loops, the alternative is an access
> violation when evaluating short strings and just weird behavior for
> large strings. If I had to track down a program bug I'd almost prefer
> it be a tight endless loop.
Take a look at std.utf.toUTFindex(), which takes care of the problem (by
throwing an exception).
More information about the Digitalmars-d
mailing list