Ceci n'est pas une char
Sean Kelly
sean at f4.ca
Thu Apr 6 21:53:20 PDT 2006
Walter Bright wrote:
> Sean Kelly wrote:
>> Walter Bright wrote:
>>> Thomas Kuehne wrote:
>>>> Challenge:
>>>> Provide a D implementation that firsts converts to UTF-32 and has
>>>> shorter runtime than the code below:
>>>
>>> I don't know about that, but the code below isn't optimal <g>.
>>> Replace the sar's with a lookup of the 'stride' of the UTF-8
>>> character (see std.utf.UTF8stride[]). An implementation is
>>> std.utf.toUTFindex().
>>
>> I've been wondering about this. Will 'stride' be accurate for any
>> arbitrary string position or input data? I would assume so, but don't
>> know enough about how UTF-8 is structured to be sure.
>
> UTF8stride[] will give 0xFF for values that are not at the beginning of
> a valid UTF-8 sequence.
Thanks. I saw the 0xFF entries in UTF8stride and mostly wanted to make
sure an odd combination of bytes couldn't be mistaken as a valid
character, as stride seems the best fit for an "is valid UTF-8 char"
type function. I've been giving the 0xFF choice some thought however,
and while it would avoid stalling loops, the alternative is an access
violation when evaluating short strings and just weird behavior for
large strings. If I had to track down a program bug I'd almost prefer
it be a tight endless loop.
Sean
More information about the Digitalmars-d
mailing list