Ceci n'est pas une char

Thu Apr 6 21:53:20 PDT 2006

Walter Bright wrote:
> Sean Kelly wrote:
>> Walter Bright wrote:
>>> Thomas Kuehne wrote:
>>>> Challenge:
>>>> Provide a D implementation that firsts converts to UTF-32 and has
>>>> shorter runtime than the code below:
>>>
>>> I don't know about that, but the code below isn't optimal <g>. 
>>> Replace the sar's with a lookup of the 'stride' of the UTF-8 
>>> character (see std.utf.UTF8stride[]). An implementation is 
>>> std.utf.toUTFindex().
>>
>> I've been wondering about this.  Will 'stride' be accurate for any 
>> arbitrary string position or input data?  I would assume so, but don't 
>> know enough about how UTF-8 is structured to be sure.
> 
> UTF8stride[] will give 0xFF for values that are not at the beginning of 
> a valid UTF-8 sequence.

Thanks.  I saw the 0xFF entries in UTF8stride and mostly wanted to make 
sure an odd combination of bytes couldn't be mistaken as a valid 
character, as stride seems the best fit for an "is valid UTF-8 char" 
type function.  I've been giving the 0xFF choice some thought however, 
and while it would avoid stalling loops, the alternative is an access 
violation when evaluating short strings and just weird behavior for 
large strings.  If I had to track down a program bug I'd almost prefer 
it be a tight endless loop.

Sean