Ceci n'est pas une char

Thu Apr 6 22:53:21 PDT 2006

Sean Kelly wrote:
> Walter Bright wrote:
>> Sean Kelly wrote:
>>> Walter Bright wrote:
>>>> Thomas Kuehne wrote:
>>>>> Challenge:
>>>>> Provide a D implementation that firsts converts to UTF-32 and has
>>>>> shorter runtime than the code below:
>>>>
>>>> I don't know about that, but the code below isn't optimal <g>. 
>>>> Replace the sar's with a lookup of the 'stride' of the UTF-8 
>>>> character (see std.utf.UTF8stride[]). An implementation is 
>>>> std.utf.toUTFindex().
>>>
>>> I've been wondering about this.  Will 'stride' be accurate for any 
>>> arbitrary string position or input data?  I would assume so, but 
>>> don't know enough about how UTF-8 is structured to be sure.
>>
>> UTF8stride[] will give 0xFF for values that are not at the beginning 
>> of a valid UTF-8 sequence.
> 
> Thanks.  I saw the 0xFF entries in UTF8stride and mostly wanted to make 
> sure an odd combination of bytes couldn't be mistaken as a valid 
> character, as stride seems the best fit for an "is valid UTF-8 char" 
> type function.  I've been giving the 0xFF choice some thought however, 
> and while it would avoid stalling loops, the alternative is an access 
> violation when evaluating short strings and just weird behavior for 
> large strings.  If I had to track down a program bug I'd almost prefer 
> it be a tight endless loop.

Take a look at std.utf.toUTFindex(), which takes care of the problem (by 
throwing an exception).