Ceci n'est pas une char

Thu Apr 6 23:07:49 PDT 2006

Sean Kelly wrote:
> Walter Bright wrote:
>> Sean Kelly wrote:
>>> Walter Bright wrote:
>>>> Thomas Kuehne wrote:
>>>>
>>>>> Challenge:
>>>>> Provide a D implementation that firsts converts to UTF-32 and has
>>>>> shorter runtime than the code below:
>>>>
>>>> I don't know about that, but the code below isn't optimal <g>. 
>>>> Replace the sar's with a lookup of the 'stride' of the UTF-8 
>>>> character (see std.utf.UTF8stride[]). An implementation is 
>>>> std.utf.toUTFindex().
>>>
>>> I've been wondering about this.  Will 'stride' be accurate for any 
>>> arbitrary string position or input data?  I would assume so, but 
>>> don't know enough about how UTF-8 is structured to be sure.
>>
>> UTF8stride[] will give 0xFF for values that are not at the beginning 
>> of a valid UTF-8 sequence.
> 
> Thanks.  I saw the 0xFF entries in UTF8stride and mostly wanted to make 
> sure an odd combination of bytes couldn't be mistaken as a valid 
> character, 

No fear. Any UTF-8 byte that belongs to a stride is clearly marked as 
such in the most significant bits. Thus, you can enter a byte[] at any 
place, and immediately know if it's (1) a single-byte character, (2) the 
first in a stride, or (3) within a stride. Without looking at any of the 
other bytes.

> as stride seems the best fit for an "is valid UTF-8 char" 
> type function.  I've been giving the 0xFF choice some thought however, 
> and while it would avoid stalling loops, the alternative is an access 
> violation when evaluating short strings and just weird behavior for 
> large strings.  If I had to track down a program bug I'd almost prefer 
> it be a tight endless loop.

UTF-8 is precisely designed to be used in very tight ASM loops, that 
don't need a lookup table.