Ceci n'est pas une char
Georg Wrede
georg.wrede at nospam.org
Thu Apr 6 23:07:49 PDT 2006
Sean Kelly wrote:
> Walter Bright wrote:
>> Sean Kelly wrote:
>>> Walter Bright wrote:
>>>> Thomas Kuehne wrote:
>>>>
>>>>> Challenge:
>>>>> Provide a D implementation that firsts converts to UTF-32 and has
>>>>> shorter runtime than the code below:
>>>>
>>>> I don't know about that, but the code below isn't optimal <g>.
>>>> Replace the sar's with a lookup of the 'stride' of the UTF-8
>>>> character (see std.utf.UTF8stride[]). An implementation is
>>>> std.utf.toUTFindex().
>>>
>>> I've been wondering about this. Will 'stride' be accurate for any
>>> arbitrary string position or input data? I would assume so, but
>>> don't know enough about how UTF-8 is structured to be sure.
>>
>> UTF8stride[] will give 0xFF for values that are not at the beginning
>> of a valid UTF-8 sequence.
>
> Thanks. I saw the 0xFF entries in UTF8stride and mostly wanted to make
> sure an odd combination of bytes couldn't be mistaken as a valid
> character,
No fear. Any UTF-8 byte that belongs to a stride is clearly marked as
such in the most significant bits. Thus, you can enter a byte[] at any
place, and immediately know if it's (1) a single-byte character, (2) the
first in a stride, or (3) within a stride. Without looking at any of the
other bytes.
> as stride seems the best fit for an "is valid UTF-8 char"
> type function. I've been giving the 0xFF choice some thought however,
> and while it would avoid stalling loops, the alternative is an access
> violation when evaluating short strings and just weird behavior for
> large strings. If I had to track down a program bug I'd almost prefer
> it be a tight endless loop.
UTF-8 is precisely designed to be used in very tight ASM loops, that
don't need a lookup table.
More information about the Digitalmars-d
mailing list