Why do you decode ? (Seriously)

Thu Aug 2 13:07:40 PDT 2012

On 02-Aug-12 22:42, Andrei Alexandrescu wrote:
> On 8/2/12 12:47 PM, Dmitry Olshansky wrote:
>> char[] input = ...;
>> size_t idx = ...;
>> size_t len = stride(input, idx);
>> uint u8word = *cast(uint*)(input.ptr+idx);
>> //u8word contains full UTF-8 sequence
>> u8word &= (1<<(8*len)) -1; //mask out extra bytes
>> //now u8word is a complete UTF-8 sequence in one uint
>>
>>
>> Barring its hacky nature, I claim that the number obtained is in no way
>> worse then distilled codepoint. It is a number that maps 1:1 any
>> codepoint in range [0..0x10FFFF]. Let me call it UTF-8 word.
>
> I like a lot this idea of an "minimally decoded" character that's
> isomorphic with UTF-32 but much cheaper to extract. (We could use ulong
> if they add 5- and 6-byte characters).

The good news is that there *used to be* 5 and 6-bytes. Now there is 
only up to 4. That's probably why such technique was not deployed widely 
yet. I don't think such a decision is easy to roll back.

>I wonder if people came up with
> this and gave it a name. If not, I'd say we call such a number an "olsh".
>
Cool, thought it'd better be olsh8 so that we can use olsh16 for UTF16 :)

-- 
Dmitry Olshansky