Why do you decode ? (Seriously)

Thu Aug 2 11:42:07 PDT 2012

On 8/2/12 12:47 PM, Dmitry Olshansky wrote:
> char[] input = ...;
> size_t idx = ...;
> size_t len = stride(input, idx);
> uint u8word = *cast(uint*)(input.ptr+idx);
> //u8word contains full UTF-8 sequence
> u8word &= (1<<(8*len)) -1; //mask out extra bytes
> //now u8word is a complete UTF-8 sequence in one uint
>
>
> Barring its hacky nature, I claim that the number obtained is in no way
> worse then distilled codepoint. It is a number that maps 1:1 any
> codepoint in range [0..0x10FFFF]. Let me call it UTF-8 word.

I like a lot this idea of an "minimally decoded" character that's 
isomorphic with UTF-32 but much cheaper to extract. (We could use ulong 
if they add 5- and 6-byte characters). I wonder if people came up with 
this and gave it a name. If not, I'd say we call such a number an "olsh".

Andrei