Why do you decode ? (Seriously)

Thu Aug 2 09:47:01 PDT 2012

Intrigued by a familiar topic in std.lexer. I've split it out.
It's not as easy question as it seems.

Before you start the usual "because codepoint has semantic meaning, 
codeunit is just bytes ya-da, ya-da" let me explain you something.

Codepoint is indeed a complete piece of symbolic information represented 
as a number in[0, 0x10FFFF] range.
A few such pieces  make up user-precived character, not that many people 
bother with this as the "few" is awfully often equals 1.
So far nothing new.

My point is - people decode UTF-8 to dchar only to be able to:
a) compare it directly with compiler's built-in '<someunicodechar>'
b) call one of isAlpha, isSpace, ... that take dchar

In other words:
	Decoding should be required only when one wants to store it in a new 
form. Otherwise if used for direct consumption it's pointless extra work.

Now take a look at this snippet:

char[] input = ...;
size_t idx = ...;
size_t len = stride(input, idx);
uint u8word = *cast(uint*)(input.ptr+idx);
//u8word contains full UTF-8 sequence
u8word &= (1<<(8*len)) -1; //mask out extra bytes
//now u8word is a complete UTF-8 sequence in one uint

Barring its hacky nature, I claim that the number obtained is in no way 
worse then distilled codepoint. It is a number that maps 1:1 any 
codepoint in range [0..0x10FFFF]. Let me call it UTF-8 word.

So why do we use dchar and not UTF-8 word, as it's as good as dchar and 
faster to obtain? The reasons as above are:
a) compiler doesn't generate UTF-8 words in any built-in way (and thus 
no special type)
b) there is not functions that will do isAlpha on this beast.

Because of the above currently requires doing some manual work instead 
of compiler magic

Reminding that I'm (no big wonder) doing the "Improve Unicode support 
for D"  GSOC project, I'll think I can easily help with point b. To that 
end the solution is flexible enough to do the same with UTF-16 word (not 
that it's relevant).

Now just throw in a template:

tempalte utf8Word(dchar ch)
{
	enum utf8Word = genUtf8(ch);
}

//sketch
uint genUtf8(dchar ch)
{
     if (c <= 0x7F)
         return ch;
     if (c <= 0x7FF)
         return 0xC0 | (c >> 6) | ((0x80 | (c & 0x3F))<<8);
     if (c <= 0xFFFF)
     {
         assert(!(0xD800 <= c && c <= 0xDFFF));
         return 0xE0 | (c >> 12) | (0x80 | (((c >> 6) & 0x3F))<<8)
		| ((0x80 | (c & 0x3F))<<16);
     }
     if (c <= 0x10FFFF)
     {
	uint r = 0x80 | (c & 0x3F); //going backwards ;)
	r <<= 8;
	r |= 0x80 | ((c >> 6) & 0x3F);
	r <<= 8;
	r |= 0x80 | ((c >> 12) & 0x3F);
	r <<= 8;
         r |= 0xF0 | (c >> 18);
         return r;
     }
}

And zup-puff! Stuff like the following works:

switch(u8word)
{
case utf8Word!'Ы':
	...
}

And the only thing lacking is a special type so that you can't mistake 
it with just some arbitrary number.

-- 
Dmitry Olshansky