Why do you decode ? (Seriously)
Dmitry Olshansky
dmitry.olsh at gmail.com
Thu Aug 2 09:47:01 PDT 2012
Intrigued by a familiar topic in std.lexer. I've split it out.
It's not as easy question as it seems.
Before you start the usual "because codepoint has semantic meaning,
codeunit is just bytes ya-da, ya-da" let me explain you something.
Codepoint is indeed a complete piece of symbolic information represented
as a number in[0, 0x10FFFF] range.
A few such pieces make up user-precived character, not that many people
bother with this as the "few" is awfully often equals 1.
So far nothing new.
My point is - people decode UTF-8 to dchar only to be able to:
a) compare it directly with compiler's built-in '<someunicodechar>'
b) call one of isAlpha, isSpace, ... that take dchar
In other words:
Decoding should be required only when one wants to store it in a new
form. Otherwise if used for direct consumption it's pointless extra work.
Now take a look at this snippet:
char[] input = ...;
size_t idx = ...;
size_t len = stride(input, idx);
uint u8word = *cast(uint*)(input.ptr+idx);
//u8word contains full UTF-8 sequence
u8word &= (1<<(8*len)) -1; //mask out extra bytes
//now u8word is a complete UTF-8 sequence in one uint
Barring its hacky nature, I claim that the number obtained is in no way
worse then distilled codepoint. It is a number that maps 1:1 any
codepoint in range [0..0x10FFFF]. Let me call it UTF-8 word.
So why do we use dchar and not UTF-8 word, as it's as good as dchar and
faster to obtain? The reasons as above are:
a) compiler doesn't generate UTF-8 words in any built-in way (and thus
no special type)
b) there is not functions that will do isAlpha on this beast.
Because of the above currently requires doing some manual work instead
of compiler magic
Reminding that I'm (no big wonder) doing the "Improve Unicode support
for D" GSOC project, I'll think I can easily help with point b. To that
end the solution is flexible enough to do the same with UTF-16 word (not
that it's relevant).
Now just throw in a template:
tempalte utf8Word(dchar ch)
{
enum utf8Word = genUtf8(ch);
}
//sketch
uint genUtf8(dchar ch)
{
if (c <= 0x7F)
return ch;
if (c <= 0x7FF)
return 0xC0 | (c >> 6) | ((0x80 | (c & 0x3F))<<8);
if (c <= 0xFFFF)
{
assert(!(0xD800 <= c && c <= 0xDFFF));
return 0xE0 | (c >> 12) | (0x80 | (((c >> 6) & 0x3F))<<8)
| ((0x80 | (c & 0x3F))<<16);
}
if (c <= 0x10FFFF)
{
uint r = 0x80 | (c & 0x3F); //going backwards ;)
r <<= 8;
r |= 0x80 | ((c >> 6) & 0x3F);
r <<= 8;
r |= 0x80 | ((c >> 12) & 0x3F);
r <<= 8;
r |= 0xF0 | (c >> 18);
return r;
}
}
And zup-puff! Stuff like the following works:
switch(u8word)
{
case utf8Word!'Ы':
...
}
And the only thing lacking is a special type so that you can't mistake
it with just some arbitrary number.
--
Dmitry Olshansky
More information about the Digitalmars-d
mailing list