toUTFz and WinAPI GetTextExtentPoint32W

Wed Sep 21 07:47:18 PDT 2011

> Actually, I don't buy it. I guess the reason it's faster is that it 
> doesn't check if the codepoint is valid.

Why should it ? The documentation of std.utf.count says the string must 
be validly encoded, not that it will enforce that it is.
Checking a string is valid everytime you use it would be very expensive.

Actually, std.range.walkLength does not check the sequence is valid. See 
this test:

void main()
{
  string text = "aléluyah";
  char[] text2 = text.dup;
  text2[3] = 'a';
  writeln(walkLength(text2)); // outputs: 8
  writeln(text2);             // outputs: al\303aluyah
}

There is probably a way to check an utf sequence is valid with an 
unrollable loop.

> In fact you can easily get ridiculous overflowed "negative" lengths. 
> Maybe we can put it here as unsafe and fast version though.

Unless I am mistaken, the minimum length myCount can return is 0 even 
if the string is invalid.

> Also check std.utf.stride to see if you can get it better, it's the 
> beast behind narrow string popFront.

stride does not make much checking. It can even return 5 or 6, which is 
not possible for a valid utf-8 string !

The equivalent of myCount to stride would be:

size_t myStride(char c)
{
    // optional:
    // if ( (((c>>7)+1)>>1) - (((c>>6)+1)>>2) + (((c>>3)+1)>>5))
    //     throw new UtfException("Not the start of the UTF-8 sequence");
    return 1 + (((c>>6)+1)>>2) + (((c>>5)+1)>>3) + (((c>>4)+1)>>4);
}

That I compared to:

size_t utfLikeStride(char c)
{
  // optional:
  // immutable result = UTF8stride[c];
  // if (result == 0xFF)
  // throw new UtfException("Not the start of the UTF-8 sequence");
  // return result;
  return UTF8stride[c];
}

One table lookup is replaced by byte some arythmetic in myStride.

I also took only one char as input, since stride only looked at the i-th 
character. Actually, if stride signature is kept to uint "stride(char[] 
s, int i)", I did not find any change with -O3.

Average times for "a lot" of calls:
(compiled with gcc, tested with -O3 and a homogenous distribution of 
"valid" characters from '\x00'..'\x7F' and '\xC2'..'\xF4')

myStride no throws:      1112ms.
utfLikeStride no throws: 1433ms.
utfLikeStride throws:    1868ms. (the current implementation).
myStride throws:         8269ms.

Removing throws from utfLikeStride makes it about 25% faster.
Removing throws from myStride makes it about 7 times faster.

With -O0, myStride gets less 10% slower than utfLikeStride (no throws).

In conclusion, the fastest implementation is myStride without throws, 
and it beats the current implementation by about 40%. Changing 
std.utf.stride may be desirable. As I said earlier, the throws do 
not enforce the validity of the string. Really checking the validity of 
the string would cost much more, which may not be desirable, so why 
bother checking at all? A more serious benchmark could justify to change 
std.utf.stride. The improvement could be even better in real situation, 
because the lookup table of utfLikeStride may not be always at hand - 
this actually really depends on what the compiler does.

In any case, this may not improve walkLength by more than a few 
percents.

-- 
Christophe

now I'll go back to my real work...