Compiler benchmarks for an alternative to std.uni.asLowerCase.

Sun May 8 16:38:31 PDT 2016

I did a performance study on speeding up case conversion in 
std.uni.asLowerCase. Specifics for asLowerCase have been added to 
issue https://issues.dlang.org/show_bug.cgi?id=11229. Publishing 
here as some of the more general observations may be of wider 
interest.

Background - Case conversion can generally be sped up by checking 
if a character is ascii before invoking a full unicode case 
conversion. The single character std.uni.toLower does this 
optimization, but std.uni.asLowerCase does not. asLowerCase does 
a lazy conversion of a range. For the test, I created a 
replacement for asLowerCase which uses map and toLower. In 
essence, `map!(x => x.toLower)` or `map!(x => x.byDchar.toLower)`.

Testing was with DMD (2.071) and LDC 1.0.0-beta1 (Phobos 2.070) 
on OSX. Compiler settings were `-release -O -boundscheck=off`. 
DMD was tested with and without `-inline`. LDC turns on inlining 
(-enable-inlining=1) by default with -O, but DMD does not. Texts 
tried were in Japanese, Chinese, Finnish, English, German, and 
Spanish. Timing was done both including and excluding decoding 
from utf-8 to dchar.

Performance delta including decoding to dchar:
   | Language group  | Pct Ascii | LDC gain   | DMD gain  | DMD no 
inline  |

|-----------------+-----------+------------+-----------+----------------|
   | Latin           |    95-99% | 64% (2.7x) | 93% (14x) | 48% 
(1.9x)     |
   | Asian (Jpn/Chn) |  2.4-3.7% | 36% (1.6x) | 80% (5x)  | -1%

Performance delta excluding decoding to dchar:
   | Language group  | Pct Ascii | LDC gain   | DMD gain  | DMD no 
inline |

|-----------------+-----------+------------+-----------+---------------|
   | Latin           |    95-99% | 60% (2.5x) | 95% (20x) | 60% 
(2.5x)    |
   | Asian (Jpn/Chn) |  2.4-3.7% | 50% (2x)   | 95% (20x) | -2%

Observations:
* mapAsLowerCase was faster than asLowerCase across the board. 
That it was better for Asian texts suggests the improvement 
involved more just the ascii check optimization.
* Performance varied widely between compilers, and for DMD, 
whether the -inline flag was included. The performance delta 
between asLowerCase and the mapAsLowerCase replacement was very 
dependent on these choices. Similarly, the delta between 
inclusion and exclusion of auto-decoding was highly dependent on 
these selections.
* DMD improvement by using -inline: 30% for asLowerCase (1.5x), 
90% for mapAsLowerCase (10x).
* DMD (-inline) vs LDC: For asLowerCase, LDC was 65-85% faster. 
For mapAsLowerCase, DMD was 10-40% faster. There were changes to 
the map implementation in 2.071, so these were not equivalent, 
but still, it's interesting that DMD beat LDC in this case.

Thoughts:
* The large variances between compiler settings imply extra 
diligence when performance tuning at the source code level, 
especially for code intended for multiple compilers.
* Perhaps DMD -O should also turn on -inline. This would present 
a better performance picture to new users. It's also helpful when 
the different compilers agree on rough meaning of compiler 
switches.
* Auto-decoding is an oft discussed concern. It doesn't show up 
in the table above, but the data I looked at suggests the 
cost/penalty may vary quite a bit depending on usage context and 
compiler/settings. I wasn't studying aspect explicitly. It may be 
worth its own analysis.

Other details:
* Code for mapAsLowerCase and the timing program is at: 
https://dpaste.dzfl.pl/a0e2fa1c71fd
* Texts used for timing were books in several languages from the 
Project Gutenberg site (http://www.gutenberg.org/), with 
boilerplate text removed.

--Jon