Reducing the cost of autodecoding

Tue Oct 25 09:57:34 PDT 2016

On 10/12/16 3:53 PM, Andrei Alexandrescu wrote:
> So we've had a good run with making popFront smaller. In ASCII
> microbenchmarks with ldc, the speed is indistinguishable from s = s[1 ..
> $]. Smaller functions make sure that the impact on instruction cache in
> larger applications is not high.
[snip]
> Your mission, should you choose to accept it, is to define a combination
> front/popFront that reduces the gap.
>

I'm sooo late to the party. Still a neat trick with UTF lookup table is 
to cram it into a single word with bit packing. Since we only ever 
consider top 5 bits of leading byte of which the top one can be safely 
discarded with a quick < 0x80 check. This leaves us at 4bits -> 16 entry 
table, fits nicely in a 64-bit word with 4bits per entry.

// construct in-word lookup table
ulong bitTable()
{
         import std.utf;
	ulong table;
	for(size_t i = 128; i<256; i+= 8)
	{
		char[1] c = [ cast(char)i ];
		try{
			ulong s = std.utf.stride(c[0..1], 0);
			table |= s<<(4*((i-128)>>3));
		}
		catch(Exception){
			// keep zeros
		}
	}
	return table;
}

uint myStride()(char c)
{
	enum table = bitTable;
	return (table >> ((c & 0x7f) >> 3)) & 0xF;
}

void myPopFront()(ref char[] s)
{
	char c = s[0];
	if(c < 0x80)
		s = s[1..$];
	else {
		uint step = myStride(c);
		if(!step) throw new Exception("blah");
		s = s[step..$];
	}
}

Accordingly myFront could be modified to use the same technique.
Can't check perf with LDC at the moment sadly.

>
> Andrei