Sometimes 100 lines of code can fix your ctfe perf

Tue Aug 24 16:06:10 UTC 2021

On Tuesday, 24 August 2021 at 14:07:45 UTC, Stefan Koch wrote:
> Good Morning,
>
> I just saw the need to concatenate a long-ish string at ctfe.
>
> TLDR; writing your own specialized chunking appender is way 
> faster than naive CTFE
> using the phobos provided appender seems to be a bad idea ;) at 
> least at CTFE.

I should have tested the code more thoroughly!
The reason why the string builder had so a flat performance curve 
was A BUG.
It just did not concatenate anything to the buffer after the 
initial was full.

This issue is now fixed and had the pleasant side effect of 
removing the `goto`s from the code

the fixed version looks like this:
```d
struct StringBuilder
{
     enum CHUNK_SIZE = 4096;
     char[CHUNK_SIZE] fixedBuffer;
     size_t fixedBufferUsed;

     char[CHUNK_SIZE][] chunks;
     size_t[] chunk_lengths;

     ushort last_chunk_used;

     void addToNewChunk(const(char)[] s)
     {
             if (s.length <= CHUNK_SIZE)
             {
                 chunks.length = last_chunk_used + 1;
                 chunk_lengths.length = last_chunk_used + 1;
                 last_chunk_used += 1;
                 appendToLastChunk(s);
             }
             else
             {
                 assert(0,
                         "string to be appended is longer than "
                         ~ CHUNK_SIZE.stringof ~ " this is not 
supported currently");
             }
     }

     void appendToLastChunk(const(char)[] s)
     {
             auto chunkBegin = chunk_lengths[last_chunk_used - 1];
             auto chunkEnd = chunkBegin + s.length;
             if (chunkEnd < CHUNK_SIZE)
             {
                 chunks[last_chunk_used - 1][chunkBegin .. 
chunkEnd] = s;
                 chunk_lengths[last_chunk_used - 1] = chunkEnd;
             }
             else
                 addToNewChunk(s);

     }

     void appendString(const(char)[] s)
     {

         // can we used our preallocated fixed buffer?
         if (!last_chunk_used)
         {
             auto fixedBufferEnd = fixedBufferUsed + s.length;
             if (fixedBufferEnd < CHUNK_SIZE)
             {
                 fixedBuffer[fixedBufferUsed .. fixedBufferEnd] = 
s;
                 fixedBufferUsed = fixedBufferEnd;
             }
             else
             {
                 addToNewChunk(s);
             }
         }
         else
             appendToLastChunk(s);
     }

     string releaseBuffer()
     {
         size_t result_length = fixedBufferUsed;
         foreach (i; 0 .. last_chunk_used)
         {
             result_length += chunk_lengths[i];
         }

         char[] result;
         result.length = result_length;
         result[0 .. fixedBufferUsed] = fixedBuffer[0 .. 
fixedBufferUsed];
         size_t result_begin_pos = fixedBufferUsed;

         foreach (i; 0 .. last_chunk_used - 1)
         {
             const chunk_used = chunk_lengths[i];
             result[result_begin_pos .. result_begin_pos + 
chunk_used] = chunks[i][0 .. chunk_used];
             result_begin_pos += chunk_used;
         }

         return cast(string) result;
     }

}
```

And now the resulting table is somewhat different as well:

```
|   N   |  stringBuilder  | naive   | Appender                    
|
|-------|-----------------|---------|-----------------------------|
| 32    | 27.8            | 27.8    | 53                          
|
| 128   | 29.8            | 27.2    | 54.1                        
|
| 512   | 33.7            | 29.8    | 121                         
|
| 1024  | 39.8            | 34.4    | 313.9                       
|
| 2048  | 52.0            | 52.9    | 1047                        
|
| 4096  | 77.0            | 123.9   | 4350                        
|
| 8192  | 127.3           | 378.2   | 18146                       
|
| 10000 | 151.1           | 548.6   | 26796                       
|
| 16384 | 233.7           | 1386    | inf (it ran out of memory)  
|
```

I inserted another row which is the N the appender can barely do 
before I run out of memory.
or memeory which is special memory to be used in memes.

We can see now the string build behaves linear rather than almost 
constant which is much more what I would have expected.

Now that were results for CTFE;
Let's now look at how it looks at regular runtime.
We compile with `ldc -O3` we also only care for the table 
starting at 10000
we also add N == 0 to account for compile time as we are 
measuring compiletime and runtime together
I will write out only the difference from N0

```
|   N   |  stringBuilder  | naive   | Appender |
|-------|-----------------|---------|----------|
| 0     |-147.0           |-100.0   |-228.7    |
| 16384 | 12.5            | 6.2     | 3.7      |
| 32768 | 23.5            | 7.9     | 5.0      |
| 65536 | 19.6            | 7.6     | 5.6      |
| 131072| 37.0            | 15.2    | 14.7     |
| 262144| 44.4            | 24.3    | 16.3     |
| 524288| 108.8           | 30.6    | 21.1     |
|1048576| 236.5           | 55.4    | 43.7     |
|2097152| 369.4           | 101.1   | 72.3     |
|4194304| 823.2           | 198.0   | 150.1    |
```

We can see that the runtime performance of the string builder is 
quite bad as we scale.
and that the naive version is quite good even under extreme 
stress.
The Appender is somewhat faster at runtime than the naive version 
but not a whole lot.
And it takes twice the compile time.

So for code that is supposed to be at runtime a quick `~=` in the 
code is not bad at all.
If it's very heavily used the Appender can make a difference,  at 
runtime as well.
But it should absolutely be avoided if you are going to use the 
same code at CTFE.

The real lesson though is to be very skeptical of extermely good 
results and verify behavior as well as performance! :)

Cheers,

Stefan