Reserving capacity in associative arrays

Tue Feb 16 11:34:07 PST 2016

On Tuesday, 16 February 2016 at 16:37:07 UTC, Steven 
Schveighoffer wrote:
> On 2/14/16 10:22 PM, Jon D wrote:
>> Is there a way to reserve capacity in associative arrays?
>> [snip]
>> The underlying implementation of associative arrays
>> appears to take an initial number of buckets, and there's
>> a private resize() method, but it's not clear if there's a
>> public way to use these.
>
> There is not a public way to access these methods unfortunately.
>
> It would be a good addition to druntime I believe.
>
> Recently, I added a clear method to the AA, which does not 
> reduce capacity. So if you frequently build large AAs, and then 
> throw them away, you could instead reuse the memory.
>
My programs build AAs lasting the lifetime of the program.
>
> I would caution to be sure of this cause, however, before 
> thinking it would solve the problem. The AA not only uses an 
> array for buckets, but allocates a memory location for each 
> element as well. I'm often wrong when I assume what the problem 
> is when it comes to GC issues...
>
Completely agree. After posting I decided to take a more 
methodical look. Not finished yet, but I can share part of it. 
Key thing so far is noticeable step function related to GC costs 
related to AA size (likely not a surprise).

My programs work with large data sets. Size is open-ended, what 
I'm trying to do is get an idea of the data set sizes they will 
handle reasonably. For purposes of illustration, word-count is a 
reasonable proxy for what I'm doing. It was in this context that 
I saw significant performance drop-off after 'size_t[string]' AAs 
reached about 10 million entries.

I've started measuring with a simple program. Basically:

     StopWatch sw;
     sw.start;
     size_t[size_t] counts;
     foreach (i; 0..iterations)
         counts[uniform(0, uniqMax)]++;
     sw.stop;

Same thing with string as key ('size_t[string]') AAs. 
'iterations' and 'uniqMax' are varied between runs. GC stats are 
printed (via "--DRT-gcopt=profile:1"), plus timing and AA size. 
(Runs use LDC 17, release mode compiles, a fast 16GB MacBook).

For the integer as key case ('size_t[size_t]', there are notable 
jumps in GC total time and GC max pause time as AA size crosses 
specific size thresholds. This makes sense, as the AA needs to 
grow. Approximate steps:

     | entries | gc_total (ms) | gc_max_pause (ms) |
     |---------+---------------+-------------------|
     | 2M      |            30 |                60 |
     | 4M      |           200 |               100 |
     | 12M     |           650 |               330 |
     | 22M     |          1650 |               750 |
     | 44M     |          5300 |              3200 |

Iterations didn't matter, and gc total time and gc max time were 
largely flat between these jumps.

This suggests AA resize is the likely driver, and that 
preallocating a large size might address it.

To the point about being sure about cause - my programs use 
strings as keys, not integers. The performance drop-off with 
strings was quite a bit more significant than with integers. That 
analysis seems a bit trickier, I'm not done with that yet. 
Different memory allocation, perhaps effects from creating 
short-lived, temporary strings to test AA membership. Could 
easily be that string use or the combo of AAs with strings as key 
is a larger effect.

The other thing that jumps out from the table is the GC max pause 
time gets to be multiple seconds. Not an issue for my tools, 
which aren't interactive at those points, but would be 
significant issue for many interactive apps.

--Jon