Reserving capacity in associative arrays

Tue Feb 16 11:49:55 PST 2016

On Tue, Feb 16, 2016 at 07:34:07PM +0000, Jon D via Digitalmars-d-learn wrote:
> On Tuesday, 16 February 2016 at 16:37:07 UTC, Steven Schveighoffer wrote:
> >On 2/14/16 10:22 PM, Jon D wrote:
> >>Is there a way to reserve capacity in associative arrays?
> >>[snip]
> >>The underlying implementation of associative arrays appears to take
> >>an initial number of buckets, and there's a private resize() method,
> >>but it's not clear if there's a public way to use these.

Rehashing (aa.rehash) would resize the number of buckets, but if you
don't already have the requisite number of keys, it wouldn't help.

[...]
> >I would caution to be sure of this cause, however, before thinking it
> >would solve the problem. The AA not only uses an array for buckets,
> >but allocates a memory location for each element as well. I'm often
> >wrong when I assume what the problem is when it comes to GC issues...
> >
> Completely agree. After posting I decided to take a more methodical
> look.  Not finished yet, but I can share part of it. Key thing so far
> is noticeable step function related to GC costs related to AA size
> (likely not a surprise).
> 
> My programs work with large data sets. Size is open-ended, what I'm
> trying to do is get an idea of the data set sizes they will handle
> reasonably. For purposes of illustration, word-count is a reasonable
> proxy for what I'm doing. It was in this context that I saw
> significant performance drop-off after 'size_t[string]' AAs reached
> about 10 million entries.

I also have a program that builds very large AA's (also up to millions
of keys, sometimes more) that essentially last for the entire duration
of the program, and I've found that a lot of the performance degradation
can be attributed to overly-frequent GC collection runs.  As a
workaround, I did something like this:

	size_t counter = 0;

	void main() {
		GC.disable();
		doWork();
		GC.enable();	// optional
	}

	void doWork() {
		Data[Key] aa = ...;

		mainloop: while(...) {
			... // lots of stuff, including adding things to the AA

			// Manually schedule GC collections.
			// The 10_000 is just an example number, tweak
			// as you see fit.
			if (++counter % 10_000)
				GC.collect();
		}
	}

Basically, I run GC collections on my own schedule, with the frequency
controlled by tweaking the counter check. (Obviously, how you implement
this depends on whether there are regular or semi-regular units of work
that your program does, and whether they are a good basis for scheduling
collections.)

By tweaking the frequency of the manual GC collections, I've been able
to obtain 30-40% speedups in my program (in some cases, it could get as
high as 50%). YMMV.

Of course, as with all performance-related questions, the only way to be
sure that the GC (or anything else) is the cause of your problem, is to
profile.  More often than not, I've found that the profiler has proven
me wrong about where the real bottleneck is, so I try not to assume
anything until I see the profile data.

T

-- 
If Java had true garbage collection, most programs would delete
themselves upon execution. -- Robert Sewell