Locality and Custom Allocations

Sun May 14 00:45:40 PDT 2006

John Demme wrote:

> Walter (and other compiler geniuses)- I've got two questions for you at
> the bottom of the post, if you want to skip the bulk of it.
> 
> -------
> 
> In order to decrease page faults (and thus increase performance) of my
> LinkedList, I implemented a block allocate scheme using the below snippet
> in a node struct:
> struct Node
> {
>         static void[] block;
>         /*
>           If we allocate large blocks in advance for the nodes, then
>           we increase our locality and thus cut down on page-faults
>           and increase our cache-hit percentage
>          */
>         new (size_t sz, int length)
>         {
>                 if (block.length < sz)
>                         block = new void[sz*(length+1)];
>                         
>                 void* p = block.ptr;
>                 block = block[sz..$];
>                 return p;
>         }
> ...
> }
> 
> Then, I used some variants of the following benchmark code to test
> ArrayList, LinkedList with block allocation and LinkedList without block
> allocations.  You can find ArrayList and LinkedList (w/o block
> allocations)
> in the mango.containers SVN.  ArrayList is backed by a D dynamic array,
> and LinkedList uses struct nodes.
> 
> void main()
> {
>         MutableList!(int) list = new LinkedList!(int)();
> 
>         for (int i=0; i<1_000_000; i++)
>         {
>                 list ~= i;
>         }
> 
>         for (int j=0; j<1; j++)
>         {
>                 foreach (int index, int i; list)
>                 {
>                         if (i != index)
>                                 Stdout("Error: "c)(i)(" != "c)(index)(CR);
>                 }
>         }
> }
> 
> I found the results quite puzzling.  I used the "time" command in unix to
> get the results. Here they are:
> 
> LinkedList, block allocations, compiler optimizations
> (-O -release -inline) , 1_000_000 elements, 1_000 read loops
> real    0m35.673s
> user    0m34.274s
> sys     0m0.240s
> 
> LinkedList, no block allocations, compiler optimizations
> (-O -release -inline) , 1_000_000 elements, 1_000 read loops
> real    1m2.546s
> user    0m57.824s
> sys     0m0.412s
> 
> ArrayList, compiler optimizations (-O -release -inline) , 1_000_000
> elements, 1_000 read loops
> real    0m48.569s
> user    0m46.559s
> sys     0m0.288s
> 
> LinkedList, block allocations, compiler optimizations
> (-O -release -inline) , 1_000_000 elements, 1 read loop
> real    0m2.597s
> user    0m2.500s
> sys     0m0.016s
> 
> LinkedList, no block allocations, compiler optimizations
> (-O -release -inline) , 1_000_000 elements, 1 read loop
> real    0m0.988s
> user    0m0.944s
> sys     0m0.028s
> 
> ArrayList, compiler optimizations (-O -release -inline) , 1_000_000
> elements, 1 read loop
> real    0m0.203s
> user    0m0.148s
> sys     0m0.028s
> 
> 
> Interpretation:
> Clearly, the block allocations are very helpful in the reads.  And
> surprise:
> the LinkedList performs better than the ArrayList!!!  How?  I suspect that
> it has to do with the method of iteration.  mango.container Containers are
> iterated through using Iterator objects (the opApplys use them as well)
> and the generic ListIterator calls .get(index) on the array in addition to
> the .next() call to the iterator.  The LinkedListIterator, however, is
> specific to LinkedList and uses a reference to the current node to avoid
> the obvious inefficiency of .get(index) on a LinkedList.  Assuming that
> the extra call to .get(index) on the ArrayList was causing some strain, I
> built an ArrayListIterator to avoid it, and came up with the following
> results:
> 
> ArrayList, compiler optimizations (-O -release -inline) , 1_000_000
> elements, 1_000 read loops, using ArrayListIterator
> real    0m30.966s
> user    0m29.774s
> sys     0m0.180s
> 
> OK, looks like I was right.  What does this tell us?  That virtual
> function
> calls suck, but we knew this.  The LinkedList with block allocations only
> runs about 15% longer than the ArrayList- that's not too shabby.
> 
> The bulk of my surpise came when only one read loop was done, so the
> allocation part of the test program has far more weight.  The LinkedList
> without block allocations does better by an order of magnitude!!!  This
> blew me away.  I figured that in terms of allocations, the block
> allocations would do no harm, and probably do better.  In fact, I expected
> them to beat out the ArrayList-- the whole point of a LinkedList!  Here is
> what I surmise from this result:
> 
> Custom Allocators are Slow.
> 
> Is my reasoning faulty, or am I right?  If I'm right, here's my question
> to
> Walter:  why?  Are calls to these allocators not inlined?  If not, can
> they
> be?  In addition, to what extent can virtual function calls be sped up via
> optimization?  Under what circumstances can calls to virtual methods be
> inlined?
> 
> Sorry for the long post.... I may just be compensating for not posting
> much lately.
> 
> ~John Demme

Sorry to reply to myself, but I though I kinda sounded like an ass in the
post.  I'm not at all upset with DMD's performance- I'd rather see the
debugging stuff we're getting now over optimizations... I'm just wondering
what we might see in the future, and how to improve my code now.

~John Demme (again)