Locality and Custom Allocations
John Demme
me at teqdruid.com
Sun May 14 00:45:40 PDT 2006
John Demme wrote:
> Walter (and other compiler geniuses)- I've got two questions for you at
> the bottom of the post, if you want to skip the bulk of it.
>
> -------
>
> In order to decrease page faults (and thus increase performance) of my
> LinkedList, I implemented a block allocate scheme using the below snippet
> in a node struct:
> struct Node
> {
> static void[] block;
> /*
> If we allocate large blocks in advance for the nodes, then
> we increase our locality and thus cut down on page-faults
> and increase our cache-hit percentage
> */
> new (size_t sz, int length)
> {
> if (block.length < sz)
> block = new void[sz*(length+1)];
>
> void* p = block.ptr;
> block = block[sz..$];
> return p;
> }
> ...
> }
>
> Then, I used some variants of the following benchmark code to test
> ArrayList, LinkedList with block allocation and LinkedList without block
> allocations. You can find ArrayList and LinkedList (w/o block
> allocations)
> in the mango.containers SVN. ArrayList is backed by a D dynamic array,
> and LinkedList uses struct nodes.
>
> void main()
> {
> MutableList!(int) list = new LinkedList!(int)();
>
> for (int i=0; i<1_000_000; i++)
> {
> list ~= i;
> }
>
> for (int j=0; j<1; j++)
> {
> foreach (int index, int i; list)
> {
> if (i != index)
> Stdout("Error: "c)(i)(" != "c)(index)(CR);
> }
> }
> }
>
> I found the results quite puzzling. I used the "time" command in unix to
> get the results. Here they are:
>
> LinkedList, block allocations, compiler optimizations
> (-O -release -inline) , 1_000_000 elements, 1_000 read loops
> real 0m35.673s
> user 0m34.274s
> sys 0m0.240s
>
> LinkedList, no block allocations, compiler optimizations
> (-O -release -inline) , 1_000_000 elements, 1_000 read loops
> real 1m2.546s
> user 0m57.824s
> sys 0m0.412s
>
> ArrayList, compiler optimizations (-O -release -inline) , 1_000_000
> elements, 1_000 read loops
> real 0m48.569s
> user 0m46.559s
> sys 0m0.288s
>
> LinkedList, block allocations, compiler optimizations
> (-O -release -inline) , 1_000_000 elements, 1 read loop
> real 0m2.597s
> user 0m2.500s
> sys 0m0.016s
>
> LinkedList, no block allocations, compiler optimizations
> (-O -release -inline) , 1_000_000 elements, 1 read loop
> real 0m0.988s
> user 0m0.944s
> sys 0m0.028s
>
> ArrayList, compiler optimizations (-O -release -inline) , 1_000_000
> elements, 1 read loop
> real 0m0.203s
> user 0m0.148s
> sys 0m0.028s
>
>
> Interpretation:
> Clearly, the block allocations are very helpful in the reads. And
> surprise:
> the LinkedList performs better than the ArrayList!!! How? I suspect that
> it has to do with the method of iteration. mango.container Containers are
> iterated through using Iterator objects (the opApplys use them as well)
> and the generic ListIterator calls .get(index) on the array in addition to
> the .next() call to the iterator. The LinkedListIterator, however, is
> specific to LinkedList and uses a reference to the current node to avoid
> the obvious inefficiency of .get(index) on a LinkedList. Assuming that
> the extra call to .get(index) on the ArrayList was causing some strain, I
> built an ArrayListIterator to avoid it, and came up with the following
> results:
>
> ArrayList, compiler optimizations (-O -release -inline) , 1_000_000
> elements, 1_000 read loops, using ArrayListIterator
> real 0m30.966s
> user 0m29.774s
> sys 0m0.180s
>
> OK, looks like I was right. What does this tell us? That virtual
> function
> calls suck, but we knew this. The LinkedList with block allocations only
> runs about 15% longer than the ArrayList- that's not too shabby.
>
> The bulk of my surpise came when only one read loop was done, so the
> allocation part of the test program has far more weight. The LinkedList
> without block allocations does better by an order of magnitude!!! This
> blew me away. I figured that in terms of allocations, the block
> allocations would do no harm, and probably do better. In fact, I expected
> them to beat out the ArrayList-- the whole point of a LinkedList! Here is
> what I surmise from this result:
>
> Custom Allocators are Slow.
>
> Is my reasoning faulty, or am I right? If I'm right, here's my question
> to
> Walter: why? Are calls to these allocators not inlined? If not, can
> they
> be? In addition, to what extent can virtual function calls be sped up via
> optimization? Under what circumstances can calls to virtual methods be
> inlined?
>
> Sorry for the long post.... I may just be compensating for not posting
> much lately.
>
> ~John Demme
Sorry to reply to myself, but I though I kinda sounded like an ass in the
post. I'm not at all upset with DMD's performance- I'd rather see the
debugging stuff we're getting now over optimizations... I'm just wondering
what we might see in the future, and how to improve my code now.
~John Demme (again)
More information about the Digitalmars-d
mailing list