Code speed (and back to the memory leaks...)
bearophile
bearophileHUGS at lycos.com
Thu Apr 15 09:10:45 PDT 2010
Joseph Wakeling:
>My guess for the difference is that you ran less than the full 100 iterations of the main for loop.<
You are right, sorry.
>So I think it's probably just compiler difference that's to blame for speed differences<
This can be true, but such differences are not magic, they can be found, and eventually they can even become precise enhancement requests for llvm developers, they are open to such requests.
>After all, D in general and DMD in particular is in development.<
DMD has a quite old back-end that is not in active development. Maybe it will become 64 bit.
>I _am_ surprised that in your profiling the random number generator took up so much time, as if you cut out the lines,
aa(ratings, reputationUser, reputationObject);
yzlm(ratings, reputationUser, reputationObject);
... and recompile, the program screams through in no time at all -- which would lead me to think that it's the yzlm opCall and the functions it calls that take up the majority of time. Or is there some lazy evaluation going on here ... ?<
I've done few more tests, an I've seen that my profiling in DMD was wrong, I have disabled the inlining (to have a less aggregated result), but this has produced fake results, because in the true optimized run the inlining removes most of the overhead of the Kiss generator, so the true profiling is now like you say:
Func
Time
17703021 Yzlm.objectReputationUpdate
12709135 Yzlm.userReputationUpdate
1385194 main
In this program I have seen that a good percentage of the speed difference between ldc/dmd is caused by foreach loops. When you iterate over structs longer than size_t use ref (or in D2 better ref const(...)).
foreach (Rating r; ratings)
==>
foreach (ref const(Rating) r; ratings)
There is nothing algorithmically strange going on here, but to be sure you can take a look at the produced asm.
This is the middle parts (the two loops) of Yzlm.objectReputationUpdate compiled with ldc, this is one of the two main hot spots of the program, the you can compared this asm with the asm of the same part from the C++ version:
...
.LBB6_2:
movl (%eax), %edx
movl 52(%esp), %edi
movl 4(%edi), %ebx
movsd (%ebx,%edx,8), %xmm0
mulsd 8(%eax), %xmm0
movl 4(%eax), %edx
movl 48(%esp), %ebx
movl 4(%ebx), %ebx
addsd (%ebx,%edx,8), %xmm0
movsd %xmm0, (%ebx,%edx,8)
movl 4(%eax), %edx
movl 36(%esi), %ebx
movsd (%ebx,%edx,8), %xmm0
movl (%eax), %ebp
movl 4(%edi), %edi
addsd (%edi,%ebp,8), %xmm0
movsd %xmm0, (%ebx,%edx,8)
addl $16, %eax
decl %ecx
jne .LBB6_2
.LBB6_3:
movl 48(%esp), %eax
movl (%eax), %eax
testl %eax, %eax
je .LBB6_9
movl 48(%esp), %ecx
movl 4(%ecx), %ecx
pxor %xmm0, %xmm0
xorl %edx, %edx
pxor %xmm1, %xmm1
.align 16
.LBB6_5:
movl 36(%esi), %edi
movsd (%edi,%edx,8), %xmm2
ucomisd %xmm1, %xmm2
ja .LBB6_7
movsd .LCPI6_0, %xmm2
.LBB6_7:
movsd (%ecx,%edx,8), %xmm3
divsd %xmm2, %xmm3
movsd %xmm3, (%ecx,%edx,8)
movl 44(%esi), %edi
subsd (%edi,%edx,8), %xmm3
mulsd %xmm3, %xmm3
addsd %xmm3, %xmm0
incl %edx
cmpl %eax, %edx
jne .LBB6_5
...
I've also seen that the program gets a little faster with some unrollig, done like this:
const int UNROLL = 5;
assert(!(ratings.length % UNROLL));
assert(!(reputationUser.length % UNROLL));
for (int i; i < ratings.length; i += UNROLL) {
foreach (k; Range!(UNROLL)) {
auto r = &(ratings[i + k]);
auto aux = (r.value - reputationObject[r.object]);
reputationUser[r.user] += aux * aux;
}
}
Where:
template Tuple(T...) {
alias T Tuple;
}
template Range(int stop) {
static if (stop <= 0)
alias Tuple!() Range;
else
alias Tuple!(Range!(stop-1), stop-1) Range;
}
But to make it work with a arrays of arbitrary length you need a little more code, using a strategy similar to:
sum = 0;
for (i = 0; i < n - 3; i += 4)
sum += A[i] + A[i+1] + A[i+2] + A[i+3];
for (; i < n; i++)
sum += A[i];
>I don't think it's the algorithm in any case (although it might be Phobos' implementation) as working with Tango and LDC, changing between the Twister or Kiss random number generators doesn't make for any meaningful performance difference.<
Those Phobos and Tango implementations, when you use uniform(), are probably different, there are enforce() and the nextafter() that are probably not used in that Tango code.
>I noticed that when compiled with LDC the memory usage stays constant, but in DMD the memory use blows up. Is this a D1/D2 difference (with D2 having the more advanced features that require preservation of memory) or is it a bug in one or the other of the compilers ... ?<
D2 dynamic arrays have being changed recently, D1 dynamic arrays have not changed. LDC is D1 still. So this is just a D1/D2 difference. The purpose of this change between D1 and D2 was to avoid bugs caused by "memory stomping" of array slices.
Bye,
bearophile
More information about the Digitalmars-d-learn
mailing list