D speed compared to C++

Tue Mar 25 19:54:48 PDT 2008

Saaa wrote:
>> If you really want to improve performance on C or C++, do it by
>> profiling your program, and optimize parts where it matters how
>> fast you go.

To Saa: it's pretty late here now, but I'll try to address some of this
briefly.

In general, (I have to start with this. And this post, as every post in
a NG, should be for a broader audience than the one nominally replied
to.) improving performance only means one thing: have the computer do
less to achieve the goal.

>> - simplify

Any time you look at your code from two weeks ago, you'll probably find
the same thing could be done more easily, with less code lines, with
less effort for the computer.

>> - remove unnecessary loops

For example, to sum up the integers from 1 to 100 doesn't mean that
you'd do the obvious loop. Thinking more about it gives Sum = 1+100 +
2+99 ... and further refining it gives 50 * (1+100) = 5050. (Courtesy of
Gauss (http://en.wikipedia.org/wiki/Carl_Friedrich_Gauss, ca 1785, ca 7 
years old!)

>> - hoist stuff out of loops as much as possible

     for(int i=0; i<100; i++)
     {
         if (i%3 == 0) sum += i;
     }

converted to

     for(int i=0; i<100/3; i++)
     {
         sum += i;
     }
// ok, stupid example, AND it's late here, AND I haven't compiled it...

>> - iterate or recurse in ways that ease cache miss penalties

Modern processors have caches ranging in size from hundreds of kilobytes
to some megabytes, usually with a separate cache for instructions, and
another for data. Program code, for example, is fetched into the cache
from memory around where the current instruction is located. Thus, the
instructions following the current one are in the cache which the
processor can access much faster than real memory.

Now, if you have your code so that, say, a loop you're executing,
contains subroutine or function calls that actually reside far apart in
memory, then it might be that the processor doesn't understand to keep
all of them in the cache. One way to ease this is to see to it that the
needed functions reside close to each other in ram. One way of trying to
dot this is to have them next to each other in the source code. (Please,
again, understand that it's late here, etc..., so I'm not accurate here,
more like conveying the general idea.)

The same goes for data. Suppose you're Cloe [in "24", the TV series],
and Boss gives you 25 minutes to filter out who of our 2 million
suspects have made any of 100 million recent cellular calls.

You'd have to write a D(!!!!) program to get it done, right? Now,
matching 2 million in an array against 100 million in a stream (yes,
you'd do it against a stream) would be the first tac to do it. But, 25
minutes wouldn't be enough. So, you write your program so that first,
the 2 million suspects are sorted in order of "suspectability", right?
The extra time used for this is more than countered for when the actual
routine is run because now most of the "suspects" appear regularly in
the comparisons, and therefore "stay in the cache".

(Again, it's late here, but you get the idea.)

>> - iterate instead of recurse as much as possible

Any textbook on programming and recursion shows you an example of doing
the Fibonacci numbers. Just write the code as recursive and as iterative
(both from the textbook) and time the results. The difference is appalling.

>> - reduce if/else if/else || && as much as possible
> 
> Is this because of the obvious `less calculations is better`, or
> something else?

Well, any single calculation that you do _within_ the loop, is
calculated as many times as the loop is run. In _most_ cases, what you'd
have inside the loop at first thought, can pretty easily be transported
outside the loop. (Especially with D, where you can have compile-time
things calculated automatically, but also with other languages where
quite a lot of the stuff you'd initially have inside the loop, is
possible to either calculate once before, or convert the whole thing to
calculating other (simpler) things, and then either before the loop or
after it, have some function "rectifying" the end result. (See the
for-loop example above.)

>> - multiply by inverse instead of divide where possible

Division is poison for computers. It's also poison for math
coprocessors. (Hey, those of us that are old enough to have had to do
division of large floating point numbers with pencil and paper at 
school, know intuitively that multiplication is a piece of cake, 
compared. Who here would venture to do 123455.2525 / 4.7110211 on only 
pencil and paper??)

Now, if instead, one could choose to do 123455.2525 * 0.212268206568
(whether with paper-and-pencil, or with the math coprocessor), the
result would be obtained _much_ faster. And with a _lot_ less head ache.

>> - reduce calls to the OS where it's sensible

(This is ridiculous, but I hope it gets the idea across.) Suppose an
idiot Ivan has to write a program that sums up the sizes of all files on
a hard disk. He might first create a function that lists all the files
in a directory with separate OS queries, then edit it to be recursive, 
then make a list of all the files found with this method, then 
one-by-one ask the operating system for the sizes of these files.

Some operating systems have a single call that returns a list of the
file names and (among other things) the file sizes and types. Wouldn't
it be faster to use this, sum up the sizes, and then do the same if any
of the entreies turn out to be directories?

(Comment about reducing calls to the OS: It's not as clear-cut as the
above example might say. Not all of us use Windows, where "_any_" OS
call is a "waste of time". So, one should know, or seek wisdom in the OS
docs, (or even source code, where available) before making judgements on
the efficacy of particular tactics.)

> - don't allocate and then delete is you need a ~same amount of memory
>  afterwards.

(Ignoring the typo.)

New-ing and delete-ing are expesive operations. (Time-wise.) Now,
especially if you know up front that you will need to make a lot of
both, it might be smart to first consider (or have your app evaluate, or
even guess) how many of these you might need at most at the same time.
Then it would be smart to allocate an array containing whatever it is
you need to allocate (be it integers, structs or objects, or whatever) 
at the most.

Then you could write a function myNew(object_or_whatever) that, instead
of allocating with new, would just look at the array and find the first
empty slot in it. Similarly, with delete, you could have a function that
frees the particular slot in this array. While this doesn't sound much
faster than using new and delete (because I'm too tired now to explain
it properly), this is a tried an proven method of making the program run
_much_ faster.

---

>> If you really want to improve performance on C or C++, do it by
>> profiling your program, and optimize parts where it matters how
>> fast you go.

(Quoted again, from the top of this post.) I'd rewrite the above quote, to:

     If you really want to improve performance, do it by profiling.
     And, improve only where the profiling shows you're slow.

In any language.