Slow performance compared to C++, ideas?

Sun Jun 2 11:17:23 PDT 2013

On 06/02/2013 04:34 PM, Manu wrote:
> Well this is another classic point actually. I've been asked by my friends at
> Cambridge to give their code a once-over for them on many occasions, and while I
> may not understand exactly what their code does, I can often spot boat-loads of
> simple functional errors. Like basic programming bugs; out-by-ones, pointer
> logic fails, clear lack of understanding of floating point, or logical structure
> that will clearly lead to incorrect/unexpected edge cases.
> And it blows my mind that they then run this code on their big sets of data,
> write some big analysis/conclusions, and present this statistical data in some
> journal somewhere, and are generally accepted as an authority and taken seriously!
> 
> *brain asplode*

Yes, I can imagine.  I've seen more than enough researcher-written code that
made my own brain explode, and I don't consider myself in any way expert in
program design.  You have to hope that there were sufficient checks against
empirical or theoretical results, that at least any error was minimized ...

What bothers me more than the "trust" issues about the code is that very often,
the code is never made available for review.  It's astonishing how timid
journals and funding organizations are about trying to resolve this.

> I can tell you I usually offer more in the way of fixing basic logical errors
> than actually making it run faster ;)
> And they don't come to me with everything, just the occasional thing that they
> have a hunch should probably be faster than it is.

I certainly went through a phase of extreme speed-related paranoia in my C
programming past, and I think it's a common trait.  Speed is the thing you worry
about because it's the most observable problem.  And of course C tends to bias
you towards daft micro-optimization rather than basic things like getting the
algorithms right.

> I hope my experience there isn't too common, but I actually get the feeling it's
> more common that you'd like to think!
> This is a crowd I'd actually love to promote D to! But the tools they need
> aren't all there yet...

I think it's probably very common, exacerbated by the fact that most researchers
(not just in maths and physics) are amateur, self-taught programmers with
limited time to study the art of programming for itself, and limited
opportunities for formal training, who are under great pressure to produce
research results quickly and continuously.

I do what I can to promote D to colleagues, and I've had one or two people come
up to me spontaneously and ask about it (because they've seen my name on the
mailing lists), so I think the interest is growing and it will get there.  90%
of the worries I have with it concern 3rd-party libraries -- obviously C/C++
gives you access to a much wider range of stuff without having to write bindings
(which to me at least, is a scary prospect, not so much the technical side as
the hassle and time requirement of having to write and maintain them).  The
other 10% of the worries are about potential issues with the standard library
that might result in incorrect results (actually, I'm pretty confident here
really, but there are a number of known issues in std.random which I make sure
to work around).

> Yeah, this is an interesting point. These friends of mine all write C code, not
> even C++.
> Why is that?
> I guess it's promoted, because they're supposed to be into the whole 'HPC'
> thing, but C is really not a good language for doing maths!

Well, I can't speak for proper HPC because it's not my field (I work in
complexity science, which does involve a lot of intensive computation but has
developed somewhat independently of traditional HPC fields).  However, my guess
would be that it's a mix of what people are first trained in, together with a
measure of conservatism and "What's the lowest common denominator?".  I also
don't think I can stress enough how true it is that mathematicians, physicists
and other researchers tend to be trained to program _in C_, or in C++, or in
FORTRAN, rather than _how to program_.

Speed concerns might be a factor, as C++ offers you rather more ways to shoot
yourself in the foot speed-wise than C -- there might be some prejudice about C
being the one to use for really heavy-duty computation, though I imagine that
says more about the programmer's skill than the reality of the languages.

In my own field the norm these days seems to be a hybrid of C/C++ and Python --
the former for the high-intensity stuff, the latter for convenience or to have a
friendly surface from which to call the high-intensity routines, although
libraries like NumPy seem to be challenging the C dominance for some intense
calculations -- I'm seeing an increasing number of Python libraries being
written and used.

That said, I don't think language lock-in is unique to mathematicians and
physicists.  Many of the computer scientists I've worked with have been wedded
to Java with a strength that is astonishing given that you'd expect them to be
trained well enough to really appreciate the variety of choices available.  (In
my experience, mathematicians and physicists tend to be far more comfortable
with the command line and in programming without an IDE.  That may of course
explain some of the errors, too:-)

> I see stuff like this:
> float ***cubicMatrix = (float***)malloc(sizeof(float**)depth);
> for(int z=0; z<width; z++)
> {
>   cubicMatrix[z] = (float**)malloc(sizeof(float**)*height);
>   for(int y=0; y<height; y++)
>   {
>     cubicMatrix[z][y] = (float*)malloc(sizeof(float*)*width);
>   }
> }
> 
> Seriously, float***. Each 1d row is an individual allocation!
> And then somewhere later on they want to iterate a column rather than a row, and
> get confused about the pointer arithmetic (well, maybe not precisely that, but
> you get the idea).

Oh yes, I've written code like that. :-P

I can only say that it's the way that I was shown how to create matrices in C.
I can't remember the context; possibly I read it in a book, or possibly it was
by browsing other code, possibly it was in lecture notes.

That said, it's the _obvious_ way to create a matrix (if you think of a matrix
as being an entity whose values are accessed by indices [x][y][z]), and if
you're not trained in program design, the obvious way tends to be the thing you
pick, and it seems to work, so ...

I mean, I guess (I've never had call to do it, so never looked into the detail)
the way to _really_ build an effective matrix design is to have a single array
and wrap it with functions that translate x, y, z indices to appropriate array
locations.  But you wouldn't think of that as a novice programmer, and unless
someone teaches you, the multi-level alloc solution probably seems to work well
enough that you never think to question it.  (For the avoidance of doubt: I'm at
least experienced enough to have questioned it before this email exchange:-)

That kind of "works well enough" probably explains 99% of the programming faults
made by researchers, together with the fact that they very rarely encounter
anyone with the experience to question their approach -- and of course, their
own sense of the problems within their code can (as you've experienced) be very
different from the problems an experienced developer will focus on.

I have to say that it'd be very tempting to try and organize an annual event
(and maybe also an online space) where researchers using computation are brought
together with genuinely expert developers for lectures, brainstorming and
collaboration, with the specific aim of getting away from these kinds of
habitual errors.