Keywords: openmpi performance
Description: Firstly, I know this [type of] question is frequently asked, so let me preface this by saying I've read as much as I can, and I still don't know what the deal is. I've parallelized a massive
Firstly, I know this [type of] question is frequently asked, so let me preface this by saying I've read as much as I can, and I still don't know what the deal is.
I've parallelized a massive outer for loop. Number of loop iterations varies, typically between 20-150, but the loop body does a huge amount of work, calling on a lot of local intensive linear algebra routines (as in, the code is part of the source and not an external dependency). Within the loop body there is 1000+ calls to these routines, but they're all totally independent of one another, so I figured it would be a prime candidate for parallelism. The loop code is C++, but it calls a lot of subroutines written in C.
There shouldn't, I believe, be any synchronization at all - the only time the threads access a shared variable are the shared_arrays. and they access unique points in those arrays, indexed by tst .
Thing is, when I up the number of threads (on a multicore cluster!) The speeds we're seeing (where we invoke this loop 5 times) are as follows;
Things which may be noticeable are the massive jump in System time between 2 and 4 threads, and the fact the elapsed time doubles as we move from 2 to 4, and then slowly decreases.
I've tried with a huge range of OMP_SCHEDULE parameters but no luck. Is this related to the fact each thread is using malloc/new and free/delete a lot? This has consistently been run with 8GBs memory - but I'm guessing that's not a problem. Frankly, the huge rise in system time makes it look like the threads might be blocking, but I have no idea why that would happen.
UPDATE 1 I really thought the false sharing was going to be the problem, so re-wrote the code so that the loops store their calculated values in thread-local arrays, and then copy these arrays across to the shared array at the end. Sadly this didn't have any impact, though I almost don't believe it myself.
Following @cmeerw's advice, I ran strace -f, and after all the initialization there are just millions of lines of
Anyone have any ideas what means? Looks like the threads are context switching way too often, or just blocking and unblocking? When I strace the same implementation with OMP_NUM_THREADS set to 0 I get none of this at all. For some comparison, the logfile generated when 1 thread is used is 486 KB. and the log file generated when 4 threads are used is 266 MB.
As suggested by Tom, I tried binding threads to specific processors to no avail. We're in OpenMP 3.1, so I set the environment variable using export OMP_PROC_BIND=true. Same size logfile and same timeframe.
The plot thickens. Having only profiled on the cluster so far, I installed GNU GCC 4.7 via Macports and compiled (with openMP) on my Macbook for the first time (Apple's GCC-4.2.1 throws up a compiler bug when OpenMP is enabled, which is why I hadn't compiled and run it in parallel locally until now). On the Macbook, you see basically the trend you'd expect
We see dimishing returns towards the ends, though this is hardly surprising as a couple of the data sets we're iterating over on this test data have <16 members (so, we're generating 16 threads for, say a for-loop with 7 iterations).
So, now the question remains - WHY does the cluster's performance degrade so badly. I'm going to try on a different quadcore linuxbox tonight. The cluster compiles with GNU-GCC 4.6.3, but I can't believe that in itself is going to make such a difference?
Neither ltrace nor GDB are installed on the cluster (and I can't get them on for various reasons). If my linuxbox gives cluster-like performance I'll run the corresponding ltrace analysis there.
Oh my. I duel booted my Macbook Pro into Ubuntu (12.04) and re-ran the code. It all runs (which is somewhat reassuring) but I see the same, weird bad-performance behavior I see on the clusters, and the same run of millions of futex calls. Given the only difference between my local machine in Ubuntu and in OSX is software (and I'm using the same compiler and libraries - presumably there aren't different glibc implementations for OSX and Ubuntu!) I'm now wondering if this is something to do with how Linux schedules/distributes threads. Any case, being on my local machine makes everything a million times easier, so I'm going to go ahead and ltrace -f it and see what I can find. I wrote a work around for the clusters which forks() off a separate process, and gives a perfect 1/2 in the runtime, so it's definitely possible to get the parallelism going.