I'm building a circuit simulator, and I want to evaluate the component graph as fast as possible. Each "time-step" of this simulation requires two passes over an array of components I need to check (update, propagate), which are very parallelizable. My goal is to have the most time-steps per second while fully utilizing all my threads, so that large circuits (with lots of state-changes per step) can be evaluated rapidly.
The way I first did this was with condvars (I am using C11 threads.h), where the main thread would break down the problem, put the slices in a shared array, and signal the threads which got a job to wake up and do it. The main thread would itself go to sleep after distribution, only to be woken up each time a thread finishes, at which time it checks to see whether or not all of them are done.
Using that approach, I got around 5000 nanoseconds per step, which is 200 kHz. Not great, as the callgrind output showed that the vast majority of the time was spend waiting on condvars.
So, I turned to a lock-free approach where all children (and main thread) would spinlock on atomics instead of communicating through the scheduler. If I use all 12 of the threads on my machine, each step now takes in the millions of nanoseconds - a vast regression! But, if I limit the worker threads to only one, I achieve as low as 500 nanoseconds per step, which is incredible.
I understand that there is a negative feedback loop effect going on where children threads spinlocking steal CPU time from the main thread (which needs to merge/prepare their jobs), but I'm not sure how to keep my very fast times without resorting to ugly tricks like "pinning" the main thread to a CPU core (not to mention, this would waste a whole core).
A worker function (summoned at program start) looks like this:
static usf_compatibility_int startworker(void *n) {
/* Starts a thread to evaluate graph slices */
u64 nproc = (u64) n;
for (;;) {
while (usf_atmflagtry(&jobs[nproc].wait, MEMORDER_ACQ_REL)); //Spinlock
switch (jobs[nproc].flag) {
case WAIT: unreachable(); break;
case UPDATE: update(nproc); break;
case PROPAGATE: propagate(nproc); break;
case TERMINATE: return 0;
}
jobs[nproc].flag = WAIT;
usf_atmsubi(&running_, 1, MEMORDER_ACQ_REL); //At 0, main thread unblocks
}
}
A Slice struct looks like this:
typedef struct Slice {
u64 len;
usf_hashentry *array;
atomic_flag wait;
Threadflag flag;
} Slice;
And finally, the function to distribute work:
static void distribute(usf_hashentry *array, u64 len, Threadflag threadflag) {
u64 chunklen, nproc, slicelen;
for (chunklen = len / NPROCS, nproc = 0; nproc < NPROCS; nproc++) {
slicelen = (nproc == NPROCS - 1) ? chunklen + len % NPROCS : chunklen;
if (slicelen == 0) continue;
usf_hashentry *subarray = array + nproc * chunklen;
jobs[nproc].len = slicelen;
jobs[nproc].array = subarray;
jobs[nproc].flag = threadflag;
usf_atmaddi(&running_, 1, MEMORDER_ACQ_REL);
usf_atmflagclr(&jobs[nproc].wait, MEMORDER_RELEASE);
}
while (usf_atmmld(&running_, MEMORDER_ACQUIRE)); //Spinlock
}