How to signal a multithreaded worker pool at very high throughput?
01:58 07 Mar 2026

I'm building a circuit simulator, and I want to evaluate the component graph as fast as possible. Each "time-step" of this simulation requires two passes over an array of components I need to check (update, propagate), which are very parallelizable. My goal is to have the most time-steps per second while fully utilizing all my threads, so that large circuits (with lots of state-changes per step) can be evaluated rapidly.

The way I first did this was with condvars (I am using C11 threads.h), where the main thread would break down the problem, put the slices in a shared array, and signal the threads which got a job to wake up and do it. The main thread would itself go to sleep after distribution, only to be woken up each time a thread finishes, at which time it checks to see whether or not all of them are done.

Using that approach, I got around 5000 nanoseconds per step, which is 200 kHz. Not great, as the callgrind output showed that the vast majority of the time was spend waiting on condvars.

So, I turned to a lock-free approach where all children (and main thread) would spinlock on atomics instead of communicating through the scheduler. If I use all 12 of the threads on my machine, each step now takes in the millions of nanoseconds - a vast regression! But, if I limit the worker threads to only one, I achieve as low as 500 nanoseconds per step, which is incredible.

I understand that there is a negative feedback loop effect going on where children threads spinlocking steal CPU time from the main thread (which needs to merge/prepare their jobs), but I'm not sure how to keep my very fast times without resorting to ugly tricks like "pinning" the main thread to a CPU core (not to mention, this would waste a whole core).

A worker function (summoned at program start) looks like this:

static usf_compatibility_int startworker(void *n) {
     /* Starts a thread to evaluate graph slices */
 
     u64 nproc = (u64) n;
 
     for (;;) {
         while (usf_atmflagtry(&jobs[nproc].wait, MEMORDER_ACQ_REL)); //Spinlock
                                            
         switch (jobs[nproc].flag) {
             case WAIT: unreachable(); break;
             case UPDATE: update(nproc); break;
             case PROPAGATE: propagate(nproc); break;        
             case TERMINATE: return 0;     
         }                            
          
         jobs[nproc].flag = WAIT;
         usf_atmsubi(&running_, 1, MEMORDER_ACQ_REL); //At 0, main thread unblocks
     }                                               
 } 

A Slice struct looks like this:

typedef struct Slice {
    u64 len;                           
    usf_hashentry *array;
    atomic_flag wait;             
    Threadflag flag;
} Slice;

And finally, the function to distribute work:

static void distribute(usf_hashentry *array, u64 len, Threadflag threadflag) {
    u64 chunklen, nproc, slicelen;
    for (chunklen = len / NPROCS, nproc = 0; nproc < NPROCS; nproc++) {
        slicelen = (nproc == NPROCS - 1) ? chunklen + len % NPROCS : chunklen;
        if (slicelen == 0) continue;

        usf_hashentry *subarray = array + nproc * chunklen;
        jobs[nproc].len = slicelen;
        jobs[nproc].array = subarray;
        jobs[nproc].flag = threadflag;
        usf_atmaddi(&running_, 1, MEMORDER_ACQ_REL);
        usf_atmflagclr(&jobs[nproc].wait, MEMORDER_RELEASE);
    } 
    while (usf_atmmld(&running_, MEMORDER_ACQUIRE)); //Spinlock
}
c multithreading performance threadpool contention