I am building a telemetry ingestion engine in C11 that processes multi-gigabyte JSON execution logs using mmap() and pthreads.
The current design utilizes a shared work queue between worker threads to handle parsing chunks. While functional at low volumes, thread contention spikes dramatically as the worker pool scales, severely limiting our processing throughput.
What are the standard architectural approaches for minimizing synchronization overhead and lock contention in this specific scenario?
I am specifically interested in:
• Lock-free queue patterns or ring-buffer topologies suitable for SPSC/MPSC boundaries.
• Work-stealing pool architectures to keep workers balanced without constant global locking.
• Practical performance trade-offs of these patterns vs. a highly optimized mutex-protected queue on Linux.
Current Environment:
• Linux (Kernel 6.x)
• Pure C11 (GCC -O3)
• POSIX Threads (pthreads)
• Target Hardware: AVX2-capable x86_64 systems