Parallel Algorithms - Tutorials

Parallel algorithms are at the heart of modern high‑performance computing. By dividing work across multiple processing elements, they enable solutions that scale with the ever‑growing number of cores in CPUs, GPUs, and distributed clusters. This tutorial walks you through the fundamentals, design principles, common paradigms, concrete code examples, performance analysis, and best practices for building robust parallel algorithms.

1. What Are Parallel Algorithms?

A parallel algorithm solves a problem by performing several operations simultaneously. Unlike sequential algorithms, which execute one step after another, parallel algorithms exploit concurrency to reduce overall execution time, often expressed as speedup relative to the best sequential baseline.

Concurrency vs. Parallelism
Speedup and Efficiency
Amdahl’s Law
Gustafson’s Law

Amdahl’s Law shows that the maximum speedup is limited by the serial portion of the algorithm, while Gustafson’s Law emphasizes scaling problem size to achieve higher utilization.

2. Design Principles for Parallel Algorithms

Identify independent tasks (embarrassingly parallel work).
Minimize communication and synchronization overhead.
Balance load across processing elements.
Avoid false sharing and contention on shared resources.
Design for scalability: aim for strong and weak scaling.

📝 Note: A well‑designed parallel algorithm often starts with a clear dependency graph that guides task decomposition.

2.1. Decomposition Strategies

Domain decomposition (e.g., spatial partitioning).
Functional decomposition (splitting algorithmic phases).
Data parallelism (applying the same operation to multiple data items).

3. Common Parallel Paradigms

Different hardware architectures favor different parallel models. Below is a comparison of the most widely used paradigms.

Paradigm	Typical Hardware	Programming Model	Strengths	Weaknesses
Shared‑Memory (e.g., OpenMP)	Multi‑core CPUs	Fork‑join, pragma directives	Low latency, easy to program	Limited scalability beyond a single node
Distributed‑Memory (MPI)	Clusters, Supercomputers	Message passing	Scales to thousands of nodes	Higher latency, explicit communication
GPU Computing (CUDA, OpenCL)	GPUs, Accelerators	Massively data‑parallel kernels	Very high throughput for data‑parallel work	Complex memory hierarchy, limited branching
Task‑Based (Cilk, TBB, ForkJoin)	Multi‑core CPUs, many‑core	Dynamic task graph	Load balancing, work stealing	Overhead for fine‑grained tasks
Dataflow / Reactive (RxJava, Spark)	Clusters, Cloud	Declarative data pipelines	Fault tolerance, high‑level abstraction	Latency for streaming may be higher

4. Parallel Algorithm Examples

4.1. Parallel Prefix Sum (Scan)

The prefix sum computes the running total of an array. It is a building block for many algorithms such as sorting, histogram computation, and dynamic programming.

cpp python

#include <omp.h>
#include <vector>
#include <iostream>

void parallel_prefix_sum(std::vector<int>& a) {
    int n = a.size();
    // Up‑sweep (reduce) phase
    for (int d = 1; d < n; d <<= 1) {
        #pragma omp parallel for
        for (int i = d; i < n; i++) {
            a[i] += a[i - d];
        }
    }
    // Down‑sweep (optional) for exclusive scan can be added here
}

int main() {
    std::vector<int> data = {1,2,3,4,5,6,7,8};
    parallel_prefix_sum(data);
    for (int v : data) std::cout << v << " ";
    return 0;
}

import multiprocessing as mp

def upsweep(chunk):
    # Simple sequential prefix for the chunk
    out = []
    total = 0
    for x in chunk:
        total += x
        out.append(total)
    return out, total

def parallel_prefix_sum(arr):
    n = len(arr)
    cpu = mp.cpu_count()
    size = (n + cpu - 1) // cpu
    chunks = [arr[i*size:(i+1)*size] for i in range(cpu)]
    with mp.Pool(cpu) as pool:
        results = pool.map(upsweep, chunks)
    # Combine results
    prefix = []
    offset = 0
    for out, total in results:
        prefix.extend([x + offset for x in out])
        offset += total
    return prefix

if __name__ == "__main__":
    data = [1,2,3,4,5,6,7,8]
    print(parallel_prefix_sum(data))

💡 Tip: For very large arrays on GPUs, consider the Blelloch scan algorithm which reduces synchronization steps.

4.2. Parallel QuickSort (Divide‑and‑Conquer)

QuickSort can be parallelized by sorting the left and right partitions concurrently.

java go

import java.util.concurrent.RecursiveAction;
import java.util.concurrent.ForkJoinPool;

public class ParallelQuickSort extends RecursiveAction {
    private final int[] arr;
    private final int low, high;
    private static final int THRESHOLD = 1_000;

    public ParallelQuickSort(int[] arr, int low, int high) {
        this.arr = arr; this.low = low; this.high = high;
    }

    @Override
    protected void compute() {
        if (high - low < THRESHOLD) {
            // sequential sort for small partitions
            java.util.Arrays.sort(arr, low, high + 1);
            return;
        }
        int pivot = partition(arr, low, high);
        ParallelQuickSort left = new ParallelQuickSort(arr, low, pivot - 1);
        ParallelQuickSort right = new ParallelQuickSort(arr, pivot + 1, high);
        invokeAll(left, right);
    }

    private int partition(int[] a, int l, int h) {
        int pivot = a[h];
        int i = l - 1;
        for (int j = l; j < h; j++) {
            if (a[j] <= pivot) {
                i++;
                int tmp = a[i]; a[i] = a[j]; a[j] = tmp;
            }
        }
        int tmp = a[i + 1]; a[i + 1] = a[h]; a[h] = tmp;
        return i + 1;
    }

    public static void main(String[] args) {
        int[] data = {9,4,6,2,7,5,3,8,1};
        ForkJoinPool pool = new ForkJoinPool();
        pool.invoke(new ParallelQuickSort(data, 0, data.length - 1));
        for (int v : data) System.out.print(v + " ");
    }
}

// Go example using goroutines and channels
package main
import (
    "fmt"
    "sync"
)

func parallelQuickSort(arr []int, wg *sync.WaitGroup) {
    defer wg.Done()
    if len(arr) <= 1 {
        return
    }
    pivot := arr[len(arr)/2]
    left, right := []int{}, []int{}
    for _, v := range arr {
        if v < pivot {
            left = append(left, v)
        } else if v > pivot {
            right = append(right, v)
        }
    }
    var wgInner sync.WaitGroup
    wgInner.Add(2)
    go parallelQuickSort(left, &wgInner)
    go parallelQuickSort(right, &wgInner)
    wgInner.Wait()
    // reconstruct (omitted for brevity)
    copy(arr, append(append(left, pivot), right...))
}

func main() {
    data := []int{9,4,6,2,7,5,3,8,1}
    var wg sync.WaitGroup
    wg.Add(1)
    go parallelQuickSort(data, &wg)
    wg.Wait()
    fmt.Println(data)
}

⚠ Warning: Beware of excessive goroutine creation in Go; use a worker pool or limit recursion depth to avoid runtime overhead.

5. Performance Analysis and Profiling

Measuring the effectiveness of a parallel algorithm requires more than just wall‑clock time. Consider the following metrics:

Speedup (S = T_seq/T_par)
Efficiency (E = S / p, where p is the number of processors)
Scalability (strong vs. weak scaling curves)
Memory bandwidth utilization
Synchronization overhead (lock contention, barrier wait time)

Tools such as perf, VTune, nvprof, and language‑specific profilers (e.g., gprof, cProfile) help isolate bottlenecks.

OpenMP Performance Tools Guide

6. Best Practices & Common Pitfalls

Start with a correct sequential version before parallelizing.
Use high‑level abstractions (e.g., parallel loops, task libraries) to reduce bugs.
Prefer immutable data or thread‑local storage to avoid data races.
Profile early; premature optimization can hide correctness issues.
Validate results with deterministic unit tests that run both sequential and parallel versions.

💡 Tip: When targeting GPUs, keep kernels small and ensure coalesced memory accesses to maximize throughput.

7. Tools, Libraries, and Frameworks

OpenMP (C/C++/Fortran) – pragma‑based shared‑memory parallelism.
MPI – message‑passing interface for distributed systems.
CUDA / ROCm – low‑level GPU programming.
Intel Threading Building Blocks (TBB) – task‑based parallelism.
Cilk Plus / OpenCilk – keyword‑driven task parallelism.
Apache Spark – data‑parallel processing for big data.

8. Frequently Asked Questions

Q: Can any algorithm be parallelized?
A: No. Algorithms with strong data dependencies or inherently sequential steps (e.g., certain recursive DP formulations) may have limited or no parallel speedup. Identifying the parallelizable portions is the first step.

Q: Is more cores always better?
A: Only if the algorithm scales. After a certain point, overhead from synchronization, communication, and memory contention outweighs the benefits, leading to diminishing returns.

Q: How do I choose between OpenMP and MPI?
A: Use OpenMP for shared‑memory machines (single node, up to a few hundred cores). Use MPI when you need to run across multiple nodes or when the memory model is distributed.

9. Quick Quiz

Q. Which law predicts the theoretical maximum speedup for a fixed problem size given a serial fraction f?

Gustafson’s Law
Amdahl’s Law
Brent’s Theorem
Little’s Law

Answer: Amdahl’s Law
Amdahl’s Law states S ≤ 1 / (f + (1‑f)/p), where f is the serial fraction.

Q. In OpenMP, which clause is used to distribute loop iterations among threads while preserving the original order of execution?

schedule(static)
schedule(dynamic)
ordered
nowait

Answer: ordered
The ordered clause allows the body of a parallel loop to be executed in the original sequential order.

10. Summary

📘 Summary: Parallel algorithms transform sequential logic into concurrent workloads, leveraging modern multicore and manycore hardware. By following solid design principles, selecting the appropriate parallel paradigm, and rigorously profiling, developers can achieve substantial speedups while maintaining correctness and scalability.

11. References

References

#ad