I'm trying to run calculations using multiple cores in Python on multiple platforms (Linux, macOS, Windows). I need to pass a large CustomClass Object and a dict (both readonly) to all workers. So far I tried to use multiprocessing with Pool. On platforms using fork (Linux, macOS) this results in faster computation. On Windows (spawn) multiprocessing is much slower and memory inefficient. On Windows processes will be spawned sequentially, taking much longer than the actual computation.
import multiprocessing as mp
import numpy as np
import tqdm
class CustomClass:
def __init(self, data):
## assigns multiple huge vectors and objects
class ParallelComputation:
...
def compute(n_jobs=-1):
custom_obj = CustomCLass(data) # ~1GB of memory
cfg = dict(some_config=True)
inputs = [(1, np.random.rand(1, 300)), (1, np.random.rand(1, 300)), ...] # array of tuples
worker = worker_task
with mp.Pool(processes=n_jobs, initializer=_set_globals, initargs=(custom_obj, cfg)) as pool:
for result in tqdm(pool.imap(worker, inputs, chunksize=8), total=n_rows):
i, row = result
out[i, :] = row
...
def worker_task(args):
lib = _G_LIB
cfg = _G_CFG
# some computation
def _set_globals(custom_obj, cfg):
global _G_LIB, _G_CFG
_G_LIB = custom_obj
_G_CFG = cfg
When running my computation I also have to guard it on Windows:
if __name__ == "__main__":
parallel = ParallelComputation(...)
parallel.compute()
I also tried joblib, which spawns worker processes in parallel, but will also consume a lot of memory. I do not change the custom_obj or cfg, so SharedMemory or a ProcessManager seem not suitable. How can I use multiprocessing with a large CustomClass Object efficiently on Windows?