vLLM-Omni fails with "NVMLError_InvalidArgument" and "V1 LLMEngine" errors in spawned worker processes

22:29 27 Jan 2026

Problem

I'm trying to run vLLM-Omni (v0.11.0rc1) with the Qwen2.5-Omni-7B model on an NVIDIA A100 GPU, but initialization fails with two critical errors in spawned worker processes:

NVML Invalid Argument Error in one worker:

vllm.third_party.pynvml.NVMLError_InvalidArgument: Invalid Argument

This occurs at:

handle = pynvml.nvmlDeviceGetHandleByIndex(physical_device_id)

V1 Engine Mismatch Error in other workers:

ValueError: Using V1 LLMEngine, but envs.VLLM_USE_V1=False. 
This should not happen. As a workaround, try using LLMEngine.from_vllm_config(...) 
or explicitly set VLLM_USE_V1=0 or 1 and report this issue on Github.

All 3 spawned processes fail, causing the orchestrator to timeout:

WARNING: [Orchestrator] Initialization timeout: only 0/3 stages are ready; not ready: [0, 1, 2]

Environment

GPU: NVIDIA A100-SXM4-40GB
vLLM: 0.11.0
vLLM-Omni: 0.11.0rc1
Python: 3.10
PyTorch: CUDA available in main process
Multiprocessing: spawn method
Environment variables set in bash:
- CUDA_VISIBLE_DEVICES=0
- VLLM_USE_V1=0
- VLLM_WORKER_MULTIPROC_METHOD=spawn

Code

import os
import soundfile as sf
import torch

def main():
    from vllm_omni.entrypoints.omni_llm import OmniLLM
    from vllm.sampling_params import SamplingParams
    
    print("=== Starting vLLM-Omni Test ===")
    print(f"Environment: VLLM_USE_V1={os.environ.get('VLLM_USE_V1', 'NOT SET')}")
    print(f"PyTorch CUDA: {torch.cuda.is_available()}, Devices: {torch.cuda.device_count()}")
    
    audio_path = "/scratch/users/ntu/es0001an/dataset_generated/001_input.wav"
    os.makedirs(os.path.dirname(audio_path), exist_ok=True)
    
    if not os.path.exists(audio_path):
        sf.write(audio_path, torch.zeros(16000).numpy(), 16000)
        print(f"Created dummy audio at {audio_path}")

    print("\n=== Initializing OmniLLM ===")
    
    engine = OmniLLM(
        model="Qwen/Qwen2.5-Omni-7B",
        trust_remote_code=True,
        dtype="bfloat16",
        runtime={"devices": [[0], [0], [0]]},
        init_sleep_seconds=180, 
        max_model_len=2048,
        disable_custom_all_reduce=True,
        enforce_eager=True,
    )

    prompt = {
        "prompt": (
            "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
            "<|im_start|>user\n<|audio_bos|><|AUDIO|><|audio_eos|>\n"
            "Describe this audio in detail.<|im_end|>\n<|im_start|>assistant\n"
        ),
        "multi_modal_data": {"audio": [audio_path]}
    }

    sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
    sampling_params_list = [sampling_params, sampling_params, sampling_params]

    print("\n=== Generating Response ===")
    try:
        results = engine.generate([prompt], sampling_params_list)
        
        if results and len(results) > 0:
            result = results[0]
            print(f"\n{'='*60}")
            print("SUCCESS!")
            print(f"{'='*60}")
            print(result)
            
            if hasattr(result, 'outputs') and result.outputs:
                for idx, output in enumerate(result.outputs):
                    if hasattr(output, 'text') and output.text:
                        print(f"\nText: {output.text}")
                    if hasattr(output, 'audio') and output.audio is not None:
                        audio_file = f'output_{idx}.wav'
                        sf.write(audio_file, output.audio, 24000)
                        print(f"Audio saved to: {audio_file}")
        else:
            print("No results returned")
            
    except Exception as e:
        print(f"Error: {e}")
        import traceback
        traceback.print_exc()

if __name__ == '__main__':
    main()

What I've Tried

Setting VLLM_USE_V1=0 in bash script (not Python) - still fails
Using single GPU with runtime={"devices": [[0], [0], [0]]}
Verified PyTorch can access GPU in main process
Added enforce_eager=True and disable_custom_all_reduce=True
Setting environment variables in Python with os.environ - doesn't propagate to spawned children

Questions

Why does NVML fail to get GPU handle in spawned processes when CUDA_VISIBLE_DEVICES=0 is set and the main process can access the GPU fine?
Why does vLLM-Omni use V1 LLMEngine despite VLLM_USE_V1=0 being explicitly set in the shell environment?
Is this a known bug in vLLM-Omni 0.11.0rc1, or is there a correct way to configure multi-stage initialization?
Should I try:
- Setting VLLM_USE_V1=1 instead?
- Using fork instead of spawn?

Any insights on resolving these multiprocessing/GPU initialization issues would be greatly appreciated!

python vllm

Problem

Environment

What I've Tried

Questions

Your Answer

Privacy & Cookie Consent