How to optimize KV cache memory fragmentation in vLLM when handling highly interleaved requests?
- Problem Summary: I am deploying a Llama-3-70B model using vLLM and noticing that as the number of concurrent requests increases, the GPU memory utilization remains high, but the throughput (tokens/sec) drops significantly earlier than expected compared to baseline benchmarks. I suspect this is due to KV cache fragmentation during the PagedAttention process when handling requests with significantly different input lengths (interleaved long-context and short-context queries). Environment: Model: Llama-3-70B (AWQ Quantized) Hardware: 2x NVIDIA A100 80GB Framework: vLLM v0.4.0 Serving: FastAPI with asynchronous background tasks What I’ve tried: Adjusted gpu_memory_utilization from 0.90 to 0.95, which increased capacity but didn't solve the latency spikes. Experimented with max_num_seqs, but setting it too low underutilizes the GPU, while too high causes OOM (Out of Memory) or severe context switching overhead. Profiled with PyTorch Profiler; it shows high kernel execution time for paged_attention_v2 during peak loads. The Question: How can I programmatically tune or monitor the block allocation efficiency in vLLM to minimize this fragmentation? Specifically, is there a way to prioritize request batching based on prefix sharing (Prefix Caching) to reduce the redundant KV cache overhead in a production environment?