I am currently developing a behavioral transformation protocol (Breath Realm) that utilizes MediaPipe for real-world habit verification. To ensure privacy and reduce thermal throttling on mobile devices, I have migrated our behavioral logic to an INT4 quantized LLM.
However, I am encountering a potential bottleneck: when the MediaPipe vision pipeline and the quantized LLM inference run simultaneously on mid-range Android/iOS devices, the memory overhead triggers aggressive background process management.
Question: > Are there recommended strategies for balancing the NPU/GPU resource allocation specifically for synchronized vision tasks and INT4 quantized text inference to maintain a 30fps verification rate without overheating?
Context: Our goal is a zero-latency, privacy-first Edge AI environment for youth behavioral habit-forming.