Question Body
I am working on an embedded Linux project (i.MX8 platform) where I need to share raw camera video across multiple processes using a producer/consumer architecture.
Architecture
I have:
Producer process
Captures camera using
v4l2srcSends raw NV12 1080p30 frames to shared memory using
shmsink
Multiple consumer processes
One records to file
One performs AI inference
One (or more) provides RTSP streaming
All consume raw frames via
shmsrc
Example producer pipeline:
gst-launch-1.0 \
v4l2src device=/dev/video3 io-mode=dmabuf ! \
video/x-raw,format=NV12,width=1920,height=1080,framerate=30/1 ! \
queue ! \
shmsink socket-path=/tmp/cam.sock wait-for-connection=false sync=false
Example consumer (RTSP branch):
shmsrc socket-path=/tmp/cam.sock is-live=true do-timestamp=true ! \
video/x-raw,format=NV12,width=1920,height=1080,framerate=30/1 ! \
videoscale ! videorate ! \
v4l2h264enc ! h264parse ! rtph264pay
Problem
CPU usage is extremely high.
Producer alone consumes ~100% (of one core)
Producer + 3 RTSP branches consume ~360% of 400% total CPU (quad core system)
This is unexpected because:
Encoding is hardware accelerated (
v4l2h264enc)Capture uses
io-mode=dmabufNo software decoding is involved
However, sharing raw frames across processes appears very expensive.
Observations
Each branch reads raw 1080p frames from shared memory.
Each branch performs scaling and framerate conversion independently.
Shared memory causes one memory copy per branch.
Total memory bandwidth becomes very high:
1920×1080×1.5 bytes ≈ 3MB per frame
3MB × 30 fps × multiple branches
Using tee inside a single process reduces CPU usage significantly, but:
Static tee does not meet my requirements.
I need dynamic branch creation and removal.
I need true inter-process separation (producer/consumer model).
appsrcintroduces heavy copying and does not share buffers across processes.I want to strictly avoid re-encoding and re-decoding between processes.
UDP/RTP transport is not preferred because it requires encoding.
Question
What is the best architecture to:
Share raw camera video across multiple processes
Avoid excessive CPU usage
Avoid repeated encoding/decoding
Allow dynamic branch creation
Maintain process isolation
Specifically:
Is
shmsink/shmsrcinherently copy-heavy and memory-bandwidth limited?Is there a way to share DMABUF across processes without copying?
Would encoding once and sharing compressed stream be the only scalable solution?
Are there NXP/i.MX specific mechanisms (DMA-BUF export, v4l2 memory sharing, imx plugins) better suited for this?
Is there a recommended design pattern for this use case on embedded systems?
Goal
My goal is to:
Avoid multiple encode/decode cycles
Avoid unnecessary copies
Keep CPU usage minimal
Support dynamic consumer processes
Any architectural guidance or NXP-specific recommendations would be highly appreciated.