I am integrating Twilio Media Streams with a real-time speech generation and transcription service to handle incoming phone calls (e.g., a bakery ordering system).
The main challenge I am facing is reconstructing the correct conversation history, specifically what the caller actually heard, when interruptions (barge-in) occur.
Issue:
In a real-time call, the speech service continuously emits:
Audio output chunks
Transcript deltas for that audio
However, transcript deltas are emitted even if the audio is never fully played, or if playback is interrupted and cleared on the Twilio side.
When the caller interrupts, I detect this using a speech-start signal (e.g., input_audio_buffer.speech_started) and immediately:
Truncate the currently playing output
Clear the Twilio audio stream
Resume listening to the caller
Despite this, the transcript stream continues to emit deltas for content that was generated but never heard.
As a result, the transcript history becomes incorrect and no longer reflects what the caller actually experienced.
Example
Conversation flow:
System: Yes, I can help you place your order. What would you like?
Customer: I want bagels and a soda.
System: That will be $9.05. Cash or credit?
(Customer interrupts here)
(Customer never hears the rest of the sentence)
Customer: Sorry for interrupting — do you also have coffee?
System: Sure. What would you like to order?
What actually gets generated internally (undesired)
System: That will be $9.05. Cash or credit? We also have a store discount you can apply now.
The final sentence is generated and transcribed, but never played to the caller because the stream was cleared.
What I want to actually get:
I am not trying to collect all generated transcript deltas.
I specifically need to reconstruct: Only the text corresponding to audio that was actually played to the caller by Twilio
This includes handling cases where:
Output is partially played
Output is truncated mid-sentence
The caller interrupts while audio is playing
Current Code
When caller starts speaking (barge-in)
Audio output is sent to Twilio like this:
---------------------------------------------
AGENT AUDIO → TWILIO
---------------------------------------------
if rtype in (“response.output_audio.delta”, “response.audio.delta”):
agent_cutoff = False
delta = response.get("delta")
if not delta or not stream_sid:
continue
await websocket.send_json({
"event": "media",
"streamSid": stream_sid,
"media": {"payload": delta},
})
last_audio_ts = time.monotonic()
if response_start_timestamp_twilio is None:
response_start_timestamp_twilio = latest_media_timestamp
item_id = response.get("item_id")
if item_id:
last_assistant_item = item_id
await send_mark(websocket, stream_sid)
continue
Each audio chunk sent to Twilio is followed by a mark so playback can be acknowledged later.
Interruption Handling
When the caller starts speaking, I detect it and immediately truncate output and clear Twilio playback:
await openai_ws.send(json.dumps({
“type”: “conversation.item.truncate”,
“item_id”: last_assistant_item,
“content_index”: 0,
“audio_end_ms”: elapsed_ms,
}))
await websocket.send_json({
“event”: “clear”,
“streamSid”: stream_sid,
})
elapsed_ms is computed using Twilio media timestamps.
Transcript Handling
Transcript deltas arrive continuously, even if audio is later truncated or cleared.
To avoid logging text that was never played, I currently attach transcript deltas to the most recent unacknowledged mark:
---------------------------------------------
OUTPUT TRANSCRIPT DELTA
---------------------------------------------
if rtype == “response.output_audio_transcript.delta”:
if agent_cutoff or not mark_queue:
continue
delta = response.get("delta")
if not delta:
continue
active_mark_id = mark_queue[-1]
pending_agent_transcripts.setdefault(active_mark_id, []).append(delta)
continue
Despite this, transcript deltas continue to arrive for audio that was never played.
Question
What is the correct or recommended way to reconstruct a reliable conversation transcript in this scenario? How can I reliably determine which transcript segments correspond to audio actually played?
I am following an official Twilio example that uses playback markers and interruption detection, but the issue persists on this PR: Initial add AI interruption/conversation truncation. AI talks first. by pkamp3 · Pull Request #13 · twilio-samples/speech-assistant-openai-realtime-api-python · GitHub
Thank you so so much