Realtime OpenAI and Twilio: How to reconstruct the correct conversation transcript when speech output is interrupted
09:17 14 Jan 2026

I am integrating Twilio Media Streams with a real-time speech generation and transcription service to handle incoming phone calls (e.g., a bakery ordering system).

The main challenge I am facing is reconstructing the correct conversation history, specifically what the caller actually heard, when interruptions (barge-in) occur.

Issue:

In a real-time call, the speech service continuously emits:

  • Audio output chunks

  • Transcript deltas for that audio

However, transcript deltas are emitted even if the audio is never fully played, or if playback is interrupted and cleared on the Twilio side.

When the caller interrupts, I detect this using a speech-start signal (e.g., input_audio_buffer.speech_started) and immediately:

  1. Truncate the currently playing output

  2. Clear the Twilio audio stream

  3. Resume listening to the caller

Despite this, the transcript stream continues to emit deltas for content that was generated but never heard.

As a result, the transcript history becomes incorrect and no longer reflects what the caller actually experienced.

Example

Conversation flow:

System: Yes, I can help you place your order. What would you like?
Customer: I want bagels and a soda.
System: That will be $9.05. Cash or credit?
(Customer interrupts here)
(Customer never hears the rest of the sentence)
Customer: Sorry for interrupting — do you also have coffee?
System: Sure. What would you like to order?

What actually gets generated internally (undesired)

System: That will be $9.05. Cash or credit? We also have a store discount you can apply now.

The final sentence is generated and transcribed, but never played to the caller because the stream was cleared.

What I want to actually get:

I am not trying to collect all generated transcript deltas.

I specifically need to reconstruct: Only the text corresponding to audio that was actually played to the caller by Twilio

This includes handling cases where:

  • Output is partially played

  • Output is truncated mid-sentence

  • The caller interrupts while audio is playing

Current Code

When caller starts speaking (barge-in)

Audio output is sent to Twilio like this:

---------------------------------------------

AGENT AUDIO → TWILIO

---------------------------------------------

if rtype in (“response.output_audio.delta”, “response.audio.delta”):
agent_cutoff = False

delta = response.get("delta")
if not delta or not stream_sid:
    continue

await websocket.send_json({
    "event": "media",
    "streamSid": stream_sid,
    "media": {"payload": delta},
})

last_audio_ts = time.monotonic()

if response_start_timestamp_twilio is None:
    response_start_timestamp_twilio = latest_media_timestamp

item_id = response.get("item_id")
if item_id:
    last_assistant_item = item_id

await send_mark(websocket, stream_sid)
continue


Each audio chunk sent to Twilio is followed by a mark so playback can be acknowledged later.

Interruption Handling

When the caller starts speaking, I detect it and immediately truncate output and clear Twilio playback:

await openai_ws.send(json.dumps({
“type”: “conversation.item.truncate”,
“item_id”: last_assistant_item,
“content_index”: 0,
“audio_end_ms”: elapsed_ms,
}))

await websocket.send_json({
“event”: “clear”,
“streamSid”: stream_sid,
})


elapsed_ms is computed using Twilio media timestamps.

Transcript Handling

Transcript deltas arrive continuously, even if audio is later truncated or cleared.
To avoid logging text that was never played, I currently attach transcript deltas to the most recent unacknowledged mark:



---------------------------------------------

OUTPUT TRANSCRIPT DELTA

---------------------------------------------

if rtype == “response.output_audio_transcript.delta”:
if agent_cutoff or not mark_queue:
continue

delta = response.get("delta")
if not delta:
    continue

active_mark_id = mark_queue[-1]
pending_agent_transcripts.setdefault(active_mark_id, []).append(delta)
continue




Despite this, transcript deltas continue to arrive for audio that was never played.
Question

What is the correct or recommended way to reconstruct a reliable conversation transcript in this scenario? How can I reliably determine which transcript segments correspond to audio actually played?

I am following an official Twilio example that uses playback markers and interruption detection, but the issue persists on this PR: Initial add AI interruption/conversation truncation. AI talks first. by pkamp3 · Pull Request #13 · twilio-samples/speech-assistant-openai-realtime-api-python · GitHub

Thank you so so much

python twilio real-time openai-api