Skip to main content

Real-time Streaming API

Stream live audio for real-time speech-to-text transcription via WebSocket.

What is Real-time Transcription?

Stream audio in real-time and receive transcripts as speech occurs:

  1. Initialize WebSocket session
  2. Stream audio chunks continuously
  3. Receive transcripts in real-time

Best for: Live captions, voice assistants, call centers, live events

Quick Start

1. Initialize Session

curl -X POST https://api.scriptix.io/api/v4/realtime/session \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"language": "en",
"sample_rate": 16000
}'

Response:

{
"session_id": "session_abc123",
"websocket_url": "wss://api.scriptix.io/realtime/session_abc123",
"expires_at": "2025-01-17T14:00:00Z"
}

2. Connect WebSocket

const ws = new WebSocket('wss://api.scriptix.io/realtime/session_abc123');

ws.on('open', () => {
console.log('Connected!');
// Start streaming audio
});

ws.on('message', (data) => {
const message = JSON.parse(data);
if (message.type === 'transcript') {
console.log(message.text);
}
});

3. Stream Audio

// Send audio chunks (binary)
function streamAudio(audioBuffer) {
ws.send(audioBuffer);
}

// Or send base64-encoded
function streamAudioBase64(audioData) {
ws.send(JSON.stringify({
type: 'audio',
data: audioData // base64 string
}));
}

4. Receive Transcripts

ws.on('message', (data) => {
const message = JSON.parse(data);

if (message.type === 'partial') {
// Interim result (may change)
console.log('Partial:', message.text);
} else if (message.type === 'final') {
// Final result (won't change)
console.log('Final:', message.text);
}
});

Supported Audio Formats

Required Specifications

ParameterValue
Sample Rate8000, 16000, 22050, 44100, 48000 Hz
Bit Depth16-bit
ChannelsMono (1 channel)
EncodingPCM, μ-law, A-law
FormatRaw audio (PCM) or base64-encoded
  • Sample Rate: 16000 Hz
  • Bit Depth: 16-bit
  • Encoding: PCM
  • Chunk Size: 20-50ms of audio

Message Protocol

Client → Server

1. Audio Data (Binary)

Send raw audio bytes:

ws.send(audioBuffer);  // ArrayBuffer or Buffer

2. Audio Data (JSON)

{
"type": "audio",
"data": "base64_encoded_audio_data"
}

3. Control Messages

{
"type": "configure",
"sample_rate": 16000,
"language": "en"
}
{
"type": "end_of_speech"
}

Server → Client

1. Partial Transcript

{
"type": "partial",
"text": "Hello how are",
"confidence": 0.85,
"timestamp": 1642089600.5
}

2. Final Transcript

{
"type": "final",
"text": "Hello, how are you doing today?",
"confidence": 0.95,
"start": 0.0,
"end": 2.5,
"timestamp": 1642089603.0
}

3. Error

{
"type": "error",
"error": "Invalid audio format",
"error_code": "INVALID_AUDIO_FORMAT"
}

Complete Example

Python with websocket-client

import websocket
import json
import pyaudio

# 1. Initialize session
response = requests.post(
'https://api.scriptix.io/api/v4/realtime/session',
headers={
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
json={'language': 'en', 'sample_rate': 16000}
)

session = response.json()
ws_url = session['websocket_url']

# 2. Connect WebSocket
def on_message(ws, message):
data = json.loads(message)
if data['type'] == 'final':
print(f"Final: {data['text']}")
elif data['type'] == 'partial':
print(f"Partial: {data['text']}")

def on_error(ws, error):
print(f"Error: {error}")

ws = websocket.WebSocketApp(
ws_url,
on_message=on_message,
on_error=on_error
)

# 3. Stream audio from microphone
def stream_microphone():
p = pyaudio.PyAudio()
stream = p.open(
format=pyaudio.paInt16,
channels=1,
rate=16000,
input=True,
frames_per_buffer=1024
)

while True:
data = stream.read(1024)
ws.send(data, opcode=websocket.ABNF.OPCODE_BINARY)

# Run in separate threads
import threading
threading.Thread(target=ws.run_forever).start()
threading.Thread(target=stream_microphone).start()

JavaScript/Node.js

const WebSocket = require('ws');
const mic = require('mic');

// 1. Initialize session
const response = await fetch(
'https://api.scriptix.io/api/v4/realtime/session',
{
method: 'POST',
headers: {
'Authorization': 'Bearer YOUR_API_KEY',
'Content-Type': 'application/json'
},
body: JSON.stringify({
language: 'en',
sample_rate: 16000
})
}
);

const session = await response.json();

// 2. Connect WebSocket
const ws = new WebSocket(session.websocket_url);

ws.on('open', () => {
console.log('Connected!');

// 3. Stream microphone audio
const micInstance = mic({
rate: '16000',
channels: '1',
encoding: 'signed-integer'
});

const micInputStream = micInstance.getAudioStream();

micInputStream.on('data', (data) => {
ws.send(data);
});

micInstance.start();
});

// 4. Receive exports
ws.on('message', (data) => {
const message = JSON.parse(data);

if (message.type === 'final') {
console.log('Final:', message.text);
} else if (message.type === 'partial') {
process.stdout.write(`\rPartial: ${message.text}`);
}
});

Features

Interim Results

Get partial transcripts while speaking:

  • Partial: May change as more audio received
  • Final: Won't change, confirmed result

Low Latency

Typical latency: 200-500ms from speech to transcript

Continuous Sessions

Keep session open for up to 4 hours of continuous streaming

Session Management

Session Timeout

  • Idle timeout: 30 seconds of silence closes session
  • Max duration: 4 hours
  • Reconnection: Create new session if disconnected

Close Session

// Graceful close
ws.send(JSON.stringify({ type: 'end_of_session' }));
ws.close();

Pricing

Real-time transcription is charged per audio minute:

PlanPrice per MinuteConcurrent Sessions
Free-0 (not available)
Bronze$0.0122
Silver$0.0105
Gold$0.00810
EnterpriseCustomCustom

Note: Real-time costs 2x batch transcription due to compute requirements.

Best Practices

1. Optimal Audio Chunks

Send 20-50ms chunks for best latency/quality balance:

# 16kHz, 16-bit, mono
chunk_size = 16000 * 2 * 0.02 # 20ms chunks = 640 bytes

2. Handle Network Issues

Implement reconnection logic:

let reconnectAttempts = 0;

function connect() {
const ws = new WebSocket(websocketUrl);

ws.on('close', () => {
if (reconnectAttempts < 5) {
setTimeout(() => {
reconnectAttempts++;
connect();
}, 1000 * reconnectAttempts);
}
});
}

3. Buffer Management

Don't buffer too much audio client-side:

// ❌ Large buffer causes delay
const buffer = [];
setInterval(() => {
ws.send(buffer.join());
buffer = [];
}, 5000); // 5 second delay!

// ✅ Send immediately
micInputStream.on('data', (chunk) => {
ws.send(chunk);
});

Limitations

  • Max session: 4 hours
  • Idle timeout: 30 seconds
  • Audio format: PCM only (no MP3/AAC)
  • Sample rates: Fixed set (8, 16, 22.05, 44.1, 48 kHz)
  • Channels: Mono only

Next Steps


Ready to stream? Start with Initialize Session.