Real-time Protocol
This page outlines the messaging protocol used in Scriptix's real-time speech-to-text API over WebSocket. All control messages are exchanged using JSON, while audio is streamed as binary data.
🧭 Client Messages
The following messages must be sent from the client over the WebSocket connection.
▶️ Start Session
{ "action": "start" }
🛠 Optional Properties
You can include additional properties in your { "action": "start" }
message to configure the behavior of your real-time transcription session.
Property | Type | Default | Description |
---|---|---|---|
key | string | null | Provide the API key in the payload instead of the header. Not recommended for production. |
partial | boolean | true | Enables partial result messages. |
actual_numbers | boolean | false | If true , the engine will output numeric values instead of spelled-out words (e.g., "3" instead of "three" ). |
⏹ Stop Session
{ "action": "stop" }
Ends the active session. The engine will return any remaining buffered audio as a final result and respond with:
{ "state": "stopped" }
A session cannot be restarted after being stopped. You must close and reopen the WebSocket to begin a new session.
🔊 Send Audio Stream
Audio should be streamed directly as raw binary chunks over the WebSocket connection.
<binary PCM data>
📥 Server Messages
This section describes the messages sent by the Scriptix real-time API server over an active WebSocket connection.
🔁 State Transitions
State | Message | Description |
---|---|---|
Listening | { "state": "listening" } | Indicates that the server is ready to process incoming audio. |
Stopped | { "state": "stopped" } | Indicates that the session has been terminated. No further audio is accepted. |
Shutting Down | { "state": "shutting_down", "at": 1234567890 } | Sent ~1 hour before system shutdown. Useful for long-lived services. |
📝 Transcription Results
Partial Results
Returned when partial transcription is enabled and new speech is detected:
{ "partial": "Hello, this is a test" }
Partial results are intermediate and subject to change.
If no audio is detected, no partial will be returned.
Final Results
When the engine is confident, it sends a finalized transcription result:
{
"result": [
[ "Hello", 1000, 1200, 0.99 ],
[ "world", 1200, 1450, 0.98 ]
],
"text": "Hello world"
}
Each entry in the result array has the following structure:
[ "word", start_time_ms, end_time_ms, confidence ]
start_time_ms and end_time_ms represent time boundaries of the word in milliseconds from the beginning of the session.
confidence is a float between 0 and 1.
The text field contains the full recognized sentence.
❌ Error Messages
Below are the common error messages returned by the real-time engine and their meanings:
Message | Description |
---|---|
{"error": "Session not started"} | Audio was sent before starting a session. |
{"error": "backend Client tried to start a new session while there is already listening"} | Attempted to start a second session without disconnecting. |
{"error": "restarting of sessions is not supported"} | A session was started again after being stopped. Not allowed. |
{"error": "unable to start backend"} | The engine couldn’t initialize a backend. Contact support if persistent. |
{"error": "engine_not_responding"} | No response from the backend transcription engine. Contact support. |
✅ Summary
- Always start with
{ "action": "start" }
and wait for alistening
state before sending audio. - Audio must be streamed as binary, not JSON or base64.
- End the session with
{ "action": "stop" }
to finalize and flush any buffered transcription. - Use error messages and state transitions to monitor and manage session flow reliably.