Real-time Protocol
This page outlines the messaging protocol used in Scriptix's real-time speech-to-text API over WebSocket. All control messages are exchanged using JSON, while audio is streamed as binary data.
Client Messages
The following messages must be sent from the client over the WebSocket connection.
Start Session
{ "action": "start" }
Optional Properties
You can include additional properties in your { "action": "start" } message to configure the behavior of your real-time transcription session.
| Property | Type | Default | Description |
|---|---|---|---|
key | string | null | Provide the API key in the payload instead of the header. Not recommended for production. |
partial | boolean | true | Enables partial result messages. |
actual_numbers | boolean | false | If true, the engine will output numeric values instead of spelled-out words (e.g., "3" instead of "three"). |
Stop Session
{ "action": "stop" }
Ends the active session. The engine will return any remaining buffered audio as a final result and respond with:
{ "state": "stopped" }
A session cannot be restarted after being stopped. You must close and reopen the WebSocket to begin a new session.
Send Audio Stream
Audio should be streamed directly as raw binary chunks over the WebSocket connection.
<binary PCM data>
Server Messages
This section describes the messages sent by the Scriptix real-time API server over an active WebSocket connection.
State Transitions
| State | Message | Description |
|---|---|---|
| Listening | { "state": "listening" } | Indicates that the server is ready to process incoming audio. |
| Stopped | { "state": "stopped" } | Indicates that the session has been terminated. No further audio is accepted. |
| Shutting Down | { "state": "shutting_down", "at": 1234567890 } | Sent ~1 hour before system shutdown. Useful for long-lived services. |
Transcription Results
Partial Results
Returned when partial transcription is enabled and new speech is detected:
{ "partial": "Hello, this is a test" }
Partial results are intermediate and subject to change.
If no audio is detected, no partial will be returned.
Final Results
When the engine is confident, it sends a finalized transcription result:
{
"result": [
[ "Hello", 1000, 1200, 0.99 ],
[ "world", 1200, 1450, 0.98 ]
],
"text": "Hello world"
}
Each entry in the result array has the following structure:
[ "word", start_time_ms, end_time_ms, confidence ]
start_time_ms and end_time_ms represent time boundaries of the word in milliseconds from the beginning of the session.
confidence is a float between 0 and 1.
The text field contains the full recognized sentence.
Error Messages
Below are the common error messages returned by the real-time engine and their meanings:
| Message | Description |
|---|---|
{"error": "Session not started"} | Audio was sent before starting a session. |
{"error": "backend Client tried to start a new session while there is already listening"} | Attempted to start a second session without disconnecting. |
{"error": "restarting of sessions is not supported"} | A session was started again after being stopped. Not allowed. |
{"error": "unable to start backend"} | The engine couldn’t initialize a backend. Contact support if persistent. |
{"error": "engine_not_responding"} | No response from the backend transcription engine. Contact support. |
Summary
- Always start with
{ "action": "start" }and wait for alisteningstate before sending audio. - Audio must be streamed as binary, not JSON or base64.
- End the session with
{ "action": "stop" }to finalize and flush any buffered transcription. - Use error messages and state transitions to monitor and manage session flow reliably.