Real-time Protocol

This page outlines the messaging protocol used in Scriptix's real-time speech-to-text API over WebSocket. All control messages are exchanged using JSON, while audio is streamed as binary data.

Client Messages

The following messages must be sent from the client over the WebSocket connection.

Start Session

{ "action": "start" }

Optional Properties

You can include additional properties in your { "action": "start" } message to configure the behavior of your real-time transcription session.

Property	Type	Default	Description
`key`	string	`null`	Provide the API key in the payload instead of the header. Not recommended for production.
`partial`	boolean	`true`	Enables partial result messages.
`actual_numbers`	boolean	`false`	If `true`, the engine will output numeric values instead of spelled-out words (e.g., `"3"` instead of `"three"`).

Stop Session

{ "action": "stop" }

Ends the active session. The engine will return any remaining buffered audio as a final result and respond with:

{ "state": "stopped" }

A session cannot be restarted after being stopped. You must close and reopen the WebSocket to begin a new session.

Send Audio Stream

Audio should be streamed directly as raw binary chunks over the WebSocket connection.

<binary PCM data>

Server Messages

This section describes the messages sent by the Scriptix real-time API server over an active WebSocket connection.

State Transitions

State	Message	Description
Listening	`{ "state": "listening" }`	Indicates that the server is ready to process incoming audio.
Stopped	`{ "state": "stopped" }`	Indicates that the session has been terminated. No further audio is accepted.
Shutting Down	`{ "state": "shutting_down", "at": 1234567890 }`	Sent ~1 hour before system shutdown. Useful for long-lived services.

Transcription Results

Partial Results

Returned when partial transcription is enabled and new speech is detected:

{ "partial": "Hello, this is a test" }

Partial results are intermediate and subject to change.

If no audio is detected, no partial will be returned.

Final Results

When the engine is confident, it sends a finalized transcription result:

{
"result": [
[ "Hello", 1000, 1200, 0.99 ],
[ "world", 1200, 1450, 0.98 ]
],
"text": "Hello world"
}

Each entry in the result array has the following structure:

[ "word", start_time_ms, end_time_ms, confidence ]

start_time_ms and end_time_ms represent time boundaries of the word in milliseconds from the beginning of the session.

confidence is a float between 0 and 1.

The text field contains the full recognized sentence.

Error Messages

Below are the common error messages returned by the real-time engine and their meanings:

Message	Description
`{"error": "Session not started"}`	Audio was sent before starting a session.
`{"error": "backend Client tried to start a new session while there is already listening"}`	Attempted to start a second session without disconnecting.
`{"error": "restarting of sessions is not supported"}`	A session was started again after being stopped. Not allowed.
`{"error": "unable to start backend"}`	The engine couldn’t initialize a backend. Contact support if persistent.
`{"error": "engine_not_responding"}`	No response from the backend transcription engine. Contact support.

Summary

Always start with { "action": "start" } and wait for a listening state before sending audio.
Audio must be streamed as binary, not JSON or base64.
End the session with { "action": "stop" } to finalize and flush any buffered transcription.
Use error messages and state transitions to monitor and manage session flow reliably.

Client Messages​

Start Session​

Optional Properties​

Stop Session​

Send Audio Stream​

Server Messages

State Transitions​

Transcription Results​

Partial Results​

Final Results​

Error Messages​

Summary​

Related Pages​