Real-time Protocol

This page outlines the messaging protocol used in Scriptix's real-time speech-to-text API over WebSocket. All control messages are exchanged using JSON, while audio is streamed as binary data.

🧭 Client Messages

The following messages must be sent from the client over the WebSocket connection.

▶️ Start Session

{ "action": "start" }

🛠 Optional Properties

You can include additional properties in your { "action": "start" } message to configure the behavior of your real-time transcription session.

Property	Type	Default	Description
`key`	string	`null`	Provide the API key in the payload instead of the header. Not recommended for production.
`partial`	boolean	`true`	Enables partial result messages.
`actual_numbers`	boolean	`false`	If `true`, the engine will output numeric values instead of spelled-out words (e.g., `"3"` instead of `"three"`).

⏹ Stop Session

{ "action": "stop" }

Ends the active session. The engine will return any remaining buffered audio as a final result and respond with:

{ "state": "stopped" }

A session cannot be restarted after being stopped. You must close and reopen the WebSocket to begin a new session.

🔊 Send Audio Stream

Audio should be streamed directly as raw binary chunks over the WebSocket connection.

<binary PCM data>

📥 Server Messages

This section describes the messages sent by the Scriptix real-time API server over an active WebSocket connection.

🔁 State Transitions

State	Message	Description
Listening	`{ "state": "listening" }`	Indicates that the server is ready to process incoming audio.
Stopped	`{ "state": "stopped" }`	Indicates that the session has been terminated. No further audio is accepted.
Shutting Down	`{ "state": "shutting_down", "at": 1234567890 }`	Sent ~1 hour before system shutdown. Useful for long-lived services.

📝 Transcription Results

Partial Results

Returned when partial transcription is enabled and new speech is detected:

{ "partial": "Hello, this is a test" }

Partial results are intermediate and subject to change.

If no audio is detected, no partial will be returned.

Final Results

When the engine is confident, it sends a finalized transcription result:

{
"result": [
[ "Hello", 1000, 1200, 0.99 ],
[ "world", 1200, 1450, 0.98 ]
],
"text": "Hello world"
}

Each entry in the result array has the following structure:

[ "word", start_time_ms, end_time_ms, confidence ]

start_time_ms and end_time_ms represent time boundaries of the word in milliseconds from the beginning of the session.

confidence is a float between 0 and 1.

The text field contains the full recognized sentence.

❌ Error Messages

Below are the common error messages returned by the real-time engine and their meanings:

Message	Description
`{"error": "Session not started"}`	Audio was sent before starting a session.
`{"error": "backend Client tried to start a new session while there is already listening"}`	Attempted to start a second session without disconnecting.
`{"error": "restarting of sessions is not supported"}`	A session was started again after being stopped. Not allowed.
`{"error": "unable to start backend"}`	The engine couldn’t initialize a backend. Contact support if persistent.
`{"error": "engine_not_responding"}`	No response from the backend transcription engine. Contact support.

✅ Summary

Always start with { "action": "start" } and wait for a listening state before sending audio.
Audio must be streamed as binary, not JSON or base64.
End the session with { "action": "stop" } to finalize and flush any buffered transcription.
Use error messages and state transitions to monitor and manage session flow reliably.

🧭 Client Messages​

▶️ Start Session​

🛠 Optional Properties​

⏹ Stop Session​

🔊 Send Audio Stream​

📥 Server Messages

🔁 State Transitions​

📝 Transcription Results​

Partial Results​

Final Results​

❌ Error Messages​

✅ Summary​

🔗 Related Pages​