Skip to main content

Real-time Protocol

This page outlines the messaging protocol used in Scriptix's real-time speech-to-text API over WebSocket. All control messages are exchanged using JSON, while audio is streamed as binary data.

🧭 Client Messages

The following messages must be sent from the client over the WebSocket connection.


▶️ Start Session

{ "action": "start" }

🛠 Optional Properties

You can include additional properties in your { "action": "start" } message to configure the behavior of your real-time transcription session.

PropertyTypeDefaultDescription
keystringnullProvide the API key in the payload instead of the header. Not recommended for production.
partialbooleantrueEnables partial result messages.
actual_numbersbooleanfalseIf true, the engine will output numeric values instead of spelled-out words (e.g., "3" instead of "three").

⏹ Stop Session

{ "action": "stop" }

Ends the active session. The engine will return any remaining buffered audio as a final result and respond with:

{ "state": "stopped" }

A session cannot be restarted after being stopped. You must close and reopen the WebSocket to begin a new session.


🔊 Send Audio Stream

Audio should be streamed directly as raw binary chunks over the WebSocket connection.

<binary PCM data>

📥 Server Messages

This section describes the messages sent by the Scriptix real-time API server over an active WebSocket connection.


🔁 State Transitions

StateMessageDescription
Listening{ "state": "listening" }Indicates that the server is ready to process incoming audio.
Stopped{ "state": "stopped" }Indicates that the session has been terminated. No further audio is accepted.
Shutting Down{ "state": "shutting_down", "at": 1234567890 }Sent ~1 hour before system shutdown. Useful for long-lived services.

📝 Transcription Results

Partial Results

Returned when partial transcription is enabled and new speech is detected:

{ "partial": "Hello, this is a test" }

Partial results are intermediate and subject to change.

If no audio is detected, no partial will be returned.

Final Results

When the engine is confident, it sends a finalized transcription result:

{
"result": [
[ "Hello", 1000, 1200, 0.99 ],
[ "world", 1200, 1450, 0.98 ]
],
"text": "Hello world"
}

Each entry in the result array has the following structure:

[ "word", start_time_ms, end_time_ms, confidence ]

start_time_ms and end_time_ms represent time boundaries of the word in milliseconds from the beginning of the session.

confidence is a float between 0 and 1.

The text field contains the full recognized sentence.


❌ Error Messages

Below are the common error messages returned by the real-time engine and their meanings:

MessageDescription
{"error": "Session not started"}Audio was sent before starting a session.
{"error": "backend Client tried to start a new session while there is already listening"}Attempted to start a second session without disconnecting.
{"error": "restarting of sessions is not supported"}A session was started again after being stopped. Not allowed.
{"error": "unable to start backend"}The engine couldn’t initialize a backend. Contact support if persistent.
{"error": "engine_not_responding"}No response from the backend transcription engine. Contact support.

✅ Summary

  • Always start with { "action": "start" } and wait for a listening state before sending audio.
  • Audio must be streamed as binary, not JSON or base64.
  • End the session with { "action": "stop" } to finalize and flush any buffered transcription.
  • Use error messages and state transitions to monitor and manage session flow reliably.