Speaker Separation (Diarization)
Speaker diarization is the automatic identification and separation of different speakers in your audio. This guide explains how speaker separation works, when to use it, and how to work with the results.
What is Speaker Diarization?
Diarization (pronounced "die-uh-rih-ZAY-shun") is the process of automatically:
- Detecting when different speakers are talking
- Separating speech into distinct speaker turns
- Grouping utterances by the same speaker
- Labeling speakers with default identifiers
How It Works
During Transcription Processing:
-
Audio Analysis - System analyzes voice characteristics
- Pitch and tone
- Speaking rate and rhythm
- Vocal timbre and quality
-
Speaker Clustering - Groups similar voices together
- Identifies unique voice signatures
- Assigns each signature a speaker ID
- Determines speaker change points
-
Utterance Attribution - Attributes each speech segment to a speaker
- Creates separate utterances for each speaker turn
- Assigns speaker labels
- Maintains timestamp information
-
Output - Generates structured transcript
- Each utterance has a
speakerfield - Chronological order preserved
- Ready for renaming and editing
- Each utterance has a
With vs Without Speaker Separation
| Aspect | With Diarization | Without Diarization |
|---|---|---|
| Processing | Automatic speaker detection | All text in one continuous block |
| Output | Separated by speaker turns | No speaker attribution |
| Initial Labels | Speaker identifiers assigned | No speaker labels |
| Editing Required | Rename speakers, verify separation | Manually split all speaker turns |
| Best For | Multi-speaker content | Single speaker content |
When to Enable Speaker Separation
Recommended For
Multi-Speaker Content:
- Interviews - Interviewer and interviewee(s)
- Meetings - Team discussions, conference calls
- Podcasts - Multiple hosts or guests
- Panel Discussions - Multiple participants
- Debates - Two or more speakers
- Conferences - Q&A sessions, presentations with discussions
- Customer Calls - Agent and customer conversations
- Focus Groups - Moderator and participants
Optimal Scenarios:
- 2-6 speakers (best accuracy)
- Clear speaker changes
- Distinct voices (different genders, accents, or pitches)
- Limited speaker overlap
Not Recommended For
Single Speaker Content:
- Individual presentations or lectures
- Solo podcasts or monologues
- Audiobooks or narration
- Voice notes or memos
- Solo video tutorials
Challenging Scenarios:
- Very similar-sounding speakers
- More than 10 speakers
- Extensive speaker overlap (everyone talking at once)
- Very short speaker turns (rapid back-and-forth)
- Background conversations
Enabling Speaker Separation
During Upload
Step 1: Start Upload
- Click "Upload" button in workspace
- Upload modal opens
- Select your media file(s)
Step 2: Configure Settings
- In the upload modal, find the "Separate speakers" toggle
- Enable the toggle to activate speaker separation
- Toggle appears as a Switch component
Step 3: Complete Upload
- Configure other settings (language, custom model if needed)
- Click "Upload" button to start
- Processing begins with diarization enabled
UI Location:
Upload Modal
├── File Selection
├── Language Selection
├── Custom Model (optional)
├── Separate speakers ← Toggle switch
└── Upload Button
Default Behavior
Organization Default:
- Organizations can set a default value for
diarization - New uploads automatically use organization default
- Check the toggle state during upload to verify
- Toggle can be changed per-upload
Setting Location:
- Organization settings:
organization.diarization(boolean) - Default loads on modal open via
useEffect
Per-Session Configuration:
- Setting applies to current upload only
- Each new upload can have different settings
- Previous sessions are unaffected