Skip to main content

Speaker Separation (Diarization)

Speaker diarization is the automatic identification and separation of different speakers in your audio. This guide explains how speaker separation works, when to use it, and how to work with the results.

What is Speaker Diarization?

Diarization (pronounced "die-uh-rih-ZAY-shun") is the process of automatically:

  • Detecting when different speakers are talking
  • Separating speech into distinct speaker turns
  • Grouping utterances by the same speaker
  • Labeling speakers with default identifiers

How It Works

During Transcription Processing:

  1. Audio Analysis - System analyzes voice characteristics

    • Pitch and tone
    • Speaking rate and rhythm
    • Vocal timbre and quality
  2. Speaker Clustering - Groups similar voices together

    • Identifies unique voice signatures
    • Assigns each signature a speaker ID
    • Determines speaker change points
  3. Utterance Attribution - Attributes each speech segment to a speaker

    • Creates separate utterances for each speaker turn
    • Assigns speaker labels
    • Maintains timestamp information
  4. Output - Generates structured transcript

    • Each utterance has a speaker field
    • Chronological order preserved
    • Ready for renaming and editing

With vs Without Speaker Separation

AspectWith DiarizationWithout Diarization
ProcessingAutomatic speaker detectionAll text in one continuous block
OutputSeparated by speaker turnsNo speaker attribution
Initial LabelsSpeaker identifiers assignedNo speaker labels
Editing RequiredRename speakers, verify separationManually split all speaker turns
Best ForMulti-speaker contentSingle speaker content

When to Enable Speaker Separation

Multi-Speaker Content:

  • Interviews - Interviewer and interviewee(s)
  • Meetings - Team discussions, conference calls
  • Podcasts - Multiple hosts or guests
  • Panel Discussions - Multiple participants
  • Debates - Two or more speakers
  • Conferences - Q&A sessions, presentations with discussions
  • Customer Calls - Agent and customer conversations
  • Focus Groups - Moderator and participants

Optimal Scenarios:

  • 2-6 speakers (best accuracy)
  • Clear speaker changes
  • Distinct voices (different genders, accents, or pitches)
  • Limited speaker overlap

Single Speaker Content:

  • Individual presentations or lectures
  • Solo podcasts or monologues
  • Audiobooks or narration
  • Voice notes or memos
  • Solo video tutorials

Challenging Scenarios:

  • Very similar-sounding speakers
  • More than 10 speakers
  • Extensive speaker overlap (everyone talking at once)
  • Very short speaker turns (rapid back-and-forth)
  • Background conversations

Enabling Speaker Separation

During Upload

Step 1: Start Upload

  • Click "Upload" button in workspace
  • Upload modal opens
  • Select your media file(s)

Step 2: Configure Settings

  • In the upload modal, find the "Separate speakers" toggle
  • Enable the toggle to activate speaker separation
  • Toggle appears as a Switch component

Step 3: Complete Upload

  • Configure other settings (language, custom model if needed)
  • Click "Upload" button to start
  • Processing begins with diarization enabled

UI Location:

Upload Modal
├── File Selection
├── Language Selection
├── Custom Model (optional)
├── Separate speakers ← Toggle switch
└── Upload Button

Default Behavior

Organization Default:

  • Organizations can set a default value for diarization
  • New uploads automatically use organization default
  • Check the toggle state during upload to verify
  • Toggle can be changed per-upload

Setting Location:

  • Organization settings: organization.diarization (boolean)
  • Default loads on modal open via useEffect

Per-Session Configuration:

  • Setting applies to current upload only
  • Each new upload can have different settings
  • Previous sessions are unaffected

Diarization Output

Structured Transcript

With Speaker Separation Enabled:

Each utterance has a speaker field:

[00:00:05] Speaker A
Hello everyone, thank you for joining today's meeting.
I'd like to start by discussing our quarterly results.

[00:00:25] Speaker B
Thanks for having us. I'm excited to share our progress
on the new product launch.

[00:00:40] Speaker A
Great! Let's dive into the details.

Without Speaker Separation:

All text in continuous utterances without speaker attribution:

[00:00:05]
Hello everyone, thank you for joining today's meeting.
I'd like to start by discussing our quarterly results.
Thanks for having us. I'm excited to share our progress
on the new product launch. Great! Let's dive into the details.

Speaker Labels

Data Structure:

  • Each utterance has a speaker?: string field
  • Speakers identified by the diarization algorithm
  • Labels can be renamed in the editor

Initial Naming:

  • System assigns speaker identifiers
  • Naming convention depends on server implementation
  • Typically sequential or based on detection order

Working with Speakers in the Editor

Viewing Speakers

Transcript Editor:

  • Each utterance displays its speaker label
  • Speaker field shown before utterance text
  • Visual distinction between different speakers
  • Click to edit speaker name

Speaker Field:

  • Located at start of each utterance
  • Editable inline
  • Changes save with document

Editing Speakers

Rename Speakers:

  • Click on speaker label to edit
  • Type new name
  • Press Enter or click away to save

Speaker Management Features:

  • Add new speakers
  • Merge speakers (combine multiple IDs into one)
  • Split speakers (separate incorrectly grouped utterances)
  • Change speaker for individual utterances

See Label Speakers and Merge & Split Speakers for detailed editing instructions.

Common Diarization Scenarios

Scenario 1: Over-Segmentation

Issue: One speaker split into multiple speaker IDs

Example:

[00:00:05] Speaker A
Hello everyone.

[00:00:08] Speaker C ← Same person as Speaker A
Let's begin the meeting.

[00:00:15] Speaker A
First on the agenda...

Cause: Voice changes (volume, emotion), pauses, background noise

Fix: Merge Speaker C into Speaker A using speaker merge tool

Scenario 2: Under-Segmentation

Issue: Two speakers combined in one utterance

Example:

[00:00:05] Speaker A
Hello everyone. [Speaker B should start here] Thanks for joining.

Cause: No pause between speakers, very short turns

Fix: Split utterance at correct point, assign second part to Speaker B

Scenario 3: Speaker Confusion

Issue: Two speakers incorrectly assigned same ID

Example:

[00:00:05] Speaker A  ← Actually John
Hello, I'm John.

[00:00:25] Speaker A ← Actually Sarah
Hi, this is Sarah.

Cause: Similar voices, algorithm couldn't distinguish

Fix: Rename speakers, split if needed, reassign utterances

Processing Impact

Processing Time

With Diarization:

  • Adds additional processing time for speaker analysis
  • Exact impact depends on audio length and complexity
  • Typically worth the time saved in manual editing

Without Diarization:

  • Faster processing
  • Standard transcription time only

Cost Considerations

Pricing:

  • Diarization may affect processing cost
  • Check with your organization administrator
  • Cost varies by subscription plan

Best Practices

Recording Best Practices

  1. Clear Turn-Taking - Pause between speakers when possible
  2. Minimize Overlap - Avoid talking over each other
  3. Consistent Audio - Keep microphone distance and volume stable
  4. Reduce Noise - Minimize background sounds
  5. Individual Microphones - Each speaker on separate mic if possible

Post-Processing Best Practices

  1. Review Immediately - Check speaker attribution early
  2. Rename Speakers - Replace generic labels with actual names
  3. Use Audio Verification - Listen to confirm speaker identity
  4. Systematic Correction - Fix speaker issues before editing text
  5. Document Patterns - Note recurring errors for future uploads

Workflow Integration

End-to-End Process

Step 1: Enable Diarization During Upload

  • Toggle "Separate speakers" ON
  • Select appropriate language
  • Start transcription

Step 2: Wait for Processing

  • Diarization runs automatically
  • No intervention required
  • Receive notification when complete

Step 3: Review Initial Output

  • Open transcript in editor
  • Skim through speaker labels
  • Note overall accuracy

Step 4: Rename Speakers

  • Replace generic labels with actual names
  • Do this early to make editing easier
  • See Label Speakers

Step 5: Correct Attribution

  • Merge over-segmented speakers
  • Split under-segmented speakers
  • Reassign misattributed utterances
  • See Merge & Split Speakers

Step 6: Edit Transcript Text

  • With speakers correctly labeled, edit text content
  • Verify accuracy with audio playback
  • Finalize transcript

Troubleshooting

All Speech Attributed to One Speaker

Issue: Only one speaker ID appears, but multiple speakers exist

Possible Causes:

  • Diarization not enabled during upload
  • Very similar voices (system couldn't distinguish)
  • Poor audio quality

Solutions:

  1. Verify "Separate speakers" was toggled ON
  2. If not: Re-transcribe with diarization enabled
  3. If yes: Manually add speaker changes using editor
  4. Improve audio quality for future uploads

Too Many Speakers Detected

Issue: More speaker IDs than actual speakers

Possible Causes:

  • Over-segmentation (one speaker split into multiple IDs)
  • Background noise interpreted as speakers
  • Speaker voice changes

Solutions:

  1. Merge duplicate speakers
  2. Review sample utterances to identify true speakers
  3. Reassign incorrectly labeled utterances

Speakers Frequently Mixed Up

Issue: Same two speakers repeatedly confused

Possible Causes:

  • Very similar voices
  • Overlapping speech
  • Same gender with similar accents

Solutions:

  1. Manually correct each instance
  2. Use audio playback to verify
  3. For future: Encourage distinct speaking styles
  4. Consider using individual microphones

Toggle Not Visible

Issue: "Separate speakers" toggle doesn't appear

Possible Cause:

  • Using Align mode (toggle hidden)

Solution:

  • Diarization only available for Upload and Decode modes
  • Use standard upload for speaker separation

Frequently Asked Questions

Can I enable diarization after transcription?

No, diarization must be enabled during the initial upload. You would need to re-transcribe the file with "Separate speakers" turned on.

How many speakers can diarization handle?

The system can technically handle many speakers, but accuracy decreases with more speakers. Best results with 2-6 speakers.

Does diarization identify speakers by name?

No, it assigns generic speaker identifiers. You must manually rename them with actual names in the editor.

Does diarization cost extra?

This varies by organization and subscription plan. Check with your administrator for specific pricing details.

What if the toggle is already on by default?

Your organization has set diarization as the default. You can toggle it OFF if you have single-speaker content.

Can I change the speaker separation after processing?

You can edit speaker labels and manually split/merge speakers in the editor, but you cannot re-run the automatic diarization. You would need to re-transcribe the file.

Does it work for speakers in different languages?

Diarization works independently of language based on voice characteristics. It can separate speakers even if they speak different languages.

Next Steps

After diarization completes and you review the results:


Enable speaker separation for multi-speaker content! Toggle "Separate speakers" during upload for automatic speaker detection.