TTS Provider Guide
Script to Speech supports multiple Text-to-Speech providers. This guide covers configuration, capabilities, and provider-specific considerations.
Supported TTS Providers
OpenAI
- Requirements: API key required
- Voice Options: Preview available at openai.fm
- Concurrent Downloads: 7 threads
- Rate Limits: Standard API rate limits apply
- Best For:
- Considerations
- Pros
- Cheap (up to 10x cheaper compared to ElevenLabs)
- High-quality, realistic-sounding voices
- Fast generation
- Cons
- Limited number of voices
- Has issues where short clips are sometimes output as silent
- Best For
- Characters with lots of lines (due to affordability)
- Characters that don’t have special accent / age / etc. considerations
- “default” narrator character (given above considerations)
- Pros
ElevenLabs
- Requirements:
- API key required
- “Creator” plan or higher required (other plans to be supported in future releases)
- Voice Library: Uses “public” library voices for configuration
- Voice Limit: 30 voice limit in “my voices” library
- Voice Management: Automatic voice addition/removal within 30 voice limit
- Monthly Limits: Voice adds/removes have monthly quotas imposed by ElevenLabs
- Concurrent Downloads: 5 threads
- Considerations
- Pros
- Reliable generation: no issues with silent or otherwise mis-generated audio
- Wide variety of voices, across ages / accents / ethnicities / style
- High-quality, realistic-sounding voices
- Fast generation
- Cons
- Expensive
- Some voices in public library low quality
- Best For
- Characters where accent / age / style is important
- Filling out the wider world of side characters
- Pros
Cartesia
- Concurrent Downloads: 2 threads
- Customization: Language options and speaking rate (experimental)
- Considerations
- Pros
- Free plan gives 25 minutes of generations a month
- Voice audio quality fairly high
- Fast generation
- Features a few dozen voices
- $5 / month plan gets 125 minutes of audio
- Cons
- Voice cadence / delivery at times inconsistent
- Voice less life-like than OpenAI or ElevenLabs providers
- Inconsistent delivery of ALL UPPERCASE text
- Best For
- Side characters
- Testing
- Pros
Minimax
- Requirements: API key and Group ID required
- Voice Options: 60+ system voices with voice mixing capabilities
- Concurrent Downloads: 1 thread (multi-threading supported, but rate-limit generally hit)
- Customization: Voice mixing, speed, volume, pitch, emotion, language boost
- Considerations
- Pros
- Good number of high-quality voices, with a few different accents, and a number of configuration options available
- Voice mixing allows blending multiple voices with different weights
- Emotion control / pitch control for expressive speech
- Extensive non-english support
- Fast generation (but aggressive rate-limiting negates most benefit)
- Cheap
- Good number of high-quality voices, with a few different accents, and a number of configuration options available
- Cons
- Some voices lack life-like expressiveness, despite being high-quality otherwise
- Some small quirks make for distracting dialogue
- Issues with reading numbers at times (e.g. “In the year 1972” -> “In the year one-nine-seven-two”)
- Seems to pick the wrong heteronym more than providers like Elevenlabs / OpenAI (e.g. “close up” -> “cloz up” instead of “cloce up”; “we’re going live” -> “we’re going liv” instead of “we’re going live”)
- Some strange pronunciation for English words at times
- Best For
- Main and supporting characters (though maybe not narrators)
- Emotional dialogue with varied expressions or ones requiring a voice blend
- Non-english characters
- Pros
Zyphra Zonos (API version)
- Requirements: API key required; free plan okay
- Voice Options: Configurable voice from 9 options
- Concurrent Downloads: 5 threads
- Customization: Speaking rate and language options
- Considerations
- Pros
- Free plan gives 100 minutes of generations a month
- Cons
- Few voices offered
- Generation comparatively slow
- Reliability: coherence struggles with longer dialogues
- Voice less life-like than other providers
- Best For
- One-off side characters
- Testing
- Pros
Dummy (Testing Only)
- Purpose: Testing without API calls
- Types: dummy_stateful and dummy_stateless
- Use Case: Development and testing
Environment Variables
Required environment variables by provider:
# OpenAI
export OPENAI_API_KEY="your-api-key"
# ElevenLabs
export ELEVEN_API_KEY="your-api-key"
# Cartesia
export CARTESIA_API_KEY="your-api-key"
# Minimax
export MINIMAX_API_KEY="your-api-key"
export MINIMAX_GROUP_ID="your-group-id"
# Zonos
export ZONOS_API_KEY="your-api-key"
Configuration Structure
Provider Assignment
Each speaker in your configuration must have a provider assigned, and must supply all required fields for that provider. By default, when a TTS provider configuration is generated, required fields will be generated; optional fields can be manually added. Multiple providers can be combined in a single TTS provider configuration.
default:
provider: openai
voice: onyx
HARRY:
provider: elevenlabs
voice_id: ErXwobaYiN019PkySvjV
LUNA:
provider: zonos
default_voice_name: american_male
Generated Configuration (Single Provider Workflow)
The sts-tts-provider-yaml generate [screenplay].json --tts-provider [provider] command creates a template with:
- An entry for each speaker
- Pre-populated
providerfield - Empty entries for each required provider field
- Speaker statistics to aid in casting each character
- (optional) Use
--include-optional-fieldsflag to also create empty entries for each optional field
# default: 1556 lines - Used for all non-dialogue pieces
# Total characters: 104244, Longest dialogue: 2082 characters
default:
provider: openai
voice:
# HARRY: 283 lines
# Total characters: 12181, Longest dialogue: 365 characters
HARRY:
provider: openai
voice:
Multi-Provider Workflow
Step 1: Generate Base Configuration
uv run sts-tts-provider-yaml generate input/[screenplay]/[screenplay].json
Step 2: Assign TTS Providers
Edit the generated YAML to assign providers to each speaker:
default:
provider: openai
HARRY:
provider: elevenlabs
LUNA:
provider: openai
Step 3: Populate Provider Fields
uv run sts-tts-provider-yaml populate input/[screenplay]/[screenplay].json \
input/[screenplay]/[screenplay]_voice_config.yaml
This creates [screenplay]_voice_config_populated.yaml with provider-specific fields grouped:
# OpenAI Configuration
default:
provider: openai
voice:
LUNA:
provider: openai
voice:
# ElevenLabs Configuration
HARRY:
provider: elevenlabs
voice_id:
Step 4: Fill in Provider Details
Complete the populated configuration with specific values:
# OpenAI Configuration
default:
provider: openai
voice: onyx
LUNA:
provider: openai
voice: alloy
# ElevenLabs Configuration
HARRY:
provider: elevenlabs
voice_id: ErXwobaYiN019PkySvjV
Step 5: Validate Configuration
# Basic validation (checks for missing/extra/duplicate speakers)
uv run sts-tts-provider-yaml validate input/[screenplay]/[screenplay].json \
input/[screenplay]/[screenplay]_voice_config.yaml
# Strict validation (also validates provider-specific fields)
uv run sts-tts-provider-yaml validate input/[screenplay]/[screenplay].json \
input/[screenplay]/[screenplay]_voice_config.yaml --strict
Provider-Specific Configuration
OpenAI Configuration
Required fields:
voice: Voice identifier
Available voices:
- alloy
- ash
- coral
- echo
- fable
- onyx
- nova
- sage
- shimmer
Example:
default:
provider: openai
voice: onyx
NARRATOR:
provider: openai
voice: alloy
ElevenLabs Configuration
Required fields:
voice_id: Public library voice ID
Important notes:
- Voice IDs must be from the public voice library, not the my voices library
- Provider manages the 30 voice limit automatically by removing voices from “my voices” library when limit is reached
- Monthly add/remove operations are limited
Example:
MARY:
provider: elevenlabs
voice_id: IKne3meq5aSn9XLyUdCD # Public library ID
JOHN:
provider: elevenlabs
voice_id: ErXwobaYiN019PkySvjV # Public library ID
Cartesia Configuration
Required fields:
voice_id: one of 9 available voices Voices and theird IDs can be found at the Cartesia Playground
Optional fields
language: One of [en,fr,de,es,pt,zh,ja,hi,it,ko,nl,pl,ru,sv,tr]speed: One of [“slow”, “normal”, “fast”]- note: this is an experimental feature that doesn’t work for all voices
Example:
BECCA:
provider: cartesia
voice_id: bf0a246a-8642-498a-9950-80c35e9276b5
speed: fast
language: fr
TOM:
provider: cartesia
voice_id: 4df027cb-2920-4a1f-8c34-f21529d5c3fe
Minimax Configuration
Required fields:
voice_id: One of 17 available system voices
Available voices:
- English_expressive_narrator
- English_radiant_girl
- English_magnetic_voiced_man
- English_compelling_lady1
- English_Aussie_Bloke
- English_captivating_female1
- English_Upbeat_Woman
- English_Trustworth_Man
- English_CalmWoman
- English_UpsetGirl
- English_Gentle-voiced_man
- English_Whispering_girl_v3
- English_Diligent_Man
- English_Graceful_Lady
- English_ReservedYoungMan
- English_PlayfulGirl
- English_ManWithDeepVoice
- English_GentleTeacher
- English_MaturePartner
- English_FriendlyPerson
- English_MatureBoss
- English_Debator
- English_Abbess
- English_LovelyGirl
- English_Steadymentor
- English_Deep-VoicedGentleman
- English_DeterminedMan
- English_Wiselady
- English_CaptivatingStoryteller
- English_AttractiveGirl
- English_DecentYoungMan
- English_SentimentalLady
- English_ImposingManner
- English_SadTeen
- English_ThoughtfulMan
- English_PassionateWarrior
- English_DecentBoy
- English_WiseScholar
- English_Soft-spokenGirl
- English_SereneWoman
- English_ConfidentWoman
- English_patient_man_v1
- English_Comedian
- English_GorgeousLady
- English_BossyLeader
- English_LovelyLady
- English_Strong-WilledBoy
- English_Deep-tonedMan
- English_StressedLady
- English_AssertiveQueen
- English_AnimeCharacter
- English_Jovialman
- English_WhimsicalGirl
- English_CharmingQueen
- English_Kind-heartedGirl
- English_FriendlyNeighbor
- English_Sweet_Female_4
- English_Magnetic_Male_2
- English_Lively_Male_11
- English_Friendly_Female_3
- English_Steady_Female_1
- English_Lively_Male_10
- English_Magnetic_Male_12
- English_Steady_Female_5
Optional fields:
voice_mix: List of voice blends (1-4 items), each with:voice_id: One of the 17 system voicesweight: Integer between 1-100- Note: If provided, takes precedence over the top-level
voice_id. Only one ofvoice_idorvoice_mixcan be supplied
speed: (default: 1.0) Float between 0.5-2.0volume: (default: 1.0) Float between >0.0-10.0pitch: (default: 0) Integer between -12 to 12emotion: One of [“happy”, “sad”, “angry”, “fear”, “disgust”, “neutral”, “surprise”]english_normalization: (default: true) Boolean (true/false)language_boost: (default: “English”) One of [“Chinese”, “English”, “Japanese”, “Korean”, “French”, “Spanish”, “German”]
Example (with voice_id):
DAVID:
provider: minimax
voice_id: Calm_Woman
speed: 1.2
volume: 8.0
pitch: 2
emotion: happy
english_normalization: false
language_boost: Spanish
Example (with voice_mix):
MARIA:
provider: minimax
voice_mix:
- voice_id: Patient_Man
weight: 70
- voice_id: Young_Knight
weight: 30
Zonos Configuration
Required fields:
default_voice_name: one of 9 available voices
Available voices:
- american_female
- american_male
- anime_girl
- british_female
- british_male
- energetic_boy
- energetic_girl
- japanese_female
- japanese_male
Optional fields:
speaking_rate: Float between 5 and 35language_iso_code: One of [en-us, fr-fr, de, ja, ko, cmn]
Example:
ROBOT:
provider: zonos
default_voice_name: american_female
speaking_rate: 20
language_iso_code: en-us
ALIEN:
provider: zonos
default_voice_name: american_male
Rate Limiting
Automatic Handling
The system automatically handles rate limits with:
- Exponential backoff
- Provider-specific retry logic
- Queue management
When rate limited, the system will:
- Pause requests for that provider
- Continue with other TTS providers
- Retry after backoff period
Provider Architecture
The Script to Speech system supports two types of TTS providers: stateless and stateful. Understanding the difference is important for creating custom providers.
Stateless vs. Stateful TTS Providers
Stateless TTS Providers
- Definition: Providers that don’t maintain state between API calls
- Implementation: Use class methods to generate audio
- Examples: OpenAI, Zonos
- Advantages:
- Simpler to implement
- Thread-safe without additional code
- More predictable behavior
- Easier to debug
- When to use: Default choice for most providers
Stateful TTS Providers
- Definition: Providers that maintain state between API calls
- Implementation: Same class methods as Stateless provider, except for instance-based
__init__andgenerate_audiomethods - Examples: ElevenLabs (for voice registry management)
- Advantages:
- Can cache / configure information between calls
- Can implement complex state machines
- When to use: Only when required by the API or when managing complex resources
Provider Management
The TTSProviderManager handles:
- Lazy Initialization: TTS providers are only initialized when needed
- Thread Safety: Thread locks protect provider initialization and state
- Client Caching: API clients are reused across calls
- Multi-threading: Each provider has its own download concurrency settings
Creating Custom TTS Providers
Base Classes
All TTS providers implement one of two base classes:
-
StatelessTTSProviderBase- For TTS providers without state
- Uses class methods
-
StatefulTTSProviderBase- For TTS providers with state
- Uses instance methods and
__init__
Both inherit from TTSProviderCommonMixin which defines common requirements.
Adding a New Provider
- Create a directory in
src/script_to_speech/tts_providers/with your provider name - Create a
tts_provider.pyfile in that directory - Implement the appropriate base class
- Return the correct provider identifier via
get_provider_identifier()
The provider will be automatically discovered and available in configurations.
Examples
- For an example of a stateless provider, see the OpenAI TTS Provider
- For an example of a stateful provider, see the ElevenLabs TTS Provider and accompanying ElevenLabs Voice Registry Manager
Provider Requirements
Required methods:
get_provider_identifier(): Unique identifier for this provider; will be used in YAML configuration and whenever this provider is being called on the command lineget_speaker_identifier(): Unique identifier for a given speaker, given a speaker configuration. This is used in caching, so changing any configuration option for a speaker (e.g. optional fields) should also change the returnedspeaker_identifierinstantiate_client(): Create the API client that will be passed togenerate_audiomethod by theTTSProviderManagergenerate_audio(): Request the audio from the TTS Provider API; return bytes representing the audioget_required_fields(): Required configuration fieldsvalidate_speaker_config(): Logic to validate configuration (checking for required fields, that they’re the right type, etc.)get_yaml_instructions(): Configuration help text outlining required / optional fields, best practices, etc.
Optional methods:
get_optional_fields(): Optional configuration fieldsget_max_download_threads(): Concurrent thread limit; defaults to 1
Best Practices
Configuration Management
- Use the generate → assign → populate workflow
- Keep backup configurations for different voice setups
- Consider character type when assigning TTS providers
- Validate configurations with
sts-tts-provider-yaml validate
Multi-Provider Benefits
- Speed: Parallel processing across TTS providers
- Cost: Optimize per-provider pricing
- Quality: Match voice types to character needs
Troubleshooting
Common Issues
-
API Key Errors
# Check environment variables echo $OPENAI_API_KEY echo $ELEVEN_API_KEY echo $CARTESIA_API_KEY echo $MINIMAX_API_KEY echo $MINIMAX_GROUP_ID echo $ZONOS_API_KEY -
Voice Configuration Validation
# Check for missing/extra/duplicate speakers uv run sts-tts-provider-yaml validate script.json config.yaml # Strict validation including provider field validation uv run sts-tts-provider-yaml validate script.json config.yaml --strict -
Voice Not Found
- ElevenLabs: Ensure voice ID is from public library
- OpenAI: Verify voice name matches available options
- Minimax: Verify voice_id is one of the specified system voices
- Minimax: Check voice_mix structure if using voice mixing
- Zonos: Check default_voice_name is a valid voice
-
Rate Limiting
- Check provider-specific rate limits
- Limit global concurrent downloads with
--max-workersrun mode modifier - Distribute voices across TTS providers
- Monitor monthly quotas for voice adds / removes (ElevenLabs)
-
Quality Issues
- OpenAI: Try different voices for different character types
- ElevenLabs: Use “narrative”/“conversational” tagged voices
- Minimax: Experiment with voice_mix for unique character voices
- Minimax: Adjust emotion, speed, and pitch parameters for expressiveness
- Zonos: Adjust speaking rate parameter
Debugging
# Test single line of dialogue
uv run sts-generate-standalone-speech openai --voice echo "Test text"
# Validate voice configuration against script
uv run sts-tts-provider-yaml validate input/script.json config.yaml
# Strict validation including provider field validation
uv run sts-tts-provider-yaml validate input/script.json config.yaml --strict