668 lines
18 KiB
Markdown
668 lines
18 KiB
Markdown
# Local Text-to-Speech Setup
|
|
|
|
Learn how to run text-to-speech models completely locally using OpenAI-compatible TTS servers, giving you full privacy control and zero ongoing costs for podcast and audio generation.
|
|
|
|
This guide uses **Speaches** as an example implementation, but the principles apply to any OpenAI-compatible TTS server.
|
|
|
|
## Why Local Text-to-Speech?
|
|
|
|
Running text-to-speech locally offers significant advantages:
|
|
|
|
- **🔒 Complete Privacy**: Your content never leaves your machine
|
|
- **💰 Zero Ongoing Costs**: No per-character or per-minute charges
|
|
- **⚡ No Rate Limits**: Generate unlimited audio without restrictions
|
|
- **🌐 Offline Capability**: Works without internet connection
|
|
- **🎯 Full Control**: Choose and customize your voice models
|
|
- **📈 Predictable Costs**: One-time setup, no surprises
|
|
|
|
## Available Local TTS Solutions
|
|
|
|
Open Notebook supports any OpenAI-compatible text-to-speech server. This guide uses **Speaches** as an example because it's:
|
|
|
|
- Open-source and actively maintained
|
|
- Easy to set up with Docker
|
|
- Compatible with OpenAI's TTS API specification
|
|
- Supports multiple high-quality voice models
|
|
|
|
### About Speaches
|
|
|
|
[Speaches](https://github.com/speaches-ai/speaches) is an open-source, OpenAI-compatible text-to-speech server that runs locally on your machine. It provides:
|
|
|
|
- **OpenAI API Compatibility**: Works seamlessly with Open Notebook's OpenAI-compatible provider
|
|
- **High-Quality Voices**: Support for multiple neural TTS models
|
|
- **Easy Model Management**: Simple CLI for downloading and managing voice models
|
|
- **Docker Support**: Run in containers for easy deployment
|
|
- **Multiple Voice Options**: Various voices and languages available
|
|
- **Customizable Speed**: Adjust speech rate to your preference
|
|
|
|
> **Note**: If you're using a different OpenAI-compatible TTS server, the configuration steps will be similar - just adjust the endpoints and model names accordingly.
|
|
|
|
## Quick Start with Speaches
|
|
|
|
This section demonstrates setup using Speaches as an example. If you're using a different local TTS solution, adapt the steps accordingly.
|
|
|
|
### Prerequisites
|
|
|
|
- **Docker** installed on your system
|
|
- At least **2GB RAM** available
|
|
- **5GB disk space** for models
|
|
|
|
### Basic Setup
|
|
|
|
The fastest way to get started is using our example setup:
|
|
|
|
**1. Create a project directory:**
|
|
```bash
|
|
mkdir speaches-setup
|
|
cd speaches-setup
|
|
```
|
|
|
|
**2. Create a `docker-compose.yml` file:**
|
|
```yaml
|
|
services:
|
|
speaches:
|
|
image: ghcr.io/speaches-ai/speaches:latest-cpu
|
|
container_name: speaches
|
|
ports:
|
|
- "8969:8000"
|
|
volumes:
|
|
- hf-hub-cache:/home/ubuntu/.cache/huggingface/hub
|
|
restart: unless-stopped
|
|
|
|
volumes:
|
|
hf-hub-cache:
|
|
```
|
|
|
|
**3. Start the Speaches server:**
|
|
```bash
|
|
docker compose up -d
|
|
```
|
|
|
|
**4. Download a TTS model:**
|
|
```bash
|
|
# Wait a few seconds for the container to start
|
|
sleep 10
|
|
|
|
# Download the recommended Kokoro model
|
|
docker compose exec speaches uv tool run speaches-cli model download speaches-ai/Kokoro-82M-v1.0-ONNX
|
|
```
|
|
|
|
**5. Test the setup:**
|
|
```bash
|
|
curl "http://localhost:8969/v1/audio/speech" -s -H "Content-Type: application/json" \
|
|
--output test.mp3 \
|
|
--data '{
|
|
"input": "Hello! This is a test of local text to speech.",
|
|
"model": "speaches-ai/Kokoro-82M-v1.0-ONNX",
|
|
"voice": "af_bella",
|
|
"speed": 1.0
|
|
}'
|
|
```
|
|
|
|
If successful, you'll have a `test.mp3` file with the generated speech!
|
|
|
|
### Configure Open Notebook
|
|
|
|
Now that Speaches is running, configure Open Notebook to use it:
|
|
|
|
**1. Set the environment variable:**
|
|
|
|
For Docker deployments:
|
|
```bash
|
|
docker run -d \
|
|
--name open-notebook \
|
|
-p 8502:8502 -p 5055:5055 \
|
|
-v ./notebook_data:/app/data \
|
|
-v ./surreal_data:/mydata \
|
|
-e OPENAI_COMPATIBLE_BASE_URL_TTS=http://host.docker.internal:8969/v1 \
|
|
lfnovo/open_notebook:v1-latest-single
|
|
```
|
|
|
|
For local development:
|
|
```bash
|
|
export OPENAI_COMPATIBLE_BASE_URL_TTS=http://localhost:8969/v1
|
|
```
|
|
|
|
**2. Add the model in Open Notebook:**
|
|
|
|
1. Go to **Settings** → **Models** page
|
|
2. Click **Add Model** in the Text-to-Speech section
|
|
3. Configure the model:
|
|
- **Provider**: `openai_compatible`
|
|
- **Model Name**: `speaches-ai/Kokoro-82M-v1.0-ONNX`
|
|
- **Display Name**: `Kokoro Local TTS` (or your preference)
|
|
4. Click **Save**
|
|
|
|
**3. Set as default (optional):**
|
|
- In Settings, set this model as your default Text-to-Speech model
|
|
- Now all podcast generation will use your local TTS
|
|
|
|
## Available Voice Models
|
|
|
|
Speaches supports various TTS models from Hugging Face. Here are some recommended options:
|
|
|
|
### Kokoro (Recommended)
|
|
- **Model ID**: `speaches-ai/Kokoro-82M-v1.0-ONNX`
|
|
- **Size**: ~500MB
|
|
- **Quality**: High
|
|
- **Speed**: Fast
|
|
- **Languages**: English
|
|
- **Voices**: `af_bella`, `af_sarah`, `am_adam`, `am_michael`, and more
|
|
|
|
### Other Models
|
|
You can use any compatible ONNX TTS model from Hugging Face. Check the [Speaches documentation](https://github.com/speaches-ai/speaches) for a complete list.
|
|
|
|
## Available Voices
|
|
|
|
The Kokoro model includes multiple voices with different characteristics:
|
|
|
|
**Female Voices:**
|
|
- `af_bella` - Clear, professional
|
|
- `af_sarah` - Warm, friendly
|
|
- `af_nicole` - Energetic, expressive
|
|
|
|
**Male Voices:**
|
|
- `am_adam` - Deep, authoritative
|
|
- `am_michael` - Friendly, conversational
|
|
- `bf_emma` - British accent, professional
|
|
- `bm_george` - British accent, formal
|
|
|
|
**Testing Voices:**
|
|
```bash
|
|
# Try different voices to find your favorite
|
|
for voice in af_bella af_sarah am_adam am_michael; do
|
|
curl "http://localhost:8969/v1/audio/speech" -s \
|
|
-H "Content-Type: application/json" \
|
|
--output "test_${voice}.mp3" \
|
|
--data "{
|
|
\"input\": \"Hello! This is a test of the ${voice} voice.\",
|
|
\"model\": \"speaches-ai/Kokoro-82M-v1.0-ONNX\",
|
|
\"voice\": \"${voice}\"
|
|
}"
|
|
done
|
|
```
|
|
|
|
## Advanced Configuration
|
|
|
|
### GPU Acceleration
|
|
|
|
For faster processing with NVIDIA GPUs:
|
|
|
|
```yaml
|
|
services:
|
|
speaches:
|
|
image: ghcr.io/speaches-ai/speaches:latest-cuda # GPU-enabled image
|
|
container_name: speaches
|
|
ports:
|
|
- "8969:8000"
|
|
volumes:
|
|
- hf-hub-cache:/home/ubuntu/.cache/huggingface/hub
|
|
restart: unless-stopped
|
|
deploy:
|
|
resources:
|
|
reservations:
|
|
devices:
|
|
- driver: nvidia
|
|
count: 1
|
|
capabilities: [gpu]
|
|
|
|
volumes:
|
|
hf-hub-cache:
|
|
```
|
|
|
|
### Custom Port
|
|
|
|
If port 8969 is already in use, change it in docker-compose.yml:
|
|
|
|
```yaml
|
|
ports:
|
|
- "9000:8000" # Use port 9000 instead
|
|
```
|
|
|
|
Then update your environment variable:
|
|
```bash
|
|
export OPENAI_COMPATIBLE_BASE_URL_TTS=http://localhost:9000/v1
|
|
```
|
|
|
|
### Multiple Models
|
|
|
|
Download and use multiple models for different purposes:
|
|
|
|
```bash
|
|
# Download additional models
|
|
docker compose exec speaches uv tool run speaches-cli model download model-name-1
|
|
docker compose exec speaches uv tool run speaches-cli model download model-name-2
|
|
|
|
# List downloaded models
|
|
docker compose exec speaches uv tool run speaches-cli model list
|
|
```
|
|
|
|
In Open Notebook, add each model separately and choose which to use for different podcasts.
|
|
|
|
## Network Configuration
|
|
|
|
### Docker Networking
|
|
|
|
When Open Notebook runs in Docker and needs to reach Speaches:
|
|
|
|
**On macOS/Windows:**
|
|
```bash
|
|
export OPENAI_COMPATIBLE_BASE_URL_TTS=http://host.docker.internal:8969/v1
|
|
```
|
|
|
|
**On Linux:**
|
|
```bash
|
|
# Option 1: Use Docker bridge IP
|
|
export OPENAI_COMPATIBLE_BASE_URL_TTS=http://172.17.0.1:8969/v1
|
|
|
|
# Option 2: Use host networking
|
|
docker run --network host ...
|
|
```
|
|
|
|
### Remote Speaches Server
|
|
|
|
Run Speaches on a different machine for distributed processing:
|
|
|
|
```bash
|
|
# On the server machine
|
|
docker compose up -d
|
|
|
|
# Allow external connections (be careful with firewall settings)
|
|
# Update docker-compose.yml to bind to 0.0.0.0:8969
|
|
```
|
|
|
|
Then configure Open Notebook:
|
|
```bash
|
|
export OPENAI_COMPATIBLE_BASE_URL_TTS=http://server-ip:8969/v1
|
|
```
|
|
|
|
**Security Warning:** Only expose Speaches on trusted networks or use proper authentication/firewall rules.
|
|
|
|
## Podcast Generation
|
|
|
|
### Creating Podcasts with Local TTS
|
|
|
|
Once configured, use Speaches for podcast generation:
|
|
|
|
1. **Go to Podcasts page** in Open Notebook
|
|
2. **Create or edit an Episode Profile**
|
|
3. **Configure speakers:**
|
|
- For each speaker, select your Speaches model
|
|
- Choose different voices (e.g., `af_bella` for host, `am_adam` for guest)
|
|
4. **Generate podcast**
|
|
5. **Audio is generated locally** using your Speaches server
|
|
|
|
### Multi-Speaker Setup
|
|
|
|
Create natural-sounding conversations with different voices:
|
|
|
|
```
|
|
Speaker 1 (Host):
|
|
- Model: speaches-ai/Kokoro-82M-v1.0-ONNX
|
|
- Voice: af_bella
|
|
|
|
Speaker 2 (Guest):
|
|
- Model: speaches-ai/Kokoro-82M-v1.0-ONNX
|
|
- Voice: am_adam
|
|
|
|
Speaker 3 (Narrator):
|
|
- Model: speaches-ai/Kokoro-82M-v1.0-ONNX
|
|
- Voice: bf_emma
|
|
```
|
|
|
|
## Performance Optimization
|
|
|
|
### CPU Performance
|
|
|
|
**Recommended Specs:**
|
|
- 4+ CPU cores
|
|
- 4GB+ RAM
|
|
- SSD storage
|
|
|
|
**Tips:**
|
|
- Close unnecessary applications
|
|
- Use quantized models when available
|
|
- Adjust speech speed for faster generation
|
|
|
|
### Memory Management
|
|
|
|
Monitor Docker memory usage:
|
|
```bash
|
|
docker stats speaches
|
|
```
|
|
|
|
Allocate more memory if needed:
|
|
```yaml
|
|
services:
|
|
speaches:
|
|
# ... other config ...
|
|
mem_limit: 4g # Adjust based on your system
|
|
```
|
|
|
|
### Batch Processing
|
|
|
|
For generating multiple audio files, Speaches handles concurrent requests efficiently. Open Notebook automatically manages this during podcast generation.
|
|
|
|
## Troubleshooting
|
|
|
|
### Service Won't Start
|
|
|
|
**Symptom:** Container exits immediately
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Check logs
|
|
docker compose logs speaches
|
|
|
|
# Verify Docker is running
|
|
docker ps
|
|
|
|
# Check port availability
|
|
lsof -i :8969 # macOS/Linux
|
|
netstat -ano | findstr :8969 # Windows
|
|
```
|
|
|
|
---
|
|
|
|
### Connection Refused
|
|
|
|
**Symptom:** Open Notebook can't reach Speaches
|
|
|
|
**Solutions:**
|
|
1. **Verify Speaches is running:**
|
|
```bash
|
|
curl http://localhost:8969/v1/models
|
|
```
|
|
|
|
2. **Check Docker networking:**
|
|
- Use `host.docker.internal` instead of `localhost` when Open Notebook is in Docker
|
|
- Verify firewall settings
|
|
|
|
3. **Test from inside Open Notebook container:**
|
|
```bash
|
|
docker exec -it open-notebook curl http://host.docker.internal:8969/v1/models
|
|
```
|
|
|
|
---
|
|
|
|
### Model Not Found
|
|
|
|
**Symptom:** Error about missing model during generation
|
|
|
|
**Solutions:**
|
|
1. **Verify model is downloaded:**
|
|
```bash
|
|
docker compose exec speaches uv tool run speaches-cli model list
|
|
```
|
|
|
|
2. **Download the model:**
|
|
```bash
|
|
docker compose exec speaches uv tool run speaches-cli model download speaches-ai/Kokoro-82M-v1.0-ONNX
|
|
```
|
|
|
|
3. **Check model name matches** what you configured in Open Notebook
|
|
|
|
---
|
|
|
|
### Poor Audio Quality
|
|
|
|
**Symptom:** Generated speech sounds robotic or unclear
|
|
|
|
**Solutions:**
|
|
- Try different voices
|
|
- Adjust speech speed (1.0 is normal, try 0.9-1.2)
|
|
- Use higher-quality models if available
|
|
- Check that model downloaded completely
|
|
|
|
---
|
|
|
|
### Slow Generation
|
|
|
|
**Symptom:** Audio generation takes a long time
|
|
|
|
**Solutions:**
|
|
- **Enable GPU acceleration** if you have an NVIDIA GPU
|
|
- **Use faster models** (smaller models = faster generation)
|
|
- **Adjust speech speed** to 1.5-2.0 for quicker output
|
|
- **Allocate more CPU cores** in Docker settings
|
|
- **Use SSD storage** instead of HDD
|
|
|
|
---
|
|
|
|
### Out of Memory
|
|
|
|
**Symptom:** Container crashes or system freezes
|
|
|
|
**Solutions:**
|
|
1. **Increase Docker memory limit:**
|
|
```yaml
|
|
services:
|
|
speaches:
|
|
mem_limit: 4g # Increase this value
|
|
```
|
|
|
|
2. **Use smaller models**
|
|
3. **Close other applications**
|
|
4. **Monitor with** `docker stats`
|
|
|
|
---
|
|
|
|
### Voice Not Available
|
|
|
|
**Symptom:** Requested voice doesn't work
|
|
|
|
**Solutions:**
|
|
- Check available voices for your model
|
|
- Use one of the documented voices (af_bella, am_adam, etc.)
|
|
- Verify voice name spelling (case-sensitive)
|
|
|
|
## Comparison: Local vs Cloud TTS
|
|
|
|
| Aspect | Local (Speaches) | Cloud (OpenAI/ElevenLabs) |
|
|
|--------|------------------|---------------------------|
|
|
| **Cost** | Free after setup | $15-50 per 1M characters |
|
|
| **Privacy** | Complete | Data sent to provider |
|
|
| **Speed** | Depends on hardware | Usually faster |
|
|
| **Quality** | Good (improving) | Excellent |
|
|
| **Setup** | Moderate complexity | Simple API key |
|
|
| **Offline** | Yes | No |
|
|
| **Rate Limits** | None | Yes |
|
|
| **Voices** | Limited selection | Many options |
|
|
| **Languages** | Limited | 50+ languages |
|
|
|
|
**Recommendation:**
|
|
- **Use Local** for: Privacy-sensitive content, high-volume generation, development
|
|
- **Use Cloud** for: Production podcasts, multiple languages, premium quality needs
|
|
|
|
## Best Practices
|
|
|
|
### 1. Model Management
|
|
|
|
**Download Models Ahead of Time:**
|
|
```bash
|
|
# Don't wait until generation time
|
|
docker compose exec speaches uv tool run speaches-cli model download speaches-ai/Kokoro-82M-v1.0-ONNX
|
|
```
|
|
|
|
**Keep Models Updated:**
|
|
```bash
|
|
# Periodically check for model updates
|
|
# Remove old models to save space
|
|
docker compose exec speaches uv tool run speaches-cli model list
|
|
```
|
|
|
|
### 2. Voice Selection
|
|
|
|
**Test Before Production:**
|
|
- Generate test audio with different voices
|
|
- Choose voices that match your podcast style
|
|
- Use consistent voices for recurring speakers
|
|
|
|
**Voice Characteristics:**
|
|
- Clear pronunciation for educational content
|
|
- Expressive voices for storytelling
|
|
- Professional voices for business content
|
|
|
|
### 3. Resource Management
|
|
|
|
**Monitor System Resources:**
|
|
```bash
|
|
# Check Docker resource usage
|
|
docker stats speaches
|
|
|
|
# Monitor disk space for models
|
|
docker compose exec speaches df -h
|
|
```
|
|
|
|
**Optimize Docker:**
|
|
```yaml
|
|
# Set appropriate limits
|
|
services:
|
|
speaches:
|
|
mem_limit: 4g
|
|
cpus: 2
|
|
```
|
|
|
|
### 4. Backup Strategy
|
|
|
|
**Persist Model Cache:**
|
|
The `hf-hub-cache` volume stores downloaded models. To backup:
|
|
```bash
|
|
# List volumes
|
|
docker volume ls
|
|
|
|
# Backup volume
|
|
docker run --rm -v hf-hub-cache:/data -v $(pwd):/backup ubuntu tar czf /backup/speaches-models-backup.tar.gz /data
|
|
```
|
|
|
|
**Restore if needed:**
|
|
```bash
|
|
docker run --rm -v hf-hub-cache:/data -v $(pwd):/backup ubuntu tar xzf /backup/speaches-models-backup.tar.gz -C /
|
|
```
|
|
|
|
### 5. Testing
|
|
|
|
**Always Test First:**
|
|
```bash
|
|
# Test with short text before generating long podcasts
|
|
curl "http://localhost:8969/v1/audio/speech" -s \
|
|
-H "Content-Type: application/json" \
|
|
--output test.mp3 \
|
|
--data '{
|
|
"input": "Test",
|
|
"model": "speaches-ai/Kokoro-82M-v1.0-ONNX",
|
|
"voice": "af_bella"
|
|
}'
|
|
```
|
|
|
|
## Complete Setup Script
|
|
|
|
For quick setup, save this as `setup-speaches.sh`:
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
set -e
|
|
|
|
echo "Creating Speaches setup directory..."
|
|
mkdir -p speaches-setup
|
|
cd speaches-setup
|
|
|
|
echo "Creating docker-compose.yml..."
|
|
cat > docker-compose.yml << 'EOF'
|
|
services:
|
|
speaches:
|
|
image: ghcr.io/speaches-ai/speaches:latest-cpu
|
|
container_name: speaches
|
|
ports:
|
|
- "8969:8000"
|
|
volumes:
|
|
- hf-hub-cache:/home/ubuntu/.cache/huggingface/hub
|
|
restart: unless-stopped
|
|
|
|
volumes:
|
|
hf-hub-cache:
|
|
EOF
|
|
|
|
echo "Starting Speaches container..."
|
|
docker compose up -d
|
|
|
|
echo "Waiting for service to be ready..."
|
|
sleep 10
|
|
|
|
echo "Downloading TTS model..."
|
|
docker compose exec speaches uv tool run speaches-cli model download speaches-ai/Kokoro-82M-v1.0-ONNX
|
|
|
|
echo "Testing speech generation..."
|
|
curl "http://localhost:8969/v1/audio/speech" -s -H "Content-Type: application/json" \
|
|
--output test-audio.mp3 \
|
|
--data '{
|
|
"input": "Hello! Speaches is now configured and ready to use with Open Notebook.",
|
|
"model": "speaches-ai/Kokoro-82M-v1.0-ONNX",
|
|
"voice": "af_bella",
|
|
"speed": 1.0
|
|
}'
|
|
|
|
echo ""
|
|
echo "✅ Setup complete!"
|
|
echo ""
|
|
echo "Next steps:"
|
|
echo "1. Test the audio file: test-audio.mp3"
|
|
echo "2. Set environment variable: export OPENAI_COMPATIBLE_BASE_URL_TTS=http://localhost:8969/v1"
|
|
echo "3. Configure in Open Notebook Settings → Models"
|
|
echo ""
|
|
echo "To stop Speaches: docker compose down"
|
|
echo "To restart: docker compose up -d"
|
|
```
|
|
|
|
Make it executable and run:
|
|
```bash
|
|
chmod +x setup-speaches.sh
|
|
./setup-speaches.sh
|
|
```
|
|
|
|
## Using Other Local TTS Servers
|
|
|
|
The principles in this guide apply to any OpenAI-compatible TTS server. When using a different solution:
|
|
|
|
1. **Start your TTS server** following its documentation
|
|
2. **Set the environment variable** to point to your server:
|
|
```bash
|
|
export OPENAI_COMPATIBLE_BASE_URL_TTS=http://your-server-url:port/v1
|
|
```
|
|
3. **Add the model in Open Notebook** using provider `openai_compatible`
|
|
4. **Use the model name** as specified by your TTS server
|
|
|
|
The key requirement is OpenAI API compatibility - specifically, the `/v1/audio/speech` endpoint.
|
|
|
|
## Getting Help
|
|
|
|
**Resources:**
|
|
- **Open Notebook Discord**: [https://discord.gg/37XJPXfz2w](https://discord.gg/37XJPXfz2w) - Get help with Open Notebook integration
|
|
- **Open Notebook Issues**: Report integration issues to Open Notebook
|
|
- **Speaches GitHub**: [https://github.com/speaches-ai/speaches](https://github.com/speaches-ai/speaches) - For Speaches-specific questions
|
|
- **Your TTS Server Documentation**: Consult the docs for your chosen TTS solution
|
|
|
|
**Common Questions:**
|
|
|
|
**Q: Can I use Speaches with multiple Open Notebook instances?**
|
|
A: Yes! Just point each instance to the same Speaches server URL.
|
|
|
|
**Q: How much disk space do I need?**
|
|
A: Each model is 300-800MB. Start with 5GB and add more as you download models.
|
|
|
|
**Q: Can I use this for commercial podcasts?**
|
|
A: Check the model's license on Hugging Face. Most open models allow commercial use.
|
|
|
|
**Q: How does quality compare to ElevenLabs or OpenAI?**
|
|
A: Local models are improving rapidly. For most use cases, quality is very good. Premium services still have an edge for the highest quality needs.
|
|
|
|
## Related Documentation
|
|
|
|
- **[OpenAI-Compatible Setup](openai-compatible.md)** - General OpenAI-compatible provider configuration
|
|
- **[AI Models Guide](ai-models.md)** - Complete AI model configuration
|
|
- **[Podcast Generation](podcasts.md)** - Learn about creating podcasts
|
|
- **[Ollama Setup](ollama.md)** - Another local AI option for language models
|
|
|
|
---
|
|
|
|
This guide should get you up and running with local text-to-speech in Open Notebook. Enjoy complete privacy and unlimited audio generation! 🎙️
|