open-notebook/docs/features/local_tts.md

18 KiB

Local Text-to-Speech Setup

Learn how to run text-to-speech models completely locally using OpenAI-compatible TTS servers, giving you full privacy control and zero ongoing costs for podcast and audio generation.

This guide uses Speaches as an example implementation, but the principles apply to any OpenAI-compatible TTS server.

Why Local Text-to-Speech?

Running text-to-speech locally offers significant advantages:

  • 🔒 Complete Privacy: Your content never leaves your machine
  • 💰 Zero Ongoing Costs: No per-character or per-minute charges
  • No Rate Limits: Generate unlimited audio without restrictions
  • 🌐 Offline Capability: Works without internet connection
  • 🎯 Full Control: Choose and customize your voice models
  • 📈 Predictable Costs: One-time setup, no surprises

Available Local TTS Solutions

Open Notebook supports any OpenAI-compatible text-to-speech server. This guide uses Speaches as an example because it's:

  • Open-source and actively maintained
  • Easy to set up with Docker
  • Compatible with OpenAI's TTS API specification
  • Supports multiple high-quality voice models

About Speaches

Speaches is an open-source, OpenAI-compatible text-to-speech server that runs locally on your machine. It provides:

  • OpenAI API Compatibility: Works seamlessly with Open Notebook's OpenAI-compatible provider
  • High-Quality Voices: Support for multiple neural TTS models
  • Easy Model Management: Simple CLI for downloading and managing voice models
  • Docker Support: Run in containers for easy deployment
  • Multiple Voice Options: Various voices and languages available
  • Customizable Speed: Adjust speech rate to your preference

Note

: If you're using a different OpenAI-compatible TTS server, the configuration steps will be similar - just adjust the endpoints and model names accordingly.

Quick Start with Speaches

This section demonstrates setup using Speaches as an example. If you're using a different local TTS solution, adapt the steps accordingly.

Prerequisites

  • Docker installed on your system
  • At least 2GB RAM available
  • 5GB disk space for models

Basic Setup

The fastest way to get started is using our example setup:

1. Create a project directory:

mkdir speaches-setup
cd speaches-setup

2. Create a docker-compose.yml file:

services:
  speaches:
    image: ghcr.io/speaches-ai/speaches:latest-cpu
    container_name: speaches
    ports:
      - "8969:8000"
    volumes:
      - hf-hub-cache:/home/ubuntu/.cache/huggingface/hub
    restart: unless-stopped

volumes:
  hf-hub-cache:

3. Start the Speaches server:

docker compose up -d

4. Download a TTS model:

# Wait a few seconds for the container to start
sleep 10

# Download the recommended Kokoro model
docker compose exec speaches uv tool run speaches-cli model download speaches-ai/Kokoro-82M-v1.0-ONNX

5. Test the setup:

curl "http://localhost:8969/v1/audio/speech" -s -H "Content-Type: application/json" \
  --output test.mp3 \
  --data '{
    "input": "Hello! This is a test of local text to speech.",
    "model": "speaches-ai/Kokoro-82M-v1.0-ONNX",
    "voice": "af_bella",
    "speed": 1.0
  }'

If successful, you'll have a test.mp3 file with the generated speech!

Configure Open Notebook

Now that Speaches is running, configure Open Notebook to use it:

1. Set the environment variable:

For Docker deployments:

docker run -d \
  --name open-notebook \
  -p 8502:8502 -p 5055:5055 \
  -v ./notebook_data:/app/data \
  -v ./surreal_data:/mydata \
  -e OPENAI_COMPATIBLE_BASE_URL_TTS=http://host.docker.internal:8969/v1 \
  lfnovo/open_notebook:v1-latest-single

For local development:

export OPENAI_COMPATIBLE_BASE_URL_TTS=http://localhost:8969/v1

2. Add the model in Open Notebook:

  1. Go to SettingsModels page
  2. Click Add Model in the Text-to-Speech section
  3. Configure the model:
    • Provider: openai_compatible
    • Model Name: speaches-ai/Kokoro-82M-v1.0-ONNX
    • Display Name: Kokoro Local TTS (or your preference)
  4. Click Save

3. Set as default (optional):

  • In Settings, set this model as your default Text-to-Speech model
  • Now all podcast generation will use your local TTS

Available Voice Models

Speaches supports various TTS models from Hugging Face. Here are some recommended options:

  • Model ID: speaches-ai/Kokoro-82M-v1.0-ONNX
  • Size: ~500MB
  • Quality: High
  • Speed: Fast
  • Languages: English
  • Voices: af_bella, af_sarah, am_adam, am_michael, and more

Other Models

You can use any compatible ONNX TTS model from Hugging Face. Check the Speaches documentation for a complete list.

Available Voices

The Kokoro model includes multiple voices with different characteristics:

Female Voices:

  • af_bella - Clear, professional
  • af_sarah - Warm, friendly
  • af_nicole - Energetic, expressive

Male Voices:

  • am_adam - Deep, authoritative
  • am_michael - Friendly, conversational
  • bf_emma - British accent, professional
  • bm_george - British accent, formal

Testing Voices:

# Try different voices to find your favorite
for voice in af_bella af_sarah am_adam am_michael; do
  curl "http://localhost:8969/v1/audio/speech" -s \
    -H "Content-Type: application/json" \
    --output "test_${voice}.mp3" \
    --data "{
      \"input\": \"Hello! This is a test of the ${voice} voice.\",
      \"model\": \"speaches-ai/Kokoro-82M-v1.0-ONNX\",
      \"voice\": \"${voice}\"
    }"
done

Advanced Configuration

GPU Acceleration

For faster processing with NVIDIA GPUs:

services:
  speaches:
    image: ghcr.io/speaches-ai/speaches:latest-cuda  # GPU-enabled image
    container_name: speaches
    ports:
      - "8969:8000"
    volumes:
      - hf-hub-cache:/home/ubuntu/.cache/huggingface/hub
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

volumes:
  hf-hub-cache:

Custom Port

If port 8969 is already in use, change it in docker-compose.yml:

ports:
  - "9000:8000"  # Use port 9000 instead

Then update your environment variable:

export OPENAI_COMPATIBLE_BASE_URL_TTS=http://localhost:9000/v1

Multiple Models

Download and use multiple models for different purposes:

# Download additional models
docker compose exec speaches uv tool run speaches-cli model download model-name-1
docker compose exec speaches uv tool run speaches-cli model download model-name-2

# List downloaded models
docker compose exec speaches uv tool run speaches-cli model list

In Open Notebook, add each model separately and choose which to use for different podcasts.

Network Configuration

Docker Networking

When Open Notebook runs in Docker and needs to reach Speaches:

On macOS/Windows:

export OPENAI_COMPATIBLE_BASE_URL_TTS=http://host.docker.internal:8969/v1

On Linux:

# Option 1: Use Docker bridge IP
export OPENAI_COMPATIBLE_BASE_URL_TTS=http://172.17.0.1:8969/v1

# Option 2: Use host networking
docker run --network host ...

Remote Speaches Server

Run Speaches on a different machine for distributed processing:

# On the server machine
docker compose up -d

# Allow external connections (be careful with firewall settings)
# Update docker-compose.yml to bind to 0.0.0.0:8969

Then configure Open Notebook:

export OPENAI_COMPATIBLE_BASE_URL_TTS=http://server-ip:8969/v1

Security Warning: Only expose Speaches on trusted networks or use proper authentication/firewall rules.

Podcast Generation

Creating Podcasts with Local TTS

Once configured, use Speaches for podcast generation:

  1. Go to Podcasts page in Open Notebook
  2. Create or edit an Episode Profile
  3. Configure speakers:
    • For each speaker, select your Speaches model
    • Choose different voices (e.g., af_bella for host, am_adam for guest)
  4. Generate podcast
  5. Audio is generated locally using your Speaches server

Multi-Speaker Setup

Create natural-sounding conversations with different voices:

Speaker 1 (Host):
- Model: speaches-ai/Kokoro-82M-v1.0-ONNX
- Voice: af_bella

Speaker 2 (Guest):
- Model: speaches-ai/Kokoro-82M-v1.0-ONNX
- Voice: am_adam

Speaker 3 (Narrator):
- Model: speaches-ai/Kokoro-82M-v1.0-ONNX
- Voice: bf_emma

Performance Optimization

CPU Performance

Recommended Specs:

  • 4+ CPU cores
  • 4GB+ RAM
  • SSD storage

Tips:

  • Close unnecessary applications
  • Use quantized models when available
  • Adjust speech speed for faster generation

Memory Management

Monitor Docker memory usage:

docker stats speaches

Allocate more memory if needed:

services:
  speaches:
    # ... other config ...
    mem_limit: 4g  # Adjust based on your system

Batch Processing

For generating multiple audio files, Speaches handles concurrent requests efficiently. Open Notebook automatically manages this during podcast generation.

Troubleshooting

Service Won't Start

Symptom: Container exits immediately

Solutions:

# Check logs
docker compose logs speaches

# Verify Docker is running
docker ps

# Check port availability
lsof -i :8969  # macOS/Linux
netstat -ano | findstr :8969  # Windows

Connection Refused

Symptom: Open Notebook can't reach Speaches

Solutions:

  1. Verify Speaches is running:

    curl http://localhost:8969/v1/models
    
  2. Check Docker networking:

    • Use host.docker.internal instead of localhost when Open Notebook is in Docker
    • Verify firewall settings
  3. Test from inside Open Notebook container:

    docker exec -it open-notebook curl http://host.docker.internal:8969/v1/models
    

Model Not Found

Symptom: Error about missing model during generation

Solutions:

  1. Verify model is downloaded:

    docker compose exec speaches uv tool run speaches-cli model list
    
  2. Download the model:

    docker compose exec speaches uv tool run speaches-cli model download speaches-ai/Kokoro-82M-v1.0-ONNX
    
  3. Check model name matches what you configured in Open Notebook


Poor Audio Quality

Symptom: Generated speech sounds robotic or unclear

Solutions:

  • Try different voices
  • Adjust speech speed (1.0 is normal, try 0.9-1.2)
  • Use higher-quality models if available
  • Check that model downloaded completely

Slow Generation

Symptom: Audio generation takes a long time

Solutions:

  • Enable GPU acceleration if you have an NVIDIA GPU
  • Use faster models (smaller models = faster generation)
  • Adjust speech speed to 1.5-2.0 for quicker output
  • Allocate more CPU cores in Docker settings
  • Use SSD storage instead of HDD

Out of Memory

Symptom: Container crashes or system freezes

Solutions:

  1. Increase Docker memory limit:

    services:
      speaches:
        mem_limit: 4g  # Increase this value
    
  2. Use smaller models

  3. Close other applications

  4. Monitor with docker stats


Voice Not Available

Symptom: Requested voice doesn't work

Solutions:

  • Check available voices for your model
  • Use one of the documented voices (af_bella, am_adam, etc.)
  • Verify voice name spelling (case-sensitive)

Comparison: Local vs Cloud TTS

Aspect Local (Speaches) Cloud (OpenAI/ElevenLabs)
Cost Free after setup $15-50 per 1M characters
Privacy Complete Data sent to provider
Speed Depends on hardware Usually faster
Quality Good (improving) Excellent
Setup Moderate complexity Simple API key
Offline Yes No
Rate Limits None Yes
Voices Limited selection Many options
Languages Limited 50+ languages

Recommendation:

  • Use Local for: Privacy-sensitive content, high-volume generation, development
  • Use Cloud for: Production podcasts, multiple languages, premium quality needs

Best Practices

1. Model Management

Download Models Ahead of Time:

# Don't wait until generation time
docker compose exec speaches uv tool run speaches-cli model download speaches-ai/Kokoro-82M-v1.0-ONNX

Keep Models Updated:

# Periodically check for model updates
# Remove old models to save space
docker compose exec speaches uv tool run speaches-cli model list

2. Voice Selection

Test Before Production:

  • Generate test audio with different voices
  • Choose voices that match your podcast style
  • Use consistent voices for recurring speakers

Voice Characteristics:

  • Clear pronunciation for educational content
  • Expressive voices for storytelling
  • Professional voices for business content

3. Resource Management

Monitor System Resources:

# Check Docker resource usage
docker stats speaches

# Monitor disk space for models
docker compose exec speaches df -h

Optimize Docker:

# Set appropriate limits
services:
  speaches:
    mem_limit: 4g
    cpus: 2

4. Backup Strategy

Persist Model Cache: The hf-hub-cache volume stores downloaded models. To backup:

# List volumes
docker volume ls

# Backup volume
docker run --rm -v hf-hub-cache:/data -v $(pwd):/backup ubuntu tar czf /backup/speaches-models-backup.tar.gz /data

Restore if needed:

docker run --rm -v hf-hub-cache:/data -v $(pwd):/backup ubuntu tar xzf /backup/speaches-models-backup.tar.gz -C /

5. Testing

Always Test First:

# Test with short text before generating long podcasts
curl "http://localhost:8969/v1/audio/speech" -s \
  -H "Content-Type: application/json" \
  --output test.mp3 \
  --data '{
    "input": "Test",
    "model": "speaches-ai/Kokoro-82M-v1.0-ONNX",
    "voice": "af_bella"
  }'

Complete Setup Script

For quick setup, save this as setup-speaches.sh:

#!/bin/bash
set -e

echo "Creating Speaches setup directory..."
mkdir -p speaches-setup
cd speaches-setup

echo "Creating docker-compose.yml..."
cat > docker-compose.yml << 'EOF'
services:
  speaches:
    image: ghcr.io/speaches-ai/speaches:latest-cpu
    container_name: speaches
    ports:
      - "8969:8000"
    volumes:
      - hf-hub-cache:/home/ubuntu/.cache/huggingface/hub
    restart: unless-stopped

volumes:
  hf-hub-cache:
EOF

echo "Starting Speaches container..."
docker compose up -d

echo "Waiting for service to be ready..."
sleep 10

echo "Downloading TTS model..."
docker compose exec speaches uv tool run speaches-cli model download speaches-ai/Kokoro-82M-v1.0-ONNX

echo "Testing speech generation..."
curl "http://localhost:8969/v1/audio/speech" -s -H "Content-Type: application/json" \
  --output test-audio.mp3 \
  --data '{
    "input": "Hello! Speaches is now configured and ready to use with Open Notebook.",
    "model": "speaches-ai/Kokoro-82M-v1.0-ONNX",
    "voice": "af_bella",
    "speed": 1.0
  }'

echo ""
echo "✅ Setup complete!"
echo ""
echo "Next steps:"
echo "1. Test the audio file: test-audio.mp3"
echo "2. Set environment variable: export OPENAI_COMPATIBLE_BASE_URL_TTS=http://localhost:8969/v1"
echo "3. Configure in Open Notebook Settings → Models"
echo ""
echo "To stop Speaches: docker compose down"
echo "To restart: docker compose up -d"

Make it executable and run:

chmod +x setup-speaches.sh
./setup-speaches.sh

Using Other Local TTS Servers

The principles in this guide apply to any OpenAI-compatible TTS server. When using a different solution:

  1. Start your TTS server following its documentation
  2. Set the environment variable to point to your server:
    export OPENAI_COMPATIBLE_BASE_URL_TTS=http://your-server-url:port/v1
    
  3. Add the model in Open Notebook using provider openai_compatible
  4. Use the model name as specified by your TTS server

The key requirement is OpenAI API compatibility - specifically, the /v1/audio/speech endpoint.

Getting Help

Resources:

Common Questions:

Q: Can I use Speaches with multiple Open Notebook instances? A: Yes! Just point each instance to the same Speaches server URL.

Q: How much disk space do I need? A: Each model is 300-800MB. Start with 5GB and add more as you download models.

Q: Can I use this for commercial podcasts? A: Check the model's license on Hugging Face. Most open models allow commercial use.

Q: How does quality compare to ElevenLabs or OpenAI? A: Local models are improving rapidly. For most use cases, quality is very good. Premium services still have an edge for the highest quality needs.


This guide should get you up and running with local text-to-speech in Open Notebook. Enjoy complete privacy and unlimited audio generation! 🎙️