From 213775810335b839f6a91c7880eafdba57937496 Mon Sep 17 00:00:00 2001 From: Nikita Date: Mon, 2 Feb 2026 00:54:49 +0300 Subject: [PATCH] docs: add local STT setup guide (#521) Add comprehensive documentation for setting up local speech-to-text using Speaches with faster-whisper. Includes model recommendations, GPU acceleration, Docker networking, and troubleshooting. Also updates related docs with cross-references to the new guide. Co-authored-by: Claude Opus 4.5 --- docs/5-CONFIGURATION/ai-providers.md | 4 +- docs/5-CONFIGURATION/index.md | 6 + docs/5-CONFIGURATION/local-stt.md | 368 ++++++++++++++++++++++ docs/5-CONFIGURATION/local-tts.md | 1 + docs/5-CONFIGURATION/openai-compatible.md | 1 + 5 files changed, 379 insertions(+), 1 deletion(-) create mode 100644 docs/5-CONFIGURATION/local-stt.md diff --git a/docs/5-CONFIGURATION/ai-providers.md b/docs/5-CONFIGURATION/ai-providers.md index 9ca7aa3..e94443b 100644 --- a/docs/5-CONFIGURATION/ai-providers.md +++ b/docs/5-CONFIGURATION/ai-providers.md @@ -460,7 +460,7 @@ Use OpenAI - Ollama: Free, but local - OpenRouter: many open source models very accessible -**For privacy-first:** Ollama or LM Studio and [Speaches](local-tts.md) +**For privacy-first:** Ollama or LM Studio and Speaches ([TTS](local-tts.md), [STT](local-stt.md)) - Everything stays local - Works offline - No API keys sent anywhere @@ -491,4 +491,6 @@ Done! - **[Advanced Configuration](advanced.md)** - Timeouts, SSL, performance tuning - **[Ollama Setup](ollama.md)** - Detailed Ollama configuration guide - **[OpenAI-Compatible](openai-compatible.md)** - LM Studio and other compatible providers +- **[Local TTS Setup](local-tts.md)** - Text-to-speech with Speaches +- **[Local STT Setup](local-stt.md)** - Speech-to-text with Speaches - **[Troubleshooting](../6-TROUBLESHOOTING/quick-fixes.md)** - Common issues and fixes diff --git a/docs/5-CONFIGURATION/index.md b/docs/5-CONFIGURATION/index.md index 38f7eaa..561e133 100644 --- a/docs/5-CONFIGURATION/index.md +++ b/docs/5-CONFIGURATION/index.md @@ -214,6 +214,12 @@ AZURE_OPENAI_API_VERSION=2024-12-01-preview - Voice options - Docker networking +### [Local STT](local-stt.md) +- Speaches setup for local speech-to-text +- Whisper model options +- GPU acceleration +- Docker networking + ### [Ollama](ollama.md) - Setting up and pointing to an Ollama server - Downloading models diff --git a/docs/5-CONFIGURATION/local-stt.md b/docs/5-CONFIGURATION/local-stt.md new file mode 100644 index 0000000..3e6d313 --- /dev/null +++ b/docs/5-CONFIGURATION/local-stt.md @@ -0,0 +1,368 @@ +# Local Speech-to-Text Setup + +Run speech-to-text locally for free, private audio/video transcription using OpenAI-compatible STT servers. + +--- + +## Why Local STT? + +| Benefit | Description | +|---------|-------------| +| **Free** | No per-minute costs after setup | +| **Private** | Audio never leaves your machine | +| **Unlimited** | No rate limits or quotas | +| **Offline** | Works without internet | + +--- + +## Quick Start with Speaches + +[Speaches](https://github.com/speaches-ai/speaches) is an open-source, OpenAI-compatible server that supports both TTS and STT. It uses [faster-whisper](https://github.com/SYSTRAN/faster-whisper) for transcription. + +### Step 1: Create Docker Compose File + +Create a folder and add `docker-compose.yml`: + +```yaml +services: + speaches: + image: ghcr.io/speaches-ai/speaches:latest-cpu + container_name: speaches + ports: + - "8969:8000" + volumes: + - hf-hub-cache:/home/ubuntu/.cache/huggingface/hub + restart: unless-stopped + +volumes: + hf-hub-cache: +``` + +### Step 2: Start and Download Model + +```bash +# Start Speaches +docker compose up -d + +# Wait for startup +sleep 10 + +# Download Whisper model (~500MB for small) +docker compose exec speaches uv tool run speaches-cli model download Systran/faster-whisper-small +``` + +Models can also be downloaded automatically on first use, but pre-downloading avoids delays. + +### Step 3: Test + +```bash +# Create a test audio file (or use your own) +# Then transcribe it: +curl "http://localhost:8969/v1/audio/transcriptions" \ + -F "file=@test.mp3" \ + -F "model=Systran/faster-whisper-small" +``` + +You should see the transcribed text in the response. + +### Step 4: Configure Open Notebook + +**Docker deployment:** +```yaml +# In your Open Notebook docker-compose.yml +environment: + - OPENAI_COMPATIBLE_BASE_URL_STT=http://host.docker.internal:8969/v1 +``` + +**Local development:** +```bash +export OPENAI_COMPATIBLE_BASE_URL_STT=http://localhost:8969/v1 +``` + +### Step 5: Add Model in Open Notebook + +1. Go to **Settings** → **Models** +2. Click **Add Model** in Speech-to-Text section +3. Configure: + - **Provider**: `openai_compatible` + - **Model Name**: `Systran/faster-whisper-small` + - **Display Name**: `Local Whisper` +4. Click **Save** +5. Set as default if desired + +--- + +## Available Models + +Speaches supports various Whisper model sizes. Larger models are more accurate but slower: + +| Model | Size | Speed | Accuracy | VRAM (GPU) | +|-------|------|-------|----------|------------| +| `Systran/faster-whisper-tiny` | ~75 MB | Fastest | Basic | ~1 GB | +| `Systran/faster-whisper-base` | ~150 MB | Fast | Good | ~1 GB | +| `Systran/faster-whisper-small` | ~500 MB | Medium | Better | ~2 GB | +| `Systran/faster-whisper-medium` | ~1.5 GB | Slow | Great | ~5 GB | +| `Systran/faster-whisper-large-v3` | ~3 GB | Slowest | Best | ~10 GB | +| `Systran/faster-distil-whisper-small.en` | ~400 MB | Fast | Good (English only) | ~2 GB | + +### List Available Models + +```bash +docker compose exec speaches uv tool run speaches-cli registry ls --task automatic-speech-recognition +``` + +### Recommended Models + +- **For speed**: `Systran/faster-whisper-tiny` or `Systran/faster-whisper-base` +- **For balance**: `Systran/faster-whisper-small` (recommended) +- **For accuracy**: `Systran/faster-whisper-large-v3` + +--- + +## GPU Acceleration + +For faster transcription with NVIDIA GPUs: + +```yaml +services: + speaches: + image: ghcr.io/speaches-ai/speaches:latest-cuda + container_name: speaches + ports: + - "8969:8000" + volumes: + - hf-hub-cache:/home/ubuntu/.cache/huggingface/hub + environment: + - WHISPER__TTL=-1 # Keep model in VRAM (recommended if you have enough memory) + restart: unless-stopped + deploy: + resources: + reservations: + devices: + - driver: nvidia + count: 1 + capabilities: [gpu] + +volumes: + hf-hub-cache: +``` + +### Keep Model in Memory + +By default, Speaches unloads models after some time. To keep the Whisper model loaded for instant transcription: + +```yaml +environment: + - WHISPER__TTL=-1 # Never unload +``` + +This is recommended if you have enough RAM/VRAM, as loading the model can take a few seconds. + +--- + +## Docker Networking + +### Open Notebook in Docker (macOS/Windows) + +```bash +OPENAI_COMPATIBLE_BASE_URL_STT=http://host.docker.internal:8969/v1 +``` + +### Open Notebook in Docker (Linux) + +```bash +# Option 1: Docker bridge IP +OPENAI_COMPATIBLE_BASE_URL_STT=http://172.17.0.1:8969/v1 + +# Option 2: Host networking +docker run --network host ... +``` + +### Remote Server + +Run Speaches on a different machine: + +```bash +# On server, bind to all interfaces +# Then in Open Notebook: +OPENAI_COMPATIBLE_BASE_URL_STT=http://server-ip:8969/v1 +``` + +--- + +## Language Support + +Whisper supports 99+ languages. Specify the language for better accuracy: + +```bash +curl "http://localhost:8969/v1/audio/transcriptions" \ + -F "file=@audio.mp3" \ + -F "model=Systran/faster-whisper-small" \ + -F "language=ru" +``` + +Common language codes: +- `en` - English +- `ru` - Russian +- `es` - Spanish +- `fr` - French +- `de` - German +- `zh` - Chinese +- `ja` - Japanese + +--- + +## Troubleshooting + +### Service Won't Start + +```bash +# Check logs +docker compose logs speaches + +# Verify port available +lsof -i :8969 + +# Restart +docker compose down && docker compose up -d +``` + +### Connection Refused + +```bash +# Test Speaches is running +curl http://localhost:8969/v1/models + +# From inside Open Notebook container +docker exec -it open-notebook curl http://host.docker.internal:8969/v1/models +``` + +### Model Download Fails + +Models are downloaded automatically on first use. If download fails: + +```bash +# Check available disk space +df -h + +# Check Docker logs for errors +docker compose logs speaches + +# Restart and try again +docker compose restart speaches +``` + +### Poor Transcription Quality + +- Use a larger model (`faster-whisper-medium` or `large-v3`) +- Specify the correct language +- Ensure audio quality is good (clear speech, minimal background noise) +- Try different audio formats (WAV often works better than MP3) + +### Slow Transcription + +| Solution | How | +|----------|-----| +| Use GPU | Switch to `latest-cuda` image | +| Smaller model | Use `faster-whisper-tiny` or `base` | +| More CPU | Allocate more cores in Docker | +| SSD storage | Move Docker volumes to SSD | + +--- + +## Performance Tips + +### Recommended Specs + +| Component | Minimum | Recommended | +|-----------|---------|-------------| +| CPU | 2 cores | 4+ cores | +| RAM | 2 GB | 8+ GB | +| Storage | 5 GB | 10 GB (for multiple models) | +| GPU | None | NVIDIA (optional, much faster) | + +### Resource Limits + +```yaml +services: + speaches: + # ... other config + mem_limit: 4g + cpus: 2 +``` + +### Monitor Usage + +```bash +docker stats speaches +``` + +--- + +## Comparison: Local vs Cloud + +| Aspect | Local (Speaches) | Cloud (OpenAI Whisper) | +|--------|------------------|------------------------| +| **Cost** | Free | $0.006/min | +| **Privacy** | Complete | Data sent to provider | +| **Speed** | Depends on hardware | Usually faster | +| **Quality** | Excellent (same Whisper) | Excellent | +| **Setup** | Moderate | Simple API key | +| **Offline** | Yes | No | +| **Languages** | 99+ | 99+ | + +### When to Use Local + +- Privacy-sensitive content +- High-volume transcription +- Development/testing +- Offline environments +- Cost control + +### When to Use Cloud + +- Limited hardware +- Time-sensitive projects +- No GPU available +- Simple setup preferred + +--- + +## Using Both TTS and STT + +Speaches supports both TTS and STT in one server. Configure both: + +```bash +# Same server for both +OPENAI_COMPATIBLE_BASE_URL_TTS=http://localhost:8969/v1 +OPENAI_COMPATIBLE_BASE_URL_STT=http://localhost:8969/v1 +``` + +See **[Local TTS Setup](local-tts.md)** for TTS configuration. + +--- + +## Other Local STT Options + +Any OpenAI-compatible STT server works: + +| Server | Description | +|--------|-------------| +| [Speaches](https://github.com/speaches-ai/speaches) | TTS + STT in one (recommended) | +| [faster-whisper-server](https://github.com/fedirz/faster-whisper-server) | Lightweight STT only | +| [whisper.cpp](https://github.com/ggerganov/whisper.cpp) | C++ implementation with server mode | +| [LocalAI](https://github.com/mudler/LocalAI) | Multi-model local AI server | + +The key requirements: + +1. Server implements `/v1/audio/transcriptions` endpoint +2. Set `OPENAI_COMPATIBLE_BASE_URL_STT` to server URL +3. Add model with provider `openai_compatible` + +--- + +## Related + +- **[Local TTS Setup](local-tts.md)** - Text-to-speech with Speaches +- **[OpenAI-Compatible Providers](openai-compatible.md)** - General compatible provider setup +- **[AI Providers](ai-providers.md)** - All provider configuration \ No newline at end of file diff --git a/docs/5-CONFIGURATION/local-tts.md b/docs/5-CONFIGURATION/local-tts.md index 8c39cfc..20547b7 100644 --- a/docs/5-CONFIGURATION/local-tts.md +++ b/docs/5-CONFIGURATION/local-tts.md @@ -334,6 +334,7 @@ Any OpenAI-compatible TTS server works. The key is: ## Related +- **[Local STT Setup](local-stt.md)** - Speech-to-text with Speaches - **[OpenAI-Compatible Providers](openai-compatible.md)** - General compatible provider setup - **[AI Providers](ai-providers.md)** - All provider configuration - **[Creating Podcasts](../3-USER-GUIDE/creating-podcasts.md)** - Using TTS for podcasts diff --git a/docs/5-CONFIGURATION/openai-compatible.md b/docs/5-CONFIGURATION/openai-compatible.md index 1789c36..e16f803 100644 --- a/docs/5-CONFIGURATION/openai-compatible.md +++ b/docs/5-CONFIGURATION/openai-compatible.md @@ -392,5 +392,6 @@ Use OpenAI-compatible when: ## Related - **[Local TTS Setup](local-tts.md)** - Text-to-speech with Speaches +- **[Local STT Setup](local-stt.md)** - Speech-to-text with Speaches - **[AI Providers](ai-providers.md)** - All provider options - **[Ollama Setup](ollama.md)** - Native Ollama integration