diff --git a/BENCHMARK.md b/BENCHMARK.md index df1293f..3fcb30a 100644 --- a/BENCHMARK.md +++ b/BENCHMARK.md @@ -1,7 +1,7 @@ # WhisperLiveKit Benchmark Report -Benchmark comparing all supported ASR backends and streaming policies on Apple Silicon, -using the full AudioProcessor pipeline (the same path audio takes in production via WebSocket). +Benchmark comparing all supported ASR backends, streaming policies, and model sizes on Apple Silicon. +All tests run through the full AudioProcessor pipeline (same code path as production WebSocket). ## Test Environment @@ -12,9 +12,8 @@ using the full AudioProcessor pipeline (the same path audio takes in production | Python | 3.13 | | faster-whisper | 1.2.1 | | mlx-whisper | installed (via mlx) | -| Voxtral (HF) | transformers-based | | Voxtral MLX | native MLX backend | -| Model size | `base` (default for whisper backends) | +| Voxtral (HF) | transformers-based | | VAC (Silero VAD) | enabled unless noted | | Chunk size | 100 ms | | Pacing | no-realtime (as fast as possible) | @@ -25,50 +24,80 @@ using the full AudioProcessor pipeline (the same path audio takes in production |------|----------|----------|----------|-------------| | `00_00_07_english_1_speaker.wav` | 7.2 s | English | 1 | Short dictation with pauses | | `00_00_16_french_1_speaker.wav` | 16.3 s | French | 1 | French speech with intentional silence gaps | -| `00_00_30_english_3_speakers.wav` | 30.0 s | English | 3 | Multi-speaker conversation about transcription | +| `00_00_30_english_3_speakers.wav` | 30.0 s | English | 3 | Multi-speaker conversation | -All files have hand-verified ground truth transcripts (`.transcript.json`) with per-word timestamps. +Ground truth transcripts (`.transcript.json`) with per-word timestamps are hand-verified. --- -## Results Overview +## Results -### English - Short (7.2 s, 1 speaker) +### English -- Short (7.2 s, 1 speaker) -| Backend | Policy | RTF | WER | Timestamp MAE | -|---------|--------|-----|-----|---------------| -| faster-whisper | LocalAgreement | 0.20x | 21.1% | 0.080 s | -| faster-whisper | SimulStreaming | 0.14x | 0.0% | 0.239 s | -| mlx-whisper | LocalAgreement | 0.05x | 21.1% | 0.080 s | -| mlx-whisper | SimulStreaming | 0.14x | 10.5% | 0.245 s | -| voxtral-mlx | voxtral | 0.32x | 0.0% | 0.254 s | -| voxtral (HF) | voxtral | 1.29x | 0.0% | 1.876 s | +| Backend | Policy | Model | RTF | WER | Timestamp MAE | +|---------|--------|-------|-----|-----|---------------| +| faster-whisper | LocalAgreement | base | 0.20x | 21.1% | 0.080 s | +| faster-whisper | SimulStreaming | base | 0.14x | 0.0% | 0.239 s | +| faster-whisper | LocalAgreement | small | 0.59x | 21.1% | 0.089 s | +| faster-whisper | SimulStreaming | small | 0.39x | 0.0% | 0.221 s | +| mlx-whisper | LocalAgreement | base | 0.05x | 21.1% | 0.080 s | +| mlx-whisper | SimulStreaming | base | 0.14x | 10.5% | 0.245 s | +| mlx-whisper | LocalAgreement | small | 0.16x | 21.1% | 0.089 s | +| mlx-whisper | SimulStreaming | small | 0.20x | 10.5% | 0.226 s | +| voxtral-mlx | voxtral | 4B | 0.32x | 0.0% | 0.254 s | +| voxtral (HF) | voxtral | 4B | 1.29x | 0.0% | 1.876 s | -### French (16.3 s, 1 speaker) +### English -- Multi-speaker (30.0 s, 3 speakers) -| Backend | Policy | RTF | WER | Timestamp MAE | -|---------|--------|-----|-----|---------------| -| faster-whisper | LocalAgreement | 0.20x | 120.0% | 0.540 s | -| faster-whisper | SimulStreaming | 0.10x | 100.0% | 0.120 s | -| mlx-whisper | LocalAgreement | 0.31x | 1737.1% | 0.060 s | -| mlx-whisper | SimulStreaming | 0.08x | 94.3% | 0.120 s | -| voxtral-mlx | voxtral | 0.18x | 37.1% | 3.422 s | -| voxtral (HF) | voxtral | 0.63x | 28.6% | 4.040 s | +| Backend | Policy | Model | RTF | WER | Timestamp MAE | +|---------|--------|-------|-----|-----|---------------| +| faster-whisper | LocalAgreement | base | 0.24x | 44.7% | 0.235 s | +| faster-whisper | SimulStreaming | base | 0.10x | 5.3% | 0.398 s | +| faster-whisper | LocalAgreement | small | 0.59x | 25.0% | 0.226 s | +| faster-whisper | SimulStreaming | small | 0.26x | 5.3% | 0.387 s | +| mlx-whisper | LocalAgreement | base | 0.06x | 23.7% | 0.237 s | +| mlx-whisper | SimulStreaming | base | 0.11x | 5.3% | 0.395 s | +| mlx-whisper | LocalAgreement | small | 0.13x | 25.0% | 0.226 s | +| mlx-whisper | SimulStreaming | small | 0.20x | 5.3% | 0.394 s | +| voxtral-mlx | voxtral | 4B | 0.31x | 9.2% | 0.176 s | +| voxtral (HF) | voxtral | 4B | 1.00x | 32.9% | 1.034 s | -Note: The whisper-based backends were run with `--lan en`, so they attempted to transcribe French -audio in English. This is expected to produce high WER. For a fair comparison, the whisper backends -should be run with `--lan fr` or `--lan auto`. The Voxtral backends auto-detect language. +

+Benchmark comparison on 30s English +

-### English - Multi-speaker (30.0 s, 3 speakers) +### French (16.3 s, 1 speaker, `--language fr`) -| Backend | Policy | RTF | WER | Timestamp MAE | -|---------|--------|-----|-----|---------------| -| faster-whisper | LocalAgreement | 0.24x | 44.7% | 0.235 s | -| faster-whisper | SimulStreaming | 0.10x | 5.3% | 0.398 s | -| mlx-whisper | LocalAgreement | 0.06x | 23.7% | 0.237 s | -| mlx-whisper | SimulStreaming | 0.11x | 5.3% | 0.395 s | -| voxtral-mlx | voxtral | 0.31x | 9.2% | 0.176 s | -| voxtral (HF) | voxtral | 1.00x | 32.9% | 1.034 s | +| Backend | Policy | Model | RTF | WER | Timestamp MAE | +|---------|--------|-------|-----|-----|---------------| +| faster-whisper | LocalAgreement | base | 0.22x | 25.7% | 3.460 s | +| faster-whisper | SimulStreaming | base | 0.10x | 31.4% | 3.660 s | +| faster-whisper | LocalAgreement | small | 0.76x | 42.9% | 0.051 s | +| faster-whisper | SimulStreaming | small | 0.29x | 25.7% | 0.219 s | +| mlx-whisper | LocalAgreement | base | 0.09x | ~45%* | ~5.0 s* | +| mlx-whisper | SimulStreaming | base | 0.09x | 40.0% | 3.540 s | +| mlx-whisper | LocalAgreement | small | 0.14x | 25.7% | 0.083 s | +| mlx-whisper | SimulStreaming | small | 0.17x | 31.4% | 0.203 s | +| voxtral-mlx | voxtral | 4B | 0.18x | 37.1% | 3.422 s | +| voxtral (HF) | voxtral | 4B | 0.63x | 28.6% | 4.040 s | + +\* mlx-whisper + LocalAgreement + base is unstable on this French file (WER fluctuates 34-1037% across runs due to hallucination loops). The `small` model does not have this problem. + +**Timestamp note:** The base model produces very high timestamp MAE (3.4-3.7s) on this French file because it misaligns words around the silence gaps. The small model handles this much better (0.05-0.22s MAE). Voxtral also drifts on the silence gaps. + +--- + +## Model Size Comparison (base vs small) + +| | base | small | Observation | +|--|------|-------|-------------| +| **RTF** | 0.05-0.24x | 0.13-0.76x | small is 2-3x slower | +| **English WER (SS)** | 0-5.3% | 0-5.3% | No improvement: SimulStreaming already saturates on base | +| **English WER (LA)** | 21-44.7% | 21-25% | small reduces LA errors on longer audio | +| **French WER** | 25-40% | 25-43% | Mixed: depends on backend/policy combo | +| **French timestamps** | 3.4-5.0s MAE | 0.05-0.22s MAE | small is dramatically better for French timestamps | + +In short: **base + SimulStreaming** gives the best speed/accuracy tradeoff for English. The small model only helps if you need LocalAgreement (for subtitle-grade timestamps) or non-English languages. --- @@ -76,37 +105,25 @@ should be run with `--lan fr` or `--lan auto`. The Voxtral backends auto-detect ### Speed (RTF = processing time / audio duration, lower is better) -1. **mlx-whisper + LocalAgreement** is the fastest combo on Apple Silicon, reaching 0.05-0.06x RTF - on English audio. 30 seconds of audio processed in under 2 seconds. -2. For **faster-whisper**, SimulStreaming is consistently faster than LocalAgreement. - For **mlx-whisper**, it is the opposite: LocalAgreement (0.05-0.06x) is faster than SimulStreaming (0.11-0.14x). -3. **voxtral-mlx** runs at 0.18-0.32x RTF, roughly 3-5x slower than mlx-whisper but well within - real-time requirements. -4. **voxtral (HF transformers)** is the slowest at 1.0-1.3x RTF. On longer audio it risks - falling behind real-time. On Apple Silicon, the MLX variant is strongly preferred. +1. **mlx-whisper + LocalAgreement + base** is the fastest combo on Apple Silicon: 0.05-0.06x RTF on English. 30 seconds of audio in under 2 seconds. +2. For **faster-whisper**, SimulStreaming is faster than LocalAgreement. For **mlx-whisper**, it is the opposite: LocalAgreement (0.05-0.06x) outperforms SimulStreaming (0.11-0.14x) on speed. +3. **voxtral-mlx** runs at 0.18-0.32x RTF -- 3-5x slower than mlx-whisper base, but well within real-time. +4. **voxtral (HF transformers)** hits 1.0-1.3x RTF. At the real-time boundary on Apple Silicon. Use the MLX variant instead. +5. The **small** model is 2-3x slower than base across all backends. ### Accuracy (WER = Word Error Rate, lower is better) -1. **SimulStreaming** produces significantly better WER than LocalAgreement for whisper backends. - On the 30s English file: 5.3% vs 23.7-44.7%. -2. **voxtral-mlx** has good accuracy (0% on short English, 9.2% on multi-speaker). - Whisper also supports `--language auto`, but Voxtral's language detection is more - reliable and does not bias towards English the way Whisper's auto mode tends to. -3. **LocalAgreement** tends to duplicate the last sentence, inflating WER. This is a known - artifact of the LCP (Longest Common Prefix) commit strategy at end-of-stream. -4. **Voxtral** backends handle French natively with 28-37% WER, while whisper backends - were run with `--lan en` here (not a fair comparison for French). +1. **SimulStreaming** gives dramatically lower WER than LocalAgreement on the whisper backends. On the 30s English file: 5.3% vs 23-44%. +2. **voxtral-mlx** hits 0% on short English and 9.2% on multi-speaker. It auto-detects language natively. Whisper also supports `--language auto`, but tends to bias towards English on short segments. +3. **LocalAgreement** tends to repeat the last sentence at end-of-stream (a known LCP artifact), inflating WER. This is visible in the 21% WER on the 7s file -- the same 4 extra words appear in every LA run. +4. On **French** with the correct `--language fr`, whisper base achieves 25-40% WER -- comparable to Voxtral's 28-37%. The small model does not consistently improve French WER. -### Timestamp Accuracy (MAE = Mean Absolute Error on word start times, lower is better) +### Timestamps (MAE = Mean Absolute Error on word start times) -1. **LocalAgreement** produces the most accurate timestamps (0.08s MAE on English), since it - processes overlapping audio windows and validates via prefix matching. -2. **SimulStreaming** timestamps are slightly less precise (0.24-0.40s MAE) but still usable - for most applications. -3. **voxtral-mlx** has good timestamp accuracy on English (0.18-0.25s MAE) but drifts on - audio with long silence gaps (3.4s MAE on the French file with 4-second pauses). -4. **voxtral (HF)** has the worst timestamp accuracy (1.0-4.0s MAE). This is likely related to - differences in the transformers-based decoding pipeline rather than model quality. +1. **LocalAgreement** gives the best timestamps on English (0.08-0.09s MAE). +2. **SimulStreaming** is less precise (0.22-0.40s MAE) but good enough for most applications. +3. On French with silence gaps, **base model timestamps are unreliable** (3.4-5s MAE). The **small model fixes this** (0.05-0.22s MAE). This is the strongest argument for using `small` over `base`. +4. **voxtral-mlx** has good timestamps on English (0.18-0.25s MAE) but drifts on audio with long silence gaps (3.4s MAE on the French file). ### VAC (Voice Activity Classification) Impact @@ -117,23 +134,29 @@ should be run with `--lan fr` or `--lan auto`. The Voxtral backends auto-detect | voxtral-mlx | voxtral | on | 0.0% | 9.2% | | voxtral-mlx | voxtral | off | 0.0% | 9.2% | -- **Whisper backends require VAC** to function in streaming mode. Without it, the entire audio - is buffered as a single chunk and the LocalAgreement/SimulStreaming buffer logic breaks down. -- **Voxtral backends are VAC-independent** because they handle their own internal chunking and - produce identical results with or without VAC. VAC still reduces wasted compute on silence. +- **Whisper backends need VAC** to work in streaming mode. Without it the buffer logic breaks down and you get empty or garbage output. +- **Voxtral is unaffected by VAC** since it handles its own internal chunking. Identical results with or without. VAC still saves compute on silent segments. --- ## Recommendations -| Use Case | Recommended Backend | Policy | Notes | -|----------|-------------------|--------|-------| -| Fastest English transcription (Apple Silicon) | mlx-whisper | SimulStreaming | 0.08-0.14x RTF, 5-10% WER | -| Fastest English transcription (Linux/GPU) | faster-whisper | SimulStreaming | 0.10-0.14x RTF, 0-5% WER | -| Multilingual / auto-detect (Apple Silicon) | voxtral-mlx | voxtral | Handles 100+ languages, 0.18-0.32x RTF | -| Multilingual / auto-detect (Linux/GPU) | voxtral (HF) | voxtral | Same model, slower on CPU, needs GPU | -| Best timestamp accuracy | faster-whisper | LocalAgreement | 0.08s MAE, good for subtitle alignment | -| Low latency, low memory | mlx-whisper (tiny) | SimulStreaming | Smallest footprint, fastest response | +| Use Case | Backend | Policy | Model | Notes | +|----------|---------|--------|-------|-------| +| Fastest English (Apple Silicon) | mlx-whisper | SimulStreaming | base | 0.11x RTF, 5.3% WER | +| Fastest English (Linux/GPU) | faster-whisper | SimulStreaming | base | 0.10x RTF, 5.3% WER | +| Best accuracy, English | faster-whisper | SimulStreaming | small | 0.26x RTF, 5.3% WER, still fast | +| Multilingual / auto-detect | voxtral-mlx | voxtral | 4B | 100+ languages, 0.18-0.32x RTF | +| Best timestamps | any | LocalAgreement | small | 0.05-0.09s MAE, good for subtitles | +| Low memory / embedded | mlx-whisper | SimulStreaming | base | Smallest footprint, fastest response | + +--- + +## Caveats + +- **3 test files, ~53 seconds total.** Results give relative rankings between backends but should not be taken as definitive WER numbers. Run on your own data for production decisions. +- **RTF varies between runs** (up to +/-30%) depending on thermal state, background processes, and model caching. The numbers above are single sequential runs on a warm machine. +- **Only base and small tested.** Medium and large-v3 would likely improve WER at the cost of higher RTF. We did not test them here because they are slow on Apple Silicon without GPU. --- @@ -144,15 +167,18 @@ should be run with `--lan fr` or `--lan auto`. The Voxtral backends auto-detect pip install -e ".[test]" # Single backend test -python test_backend_offline.py --backend faster-whisper --policy simulstreaming --no-realtime +python test_backend_offline.py --backend faster-whisper --policy simulstreaming --model base --no-realtime + +# With a specific language +python test_backend_offline.py --backend mlx-whisper --policy simulstreaming --model small --lan fr --no-realtime # Multi-backend auto-detect benchmark python test_backend_offline.py --benchmark --no-realtime -# Export to JSON for programmatic analysis +# Export to JSON python test_backend_offline.py --benchmark --no-realtime --json results.json -# Test with custom audio +# Test with your own audio python test_backend_offline.py --backend voxtral-mlx --audio your_file.wav --no-realtime ``` diff --git a/benchmark_chart.png b/benchmark_chart.png new file mode 100644 index 0000000..20123bd Binary files /dev/null and b/benchmark_chart.png differ