diff --git a/BENCHMARK.md b/BENCHMARK.md
index df1293f..3fcb30a 100644
--- a/BENCHMARK.md
+++ b/BENCHMARK.md
@@ -1,7 +1,7 @@
# WhisperLiveKit Benchmark Report
-Benchmark comparing all supported ASR backends and streaming policies on Apple Silicon,
-using the full AudioProcessor pipeline (the same path audio takes in production via WebSocket).
+Benchmark comparing all supported ASR backends, streaming policies, and model sizes on Apple Silicon.
+All tests run through the full AudioProcessor pipeline (same code path as production WebSocket).
## Test Environment
@@ -12,9 +12,8 @@ using the full AudioProcessor pipeline (the same path audio takes in production
| Python | 3.13 |
| faster-whisper | 1.2.1 |
| mlx-whisper | installed (via mlx) |
-| Voxtral (HF) | transformers-based |
| Voxtral MLX | native MLX backend |
-| Model size | `base` (default for whisper backends) |
+| Voxtral (HF) | transformers-based |
| VAC (Silero VAD) | enabled unless noted |
| Chunk size | 100 ms |
| Pacing | no-realtime (as fast as possible) |
@@ -25,50 +24,80 @@ using the full AudioProcessor pipeline (the same path audio takes in production
|------|----------|----------|----------|-------------|
| `00_00_07_english_1_speaker.wav` | 7.2 s | English | 1 | Short dictation with pauses |
| `00_00_16_french_1_speaker.wav` | 16.3 s | French | 1 | French speech with intentional silence gaps |
-| `00_00_30_english_3_speakers.wav` | 30.0 s | English | 3 | Multi-speaker conversation about transcription |
+| `00_00_30_english_3_speakers.wav` | 30.0 s | English | 3 | Multi-speaker conversation |
-All files have hand-verified ground truth transcripts (`.transcript.json`) with per-word timestamps.
+Ground truth transcripts (`.transcript.json`) with per-word timestamps are hand-verified.
---
-## Results Overview
+## Results
-### English - Short (7.2 s, 1 speaker)
+### English -- Short (7.2 s, 1 speaker)
-| Backend | Policy | RTF | WER | Timestamp MAE |
-|---------|--------|-----|-----|---------------|
-| faster-whisper | LocalAgreement | 0.20x | 21.1% | 0.080 s |
-| faster-whisper | SimulStreaming | 0.14x | 0.0% | 0.239 s |
-| mlx-whisper | LocalAgreement | 0.05x | 21.1% | 0.080 s |
-| mlx-whisper | SimulStreaming | 0.14x | 10.5% | 0.245 s |
-| voxtral-mlx | voxtral | 0.32x | 0.0% | 0.254 s |
-| voxtral (HF) | voxtral | 1.29x | 0.0% | 1.876 s |
+| Backend | Policy | Model | RTF | WER | Timestamp MAE |
+|---------|--------|-------|-----|-----|---------------|
+| faster-whisper | LocalAgreement | base | 0.20x | 21.1% | 0.080 s |
+| faster-whisper | SimulStreaming | base | 0.14x | 0.0% | 0.239 s |
+| faster-whisper | LocalAgreement | small | 0.59x | 21.1% | 0.089 s |
+| faster-whisper | SimulStreaming | small | 0.39x | 0.0% | 0.221 s |
+| mlx-whisper | LocalAgreement | base | 0.05x | 21.1% | 0.080 s |
+| mlx-whisper | SimulStreaming | base | 0.14x | 10.5% | 0.245 s |
+| mlx-whisper | LocalAgreement | small | 0.16x | 21.1% | 0.089 s |
+| mlx-whisper | SimulStreaming | small | 0.20x | 10.5% | 0.226 s |
+| voxtral-mlx | voxtral | 4B | 0.32x | 0.0% | 0.254 s |
+| voxtral (HF) | voxtral | 4B | 1.29x | 0.0% | 1.876 s |
-### French (16.3 s, 1 speaker)
+### English -- Multi-speaker (30.0 s, 3 speakers)
-| Backend | Policy | RTF | WER | Timestamp MAE |
-|---------|--------|-----|-----|---------------|
-| faster-whisper | LocalAgreement | 0.20x | 120.0% | 0.540 s |
-| faster-whisper | SimulStreaming | 0.10x | 100.0% | 0.120 s |
-| mlx-whisper | LocalAgreement | 0.31x | 1737.1% | 0.060 s |
-| mlx-whisper | SimulStreaming | 0.08x | 94.3% | 0.120 s |
-| voxtral-mlx | voxtral | 0.18x | 37.1% | 3.422 s |
-| voxtral (HF) | voxtral | 0.63x | 28.6% | 4.040 s |
+| Backend | Policy | Model | RTF | WER | Timestamp MAE |
+|---------|--------|-------|-----|-----|---------------|
+| faster-whisper | LocalAgreement | base | 0.24x | 44.7% | 0.235 s |
+| faster-whisper | SimulStreaming | base | 0.10x | 5.3% | 0.398 s |
+| faster-whisper | LocalAgreement | small | 0.59x | 25.0% | 0.226 s |
+| faster-whisper | SimulStreaming | small | 0.26x | 5.3% | 0.387 s |
+| mlx-whisper | LocalAgreement | base | 0.06x | 23.7% | 0.237 s |
+| mlx-whisper | SimulStreaming | base | 0.11x | 5.3% | 0.395 s |
+| mlx-whisper | LocalAgreement | small | 0.13x | 25.0% | 0.226 s |
+| mlx-whisper | SimulStreaming | small | 0.20x | 5.3% | 0.394 s |
+| voxtral-mlx | voxtral | 4B | 0.31x | 9.2% | 0.176 s |
+| voxtral (HF) | voxtral | 4B | 1.00x | 32.9% | 1.034 s |
-Note: The whisper-based backends were run with `--lan en`, so they attempted to transcribe French
-audio in English. This is expected to produce high WER. For a fair comparison, the whisper backends
-should be run with `--lan fr` or `--lan auto`. The Voxtral backends auto-detect language.
+
+
+
-### English - Multi-speaker (30.0 s, 3 speakers)
+### French (16.3 s, 1 speaker, `--language fr`)
-| Backend | Policy | RTF | WER | Timestamp MAE |
-|---------|--------|-----|-----|---------------|
-| faster-whisper | LocalAgreement | 0.24x | 44.7% | 0.235 s |
-| faster-whisper | SimulStreaming | 0.10x | 5.3% | 0.398 s |
-| mlx-whisper | LocalAgreement | 0.06x | 23.7% | 0.237 s |
-| mlx-whisper | SimulStreaming | 0.11x | 5.3% | 0.395 s |
-| voxtral-mlx | voxtral | 0.31x | 9.2% | 0.176 s |
-| voxtral (HF) | voxtral | 1.00x | 32.9% | 1.034 s |
+| Backend | Policy | Model | RTF | WER | Timestamp MAE |
+|---------|--------|-------|-----|-----|---------------|
+| faster-whisper | LocalAgreement | base | 0.22x | 25.7% | 3.460 s |
+| faster-whisper | SimulStreaming | base | 0.10x | 31.4% | 3.660 s |
+| faster-whisper | LocalAgreement | small | 0.76x | 42.9% | 0.051 s |
+| faster-whisper | SimulStreaming | small | 0.29x | 25.7% | 0.219 s |
+| mlx-whisper | LocalAgreement | base | 0.09x | ~45%* | ~5.0 s* |
+| mlx-whisper | SimulStreaming | base | 0.09x | 40.0% | 3.540 s |
+| mlx-whisper | LocalAgreement | small | 0.14x | 25.7% | 0.083 s |
+| mlx-whisper | SimulStreaming | small | 0.17x | 31.4% | 0.203 s |
+| voxtral-mlx | voxtral | 4B | 0.18x | 37.1% | 3.422 s |
+| voxtral (HF) | voxtral | 4B | 0.63x | 28.6% | 4.040 s |
+
+\* mlx-whisper + LocalAgreement + base is unstable on this French file (WER fluctuates 34-1037% across runs due to hallucination loops). The `small` model does not have this problem.
+
+**Timestamp note:** The base model produces very high timestamp MAE (3.4-3.7s) on this French file because it misaligns words around the silence gaps. The small model handles this much better (0.05-0.22s MAE). Voxtral also drifts on the silence gaps.
+
+---
+
+## Model Size Comparison (base vs small)
+
+| | base | small | Observation |
+|--|------|-------|-------------|
+| **RTF** | 0.05-0.24x | 0.13-0.76x | small is 2-3x slower |
+| **English WER (SS)** | 0-5.3% | 0-5.3% | No improvement: SimulStreaming already saturates on base |
+| **English WER (LA)** | 21-44.7% | 21-25% | small reduces LA errors on longer audio |
+| **French WER** | 25-40% | 25-43% | Mixed: depends on backend/policy combo |
+| **French timestamps** | 3.4-5.0s MAE | 0.05-0.22s MAE | small is dramatically better for French timestamps |
+
+In short: **base + SimulStreaming** gives the best speed/accuracy tradeoff for English. The small model only helps if you need LocalAgreement (for subtitle-grade timestamps) or non-English languages.
---
@@ -76,37 +105,25 @@ should be run with `--lan fr` or `--lan auto`. The Voxtral backends auto-detect
### Speed (RTF = processing time / audio duration, lower is better)
-1. **mlx-whisper + LocalAgreement** is the fastest combo on Apple Silicon, reaching 0.05-0.06x RTF
- on English audio. 30 seconds of audio processed in under 2 seconds.
-2. For **faster-whisper**, SimulStreaming is consistently faster than LocalAgreement.
- For **mlx-whisper**, it is the opposite: LocalAgreement (0.05-0.06x) is faster than SimulStreaming (0.11-0.14x).
-3. **voxtral-mlx** runs at 0.18-0.32x RTF, roughly 3-5x slower than mlx-whisper but well within
- real-time requirements.
-4. **voxtral (HF transformers)** is the slowest at 1.0-1.3x RTF. On longer audio it risks
- falling behind real-time. On Apple Silicon, the MLX variant is strongly preferred.
+1. **mlx-whisper + LocalAgreement + base** is the fastest combo on Apple Silicon: 0.05-0.06x RTF on English. 30 seconds of audio in under 2 seconds.
+2. For **faster-whisper**, SimulStreaming is faster than LocalAgreement. For **mlx-whisper**, it is the opposite: LocalAgreement (0.05-0.06x) outperforms SimulStreaming (0.11-0.14x) on speed.
+3. **voxtral-mlx** runs at 0.18-0.32x RTF -- 3-5x slower than mlx-whisper base, but well within real-time.
+4. **voxtral (HF transformers)** hits 1.0-1.3x RTF. At the real-time boundary on Apple Silicon. Use the MLX variant instead.
+5. The **small** model is 2-3x slower than base across all backends.
### Accuracy (WER = Word Error Rate, lower is better)
-1. **SimulStreaming** produces significantly better WER than LocalAgreement for whisper backends.
- On the 30s English file: 5.3% vs 23.7-44.7%.
-2. **voxtral-mlx** has good accuracy (0% on short English, 9.2% on multi-speaker).
- Whisper also supports `--language auto`, but Voxtral's language detection is more
- reliable and does not bias towards English the way Whisper's auto mode tends to.
-3. **LocalAgreement** tends to duplicate the last sentence, inflating WER. This is a known
- artifact of the LCP (Longest Common Prefix) commit strategy at end-of-stream.
-4. **Voxtral** backends handle French natively with 28-37% WER, while whisper backends
- were run with `--lan en` here (not a fair comparison for French).
+1. **SimulStreaming** gives dramatically lower WER than LocalAgreement on the whisper backends. On the 30s English file: 5.3% vs 23-44%.
+2. **voxtral-mlx** hits 0% on short English and 9.2% on multi-speaker. It auto-detects language natively. Whisper also supports `--language auto`, but tends to bias towards English on short segments.
+3. **LocalAgreement** tends to repeat the last sentence at end-of-stream (a known LCP artifact), inflating WER. This is visible in the 21% WER on the 7s file -- the same 4 extra words appear in every LA run.
+4. On **French** with the correct `--language fr`, whisper base achieves 25-40% WER -- comparable to Voxtral's 28-37%. The small model does not consistently improve French WER.
-### Timestamp Accuracy (MAE = Mean Absolute Error on word start times, lower is better)
+### Timestamps (MAE = Mean Absolute Error on word start times)
-1. **LocalAgreement** produces the most accurate timestamps (0.08s MAE on English), since it
- processes overlapping audio windows and validates via prefix matching.
-2. **SimulStreaming** timestamps are slightly less precise (0.24-0.40s MAE) but still usable
- for most applications.
-3. **voxtral-mlx** has good timestamp accuracy on English (0.18-0.25s MAE) but drifts on
- audio with long silence gaps (3.4s MAE on the French file with 4-second pauses).
-4. **voxtral (HF)** has the worst timestamp accuracy (1.0-4.0s MAE). This is likely related to
- differences in the transformers-based decoding pipeline rather than model quality.
+1. **LocalAgreement** gives the best timestamps on English (0.08-0.09s MAE).
+2. **SimulStreaming** is less precise (0.22-0.40s MAE) but good enough for most applications.
+3. On French with silence gaps, **base model timestamps are unreliable** (3.4-5s MAE). The **small model fixes this** (0.05-0.22s MAE). This is the strongest argument for using `small` over `base`.
+4. **voxtral-mlx** has good timestamps on English (0.18-0.25s MAE) but drifts on audio with long silence gaps (3.4s MAE on the French file).
### VAC (Voice Activity Classification) Impact
@@ -117,23 +134,29 @@ should be run with `--lan fr` or `--lan auto`. The Voxtral backends auto-detect
| voxtral-mlx | voxtral | on | 0.0% | 9.2% |
| voxtral-mlx | voxtral | off | 0.0% | 9.2% |
-- **Whisper backends require VAC** to function in streaming mode. Without it, the entire audio
- is buffered as a single chunk and the LocalAgreement/SimulStreaming buffer logic breaks down.
-- **Voxtral backends are VAC-independent** because they handle their own internal chunking and
- produce identical results with or without VAC. VAC still reduces wasted compute on silence.
+- **Whisper backends need VAC** to work in streaming mode. Without it the buffer logic breaks down and you get empty or garbage output.
+- **Voxtral is unaffected by VAC** since it handles its own internal chunking. Identical results with or without. VAC still saves compute on silent segments.
---
## Recommendations
-| Use Case | Recommended Backend | Policy | Notes |
-|----------|-------------------|--------|-------|
-| Fastest English transcription (Apple Silicon) | mlx-whisper | SimulStreaming | 0.08-0.14x RTF, 5-10% WER |
-| Fastest English transcription (Linux/GPU) | faster-whisper | SimulStreaming | 0.10-0.14x RTF, 0-5% WER |
-| Multilingual / auto-detect (Apple Silicon) | voxtral-mlx | voxtral | Handles 100+ languages, 0.18-0.32x RTF |
-| Multilingual / auto-detect (Linux/GPU) | voxtral (HF) | voxtral | Same model, slower on CPU, needs GPU |
-| Best timestamp accuracy | faster-whisper | LocalAgreement | 0.08s MAE, good for subtitle alignment |
-| Low latency, low memory | mlx-whisper (tiny) | SimulStreaming | Smallest footprint, fastest response |
+| Use Case | Backend | Policy | Model | Notes |
+|----------|---------|--------|-------|-------|
+| Fastest English (Apple Silicon) | mlx-whisper | SimulStreaming | base | 0.11x RTF, 5.3% WER |
+| Fastest English (Linux/GPU) | faster-whisper | SimulStreaming | base | 0.10x RTF, 5.3% WER |
+| Best accuracy, English | faster-whisper | SimulStreaming | small | 0.26x RTF, 5.3% WER, still fast |
+| Multilingual / auto-detect | voxtral-mlx | voxtral | 4B | 100+ languages, 0.18-0.32x RTF |
+| Best timestamps | any | LocalAgreement | small | 0.05-0.09s MAE, good for subtitles |
+| Low memory / embedded | mlx-whisper | SimulStreaming | base | Smallest footprint, fastest response |
+
+---
+
+## Caveats
+
+- **3 test files, ~53 seconds total.** Results give relative rankings between backends but should not be taken as definitive WER numbers. Run on your own data for production decisions.
+- **RTF varies between runs** (up to +/-30%) depending on thermal state, background processes, and model caching. The numbers above are single sequential runs on a warm machine.
+- **Only base and small tested.** Medium and large-v3 would likely improve WER at the cost of higher RTF. We did not test them here because they are slow on Apple Silicon without GPU.
---
@@ -144,15 +167,18 @@ should be run with `--lan fr` or `--lan auto`. The Voxtral backends auto-detect
pip install -e ".[test]"
# Single backend test
-python test_backend_offline.py --backend faster-whisper --policy simulstreaming --no-realtime
+python test_backend_offline.py --backend faster-whisper --policy simulstreaming --model base --no-realtime
+
+# With a specific language
+python test_backend_offline.py --backend mlx-whisper --policy simulstreaming --model small --lan fr --no-realtime
# Multi-backend auto-detect benchmark
python test_backend_offline.py --benchmark --no-realtime
-# Export to JSON for programmatic analysis
+# Export to JSON
python test_backend_offline.py --benchmark --no-realtime --json results.json
-# Test with custom audio
+# Test with your own audio
python test_backend_offline.py --backend voxtral-mlx --audio your_file.wav --no-realtime
```
diff --git a/benchmark_chart.png b/benchmark_chart.png
new file mode 100644
index 0000000..20123bd
Binary files /dev/null and b/benchmark_chart.png differ