This fix addresses a critical bug in the Whisper tokenizer that causes the transcription server to crash with an `IndexError: string index out of range` when streaming audio in languages utilizing multi-byte UTF-8 characters (e.g., Cantonese, Japanese, Mandarin). When a 3-byte character is cut off at the boundary of an audio chunk, incomplete bytes are decoded into a single Unicode replacement character (`\ufffd`), artificially shortening the string and breaking the offset mapping assumed by `split_tokens_on_unicode`. This ports the upstream fix from SYSTRAN/faster-whisper (PR #111) to add a strict bounds check before accessing the string index, allowing incomplete bytes to be safely caught and handled in the next chunk. |
||
|---|---|---|
| .. | ||
| diarization | ||
| local_agreement | ||
| silero_vad_models | ||
| simul_whisper | ||
| voxtral_mlx | ||
| web | ||
| whisper | ||
| __init__.py | ||
| audio_processor.py | ||
| backend_support.py | ||
| basic_server.py | ||
| config.py | ||
| core.py | ||
| ffmpeg_manager.py | ||
| metrics.py | ||
| metrics_collector.py | ||
| model_mapping.py | ||
| model_paths.py | ||
| parse_args.py | ||
| silero_vad_iterator.py | ||
| thread_safety.py | ||
| timed_objects.py | ||
| tokens_alignment.py | ||
| voxtral_hf_streaming.py | ||
| voxtral_mlx_asr.py | ||
| warmup.py | ||