fixes#283, fixes#275
- accumulated_cross_attns was growing unboundedly during decoding loop,
using up to ~5GB for repetition loops. now capped to rolling window of 16
- max_tokens_per_chunk was using TOKENS_PER_SECOND (mel frame rate = 50)
instead of actual text token rate (~15/s), allowing 10-40x too many
decoding steps
- removed unused torch.cat on early return path
- removed dead self.committed/last_result_tokens lists (never read)
- same fixes applied to mlx variant