Problem: AI articles scored MORE human (avg 26.2) than actual human
articles (avg 44.0) — opposite of 朱雀's judgment. AI was gaming the
linear scoring by over-optimizing broken sentences, self-correction,
paragraph variance, etc.
Fix: Two calibration layers added after raw scoring:
1. Bell-curve scoring for 5 over-optimizable dimensions (broken_sentences,
self_correction, sentence_length_range, paragraph_length_variance,
banned_words). Score peaks at human article average, penalizes both
too-low AND too-high values.
2. Over-optimization penalty: 15% global penalty when 60%+ of checks
score above 0.8, indicating suspiciously "perfect" articles.
Results:
Before: Human avg=44.0, AI avg=26.2 (WRONG direction)
After: Human avg=42.5, AI avg=44.0 (CORRECT direction)
A/B test now agrees with 朱雀 (exemplar version scores better)
Baselines derived from 15 human articles tested on 2026-03-30.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- 11 checks across 2 tiers (6 statistical + 5 pattern), up from 6
- Continuous 0-1 scores instead of pass/fail booleans
- Each check maps to a writing-config parameter via param field
- New checks: negative emotion ratio, adverb density, vocabulary richness,
sentence length range, self-correction patterns
- New --tier3 flag for agent to pass LLM structural analysis score
- param_scores in JSON output: flat param→score map for optimization
- Standalone mode redistributes weights (T1=62.5%, T2=37.5%)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>