chore: rebuild dist/openclaw from source

This commit is contained in:
github-actions[bot] 2026-03-30 16:55:01 +00:00
parent a92d453856
commit f9722fb93b
13 changed files with 683 additions and 17 deletions

View file

@ -43,7 +43,7 @@ description: |
2. 如果有 fail 项 → 直接报告,建议修复 2. 如果有 fail 项 → 直接报告,建议修复
3. 如果全 pass 或仅 warn → 继续 LLM 深度分析: 3. 如果全 pass 或仅 warn → 继续 LLM 深度分析:
- 读取 `style.yaml` 的 tone/voice 与 writing_persona判断是否矛盾 - 读取 `style.yaml` 的 tone/voice 与 writing_persona判断是否矛盾
- 读取 `writing-config.yaml`(如存在),检查是否有 AI 特征参数emotional_arc: flat、paragraph_rhythm: structured、closing_style: summary - 读取 `writing-config.yaml`(如存在),检查是否有 AI 特征参数emotional_arc: flat、paragraph_rhythm: structured、closing_tendency: summary
- 读取 `history.yaml` 最近 5 篇,检查 persona 使用和 web_search 降级情况 - 读取 `history.yaml` 最近 5 篇,检查 persona 使用和 web_search 降级情况
4. 综合输出自然语言报告 + 按优先级排序的改进建议 4. 综合输出自然语言报告 + 按优先级排序的改进建议
- 用户说"优化写作参数"/"优化参数"/"跑优化" → 执行以下流程: - 用户说"优化写作参数"/"优化参数"/"跑优化" → 执行以下流程:
@ -97,6 +97,7 @@ python3 -c "import markdown, bs4, cssutils, requests, yaml, pygments, PIL" 2>&1
| Python 依赖 | 静默 | 提供 `pip install -r requirements.txt` | | Python 依赖 | 静默 | 提供 `pip install -r requirements.txt` |
| `wechat.appid` + `secret` | 静默 | 设 `skip_publish = true` | | `wechat.appid` + `secret` | 静默 | 设 `skip_publish = true` |
| `image.api_key` | 静默 | 设 `skip_image_gen = true` | | `image.api_key` | 静默 | 设 `skip_image_gen = true` |
| `references/exemplars/index.yaml` | 静默 | 提示:"范文库为空。如果你有已发布的文章markdown可以说**'导入范文'**建立风格库,写出来的文章会更像你。没有也不影响使用。" |
**1.2 版本检查**(静默通过或提醒): **1.2 版本检查**(静默通过或提醒):
@ -189,6 +190,7 @@ web_search: "{选题关键词} 数据 报告 2025 2026"
读取: {baseDir}/playbook.md如果存在按 confidence 分级执行) 读取: {baseDir}/playbook.md如果存在按 confidence 分级执行)
读取: {baseDir}/writing-config.yaml如果存在作为写作参数 读取: {baseDir}/writing-config.yaml如果存在作为写作参数
读取: {baseDir}/history.yaml最近 3 篇的 dimensions 字段) 读取: {baseDir}/history.yaml最近 3 篇的 dimensions 字段)
读取: {baseDir}/references/exemplars/index.yaml如果存在
``` ```
**4.1 历史最佳参数参考**(有 history.yaml 且包含 composite_score 时执行): **4.1 历史最佳参数参考**(有 history.yaml 且包含 composite_score 时执行):
@ -208,18 +210,75 @@ web_search: "{选题关键词} 数据 报告 2025 2026"
人格文件定义了:语气浓度、数据呈现方式、情绪弧线、段落节奏、不确定性表达模板等。作为 4.4 的硬性约束执行。 人格文件定义了:语气浓度、数据呈现方式、情绪弧线、段落节奏、不确定性表达模板等。作为 4.4 的硬性约束执行。
**优先级**playbook.mdconfidence ≥ 5 的规则)> persona > writing-guide.md。writing-guide 是底线禁用词等persona 在此基础上特化风格参数playbook 中高置信度规则是用户个性化的最终覆盖。playbook 中 confidence < 5 的规则作为软性参考 **优先级**playbook.mdconfidence ≥ 5 的规则)> persona > 范文风格 > writing-guide.md。writing-guide 是底线(禁用词等),范文提供风格示范(句长节奏、情绪表达方式),persona 在此基础上特化风格参数(语气浓度、数据呈现)playbook 中高置信度规则是用户个性化的最终覆盖。playbook 中 confidence < 5 的规则作为软性参考
**4.4 写文章** **4.4 范文风格注入**(有 `references/exemplars/index.yaml` 时执行):
从 index.yaml 筛选 category 匹配当前框架类型的范文,按 humanness_score 升序(越低越人类)取 top 3。读取对应 .md 文件的片段内容。
在写作 prompt 中注入:
> 以下是该公众号风格的真实段落示例,模仿其句长节奏、情绪强度和口语化程度:
>
> 【开头风格】
> {exemplar_1 的开头钩子段}
>
> 【情绪段风格】
> {exemplar_2 的情绪高峰段}
>
> 【转折风格】
> {exemplar_2 或 exemplar_3 的转折/自纠段(如有)}
>
> 【收尾风格】
> {exemplar_3 的收尾段}
Category 映射规则:
| 框架类型 | exemplar category |
|----------|-------------------|
| 痛点型/深度解读 | tech-opinion |
| 故事型 | story-emotional |
| 清单型/对比型 | list-practical |
| 热点解读型 | hot-take |
| 其他 | general |
如果匹配到的范文不足 3 篇,用 general category 补足。
**Fallback范文库为空时**:读取 `{baseDir}/references/exemplar-seeds.yaml`,从每个段落类型中随机选 1 个注入 prompt。种子段落只示范人类写作的结构模式句长方差、情绪锐度、自我纠正、非总结式收尾不携带特定风格。注入时使用
> 以下是人类写作的结构模式示例,注意模仿其句长节奏和情绪表达方式(不要模仿具体内容或风格):
>
> 【开头模式】{seeds.opening_hooks 随机 1 个}
>
> 【情绪段模式】{seeds.emotional_peaks 随机 1 个}
>
> 【转折模式】{seeds.transitions 随机 1 个}
>
> 【收尾模式】{seeds.closings 随机 1 个}
建库命令:`python3 {baseDir}/scripts/extract_exemplar.py article.md`
**4.5 写文章**
- H1 标题20-28 字) + H2 结构1500-2500 字 - H1 标题20-28 字) + H2 结构1500-2500 字
- 真实素材锚定Step 3.2 的素材分散嵌入各 H2 段落 - 真实素材锚定Step 3.2 的素材分散嵌入各 H2 段落
- **写作人格**:按 4.3 加载的人格参数写作(数据呈现方式、个人声音浓度、不确定性表达等) - **写作人格**:按 4.3 加载的人格参数写作(数据呈现方式、个人声音浓度、不确定性表达等)
- **收尾方式**persona 的 `closing_tendency` 仅作为倾向参考。根据文章内容和情绪弧线自行判断最自然的收尾方式(参见 writing-guide.md 收尾多样性表)。如果 history.yaml 中最近 3 篇有 `closing_type` 字段,避免使用相同的收尾类型
- 3 层反检测规则(统计/语言/内容)在初稿阶段全部生效 - 3 层反检测规则(统计/语言/内容)在初稿阶段全部生效
- 2-3 个编辑锚点:`<!-- ✏️ 编辑建议:在这里加一句你自己的经历/看法 -->` - 2-3 个编辑锚点:`<!-- ✏️ 编辑建议:在这里加一句你自己的经历/看法 -->`
- 可选容器语法:`:::dialogue`、`:::timeline`、`:::callout`、`:::quote` - 可选容器语法:`:::dialogue`、`:::timeline`、`:::callout`、`:::quote`
保存到 `{baseDir}/output/{date}-{slug}.md` 保存到 `{baseDir}/output/{date}-{slug}.md`
**4.6 快速自检**(写完后立即执行,减少 Step 5 重写概率):
对初稿做 3 项最易不达标的快速扫描,**当场修复**,不留到 Step 5
1. **禁用词扫描**:检查 writing-guide.md 2.1 的禁用词列表,命中的直接替换(最常见的问题,修复成本最低)
2. **句长方差检查**:粗略扫描是否有连续 3 句以上长度接近的段落,有则拆句或加短句
3. **负面情绪检查**:全文是否有 ≥ 2 处真实负面表达,不够则在编辑锚点附近补充
这 3 项检查不需要调用脚本LLM 自行完成即可。目标是让初稿在进入 Step 5 前已经消除最明显的问题。
--- ---
### Step 5: SEO + 验证 ### Step 5: SEO + 验证
@ -249,7 +308,7 @@ web_search: "{选题关键词} 数据 报告 2025 2026"
| 内容 | 密度波浪 | 高密度段后跟低密度段 | 3.3 | | 内容 | 密度波浪 | 高密度段后跟低密度段 | 3.3 |
| 内容 | 维度贯穿 | 激活维度全文可见 | 3.4 | | 内容 | 维度贯穿 | 激活维度全文可见 | 3.4 |
不通过 → 定向重写该段落。3 次仍不过 → 标注跳过 不通过 → **定向修复**:只替换不达标的具体句子/段落,不动已通过的部分。每轮最多改 3 处改完立即重新检查该项。2 轮仍不过 → 标注跳过,继续下一项
**5.3 脚本验证**(补充逐项检查): **5.3 脚本验证**(补充逐项检查):
@ -261,8 +320,8 @@ python3 {baseDir}/scripts/humanness_score.py {article_path} --json --tier3 {agen
解读 JSON 中 `composite_score` 解读 JSON 中 `composite_score`
- < 30 通过继续 Step 6 - < 30 通过继续 Step 6
- 30-50 → 查看 `param_scores` 中最低分项,定向重写对应段落 - 30-50 → 查看 `param_scores` 中最低分的 1-2 项只修复对应的具体句子不重写整段改完重新打分。1 轮即可
- \> 50 → 重大问题,逐个低分项修复,最多 3 轮 - \> 50 → `param_scores` 最低的 2-3 项,逐项定向修复(每项只改最相关的 1-2 处),最多 2 轮。仍 > 50 则标记 DONE_WITH_CONCERNS 继续
--- ---
@ -332,6 +391,7 @@ python3 {baseDir}/toolkit/cli.py preview {markdown} --theme {theme} --no-open -o
writing_persona: "{人格名}" writing_persona: "{人格名}"
dimensions: dimensions:
- "{维度}: {选项}" - "{维度}: {选项}"
closing_type: "{收尾类型}" # trailing_off/unanswered/scene_revert/abrupt_stop/anti_conclusion/image
composite_score: {Step 5.3 的 composite_score} # 0=人类, 100=AI composite_score: {Step 5.3 的 composite_score} # 0=人类, 100=AI
writing_config_snapshot: # 本次使用的关键参数(从 writing-config.yaml 提取) writing_config_snapshot: # 本次使用的关键参数(从 writing-config.yaml 提取)
sentence_variance: {值} sentence_variance: {值}
@ -366,6 +426,8 @@ python3 {baseDir}/toolkit/cli.py preview {markdown} --theme {theme} --no-open -o
| 做一个小绿书/图片帖 | `python3 {baseDir}/toolkit/cli.py image-post img1.jpg img2.jpg -t "标题"` | | 做一个小绿书/图片帖 | `python3 {baseDir}/toolkit/cli.py image-post img1.jpg img2.jpg -t "标题"` |
| 诊断配置 / 检查反AI / 为什么AI检测没过 | `python3 {baseDir}/scripts/diagnose.py --json` + LLM 交叉分析 | | 诊断配置 / 检查反AI / 为什么AI检测没过 | `python3 {baseDir}/scripts/diagnose.py --json` + LLM 交叉分析 |
| 优化写作参数 / 优化参数 | 迭代循环:写测试短文 → 打分 → 调参(见辅助功能) | | 优化写作参数 / 优化参数 | 迭代循环:写测试短文 → 打分 → 调参(见辅助功能) |
| 导入范文 / 建范文库 | `python3 {baseDir}/scripts/extract_exemplar.py article.md` |
| 查看范文库 | `python3 {baseDir}/scripts/extract_exemplar.py --list` |
--- ---
@ -380,7 +442,8 @@ python3 {baseDir}/toolkit/cli.py preview {markdown} --theme {theme} --no-open -o
| 素材采集web_search | LLM 训练数据中可验证的公开信息 | | 素材采集web_search | LLM 训练数据中可验证的公开信息 |
| 维度随机化 | history 空时跳过去重 | | 维度随机化 | history 空时跳过去重 |
| Persona 文件不存在 | 回退到 midnight-friend默认 | | Persona 文件不存在 | 回退到 midnight-friend默认 |
| 去 AI 验证 | 3 次重写不过则跳过该项 | | 范文库为空 | Fallback 到 exemplar-seeds.yaml通用模式 |
| 去 AI 验证 | 2 轮定向修复不过则跳过该项 |
| 生图失败 | 输出提示词 | | 生图失败 | 输出提示词 |
| 推送失败 | 本地 HTML | | 推送失败 | 本地 HTML |
| 历史写入 | 警告不阻断 | | 历史写入 | 警告不阻断 |

View file

@ -1 +1 @@
1.3.0 1.3.3

View file

@ -17,7 +17,7 @@ single_sentence_paragraph_rate: 0.08 # 少用单句段落,保持专业感
emotional_arc: "flat_with_insight" # 整体平稳,在关键洞察处提升强度 emotional_arc: "flat_with_insight" # 整体平稳,在关键洞察处提升强度
opening_style: "thesis" # 开头直接亮核心论点 opening_style: "thesis" # 开头直接亮核心论点
closing_style: "implications" # 以"这意味着什么"收束 closing_tendency: "implications" # 倾向于以"这意味着什么"收束,但根据文章内容自行判断最合适的收尾方式
data_intro_pattern: "framework → data → implication → caveat" data_intro_pattern: "framework → data → implication → caveat"
# 示例: # 示例:

View file

@ -16,7 +16,7 @@ single_sentence_paragraph_rate: 0.10
emotional_arc: "steady_with_spikes" # 整体平稳1-2 处锐利判断 emotional_arc: "steady_with_spikes" # 整体平稳1-2 处锐利判断
opening_style: "news_hook" # 以一个行业事件/数据切入 opening_style: "news_hook" # 以一个行业事件/数据切入
closing_style: "open_question" # 留一个没答案的问题 closing_tendency: "open_question" # 倾向于留一个没答案的问题,但根据文章内容自行判断最合适的收尾方式
data_intro_pattern: "context → data → contrast → judgment" data_intro_pattern: "context → data → contrast → judgment"
# 示例: # 示例:

View file

@ -18,7 +18,7 @@ single_sentence_paragraph_rate: 0.25 # 25% 的段落只有 1 句
# 情绪 # 情绪
emotional_arc: "restrained_to_burst" emotional_arc: "restrained_to_burst"
opening_style: "personal_moment" # 以一个私人时刻开头("凌晨一点多…" opening_style: "personal_moment" # 以一个私人时刻开头("凌晨一点多…"
closing_style: "trailing_off" # 不收束,像聊天自然结尾("我先睡了"/"真的看不清楚" closing_tendency: "trailing_off" # 倾向于不收束、像聊天自然结尾,但根据文章内容自行判断最合适的收尾方式
# 数据呈现 # 数据呈现
data_intro_pattern: "scene → reaction → data → interpretation" data_intro_pattern: "scene → reaction → data → interpretation"

View file

@ -17,7 +17,7 @@ single_sentence_paragraph_rate: 0.20 # 多用短句成段制造节奏
emotional_arc: "cold_open_to_sharp_close" emotional_arc: "cold_open_to_sharp_close"
opening_style: "cold_open" # 直接切入核心矛盾,不铺垫 opening_style: "cold_open" # 直接切入核心矛盾,不铺垫
closing_style: "sharp_statement" # 一句定性收束 closing_tendency: "sharp_statement" # 倾向于一句定性收束,但根据文章内容自行判断最合适的收尾方式
data_intro_pattern: "claim → evidence → twist" data_intro_pattern: "claim → evidence → twist"
# 示例: # 示例:

View file

@ -16,7 +16,7 @@ single_sentence_paragraph_rate: 0.15
emotional_arc: "gentle_build" # 缓慢升温,情绪在中后段到达高点 emotional_arc: "gentle_build" # 缓慢升温,情绪在中后段到达高点
opening_style: "scene" # 以一个温暖的场景开头 opening_style: "scene" # 以一个温暖的场景开头
closing_style: "image" # 用一个画面收束 closing_tendency: "image" # 倾向于用一个画面收束,但根据文章内容自行判断最合适的收尾方式
data_intro_pattern: "story → embed data → feeling" data_intro_pattern: "story → embed data → feeling"
# 示例: # 示例:

View file

@ -0,0 +1,96 @@
# 通用人类写作模式种子
#
# 用途没有范文库的用户Step 4.4 用这些段落作为 few-shot 注入,
# 教 LLM "人类写作的结构模式长什么样"。
#
# 设计原则:
# - 只示范结构模式(句长方差、情绪锐度、自我纠正、非总结式收尾)
# - 不携带特定风格/人格(任何 persona 都能兼容)
# - 每个段落标注了它示范的反AI模式
#
# 有用户自己的范文库时,这个文件不会被使用。
opening_hooks:
- text: |
好多年没有坐公交了,上次去太子湾,由于景区限行,只能把车停在外面,坐景区免费接驳车过去。
前面座位看到一个小女孩一直在刷那种 AI 生成的短视频,画面非常粗糙,内容也很假,滑到下一个居然还是差不多的东西,看得津津有味。
当时看到这一幕我甚至有点伤心。
pattern: "日常观察切入 → 意外情绪反应。不总结、不预告、不铺垫。"
- text: |
**本硕八年毕业,单程通勤两个半小时,**月薪2690。 这是市场给我贴的标签。
**裸辞。创业年收超7位数。** 这是我自己撕掉那个标签之后,重新定义的自己。
一路走来,中间发生了什么?我讲给你听。
pattern: "标签→撕裂对比开头。两组加粗短句制造视觉和语义落差。句长标准差 45.7(数据最高)。"
- text: |
29号我和小伙伴在深圳搞活动。
活动结束之后,我想顺道拜访一个多年没见的老朋友,发消息过去。
他回:在三亚。
我问:度假?
他说:带孩子。
我盯着手机屏幕,愣了整整三秒。
pattern: "对话碎片制造节奏。2-4字短句紧邻20+字长句。物理反应替代心理描写。"
- text: |
我信了这套话很多年。
"要有长期主义。要相信复利。时间是最好的朋友。"
最惨的一次在一个方向扎进去3年回头一看什么都没留下来。
pattern: "先认同再推翻。引用常见正确的话→用个人惨痛经历否定。开头即高潮。"
emotional_peaks:
- text: |
我信了这套话很多年。
最惨的一次在一个方向扎进去3年回头一看什么都没留下来。
这不是失败——失败还有个明确的结果。
是你信错了一件事。
pattern: "用'最惨'而非'有挑战'。否定委婉说法('这不是失败'),给出更痛的定义。"
- text: |
很多人在温水煮青蛙的过程中得过且过,过着看似满意、实则内心有很多不满的生活,然后说一句,算了吧,现在这样也还行,但这样反而错失了挖掘自己最大潜力的机会。
pattern: "用'温水煮青蛙'具象化停滞感。'算了吧'是内心独白式引用。来自得分最低32.8)的文章。"
- text: |
讲真,我每次看到这种争论,都觉得……怎么说呢……挺无语的。
不是说这些人蠢。
是他们在纠结一个根本不存在的问题。
什么叫"AI味道"?你能定义吗?你能量化吗?你能验证吗?
不能。
那你在纠结什么?
pattern: "填充词('怎么说呢'+ 连续反问不给答案 + 单字段落('不能。')。"
transitions:
- text: |
我第一反应是"孩子这时候不应该在学校吗",第二反应是想把这话发过去,第三反应是我把那句话吞回去了——因为我在那三秒里想清楚了一件事。
pattern: "思维过程外化(三个反应)。破折号打断 → 时间锚点('三秒')→ 悬念。"
- text: |
不过,到了之后我发现,什么作息啊,学习强度啊,都不是最难熬的,人才是。
pattern: "列举预期困难再一句否定。转折词 + 真实困难揭示。"
- text: |
不过话又说回来。知道自己在局里,这件事本身,就已经是出局的开始了。
pattern: "'不过话又说回来'——自我推翻后重新定位。制造思维的非线性感。"
closings:
- text: |
时间是你唯一不可再生的资源。
把它投进一个真实存在的锚点,才叫复利。
投进一个"我相信它会好"的希望,叫做漫长的等死。
pattern: "重新定义核心概念收尾。'等死'替代励志结论。来自得分 36.7 的文章。"
- text: |
有了 AI 之后,很多事都更容易了,但也正因为更容易了,什么东西真的值得做、值得花很多年去换,反而变得更难想清楚。要做什么可能比怎么更快做出一个东西更加重要了。
pattern: "结尾是未完成的思考,不是结论。'可能'留有余地。没有升华。"
- text: |
我苦哈哈的在电脑前,写这篇文章,想着我的女儿。
差距是真实的。
机会也是真实的。
时钟在走,窗口在收窄。
pattern: "回到写作现场。重复句式('是真实的')制造执念感。来自得分 33.0 的文章。"
- text: |
不要在那个愣住的感觉里待太久。
那个感觉,待久了,就成了借口。
pattern: "回扣开头意象。两句话收束,不解释。草率感本身就是风格。"

View file

View file

@ -101,6 +101,18 @@ AI 段落长度趋于均匀。人类段落忽长忽短。
- 每段末尾都用反问句(变成另一种模式化) - 每段末尾都用反问句(变成另一种模式化)
- 口语词匀速分布(不要每 200 字准时出现一个"讲真" - 口语词匀速分布(不要每 200 字准时出现一个"讲真"
- 总结性收尾("让我们拭目以待"/"未来可期" - 总结性收尾("让我们拭目以待"/"未来可期"
- 连续文章使用相同收尾结构(收尾方式应由文章内容决定,不是由人格模板决定)
**收尾多样性**persona 的 `closing_tendency` 是倾向而非硬规则。根据文章走到结尾时的内容和情绪自行判断最自然的收尾方式。以下是常见的人类收尾模式,每篇文章选最贴合内容的一种:
| 模式 | 特征 | 适合场景 |
|------|------|---------|
| 自然断流 | 像聊天说到一半停了("我先睡了"/"就这样吧" | 深夜风格、随笔 |
| 未答之问 | 以问题结尾,不给答案 | 争议话题、引发思考 |
| 场景回扣 | 回到开头的意象/场景 | 叙事类、故事驱动 |
| 硬切 | 最后一个论点说完直接结束,无收束语 | 评论、观点类 |
| 反结论 | 明确拒绝给结论("我也不知道"/"答案可能不存在" | 复杂议题、探索性 |
| 画面定格 | 用一个视觉画面收束 | 情感类、人物类 |
**writing-config 参数**`emotional_arc`flat/gradual/restrained_to_burst/volatile **writing-config 参数**`emotional_arc`flat/gradual/restrained_to_burst/volatile

View file

@ -0,0 +1,374 @@
#!/usr/bin/env python3
"""
Extract style exemplars from human-written articles for SICO-style few-shot injection.
Takes a markdown article, analyzes it for style fingerprints, extracts key
segments (opening hook, emotional peak, transition/self-correction, closing),
and saves structured exemplar files to references/exemplars/.
Usage:
python3 scripts/extract_exemplar.py article.md
python3 scripts/extract_exemplar.py article.md --category tech-opinion --source "公众号名"
python3 scripts/extract_exemplar.py article1.md article2.md article3.md # batch
python3 scripts/extract_exemplar.py --list # list all exemplars
"""
import argparse
import json
import re
import sys
from datetime import datetime
from pathlib import Path
import yaml
# Reuse analysis functions from humanness_score
sys.path.insert(0, str(Path(__file__).parent))
import humanness_score as hs
SKILL_DIR = Path(__file__).parent.parent
EXEMPLARS_DIR = SKILL_DIR / "references" / "exemplars"
INDEX_FILE = EXEMPLARS_DIR / "index.yaml"
CATEGORIES = ["tech-opinion", "story-emotional", "list-practical", "hot-take", "general"]
# Category detection markers
STORY_MARKERS = [
"", "我们", "那天", "那年", "记得", "后来", "当时",
"第一次", "最后", "突然", "终于",
]
# ============================================================
# Segment Extraction
# ============================================================
def extract_headings(text):
"""Extract H2 headings from markdown."""
return re.findall(r'^##\s+(.+)$', text, re.MULTILINE)
def extract_title(text):
"""Extract H1 title from markdown."""
m = re.search(r'^#\s+(.+)$', text, re.MULTILINE)
return m.group(1).strip() if m else ""
def extract_opening(paragraphs, max_chars=250):
"""Extract opening hook — first non-empty paragraph(s) up to max_chars."""
result = []
total = 0
for p in paragraphs:
if total + len(p) > max_chars and result:
break
result.append(p)
total += len(p)
return "\n\n".join(result)
def extract_emotional_peak(paragraphs):
"""Find paragraph with highest negative emotion density."""
best_para, best_density = "", -1.0
for p in paragraphs:
if len(p) < 20:
continue
count = sum(1 for m in hs.NEGATIVE_MARKERS if m in p)
density = count / len(p) * 100
if density > best_density:
best_density = density
best_para = p
return best_para if best_density > 0 else ""
def extract_transition(paragraphs):
"""Find paragraph with most self-correction / transition patterns."""
transition_words = [
"但是", "不过", "然而", "话说回来", "换个角度",
"说回来", "但话又说回来", "不对", "算了",
]
best_para, best_count = "", 0
for p in paragraphs:
if len(p) < 20:
continue
count = sum(len(re.findall(pat, p)) for pat in hs.SELF_CORRECTION_PATTERNS)
count += sum(p.count(w) for w in transition_words)
if count > best_count:
best_count = count
best_para = p
return best_para if best_count > 0 else ""
def extract_closing(paragraphs, max_chars=250):
"""Extract closing paragraph(s), reading backwards."""
result = []
total = 0
for p in reversed(paragraphs):
if total + len(p) > max_chars and result:
break
result.insert(0, p)
total += len(p)
return "\n\n".join(result)
# ============================================================
# Category Detection
# ============================================================
def detect_category(text, paragraphs, headings):
"""Auto-detect article category from content features."""
data_count = sum(len(re.findall(p, text)) for p in hs.REAL_SOURCE_PATTERNS)
story_count = sum(text.count(m) for m in STORY_MARKERS)
h2_count = len(headings)
neg_count = sum(1 for m in hs.NEGATIVE_MARKERS if m in text)
scores = {
"tech-opinion": data_count * 2,
"story-emotional": story_count * 1.5,
"list-practical": h2_count * 3 if h2_count >= 5 else 0,
"hot-take": neg_count * 2 + data_count if len(text) < 2000 else 0,
"general": 5,
}
return max(scores, key=scores.get)
# ============================================================
# Statistical Fingerprint
# ============================================================
def compute_vocab_temperature(text):
"""Compute vocabulary temperature band distribution."""
counts = {
"cold": sum(text.count(w) for w in hs.COLD_WORDS),
"warm": sum(text.count(w) for w in hs.WARM_WORDS),
"hot": sum(text.count(w) for w in hs.HOT_WORDS),
"wild": sum(text.count(w) for w in hs.WILD_WORDS),
}
total = sum(counts.values())
if total == 0:
return {k: 0.25 for k in counts}
return {k: round(v / total, 2) for k, v in counts.items()}
def compute_paragraph_cv(paragraphs):
"""Coefficient of variation for paragraph lengths."""
if len(paragraphs) < 3:
return 0.0
lengths = [len(p) for p in paragraphs]
mean = sum(lengths) / len(lengths)
if mean == 0:
return 0.0
variance = sum((l - mean) ** 2 for l in lengths) / len(lengths)
return round((variance ** 0.5) / mean, 2)
def count_short_paragraphs(text):
"""Count single-sentence short paragraphs (1-10 chars, non-heading)."""
return sum(1 for l in text.split('\n')
if l.strip() and 1 <= len(l.strip()) <= 10
and not l.strip().startswith('#'))
# ============================================================
# Main Extraction
# ============================================================
def extract_exemplar(text, category=None, source=None):
"""Analyze article and return structured exemplar dict."""
clean = re.sub(r'^#+\s+.*$', '', text, flags=re.MULTILINE).strip()
paragraphs = hs._split_paragraphs(text)
sentences = hs._split_sentences(clean)
headings = extract_headings(text)
title = extract_title(text) or source or ""
if not category:
category = detect_category(clean, paragraphs, headings)
score_result = hs.score_article(text)
# Sentence length stats
lengths = [len(s) for s in sentences]
if len(lengths) >= 2:
mean = sum(lengths) / len(lengths)
variance = sum((l - mean) ** 2 for l in lengths) / len(lengths)
sentence_stddev = round(variance ** 0.5, 1)
else:
sentence_stddev = 0.0
neg_count = sum(1 for s in sentences if any(m in s for m in hs.NEGATIVE_MARKERS))
negative_ratio = round(neg_count / len(sentences), 2) if sentences else 0.0
return {
"title": title,
"source": source or title,
"category": category,
"humanness_score": score_result["composite_score"],
"fingerprint": {
"sentence_stddev": sentence_stddev,
"vocab_temperature": compute_vocab_temperature(clean),
"negative_ratio": negative_ratio,
"paragraph_cv": compute_paragraph_cv(paragraphs),
"short_paragraphs": count_short_paragraphs(text),
},
"segments": {
"opening": extract_opening(paragraphs),
"emotional_peak": extract_emotional_peak(paragraphs),
"transition": extract_transition(paragraphs),
"closing": extract_closing(paragraphs),
},
"extracted_at": datetime.now().strftime("%Y-%m-%d"),
"char_count": len(clean),
}
# ============================================================
# Persistence
# ============================================================
def save_exemplar(exemplar):
"""Save exemplar to markdown file and update index.yaml. Returns filepath."""
EXEMPLARS_DIR.mkdir(parents=True, exist_ok=True)
category = exemplar["category"]
num = 1
while (EXEMPLARS_DIR / f"{category}-{num:03d}.md").exists():
num += 1
filename = f"{category}-{num:03d}.md"
filepath = EXEMPLARS_DIR / filename
fp = exemplar["fingerprint"]
seg = exemplar["segments"]
frontmatter = {
"source": exemplar["source"],
"category": category,
"humanness_score": exemplar["humanness_score"],
"sentence_stddev": fp["sentence_stddev"],
"vocab_temperature": fp["vocab_temperature"],
"negative_ratio": fp["negative_ratio"],
"paragraph_cv": fp["paragraph_cv"],
"short_paragraphs": fp["short_paragraphs"],
"extracted_at": exemplar["extracted_at"],
}
content = "---\n"
content += yaml.dump(frontmatter, allow_unicode=True, default_flow_style=False)
content += "---\n\n"
section_map = [
("opening", "开头钩子"),
("emotional_peak", "情绪高峰"),
("transition", "转折/自纠"),
("closing", "收尾"),
]
for key, label in section_map:
if seg.get(key):
content += f"## {label}\n\n{seg[key]}\n\n"
filepath.write_text(content, encoding="utf-8")
_update_index(filename, exemplar)
return filepath
def _update_index(filename, exemplar):
"""Add or update entry in index.yaml."""
index = []
if INDEX_FILE.exists():
with open(INDEX_FILE, "r", encoding="utf-8") as f:
index = yaml.safe_load(f) or []
entry = {
"file": filename,
"source": exemplar["source"],
"category": exemplar["category"],
"humanness_score": exemplar["humanness_score"],
"extracted_at": exemplar["extracted_at"],
}
index = [e for e in index if e.get("file") != filename]
index.append(entry)
index.sort(key=lambda x: (x["category"], x["humanness_score"]))
with open(INDEX_FILE, "w", encoding="utf-8") as f:
yaml.dump(index, f, allow_unicode=True, default_flow_style=False)
# ============================================================
# List / CLI
# ============================================================
def list_exemplars():
"""Print all exemplars in the library."""
if not INDEX_FILE.exists():
print("范文库为空。用法: python3 scripts/extract_exemplar.py article.md")
return
with open(INDEX_FILE, "r", encoding="utf-8") as f:
index = yaml.safe_load(f) or []
if not index:
print("范文库为空。")
return
print(f"\n{'=' * 60}")
print(f"范文库 ({len(index)} 篇)")
print(f"{'=' * 60}")
by_cat = {}
for e in index:
by_cat.setdefault(e["category"], []).append(e)
for cat, entries in sorted(by_cat.items()):
print(f"\n [{cat}] ({len(entries)} 篇)")
for e in entries:
score = e["humanness_score"]
bar = "" * int((100 - score) / 10) + "" * (10 - int((100 - score) / 10))
print(f" {bar} {score:5.1f} {e['source'][:40]}")
def main():
parser = argparse.ArgumentParser(description="Extract style exemplars from articles")
parser.add_argument("inputs", nargs="*", help="Markdown article file(s)")
parser.add_argument("--category", "-c", choices=CATEGORIES,
help="Article category (auto-detected if omitted)")
parser.add_argument("--source", "-s", help="Source name (e.g. account name)")
parser.add_argument("--list", "-l", action="store_true", help="List all exemplars")
parser.add_argument("--json", action="store_true", help="JSON output")
args = parser.parse_args()
if args.list:
list_exemplars()
return
if not args.inputs:
parser.print_help()
sys.exit(1)
for input_path in args.inputs:
path = Path(input_path)
if not path.exists():
print(f"Error: {input_path} not found", file=sys.stderr)
continue
text = path.read_text(encoding="utf-8")
source = args.source or path.stem # fallback to filename without extension
exemplar = extract_exemplar(text, category=args.category, source=source)
filepath = save_exemplar(exemplar)
if args.json:
print(json.dumps(exemplar, ensure_ascii=False, indent=2))
else:
print(f"{path.name}")
print(f" Category: {exemplar['category']}")
print(f" Score: {exemplar['humanness_score']:.1f}/100")
print(f" Segments: {sum(1 for v in exemplar['segments'].values() if v)}/4")
fp = exemplar["fingerprint"]
print(f" Stddev: {fp['sentence_stddev']}")
print(f" Neg ratio: {fp['negative_ratio']:.0%}")
print(f" Para CV: {fp['paragraph_cv']}")
temp = fp["vocab_temperature"]
print(f" Temp: cold={temp['cold']} warm={temp['warm']} hot={temp['hot']} wild={temp['wild']}")
print(f" Saved to: {filepath}")
print()
if __name__ == "__main__":
main()

View file

@ -55,10 +55,23 @@ REAL_SOURCE_PATTERNS = [
] ]
NEGATIVE_MARKERS = [ NEGATIVE_MARKERS = [
# 直接负面情绪
"失望", "糟糕", "", "", "", "差劲", "崩溃", "吐槽", "", "失望", "糟糕", "", "", "", "差劲", "崩溃", "吐槽", "",
"", "", "焦虑", "担忧", "不满", "恶心", "可怕", "可悲", "可笑", "", "", "焦虑", "担忧", "不满", "恶心", "可怕", "可悲", "可笑",
"离谱", "尴尬", "无语", "", "", "", "", "离谱", "尴尬", "无语", "", "", "", "",
# 绝望/迷茫
"绝望", "迷茫", "心累", "", "后悔", "后怕", "心寒",
# 欺骗/操控(隐性负面)
"", "忽悠", "割韭菜", "套路", "画大饼", "洗脑",
# 失败/徒劳
"白费", "白搭", "没戏", "黄了", "凉了", "废了",
# 自嘲/自贬
"", "天真", "吃亏", "自嗨", "打脸",
# 讽刺/反语
"呵呵", "好吧", "行吧", "真服了",
# 短语
"太扯了", "说实话我很失望", "搞什么", "不靠谱", "受不了", "太扯了", "说实话我很失望", "搞什么", "不靠谱", "受不了",
"受够了", "想哭", "伤心", "苦哈哈", "得过且过",
] ]
COMMON_ADVERBS = [ COMMON_ADVERBS = [
@ -69,10 +82,27 @@ COMMON_ADVERBS = [
"竟然", "简直", "几乎", "完全", "绝对", "必然", "竟然", "简直", "几乎", "完全", "绝对", "必然",
] ]
COLD_WORDS = ["边际", "认知负荷", "信息不对称", "路径依赖", "商业模式", "生态系统", "增量"] COLD_WORDS = [
WARM_WORDS = ["说白了", "其实吧", "讲真", "说实话", "坦白讲", "懂的都懂", "怎么说呢"] "边际", "认知负荷", "信息不对称", "路径依赖", "商业模式", "生态系统", "增量",
HOT_WORDS = ["DNA动了", "格局打开", "遥遥<EFBFBD><EFBFBD>", "", "内卷", "炸了", "杀疯了", "吃灰"] "技术栈", "标准化", "结构性", "规模化", "护城河", "飞轮", "闭环",
WILD_WORDS = ["整挺好", "不靠谱", "瞎折腾", "搁这儿", "糊弄", "", ""] "赛道", "壁垒", "方法论", "底层逻辑", "第一性原理", "杠杆", "复利",
"ROI", "PMF", "代运营", "供给侧", "需求侧",
]
WARM_WORDS = [
"说白了", "其实吧", "讲真", "说实话", "坦白讲", "懂的都懂", "怎么说呢",
"老实说", "这么说吧", "你想啊", "别急", "慢慢来",
"有意思的是", "好玩的是", "巧的是", "说来话长", "话说回来",
]
HOT_WORDS = [
"DNA动了", "格局打开", "遥遥领先", "", "内卷", "炸了", "杀疯了", "吃灰",
"凡尔赛", "标题党", "躺平", "摆烂", "破防", "上头", "内耗",
"蒸发", "出圈", "降维打击", "弯道超车",
]
WILD_WORDS = [
"整挺好", "不靠谱", "瞎折腾", "搁这儿", "糊弄", "", "",
"苦哈哈", "傻乎乎", "稀里糊涂", "得了吧", "算了吧",
"摔了跤", "交学费", "踩坑", "翻车", "栽了",
]
SELF_CORRECTION_PATTERNS = [ SELF_CORRECTION_PATTERNS = [
r'不对[,]', r'准确说', r'算了', r'说错了', r'不对[,]', r'准确说', r'算了', r'说错了',
@ -314,6 +344,81 @@ def run_tier(checks, text):
return results return results
# ============================================================
# Calibration (bell-curve + over-optimization penalty)
# ============================================================
# Human article baselines (from 15 example articles, 2026-03-30)
# Dimensions where AI over-optimizes: bell-curve scoring penalizes
# both "too low" AND "too high" relative to human average.
_BELL_CURVE_CHECKS = {
"broken_sentences": 0.39,
"self_correction": 0.20,
"sentence_length_range": 0.71,
"paragraph_length_variance": 0.52,
"banned_words": 0.73,
}
def _bell_curve(raw_score, center):
"""Score peaks at center (human avg), penalizes over-optimization.
Below center: linear rise (as before).
Above center: quadratic penalty too much is suspicious.
"""
if center <= 0:
return raw_score
if raw_score <= center:
return raw_score / center
else:
overshoot = (raw_score - center) / (1.0 - center) if center < 1 else 0
return max(0.0, 1.0 - overshoot * overshoot)
def calibrate_tiers(tier1, tier2):
"""Apply bell-curve calibration and over-optimization penalty in-place."""
# 1. Bell-curve adjustment for over-optimizable dimensions
for tier in [tier1, tier2]:
for name, data in tier.items():
if name.startswith("_"):
continue
if name in _BELL_CURVE_CHECKS:
raw = data["score"]
center = _BELL_CURVE_CHECKS[name]
calibrated = round(max(0.0, min(1.0, _bell_curve(raw, center))), 4)
data["raw_score"] = raw
data["score"] = calibrated
data["detail"] += f" [calibrated from {raw:.2f}, center={center}]"
# 2. Over-optimization penalty: if 60%+ of checks score > 0.8,
# the article is suspiciously "perfect" — apply global penalty.
all_scores = []
for tier in [tier1, tier2]:
for name, data in tier.items():
if not name.startswith("_"):
all_scores.append(data["score"])
high_count = sum(1 for s in all_scores if s > 0.8)
over_opt_ratio = high_count / len(all_scores) if all_scores else 0
penalty = 1.0
if over_opt_ratio >= 0.6:
penalty = 0.85 # 15% penalty for suspiciously perfect articles
if penalty < 1.0:
for tier in [tier1, tier2]:
for name, data in tier.items():
if not name.startswith("_"):
data["score"] = round(data["score"] * penalty, 4)
# 3. Recalculate tier summaries
for tier in [tier1, tier2]:
scores = [data["score"] for name, data in tier.items() if not name.startswith("_")]
tier["_summary"]["mean_score"] = round(sum(scores) / len(scores), 4) if scores else 0
tier["_summary"]["scores"] = [round(s, 4) for s in scores]
return penalty
# ============================================================ # ============================================================
# Composite Score # Composite Score
# ============================================================ # ============================================================
@ -364,6 +469,7 @@ def score_article(text, verbose=False, tier3_score=None):
tier1 = run_tier(TIER1_CHECKS, clean) tier1 = run_tier(TIER1_CHECKS, clean)
tier2 = run_tier(TIER2_CHECKS, clean) tier2 = run_tier(TIER2_CHECKS, clean)
over_opt_penalty = calibrate_tiers(tier1, tier2)
composite, weights = compute_composite(tier1, tier2, tier3_score) composite, weights = compute_composite(tier1, tier2, tier3_score)
param_scores = build_param_scores(tier1, tier2) param_scores = build_param_scores(tier1, tier2)
@ -377,6 +483,7 @@ def score_article(text, verbose=False, tier3_score=None):
}, },
"weights": weights, "weights": weights,
"param_scores": param_scores, "param_scores": param_scores,
"over_optimization_penalty": over_opt_penalty,
"char_count": len(clean), "char_count": len(clean),
} }

View file

@ -325,6 +325,20 @@ def main():
lesson_file = save_lesson(diff_result, args.draft, args.final) lesson_file = save_lesson(diff_result, args.draft, args.final)
print(f"\nLesson saved to: {lesson_file}") print(f"\nLesson saved to: {lesson_file}")
# Auto-grow exemplar library from edited finals
final_title = extract_title(final)
try:
import extract_exemplar
exemplar = extract_exemplar.extract_exemplar(final, source=final_title or "user-edited")
if exemplar["humanness_score"] <= 50:
exemplar_path = extract_exemplar.save_exemplar(exemplar)
print(f"\n✓ 终稿已加入范文库: {exemplar_path}")
print(f" Score: {exemplar['humanness_score']:.1f}/100, Category: {exemplar['category']}")
else:
print(f"\n⚠ 终稿 humanness_score={exemplar['humanness_score']:.1f} > 50未加入范文库")
except Exception as e:
print(f"\n⚠ 范文提取跳过: {e}")
lesson_count = len(load_all_lessons()) lesson_count = len(load_all_lessons())
print(f"Total lessons: {lesson_count}") print(f"Total lessons: {lesson_count}")