feat: add article content extraction with anti-scraping fallback

- New `scripts/fetch_article.py`: extract WeChat article content as Markdown
  with three-level fetch strategy (requests → Playwright → manual HTML)
- Refactor `learn_theme.py` to reuse `fetch_article.fetch_html()`, removing
  duplicate fetch logic
- Update SKILL.md: add "学习这篇文章/导入范文" auxiliary function
- Update README.md: add article extraction to feature table and directory tree

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
wangzhuc 2026-04-02 00:34:02 +08:00 committed by github-actions[bot]
parent 530e65c41c
commit 25d6a44082
10 changed files with 1407 additions and 222 deletions

View file

@ -33,6 +33,7 @@
| 范文风格库 | SICO 式 few-shot从你的文章提取风格指纹写作时注入 | `scripts/extract_exemplar.py` | | 范文风格库 | SICO 式 few-shot从你的文章提取风格指纹写作时注入 | `scripts/extract_exemplar.py` |
| 风格飞轮 | 学习你的修改,越用越像你 | `references/learn-edits.md` | | 风格飞轮 | 学习你的修改,越用越像你 | `references/learn-edits.md` |
| 排版学习 | 从任意公众号文章 URL 提取排版主题 | `scripts/learn_theme.py` | | 排版学习 | 从任意公众号文章 URL 提取排版主题 | `scripts/learn_theme.py` |
| 文章采集 | 从公众号 URL 提取正文为 Markdown可导入范文库 | `scripts/fetch_article.py` |
## 写作人格 ## 写作人格
@ -183,6 +184,7 @@ wewrite/
│ ├── humanness_score.py # 文章质量打分11 项检测,供自检和 Step 5 使用) │ ├── humanness_score.py # 文章质量打分11 项检测,供自检和 Step 5 使用)
│ ├── extract_exemplar.py # 范文风格提取SICO 式 few-shot 建库) │ ├── extract_exemplar.py # 范文风格提取SICO 式 few-shot 建库)
│ ├── learn_theme.py # 从公众号文章 URL 提取排版主题 │ ├── learn_theme.py # 从公众号文章 URL 提取排版主题
│ ├── fetch_article.py # 从公众号 URL 提取正文为 Markdown
│ ├── diagnose.py # 配置完备度检查 │ ├── diagnose.py # 配置完备度检查
│ └── build_openclaw.py # SKILL.md → OpenClaw 格式转换 │ └── build_openclaw.py # SKILL.md → OpenClaw 格式转换
@ -192,7 +194,7 @@ wewrite/
│ ├── theme.py # YAML 主题引擎 │ ├── theme.py # YAML 主题引擎
│ ├── publisher.py # 微信草稿箱 API + 小绿书图片帖 │ ├── publisher.py # 微信草稿箱 API + 小绿书图片帖
│ ├── wechat_api.py # access_token / 图片上传 │ ├── wechat_api.py # access_token / 图片上传
│ ├── image_gen.py # AI 图片生成(doubao / OpenAI / Gemini │ ├── image_gen.py # AI 图片生成(9 provider自动 fallback
│ └── themes/ # 16+ 排版主题(含暗黑模式,可从文章学习新增) │ └── themes/ # 16+ 排版主题(含暗黑模式,可从文章学习新增)
├── personas/ # 5 套写作人格预设(含朱雀实测数据) ├── personas/ # 5 套写作人格预设(含朱雀实测数据)

View file

@ -49,6 +49,7 @@ allowed-tools:
- **本地修改**(默认):用户在 `output/` 的 markdown 文件中修改 - **本地修改**(默认):用户在 `output/` 的 markdown 文件中修改
- **微信草稿箱同步**`python3 {skill_dir}/scripts/learn_edits.py --from-wechat`,自动从草稿箱拉回最新内容,与本地原文做纯文本 diff - **微信草稿箱同步**`python3 {skill_dir}/scripts/learn_edits.py --from-wechat`,自动从草稿箱拉回最新内容,与本地原文做纯文本 diff
- 用户说"学习排版"/"学排版" → `python3 {skill_dir}/scripts/learn_theme.py <url> --name <name>`,用户需提供一个公众号文章 URL 和主题名称。提取完成后提示用户设置 `style.yaml``theme` 字段。 - 用户说"学习排版"/"学排版" → `python3 {skill_dir}/scripts/learn_theme.py <url> --name <name>`,用户需提供一个公众号文章 URL 和主题名称。提取完成后提示用户设置 `style.yaml``theme` 字段。
- 用户说"学习这篇文章"/"导入范文" + URL → `python3 {skill_dir}/scripts/fetch_article.py <url> -o /tmp/article.md && python3 {skill_dir}/scripts/extract_exemplar.py /tmp/article.md -s <账号名>`,从公众号文章 URL 提取正文并导入范文库。支持三级降级requests → Playwright → 手动 HTML
- 用户说"看看文章数据" → `读取: {skill_dir}/references/effect-review.md` - 用户说"看看文章数据" → `读取: {skill_dir}/references/effect-review.md`
- 用户说"检查一下"/"自检"/"这篇文章怎么样" → 对最近一篇生成的文章(或用户指定的文章)执行自检,输出生成报告: - 用户说"检查一下"/"自检"/"这篇文章怎么样" → 对最近一篇生成的文章(或用户指定的文章)执行自检,输出生成报告:

View file

@ -40,6 +40,7 @@ description: |
- **本地修改**(默认):用户在 `output/` 的 markdown 文件中修改 - **本地修改**(默认):用户在 `output/` 的 markdown 文件中修改
- **微信草稿箱同步**`python3 {baseDir}/scripts/learn_edits.py --from-wechat`,自动从草稿箱拉回最新内容,与本地原文做纯文本 diff - **微信草稿箱同步**`python3 {baseDir}/scripts/learn_edits.py --from-wechat`,自动从草稿箱拉回最新内容,与本地原文做纯文本 diff
- 用户说"学习排版"/"学排版" → `python3 {baseDir}/scripts/learn_theme.py <url> --name <name>`,用户需提供一个公众号文章 URL 和主题名称。提取完成后提示用户设置 `style.yaml``theme` 字段。 - 用户说"学习排版"/"学排版" → `python3 {baseDir}/scripts/learn_theme.py <url> --name <name>`,用户需提供一个公众号文章 URL 和主题名称。提取完成后提示用户设置 `style.yaml``theme` 字段。
- 用户说"学习这篇文章"/"导入范文" + URL → `python3 {baseDir}/scripts/fetch_article.py <url> -o /tmp/article.md && python3 {baseDir}/scripts/extract_exemplar.py /tmp/article.md -s <账号名>`,从公众号文章 URL 提取正文并导入范文库。支持三级降级requests → Playwright → 手动 HTML
- 用户说"看看文章数据" → `读取: {baseDir}/references/effect-review.md` - 用户说"看看文章数据" → `读取: {baseDir}/references/effect-review.md`
- 用户说"检查一下"/"自检"/"这篇文章怎么样" → 对最近一篇生成的文章(或用户指定的文章)执行自检,输出生成报告: - 用户说"检查一下"/"自检"/"这篇文章怎么样" → 对最近一篇生成的文章(或用户指定的文章)执行自检,输出生成报告:
@ -98,7 +99,7 @@ python3 -c "import markdown, bs4, cssutils, requests, yaml, pygments, PIL" 2>&1
| `config.yaml` 存在 | 静默 | 引导创建,或设 `skip_publish = true` | | `config.yaml` 存在 | 静默 | 引导创建,或设 `skip_publish = true` |
| Python 依赖 | 静默 | 提供 `pip install -r requirements.txt` | | Python 依赖 | 静默 | 提供 `pip install -r requirements.txt` |
| `wechat.appid` + `secret` | 静默 | 设 `skip_publish = true` | | `wechat.appid` + `secret` | 静默 | 设 `skip_publish = true` |
| `image.api_key` | 静默 | 设 `skip_image_gen = true` | | `image.api_key` `image.providers` 至少一项有效 | 静默 | 设 `skip_image_gen = true` |
| `references/exemplars/index.yaml` | 静默 | 提示:"范文库为空。如果你有已发布的文章markdown可以说**'导入范文'**建立风格库,写出来的文章会更像你。没有也不影响使用。" | | `references/exemplars/index.yaml` | 静默 | 提示:"范文库为空。如果你有已发布的文章markdown可以说**'导入范文'**建立风格库,写出来的文章会更像你。没有也不影响使用。" |
**1.2 版本检查**(静默通过或提醒): **1.2 版本检查**(静默通过或提醒):
@ -377,9 +378,11 @@ python3 {baseDir}/scripts/humanness_score.py {article_path} --json --tier3 {agen
- **交互模式**:展示封面,问用户"封面效果如何?"。用户 OK → 继续;不满意 → 调整提示词重新生成。 - **交互模式**:展示封面,问用户"封面效果如何?"。用户 OK → 继续;不满意 → 调整提示词重新生成。
- **全自动模式**agent 自检——提示词中的实体是否在画面描述中可识别?如果提示词过于泛化(仅含"科技感""未来感"等抽象词,无具体实体),换一组提示词重试 1 次。 - **全自动模式**agent 自检——提示词中的实体是否在画面描述中可识别?如果提示词过于泛化(仅含"科技感""未来感"等抽象词,无具体实体),换一组提示词重试 1 次。
**6.4 内文配图**:分析文章结构,生成 3-6 张内文配图提示词(按 visual-prompts.md。风格、色调、画风沿用封面保持视觉一致。批量调用 image_gen.py替换 Markdown 占位符 **6.3b 风格锚定**:封面确认后,提取视觉锚点(色板 hex、风格关键词、画面调性后续所有内文配图的提示词必须引用这组锚点保证全文视觉一致
**降级**:生图失败 → 输出提示词 + 备选图库关键词,继续。 **6.4 内文配图**分析文章结构为每个需要配图的段落选择图片类型infographic/scene/flowchart/comparison/framework/timeline使用对应的结构化提示词模板生成 3-6 张配图提示词(按 visual-prompts.md。批量调用 image_gen.py替换 Markdown 占位符。
**降级**image_gen.py 支持多 provider 自动 fallback按 config.yaml 中 providers 列表顺序尝试)。全部失败 → 输出提示词 + 备选图库关键词,继续。
--- ---

View file

@ -8,27 +8,63 @@ wechat:
author: "" # 默认署名(可选) author: "" # 默认署名(可选)
# AI 图片生成 # AI 图片生成
# 支持 9 个 provider配一个就能用配多个自动 fallback。
#
# ┌─────────────────┬────────────────────────────────────────────────┬────────────────────┐
# │ Provider │ 获取 API Key │ 特点 │
# ├─────────────────┼────────────────────────────────────────────────┼────────────────────┤
# │ doubao │ https://console.volcengine.com/ark │ 中文提示词最优 │
# │ dashscope │ https://dashscope.console.aliyun.com/ │ 阿里通义万相 │
# │ jimeng │ https://console.volcengine.com/iam │ 字节即梦,中文强 │
# │ minimax │ https://platform.minimaxi.com/ │ 国内 provider │
# │ openai │ https://platform.openai.com/api-keys │ DALL-E通用性强 │
# │ azure_openai │ Azure Portal │ 国内可访问的 OpenAI│
# │ gemini │ https://aistudio.google.com/apikey │ 免费额度较多 │
# │ openrouter │ https://openrouter.ai/settings/keys │ 多模型代理 │
# │ replicate │ https://replicate.com/account/api-tokens │ 开源模型丰富 │
# └─────────────────┴────────────────────────────────────────────────┴────────────────────┘
#
# 支持两种配置方式:
# 方式一:单 provider简单用法填一个就行
image: image:
# 可选 provider: doubao | openai | gemini provider: "doubao" # 见上表 Provider 列
provider: "doubao"
api_key: "your_api_key" api_key: "your_api_key"
# model: "doubao-seedream-5-0-260128" # 可选,各 provider 有默认值
# base_url: "https://ark.cn-beijing.volces.com/api/v3" # 可选
# doubao-seedream默认 # 方式二:多 provider 自动 fallback推荐
# 获取 API key: https://console.volcengine.com/ark # 按顺序尝试,第一个失败自动切换下一个,不需要全部填写
# model: "doubao-seedream-5-0-260128" # image:
# base_url: "https://ark.cn-beijing.volces.com/api/v3" # providers:
# - provider: doubao
# OpenAI DALL-E 3 # api_key: "your_volcengine_key"
# provider: "openai" # - provider: dashscope
# api_key: "your_dashscope_key"
# # model: "qwen-image-2.0-pro"
# - provider: jimeng
# api_key: "your_access_key_id" # 即梦需要 access_key_id + secret_key
# secret_key: "your_secret_access_key"
# # model: "jimeng_t2i_v40"
# - provider: minimax
# api_key: "your_minimax_key"
# # model: "image-01"
# - provider: openai
# api_key: "sk-..." # api_key: "sk-..."
# model: "dall-e-3" # # model: "dall-e-3"
# base_url: "https://api.openai.com/v1" # - provider: azure_openai
# api_key: "your_azure_key"
# Google Gemini Imagen # base_url: "https://YOUR-RESOURCE.openai.azure.com/openai" # 必填
# provider: "gemini" # # deployment: "dall-e-3"
# - provider: gemini
# api_key: "AIza..." # api_key: "AIza..."
# 获取 API key: https://aistudio.google.com/apikey # # model: "gemini-3.1-flash-image-preview"
# model: "gemini-3.1-flash-image-preview" # - provider: openrouter
# api_key: "sk-or-..."
# # model: "google/gemini-3.1-flash-image-preview"
# - provider: replicate
# api_key: "r8_..."
# # model: "google/nano-banana-pro"
# 默认排版主题 # 默认排版主题
theme: "professional-clean" theme: "professional-clean"

View file

@ -73,6 +73,24 @@
--- ---
## 风格锚定
封面确认后,**立即提取视觉锚点**,后续所有内文配图必须复用:
```
视觉锚点:
- 色板:{封面的主色 hex + 辅色 hex#2563EB + #F97316}
- 风格关键词:{封面的风格描述,如 "flat illustration, minimalist, bold outlines"}
- 画面调性:{冷调/暖调/中性}
```
**规则**
- 每条内文配图提示词的末尾,必须附加视觉锚点中的色板和风格关键词
- 如果封面是暖调,内文配图不能突然切换为冷调科技风(反之亦然)
- 视觉锚点在整篇文章的所有配图中保持一致
---
## 二、内文配图3-6 张) ## 二、内文配图3-6 张)
### 分析流程 ### 分析流程
@ -94,7 +112,20 @@
| 转折/高潮处 → 视觉冲击 | 紧接着另一张配图间距不足300字 | | 转折/高潮处 → 视觉冲击 | 紧接着另一张配图间距不足300字 |
| 长段落后(>400字无图 → 节奏调节 | 结尾 CTA 段落 | | 长段落后(>400字无图 → 节奏调节 | 结尾 CTA 段落 |
**第三步:确定位置** **第三步:确定图片类型**
根据段落内容,为每张配图选择最匹配的类型:
| 类型 | 适用内容 | 核心构图 |
|------|---------|---------|
| infographic | 数据、统计、指标对比 | 区域分块 + 标签标注 |
| scene | 叙事场景、情绪渲染、人物故事 | 焦点主体 + 氛围光影 |
| flowchart | 流程、步骤、工作流 | 步骤节点 + 连接箭头 |
| comparison | 两个方案/观点对比 | 左右分栏 + 分隔线 |
| framework | 概念模型、架构关系 | 层级节点 + 关系连线 |
| timeline | 时间线、发展历程 | 时间轴 + 里程碑标记 |
**第四步:确定位置**
- 配图插入在对应段落**之后**(不是之前) - 配图插入在对应段落**之后**(不是之前)
- 具体到"H2 XX 下的第 N 段之后" - 具体到"H2 XX 下的第 N 段之后"
@ -104,24 +135,132 @@
- 不要在文章第一段之前放配图 - 不要在文章第一段之前放配图
- 不要在结尾 CTA 段落放配图 - 不要在结尾 CTA 段落放配图
### 提示词格式 ### 结构化提示词模板
每张输出: 根据图片类型,使用对应的结构化模板生成提示词。**禁止自由文本描述**——所有提示词必须填写模板的每个字段。
#### infographic信息图
``` ```
### 配图 {序号}: 位于「{H2标题}」第{N}段后 ### 配图 {序号}: 位于「{H2标题}」第{N}段后
- 配图目的:{信息强化/场景还原/节奏调节} - 类型infographic
- 对应内容:{这段讲了什么1句话概括} - 对应内容:{1句话概括}
- 画面描述:{具体的画面内容80-120字}
- AI 绘图提示词: Layout: {grid / radial / hierarchical}
"{中文提示词,给 doubao-seedream 用}" Zones:
- Zone 1: {具体数据点,用文章真实数字}
- Zone 2: {对比/趋势,用文章真实数字}
- Zone 3: {结论/要点}
Labels: {文章中的真实数字、术语、指标名}
Colors: {视觉锚点色板}
Style: {视觉锚点风格关键词}, clean infographic, no text
Aspect: 16:9
- 备选方案:{Unsplash/Pexels 搜索关键词} - 备选方案:{Unsplash/Pexels 搜索关键词}
``` ```
### 内文配图的特殊要求 #### scene场景
```
### 配图 {序号}: 位于「{H2标题}」第{N}段后
- 类型scene
- 对应内容:{1句话概括}
Focal Point: {画面主体,必须是文章实体}
Atmosphere: {光影、环境、时间}
Mood: {情绪基调}
Color Temperature: {warm / cool / neutral与视觉锚点一致}
Style: {视觉锚点风格关键词}, no text no letters
Aspect: 16:9
- 备选方案:{Unsplash/Pexels 搜索关键词}
```
#### flowchart流程图
```
### 配图 {序号}: 位于「{H2标题}」第{N}段后
- 类型flowchart
- 对应内容:{1句话概括}
Layout: {left-right / top-down / circular}
Steps:
1. {步骤名} — {简述}
2. {步骤名} — {简述}
3. {步骤名} — {简述}
Connections: {箭头方向、决策分支}
Colors: {视觉锚点色板}
Style: {视觉锚点风格关键词}, clean diagram, no text
Aspect: 16:9
- 备选方案:{Unsplash/Pexels 搜索关键词}
```
#### comparison对比图
```
### 配图 {序号}: 位于「{H2标题}」第{N}段后
- 类型comparison
- 对应内容:{1句话概括}
Left Side — {选项A名称}:
- {要点1}
- {要点2}
Right Side — {选项B名称}:
- {要点1}
- {要点2}
Divider: {分隔线样式}
Colors: {视觉锚点色板,左右各用一个主色}
Style: {视觉锚点风格关键词}, split layout, no text
Aspect: 16:9
- 备选方案:{Unsplash/Pexels 搜索关键词}
```
#### framework架构图
```
### 配图 {序号}: 位于「{H2标题}」第{N}段后
- 类型framework
- 对应内容:{1句话概括}
Structure: {hierarchical / network / matrix}
Nodes:
- {概念1} — {角色}
- {概念2} — {角色}
- {概念3} — {角色}
Relationships: {节点间如何连接}
Colors: {视觉锚点色板}
Style: {视觉锚点风格关键词}, clean diagram, no text
Aspect: 16:9
- 备选方案:{Unsplash/Pexels 搜索关键词}
```
#### timeline时间线
```
### 配图 {序号}: 位于「{H2标题}」第{N}段后
- 类型timeline
- 对应内容:{1句话概括}
Direction: {horizontal / vertical}
Events:
- {时间点1}: {里程碑}
- {时间点2}: {里程碑}
- {时间点3}: {里程碑}
Markers: {视觉标记样式}
Colors: {视觉锚点色板}
Style: {视觉锚点风格关键词}, clean timeline, no text
Aspect: 16:9
- 备选方案:{Unsplash/Pexels 搜索关键词}
```
### 内文配图通用要求
- 尺寸统一 **16:9 横版**image_gen.py --size article - 尺寸统一 **16:9 横版**image_gen.py --size article
- **风格一致性**:沿用封面确定的色调、画风、视觉语言。在每条提示词中显式复用封面的风格描述(如 "flat illustration, blue-orange palette, minimalist" - **视觉锚定**:每条提示词的 Colors 和 Style 字段必须引用封面提取的视觉锚点
- 实体锚定规则同封面——每条提示词至少包含 2 个文章实体 - 实体锚定规则同封面——每条提示词至少包含 2 个文章实体
- 不要太复杂——手机屏幕上看,简洁的图比复杂的图好 - 不要太复杂——手机屏幕上看,简洁的图比复杂的图好
- 提示词用中文seedream 中文理解强) - 提示词用中文seedream 中文理解强)

323
dist/openclaw/scripts/fetch_article.py vendored Normal file
View file

@ -0,0 +1,323 @@
#!/usr/bin/env python3
"""fetch_article.py — extract WeChat article content as Markdown.
Three-level fetching strategy:
Level 1: requests (fast, zero overhead, works for most articles)
Level 2: Playwright headless Chrome (bypasses anti-scraping JS checks)
Level 3: Prompt user to save HTML manually and pass via --file
Usage:
python3 scripts/fetch_article.py <url> # auto fetch
python3 scripts/fetch_article.py <url> -o article.md # save to file
python3 scripts/fetch_article.py --file saved.html # from local HTML
python3 scripts/fetch_article.py <url> --json # JSON output for agent
"""
import argparse
import json
import re
import sys
from pathlib import Path
import requests
from bs4 import BeautifulSoup, NavigableString
_BROWSER_UA = (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
)
# ---------------------------------------------------------------------------
# Fetching: three-level strategy
# ---------------------------------------------------------------------------
def _fetch_requests(url: str, timeout: int = 20) -> str | None:
"""Level 1: plain requests. Returns HTML string or None on failure."""
try:
resp = requests.get(url, headers={"User-Agent": _BROWSER_UA}, timeout=timeout)
resp.raise_for_status()
resp.encoding = "utf-8"
return resp.text
except requests.exceptions.RequestException:
return None
def _fetch_playwright(url: str, timeout: int = 30000) -> str | None:
"""Level 2: Playwright headless Chrome. Returns HTML or None."""
try:
from playwright.sync_api import sync_playwright
except ImportError:
return None
try:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page(user_agent=_BROWSER_UA)
page.goto(url, wait_until="networkidle", timeout=timeout)
# Wait for WeChat content to render
page.wait_for_selector("#js_content", timeout=10000)
html = page.content()
browser.close()
return html
except Exception:
return None
def fetch_html(url: str) -> str:
"""Fetch article HTML with automatic fallback.
Returns HTML string. Exits with error if all levels fail.
"""
# Level 1
html = _fetch_requests(url)
if html and _has_content(html):
return html
# Level 2
print("requests 未获取到正文,尝试 Playwright...", file=sys.stderr)
html = _fetch_playwright(url)
if html and _has_content(html):
return html
# Level 3
print(
"Error: 无法获取文章内容。请在浏览器中打开文章 → 右键另存为 HTML → 使用 --file 参数传入。",
file=sys.stderr,
)
sys.exit(1)
def _has_content(html: str) -> bool:
"""Check if HTML contains non-empty #js_content."""
soup = BeautifulSoup(html, "html.parser")
content = soup.find(id="js_content")
if content is None:
return False
text = content.get_text(strip=True)
return len(text) > 50 # must have real content, not just whitespace
# ---------------------------------------------------------------------------
# HTML → Markdown conversion
# ---------------------------------------------------------------------------
def _extract_metadata(soup: BeautifulSoup) -> dict:
"""Extract article metadata from WeChat page."""
title_tag = soup.find("h1", class_="rich_media_title") or soup.find(
"h1", id="activity-name"
)
title = title_tag.get_text(strip=True) if title_tag else ""
author_tag = soup.find("a", id="js_name") or soup.find(
"span", class_="rich_media_meta_nickname"
)
author = author_tag.get_text(strip=True) if author_tag else ""
# Publish time
pub_tag = soup.find("em", id="publish_time")
pub_time = pub_tag.get_text(strip=True) if pub_tag else ""
return {"title": title, "author": author, "publish_time": pub_time}
def _elem_to_md(elem, depth: int = 0) -> str:
"""Convert a single HTML element to Markdown."""
tag = elem.name if hasattr(elem, "name") else None
if isinstance(elem, NavigableString):
text = str(elem).strip()
return text if text else ""
if tag is None:
return ""
# Skip hidden/empty elements
style = elem.get("style", "")
if "display:none" in style.replace(" ", "").lower():
return ""
if "visibility:hidden" in style.replace(" ", "").lower():
return ""
# Get inner content recursively
inner = ""
for child in elem.children:
inner += _elem_to_md(child, depth + 1)
inner = inner.strip()
if not inner:
return ""
# Headings
if tag in ("h1", "h2", "h3", "h4"):
level = int(tag[1])
return f"\n\n{'#' * level} {inner}\n\n"
# Paragraphs
if tag == "p":
return f"\n\n{inner}\n\n"
# Line breaks
if tag == "br":
return "\n"
# Bold
if tag in ("strong", "b"):
return f"**{inner}**"
# Italic
if tag in ("em", "i"):
return f"*{inner}*"
# Links
if tag == "a":
href = elem.get("href", "")
if href and not href.startswith("javascript:"):
return f"[{inner}]({href})"
return inner
# Images
if tag == "img":
src = elem.get("data-src") or elem.get("src") or ""
alt = elem.get("alt", "")
if src:
return f"\n\n![{alt}]({src})\n\n"
return ""
# Blockquotes
if tag == "blockquote":
lines = inner.split("\n")
quoted = "\n".join(f"> {line}" for line in lines if line.strip())
return f"\n\n{quoted}\n\n"
# Lists
if tag in ("ul", "ol"):
return f"\n\n{inner}\n\n"
if tag == "li":
parent = elem.parent
if parent and parent.name == "ol":
# Ordered list — position tracking is imperfect but functional
return f"1. {inner}\n"
return f"- {inner}\n"
# Code
if tag == "code":
if elem.parent and elem.parent.name == "pre":
return inner
return f"`{inner}`"
if tag == "pre":
return f"\n\n```\n{inner}\n```\n\n"
# Horizontal rule
if tag == "hr":
return "\n\n---\n\n"
# Section / div / span — pass through
if tag in ("section", "div", "span", "article", "main", "figure",
"figcaption", "table", "thead", "tbody", "tr"):
return inner
# Table cells
if tag in ("td", "th"):
return f" {inner} |"
return inner
def html_to_markdown(soup: BeautifulSoup) -> str:
"""Convert WeChat article HTML to clean Markdown."""
content = soup.find(id="js_content")
if content is None:
return ""
raw = _elem_to_md(content)
# Clean up excessive whitespace
md = re.sub(r"\n{3,}", "\n\n", raw)
md = md.strip()
return md
# ---------------------------------------------------------------------------
# Public API
# ---------------------------------------------------------------------------
def fetch_article(url: str = None, file_path: str = None) -> dict:
"""Fetch and parse a WeChat article.
Args:
url: WeChat article URL.
file_path: Path to saved HTML file (alternative to URL).
Returns:
dict with keys: title, author, publish_time, markdown, url
"""
if file_path:
html = Path(file_path).read_text(encoding="utf-8")
elif url:
html = fetch_html(url)
else:
raise ValueError("Either url or file_path must be provided")
soup = BeautifulSoup(html, "html.parser")
meta = _extract_metadata(soup)
md = html_to_markdown(soup)
return {
"title": meta["title"],
"author": meta["author"],
"publish_time": meta["publish_time"],
"markdown": md,
"url": url or "",
}
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def main():
ap = argparse.ArgumentParser(
description="Extract WeChat article content as Markdown."
)
ap.add_argument("url", nargs="?", help="WeChat article URL")
ap.add_argument("--file", dest="file_path",
help="Local HTML file instead of URL")
ap.add_argument("-o", "--output", help="Save Markdown to file")
ap.add_argument("--json", dest="as_json", action="store_true",
help="Output as JSON (for agent use)")
args = ap.parse_args()
if not args.url and not args.file_path:
ap.error("Provide a URL or --file path")
result = fetch_article(url=args.url, file_path=args.file_path)
if args.as_json:
print(json.dumps(result, ensure_ascii=False, indent=2))
elif args.output:
# Write Markdown with YAML frontmatter
out = Path(args.output)
frontmatter = f"---\ntitle: \"{result['title']}\"\nauthor: \"{result['author']}\"\n"
if result["publish_time"]:
frontmatter += f"date: \"{result['publish_time']}\"\n"
if result["url"]:
frontmatter += f"source: \"{result['url']}\"\n"
frontmatter += "---\n\n"
out.write_text(frontmatter + result["markdown"], encoding="utf-8")
print(f"Saved: {out}")
else:
if result["title"]:
print(f"# {result['title']}\n")
if result["author"]:
print(f"> {result['author']}")
if result["publish_time"]:
print(f"> {result['publish_time']}")
if result["author"] or result["publish_time"]:
print()
print(result["markdown"])
if __name__ == "__main__":
main()

View file

@ -12,7 +12,6 @@ import sys
from collections import Counter from collections import Counter
from pathlib import Path from pathlib import Path
import requests
import yaml import yaml
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
@ -154,12 +153,6 @@ _TARGET_TAGS = {
"blockquote", "code", "pre", "img", "a", "blockquote", "code", "pre", "img", "a",
} }
_BROWSER_UA = (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
)
TEMPLATE_THEME = "professional-clean" TEMPLATE_THEME = "professional-clean"
THEMES_DIR = Path(__file__).resolve().parent.parent / "toolkit" / "themes" THEMES_DIR = Path(__file__).resolve().parent.parent / "toolkit" / "themes"
@ -175,26 +168,20 @@ def _attach_title(soup, content) -> None:
def fetch_article(url: str, timeout: int = 20) -> "BeautifulSoup tag": def fetch_article(url: str, timeout: int = 20) -> "BeautifulSoup tag":
"""Fetch a WeChat article, return the ``#js_content`` element. """Fetch a WeChat article, return the ``#js_content`` element.
The article title is attached as ``content._wewrite_title`` (empty string Delegates to fetch_article.fetch_html() for three-level fetching
if not found). Exits with code 1 on network errors or missing content. (requests Playwright manual fallback).
Parameters The article title is attached as ``content._wewrite_title`` (empty string
---------- if not found).
url: WeChat article URL (mp.weixin.qq.com/)
timeout: HTTP request timeout in seconds (default 20).
""" """
try: from scripts.fetch_article import fetch_html
resp = requests.get(url, headers={"User-Agent": _BROWSER_UA}, timeout=timeout)
resp.raise_for_status() html = fetch_html(url)
except requests.exceptions.RequestException as exc: soup = BeautifulSoup(html, "html.parser")
print(f"Error: failed to fetch URL: {exc}", file=sys.stderr)
sys.exit(1)
resp.encoding = "utf-8"
soup = BeautifulSoup(resp.text, "html.parser")
content = soup.find(id="js_content") content = soup.find(id="js_content")
if content is None: if content is None:
print("Error: #js_content not found — the page may require verification.", file=sys.stderr) print("Error: #js_content not found.", file=sys.stderr)
sys.exit(1) sys.exit(1)
_attach_title(soup, content) _attach_title(soup, content)

View file

@ -6,6 +6,12 @@ Supports multiple providers via a simple abstraction:
- doubao-seedream (Volcengine Ark) default, good for Chinese prompts - doubao-seedream (Volcengine Ark) default, good for Chinese prompts
- openai (DALL-E 3) broad availability - openai (DALL-E 3) broad availability
- gemini (Google Gemini Imagen) multimodal image generation - gemini (Google Gemini Imagen) multimodal image generation
- dashscope (Alibaba Tongyi Wanxiang) good for Chinese prompts
- minimax Chinese provider
- replicate open-source models
- azure_openai Azure-hosted DALL-E
- openrouter multi-model proxy
- jimeng (ByteDance) good for Chinese prompts
- Custom providers via ImageProvider base class - Custom providers via ImageProvider base class
Usage as CLI: Usage as CLI:
@ -21,8 +27,12 @@ Usage as module:
import abc import abc
import argparse import argparse
import base64 import base64
import hashlib
import hmac
import json import json
import sys import sys
import time
from datetime import datetime, timezone
from pathlib import Path from pathlib import Path
import requests import requests
@ -51,11 +61,31 @@ def _load_config() -> dict:
# Cover: 2.35:1 微信封面比例 # Cover: 2.35:1 微信封面比例
# Article: 16:9 横版内文配图 # Article: 16:9 横版内文配图
# Vertical: 9:16 竖版 # Vertical: 9:16 竖版
_DEFAULT = "1792x1024"
_DEFAULT_V = "1024x1792"
_DEFAULT_SQ = "1024x1024"
SIZE_PRESETS = { SIZE_PRESETS = {
"cover": {"doubao": "2952x1256", "openai": "1792x1024", "gemini": "1792x1024"}, "cover": {
"article": {"doubao": "2560x1440", "openai": "1792x1024", "gemini": "1792x1024"}, "doubao": "2952x1256", "openai": _DEFAULT, "gemini": _DEFAULT,
"vertical": {"doubao": "1088x2560", "openai": "1024x1792", "gemini": "1024x1792"}, "dashscope": _DEFAULT, "minimax": _DEFAULT, "replicate": _DEFAULT,
"square": {"doubao": "2048x2048", "openai": "1024x1024", "gemini": "1024x1024"}, "azure_openai": _DEFAULT, "openrouter": _DEFAULT, "jimeng": _DEFAULT,
},
"article": {
"doubao": "2560x1440", "openai": _DEFAULT, "gemini": _DEFAULT,
"dashscope": _DEFAULT, "minimax": _DEFAULT, "replicate": _DEFAULT,
"azure_openai": _DEFAULT, "openrouter": _DEFAULT, "jimeng": _DEFAULT,
},
"vertical": {
"doubao": "1088x2560", "openai": _DEFAULT_V, "gemini": _DEFAULT_V,
"dashscope": _DEFAULT_V, "minimax": _DEFAULT_V, "replicate": _DEFAULT_V,
"azure_openai": _DEFAULT_V, "openrouter": _DEFAULT_V, "jimeng": _DEFAULT_V,
},
"square": {
"doubao": "2048x2048", "openai": _DEFAULT_SQ, "gemini": _DEFAULT_SQ,
"dashscope": _DEFAULT_SQ, "minimax": _DEFAULT_SQ, "replicate": _DEFAULT_SQ,
"azure_openai": _DEFAULT_SQ, "openrouter": _DEFAULT_SQ, "jimeng": _DEFAULT_SQ,
},
} }
MAX_FILE_SIZE = 5 * 1024 * 1024 # 5MB MAX_FILE_SIZE = 5 * 1024 * 1024 # 5MB
@ -79,6 +109,29 @@ def _compress_image(raw_bytes: bytes, max_size: int) -> bytes:
return buf.getvalue() return buf.getvalue()
def _size_to_aspect(size: str) -> str:
"""Convert 'WxH' to nearest standard aspect ratio string."""
if ":" in size:
return size
try:
w, h = (int(x) for x in size.split("x", 1))
except ValueError:
return "16:9"
ratio = w / h
for ar, val in [("1:1", 1.0), ("16:9", 16/9), ("9:16", 9/16),
("4:3", 4/3), ("3:4", 3/4), ("3:2", 3/2), ("2:3", 2/3)]:
if abs(ratio - val) < 0.15:
return ar
return "16:9"
def _download_image(url: str) -> bytes:
"""Download image bytes from URL."""
resp = requests.get(url, timeout=60)
resp.raise_for_status()
return resp.content
# --- Provider abstraction --- # --- Provider abstraction ---
class ImageProvider(abc.ABC): class ImageProvider(abc.ABC):
@ -86,15 +139,7 @@ class ImageProvider(abc.ABC):
@abc.abstractmethod @abc.abstractmethod
def generate(self, prompt: str, size: str) -> bytes: def generate(self, prompt: str, size: str) -> bytes:
"""Generate an image and return raw bytes. """Generate an image and return raw bytes."""
Args:
prompt: Image description (Chinese or English).
size: Resolved size string (e.g. "1792x1024").
Returns:
Raw image bytes.
"""
... ...
def resolve_size(self, preset: str) -> str: def resolve_size(self, preset: str) -> str:
@ -102,63 +147,45 @@ class ImageProvider(abc.ABC):
provider_key = self.provider_key provider_key = self.provider_key
if preset in SIZE_PRESETS: if preset in SIZE_PRESETS:
return SIZE_PRESETS[preset].get(provider_key, list(SIZE_PRESETS[preset].values())[0]) return SIZE_PRESETS[preset].get(provider_key, list(SIZE_PRESETS[preset].values())[0])
return preset # assume explicit WxH return preset
@property @property
@abc.abstractmethod @abc.abstractmethod
def provider_key(self) -> str: def provider_key(self) -> str:
"""Short identifier used for size preset lookup."""
... ...
# --- Providers ---
class DoubaoProvider(ImageProvider): class DoubaoProvider(ImageProvider):
"""doubao-seedream via Volcengine Ark API.""" """doubao-seedream via Volcengine Ark API."""
provider_key = "doubao" provider_key = "doubao"
def __init__(self, api_key: str, model: str = "doubao-seedream-5-0-260128", def __init__(self, api_key: str, model: str = "doubao-seedream-5-0-260128",
base_url: str = "https://ark.cn-beijing.volces.com/api/v3"): base_url: str = "https://ark.cn-beijing.volces.com/api/v3", **_kw):
self._api_key = api_key self._api_key = api_key
self._model = model self._model = model
self._base_url = base_url self._base_url = base_url
def generate(self, prompt: str, size: str) -> bytes: def generate(self, prompt: str, size: str) -> bytes:
body = {
"model": self._model,
"prompt": prompt,
"response_format": "url",
"size": size,
"stream": False,
"watermark": False,
}
resp = requests.post( resp = requests.post(
f"{self._base_url}/images/generations", f"{self._base_url}/images/generations",
headers={ headers={"Content-Type": "application/json",
"Content-Type": "application/json", "Authorization": f"Bearer {self._api_key}"},
"Authorization": f"Bearer {self._api_key}", json={"model": self._model, "prompt": prompt,
}, "response_format": "url", "size": size,
json=body, "stream": False, "watermark": False},
timeout=120, timeout=120,
) )
data = resp.json() data = resp.json()
if resp.status_code != 200: if resp.status_code != 200:
error = data.get("error", {}) raise ValueError(f"Doubao error ({resp.status_code}): "
msg = error.get("message", json.dumps(data, ensure_ascii=False)) f"{data.get('error', {}).get('message', str(data))}")
raise ValueError(f"Doubao API error ({resp.status_code}): {msg}") url = data.get("data", [{}])[0].get("url")
if not url:
image_data = data.get("data", []) raise ValueError(f"No image URL: {data}")
if not image_data: return _download_image(url)
raise ValueError(f"No image returned: {json.dumps(data, ensure_ascii=False)}")
image_url = image_data[0].get("url")
if not image_url:
raise ValueError(f"No image URL in response: {json.dumps(data, ensure_ascii=False)}")
img_resp = requests.get(image_url, timeout=60)
img_resp.raise_for_status()
return img_resp.content
class OpenAIProvider(ImageProvider): class OpenAIProvider(ImageProvider):
@ -167,50 +194,28 @@ class OpenAIProvider(ImageProvider):
provider_key = "openai" provider_key = "openai"
def __init__(self, api_key: str, model: str = "dall-e-3", def __init__(self, api_key: str, model: str = "dall-e-3",
base_url: str = "https://api.openai.com/v1"): base_url: str = "https://api.openai.com/v1", **_kw):
self._api_key = api_key self._api_key = api_key
self._model = model self._model = model
self._base_url = base_url self._base_url = base_url
def generate(self, prompt: str, size: str) -> bytes: def generate(self, prompt: str, size: str) -> bytes:
# DALL-E 3 expects size as "WxH" format
dall_e_size = size.replace("x", "x") # normalize
body = {
"model": self._model,
"prompt": prompt,
"n": 1,
"size": dall_e_size,
"response_format": "url",
}
resp = requests.post( resp = requests.post(
f"{self._base_url}/images/generations", f"{self._base_url}/images/generations",
headers={ headers={"Content-Type": "application/json",
"Content-Type": "application/json", "Authorization": f"Bearer {self._api_key}"},
"Authorization": f"Bearer {self._api_key}", json={"model": self._model, "prompt": prompt,
}, "n": 1, "size": size, "response_format": "url"},
json=body,
timeout=120, timeout=120,
) )
data = resp.json() data = resp.json()
if resp.status_code != 200: if resp.status_code != 200:
error = data.get("error", {}) raise ValueError(f"OpenAI error ({resp.status_code}): "
msg = error.get("message", json.dumps(data, ensure_ascii=False)) f"{data.get('error', {}).get('message', str(data))}")
raise ValueError(f"OpenAI API error ({resp.status_code}): {msg}") url = data.get("data", [{}])[0].get("url")
if not url:
image_data = data.get("data", []) raise ValueError(f"No image URL: {data}")
if not image_data: return _download_image(url)
raise ValueError(f"No image returned: {json.dumps(data, ensure_ascii=False)}")
image_url = image_data[0].get("url")
if not image_url:
raise ValueError(f"No image URL in response: {json.dumps(data, ensure_ascii=False)}")
img_resp = requests.get(image_url, timeout=60)
img_resp.raise_for_status()
return img_resp.content
class GeminiProvider(ImageProvider): class GeminiProvider(ImageProvider):
@ -219,47 +224,371 @@ class GeminiProvider(ImageProvider):
provider_key = "gemini" provider_key = "gemini"
def __init__(self, api_key: str, model: str = "gemini-3.1-flash-image-preview", def __init__(self, api_key: str, model: str = "gemini-3.1-flash-image-preview",
base_url: str = "https://generativelanguage.googleapis.com/v1beta"): base_url: str = "https://generativelanguage.googleapis.com/v1beta", **_kw):
self._api_key = api_key self._api_key = api_key
self._model = model self._model = model
self._base_url = base_url self._base_url = base_url
def generate(self, prompt: str, size: str) -> bytes: def generate(self, prompt: str, size: str) -> bytes:
# Append size instruction to prompt (Gemini doesn't have a native size param)
if "x" in size: if "x" in size:
w, h = size.split("x", 1) w, h = size.split("x", 1)
prompt = f"{prompt}\n\nGenerate this image at {w}x{h} resolution." prompt = f"{prompt}\n\nGenerate this image at {w}x{h} resolution."
body = {
"contents": [{"parts": [{"text": prompt}]}],
"generationConfig": {"responseModalities": ["TEXT", "IMAGE"]},
}
resp = requests.post( resp = requests.post(
f"{self._base_url}/models/{self._model}:generateContent", f"{self._base_url}/models/{self._model}:generateContent",
headers={ headers={"Content-Type": "application/json",
"Content-Type": "application/json", "x-goog-api-key": self._api_key},
"x-goog-api-key": self._api_key, json={"contents": [{"parts": [{"text": prompt}]}],
}, "generationConfig": {"responseModalities": ["TEXT", "IMAGE"]}},
json=body,
timeout=120, timeout=120,
) )
if resp.status_code != 200: if resp.status_code != 200:
try:
error = resp.json().get("error", {})
msg = error.get("message", resp.text[:200])
except (ValueError, KeyError):
msg = resp.text[:200] msg = resp.text[:200]
raise ValueError(f"Gemini API error ({resp.status_code}): {msg}") try:
msg = resp.json().get("error", {}).get("message", msg)
except Exception:
pass
raise ValueError(f"Gemini error ({resp.status_code}): {msg}")
for part in resp.json().get("candidates", [{}])[0].get("content", {}).get("parts", []):
inline = part.get("inlineData")
if inline and inline.get("mimeType", "").startswith("image/"):
return base64.b64decode(inline["data"])
raise ValueError("No image in Gemini response")
class DashScopeProvider(ImageProvider):
"""Alibaba Tongyi Wanxiang (通义万相) via DashScope API."""
provider_key = "dashscope"
def __init__(self, api_key: str, model: str = "qwen-image-2.0-pro",
base_url: str = "https://dashscope.aliyuncs.com/api/v1", **_kw):
self._api_key = api_key
self._model = model
self._base_url = base_url
def generate(self, prompt: str, size: str) -> bytes:
ds_size = size.replace("x", "*") # DashScope uses "W*H"
resp = requests.post(
f"{self._base_url}/services/aigc/multimodal-generation/generation",
headers={"Content-Type": "application/json",
"Authorization": f"Bearer {self._api_key}"},
json={
"model": self._model,
"input": {"messages": [{"role": "user", "content": [{"text": prompt}]}]},
"parameters": {"prompt_extend": False, "size": ds_size, "watermark": False},
},
timeout=120,
)
data = resp.json() data = resp.json()
candidates = data.get("candidates", []) if resp.status_code != 200:
if not candidates: raise ValueError(f"DashScope error ({resp.status_code}): "
raise ValueError("No candidates in Gemini response") f"{data.get('message', str(data))}")
parts = candidates[0].get("content", {}).get("parts", []) # Try output.result_image first, then output.choices
for part in parts: output = data.get("output", {})
inline_data = part.get("inlineData") img = output.get("result_image")
if inline_data and inline_data.get("mimeType", "").startswith("image/"): if not img:
return base64.b64decode(inline_data["data"]) choices = output.get("choices", [])
raise ValueError("No image found in Gemini response parts") if choices:
for c in choices[0].get("message", {}).get("content", []):
if "image" in c:
img = c["image"]
break
if not img:
raise ValueError(f"No image in DashScope response: {data}")
if img.startswith("http"):
return _download_image(img)
return base64.b64decode(img)
class MiniMaxProvider(ImageProvider):
"""MiniMax image generation."""
provider_key = "minimax"
def __init__(self, api_key: str, model: str = "image-01",
base_url: str = "https://api.minimax.io/v1", **_kw):
self._api_key = api_key
self._model = model
self._base_url = base_url
def generate(self, prompt: str, size: str) -> bytes:
w, h = 1024, 1024
try:
w, h = (int(x) for x in size.split("x", 1))
except ValueError:
pass
resp = requests.post(
f"{self._base_url}/image_generation",
headers={"Content-Type": "application/json",
"Authorization": f"Bearer {self._api_key}"},
json={"model": self._model, "prompt": prompt,
"response_format": "base64",
"width": w, "height": h, "n": 1},
timeout=120,
)
data = resp.json()
if resp.status_code != 200:
raise ValueError(f"MiniMax error ({resp.status_code}): {data}")
b64_list = data.get("data", {}).get("image_base64", [])
if not b64_list:
raise ValueError(f"No image in MiniMax response: {data}")
return base64.b64decode(b64_list[0])
class ReplicateProvider(ImageProvider):
"""Replicate API — supports many open-source image models."""
provider_key = "replicate"
_POLL_INTERVAL = 2
_POLL_TIMEOUT = 300
def __init__(self, api_key: str, model: str = "google/nano-banana-pro",
base_url: str = "https://api.replicate.com/v1", **_kw):
self._api_key = api_key
self._model = model
self._base_url = base_url
def generate(self, prompt: str, size: str) -> bytes:
aspect = _size_to_aspect(size)
headers = {"Content-Type": "application/json",
"Authorization": f"Bearer {self._api_key}",
"Prefer": "wait=60"}
resp = requests.post(
f"{self._base_url}/models/{self._model}/predictions",
headers=headers,
json={"input": {"prompt": prompt, "aspect_ratio": aspect,
"number_of_images": 1, "output_format": "png"}},
timeout=120,
)
data = resp.json()
if resp.status_code not in (200, 201):
raise ValueError(f"Replicate error ({resp.status_code}): {data}")
# Poll if not completed yet
poll_url = data.get("urls", {}).get("get")
deadline = time.monotonic() + self._POLL_TIMEOUT
while data.get("status") not in ("succeeded", "failed", "canceled"):
if time.monotonic() > deadline:
raise ValueError("Replicate polling timeout")
time.sleep(self._POLL_INTERVAL)
data = requests.get(poll_url, headers=headers, timeout=30).json()
if data.get("status") != "succeeded":
raise ValueError(f"Replicate failed: {data.get('error')}")
output = data.get("output")
if isinstance(output, list):
output = output[0]
if isinstance(output, dict):
output = output.get("url", output.get("uri"))
if not output or not isinstance(output, str):
raise ValueError(f"No image URL in Replicate output: {data}")
return _download_image(output)
class AzureOpenAIProvider(ImageProvider):
"""Azure-hosted OpenAI DALL-E."""
provider_key = "azure_openai"
def __init__(self, api_key: str, model: str = "dall-e-3",
base_url: str = "", deployment: str = "", **_kw):
self._api_key = api_key
self._deployment = deployment or model
self._base_url = base_url.rstrip("/")
def generate(self, prompt: str, size: str) -> bytes:
if not self._base_url:
raise ValueError("Azure OpenAI requires base_url "
"(e.g. https://YOUR-RESOURCE.openai.azure.com/openai)")
resp = requests.post(
f"{self._base_url}/deployments/{self._deployment}"
f"/images/generations?api-version=2025-04-01-preview",
headers={"Content-Type": "application/json",
"api-key": self._api_key},
json={"prompt": prompt, "size": size, "n": 1, "quality": "medium"},
timeout=120,
)
data = resp.json()
if resp.status_code != 200:
raise ValueError(f"Azure OpenAI error ({resp.status_code}): {data}")
item = data.get("data", [{}])[0]
if item.get("url"):
return _download_image(item["url"])
if item.get("b64_json"):
return base64.b64decode(item["b64_json"])
raise ValueError(f"No image in Azure response: {data}")
class OpenRouterProvider(ImageProvider):
"""OpenRouter — multi-model proxy using chat completions format."""
provider_key = "openrouter"
def __init__(self, api_key: str, model: str = "google/gemini-3.1-flash-image-preview",
base_url: str = "https://openrouter.ai/api/v1", **_kw):
self._api_key = api_key
self._model = model
self._base_url = base_url
def generate(self, prompt: str, size: str) -> bytes:
aspect = _size_to_aspect(size)
resp = requests.post(
f"{self._base_url}/chat/completions",
headers={"Content-Type": "application/json",
"Authorization": f"Bearer {self._api_key}"},
json={
"model": self._model,
"messages": [{"role": "user", "content": prompt}],
"modalities": ["image"],
"stream": False,
"image_config": {"aspect_ratio": aspect},
"provider": {"require_parameters": True},
},
timeout=120,
)
data = resp.json()
if resp.status_code != 200:
raise ValueError(f"OpenRouter error ({resp.status_code}): {data}")
# Extract image from multiple possible locations
choice = data.get("choices", [{}])[0].get("message", {})
# Path 1: images array
images = choice.get("images", [])
if images:
img = images[0]
if img.startswith("http"):
return _download_image(img)
if img.startswith("data:"):
_, b64 = img.split(",", 1)
return base64.b64decode(b64)
# Path 2: content array with image items
content = choice.get("content", [])
if isinstance(content, list):
for item in content:
if isinstance(item, dict) and item.get("type") == "image":
url = item.get("url") or item.get("image_url", {}).get("url")
if url:
if url.startswith("data:"):
_, b64 = url.split(",", 1)
return base64.b64decode(b64)
return _download_image(url)
raise ValueError(f"No image in OpenRouter response: {data}")
class JimengProvider(ImageProvider):
"""ByteDance Jimeng (即梦) — async submit + poll with HMAC-SHA256 auth."""
provider_key = "jimeng"
_POLL_INTERVAL = 2
_POLL_MAX_ATTEMPTS = 60
def __init__(self, api_key: str, secret_key: str = "",
model: str = "jimeng_t2i_v40",
base_url: str = "https://visual.volcengineapi.com", **_kw):
self._access_key = api_key
self._secret_key = secret_key
self._model = model
self._base_url = base_url
def _sign(self, method: str, path: str, query: str,
headers: dict, payload: bytes) -> dict:
"""Generate Volcengine HMAC-SHA256 signed headers."""
now = datetime.now(timezone.utc)
date_stamp = now.strftime("%Y%m%d")
amz_date = now.strftime("%Y%m%dT%H%M%SZ")
signed_headers_list = sorted(k.lower() for k in headers)
signed_headers_str = ";".join(signed_headers_list)
canonical = "\n".join([
method, path, query,
"".join(f"{k.lower()}:{headers[k]}\n" for k in sorted(headers)),
signed_headers_str,
hashlib.sha256(payload).hexdigest(),
])
region = "cn-north-1"
service = "cv"
scope = f"{date_stamp}/{region}/{service}/request"
string_to_sign = "\n".join([
"HMAC-SHA256", amz_date, scope,
hashlib.sha256(canonical.encode()).hexdigest(),
])
def _hmac(key: bytes, msg: str) -> bytes:
return hmac.new(key, msg.encode(), hashlib.sha256).digest()
k_date = _hmac(self._secret_key.encode(), date_stamp)
k_region = _hmac(k_date, region)
k_service = _hmac(k_region, service)
k_signing = _hmac(k_service, "request")
signature = hmac.new(k_signing, string_to_sign.encode(),
hashlib.sha256).hexdigest()
auth = (f"HMAC-SHA256 Credential={self._access_key}/{scope}, "
f"SignedHeaders={signed_headers_str}, Signature={signature}")
return {**headers, "Authorization": auth, "X-Date": amz_date}
def _request(self, action: str, body: dict) -> dict:
payload = json.dumps(body).encode()
path = "/"
query = f"Action={action}&Version=2022-08-31"
headers = {
"Content-Type": "application/json",
"Host": self._base_url.replace("https://", "").replace("http://", ""),
}
signed = self._sign("POST", path, query, headers, payload)
resp = requests.post(
f"{self._base_url}/?{query}",
headers=signed, data=payload, timeout=120,
)
data = resp.json()
if resp.status_code != 200:
raise ValueError(f"Jimeng error ({resp.status_code}): {data}")
return data
def generate(self, prompt: str, size: str) -> bytes:
if not self._secret_key:
raise ValueError("Jimeng requires both api_key (access_key_id) "
"and secret_key (secret_access_key)")
w, h = 1024, 1024
try:
w, h = (int(x) for x in size.split("x", 1))
except ValueError:
pass
# Submit task
submit = self._request("CVSync2AsyncSubmitTask", {
"req_key": self._model, "prompt": prompt,
"width": w, "height": h,
})
task_id = submit.get("data", {}).get("task_id")
if not task_id:
raise ValueError(f"No task_id from Jimeng: {submit}")
# Poll for result
for _ in range(self._POLL_MAX_ATTEMPTS):
time.sleep(self._POLL_INTERVAL)
result = self._request("CVSync2AsyncGetResult", {
"req_key": self._model, "task_id": task_id,
})
code = result.get("code")
if code == 10000:
data = result.get("data", {})
b64_list = data.get("binary_data_base64", [])
if b64_list:
return base64.b64decode(b64_list[0])
urls = data.get("image_urls", [])
if urls:
return _download_image(urls[0])
raise ValueError(f"No image data in Jimeng result: {result}")
if code and code != 10000:
status = result.get("data", {}).get("status")
if status in ("failed", "canceled"):
raise ValueError(f"Jimeng task failed: {result}")
raise ValueError("Jimeng polling timeout")
# --- Provider registry --- # --- Provider registry ---
@ -268,37 +597,82 @@ PROVIDERS = {
"doubao": DoubaoProvider, "doubao": DoubaoProvider,
"openai": OpenAIProvider, "openai": OpenAIProvider,
"gemini": GeminiProvider, "gemini": GeminiProvider,
"dashscope": DashScopeProvider,
"minimax": MiniMaxProvider,
"replicate": ReplicateProvider,
"azure_openai": AzureOpenAIProvider,
"openrouter": OpenRouterProvider,
"jimeng": JimengProvider,
} }
def _build_provider(config: dict) -> ImageProvider: def _build_provider_from_entry(entry: dict) -> ImageProvider:
"""Build an ImageProvider from config.yaml's image section.""" """Build a single ImageProvider from a provider config entry."""
img_cfg = config.get("image", {}) provider_name = entry.get("provider", "doubao")
provider_name = img_cfg.get("provider", "doubao") api_key = entry.get("api_key")
api_key = img_cfg.get("api_key")
if not api_key: if not api_key:
raise ValueError( raise ValueError(f"No api_key for provider '{provider_name}'")
f"image.api_key not set in config.yaml. "
f"Configure your {provider_name} API key to enable image generation."
)
provider_cls = PROVIDERS.get(provider_name) provider_cls = PROVIDERS.get(provider_name)
if not provider_cls: if not provider_cls:
raise ValueError( raise ValueError(
f"Unknown image provider: '{provider_name}'. " f"Unknown provider: '{provider_name}'. "
f"Available: {', '.join(PROVIDERS.keys())}" f"Available: {', '.join(PROVIDERS.keys())}"
) )
kwargs = {"api_key": api_key} kwargs = {"api_key": api_key}
if img_cfg.get("model"): if entry.get("model"):
kwargs["model"] = img_cfg["model"] kwargs["model"] = entry["model"]
if img_cfg.get("base_url"): if entry.get("base_url"):
kwargs["base_url"] = img_cfg["base_url"] kwargs["base_url"] = entry["base_url"]
if entry.get("secret_key"):
kwargs["secret_key"] = entry["secret_key"]
if entry.get("deployment"):
kwargs["deployment"] = entry["deployment"]
return provider_cls(**kwargs) return provider_cls(**kwargs)
def _build_provider_chain(config: dict) -> list[ImageProvider]:
"""Build an ordered list of providers to try.
Supports two config formats:
- Legacy: image.provider + image.api_key (single provider)
- New: image.providers (list, tried in order with auto-fallback)
"""
img_cfg = config.get("image", {})
providers_list = img_cfg.get("providers")
if providers_list and isinstance(providers_list, list):
chain = []
for entry in providers_list:
try:
chain.append(_build_provider_from_entry(entry))
except ValueError:
continue # skip misconfigured entries
if not chain:
raise ValueError(
"No valid providers in image.providers list. "
"Each entry needs 'provider' and 'api_key'."
)
return chain
# Legacy single-provider format
api_key = img_cfg.get("api_key")
if not api_key:
raise ValueError(
"image.api_key not set in config.yaml. "
"Configure your API key to enable image generation."
)
return [_build_provider_from_entry(img_cfg)]
def _build_provider(config: dict) -> ImageProvider:
"""Build an ImageProvider from config.yaml (backward-compatible entry point)."""
return _build_provider_chain(config)[0]
# --- Public API --- # --- Public API ---
def generate_image( def generate_image(
@ -308,7 +682,10 @@ def generate_image(
config: dict = None, config: dict = None,
) -> str: ) -> str:
""" """
Generate an image using the configured provider. Generate an image using configured providers with auto-fallback.
Tries each provider in order. If one fails, falls back to the next.
Supports both single-provider (legacy) and multi-provider config.
Args: Args:
prompt: Image generation prompt (Chinese or English). prompt: Image generation prompt (Chinese or English).
@ -322,10 +699,21 @@ def generate_image(
if config is None: if config is None:
config = _load_config() config = _load_config()
provider = _build_provider(config) chain = _build_provider_chain(config)
resolved_size = provider.resolve_size(size) last_error = None
for provider in chain:
resolved_size = provider.resolve_size(size)
try:
raw_bytes = provider.generate(prompt, resolved_size) raw_bytes = provider.generate(prompt, resolved_size)
except Exception as e:
last_error = e
print(
f"Provider '{provider.provider_key}' failed: {e}. "
f"Trying next...",
file=sys.stderr,
)
continue
# Compress if over 5MB (WeChat upload limit) # Compress if over 5MB (WeChat upload limit)
if len(raw_bytes) > MAX_FILE_SIZE: if len(raw_bytes) > MAX_FILE_SIZE:
@ -336,24 +724,20 @@ def generate_image(
output.write_bytes(raw_bytes) output.write_bytes(raw_bytes)
return str(output) return str(output)
raise ValueError(
f"All providers failed. Last error: {last_error}"
)
def main(): def main():
parser = argparse.ArgumentParser( ap = argparse.ArgumentParser(description="Generate images using AI")
description="Generate images using AI (doubao-seedream, OpenAI DALL-E, Gemini Imagen, etc.)" ap.add_argument("--prompt", required=True, help="Image generation prompt")
) ap.add_argument("--output", required=True, help="Output file path")
parser.add_argument("--prompt", required=True, help="Image generation prompt") ap.add_argument("--size", default="cover",
parser.add_argument("--output", required=True, help="Output file path") help="Size: cover, article, vertical, square, or WxH")
parser.add_argument( ap.add_argument("--provider", default=None,
"--size", help=f"Override provider ({', '.join(PROVIDERS)})")
default="cover", args = ap.parse_args()
help="Size: cover, article, vertical, square, or WxH",
)
parser.add_argument(
"--provider",
default=None,
help="Override provider (doubao, openai, gemini). Default: from config.yaml",
)
args = parser.parse_args()
try: try:
config = _load_config() config = _load_config()

323
scripts/fetch_article.py Normal file
View file

@ -0,0 +1,323 @@
#!/usr/bin/env python3
"""fetch_article.py — extract WeChat article content as Markdown.
Three-level fetching strategy:
Level 1: requests (fast, zero overhead, works for most articles)
Level 2: Playwright headless Chrome (bypasses anti-scraping JS checks)
Level 3: Prompt user to save HTML manually and pass via --file
Usage:
python3 scripts/fetch_article.py <url> # auto fetch
python3 scripts/fetch_article.py <url> -o article.md # save to file
python3 scripts/fetch_article.py --file saved.html # from local HTML
python3 scripts/fetch_article.py <url> --json # JSON output for agent
"""
import argparse
import json
import re
import sys
from pathlib import Path
import requests
from bs4 import BeautifulSoup, NavigableString
_BROWSER_UA = (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
)
# ---------------------------------------------------------------------------
# Fetching: three-level strategy
# ---------------------------------------------------------------------------
def _fetch_requests(url: str, timeout: int = 20) -> str | None:
"""Level 1: plain requests. Returns HTML string or None on failure."""
try:
resp = requests.get(url, headers={"User-Agent": _BROWSER_UA}, timeout=timeout)
resp.raise_for_status()
resp.encoding = "utf-8"
return resp.text
except requests.exceptions.RequestException:
return None
def _fetch_playwright(url: str, timeout: int = 30000) -> str | None:
"""Level 2: Playwright headless Chrome. Returns HTML or None."""
try:
from playwright.sync_api import sync_playwright
except ImportError:
return None
try:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page(user_agent=_BROWSER_UA)
page.goto(url, wait_until="networkidle", timeout=timeout)
# Wait for WeChat content to render
page.wait_for_selector("#js_content", timeout=10000)
html = page.content()
browser.close()
return html
except Exception:
return None
def fetch_html(url: str) -> str:
"""Fetch article HTML with automatic fallback.
Returns HTML string. Exits with error if all levels fail.
"""
# Level 1
html = _fetch_requests(url)
if html and _has_content(html):
return html
# Level 2
print("requests 未获取到正文,尝试 Playwright...", file=sys.stderr)
html = _fetch_playwright(url)
if html and _has_content(html):
return html
# Level 3
print(
"Error: 无法获取文章内容。请在浏览器中打开文章 → 右键另存为 HTML → 使用 --file 参数传入。",
file=sys.stderr,
)
sys.exit(1)
def _has_content(html: str) -> bool:
"""Check if HTML contains non-empty #js_content."""
soup = BeautifulSoup(html, "html.parser")
content = soup.find(id="js_content")
if content is None:
return False
text = content.get_text(strip=True)
return len(text) > 50 # must have real content, not just whitespace
# ---------------------------------------------------------------------------
# HTML → Markdown conversion
# ---------------------------------------------------------------------------
def _extract_metadata(soup: BeautifulSoup) -> dict:
"""Extract article metadata from WeChat page."""
title_tag = soup.find("h1", class_="rich_media_title") or soup.find(
"h1", id="activity-name"
)
title = title_tag.get_text(strip=True) if title_tag else ""
author_tag = soup.find("a", id="js_name") or soup.find(
"span", class_="rich_media_meta_nickname"
)
author = author_tag.get_text(strip=True) if author_tag else ""
# Publish time
pub_tag = soup.find("em", id="publish_time")
pub_time = pub_tag.get_text(strip=True) if pub_tag else ""
return {"title": title, "author": author, "publish_time": pub_time}
def _elem_to_md(elem, depth: int = 0) -> str:
"""Convert a single HTML element to Markdown."""
tag = elem.name if hasattr(elem, "name") else None
if isinstance(elem, NavigableString):
text = str(elem).strip()
return text if text else ""
if tag is None:
return ""
# Skip hidden/empty elements
style = elem.get("style", "")
if "display:none" in style.replace(" ", "").lower():
return ""
if "visibility:hidden" in style.replace(" ", "").lower():
return ""
# Get inner content recursively
inner = ""
for child in elem.children:
inner += _elem_to_md(child, depth + 1)
inner = inner.strip()
if not inner:
return ""
# Headings
if tag in ("h1", "h2", "h3", "h4"):
level = int(tag[1])
return f"\n\n{'#' * level} {inner}\n\n"
# Paragraphs
if tag == "p":
return f"\n\n{inner}\n\n"
# Line breaks
if tag == "br":
return "\n"
# Bold
if tag in ("strong", "b"):
return f"**{inner}**"
# Italic
if tag in ("em", "i"):
return f"*{inner}*"
# Links
if tag == "a":
href = elem.get("href", "")
if href and not href.startswith("javascript:"):
return f"[{inner}]({href})"
return inner
# Images
if tag == "img":
src = elem.get("data-src") or elem.get("src") or ""
alt = elem.get("alt", "")
if src:
return f"\n\n![{alt}]({src})\n\n"
return ""
# Blockquotes
if tag == "blockquote":
lines = inner.split("\n")
quoted = "\n".join(f"> {line}" for line in lines if line.strip())
return f"\n\n{quoted}\n\n"
# Lists
if tag in ("ul", "ol"):
return f"\n\n{inner}\n\n"
if tag == "li":
parent = elem.parent
if parent and parent.name == "ol":
# Ordered list — position tracking is imperfect but functional
return f"1. {inner}\n"
return f"- {inner}\n"
# Code
if tag == "code":
if elem.parent and elem.parent.name == "pre":
return inner
return f"`{inner}`"
if tag == "pre":
return f"\n\n```\n{inner}\n```\n\n"
# Horizontal rule
if tag == "hr":
return "\n\n---\n\n"
# Section / div / span — pass through
if tag in ("section", "div", "span", "article", "main", "figure",
"figcaption", "table", "thead", "tbody", "tr"):
return inner
# Table cells
if tag in ("td", "th"):
return f" {inner} |"
return inner
def html_to_markdown(soup: BeautifulSoup) -> str:
"""Convert WeChat article HTML to clean Markdown."""
content = soup.find(id="js_content")
if content is None:
return ""
raw = _elem_to_md(content)
# Clean up excessive whitespace
md = re.sub(r"\n{3,}", "\n\n", raw)
md = md.strip()
return md
# ---------------------------------------------------------------------------
# Public API
# ---------------------------------------------------------------------------
def fetch_article(url: str = None, file_path: str = None) -> dict:
"""Fetch and parse a WeChat article.
Args:
url: WeChat article URL.
file_path: Path to saved HTML file (alternative to URL).
Returns:
dict with keys: title, author, publish_time, markdown, url
"""
if file_path:
html = Path(file_path).read_text(encoding="utf-8")
elif url:
html = fetch_html(url)
else:
raise ValueError("Either url or file_path must be provided")
soup = BeautifulSoup(html, "html.parser")
meta = _extract_metadata(soup)
md = html_to_markdown(soup)
return {
"title": meta["title"],
"author": meta["author"],
"publish_time": meta["publish_time"],
"markdown": md,
"url": url or "",
}
# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------
def main():
ap = argparse.ArgumentParser(
description="Extract WeChat article content as Markdown."
)
ap.add_argument("url", nargs="?", help="WeChat article URL")
ap.add_argument("--file", dest="file_path",
help="Local HTML file instead of URL")
ap.add_argument("-o", "--output", help="Save Markdown to file")
ap.add_argument("--json", dest="as_json", action="store_true",
help="Output as JSON (for agent use)")
args = ap.parse_args()
if not args.url and not args.file_path:
ap.error("Provide a URL or --file path")
result = fetch_article(url=args.url, file_path=args.file_path)
if args.as_json:
print(json.dumps(result, ensure_ascii=False, indent=2))
elif args.output:
# Write Markdown with YAML frontmatter
out = Path(args.output)
frontmatter = f"---\ntitle: \"{result['title']}\"\nauthor: \"{result['author']}\"\n"
if result["publish_time"]:
frontmatter += f"date: \"{result['publish_time']}\"\n"
if result["url"]:
frontmatter += f"source: \"{result['url']}\"\n"
frontmatter += "---\n\n"
out.write_text(frontmatter + result["markdown"], encoding="utf-8")
print(f"Saved: {out}")
else:
if result["title"]:
print(f"# {result['title']}\n")
if result["author"]:
print(f"> {result['author']}")
if result["publish_time"]:
print(f"> {result['publish_time']}")
if result["author"] or result["publish_time"]:
print()
print(result["markdown"])
if __name__ == "__main__":
main()

View file

@ -12,7 +12,6 @@ import sys
from collections import Counter from collections import Counter
from pathlib import Path from pathlib import Path
import requests
import yaml import yaml
from bs4 import BeautifulSoup from bs4 import BeautifulSoup
@ -154,12 +153,6 @@ _TARGET_TAGS = {
"blockquote", "code", "pre", "img", "a", "blockquote", "code", "pre", "img", "a",
} }
_BROWSER_UA = (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
)
TEMPLATE_THEME = "professional-clean" TEMPLATE_THEME = "professional-clean"
THEMES_DIR = Path(__file__).resolve().parent.parent / "toolkit" / "themes" THEMES_DIR = Path(__file__).resolve().parent.parent / "toolkit" / "themes"
@ -175,26 +168,20 @@ def _attach_title(soup, content) -> None:
def fetch_article(url: str, timeout: int = 20) -> "BeautifulSoup tag": def fetch_article(url: str, timeout: int = 20) -> "BeautifulSoup tag":
"""Fetch a WeChat article, return the ``#js_content`` element. """Fetch a WeChat article, return the ``#js_content`` element.
The article title is attached as ``content._wewrite_title`` (empty string Delegates to fetch_article.fetch_html() for three-level fetching
if not found). Exits with code 1 on network errors or missing content. (requests Playwright manual fallback).
Parameters The article title is attached as ``content._wewrite_title`` (empty string
---------- if not found).
url: WeChat article URL (mp.weixin.qq.com/)
timeout: HTTP request timeout in seconds (default 20).
""" """
try: from scripts.fetch_article import fetch_html
resp = requests.get(url, headers={"User-Agent": _BROWSER_UA}, timeout=timeout)
resp.raise_for_status() html = fetch_html(url)
except requests.exceptions.RequestException as exc: soup = BeautifulSoup(html, "html.parser")
print(f"Error: failed to fetch URL: {exc}", file=sys.stderr)
sys.exit(1)
resp.encoding = "utf-8"
soup = BeautifulSoup(resp.text, "html.parser")
content = soup.find(id="js_content") content = soup.find(id="js_content")
if content is None: if content is None:
print("Error: #js_content not found — the page may require verification.", file=sys.stderr) print("Error: #js_content not found.", file=sys.stderr)
sys.exit(1) sys.exit(1)
_attach_title(soup, content) _attach_title(soup, content)