feat: add article content extraction with anti-scraping fallback
- New `scripts/fetch_article.py`: extract WeChat article content as Markdown with three-level fetch strategy (requests → Playwright → manual HTML) - Refactor `learn_theme.py` to reuse `fetch_article.fetch_html()`, removing duplicate fetch logic - Update SKILL.md: add "学习这篇文章/导入范文" auxiliary function - Update README.md: add article extraction to feature table and directory tree Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
530e65c41c
commit
25d6a44082
10 changed files with 1407 additions and 222 deletions
|
|
@ -33,6 +33,7 @@
|
|||
| 范文风格库 | SICO 式 few-shot:从你的文章提取风格指纹,写作时注入 | `scripts/extract_exemplar.py` |
|
||||
| 风格飞轮 | 学习你的修改,越用越像你 | `references/learn-edits.md` |
|
||||
| 排版学习 | 从任意公众号文章 URL 提取排版主题 | `scripts/learn_theme.py` |
|
||||
| 文章采集 | 从公众号 URL 提取正文为 Markdown,可导入范文库 | `scripts/fetch_article.py` |
|
||||
|
||||
## 写作人格
|
||||
|
||||
|
|
@ -183,6 +184,7 @@ wewrite/
|
|||
│ ├── humanness_score.py # 文章质量打分(11 项检测,供自检和 Step 5 使用)
|
||||
│ ├── extract_exemplar.py # 范文风格提取(SICO 式 few-shot 建库)
|
||||
│ ├── learn_theme.py # 从公众号文章 URL 提取排版主题
|
||||
│ ├── fetch_article.py # 从公众号 URL 提取正文为 Markdown
|
||||
│ ├── diagnose.py # 配置完备度检查
|
||||
│ └── build_openclaw.py # SKILL.md → OpenClaw 格式转换
|
||||
│
|
||||
|
|
@ -192,7 +194,7 @@ wewrite/
|
|||
│ ├── theme.py # YAML 主题引擎
|
||||
│ ├── publisher.py # 微信草稿箱 API + 小绿书图片帖
|
||||
│ ├── wechat_api.py # access_token / 图片上传
|
||||
│ ├── image_gen.py # AI 图片生成(doubao / OpenAI / Gemini)
|
||||
│ ├── image_gen.py # AI 图片生成(9 provider,自动 fallback)
|
||||
│ └── themes/ # 16+ 排版主题(含暗黑模式,可从文章学习新增)
|
||||
│
|
||||
├── personas/ # 5 套写作人格预设(含朱雀实测数据)
|
||||
|
|
|
|||
1
SKILL.md
1
SKILL.md
|
|
@ -49,6 +49,7 @@ allowed-tools:
|
|||
- **本地修改**(默认):用户在 `output/` 的 markdown 文件中修改
|
||||
- **微信草稿箱同步**:`python3 {skill_dir}/scripts/learn_edits.py --from-wechat`,自动从草稿箱拉回最新内容,与本地原文做纯文本 diff
|
||||
- 用户说"学习排版"/"学排版" → `python3 {skill_dir}/scripts/learn_theme.py <url> --name <name>`,用户需提供一个公众号文章 URL 和主题名称。提取完成后提示用户设置 `style.yaml` 的 `theme` 字段。
|
||||
- 用户说"学习这篇文章"/"导入范文" + URL → `python3 {skill_dir}/scripts/fetch_article.py <url> -o /tmp/article.md && python3 {skill_dir}/scripts/extract_exemplar.py /tmp/article.md -s <账号名>`,从公众号文章 URL 提取正文并导入范文库。支持三级降级(requests → Playwright → 手动 HTML)。
|
||||
- 用户说"看看文章数据" → `读取: {skill_dir}/references/effect-review.md`
|
||||
- 用户说"检查一下"/"自检"/"这篇文章怎么样" → 对最近一篇生成的文章(或用户指定的文章)执行自检,输出生成报告:
|
||||
|
||||
|
|
|
|||
9
dist/openclaw/SKILL.md
vendored
9
dist/openclaw/SKILL.md
vendored
|
|
@ -40,6 +40,7 @@ description: |
|
|||
- **本地修改**(默认):用户在 `output/` 的 markdown 文件中修改
|
||||
- **微信草稿箱同步**:`python3 {baseDir}/scripts/learn_edits.py --from-wechat`,自动从草稿箱拉回最新内容,与本地原文做纯文本 diff
|
||||
- 用户说"学习排版"/"学排版" → `python3 {baseDir}/scripts/learn_theme.py <url> --name <name>`,用户需提供一个公众号文章 URL 和主题名称。提取完成后提示用户设置 `style.yaml` 的 `theme` 字段。
|
||||
- 用户说"学习这篇文章"/"导入范文" + URL → `python3 {baseDir}/scripts/fetch_article.py <url> -o /tmp/article.md && python3 {baseDir}/scripts/extract_exemplar.py /tmp/article.md -s <账号名>`,从公众号文章 URL 提取正文并导入范文库。支持三级降级(requests → Playwright → 手动 HTML)。
|
||||
- 用户说"看看文章数据" → `读取: {baseDir}/references/effect-review.md`
|
||||
- 用户说"检查一下"/"自检"/"这篇文章怎么样" → 对最近一篇生成的文章(或用户指定的文章)执行自检,输出生成报告:
|
||||
|
||||
|
|
@ -98,7 +99,7 @@ python3 -c "import markdown, bs4, cssutils, requests, yaml, pygments, PIL" 2>&1
|
|||
| `config.yaml` 存在 | 静默 | 引导创建,或设 `skip_publish = true` |
|
||||
| Python 依赖 | 静默 | 提供 `pip install -r requirements.txt` |
|
||||
| `wechat.appid` + `secret` | 静默 | 设 `skip_publish = true` |
|
||||
| `image.api_key` | 静默 | 设 `skip_image_gen = true` |
|
||||
| `image.api_key` 或 `image.providers` 至少一项有效 | 静默 | 设 `skip_image_gen = true` |
|
||||
| `references/exemplars/index.yaml` | 静默 | 提示:"范文库为空。如果你有已发布的文章(markdown),可以说**'导入范文'**建立风格库,写出来的文章会更像你。没有也不影响使用。" |
|
||||
|
||||
**1.2 版本检查**(静默通过或提醒):
|
||||
|
|
@ -377,9 +378,11 @@ python3 {baseDir}/scripts/humanness_score.py {article_path} --json --tier3 {agen
|
|||
- **交互模式**:展示封面,问用户"封面效果如何?"。用户 OK → 继续;不满意 → 调整提示词重新生成。
|
||||
- **全自动模式**:agent 自检——提示词中的实体是否在画面描述中可识别?如果提示词过于泛化(仅含"科技感""未来感"等抽象词,无具体实体),换一组提示词重试 1 次。
|
||||
|
||||
**6.4 内文配图**:分析文章结构,生成 3-6 张内文配图提示词(按 visual-prompts.md)。风格、色调、画风沿用封面,保持视觉一致。批量调用 image_gen.py,替换 Markdown 占位符。
|
||||
**6.3b 风格锚定**:封面确认后,提取视觉锚点(色板 hex、风格关键词、画面调性),后续所有内文配图的提示词必须引用这组锚点,保证全文视觉一致。
|
||||
|
||||
**降级**:生图失败 → 输出提示词 + 备选图库关键词,继续。
|
||||
**6.4 内文配图**:分析文章结构,为每个需要配图的段落选择图片类型(infographic/scene/flowchart/comparison/framework/timeline),使用对应的结构化提示词模板生成 3-6 张配图提示词(按 visual-prompts.md)。批量调用 image_gen.py,替换 Markdown 占位符。
|
||||
|
||||
**降级**:image_gen.py 支持多 provider 自动 fallback(按 config.yaml 中 providers 列表顺序尝试)。全部失败 → 输出提示词 + 备选图库关键词,继续。
|
||||
|
||||
---
|
||||
|
||||
|
|
|
|||
68
dist/openclaw/config.example.yaml
vendored
68
dist/openclaw/config.example.yaml
vendored
|
|
@ -8,27 +8,63 @@ wechat:
|
|||
author: "" # 默认署名(可选)
|
||||
|
||||
# AI 图片生成
|
||||
# 支持 9 个 provider,配一个就能用,配多个自动 fallback。
|
||||
#
|
||||
# ┌─────────────────┬────────────────────────────────────────────────┬────────────────────┐
|
||||
# │ Provider │ 获取 API Key │ 特点 │
|
||||
# ├─────────────────┼────────────────────────────────────────────────┼────────────────────┤
|
||||
# │ doubao │ https://console.volcengine.com/ark │ 中文提示词最优 │
|
||||
# │ dashscope │ https://dashscope.console.aliyun.com/ │ 阿里通义万相 │
|
||||
# │ jimeng │ https://console.volcengine.com/iam │ 字节即梦,中文强 │
|
||||
# │ minimax │ https://platform.minimaxi.com/ │ 国内 provider │
|
||||
# │ openai │ https://platform.openai.com/api-keys │ DALL-E,通用性强 │
|
||||
# │ azure_openai │ Azure Portal │ 国内可访问的 OpenAI│
|
||||
# │ gemini │ https://aistudio.google.com/apikey │ 免费额度较多 │
|
||||
# │ openrouter │ https://openrouter.ai/settings/keys │ 多模型代理 │
|
||||
# │ replicate │ https://replicate.com/account/api-tokens │ 开源模型丰富 │
|
||||
# └─────────────────┴────────────────────────────────────────────────┴────────────────────┘
|
||||
#
|
||||
# 支持两种配置方式:
|
||||
|
||||
# 方式一:单 provider(简单用法,填一个就行)
|
||||
image:
|
||||
# 可选 provider: doubao | openai | gemini
|
||||
provider: "doubao"
|
||||
provider: "doubao" # 见上表 Provider 列
|
||||
api_key: "your_api_key"
|
||||
# model: "doubao-seedream-5-0-260128" # 可选,各 provider 有默认值
|
||||
# base_url: "https://ark.cn-beijing.volces.com/api/v3" # 可选
|
||||
|
||||
# doubao-seedream(默认)
|
||||
# 获取 API key: https://console.volcengine.com/ark
|
||||
# model: "doubao-seedream-5-0-260128"
|
||||
# base_url: "https://ark.cn-beijing.volces.com/api/v3"
|
||||
|
||||
# OpenAI DALL-E 3
|
||||
# provider: "openai"
|
||||
# 方式二:多 provider 自动 fallback(推荐)
|
||||
# 按顺序尝试,第一个失败自动切换下一个,不需要全部填写
|
||||
# image:
|
||||
# providers:
|
||||
# - provider: doubao
|
||||
# api_key: "your_volcengine_key"
|
||||
# - provider: dashscope
|
||||
# api_key: "your_dashscope_key"
|
||||
# # model: "qwen-image-2.0-pro"
|
||||
# - provider: jimeng
|
||||
# api_key: "your_access_key_id" # 即梦需要 access_key_id + secret_key
|
||||
# secret_key: "your_secret_access_key"
|
||||
# # model: "jimeng_t2i_v40"
|
||||
# - provider: minimax
|
||||
# api_key: "your_minimax_key"
|
||||
# # model: "image-01"
|
||||
# - provider: openai
|
||||
# api_key: "sk-..."
|
||||
# model: "dall-e-3"
|
||||
# base_url: "https://api.openai.com/v1"
|
||||
|
||||
# Google Gemini Imagen
|
||||
# provider: "gemini"
|
||||
# # model: "dall-e-3"
|
||||
# - provider: azure_openai
|
||||
# api_key: "your_azure_key"
|
||||
# base_url: "https://YOUR-RESOURCE.openai.azure.com/openai" # 必填
|
||||
# # deployment: "dall-e-3"
|
||||
# - provider: gemini
|
||||
# api_key: "AIza..."
|
||||
# 获取 API key: https://aistudio.google.com/apikey
|
||||
# model: "gemini-3.1-flash-image-preview"
|
||||
# # model: "gemini-3.1-flash-image-preview"
|
||||
# - provider: openrouter
|
||||
# api_key: "sk-or-..."
|
||||
# # model: "google/gemini-3.1-flash-image-preview"
|
||||
# - provider: replicate
|
||||
# api_key: "r8_..."
|
||||
# # model: "google/nano-banana-pro"
|
||||
|
||||
# 默认排版主题
|
||||
theme: "professional-clean"
|
||||
|
|
|
|||
159
dist/openclaw/references/visual-prompts.md
vendored
159
dist/openclaw/references/visual-prompts.md
vendored
|
|
@ -73,6 +73,24 @@
|
|||
|
||||
---
|
||||
|
||||
## 风格锚定
|
||||
|
||||
封面确认后,**立即提取视觉锚点**,后续所有内文配图必须复用:
|
||||
|
||||
```
|
||||
视觉锚点:
|
||||
- 色板:{封面的主色 hex + 辅色 hex,如 #2563EB + #F97316}
|
||||
- 风格关键词:{封面的风格描述,如 "flat illustration, minimalist, bold outlines"}
|
||||
- 画面调性:{冷调/暖调/中性}
|
||||
```
|
||||
|
||||
**规则**:
|
||||
- 每条内文配图提示词的末尾,必须附加视觉锚点中的色板和风格关键词
|
||||
- 如果封面是暖调,内文配图不能突然切换为冷调科技风(反之亦然)
|
||||
- 视觉锚点在整篇文章的所有配图中保持一致
|
||||
|
||||
---
|
||||
|
||||
## 二、内文配图(3-6 张)
|
||||
|
||||
### 分析流程
|
||||
|
|
@ -94,7 +112,20 @@
|
|||
| 转折/高潮处 → 视觉冲击 | 紧接着另一张配图(间距不足300字) |
|
||||
| 长段落后(>400字无图) → 节奏调节 | 结尾 CTA 段落 |
|
||||
|
||||
**第三步:确定位置**
|
||||
**第三步:确定图片类型**
|
||||
|
||||
根据段落内容,为每张配图选择最匹配的类型:
|
||||
|
||||
| 类型 | 适用内容 | 核心构图 |
|
||||
|------|---------|---------|
|
||||
| infographic | 数据、统计、指标对比 | 区域分块 + 标签标注 |
|
||||
| scene | 叙事场景、情绪渲染、人物故事 | 焦点主体 + 氛围光影 |
|
||||
| flowchart | 流程、步骤、工作流 | 步骤节点 + 连接箭头 |
|
||||
| comparison | 两个方案/观点对比 | 左右分栏 + 分隔线 |
|
||||
| framework | 概念模型、架构关系 | 层级节点 + 关系连线 |
|
||||
| timeline | 时间线、发展历程 | 时间轴 + 里程碑标记 |
|
||||
|
||||
**第四步:确定位置**
|
||||
- 配图插入在对应段落**之后**(不是之前)
|
||||
- 具体到"H2 XX 下的第 N 段之后"
|
||||
|
||||
|
|
@ -104,24 +135,132 @@
|
|||
- 不要在文章第一段之前放配图
|
||||
- 不要在结尾 CTA 段落放配图
|
||||
|
||||
### 提示词格式
|
||||
### 结构化提示词模板
|
||||
|
||||
每张输出:
|
||||
根据图片类型,使用对应的结构化模板生成提示词。**禁止自由文本描述**——所有提示词必须填写模板的每个字段。
|
||||
|
||||
#### infographic(信息图)
|
||||
|
||||
```
|
||||
### 配图 {序号}: 位于「{H2标题}」第{N}段后
|
||||
- 配图目的:{信息强化/场景还原/节奏调节}
|
||||
- 对应内容:{这段讲了什么,1句话概括}
|
||||
- 画面描述:{具体的画面内容,80-120字}
|
||||
- AI 绘图提示词:
|
||||
"{中文提示词,给 doubao-seedream 用}"
|
||||
- 类型:infographic
|
||||
- 对应内容:{1句话概括}
|
||||
|
||||
Layout: {grid / radial / hierarchical}
|
||||
Zones:
|
||||
- Zone 1: {具体数据点,用文章真实数字}
|
||||
- Zone 2: {对比/趋势,用文章真实数字}
|
||||
- Zone 3: {结论/要点}
|
||||
Labels: {文章中的真实数字、术语、指标名}
|
||||
Colors: {视觉锚点色板}
|
||||
Style: {视觉锚点风格关键词}, clean infographic, no text
|
||||
Aspect: 16:9
|
||||
|
||||
- 备选方案:{Unsplash/Pexels 搜索关键词}
|
||||
```
|
||||
|
||||
### 内文配图的特殊要求
|
||||
#### scene(场景)
|
||||
|
||||
```
|
||||
### 配图 {序号}: 位于「{H2标题}」第{N}段后
|
||||
- 类型:scene
|
||||
- 对应内容:{1句话概括}
|
||||
|
||||
Focal Point: {画面主体,必须是文章实体}
|
||||
Atmosphere: {光影、环境、时间}
|
||||
Mood: {情绪基调}
|
||||
Color Temperature: {warm / cool / neutral,与视觉锚点一致}
|
||||
Style: {视觉锚点风格关键词}, no text no letters
|
||||
Aspect: 16:9
|
||||
|
||||
- 备选方案:{Unsplash/Pexels 搜索关键词}
|
||||
```
|
||||
|
||||
#### flowchart(流程图)
|
||||
|
||||
```
|
||||
### 配图 {序号}: 位于「{H2标题}」第{N}段后
|
||||
- 类型:flowchart
|
||||
- 对应内容:{1句话概括}
|
||||
|
||||
Layout: {left-right / top-down / circular}
|
||||
Steps:
|
||||
1. {步骤名} — {简述}
|
||||
2. {步骤名} — {简述}
|
||||
3. {步骤名} — {简述}
|
||||
Connections: {箭头方向、决策分支}
|
||||
Colors: {视觉锚点色板}
|
||||
Style: {视觉锚点风格关键词}, clean diagram, no text
|
||||
Aspect: 16:9
|
||||
|
||||
- 备选方案:{Unsplash/Pexels 搜索关键词}
|
||||
```
|
||||
|
||||
#### comparison(对比图)
|
||||
|
||||
```
|
||||
### 配图 {序号}: 位于「{H2标题}」第{N}段后
|
||||
- 类型:comparison
|
||||
- 对应内容:{1句话概括}
|
||||
|
||||
Left Side — {选项A名称}:
|
||||
- {要点1}
|
||||
- {要点2}
|
||||
Right Side — {选项B名称}:
|
||||
- {要点1}
|
||||
- {要点2}
|
||||
Divider: {分隔线样式}
|
||||
Colors: {视觉锚点色板,左右各用一个主色}
|
||||
Style: {视觉锚点风格关键词}, split layout, no text
|
||||
Aspect: 16:9
|
||||
|
||||
- 备选方案:{Unsplash/Pexels 搜索关键词}
|
||||
```
|
||||
|
||||
#### framework(架构图)
|
||||
|
||||
```
|
||||
### 配图 {序号}: 位于「{H2标题}」第{N}段后
|
||||
- 类型:framework
|
||||
- 对应内容:{1句话概括}
|
||||
|
||||
Structure: {hierarchical / network / matrix}
|
||||
Nodes:
|
||||
- {概念1} — {角色}
|
||||
- {概念2} — {角色}
|
||||
- {概念3} — {角色}
|
||||
Relationships: {节点间如何连接}
|
||||
Colors: {视觉锚点色板}
|
||||
Style: {视觉锚点风格关键词}, clean diagram, no text
|
||||
Aspect: 16:9
|
||||
|
||||
- 备选方案:{Unsplash/Pexels 搜索关键词}
|
||||
```
|
||||
|
||||
#### timeline(时间线)
|
||||
|
||||
```
|
||||
### 配图 {序号}: 位于「{H2标题}」第{N}段后
|
||||
- 类型:timeline
|
||||
- 对应内容:{1句话概括}
|
||||
|
||||
Direction: {horizontal / vertical}
|
||||
Events:
|
||||
- {时间点1}: {里程碑}
|
||||
- {时间点2}: {里程碑}
|
||||
- {时间点3}: {里程碑}
|
||||
Markers: {视觉标记样式}
|
||||
Colors: {视觉锚点色板}
|
||||
Style: {视觉锚点风格关键词}, clean timeline, no text
|
||||
Aspect: 16:9
|
||||
|
||||
- 备选方案:{Unsplash/Pexels 搜索关键词}
|
||||
```
|
||||
|
||||
### 内文配图通用要求
|
||||
|
||||
- 尺寸统一 **16:9 横版**(image_gen.py --size article)
|
||||
- **风格一致性**:沿用封面确定的色调、画风、视觉语言。在每条提示词中显式复用封面的风格描述(如 "flat illustration, blue-orange palette, minimalist")
|
||||
- **视觉锚定**:每条提示词的 Colors 和 Style 字段必须引用封面提取的视觉锚点
|
||||
- 实体锚定规则同封面——每条提示词至少包含 2 个文章实体
|
||||
- 不要太复杂——手机屏幕上看,简洁的图比复杂的图好
|
||||
- 提示词用中文(seedream 中文理解强)
|
||||
|
|
|
|||
323
dist/openclaw/scripts/fetch_article.py
vendored
Normal file
323
dist/openclaw/scripts/fetch_article.py
vendored
Normal file
|
|
@ -0,0 +1,323 @@
|
|||
#!/usr/bin/env python3
|
||||
"""fetch_article.py — extract WeChat article content as Markdown.
|
||||
|
||||
Three-level fetching strategy:
|
||||
Level 1: requests (fast, zero overhead, works for most articles)
|
||||
Level 2: Playwright headless Chrome (bypasses anti-scraping JS checks)
|
||||
Level 3: Prompt user to save HTML manually and pass via --file
|
||||
|
||||
Usage:
|
||||
python3 scripts/fetch_article.py <url> # auto fetch
|
||||
python3 scripts/fetch_article.py <url> -o article.md # save to file
|
||||
python3 scripts/fetch_article.py --file saved.html # from local HTML
|
||||
python3 scripts/fetch_article.py <url> --json # JSON output for agent
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup, NavigableString
|
||||
|
||||
_BROWSER_UA = (
|
||||
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
|
||||
"AppleWebKit/537.36 (KHTML, like Gecko) "
|
||||
"Chrome/124.0.0.0 Safari/537.36"
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Fetching: three-level strategy
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _fetch_requests(url: str, timeout: int = 20) -> str | None:
|
||||
"""Level 1: plain requests. Returns HTML string or None on failure."""
|
||||
try:
|
||||
resp = requests.get(url, headers={"User-Agent": _BROWSER_UA}, timeout=timeout)
|
||||
resp.raise_for_status()
|
||||
resp.encoding = "utf-8"
|
||||
return resp.text
|
||||
except requests.exceptions.RequestException:
|
||||
return None
|
||||
|
||||
|
||||
def _fetch_playwright(url: str, timeout: int = 30000) -> str | None:
|
||||
"""Level 2: Playwright headless Chrome. Returns HTML or None."""
|
||||
try:
|
||||
from playwright.sync_api import sync_playwright
|
||||
except ImportError:
|
||||
return None
|
||||
|
||||
try:
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.launch(headless=True)
|
||||
page = browser.new_page(user_agent=_BROWSER_UA)
|
||||
page.goto(url, wait_until="networkidle", timeout=timeout)
|
||||
# Wait for WeChat content to render
|
||||
page.wait_for_selector("#js_content", timeout=10000)
|
||||
html = page.content()
|
||||
browser.close()
|
||||
return html
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def fetch_html(url: str) -> str:
|
||||
"""Fetch article HTML with automatic fallback.
|
||||
|
||||
Returns HTML string. Exits with error if all levels fail.
|
||||
"""
|
||||
# Level 1
|
||||
html = _fetch_requests(url)
|
||||
if html and _has_content(html):
|
||||
return html
|
||||
|
||||
# Level 2
|
||||
print("requests 未获取到正文,尝试 Playwright...", file=sys.stderr)
|
||||
html = _fetch_playwright(url)
|
||||
if html and _has_content(html):
|
||||
return html
|
||||
|
||||
# Level 3
|
||||
print(
|
||||
"Error: 无法获取文章内容。请在浏览器中打开文章 → 右键另存为 HTML → 使用 --file 参数传入。",
|
||||
file=sys.stderr,
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def _has_content(html: str) -> bool:
|
||||
"""Check if HTML contains non-empty #js_content."""
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
content = soup.find(id="js_content")
|
||||
if content is None:
|
||||
return False
|
||||
text = content.get_text(strip=True)
|
||||
return len(text) > 50 # must have real content, not just whitespace
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# HTML → Markdown conversion
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _extract_metadata(soup: BeautifulSoup) -> dict:
|
||||
"""Extract article metadata from WeChat page."""
|
||||
title_tag = soup.find("h1", class_="rich_media_title") or soup.find(
|
||||
"h1", id="activity-name"
|
||||
)
|
||||
title = title_tag.get_text(strip=True) if title_tag else ""
|
||||
|
||||
author_tag = soup.find("a", id="js_name") or soup.find(
|
||||
"span", class_="rich_media_meta_nickname"
|
||||
)
|
||||
author = author_tag.get_text(strip=True) if author_tag else ""
|
||||
|
||||
# Publish time
|
||||
pub_tag = soup.find("em", id="publish_time")
|
||||
pub_time = pub_tag.get_text(strip=True) if pub_tag else ""
|
||||
|
||||
return {"title": title, "author": author, "publish_time": pub_time}
|
||||
|
||||
|
||||
def _elem_to_md(elem, depth: int = 0) -> str:
|
||||
"""Convert a single HTML element to Markdown."""
|
||||
tag = elem.name if hasattr(elem, "name") else None
|
||||
|
||||
if isinstance(elem, NavigableString):
|
||||
text = str(elem).strip()
|
||||
return text if text else ""
|
||||
|
||||
if tag is None:
|
||||
return ""
|
||||
|
||||
# Skip hidden/empty elements
|
||||
style = elem.get("style", "")
|
||||
if "display:none" in style.replace(" ", "").lower():
|
||||
return ""
|
||||
if "visibility:hidden" in style.replace(" ", "").lower():
|
||||
return ""
|
||||
|
||||
# Get inner content recursively
|
||||
inner = ""
|
||||
for child in elem.children:
|
||||
inner += _elem_to_md(child, depth + 1)
|
||||
|
||||
inner = inner.strip()
|
||||
if not inner:
|
||||
return ""
|
||||
|
||||
# Headings
|
||||
if tag in ("h1", "h2", "h3", "h4"):
|
||||
level = int(tag[1])
|
||||
return f"\n\n{'#' * level} {inner}\n\n"
|
||||
|
||||
# Paragraphs
|
||||
if tag == "p":
|
||||
return f"\n\n{inner}\n\n"
|
||||
|
||||
# Line breaks
|
||||
if tag == "br":
|
||||
return "\n"
|
||||
|
||||
# Bold
|
||||
if tag in ("strong", "b"):
|
||||
return f"**{inner}**"
|
||||
|
||||
# Italic
|
||||
if tag in ("em", "i"):
|
||||
return f"*{inner}*"
|
||||
|
||||
# Links
|
||||
if tag == "a":
|
||||
href = elem.get("href", "")
|
||||
if href and not href.startswith("javascript:"):
|
||||
return f"[{inner}]({href})"
|
||||
return inner
|
||||
|
||||
# Images
|
||||
if tag == "img":
|
||||
src = elem.get("data-src") or elem.get("src") or ""
|
||||
alt = elem.get("alt", "")
|
||||
if src:
|
||||
return f"\n\n\n\n"
|
||||
return ""
|
||||
|
||||
# Blockquotes
|
||||
if tag == "blockquote":
|
||||
lines = inner.split("\n")
|
||||
quoted = "\n".join(f"> {line}" for line in lines if line.strip())
|
||||
return f"\n\n{quoted}\n\n"
|
||||
|
||||
# Lists
|
||||
if tag in ("ul", "ol"):
|
||||
return f"\n\n{inner}\n\n"
|
||||
if tag == "li":
|
||||
parent = elem.parent
|
||||
if parent and parent.name == "ol":
|
||||
# Ordered list — position tracking is imperfect but functional
|
||||
return f"1. {inner}\n"
|
||||
return f"- {inner}\n"
|
||||
|
||||
# Code
|
||||
if tag == "code":
|
||||
if elem.parent and elem.parent.name == "pre":
|
||||
return inner
|
||||
return f"`{inner}`"
|
||||
if tag == "pre":
|
||||
return f"\n\n```\n{inner}\n```\n\n"
|
||||
|
||||
# Horizontal rule
|
||||
if tag == "hr":
|
||||
return "\n\n---\n\n"
|
||||
|
||||
# Section / div / span — pass through
|
||||
if tag in ("section", "div", "span", "article", "main", "figure",
|
||||
"figcaption", "table", "thead", "tbody", "tr"):
|
||||
return inner
|
||||
|
||||
# Table cells
|
||||
if tag in ("td", "th"):
|
||||
return f" {inner} |"
|
||||
|
||||
return inner
|
||||
|
||||
|
||||
def html_to_markdown(soup: BeautifulSoup) -> str:
|
||||
"""Convert WeChat article HTML to clean Markdown."""
|
||||
content = soup.find(id="js_content")
|
||||
if content is None:
|
||||
return ""
|
||||
|
||||
raw = _elem_to_md(content)
|
||||
|
||||
# Clean up excessive whitespace
|
||||
md = re.sub(r"\n{3,}", "\n\n", raw)
|
||||
md = md.strip()
|
||||
return md
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Public API
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def fetch_article(url: str = None, file_path: str = None) -> dict:
|
||||
"""Fetch and parse a WeChat article.
|
||||
|
||||
Args:
|
||||
url: WeChat article URL.
|
||||
file_path: Path to saved HTML file (alternative to URL).
|
||||
|
||||
Returns:
|
||||
dict with keys: title, author, publish_time, markdown, url
|
||||
"""
|
||||
if file_path:
|
||||
html = Path(file_path).read_text(encoding="utf-8")
|
||||
elif url:
|
||||
html = fetch_html(url)
|
||||
else:
|
||||
raise ValueError("Either url or file_path must be provided")
|
||||
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
meta = _extract_metadata(soup)
|
||||
md = html_to_markdown(soup)
|
||||
|
||||
return {
|
||||
"title": meta["title"],
|
||||
"author": meta["author"],
|
||||
"publish_time": meta["publish_time"],
|
||||
"markdown": md,
|
||||
"url": url or "",
|
||||
}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CLI
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser(
|
||||
description="Extract WeChat article content as Markdown."
|
||||
)
|
||||
ap.add_argument("url", nargs="?", help="WeChat article URL")
|
||||
ap.add_argument("--file", dest="file_path",
|
||||
help="Local HTML file instead of URL")
|
||||
ap.add_argument("-o", "--output", help="Save Markdown to file")
|
||||
ap.add_argument("--json", dest="as_json", action="store_true",
|
||||
help="Output as JSON (for agent use)")
|
||||
args = ap.parse_args()
|
||||
|
||||
if not args.url and not args.file_path:
|
||||
ap.error("Provide a URL or --file path")
|
||||
|
||||
result = fetch_article(url=args.url, file_path=args.file_path)
|
||||
|
||||
if args.as_json:
|
||||
print(json.dumps(result, ensure_ascii=False, indent=2))
|
||||
elif args.output:
|
||||
# Write Markdown with YAML frontmatter
|
||||
out = Path(args.output)
|
||||
frontmatter = f"---\ntitle: \"{result['title']}\"\nauthor: \"{result['author']}\"\n"
|
||||
if result["publish_time"]:
|
||||
frontmatter += f"date: \"{result['publish_time']}\"\n"
|
||||
if result["url"]:
|
||||
frontmatter += f"source: \"{result['url']}\"\n"
|
||||
frontmatter += "---\n\n"
|
||||
out.write_text(frontmatter + result["markdown"], encoding="utf-8")
|
||||
print(f"Saved: {out}")
|
||||
else:
|
||||
if result["title"]:
|
||||
print(f"# {result['title']}\n")
|
||||
if result["author"]:
|
||||
print(f"> {result['author']}")
|
||||
if result["publish_time"]:
|
||||
print(f"> {result['publish_time']}")
|
||||
if result["author"] or result["publish_time"]:
|
||||
print()
|
||||
print(result["markdown"])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
31
dist/openclaw/scripts/learn_theme.py
vendored
31
dist/openclaw/scripts/learn_theme.py
vendored
|
|
@ -12,7 +12,6 @@ import sys
|
|||
from collections import Counter
|
||||
from pathlib import Path
|
||||
|
||||
import requests
|
||||
import yaml
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
|
|
@ -154,12 +153,6 @@ _TARGET_TAGS = {
|
|||
"blockquote", "code", "pre", "img", "a",
|
||||
}
|
||||
|
||||
_BROWSER_UA = (
|
||||
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
|
||||
"AppleWebKit/537.36 (KHTML, like Gecko) "
|
||||
"Chrome/124.0.0.0 Safari/537.36"
|
||||
)
|
||||
|
||||
TEMPLATE_THEME = "professional-clean"
|
||||
THEMES_DIR = Path(__file__).resolve().parent.parent / "toolkit" / "themes"
|
||||
|
||||
|
|
@ -175,26 +168,20 @@ def _attach_title(soup, content) -> None:
|
|||
def fetch_article(url: str, timeout: int = 20) -> "BeautifulSoup tag":
|
||||
"""Fetch a WeChat article, return the ``#js_content`` element.
|
||||
|
||||
The article title is attached as ``content._wewrite_title`` (empty string
|
||||
if not found). Exits with code 1 on network errors or missing content.
|
||||
Delegates to fetch_article.fetch_html() for three-level fetching
|
||||
(requests → Playwright → manual fallback).
|
||||
|
||||
Parameters
|
||||
----------
|
||||
url: WeChat article URL (mp.weixin.qq.com/…)
|
||||
timeout: HTTP request timeout in seconds (default 20).
|
||||
The article title is attached as ``content._wewrite_title`` (empty string
|
||||
if not found).
|
||||
"""
|
||||
try:
|
||||
resp = requests.get(url, headers={"User-Agent": _BROWSER_UA}, timeout=timeout)
|
||||
resp.raise_for_status()
|
||||
except requests.exceptions.RequestException as exc:
|
||||
print(f"Error: failed to fetch URL: {exc}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
resp.encoding = "utf-8"
|
||||
soup = BeautifulSoup(resp.text, "html.parser")
|
||||
from scripts.fetch_article import fetch_html
|
||||
|
||||
html = fetch_html(url)
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
|
||||
content = soup.find(id="js_content")
|
||||
if content is None:
|
||||
print("Error: #js_content not found — the page may require verification.", file=sys.stderr)
|
||||
print("Error: #js_content not found.", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
_attach_title(soup, content)
|
||||
|
|
|
|||
660
dist/openclaw/toolkit/image_gen.py
vendored
660
dist/openclaw/toolkit/image_gen.py
vendored
|
|
@ -6,6 +6,12 @@ Supports multiple providers via a simple abstraction:
|
|||
- doubao-seedream (Volcengine Ark) — default, good for Chinese prompts
|
||||
- openai (DALL-E 3) — broad availability
|
||||
- gemini (Google Gemini Imagen) — multimodal image generation
|
||||
- dashscope (Alibaba Tongyi Wanxiang) — good for Chinese prompts
|
||||
- minimax — Chinese provider
|
||||
- replicate — open-source models
|
||||
- azure_openai — Azure-hosted DALL-E
|
||||
- openrouter — multi-model proxy
|
||||
- jimeng (ByteDance) — good for Chinese prompts
|
||||
- Custom providers via ImageProvider base class
|
||||
|
||||
Usage as CLI:
|
||||
|
|
@ -21,8 +27,12 @@ Usage as module:
|
|||
import abc
|
||||
import argparse
|
||||
import base64
|
||||
import hashlib
|
||||
import hmac
|
||||
import json
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
|
||||
import requests
|
||||
|
|
@ -51,11 +61,31 @@ def _load_config() -> dict:
|
|||
# Cover: 2.35:1 微信封面比例
|
||||
# Article: 16:9 横版内文配图
|
||||
# Vertical: 9:16 竖版
|
||||
_DEFAULT = "1792x1024"
|
||||
_DEFAULT_V = "1024x1792"
|
||||
_DEFAULT_SQ = "1024x1024"
|
||||
|
||||
SIZE_PRESETS = {
|
||||
"cover": {"doubao": "2952x1256", "openai": "1792x1024", "gemini": "1792x1024"},
|
||||
"article": {"doubao": "2560x1440", "openai": "1792x1024", "gemini": "1792x1024"},
|
||||
"vertical": {"doubao": "1088x2560", "openai": "1024x1792", "gemini": "1024x1792"},
|
||||
"square": {"doubao": "2048x2048", "openai": "1024x1024", "gemini": "1024x1024"},
|
||||
"cover": {
|
||||
"doubao": "2952x1256", "openai": _DEFAULT, "gemini": _DEFAULT,
|
||||
"dashscope": _DEFAULT, "minimax": _DEFAULT, "replicate": _DEFAULT,
|
||||
"azure_openai": _DEFAULT, "openrouter": _DEFAULT, "jimeng": _DEFAULT,
|
||||
},
|
||||
"article": {
|
||||
"doubao": "2560x1440", "openai": _DEFAULT, "gemini": _DEFAULT,
|
||||
"dashscope": _DEFAULT, "minimax": _DEFAULT, "replicate": _DEFAULT,
|
||||
"azure_openai": _DEFAULT, "openrouter": _DEFAULT, "jimeng": _DEFAULT,
|
||||
},
|
||||
"vertical": {
|
||||
"doubao": "1088x2560", "openai": _DEFAULT_V, "gemini": _DEFAULT_V,
|
||||
"dashscope": _DEFAULT_V, "minimax": _DEFAULT_V, "replicate": _DEFAULT_V,
|
||||
"azure_openai": _DEFAULT_V, "openrouter": _DEFAULT_V, "jimeng": _DEFAULT_V,
|
||||
},
|
||||
"square": {
|
||||
"doubao": "2048x2048", "openai": _DEFAULT_SQ, "gemini": _DEFAULT_SQ,
|
||||
"dashscope": _DEFAULT_SQ, "minimax": _DEFAULT_SQ, "replicate": _DEFAULT_SQ,
|
||||
"azure_openai": _DEFAULT_SQ, "openrouter": _DEFAULT_SQ, "jimeng": _DEFAULT_SQ,
|
||||
},
|
||||
}
|
||||
|
||||
MAX_FILE_SIZE = 5 * 1024 * 1024 # 5MB
|
||||
|
|
@ -79,6 +109,29 @@ def _compress_image(raw_bytes: bytes, max_size: int) -> bytes:
|
|||
return buf.getvalue()
|
||||
|
||||
|
||||
def _size_to_aspect(size: str) -> str:
|
||||
"""Convert 'WxH' to nearest standard aspect ratio string."""
|
||||
if ":" in size:
|
||||
return size
|
||||
try:
|
||||
w, h = (int(x) for x in size.split("x", 1))
|
||||
except ValueError:
|
||||
return "16:9"
|
||||
ratio = w / h
|
||||
for ar, val in [("1:1", 1.0), ("16:9", 16/9), ("9:16", 9/16),
|
||||
("4:3", 4/3), ("3:4", 3/4), ("3:2", 3/2), ("2:3", 2/3)]:
|
||||
if abs(ratio - val) < 0.15:
|
||||
return ar
|
||||
return "16:9"
|
||||
|
||||
|
||||
def _download_image(url: str) -> bytes:
|
||||
"""Download image bytes from URL."""
|
||||
resp = requests.get(url, timeout=60)
|
||||
resp.raise_for_status()
|
||||
return resp.content
|
||||
|
||||
|
||||
# --- Provider abstraction ---
|
||||
|
||||
class ImageProvider(abc.ABC):
|
||||
|
|
@ -86,15 +139,7 @@ class ImageProvider(abc.ABC):
|
|||
|
||||
@abc.abstractmethod
|
||||
def generate(self, prompt: str, size: str) -> bytes:
|
||||
"""Generate an image and return raw bytes.
|
||||
|
||||
Args:
|
||||
prompt: Image description (Chinese or English).
|
||||
size: Resolved size string (e.g. "1792x1024").
|
||||
|
||||
Returns:
|
||||
Raw image bytes.
|
||||
"""
|
||||
"""Generate an image and return raw bytes."""
|
||||
...
|
||||
|
||||
def resolve_size(self, preset: str) -> str:
|
||||
|
|
@ -102,63 +147,45 @@ class ImageProvider(abc.ABC):
|
|||
provider_key = self.provider_key
|
||||
if preset in SIZE_PRESETS:
|
||||
return SIZE_PRESETS[preset].get(provider_key, list(SIZE_PRESETS[preset].values())[0])
|
||||
return preset # assume explicit WxH
|
||||
return preset
|
||||
|
||||
@property
|
||||
@abc.abstractmethod
|
||||
def provider_key(self) -> str:
|
||||
"""Short identifier used for size preset lookup."""
|
||||
...
|
||||
|
||||
|
||||
# --- Providers ---
|
||||
|
||||
class DoubaoProvider(ImageProvider):
|
||||
"""doubao-seedream via Volcengine Ark API."""
|
||||
|
||||
provider_key = "doubao"
|
||||
|
||||
def __init__(self, api_key: str, model: str = "doubao-seedream-5-0-260128",
|
||||
base_url: str = "https://ark.cn-beijing.volces.com/api/v3"):
|
||||
base_url: str = "https://ark.cn-beijing.volces.com/api/v3", **_kw):
|
||||
self._api_key = api_key
|
||||
self._model = model
|
||||
self._base_url = base_url
|
||||
|
||||
def generate(self, prompt: str, size: str) -> bytes:
|
||||
body = {
|
||||
"model": self._model,
|
||||
"prompt": prompt,
|
||||
"response_format": "url",
|
||||
"size": size,
|
||||
"stream": False,
|
||||
"watermark": False,
|
||||
}
|
||||
|
||||
resp = requests.post(
|
||||
f"{self._base_url}/images/generations",
|
||||
headers={
|
||||
"Content-Type": "application/json",
|
||||
"Authorization": f"Bearer {self._api_key}",
|
||||
},
|
||||
json=body,
|
||||
headers={"Content-Type": "application/json",
|
||||
"Authorization": f"Bearer {self._api_key}"},
|
||||
json={"model": self._model, "prompt": prompt,
|
||||
"response_format": "url", "size": size,
|
||||
"stream": False, "watermark": False},
|
||||
timeout=120,
|
||||
)
|
||||
|
||||
data = resp.json()
|
||||
if resp.status_code != 200:
|
||||
error = data.get("error", {})
|
||||
msg = error.get("message", json.dumps(data, ensure_ascii=False))
|
||||
raise ValueError(f"Doubao API error ({resp.status_code}): {msg}")
|
||||
|
||||
image_data = data.get("data", [])
|
||||
if not image_data:
|
||||
raise ValueError(f"No image returned: {json.dumps(data, ensure_ascii=False)}")
|
||||
|
||||
image_url = image_data[0].get("url")
|
||||
if not image_url:
|
||||
raise ValueError(f"No image URL in response: {json.dumps(data, ensure_ascii=False)}")
|
||||
|
||||
img_resp = requests.get(image_url, timeout=60)
|
||||
img_resp.raise_for_status()
|
||||
return img_resp.content
|
||||
raise ValueError(f"Doubao error ({resp.status_code}): "
|
||||
f"{data.get('error', {}).get('message', str(data))}")
|
||||
url = data.get("data", [{}])[0].get("url")
|
||||
if not url:
|
||||
raise ValueError(f"No image URL: {data}")
|
||||
return _download_image(url)
|
||||
|
||||
|
||||
class OpenAIProvider(ImageProvider):
|
||||
|
|
@ -167,50 +194,28 @@ class OpenAIProvider(ImageProvider):
|
|||
provider_key = "openai"
|
||||
|
||||
def __init__(self, api_key: str, model: str = "dall-e-3",
|
||||
base_url: str = "https://api.openai.com/v1"):
|
||||
base_url: str = "https://api.openai.com/v1", **_kw):
|
||||
self._api_key = api_key
|
||||
self._model = model
|
||||
self._base_url = base_url
|
||||
|
||||
def generate(self, prompt: str, size: str) -> bytes:
|
||||
# DALL-E 3 expects size as "WxH" format
|
||||
dall_e_size = size.replace("x", "x") # normalize
|
||||
|
||||
body = {
|
||||
"model": self._model,
|
||||
"prompt": prompt,
|
||||
"n": 1,
|
||||
"size": dall_e_size,
|
||||
"response_format": "url",
|
||||
}
|
||||
|
||||
resp = requests.post(
|
||||
f"{self._base_url}/images/generations",
|
||||
headers={
|
||||
"Content-Type": "application/json",
|
||||
"Authorization": f"Bearer {self._api_key}",
|
||||
},
|
||||
json=body,
|
||||
headers={"Content-Type": "application/json",
|
||||
"Authorization": f"Bearer {self._api_key}"},
|
||||
json={"model": self._model, "prompt": prompt,
|
||||
"n": 1, "size": size, "response_format": "url"},
|
||||
timeout=120,
|
||||
)
|
||||
|
||||
data = resp.json()
|
||||
if resp.status_code != 200:
|
||||
error = data.get("error", {})
|
||||
msg = error.get("message", json.dumps(data, ensure_ascii=False))
|
||||
raise ValueError(f"OpenAI API error ({resp.status_code}): {msg}")
|
||||
|
||||
image_data = data.get("data", [])
|
||||
if not image_data:
|
||||
raise ValueError(f"No image returned: {json.dumps(data, ensure_ascii=False)}")
|
||||
|
||||
image_url = image_data[0].get("url")
|
||||
if not image_url:
|
||||
raise ValueError(f"No image URL in response: {json.dumps(data, ensure_ascii=False)}")
|
||||
|
||||
img_resp = requests.get(image_url, timeout=60)
|
||||
img_resp.raise_for_status()
|
||||
return img_resp.content
|
||||
raise ValueError(f"OpenAI error ({resp.status_code}): "
|
||||
f"{data.get('error', {}).get('message', str(data))}")
|
||||
url = data.get("data", [{}])[0].get("url")
|
||||
if not url:
|
||||
raise ValueError(f"No image URL: {data}")
|
||||
return _download_image(url)
|
||||
|
||||
|
||||
class GeminiProvider(ImageProvider):
|
||||
|
|
@ -219,47 +224,371 @@ class GeminiProvider(ImageProvider):
|
|||
provider_key = "gemini"
|
||||
|
||||
def __init__(self, api_key: str, model: str = "gemini-3.1-flash-image-preview",
|
||||
base_url: str = "https://generativelanguage.googleapis.com/v1beta"):
|
||||
base_url: str = "https://generativelanguage.googleapis.com/v1beta", **_kw):
|
||||
self._api_key = api_key
|
||||
self._model = model
|
||||
self._base_url = base_url
|
||||
|
||||
def generate(self, prompt: str, size: str) -> bytes:
|
||||
# Append size instruction to prompt (Gemini doesn't have a native size param)
|
||||
if "x" in size:
|
||||
w, h = size.split("x", 1)
|
||||
prompt = f"{prompt}\n\nGenerate this image at {w}x{h} resolution."
|
||||
|
||||
body = {
|
||||
"contents": [{"parts": [{"text": prompt}]}],
|
||||
"generationConfig": {"responseModalities": ["TEXT", "IMAGE"]},
|
||||
}
|
||||
resp = requests.post(
|
||||
f"{self._base_url}/models/{self._model}:generateContent",
|
||||
headers={
|
||||
"Content-Type": "application/json",
|
||||
"x-goog-api-key": self._api_key,
|
||||
},
|
||||
json=body,
|
||||
headers={"Content-Type": "application/json",
|
||||
"x-goog-api-key": self._api_key},
|
||||
json={"contents": [{"parts": [{"text": prompt}]}],
|
||||
"generationConfig": {"responseModalities": ["TEXT", "IMAGE"]}},
|
||||
timeout=120,
|
||||
)
|
||||
if resp.status_code != 200:
|
||||
try:
|
||||
error = resp.json().get("error", {})
|
||||
msg = error.get("message", resp.text[:200])
|
||||
except (ValueError, KeyError):
|
||||
msg = resp.text[:200]
|
||||
raise ValueError(f"Gemini API error ({resp.status_code}): {msg}")
|
||||
try:
|
||||
msg = resp.json().get("error", {}).get("message", msg)
|
||||
except Exception:
|
||||
pass
|
||||
raise ValueError(f"Gemini error ({resp.status_code}): {msg}")
|
||||
for part in resp.json().get("candidates", [{}])[0].get("content", {}).get("parts", []):
|
||||
inline = part.get("inlineData")
|
||||
if inline and inline.get("mimeType", "").startswith("image/"):
|
||||
return base64.b64decode(inline["data"])
|
||||
raise ValueError("No image in Gemini response")
|
||||
|
||||
|
||||
class DashScopeProvider(ImageProvider):
|
||||
"""Alibaba Tongyi Wanxiang (通义万相) via DashScope API."""
|
||||
|
||||
provider_key = "dashscope"
|
||||
|
||||
def __init__(self, api_key: str, model: str = "qwen-image-2.0-pro",
|
||||
base_url: str = "https://dashscope.aliyuncs.com/api/v1", **_kw):
|
||||
self._api_key = api_key
|
||||
self._model = model
|
||||
self._base_url = base_url
|
||||
|
||||
def generate(self, prompt: str, size: str) -> bytes:
|
||||
ds_size = size.replace("x", "*") # DashScope uses "W*H"
|
||||
resp = requests.post(
|
||||
f"{self._base_url}/services/aigc/multimodal-generation/generation",
|
||||
headers={"Content-Type": "application/json",
|
||||
"Authorization": f"Bearer {self._api_key}"},
|
||||
json={
|
||||
"model": self._model,
|
||||
"input": {"messages": [{"role": "user", "content": [{"text": prompt}]}]},
|
||||
"parameters": {"prompt_extend": False, "size": ds_size, "watermark": False},
|
||||
},
|
||||
timeout=120,
|
||||
)
|
||||
data = resp.json()
|
||||
candidates = data.get("candidates", [])
|
||||
if not candidates:
|
||||
raise ValueError("No candidates in Gemini response")
|
||||
parts = candidates[0].get("content", {}).get("parts", [])
|
||||
for part in parts:
|
||||
inline_data = part.get("inlineData")
|
||||
if inline_data and inline_data.get("mimeType", "").startswith("image/"):
|
||||
return base64.b64decode(inline_data["data"])
|
||||
raise ValueError("No image found in Gemini response parts")
|
||||
if resp.status_code != 200:
|
||||
raise ValueError(f"DashScope error ({resp.status_code}): "
|
||||
f"{data.get('message', str(data))}")
|
||||
# Try output.result_image first, then output.choices
|
||||
output = data.get("output", {})
|
||||
img = output.get("result_image")
|
||||
if not img:
|
||||
choices = output.get("choices", [])
|
||||
if choices:
|
||||
for c in choices[0].get("message", {}).get("content", []):
|
||||
if "image" in c:
|
||||
img = c["image"]
|
||||
break
|
||||
if not img:
|
||||
raise ValueError(f"No image in DashScope response: {data}")
|
||||
if img.startswith("http"):
|
||||
return _download_image(img)
|
||||
return base64.b64decode(img)
|
||||
|
||||
|
||||
class MiniMaxProvider(ImageProvider):
|
||||
"""MiniMax image generation."""
|
||||
|
||||
provider_key = "minimax"
|
||||
|
||||
def __init__(self, api_key: str, model: str = "image-01",
|
||||
base_url: str = "https://api.minimax.io/v1", **_kw):
|
||||
self._api_key = api_key
|
||||
self._model = model
|
||||
self._base_url = base_url
|
||||
|
||||
def generate(self, prompt: str, size: str) -> bytes:
|
||||
w, h = 1024, 1024
|
||||
try:
|
||||
w, h = (int(x) for x in size.split("x", 1))
|
||||
except ValueError:
|
||||
pass
|
||||
resp = requests.post(
|
||||
f"{self._base_url}/image_generation",
|
||||
headers={"Content-Type": "application/json",
|
||||
"Authorization": f"Bearer {self._api_key}"},
|
||||
json={"model": self._model, "prompt": prompt,
|
||||
"response_format": "base64",
|
||||
"width": w, "height": h, "n": 1},
|
||||
timeout=120,
|
||||
)
|
||||
data = resp.json()
|
||||
if resp.status_code != 200:
|
||||
raise ValueError(f"MiniMax error ({resp.status_code}): {data}")
|
||||
b64_list = data.get("data", {}).get("image_base64", [])
|
||||
if not b64_list:
|
||||
raise ValueError(f"No image in MiniMax response: {data}")
|
||||
return base64.b64decode(b64_list[0])
|
||||
|
||||
|
||||
class ReplicateProvider(ImageProvider):
|
||||
"""Replicate API — supports many open-source image models."""
|
||||
|
||||
provider_key = "replicate"
|
||||
_POLL_INTERVAL = 2
|
||||
_POLL_TIMEOUT = 300
|
||||
|
||||
def __init__(self, api_key: str, model: str = "google/nano-banana-pro",
|
||||
base_url: str = "https://api.replicate.com/v1", **_kw):
|
||||
self._api_key = api_key
|
||||
self._model = model
|
||||
self._base_url = base_url
|
||||
|
||||
def generate(self, prompt: str, size: str) -> bytes:
|
||||
aspect = _size_to_aspect(size)
|
||||
headers = {"Content-Type": "application/json",
|
||||
"Authorization": f"Bearer {self._api_key}",
|
||||
"Prefer": "wait=60"}
|
||||
resp = requests.post(
|
||||
f"{self._base_url}/models/{self._model}/predictions",
|
||||
headers=headers,
|
||||
json={"input": {"prompt": prompt, "aspect_ratio": aspect,
|
||||
"number_of_images": 1, "output_format": "png"}},
|
||||
timeout=120,
|
||||
)
|
||||
data = resp.json()
|
||||
if resp.status_code not in (200, 201):
|
||||
raise ValueError(f"Replicate error ({resp.status_code}): {data}")
|
||||
|
||||
# Poll if not completed yet
|
||||
poll_url = data.get("urls", {}).get("get")
|
||||
deadline = time.monotonic() + self._POLL_TIMEOUT
|
||||
while data.get("status") not in ("succeeded", "failed", "canceled"):
|
||||
if time.monotonic() > deadline:
|
||||
raise ValueError("Replicate polling timeout")
|
||||
time.sleep(self._POLL_INTERVAL)
|
||||
data = requests.get(poll_url, headers=headers, timeout=30).json()
|
||||
|
||||
if data.get("status") != "succeeded":
|
||||
raise ValueError(f"Replicate failed: {data.get('error')}")
|
||||
|
||||
output = data.get("output")
|
||||
if isinstance(output, list):
|
||||
output = output[0]
|
||||
if isinstance(output, dict):
|
||||
output = output.get("url", output.get("uri"))
|
||||
if not output or not isinstance(output, str):
|
||||
raise ValueError(f"No image URL in Replicate output: {data}")
|
||||
return _download_image(output)
|
||||
|
||||
|
||||
class AzureOpenAIProvider(ImageProvider):
|
||||
"""Azure-hosted OpenAI DALL-E."""
|
||||
|
||||
provider_key = "azure_openai"
|
||||
|
||||
def __init__(self, api_key: str, model: str = "dall-e-3",
|
||||
base_url: str = "", deployment: str = "", **_kw):
|
||||
self._api_key = api_key
|
||||
self._deployment = deployment or model
|
||||
self._base_url = base_url.rstrip("/")
|
||||
|
||||
def generate(self, prompt: str, size: str) -> bytes:
|
||||
if not self._base_url:
|
||||
raise ValueError("Azure OpenAI requires base_url "
|
||||
"(e.g. https://YOUR-RESOURCE.openai.azure.com/openai)")
|
||||
resp = requests.post(
|
||||
f"{self._base_url}/deployments/{self._deployment}"
|
||||
f"/images/generations?api-version=2025-04-01-preview",
|
||||
headers={"Content-Type": "application/json",
|
||||
"api-key": self._api_key},
|
||||
json={"prompt": prompt, "size": size, "n": 1, "quality": "medium"},
|
||||
timeout=120,
|
||||
)
|
||||
data = resp.json()
|
||||
if resp.status_code != 200:
|
||||
raise ValueError(f"Azure OpenAI error ({resp.status_code}): {data}")
|
||||
item = data.get("data", [{}])[0]
|
||||
if item.get("url"):
|
||||
return _download_image(item["url"])
|
||||
if item.get("b64_json"):
|
||||
return base64.b64decode(item["b64_json"])
|
||||
raise ValueError(f"No image in Azure response: {data}")
|
||||
|
||||
|
||||
class OpenRouterProvider(ImageProvider):
|
||||
"""OpenRouter — multi-model proxy using chat completions format."""
|
||||
|
||||
provider_key = "openrouter"
|
||||
|
||||
def __init__(self, api_key: str, model: str = "google/gemini-3.1-flash-image-preview",
|
||||
base_url: str = "https://openrouter.ai/api/v1", **_kw):
|
||||
self._api_key = api_key
|
||||
self._model = model
|
||||
self._base_url = base_url
|
||||
|
||||
def generate(self, prompt: str, size: str) -> bytes:
|
||||
aspect = _size_to_aspect(size)
|
||||
resp = requests.post(
|
||||
f"{self._base_url}/chat/completions",
|
||||
headers={"Content-Type": "application/json",
|
||||
"Authorization": f"Bearer {self._api_key}"},
|
||||
json={
|
||||
"model": self._model,
|
||||
"messages": [{"role": "user", "content": prompt}],
|
||||
"modalities": ["image"],
|
||||
"stream": False,
|
||||
"image_config": {"aspect_ratio": aspect},
|
||||
"provider": {"require_parameters": True},
|
||||
},
|
||||
timeout=120,
|
||||
)
|
||||
data = resp.json()
|
||||
if resp.status_code != 200:
|
||||
raise ValueError(f"OpenRouter error ({resp.status_code}): {data}")
|
||||
|
||||
# Extract image from multiple possible locations
|
||||
choice = data.get("choices", [{}])[0].get("message", {})
|
||||
# Path 1: images array
|
||||
images = choice.get("images", [])
|
||||
if images:
|
||||
img = images[0]
|
||||
if img.startswith("http"):
|
||||
return _download_image(img)
|
||||
if img.startswith("data:"):
|
||||
_, b64 = img.split(",", 1)
|
||||
return base64.b64decode(b64)
|
||||
# Path 2: content array with image items
|
||||
content = choice.get("content", [])
|
||||
if isinstance(content, list):
|
||||
for item in content:
|
||||
if isinstance(item, dict) and item.get("type") == "image":
|
||||
url = item.get("url") or item.get("image_url", {}).get("url")
|
||||
if url:
|
||||
if url.startswith("data:"):
|
||||
_, b64 = url.split(",", 1)
|
||||
return base64.b64decode(b64)
|
||||
return _download_image(url)
|
||||
raise ValueError(f"No image in OpenRouter response: {data}")
|
||||
|
||||
|
||||
class JimengProvider(ImageProvider):
|
||||
"""ByteDance Jimeng (即梦) — async submit + poll with HMAC-SHA256 auth."""
|
||||
|
||||
provider_key = "jimeng"
|
||||
_POLL_INTERVAL = 2
|
||||
_POLL_MAX_ATTEMPTS = 60
|
||||
|
||||
def __init__(self, api_key: str, secret_key: str = "",
|
||||
model: str = "jimeng_t2i_v40",
|
||||
base_url: str = "https://visual.volcengineapi.com", **_kw):
|
||||
self._access_key = api_key
|
||||
self._secret_key = secret_key
|
||||
self._model = model
|
||||
self._base_url = base_url
|
||||
|
||||
def _sign(self, method: str, path: str, query: str,
|
||||
headers: dict, payload: bytes) -> dict:
|
||||
"""Generate Volcengine HMAC-SHA256 signed headers."""
|
||||
now = datetime.now(timezone.utc)
|
||||
date_stamp = now.strftime("%Y%m%d")
|
||||
amz_date = now.strftime("%Y%m%dT%H%M%SZ")
|
||||
|
||||
signed_headers_list = sorted(k.lower() for k in headers)
|
||||
signed_headers_str = ";".join(signed_headers_list)
|
||||
|
||||
canonical = "\n".join([
|
||||
method, path, query,
|
||||
"".join(f"{k.lower()}:{headers[k]}\n" for k in sorted(headers)),
|
||||
signed_headers_str,
|
||||
hashlib.sha256(payload).hexdigest(),
|
||||
])
|
||||
|
||||
region = "cn-north-1"
|
||||
service = "cv"
|
||||
scope = f"{date_stamp}/{region}/{service}/request"
|
||||
string_to_sign = "\n".join([
|
||||
"HMAC-SHA256", amz_date, scope,
|
||||
hashlib.sha256(canonical.encode()).hexdigest(),
|
||||
])
|
||||
|
||||
def _hmac(key: bytes, msg: str) -> bytes:
|
||||
return hmac.new(key, msg.encode(), hashlib.sha256).digest()
|
||||
|
||||
k_date = _hmac(self._secret_key.encode(), date_stamp)
|
||||
k_region = _hmac(k_date, region)
|
||||
k_service = _hmac(k_region, service)
|
||||
k_signing = _hmac(k_service, "request")
|
||||
signature = hmac.new(k_signing, string_to_sign.encode(),
|
||||
hashlib.sha256).hexdigest()
|
||||
|
||||
auth = (f"HMAC-SHA256 Credential={self._access_key}/{scope}, "
|
||||
f"SignedHeaders={signed_headers_str}, Signature={signature}")
|
||||
return {**headers, "Authorization": auth, "X-Date": amz_date}
|
||||
|
||||
def _request(self, action: str, body: dict) -> dict:
|
||||
payload = json.dumps(body).encode()
|
||||
path = "/"
|
||||
query = f"Action={action}&Version=2022-08-31"
|
||||
headers = {
|
||||
"Content-Type": "application/json",
|
||||
"Host": self._base_url.replace("https://", "").replace("http://", ""),
|
||||
}
|
||||
signed = self._sign("POST", path, query, headers, payload)
|
||||
resp = requests.post(
|
||||
f"{self._base_url}/?{query}",
|
||||
headers=signed, data=payload, timeout=120,
|
||||
)
|
||||
data = resp.json()
|
||||
if resp.status_code != 200:
|
||||
raise ValueError(f"Jimeng error ({resp.status_code}): {data}")
|
||||
return data
|
||||
|
||||
def generate(self, prompt: str, size: str) -> bytes:
|
||||
if not self._secret_key:
|
||||
raise ValueError("Jimeng requires both api_key (access_key_id) "
|
||||
"and secret_key (secret_access_key)")
|
||||
w, h = 1024, 1024
|
||||
try:
|
||||
w, h = (int(x) for x in size.split("x", 1))
|
||||
except ValueError:
|
||||
pass
|
||||
|
||||
# Submit task
|
||||
submit = self._request("CVSync2AsyncSubmitTask", {
|
||||
"req_key": self._model, "prompt": prompt,
|
||||
"width": w, "height": h,
|
||||
})
|
||||
task_id = submit.get("data", {}).get("task_id")
|
||||
if not task_id:
|
||||
raise ValueError(f"No task_id from Jimeng: {submit}")
|
||||
|
||||
# Poll for result
|
||||
for _ in range(self._POLL_MAX_ATTEMPTS):
|
||||
time.sleep(self._POLL_INTERVAL)
|
||||
result = self._request("CVSync2AsyncGetResult", {
|
||||
"req_key": self._model, "task_id": task_id,
|
||||
})
|
||||
code = result.get("code")
|
||||
if code == 10000:
|
||||
data = result.get("data", {})
|
||||
b64_list = data.get("binary_data_base64", [])
|
||||
if b64_list:
|
||||
return base64.b64decode(b64_list[0])
|
||||
urls = data.get("image_urls", [])
|
||||
if urls:
|
||||
return _download_image(urls[0])
|
||||
raise ValueError(f"No image data in Jimeng result: {result}")
|
||||
if code and code != 10000:
|
||||
status = result.get("data", {}).get("status")
|
||||
if status in ("failed", "canceled"):
|
||||
raise ValueError(f"Jimeng task failed: {result}")
|
||||
|
||||
raise ValueError("Jimeng polling timeout")
|
||||
|
||||
|
||||
# --- Provider registry ---
|
||||
|
|
@ -268,37 +597,82 @@ PROVIDERS = {
|
|||
"doubao": DoubaoProvider,
|
||||
"openai": OpenAIProvider,
|
||||
"gemini": GeminiProvider,
|
||||
"dashscope": DashScopeProvider,
|
||||
"minimax": MiniMaxProvider,
|
||||
"replicate": ReplicateProvider,
|
||||
"azure_openai": AzureOpenAIProvider,
|
||||
"openrouter": OpenRouterProvider,
|
||||
"jimeng": JimengProvider,
|
||||
}
|
||||
|
||||
|
||||
def _build_provider(config: dict) -> ImageProvider:
|
||||
"""Build an ImageProvider from config.yaml's image section."""
|
||||
img_cfg = config.get("image", {})
|
||||
provider_name = img_cfg.get("provider", "doubao")
|
||||
api_key = img_cfg.get("api_key")
|
||||
def _build_provider_from_entry(entry: dict) -> ImageProvider:
|
||||
"""Build a single ImageProvider from a provider config entry."""
|
||||
provider_name = entry.get("provider", "doubao")
|
||||
api_key = entry.get("api_key")
|
||||
|
||||
if not api_key:
|
||||
raise ValueError(
|
||||
f"image.api_key not set in config.yaml. "
|
||||
f"Configure your {provider_name} API key to enable image generation."
|
||||
)
|
||||
raise ValueError(f"No api_key for provider '{provider_name}'")
|
||||
|
||||
provider_cls = PROVIDERS.get(provider_name)
|
||||
if not provider_cls:
|
||||
raise ValueError(
|
||||
f"Unknown image provider: '{provider_name}'. "
|
||||
f"Unknown provider: '{provider_name}'. "
|
||||
f"Available: {', '.join(PROVIDERS.keys())}"
|
||||
)
|
||||
|
||||
kwargs = {"api_key": api_key}
|
||||
if img_cfg.get("model"):
|
||||
kwargs["model"] = img_cfg["model"]
|
||||
if img_cfg.get("base_url"):
|
||||
kwargs["base_url"] = img_cfg["base_url"]
|
||||
if entry.get("model"):
|
||||
kwargs["model"] = entry["model"]
|
||||
if entry.get("base_url"):
|
||||
kwargs["base_url"] = entry["base_url"]
|
||||
if entry.get("secret_key"):
|
||||
kwargs["secret_key"] = entry["secret_key"]
|
||||
if entry.get("deployment"):
|
||||
kwargs["deployment"] = entry["deployment"]
|
||||
|
||||
return provider_cls(**kwargs)
|
||||
|
||||
|
||||
def _build_provider_chain(config: dict) -> list[ImageProvider]:
|
||||
"""Build an ordered list of providers to try.
|
||||
|
||||
Supports two config formats:
|
||||
- Legacy: image.provider + image.api_key (single provider)
|
||||
- New: image.providers (list, tried in order with auto-fallback)
|
||||
"""
|
||||
img_cfg = config.get("image", {})
|
||||
providers_list = img_cfg.get("providers")
|
||||
|
||||
if providers_list and isinstance(providers_list, list):
|
||||
chain = []
|
||||
for entry in providers_list:
|
||||
try:
|
||||
chain.append(_build_provider_from_entry(entry))
|
||||
except ValueError:
|
||||
continue # skip misconfigured entries
|
||||
if not chain:
|
||||
raise ValueError(
|
||||
"No valid providers in image.providers list. "
|
||||
"Each entry needs 'provider' and 'api_key'."
|
||||
)
|
||||
return chain
|
||||
|
||||
# Legacy single-provider format
|
||||
api_key = img_cfg.get("api_key")
|
||||
if not api_key:
|
||||
raise ValueError(
|
||||
"image.api_key not set in config.yaml. "
|
||||
"Configure your API key to enable image generation."
|
||||
)
|
||||
return [_build_provider_from_entry(img_cfg)]
|
||||
|
||||
|
||||
def _build_provider(config: dict) -> ImageProvider:
|
||||
"""Build an ImageProvider from config.yaml (backward-compatible entry point)."""
|
||||
return _build_provider_chain(config)[0]
|
||||
|
||||
|
||||
# --- Public API ---
|
||||
|
||||
def generate_image(
|
||||
|
|
@ -308,7 +682,10 @@ def generate_image(
|
|||
config: dict = None,
|
||||
) -> str:
|
||||
"""
|
||||
Generate an image using the configured provider.
|
||||
Generate an image using configured providers with auto-fallback.
|
||||
|
||||
Tries each provider in order. If one fails, falls back to the next.
|
||||
Supports both single-provider (legacy) and multi-provider config.
|
||||
|
||||
Args:
|
||||
prompt: Image generation prompt (Chinese or English).
|
||||
|
|
@ -322,10 +699,21 @@ def generate_image(
|
|||
if config is None:
|
||||
config = _load_config()
|
||||
|
||||
provider = _build_provider(config)
|
||||
resolved_size = provider.resolve_size(size)
|
||||
chain = _build_provider_chain(config)
|
||||
last_error = None
|
||||
|
||||
for provider in chain:
|
||||
resolved_size = provider.resolve_size(size)
|
||||
try:
|
||||
raw_bytes = provider.generate(prompt, resolved_size)
|
||||
except Exception as e:
|
||||
last_error = e
|
||||
print(
|
||||
f"Provider '{provider.provider_key}' failed: {e}. "
|
||||
f"Trying next...",
|
||||
file=sys.stderr,
|
||||
)
|
||||
continue
|
||||
|
||||
# Compress if over 5MB (WeChat upload limit)
|
||||
if len(raw_bytes) > MAX_FILE_SIZE:
|
||||
|
|
@ -336,24 +724,20 @@ def generate_image(
|
|||
output.write_bytes(raw_bytes)
|
||||
return str(output)
|
||||
|
||||
raise ValueError(
|
||||
f"All providers failed. Last error: {last_error}"
|
||||
)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Generate images using AI (doubao-seedream, OpenAI DALL-E, Gemini Imagen, etc.)"
|
||||
)
|
||||
parser.add_argument("--prompt", required=True, help="Image generation prompt")
|
||||
parser.add_argument("--output", required=True, help="Output file path")
|
||||
parser.add_argument(
|
||||
"--size",
|
||||
default="cover",
|
||||
help="Size: cover, article, vertical, square, or WxH",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--provider",
|
||||
default=None,
|
||||
help="Override provider (doubao, openai, gemini). Default: from config.yaml",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
ap = argparse.ArgumentParser(description="Generate images using AI")
|
||||
ap.add_argument("--prompt", required=True, help="Image generation prompt")
|
||||
ap.add_argument("--output", required=True, help="Output file path")
|
||||
ap.add_argument("--size", default="cover",
|
||||
help="Size: cover, article, vertical, square, or WxH")
|
||||
ap.add_argument("--provider", default=None,
|
||||
help=f"Override provider ({', '.join(PROVIDERS)})")
|
||||
args = ap.parse_args()
|
||||
|
||||
try:
|
||||
config = _load_config()
|
||||
|
|
|
|||
323
scripts/fetch_article.py
Normal file
323
scripts/fetch_article.py
Normal file
|
|
@ -0,0 +1,323 @@
|
|||
#!/usr/bin/env python3
|
||||
"""fetch_article.py — extract WeChat article content as Markdown.
|
||||
|
||||
Three-level fetching strategy:
|
||||
Level 1: requests (fast, zero overhead, works for most articles)
|
||||
Level 2: Playwright headless Chrome (bypasses anti-scraping JS checks)
|
||||
Level 3: Prompt user to save HTML manually and pass via --file
|
||||
|
||||
Usage:
|
||||
python3 scripts/fetch_article.py <url> # auto fetch
|
||||
python3 scripts/fetch_article.py <url> -o article.md # save to file
|
||||
python3 scripts/fetch_article.py --file saved.html # from local HTML
|
||||
python3 scripts/fetch_article.py <url> --json # JSON output for agent
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import requests
|
||||
from bs4 import BeautifulSoup, NavigableString
|
||||
|
||||
_BROWSER_UA = (
|
||||
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
|
||||
"AppleWebKit/537.36 (KHTML, like Gecko) "
|
||||
"Chrome/124.0.0.0 Safari/537.36"
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Fetching: three-level strategy
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _fetch_requests(url: str, timeout: int = 20) -> str | None:
|
||||
"""Level 1: plain requests. Returns HTML string or None on failure."""
|
||||
try:
|
||||
resp = requests.get(url, headers={"User-Agent": _BROWSER_UA}, timeout=timeout)
|
||||
resp.raise_for_status()
|
||||
resp.encoding = "utf-8"
|
||||
return resp.text
|
||||
except requests.exceptions.RequestException:
|
||||
return None
|
||||
|
||||
|
||||
def _fetch_playwright(url: str, timeout: int = 30000) -> str | None:
|
||||
"""Level 2: Playwright headless Chrome. Returns HTML or None."""
|
||||
try:
|
||||
from playwright.sync_api import sync_playwright
|
||||
except ImportError:
|
||||
return None
|
||||
|
||||
try:
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.launch(headless=True)
|
||||
page = browser.new_page(user_agent=_BROWSER_UA)
|
||||
page.goto(url, wait_until="networkidle", timeout=timeout)
|
||||
# Wait for WeChat content to render
|
||||
page.wait_for_selector("#js_content", timeout=10000)
|
||||
html = page.content()
|
||||
browser.close()
|
||||
return html
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def fetch_html(url: str) -> str:
|
||||
"""Fetch article HTML with automatic fallback.
|
||||
|
||||
Returns HTML string. Exits with error if all levels fail.
|
||||
"""
|
||||
# Level 1
|
||||
html = _fetch_requests(url)
|
||||
if html and _has_content(html):
|
||||
return html
|
||||
|
||||
# Level 2
|
||||
print("requests 未获取到正文,尝试 Playwright...", file=sys.stderr)
|
||||
html = _fetch_playwright(url)
|
||||
if html and _has_content(html):
|
||||
return html
|
||||
|
||||
# Level 3
|
||||
print(
|
||||
"Error: 无法获取文章内容。请在浏览器中打开文章 → 右键另存为 HTML → 使用 --file 参数传入。",
|
||||
file=sys.stderr,
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def _has_content(html: str) -> bool:
|
||||
"""Check if HTML contains non-empty #js_content."""
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
content = soup.find(id="js_content")
|
||||
if content is None:
|
||||
return False
|
||||
text = content.get_text(strip=True)
|
||||
return len(text) > 50 # must have real content, not just whitespace
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# HTML → Markdown conversion
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _extract_metadata(soup: BeautifulSoup) -> dict:
|
||||
"""Extract article metadata from WeChat page."""
|
||||
title_tag = soup.find("h1", class_="rich_media_title") or soup.find(
|
||||
"h1", id="activity-name"
|
||||
)
|
||||
title = title_tag.get_text(strip=True) if title_tag else ""
|
||||
|
||||
author_tag = soup.find("a", id="js_name") or soup.find(
|
||||
"span", class_="rich_media_meta_nickname"
|
||||
)
|
||||
author = author_tag.get_text(strip=True) if author_tag else ""
|
||||
|
||||
# Publish time
|
||||
pub_tag = soup.find("em", id="publish_time")
|
||||
pub_time = pub_tag.get_text(strip=True) if pub_tag else ""
|
||||
|
||||
return {"title": title, "author": author, "publish_time": pub_time}
|
||||
|
||||
|
||||
def _elem_to_md(elem, depth: int = 0) -> str:
|
||||
"""Convert a single HTML element to Markdown."""
|
||||
tag = elem.name if hasattr(elem, "name") else None
|
||||
|
||||
if isinstance(elem, NavigableString):
|
||||
text = str(elem).strip()
|
||||
return text if text else ""
|
||||
|
||||
if tag is None:
|
||||
return ""
|
||||
|
||||
# Skip hidden/empty elements
|
||||
style = elem.get("style", "")
|
||||
if "display:none" in style.replace(" ", "").lower():
|
||||
return ""
|
||||
if "visibility:hidden" in style.replace(" ", "").lower():
|
||||
return ""
|
||||
|
||||
# Get inner content recursively
|
||||
inner = ""
|
||||
for child in elem.children:
|
||||
inner += _elem_to_md(child, depth + 1)
|
||||
|
||||
inner = inner.strip()
|
||||
if not inner:
|
||||
return ""
|
||||
|
||||
# Headings
|
||||
if tag in ("h1", "h2", "h3", "h4"):
|
||||
level = int(tag[1])
|
||||
return f"\n\n{'#' * level} {inner}\n\n"
|
||||
|
||||
# Paragraphs
|
||||
if tag == "p":
|
||||
return f"\n\n{inner}\n\n"
|
||||
|
||||
# Line breaks
|
||||
if tag == "br":
|
||||
return "\n"
|
||||
|
||||
# Bold
|
||||
if tag in ("strong", "b"):
|
||||
return f"**{inner}**"
|
||||
|
||||
# Italic
|
||||
if tag in ("em", "i"):
|
||||
return f"*{inner}*"
|
||||
|
||||
# Links
|
||||
if tag == "a":
|
||||
href = elem.get("href", "")
|
||||
if href and not href.startswith("javascript:"):
|
||||
return f"[{inner}]({href})"
|
||||
return inner
|
||||
|
||||
# Images
|
||||
if tag == "img":
|
||||
src = elem.get("data-src") or elem.get("src") or ""
|
||||
alt = elem.get("alt", "")
|
||||
if src:
|
||||
return f"\n\n\n\n"
|
||||
return ""
|
||||
|
||||
# Blockquotes
|
||||
if tag == "blockquote":
|
||||
lines = inner.split("\n")
|
||||
quoted = "\n".join(f"> {line}" for line in lines if line.strip())
|
||||
return f"\n\n{quoted}\n\n"
|
||||
|
||||
# Lists
|
||||
if tag in ("ul", "ol"):
|
||||
return f"\n\n{inner}\n\n"
|
||||
if tag == "li":
|
||||
parent = elem.parent
|
||||
if parent and parent.name == "ol":
|
||||
# Ordered list — position tracking is imperfect but functional
|
||||
return f"1. {inner}\n"
|
||||
return f"- {inner}\n"
|
||||
|
||||
# Code
|
||||
if tag == "code":
|
||||
if elem.parent and elem.parent.name == "pre":
|
||||
return inner
|
||||
return f"`{inner}`"
|
||||
if tag == "pre":
|
||||
return f"\n\n```\n{inner}\n```\n\n"
|
||||
|
||||
# Horizontal rule
|
||||
if tag == "hr":
|
||||
return "\n\n---\n\n"
|
||||
|
||||
# Section / div / span — pass through
|
||||
if tag in ("section", "div", "span", "article", "main", "figure",
|
||||
"figcaption", "table", "thead", "tbody", "tr"):
|
||||
return inner
|
||||
|
||||
# Table cells
|
||||
if tag in ("td", "th"):
|
||||
return f" {inner} |"
|
||||
|
||||
return inner
|
||||
|
||||
|
||||
def html_to_markdown(soup: BeautifulSoup) -> str:
|
||||
"""Convert WeChat article HTML to clean Markdown."""
|
||||
content = soup.find(id="js_content")
|
||||
if content is None:
|
||||
return ""
|
||||
|
||||
raw = _elem_to_md(content)
|
||||
|
||||
# Clean up excessive whitespace
|
||||
md = re.sub(r"\n{3,}", "\n\n", raw)
|
||||
md = md.strip()
|
||||
return md
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Public API
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def fetch_article(url: str = None, file_path: str = None) -> dict:
|
||||
"""Fetch and parse a WeChat article.
|
||||
|
||||
Args:
|
||||
url: WeChat article URL.
|
||||
file_path: Path to saved HTML file (alternative to URL).
|
||||
|
||||
Returns:
|
||||
dict with keys: title, author, publish_time, markdown, url
|
||||
"""
|
||||
if file_path:
|
||||
html = Path(file_path).read_text(encoding="utf-8")
|
||||
elif url:
|
||||
html = fetch_html(url)
|
||||
else:
|
||||
raise ValueError("Either url or file_path must be provided")
|
||||
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
meta = _extract_metadata(soup)
|
||||
md = html_to_markdown(soup)
|
||||
|
||||
return {
|
||||
"title": meta["title"],
|
||||
"author": meta["author"],
|
||||
"publish_time": meta["publish_time"],
|
||||
"markdown": md,
|
||||
"url": url or "",
|
||||
}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# CLI
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser(
|
||||
description="Extract WeChat article content as Markdown."
|
||||
)
|
||||
ap.add_argument("url", nargs="?", help="WeChat article URL")
|
||||
ap.add_argument("--file", dest="file_path",
|
||||
help="Local HTML file instead of URL")
|
||||
ap.add_argument("-o", "--output", help="Save Markdown to file")
|
||||
ap.add_argument("--json", dest="as_json", action="store_true",
|
||||
help="Output as JSON (for agent use)")
|
||||
args = ap.parse_args()
|
||||
|
||||
if not args.url and not args.file_path:
|
||||
ap.error("Provide a URL or --file path")
|
||||
|
||||
result = fetch_article(url=args.url, file_path=args.file_path)
|
||||
|
||||
if args.as_json:
|
||||
print(json.dumps(result, ensure_ascii=False, indent=2))
|
||||
elif args.output:
|
||||
# Write Markdown with YAML frontmatter
|
||||
out = Path(args.output)
|
||||
frontmatter = f"---\ntitle: \"{result['title']}\"\nauthor: \"{result['author']}\"\n"
|
||||
if result["publish_time"]:
|
||||
frontmatter += f"date: \"{result['publish_time']}\"\n"
|
||||
if result["url"]:
|
||||
frontmatter += f"source: \"{result['url']}\"\n"
|
||||
frontmatter += "---\n\n"
|
||||
out.write_text(frontmatter + result["markdown"], encoding="utf-8")
|
||||
print(f"Saved: {out}")
|
||||
else:
|
||||
if result["title"]:
|
||||
print(f"# {result['title']}\n")
|
||||
if result["author"]:
|
||||
print(f"> {result['author']}")
|
||||
if result["publish_time"]:
|
||||
print(f"> {result['publish_time']}")
|
||||
if result["author"] or result["publish_time"]:
|
||||
print()
|
||||
print(result["markdown"])
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -12,7 +12,6 @@ import sys
|
|||
from collections import Counter
|
||||
from pathlib import Path
|
||||
|
||||
import requests
|
||||
import yaml
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
|
|
@ -154,12 +153,6 @@ _TARGET_TAGS = {
|
|||
"blockquote", "code", "pre", "img", "a",
|
||||
}
|
||||
|
||||
_BROWSER_UA = (
|
||||
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
|
||||
"AppleWebKit/537.36 (KHTML, like Gecko) "
|
||||
"Chrome/124.0.0.0 Safari/537.36"
|
||||
)
|
||||
|
||||
TEMPLATE_THEME = "professional-clean"
|
||||
THEMES_DIR = Path(__file__).resolve().parent.parent / "toolkit" / "themes"
|
||||
|
||||
|
|
@ -175,26 +168,20 @@ def _attach_title(soup, content) -> None:
|
|||
def fetch_article(url: str, timeout: int = 20) -> "BeautifulSoup tag":
|
||||
"""Fetch a WeChat article, return the ``#js_content`` element.
|
||||
|
||||
The article title is attached as ``content._wewrite_title`` (empty string
|
||||
if not found). Exits with code 1 on network errors or missing content.
|
||||
Delegates to fetch_article.fetch_html() for three-level fetching
|
||||
(requests → Playwright → manual fallback).
|
||||
|
||||
Parameters
|
||||
----------
|
||||
url: WeChat article URL (mp.weixin.qq.com/…)
|
||||
timeout: HTTP request timeout in seconds (default 20).
|
||||
The article title is attached as ``content._wewrite_title`` (empty string
|
||||
if not found).
|
||||
"""
|
||||
try:
|
||||
resp = requests.get(url, headers={"User-Agent": _BROWSER_UA}, timeout=timeout)
|
||||
resp.raise_for_status()
|
||||
except requests.exceptions.RequestException as exc:
|
||||
print(f"Error: failed to fetch URL: {exc}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
resp.encoding = "utf-8"
|
||||
soup = BeautifulSoup(resp.text, "html.parser")
|
||||
from scripts.fetch_article import fetch_html
|
||||
|
||||
html = fetch_html(url)
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
|
||||
content = soup.find(id="js_content")
|
||||
if content is None:
|
||||
print("Error: #js_content not found — the page may require verification.", file=sys.stderr)
|
||||
print("Error: #js_content not found.", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
_attach_title(soup, content)
|
||||
|
|
|
|||
Loading…
Reference in a new issue