fix: normalize hotspot scores across platforms for fair sorting
Previously, hotspots were sorted by raw hot values directly, but different platforms use vastly different scales (Toutiao ~10M, Weibo ~1M, Baidu ~100K), causing Toutiao to dominate all results while Weibo and Baidu entries were always truncated. Now uses rank-based normalization (0-100) within each source before merging, so cross-platform sorting gives equal weight to each platform's top stories. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
eef748f9c3
commit
039a6caa9d
1 changed files with 15 additions and 2 deletions
|
|
@ -144,8 +144,21 @@ def main():
|
||||||
sources_fail.append(name)
|
sources_fail.append(name)
|
||||||
|
|
||||||
all_items = deduplicate(all_items)
|
all_items = deduplicate(all_items)
|
||||||
# Normalize hot values for sorting (different scales across sources)
|
|
||||||
all_items.sort(key=lambda x: int(x.get("hot", 0) or 0), reverse=True)
|
# Normalize hot values across platforms (different scales: toutiao ~10M, weibo ~1M, baidu ~100K)
|
||||||
|
# Strategy: within each source, rank-based score 0-100, so cross-platform sorting is fair
|
||||||
|
by_source: dict[str, list[dict]] = {}
|
||||||
|
for item in all_items:
|
||||||
|
by_source.setdefault(item["source"], []).append(item)
|
||||||
|
|
||||||
|
for source, items in by_source.items():
|
||||||
|
items.sort(key=lambda x: int(x.get("hot", 0) or 0), reverse=True)
|
||||||
|
n = len(items)
|
||||||
|
for rank, item in enumerate(items):
|
||||||
|
# Top item = 100, linear decay to ~1 for last item
|
||||||
|
item["hot_normalized"] = round(100 * (n - rank) / n, 1) if n > 0 else 0
|
||||||
|
|
||||||
|
all_items.sort(key=lambda x: x.get("hot_normalized", 0), reverse=True)
|
||||||
all_items = all_items[:args.limit]
|
all_items = all_items[:args.limit]
|
||||||
|
|
||||||
tz = timezone(timedelta(hours=8))
|
tz = timezone(timedelta(hours=8))
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue