See it in action: This pipeline powers Gen AI Spotlight on Telegram — a fully automated AI news channel. Join to see what the output looks like in production.
|
Building the News Scan Pipeline |
Pipeline Deep Dive & Demo |
A complete, automated AI news scanning pipeline for OpenClaw. Scans 5 data sources every 2 hours, scores and deduplicates results with a persistent SQLite database, enriches top articles with full text, and uses a 3-tier LLM failover chain (Gemini Flash Lite → Grok via OpenRouter → Gemini Flash) to curate the best stories for your channel.
Pipeline cost: ~$5/month (Gemini Flash Lite API + Tavily free tier)
This pipeline is designed to run as an OpenClaw cron job. Here's how it integrates:
OpenClaw Gateway
├── Cron scheduler fires every 2 hours
│ └── Runs news_scan_deduped.sh (the orchestrator)
│ ├── Calls 5 data source scripts (RSS, Reddit, Twitter, GitHub, Tavily)
│ ├── Scores + deduplicates via quality_score.py + dedup_db.py
│ ├── Enriches top articles via enrich_top_articles.py
│ └── Curates via llm_editor.py (3-tier LLM failover)
│
├── Agent receives the pipeline output
│ └── Formats and delivers to your channel (Telegram, Slack, etc.)
│
├── Nightly cron (optional)
│ └── Runs update_editorial_profile.py to learn from your approvals/rejections
│
└── memory/ directory
├── news_dedup.db ← SQLite dedup database (cross-scan)
├── editorial_profile.md ← LLM editor reads this for guidance
├── editorial_decisions.md ← Your approval/rejection log
├── scanner_presented.md ← Auto-logged: what was presented
├── news_log.md ← Your posted stories (for dedup)
├── last_scan_candidates.txt ← Persistent for "next 10" requests
└── github_trending_state.json ← Star velocity tracking
Key integration points:
- Scripts live in
~/.openclaw/workspace/scripts/— OpenClaw's standard location for agent-callable scripts - Memory files live in
~/.openclaw/workspace/memory/— persistent across sessions - The cron job uses
sessionTarget: "isolated"so each scan gets a clean session (no context contamination) - The agent model orchestrates the pipeline. Use
openai-codex/gpt-5.3-codexoranthropic/claude-sonnet-4-6— the model must support tool execution in isolated sessions. The actual AI curation uses a 3-tier LLM failover chain (Gemini Flash Lite → Grok → Gemini Flash) via direct API calls. - Delivery is handled by OpenClaw's channel system (Telegram, Slack, etc.) — use
delivery.mode: announcewith your channel and destination configured
Not using OpenClaw? The scripts work standalone too — just run ./news_scan_deduped.sh from a regular cron job or shell. The only OpenClaw-specific parts are the cron job setup and channel delivery.
┌─────────────────────────────────────────────────────────────────┐
│ news_scan_deduped.sh │
│ (Main Orchestrator) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ [1] RSS Feeds ──→ inline AI keyword filter (25 feeds) │
│ [2] Reddit JSON API ──→ fetch_reddit_news.py (13 subs) │
│ [3] Twitter/X ──→ scan_twitter_ai.sh (bird CLI) │
│ ──→ fetch_twitter_api.py (API search) │
│ [4] GitHub ──→ github_trending.py (trending+rel) │
│ [5] Tavily Web Search ──→ fetch_web_news.py (5 queries) │
│ │
│ All sources are best-effort — failures don't kill the pipeline │
│ │
├─────────────────────────────────────────────────────────────────┤
│ │
│ dedup_db.py → SQLite cross-scan dedup (URL + title) │
│ Persistent memory across all runs │
│ │
│ quality_score.py → Score + within-batch dedup (80%) │
│ + cross-scan dedup via SQLite │
│ Output: top 50 scored candidates │
│ │
│ enrich_top_articles.py → Fetch full text for top 8 articles │
│ CF Markdown preferred, HTML fallback │
│ │
│ llm_editor.py → 3-tier LLM failover chain │
│ Flash Lite → Grok (OpenRouter) → Flash │
│ Reads editorial_profile.md for guidance │
│ SQLite pre-filter before LLM call │
│ Output: up to 7 ranked picks (JSON) │
│ │
└─────────────────────────────────────────────────────────────────┘
- OpenClaw (v2026.2.23+) — the AI agent platform that runs the cron job
- Python 3.9+ — all scripts use stdlib only (no pip packages)
- blogwatcher — RSS feed scanner (
brew install blogwatcheror equivalent)
| Key | Required? | Purpose | Free Tier |
|---|---|---|---|
GEMINI_API_KEY |
Yes | Gemini Flash Lite / Flash for LLM curation | Google AI Studio — generous free tier |
OPENROUTER_API_KEY |
Recommended | Grok 4.1 Fast failover (via OpenRouter) | Pay-per-token (cheap) |
GH_TOKEN |
Recommended | GitHub API (5000 req/h vs 60/h unauthenticated) | GitHub personal access token (free) |
TAVILY_API_KEY |
Optional | Tavily web search for breaking news | 1000 queries/month free |
TWITTERAPI_IO_KEY |
Optional | twitterapi.io keyword search supplement | Paid (small monthly fee) |
- bird — Twitter/X CLI tool (for
scan_twitter_ai.sh). Install:npm install -g @steipete/birdorbrew install steipete/tap/bird— see bird.fast. If not installed, the Twitter bird CLI source is skipped gracefully.
Copy all scripts from the scripts/ directory to your OpenClaw workspace:
cp scripts/*.sh scripts/*.py ~/.openclaw/workspace/scripts/
chmod +x ~/.openclaw/workspace/scripts/news_scan_deduped.sh
chmod +x ~/.openclaw/workspace/scripts/filter_ai_news.sh
chmod +x ~/.openclaw/workspace/scripts/scan_twitter_ai.shInstall blogwatcher and add your RSS feeds. Here's a recommended starter set:
# Wire services (Tier 1 — highest trust)
blogwatcher add "Reuters Tech" "https://www.reuters.com/technology/rss"
blogwatcher add "Axios AI" "https://api.axios.com/feed/top/technology"
# Tech press (Tier 2)
blogwatcher add "TechCrunch AI" "https://techcrunch.com/category/artificial-intelligence/feed/"
blogwatcher add "The Verge" "https://www.theverge.com/rss/ai-artificial-intelligence/index.xml"
blogwatcher add "THE DECODER" "https://the-decoder.com/feed/"
blogwatcher add "Ars Technica" "https://feeds.arstechnica.com/arstechnica/technology-lab"
blogwatcher add "VentureBeat AI" "https://venturebeat.com/category/ai/feed/"
blogwatcher add "Wired AI" "https://www.wired.com/feed/tag/ai/latest/rss"
blogwatcher add "MIT Tech Review" "https://www.technologyreview.com/feed/"
# AI company blogs (Tier 1-2)
blogwatcher add "OpenAI Blog" "https://openai.com/blog/rss.xml"
blogwatcher add "Google AI Blog" "https://blog.google/technology/ai/rss/"
blogwatcher add "Hugging Face Blog" "https://huggingface.co/blog/feed.xml"
# Bloggers & newsletters (Tier 2-3)
blogwatcher add "Simon Willison" "https://simonwillison.net/atom/everything/"
blogwatcher add "Bens Bites" "https://www.bensbites.com/feed"Adjust the SOURCE_TIERS dictionary in filter_ai_news.sh to match your feed names exactly.
Copy and customize the editorial profile template:
mkdir -p ~/.openclaw/workspace/memory
cp config/editorial_profile_template.md ~/.openclaw/workspace/memory/editorial_profile.mdEdit ~/.openclaw/workspace/memory/editorial_profile.md to reflect your channel's editorial voice:
- What topics you always pick
- What you usually skip
- Your source trust ranking
- Story selection rules
This profile is read by the LLM editor on every scan and directly influences story selection.
Import your existing post history so the dedup system has context from day one:
cd ~/.openclaw/workspace/scripts
python3 dedup_db.py --seed
python3 dedup_db.py --statsIf this is a fresh install with no history, skip this step — the database will populate automatically as the pipeline runs.
Add API keys to your OpenClaw LaunchAgent plist (macOS):
# Add to ~/Library/LaunchAgents/ai.openclaw.gateway.plist under EnvironmentVariables:
# <key>GEMINI_API_KEY</key>
# <string>your-gemini-api-key</string>
# <key>OPENROUTER_API_KEY</key>
# <string>your-openrouter-key</string>
# <key>GH_TOKEN</key>
# <string>your-github-token</string>
# <key>TAVILY_API_KEY</key>
# <string>your-tavily-key</string>
# <key>TWITTERAPI_IO_KEY</key>
# <string>your-twitterapi-key</string>
# Then restart the gateway:
launchctl kickstart -k gui/$(id -u)/ai.openclaw.gatewayOr export them in your shell for testing:
export GEMINI_API_KEY="your-key"
export OPENROUTER_API_KEY="your-key"
export GH_TOKEN="your-token"
export TAVILY_API_KEY="your-key"Add the news scan as an OpenClaw cron job:
openclaw cron add \
--name "Bi-Hourly News Scan" \
--cron "40 9,11,13,15,17,19,21 * * *" \
--message "Run the Gen AI news scanner: bash ~/.openclaw/workspace/scripts/news_scan_deduped.sh" \
--agent main \
--model "openai-codex/gpt-5.3-codex" \
--announce \
--channel telegram \
--to "<your-telegram-chat-id>" \
--timeout-seconds 400 \
--tz "America/New_York"Schedule breakdown: Runs at :40 past the hour at 9am, 11am, 1pm, 3pm, 5pm, 7pm, 9pm. Adjust the hours and timezone to match your audience.
Model choice: The cron job model must support tool execution in isolated sessions — it needs to actually run the bash script, not just talk about it. Use openai-codex/gpt-5.3-codex or anthropic/claude-sonnet-4-6. Avoid Kimi K2.5 — it has a tool schema incompatibility in isolated sessions that silently prevents bash execution, causing it to hallucinate stories from training data instead of running the pipeline.
Output format: Cap your cron payload at 5 stories with summaries under 100 chars each. Telegram's Bot API has a hard 4096 character limit per message — exceeding it causes delivery to fail silently with no visible error to the user.
Run a manual test:
cd ~/.openclaw/workspace/scripts
./news_scan_deduped.sh --top 5You should see output like:
═══════════════════════════════════════════════════════════
📡 [YOUR_CHANNEL_NAME] — News Scanner v2 (top 5)
═══════════════════════════════════════════════════════════
📰 [1/5] Scanning RSS feeds...
✅ Extracted 12 new RSS articles
🔴 [2/5] Scanning Reddit (JSON API)...
✅ Found 45 Reddit posts (score-filtered)
...
The master script that calls everything else in sequence. Collects articles from all 5 sources, pipes through scoring/enrichment/LLM, and formats output. All sources are best-effort — if one fails, the pipeline continues with what it has.
Reads articles from blogwatcher, filters by AI-related keywords (with word-boundary matching for short keywords like "AI" to avoid false positives), assigns source tiers, and filters out Reddit noise (questions, rants, memes).
Note: As of v2, the main orchestrator (
news_scan_deduped.sh) handles AI keyword filtering inline during RSS extraction. This script still exists for standalone use or debugging, but is no longer called by the pipeline.
Fetches posts from 13 AI-related subreddits using Reddit's public JSON API (no auth needed). Features:
- Per-subreddit score thresholds (30-50 upvotes minimum)
- Flair filtering for noisy subs (e.g., only "News" flair from r/technology)
- Noise filter (skips questions, rants, short titles)
- Concurrent fetching (3 workers)
Scans official AI company accounts, tech reporters/leakers, and CEO accounts using the bird CLI tool. Three-tier account system:
- Tier 1: Official accounts (OpenAI, Anthropic, Google, etc.)
- Tier 2: Reporters and leakers (break news first)
- Tier 3: CEOs (context, not breaking news)
Supplements bird CLI with keyword-based search. Uses engagement filtering (50+ likes or 5000+ followers) to cut noise. Properly tags tweet-only stories (no external article URL).
Three strategies:
- Emerging: Repos created in the last 7 days with 50+ stars
- Velocity: Established repos (1000+ stars) gaining traction fast
- Releases: New releases from 16 key AI repos (Anthropic SDK, OpenAI SDK, Ollama, etc.)
Maintains state between runs to calculate star velocity.
Catches breaking news that RSS feeds miss. 5 focused queries, 2-day freshness filter. Skips domains already covered by RSS (Reddit, Twitter, GitHub, YouTube, arxiv). Filters out homepage URLs.
Persistent dedup memory shared across all pipeline runs. Stores normalized URLs and titles from every scan in ~/.openclaw/workspace/memory/news_dedup.db. Features:
- URL normalization (strips query params, fragments, www prefix, trailing punctuation)
- Title similarity matching (75% threshold via SequenceMatcher, 2-day window)
- Bulk check API for efficient pre-filtering
- CLI for seeding from historical logs, checking URLs/titles, and viewing stats
Scores every article based on:
- Source tier (wire services get +5, tech press +3, etc.)
- High-value keywords (acquisitions, billion, launch, security, etc.)
- Breaking news signals (exclusive, confirmed, first look, etc.)
- Title quality (length heuristic)
Two-stage dedup: within-batch similarity (80% threshold) followed by cross-scan dedup against the SQLite database. Outputs top 50.
Fetches full article text for the top 8 scored articles. Tries Cloudflare Markdown for Agents first (clean markdown), falls back to HTML extraction. Skips paywalled sites. 1200 character cap per article.
The AI brain of the pipeline. Sends all scored candidates + editorial profile + recent post history to a 3-tier LLM failover chain. The LLM selects up to 7 stories, ranks them, assigns categories, and writes 1-sentence summaries.
Features:
- 3-tier failover chain: Gemini 3.1 Flash Lite → Grok 4.1 Fast (OpenRouter) → Gemini 3 Flash Preview. Alternates providers to avoid double failure.
- SQLite pre-filter (skips already-seen URLs and similar titles before calling the LLM)
- Editorial profile integration (learns your preferences over time)
- Structured JSON output with validation and robust parsing (handles markdown fences, dict wrappers, etc.)
- Records all picks to the SQLite dedup database after selection
- Logs all presented stories to
scanner_presented.md
Runs nightly. Analyzes your approval/rejection patterns and updates the editorial profile's stats section. Also identifies "blind spots" — topics you manually seek out but the scanner doesn't catch.
- Add the feed to blogwatcher:
blogwatcher add "Feed Name" "https://feed-url/rss" - Add the feed name to
SOURCE_TIERSinfilter_ai_news.shwith the appropriate tier (1-3) - Add any new keywords to the
LONG_KEYWORDSlist if needed
Edit the SUBREDDITS list in fetch_reddit_news.py:
{"sub": "YourSubreddit", "sort": "hot", "limit": 25, "min_score": 30,
"flairs": ["News", "Discussion"]}, # flairs are optionalEdit the account arrays in scan_twitter_ai.sh:
OFFICIAL_ACCOUNTS— for company accountsREPORTER_ACCOUNTS— for journalists and leakersCEO_ACCOUNTS— for thought leaders
Add to the RELEASE_REPOS list in github_trending.py:
"owner/repo-name",Edit the FAILOVER_CHAIN list in llm_editor.py. Each entry specifies a model name, API type (gemini or openrouter), environment variable for the API key, and timeout. The chain is tried in order — the first provider that responds wins.
Edit the cron expression:
openclaw cron edit <job-id> --cron "0 */3 * * *" # every 3 hoursopenclaw-news-scan/
├── README.md # This file
├── CHANGELOG.md # Version history and migration guide
├── scripts/
│ ├── news_scan_deduped.sh # Main orchestrator (inline AI filter)
│ ├── dedup_db.py # SQLite cross-scan dedup database
│ ├── quality_score.py # Scoring + two-stage dedup
│ ├── enrich_top_articles.py # Full text fetcher
│ ├── llm_editor.py # LLM curation (3-tier failover)
│ ├── filter_ai_news.sh # RSS keyword filter (standalone)
│ ├── fetch_reddit_news.py # Reddit JSON API
│ ├── scan_twitter_ai.sh # Twitter bird CLI
│ ├── fetch_twitter_api.py # twitterapi.io search
│ ├── github_trending.py # GitHub trending + releases
│ ├── fetch_web_news.py # Tavily web search
│ ├── update_editorial_profile.py # Editorial profile updater
│ └── test_components.py # Unit tests (68 tests)
└── config/
└── editorial_profile_template.md # Template — customize for your channel
RSS (25 feeds) ─────────┐ ┌─ Gemini Flash Lite
Reddit (13 subs) ───────┤ AI keyword quality_score.py enrich_top │
Twitter (bird + API) ───┤──→ pre-filter ──→ + dedup_db.py ──→ articles ──→ ├─ Grok (OpenRouter) ──→ Output
GitHub (trending+rel) ──┤ (inline) (score + dedup) (max 8) │ (failover chain)
Tavily (5 queries) ─────┘ (max 50) └─ Gemini Flash Preview
Typical run: ~100 raw → ~50 after AI filter → 50 scored → 8 enriched → 3-7 curated picks
| Component | Monthly Cost | Notes |
|---|---|---|
| Gemini Flash Lite API | ~$1-2/month | Primary LLM — ~7 calls/day, ~30K tokens each |
| OpenRouter (Grok failover) | ~$0-1/month | Only used when Gemini fails |
| Tavily API | Free | 1000 queries/month free tier covers it |
| GitHub API | Free | Personal access token, 5000 req/h |
| twitterapi.io | ~$10/month | Optional — bird CLI is free |
| OpenClaw cron model | Varies | Depends on your model choice |
| Total | ~$5/month | Without twitterapi.io |
| Issue | Fix |
|---|---|
| "GEMINI_API_KEY not set" | Add to LaunchAgent plist or export in shell. Pipeline warns but continues (failover may use OpenRouter). |
| Reddit 429 (rate limit) | Normal with 2h spacing. Reduce subreddits or increase --hours |
| Reddit 404 on a sub | Sub may be private/quarantined. Remove from config. |
| bird CLI not found | Install bird or remove scan_twitter_ai.sh call |
| "No new stories found" | RSS feeds may all be read. Wait for new articles. |
| All LLM providers failed | Check that GEMINI_API_KEY and/or OPENROUTER_API_KEY are set. The pipeline saves candidates to a file for manual re-run. |
| LLM editor timeout | Increase timeout values in the FAILOVER_CHAIN in llm_editor.py |
| Pipeline takes too long | Increase cron timeout: openclaw cron edit <id> --timeout-seconds 400 |
| Cron delivers raw API errors or hallucinated stories | Your cron model doesn't support tool execution in isolated sessions. Switch to openai-codex/gpt-5.3-codex or anthropic/claude-sonnet-4-6. Kimi K2.5 has a known tool schema incompatibility in isolated sessions. |
| Telegram "message is too long" / delivery silently fails | Story list exceeds Telegram's 4096 char limit. Cap output at 5 stories with summaries under 100 chars each. |
--force flag invalid (OpenClaw v2026.3.2+) |
The flag was removed. openclaw cron run <id> now runs immediately by default — no flag needed. |
| Stories are old / from years ago | The pipeline script didn't run — the model hallucinated from training data. Check your cron model (see above) and verify the script path is correct. |
| GitHub rate limit | Set GH_TOKEN env var for 5000 req/h (vs 60/h) |
| Duplicate stories | SQLite dedup handles this automatically. Run python3 dedup_db.py --seed to import historical posts. Check DB status: python3 dedup_db.py --stats |
| Non-AI articles leaking | The inline AI keyword filter should catch these. Check the keyword patterns in news_scan_deduped.sh and add missing terms. |
The system learns from your editorial decisions:
- During the day: The scanner presents picks. You approve or skip them.
- At night:
update_editorial_profile.pyanalyzes your patterns. - Next scan: The LLM editor reads the updated profile and adjusts.
To log decisions, create ~/.openclaw/workspace/memory/editorial_decisions.md:
[2026-03-01T10:00:00-05:00] APPROVED | Story Title Here | https://url | category
[2026-03-01T10:00:00-05:00] SKIPPED | Another Story | https://url | category
[2026-03-01T14:00:00-05:00] MANUAL_DRAFT | Story I Found Myself | https://url | category
Built by Jacob Ben David with OpenClaw, Gemini Flash, and a collection of free/low-cost APIs.
Inspired by the tech-news-digest ClawHub skill (v3.14.0 by dinstein).
MIT — use it however you want. If you build something cool with it, let me know!
