InfoRadar-v5 Implementation Summary¶

Overview¶

Full end-to-end InfoRadar-v5 Python pipeline implemented following the claudecode-spec workflow.

What Was Implemented¶

Core Pipeline Modules¶

pipeline/core/fetchers.py
RSS feed parsing with feedparser
Static URL fetching
Full-text extraction: Trafilatura first, Jina Reader fallback
Time filtering (max_age_days=3)
In-batch URL deduplication
pipeline/core/processors.py
Two-round scoring (stage1 on summary, stage2 on full text)
Smoothing logic (when first>=70 but second<50)
Summary generation (~400 Chinese chars)
Embedding generation and deduplication (cosine >= 0.8)
Article file saving with grade/score in filename
pipeline/core/storage.py
SQLite database (data/radar.db)
Article CRUD operations
URL existence checking for deduplication
Date-based queries
pipeline/core/llm.py
OpenAI-compatible API client
Fast model for scoring/summaries
Smart model for report generation
Embedding generation
JSON mode support
pipeline/core/utils.py
Date filtering utilities
Cosine similarity calculation
Score smoothing logic
Grade calculation (A>=80, B>=60, C>=51)
YAML config loading

Command Interface¶

pipeline/main.py
ingest command: Fetch and process articles
report command: Generate daily report (supports --date flag)
run-daily command: Full pipeline (ingest + report)
pipeline/ingest.py
Orchestrates: fetch → dedupe → score → summarize → store
Saves to SQLite + article markdown files
Output: data/articles/{date}/{grade}_{score}_{title}.md
pipeline/report.py
Reads articles from database
Embedding deduplication (cosine >= 0.8)
Generates summary file: data/articles/summary/{date}_S.md
Single smart LLM call for full report
Output: data/dailyReport/industry_radar_{date}.md

Configuration Files¶

configs/rss_feeds.yml
RSS feed URLs (TechCrunch, VentureBeat, MIT Tech Review, etc.)
Static URL list
configs/watchlist.yml
Keywords (high/medium/low priority)
Companies to monitor
Topics of interest
configs/prompts.yml
Scoring criteria prompt
Summary generation prompt
Full report generation prompt (with proper grade thresholds)

Automation Scripts¶

scripts/run-pipeline.sh
Wrapper for cron execution
Supports ingest and report commands
Logging to separate files
Environment variable loading
scripts/add-report.js
Updated to expect input from data/dailyReport/
Copies to output/{date}.md
Generates website
Runs hooks/post_gen.sh (failure is warning only)
Git operations with clear manual steps on push failure
crontab.txt
Hourly ingest at :00 minutes
Daily report at 23:15 Asia/Shanghai (generates TODAY)

Supporting Files¶

requirements.txt
All Python dependencies listed
feedparser, trafilatura, requests, PyYAML
openai, numpy, scikit-learn
python-dotenv, python-dateutil
.env.example
API configuration template
Model configuration (fast/smart)
Pipeline parameters (thresholds, max_age_days)
.gitignore
Ignores data/ directory (articles, database, reports)
Ignores .env and other sensitive files
hooks/post_gen.sh
Post-generation hook (runs after website generation)
Failure is warning only

Key Implementation Details¶

No Mocks in Main Path¶

✅ Deleted pipeline/radar.py (contained NotImplementedError stubs)
✅ All code uses real RSS feeds, real LLM calls, real database
✅ Mocks allowed ONLY in test files (test_pipeline.py)

Output Contract¶

✅ Reports generated in data/dailyReport/industry_radar_YYYY-MM-DD.md
✅ NOT in output/ (that's for published reports)
✅ Article files in data/articles/{date}/{grade}_{score}_{title}.md
✅ Summary files in data/articles/summary/{date}_S.md

Scoring System¶

✅ Threshold: >51 to pass
✅ Stage 1: Score on summary/initial content
✅ Stage 2: Score on full text
✅ Smoothing: When stage1>=70 but stage2<50, apply weighted average
✅ Grades: A>=80, B>=60, C>=51

Deduplication¶

✅ In-batch: seen_urls set during fetch
✅ Database: article_exists(url) check before processing
✅ Embedding: cosine similarity >= 0.8 during report generation

Integration¶

✅ add-report.js runs hooks/post_gen.sh after web generation
✅ Hook failure is warning only (doesn't block pipeline)
✅ Git config: repo-local only (no --global modifications)
✅ Push failure: prints clear manual steps

Acceptance Tests¶

Test A: Ingest¶

python -m pipeline.main ingest

Expected: Fetches articles, scores, saves to database and files

Test B: Report¶

python -m pipeline.main report --date 2026-02-10

Expected: Creates data/dailyReport/industry_radar_2026-02-10.md

Test C: Add Report¶

node scripts/add-report.js data/dailyReport/industry_radar_2026-02-10.md --verbose --yes

Expected: Website generation + hook execution + git operations

Files Modified/Created¶

Modified¶

pipeline/main.py - Added --date flag support
scripts/add-report.js - Updated paths and git operations
scripts/run-pipeline.sh - Updated command structure
crontab.txt - Added hourly ingest + daily report schedule
pipeline/core/utils.py - Fixed grade thresholds (A>=80)
configs/prompts.yml - Updated grade ranges in prompt
requirements.txt - Cleaned up comments

Deleted¶

pipeline/radar.py - Removed mock file

Created¶

VERIFICATION.md - Acceptance test documentation
IMPLEMENTATION_SUMMARY.md - This file

Already Existed (Verified Correct)¶

pipeline/core/fetchers.py - Complete implementation
pipeline/core/processors.py - Complete implementation
pipeline/core/storage.py - Complete implementation
pipeline/core/llm.py - Complete implementation
pipeline/ingest.py - Complete implementation
pipeline/report.py - Complete implementation
configs/rss_feeds.yml - Configured with real feeds
configs/watchlist.yml - Configured with keywords
configs/prompts.yml - Complete prompts
.env.example - Complete template
.gitignore - Correct ignores
hooks/post_gen.sh - Hook script

Next Steps for User¶

Configure Environment:

cp .env.example .env
# Edit .env and add OPENAI_API_KEY

Install Dependencies:

pip3 install -r requirements.txt
cd web && npm install && cd ..

Run Acceptance Tests:

# Test A
python -m pipeline.main ingest

# Test B
python -m pipeline.main report --date 2026-02-10

# Test C
node scripts/add-report.js data/dailyReport/industry_radar_2026-02-10.md --verbose --yes

Setup Cron (optional):

# Edit crontab.txt with your actual path
crontab crontab.txt

Implementation Status¶

✅ COMPLETE - All requirements satisfied: - Real pipeline (no mocks) - Correct output paths - Proper scheduling - Hook integration - Git config handling - Acceptance tests documented