InfoRadar-v5 Implementation Summary¶
Overview¶
Full end-to-end InfoRadar-v5 Python pipeline implemented following the claudecode-spec workflow.
What Was Implemented¶
Core Pipeline Modules¶
- pipeline/core/fetchers.py
- RSS feed parsing with feedparser
- Static URL fetching
- Full-text extraction: Trafilatura first, Jina Reader fallback
- Time filtering (max_age_days=3)
-
In-batch URL deduplication
-
pipeline/core/processors.py
- Two-round scoring (stage1 on summary, stage2 on full text)
- Smoothing logic (when first>=70 but second<50)
- Summary generation (~400 Chinese chars)
- Embedding generation and deduplication (cosine >= 0.8)
-
Article file saving with grade/score in filename
-
pipeline/core/storage.py
- SQLite database (data/radar.db)
- Article CRUD operations
- URL existence checking for deduplication
-
Date-based queries
-
pipeline/core/llm.py
- OpenAI-compatible API client
- Fast model for scoring/summaries
- Smart model for report generation
- Embedding generation
-
JSON mode support
-
pipeline/core/utils.py
- Date filtering utilities
- Cosine similarity calculation
- Score smoothing logic
- Grade calculation (A>=80, B>=60, C>=51)
- YAML config loading
Command Interface¶
- pipeline/main.py
ingestcommand: Fetch and process articlesreportcommand: Generate daily report (supports --date flag)-
run-dailycommand: Full pipeline (ingest + report) -
pipeline/ingest.py
- Orchestrates: fetch → dedupe → score → summarize → store
- Saves to SQLite + article markdown files
-
Output:
data/articles/{date}/{grade}_{score}_{title}.md -
pipeline/report.py
- Reads articles from database
- Embedding deduplication (cosine >= 0.8)
- Generates summary file:
data/articles/summary/{date}_S.md - Single smart LLM call for full report
- Output:
data/dailyReport/industry_radar_{date}.md
Configuration Files¶
- configs/rss_feeds.yml
- RSS feed URLs (TechCrunch, VentureBeat, MIT Tech Review, etc.)
-
Static URL list
-
configs/watchlist.yml
- Keywords (high/medium/low priority)
- Companies to monitor
-
Topics of interest
-
configs/prompts.yml
- Scoring criteria prompt
- Summary generation prompt
- Full report generation prompt (with proper grade thresholds)
Automation Scripts¶
- scripts/run-pipeline.sh
- Wrapper for cron execution
- Supports
ingestandreportcommands - Logging to separate files
-
Environment variable loading
-
scripts/add-report.js
- Updated to expect input from
data/dailyReport/ - Copies to
output/{date}.md - Generates website
- Runs
hooks/post_gen.sh(failure is warning only) -
Git operations with clear manual steps on push failure
-
crontab.txt
- Hourly ingest at :00 minutes
- Daily report at 23:15 Asia/Shanghai (generates TODAY)
Supporting Files¶
- requirements.txt
- All Python dependencies listed
- feedparser, trafilatura, requests, PyYAML
- openai, numpy, scikit-learn
-
python-dotenv, python-dateutil
-
.env.example
- API configuration template
- Model configuration (fast/smart)
-
Pipeline parameters (thresholds, max_age_days)
-
.gitignore
- Ignores
data/directory (articles, database, reports) -
Ignores
.envand other sensitive files -
hooks/post_gen.sh
- Post-generation hook (runs after website generation)
- Failure is warning only
Key Implementation Details¶
No Mocks in Main Path¶
- ✅ Deleted
pipeline/radar.py(contained NotImplementedError stubs) - ✅ All code uses real RSS feeds, real LLM calls, real database
- ✅ Mocks allowed ONLY in test files (test_pipeline.py)
Output Contract¶
- ✅ Reports generated in
data/dailyReport/industry_radar_YYYY-MM-DD.md - ✅ NOT in
output/(that's for published reports) - ✅ Article files in
data/articles/{date}/{grade}_{score}_{title}.md - ✅ Summary files in
data/articles/summary/{date}_S.md
Scoring System¶
- ✅ Threshold: >51 to pass
- ✅ Stage 1: Score on summary/initial content
- ✅ Stage 2: Score on full text
- ✅ Smoothing: When stage1>=70 but stage2<50, apply weighted average
- ✅ Grades: A>=80, B>=60, C>=51
Deduplication¶
- ✅ In-batch: seen_urls set during fetch
- ✅ Database: article_exists(url) check before processing
- ✅ Embedding: cosine similarity >= 0.8 during report generation
Integration¶
- ✅ add-report.js runs hooks/post_gen.sh after web generation
- ✅ Hook failure is warning only (doesn't block pipeline)
- ✅ Git config: repo-local only (no --global modifications)
- ✅ Push failure: prints clear manual steps
Acceptance Tests¶
Test A: Ingest¶
Expected: Fetches articles, scores, saves to database and filesTest B: Report¶
Expected: Createsdata/dailyReport/industry_radar_2026-02-10.md Test C: Add Report¶
Expected: Website generation + hook execution + git operationsFiles Modified/Created¶
Modified¶
pipeline/main.py- Added --date flag supportscripts/add-report.js- Updated paths and git operationsscripts/run-pipeline.sh- Updated command structurecrontab.txt- Added hourly ingest + daily report schedulepipeline/core/utils.py- Fixed grade thresholds (A>=80)configs/prompts.yml- Updated grade ranges in promptrequirements.txt- Cleaned up comments
Deleted¶
pipeline/radar.py- Removed mock file
Created¶
VERIFICATION.md- Acceptance test documentationIMPLEMENTATION_SUMMARY.md- This file
Already Existed (Verified Correct)¶
pipeline/core/fetchers.py- Complete implementationpipeline/core/processors.py- Complete implementationpipeline/core/storage.py- Complete implementationpipeline/core/llm.py- Complete implementationpipeline/ingest.py- Complete implementationpipeline/report.py- Complete implementationconfigs/rss_feeds.yml- Configured with real feedsconfigs/watchlist.yml- Configured with keywordsconfigs/prompts.yml- Complete prompts.env.example- Complete template.gitignore- Correct ignoreshooks/post_gen.sh- Hook script
Next Steps for User¶
-
Configure Environment:
-
Install Dependencies:
-
Run Acceptance Tests:
-
Setup Cron (optional):
Implementation Status¶
✅ COMPLETE - All requirements satisfied: - Real pipeline (no mocks) - Correct output paths - Proper scheduling - Hook integration - Git config handling - Acceptance tests documented