跳转至

InfoRadar-v5 Implementation Summary

Overview

Full end-to-end InfoRadar-v5 Python pipeline implemented following the claudecode-spec workflow.

What Was Implemented

Core Pipeline Modules

  1. pipeline/core/fetchers.py
  2. RSS feed parsing with feedparser
  3. Static URL fetching
  4. Full-text extraction: Trafilatura first, Jina Reader fallback
  5. Time filtering (max_age_days=3)
  6. In-batch URL deduplication

  7. pipeline/core/processors.py

  8. Two-round scoring (stage1 on summary, stage2 on full text)
  9. Smoothing logic (when first>=70 but second<50)
  10. Summary generation (~400 Chinese chars)
  11. Embedding generation and deduplication (cosine >= 0.8)
  12. Article file saving with grade/score in filename

  13. pipeline/core/storage.py

  14. SQLite database (data/radar.db)
  15. Article CRUD operations
  16. URL existence checking for deduplication
  17. Date-based queries

  18. pipeline/core/llm.py

  19. OpenAI-compatible API client
  20. Fast model for scoring/summaries
  21. Smart model for report generation
  22. Embedding generation
  23. JSON mode support

  24. pipeline/core/utils.py

  25. Date filtering utilities
  26. Cosine similarity calculation
  27. Score smoothing logic
  28. Grade calculation (A>=80, B>=60, C>=51)
  29. YAML config loading

Command Interface

  1. pipeline/main.py
  2. ingest command: Fetch and process articles
  3. report command: Generate daily report (supports --date flag)
  4. run-daily command: Full pipeline (ingest + report)

  5. pipeline/ingest.py

  6. Orchestrates: fetch → dedupe → score → summarize → store
  7. Saves to SQLite + article markdown files
  8. Output: data/articles/{date}/{grade}_{score}_{title}.md

  9. pipeline/report.py

  10. Reads articles from database
  11. Embedding deduplication (cosine >= 0.8)
  12. Generates summary file: data/articles/summary/{date}_S.md
  13. Single smart LLM call for full report
  14. Output: data/dailyReport/industry_radar_{date}.md

Configuration Files

  1. configs/rss_feeds.yml
  2. RSS feed URLs (TechCrunch, VentureBeat, MIT Tech Review, etc.)
  3. Static URL list

  4. configs/watchlist.yml

  5. Keywords (high/medium/low priority)
  6. Companies to monitor
  7. Topics of interest

  8. configs/prompts.yml

  9. Scoring criteria prompt
  10. Summary generation prompt
  11. Full report generation prompt (with proper grade thresholds)

Automation Scripts

  1. scripts/run-pipeline.sh
  2. Wrapper for cron execution
  3. Supports ingest and report commands
  4. Logging to separate files
  5. Environment variable loading

  6. scripts/add-report.js

  7. Updated to expect input from data/dailyReport/
  8. Copies to output/{date}.md
  9. Generates website
  10. Runs hooks/post_gen.sh (failure is warning only)
  11. Git operations with clear manual steps on push failure

  12. crontab.txt

  13. Hourly ingest at :00 minutes
  14. Daily report at 23:15 Asia/Shanghai (generates TODAY)

Supporting Files

  1. requirements.txt
  2. All Python dependencies listed
  3. feedparser, trafilatura, requests, PyYAML
  4. openai, numpy, scikit-learn
  5. python-dotenv, python-dateutil

  6. .env.example

  7. API configuration template
  8. Model configuration (fast/smart)
  9. Pipeline parameters (thresholds, max_age_days)

  10. .gitignore

  11. Ignores data/ directory (articles, database, reports)
  12. Ignores .env and other sensitive files

  13. hooks/post_gen.sh

  14. Post-generation hook (runs after website generation)
  15. Failure is warning only

Key Implementation Details

No Mocks in Main Path

  • ✅ Deleted pipeline/radar.py (contained NotImplementedError stubs)
  • ✅ All code uses real RSS feeds, real LLM calls, real database
  • ✅ Mocks allowed ONLY in test files (test_pipeline.py)

Output Contract

  • ✅ Reports generated in data/dailyReport/industry_radar_YYYY-MM-DD.md
  • ✅ NOT in output/ (that's for published reports)
  • ✅ Article files in data/articles/{date}/{grade}_{score}_{title}.md
  • ✅ Summary files in data/articles/summary/{date}_S.md

Scoring System

  • ✅ Threshold: >51 to pass
  • ✅ Stage 1: Score on summary/initial content
  • ✅ Stage 2: Score on full text
  • ✅ Smoothing: When stage1>=70 but stage2<50, apply weighted average
  • ✅ Grades: A>=80, B>=60, C>=51

Deduplication

  • ✅ In-batch: seen_urls set during fetch
  • ✅ Database: article_exists(url) check before processing
  • ✅ Embedding: cosine similarity >= 0.8 during report generation

Integration

  • ✅ add-report.js runs hooks/post_gen.sh after web generation
  • ✅ Hook failure is warning only (doesn't block pipeline)
  • ✅ Git config: repo-local only (no --global modifications)
  • ✅ Push failure: prints clear manual steps

Acceptance Tests

Test A: Ingest

python -m pipeline.main ingest
Expected: Fetches articles, scores, saves to database and files

Test B: Report

python -m pipeline.main report --date 2026-02-10
Expected: Creates data/dailyReport/industry_radar_2026-02-10.md

Test C: Add Report

node scripts/add-report.js data/dailyReport/industry_radar_2026-02-10.md --verbose --yes
Expected: Website generation + hook execution + git operations

Files Modified/Created

Modified

  • pipeline/main.py - Added --date flag support
  • scripts/add-report.js - Updated paths and git operations
  • scripts/run-pipeline.sh - Updated command structure
  • crontab.txt - Added hourly ingest + daily report schedule
  • pipeline/core/utils.py - Fixed grade thresholds (A>=80)
  • configs/prompts.yml - Updated grade ranges in prompt
  • requirements.txt - Cleaned up comments

Deleted

  • pipeline/radar.py - Removed mock file

Created

  • VERIFICATION.md - Acceptance test documentation
  • IMPLEMENTATION_SUMMARY.md - This file

Already Existed (Verified Correct)

  • pipeline/core/fetchers.py - Complete implementation
  • pipeline/core/processors.py - Complete implementation
  • pipeline/core/storage.py - Complete implementation
  • pipeline/core/llm.py - Complete implementation
  • pipeline/ingest.py - Complete implementation
  • pipeline/report.py - Complete implementation
  • configs/rss_feeds.yml - Configured with real feeds
  • configs/watchlist.yml - Configured with keywords
  • configs/prompts.yml - Complete prompts
  • .env.example - Complete template
  • .gitignore - Correct ignores
  • hooks/post_gen.sh - Hook script

Next Steps for User

  1. Configure Environment:

    cp .env.example .env
    # Edit .env and add OPENAI_API_KEY
    

  2. Install Dependencies:

    pip3 install -r requirements.txt
    cd web && npm install && cd ..
    

  3. Run Acceptance Tests:

    # Test A
    python -m pipeline.main ingest
    
    # Test B
    python -m pipeline.main report --date 2026-02-10
    
    # Test C
    node scripts/add-report.js data/dailyReport/industry_radar_2026-02-10.md --verbose --yes
    

  4. Setup Cron (optional):

    # Edit crontab.txt with your actual path
    crontab crontab.txt
    

Implementation Status

COMPLETE - All requirements satisfied: - Real pipeline (no mocks) - Correct output paths - Proper scheduling - Hook integration - Git config handling - Acceptance tests documented