This project implements an automated pipeline using Python and the Google Gemini API to create a personalized daily AI news digest and learning resource for a technically-focused user (e.g., a CTO).
- Fetches content from a configurable list of RSS feeds.
- Aggressive Pre-filtering: Filters items by publish date, keywords, and per-feed limits before sending to AI, reducing cost and noise (Configurable via
config.yaml). - Strategic Model Usage: Uses different Gemini models for specific tasks (e.g.,
gemini-2.0-flash-litefor filtering/basic summary,gemini-2.0-flashfor analysis/tutorial generation) for cost and quality optimization (Configurable viaconfig.yaml). - Uses the Gemini API to filter, tag, and prioritize content based on user preferences (Google AI, LLMs, MLOps, etc.).
- Two-Step Summarization: Generates a concise summary (lite model) and then deeper analysis including technical insights and actionable ideas (reasoning model).
- Generates a custom, pedagogical tutorial on a rotating AI topic using the Gemini API.
- Assembles a structured Markdown digest with sections for headlines, tutorials, Google news, market pulse, and actionable ideas, enhanced with emojis.
- Token Tracking & Cost Estimation: Tracks Gemini API token usage per run and logs totals. Optionally estimates cost based on configurable pricing.
- Delivers the digest daily via email (supports SMTP and SendGrid - SendGrid requires uncommenting code and installing the library).
- Scheduled daily execution using the
schedulelibrary or runs once. - Configuration managed via a
config.yamlfile. - Includes basic unit tests for email sending (
tests/test_email.py). - URL Deduplication: Prevents duplicate content by tracking processed URLs across runs with configurable time window.
- Project Context Integration: Imports details about ongoing projects to generate more relevant, actionable insights.
- Resend Feature: Command-line option to resend the last generated digest without regenerating content.
- Directory Structure Management: Automatically creates required directories (logs, outputs, data) if they don't exist.
- Python 3.10 or higher
- Git (for cloning, if applicable)
- Access to the Google Gemini API (either an API Key or configured Google Cloud Application Default Credentials). Ensure the specified models (e.g.,
gemini-2.0-flash,gemini-2.0-flash-lite) are available to your project/API key. - Email account credentials (for SMTP) or a SendGrid account/API key.
-
Clone the Repository (if applicable):
git clone <repository-url> cd <repository-directory>
(If you received the code directly, just navigate to the project directory)
-
Create and Activate a Virtual Environment (Recommended):
# Windows python -m venv venv .\venv\Scripts\activate # macOS/Linux python3 -m venv venv source venv/bin/activate
-
Install Dependencies:
- Ensure you have
pipinstalled. - Install required libraries:
(The
pip install -r requirements.txt
requirements.txtincludesgoogle-generativeai,feedparser,schedule,markdown,PyYAML,python-dotenv,beautifulsoup4,lxml, andpymdown-extensions)
- Ensure you have
-
Configure
config.yaml:- Copy the example configuration file:
cp config.example.yaml config.yaml(orcopy config.example.yaml config.yamlon Windows). - Edit the
config.yamlfile and fill in the required values:- Gemini API:
- Provide your
gemini_api_keyOR - Ensure
google_application_credentialspoints to your service account key file path if using ADC.
- Provide your
- RSS Feeds: Customize the
rss_feedslist. - Ingestion Settings:
- Review and adjust
ingestion -> max_hours_since_published. - Configure
ingestion -> feed_limits(setdefaultand add specific limits for high-volume feeds by URL). - Refine
ingestion -> required_keywordsor leave empty to disable keyword filtering. - Add problematic feed URLs to
ingestion -> skip_feeds.
- Review and adjust
- Gemini Models: Verify or change the model names under
gemini_models(e.g.,FILTERING_MODEL,SUMMARIZATION_LITE_MODEL,ANALYSIS_MODEL,TUTORIAL_MODEL). - (Optional) Cost Estimation: Update
gemini_pricingwith current prices per million tokens (input/output) for the models used if you want cost estimation logging. - Processing: Adjust
num_news_items_to_summarize,num_feed_tutorials_to_include, andinitial_tutorial_topicsif desired. - Email:
- Set
email_provider(smtporsendgrid). - Enter the
recipient_emailandsender_email. - If using
smtp, providesmtp_server,smtp_port, andsmtp_password. Important: For Gmail, use an App Password instead of your regular password. - If using
sendgrid, provide yoursendgrid_api_key(and uncomment the SendGrid code insrc/email_utils.pyand potentially installpip install sendgrid).
- Set
- Scheduling: Configure
run_mode(scheduleoronce),schedule_time, andschedule_initial_run.
- Gemini API:
- Copy the example configuration file:
-
Set Up Project Context:
- Edit the
project_context.mdfile with information about your current projects. - This information will be used to generate more relevant actionable insights.
- If the file doesn't exist, a placeholder will be created on first run.
- Edit the
-
Run the Agent:
python main.py
- Based on
run_modeinconfig.yaml, the script will either run the pipeline once or start the scheduler. - If scheduling, the first run may happen immediately based on
schedule_initial_runand subsequent runs occur daily atschedule_time. - Logs will be created in the
logs/directory, including total token counts and estimated cost (if enabled). - To resend the last generated digest without regenerating content:
python main.py --resend
- Based on
-
Run Tests (Optional):
python -m unittest discover tests
.
├── config.yaml # Your configuration file
├── main.py # Main script: Orchestration and scheduling
├── requirements.txt # Python dependencies
├── project_context.md # Information about current projects for tailored insights
├── processed_urls.json # Tracks previously processed URLs for deduplication
├── .env # Environment variables (alternative to config.yaml)
├── .env.example # Example environment variables
├── data/ # Directory for data storage
├── logs/ # Log files and token usage statistics
├── outputs/ # Saved digests and other outputs
├── src/ # Source code modules
│ ├── __init__.py
│ ├── assembly.py # Assembles the final Markdown digest
│ ├── config_loader.py # Loads configuration from config.yaml
│ ├── email_utils.py # Handles sending email (SMTP/SendGrid)
│ ├── ingestion.py # Fetches, pre-filters, and parses RSS feeds
│ ├── processing.py # Filters/tags items using Gemini API, tracks tokens
│ ├── summarization.py # Summarizes/analyzes items using Gemini API (2 steps)
│ └── tutorial_generator.py # Generates custom tutorials using Gemini API
└── tests/ # Unit tests
├── __init__.py
└── test_email.py # Tests for email sending logic
- Configuration: Most parameters (feeds, limits, models, schedule, emails, etc.) are controlled via
config.yaml. - Prompts: Edit the prompt templates directly within the
processing.py,summarization.py, andtutorial_generator.pyfiles to tailor the AI's behavior. - Digest Structure: Modify
src/assembly.pyto change the layout or sections of the Markdown digest. - Project Context: Update
project_context.mdwith details about your current projects to receive more relevant actionable insights.
Running this pipeline incurs costs based on the usage of the Google Gemini API. Costs are calculated based on the number of input and output tokens processed by the configured models.
API Calls Per Run (Typical):
- Filtering/Tagging (
filter_and_tag_items): 1 call usingFILTERING_MODEL(e.g.,gemini-2.0-flash-lite).- Input: Base prompt + text from pre-filtered RSS items. Number of items significantly reduced by ingestion filtering.
- Output: JSON list of ~15-20 prioritized items.
- Basic Summarization (
summarize_and_analyze): N calls usingSUMMARIZATION_LITE_MODEL(e.g.,gemini-2.0-flash-lite), where N =num_news+num_tutorials.- Input (per call): Summary prompt + title/link/snippet of one filtered item.
- Output (per call): Basic summary text.
- Deeper Analysis (
summarize_and_analyze): N calls usingANALYSIS_MODEL(e.g.,gemini-2.0-flash), where N =num_news+num_tutorials.- Input (per call): Analysis prompt + title/link/snippet + basic summary.
- Output (per call): Markdown analysis (Insight, Market, Actionable Idea).
- Tutorial Generation (
generate_tutorial): 1 call (if topic available) usingTUTORIAL_MODEL(e.g.,gemini-2.0-flash).- Input: Tutorial prompt + selected tutorial topic.
- Output: Full Markdown tutorial.
Estimating Costs:
- Pre-filtering: The most significant cost optimization. Reduces items sent to the Filtering/Tagging model dramatically.
- Model Choice: Using
gemini-2.0-flash-litefor high-volume, simpler tasks (filtering, basic summary) reduces cost compared to usinggemini-2.0-flashfor everything. - Token Tracking: The script logs the total
prompt_tokens,candidates_tokens, andtotal_tokensused per run. You can use these numbers with official pricing for accurate cost calculation. - Optional Cost Estimation: If
gemini_pricingis configured inconfig.yaml, the script will log a rough estimated cost (based on a simplified calculation).
Example Token Flow (Highly Variable):
- Ingestion fetches 1000+ items, pre-filters down to < 100 items.
- Filtering (Lite Model): Input ~500 (Prompt) + ~80 items * ~80 tokens/item = ~6900 tokens. Output ~2000 tokens (JSON).
- Basic Summaries (Lite Model): 12 calls * (~150 Prompt + ~100 Item Snippet) = ~3000 input tokens. Output: 12 * ~70 tokens/summary = ~840 tokens.
- Deeper Analysis (Flash Model): 12 calls * (~200 Prompt + ~100 Item + ~70 Basic Summary) = ~4440 input tokens. Output: 12 * ~180 tokens/analysis = ~2160 tokens.
- Tutorial (Flash Model): Input ~500 tokens. Output ~2000 tokens.
- Total Rough Estimate: Input ~14840 tokens, Output ~7000 tokens. (Actuals depend heavily on content and model responses).
Recommendations:
- Check Official Pricing: Refer to the official Google AI pricing page for
gemini-2.0-flash-liteandgemini-2.0-flash. - Monitor Usage: Check your Google Cloud Console for actual API usage and costs.
- Tune
config.yaml: Adjustingestionfilters (hours, limits, keywords), number of summaries (num_news,num_tutorials), andrss_feedsto balance cost and content. - Review Logs: Check the application logs for token counts per run.
- Further Optimization: Consider fetching full article text only for top 1-2 items (would require adding scraping logic, e.g., using
requestsandBeautifulSoup4, and increase analysis cost but potentially improve quality).