Complete guide to setting up automated daily arXiv paper tracking with GitHub Actions.
- Quick Start
- Configuration
- GitHub Actions Setup
- API Keys Setup
- Testing Locally
- Customization
- Troubleshooting
- Go to your repository:
https://github.com/YOUR_USERNAME/my-starred-repos - Click on Actions tab
- Enable GitHub Actions if not already enabled
- Find the workflow: "arXiv Daily Paper Crawler"
- Click "Run workflow" to test manually
Edit arxiv-config.yml:
research_interests:
ai_ml:
- "large language model"
- "GPT"
- "transformer"
# Add more keywords for your specific interestsFor AI-powered summaries, add API keys as GitHub Secrets:
- Go to Settings β Secrets and variables β Actions
- Click "New repository secret"
- Add one of these:
OPENAI_API_KEY- for OpenAI (GPT-4, GPT-3.5)ANTHROPIC_API_KEY- for Anthropic (Claude)
# Research Areas - Customize these!
research_interests:
ai_ml:
- "large language model"
- "LLM"
- "GPT"
- "transformer"
reinforcement_learning:
- "reinforcement learning"
- "RLHF"
- "policy gradient"
mlops:
- "MLOps"
- "model serving"
- "model deployment"
# arXiv categories to monitor
arxiv_categories:
- "cs.AI" # Artificial Intelligence
- "cs.CL" # Computation and Language
- "cs.CV" # Computer Vision
- "cs.LG" # Machine Learning
# Filtering
filters:
max_papers_per_category: 100
min_relevance_score: 0.5
days_to_look_back: 1
# AI Summarization
summarization:
enabled: true
provider: "openai" # or "anthropic"
model: "gpt-4o-mini"
max_summary_length: 200# Notifications
notifications:
github_issue:
enabled: true
create_daily_issue: true
email:
enabled: false
recipients:
- "your-email@example.com"
# Output formats
output:
generate_markdown: true
generate_json: true
group_by: "category" # or "date", "relevance"The workflow runs automatically:
- Daily at 9 AM UTC (4 AM EST)
- Can also be triggered manually
To change schedule, edit .github/workflows/arxiv-daily-crawler.yml:
on:
schedule:
- cron: '0 9 * * *' # Change this
workflow_dispatch:The workflow needs these permissions (already configured):
- β
contents: write- to commit new papers - β
issues: write- to create daily digest issues - β
pages: write- for GitHub Pages (optional)
Cost: ~$0.01-0.10 per day (using GPT-4o-mini)
- Get API key: https://platform.openai.com/api-keys
- Add to GitHub Secrets as
OPENAI_API_KEY - Configure in
arxiv-config.yml:
summarization:
enabled: true
provider: "openai"
model: "gpt-4o-mini" # Cheap and fastCost: ~$0.01-0.05 per day (using Claude Haiku)
- Get API key: https://console.anthropic.com/
- Add to GitHub Secrets as
ANTHROPIC_API_KEY - Configure in
arxiv-config.yml:
summarization:
enabled: true
provider: "anthropic"
model: "claude-3-haiku-20240307"Disable AI summaries and use paper abstracts:
summarization:
enabled: falsecd my-starred-repos
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements-arxiv.txt# Export API key
export OPENAI_API_KEY="your-key-here"
# or
export ANTHROPIC_API_KEY="your-key-here"# Fetch papers
python scripts/fetch_arxiv_papers.py
# Generate summary
python scripts/generate_summary.py
# Check results
ls arxiv-papers/
ls daily-summaries/
cat README_ARXIV_DAILY.mdYou should see:
arxiv-papers/
βββ papers-2025-01-17.json
βββ papers-latest.json
daily-summaries/
βββ summary-2025-01-17.md
βββ summary-latest.md
README_ARXIV_DAILY.md
Edit arxiv-config.yml:
research_interests:
my_custom_topic:
- "keyword 1"
- "keyword 2"
- "keyword 3"Full list: https://arxiv.org/category_taxonomy
arxiv_categories:
- "cs.AI" # Artificial Intelligence
- "cs.DB" # Databases
- "cs.DC" # Distributed Computing
- "cs.SE" # Software Engineering
- "math.OC" # Optimizationfilters:
exclude_keywords:
- "medical imaging"
- "drug discovery"
- "protein folding"filters:
min_relevance_score: 0.7 # Higher = more selectivefilters:
max_papers_per_category: 50 # Reduce for fewer papers
days_to_look_back: 2 # Look back 2 daysAdd to .github/workflows/arxiv-daily-crawler.yml:
- name: Send Email
uses: dawidd6/action-send-mail@v3
with:
server_address: smtp.gmail.com
server_port: 465
username: ${{ secrets.EMAIL_USERNAME }}
password: ${{ secrets.EMAIL_PASSWORD }}
subject: π arXiv Daily Digest - ${{ env.DATE }}
to: your-email@example.com
from: GitHub Actions
body: file://daily-summaries/summary-latest.mdAdd secrets:
EMAIL_USERNAMEEMAIL_PASSWORD
-
Go to Settings β Pages
-
Source: Deploy from a branch
-
Branch: master or gh-pages
-
Folder: / (root)
-
Create
index.htmlor use Jekyll theme
The papers will be accessible at:
https://YOUR_USERNAME.github.io/my-starred-repos/
my-starred-repos/
βββ arxiv-papers/
β βββ papers-2025-01-17.json # Daily papers (JSON)
β βββ papers-latest.json # Latest papers
βββ daily-summaries/
β βββ summary-2025-01-17.md # Daily summary (Markdown)
β βββ summary-latest.md # Latest summary
βββ README_ARXIV_DAILY.md # Main dashboard
βββ arxiv-config.yml # Configuration
[
{
"id": "2401.12345",
"title": "Paper Title",
"authors": ["Author 1", "Author 2"],
"abstract": "Full abstract...",
"published": "2025-01-17T12:00:00",
"categories": ["cs.AI", "cs.LG"],
"pdf_url": "https://arxiv.org/pdf/2401.12345",
"arxiv_url": "https://arxiv.org/abs/2401.12345",
"relevance_score": 0.85,
"ai_summary": "AI-generated summary..."
}
]Solution:
- Check Actions tab is enabled
- Verify workflow file is in
.github/workflows/ - Check branch name (must be
masterormain) - Manually trigger: Actions β workflow β Run workflow
Solution:
- Check
arxiv-config.ymlsyntax - Broaden search keywords
- Lower
min_relevance_score - Increase
days_to_look_back - Check arXiv categories are valid
Solution:
- Verify API key is set in GitHub Secrets
- Check secret name:
OPENAI_API_KEYorANTHROPIC_API_KEY - Verify API key has credits/quota
- Check workflow logs for errors
- Test locally with
export OPENAI_API_KEY=...
Solution:
- Reduce
max_papers_per_category - Increase
rate_limit_delayin config - Reduce number of search queries
- Use
days_to_look_back: 1instead of 7
Solution:
- Check workflow permissions: Settings β Actions β Workflow permissions
- Enable "Read and write permissions"
- Verify git config in workflow
Solution:
- Increase
min_relevance_score(e.g., 0.7) - Reduce
max_papers_per_category(e.g., 20) - Add more specific keywords
- Use
exclude_keywordsto filter out unwanted topics
filters:
max_papers_per_category: 20
min_relevance_score: 0.7
days_to_look_back: 1β Bad: "machine learning" β Good: "large language model", "transformer architecture"
- OpenAI GPT-4o-mini: ~$0.01-0.05/day
- Anthropic Claude Haiku: ~$0.01-0.03/day
- Set budget alerts in API dashboards
- Check weekly digests
- Adjust keywords based on results
- Update relevance thresholds
- Archive old papers monthly
research_interests:
llm_core:
- "large language model"
- "GPT"
- "transformer"
llm_training:
- "RLHF"
- "instruction tuning"
- "alignment"
llm_applications:
- "agent"
- "chain of thought"
- "prompt engineering"
filters:
min_relevance_score: 0.6
max_papers_per_category: 30research_interests:
mlops:
- "MLOps"
- "model serving"
- "feature store"
production_ml:
- "model monitoring"
- "drift detection"
- "A/B testing"
arxiv_categories:
- "cs.SE"
- "cs.LG"
- "cs.DC"research_interests:
vision:
- "object detection"
- "segmentation"
- "diffusion model"
multimodal:
- "vision language"
- "CLIP"
- "image captioning"
arxiv_categories:
- "cs.CV"
- "cs.AI"If you encounter issues:
- Check Troubleshooting section
- Review workflow logs in Actions tab
- Test locally first
- Check GitHub Actions status: https://www.githubstatus.com/
Happy Researching! ππ
Last updated: 2025-01-17