🚀 arXiv Daily Crawler Setup Guide

Complete guide to setting up automated daily arXiv paper tracking with GitHub Actions.

📋 Table of Contents

Quick Start
Configuration
GitHub Actions Setup
API Keys Setup
Testing Locally
Customization
Troubleshooting

🎯 Quick Start

Step 1: Enable GitHub Actions

Go to your repository: https://github.com/YOUR_USERNAME/my-starred-repos
Click on Actions tab
Enable GitHub Actions if not already enabled
Find the workflow: "arXiv Daily Paper Crawler"
Click "Run workflow" to test manually

Step 2: Configure Your Research Interests

Edit arxiv-config.yml:

research_interests:
  ai_ml:
    - "large language model"
    - "GPT"
    - "transformer"
  # Add more keywords for your specific interests

Step 3: Set Up API Keys (Optional but Recommended)

For AI-powered summaries, add API keys as GitHub Secrets:

Go to Settings → Secrets and variables → Actions
Click "New repository secret"
Add one of these:
- OPENAI_API_KEY - for OpenAI (GPT-4, GPT-3.5)
- ANTHROPIC_API_KEY - for Anthropic (Claude)

⚙️ Configuration

Basic Configuration (`arxiv-config.yml`)

# Research Areas - Customize these!
research_interests:
  ai_ml:
    - "large language model"
    - "LLM"
    - "GPT"
    - "transformer"

  reinforcement_learning:
    - "reinforcement learning"
    - "RLHF"
    - "policy gradient"

  mlops:
    - "MLOps"
    - "model serving"
    - "model deployment"

# arXiv categories to monitor
arxiv_categories:
  - "cs.AI"   # Artificial Intelligence
  - "cs.CL"   # Computation and Language
  - "cs.CV"   # Computer Vision
  - "cs.LG"   # Machine Learning

# Filtering
filters:
  max_papers_per_category: 100
  min_relevance_score: 0.5
  days_to_look_back: 1

# AI Summarization
summarization:
  enabled: true
  provider: "openai"  # or "anthropic"
  model: "gpt-4o-mini"
  max_summary_length: 200

Advanced Settings

# Notifications
notifications:
  github_issue:
    enabled: true
    create_daily_issue: true

  email:
    enabled: false
    recipients:
      - "your-email@example.com"

# Output formats
output:
  generate_markdown: true
  generate_json: true
  group_by: "category"  # or "date", "relevance"

🔧 GitHub Actions Setup

Workflow Schedule

The workflow runs automatically:

Daily at 9 AM UTC (4 AM EST)
Can also be triggered manually

To change schedule, edit .github/workflows/arxiv-daily-crawler.yml:

on:
  schedule:
    - cron: '0 9 * * *'  # Change this
  workflow_dispatch:

Workflow Permissions

The workflow needs these permissions (already configured):

✅ contents: write - to commit new papers
✅ issues: write - to create daily digest issues
✅ pages: write - for GitHub Pages (optional)

🔑 API Keys Setup

Option 1: OpenAI (Recommended)

Cost: ~$0.01-0.10 per day (using GPT-4o-mini)

Get API key: https://platform.openai.com/api-keys
Add to GitHub Secrets as OPENAI_API_KEY
Configure in arxiv-config.yml:

summarization:
  enabled: true
  provider: "openai"
  model: "gpt-4o-mini"  # Cheap and fast

Option 2: Anthropic Claude

Cost: ~$0.01-0.05 per day (using Claude Haiku)

Get API key: https://console.anthropic.com/
Add to GitHub Secrets as ANTHROPIC_API_KEY
Configure in arxiv-config.yml:

summarization:
  enabled: true
  provider: "anthropic"
  model: "claude-3-haiku-20240307"

Option 3: No AI (Free)

Disable AI summaries and use paper abstracts:

summarization:
  enabled: false

🧪 Testing Locally

Install Dependencies

cd my-starred-repos

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements-arxiv.txt

Set API Keys (if using AI)

# Export API key
export OPENAI_API_KEY="your-key-here"
# or
export ANTHROPIC_API_KEY="your-key-here"

Run Crawler

# Fetch papers
python scripts/fetch_arxiv_papers.py

# Generate summary
python scripts/generate_summary.py

# Check results
ls arxiv-papers/
ls daily-summaries/
cat README_ARXIV_DAILY.md

Test Output

You should see:

arxiv-papers/
├── papers-2025-01-17.json
└── papers-latest.json

daily-summaries/
├── summary-2025-01-17.md
└── summary-latest.md

README_ARXIV_DAILY.md

🎨 Customization

1. Change Research Topics

Edit arxiv-config.yml:

research_interests:
  my_custom_topic:
    - "keyword 1"
    - "keyword 2"
    - "keyword 3"

2. Add More arXiv Categories

Full list: https://arxiv.org/category_taxonomy

arxiv_categories:
  - "cs.AI"   # Artificial Intelligence
  - "cs.DB"   # Databases
  - "cs.DC"   # Distributed Computing
  - "cs.SE"   # Software Engineering
  - "math.OC" # Optimization

3. Filter by Exclude Keywords

filters:
  exclude_keywords:
    - "medical imaging"
    - "drug discovery"
    - "protein folding"

4. Adjust Relevance Threshold

filters:
  min_relevance_score: 0.7  # Higher = more selective

5. Change Paper Limit

filters:
  max_papers_per_category: 50  # Reduce for fewer papers
  days_to_look_back: 2  # Look back 2 days

📧 Enable Email Notifications

Using GitHub Actions Email

Add to .github/workflows/arxiv-daily-crawler.yml:

- name: Send Email
  uses: dawidd6/action-send-mail@v3
  with:
    server_address: smtp.gmail.com
    server_port: 465
    username: ${{ secrets.EMAIL_USERNAME }}
    password: ${{ secrets.EMAIL_PASSWORD }}
    subject: 📚 arXiv Daily Digest - ${{ env.DATE }}
    to: your-email@example.com
    from: GitHub Actions
    body: file://daily-summaries/summary-latest.md

Add secrets:

EMAIL_USERNAME
EMAIL_PASSWORD

🌐 Enable GitHub Pages

Serve Papers as Website

Go to Settings → Pages
Source: Deploy from a branch
Branch: master or gh-pages
Folder: / (root)
Create index.html or use Jekyll theme

The papers will be accessible at: https://YOUR_USERNAME.github.io/my-starred-repos/

📊 Output Files

Generated Files

my-starred-repos/
├── arxiv-papers/
│   ├── papers-2025-01-17.json   # Daily papers (JSON)
│   └── papers-latest.json       # Latest papers
├── daily-summaries/
│   ├── summary-2025-01-17.md    # Daily summary (Markdown)
│   └── summary-latest.md        # Latest summary
├── README_ARXIV_DAILY.md        # Main dashboard
└── arxiv-config.yml             # Configuration

JSON Structure

[
  {
    "id": "2401.12345",
    "title": "Paper Title",
    "authors": ["Author 1", "Author 2"],
    "abstract": "Full abstract...",
    "published": "2025-01-17T12:00:00",
    "categories": ["cs.AI", "cs.LG"],
    "pdf_url": "https://arxiv.org/pdf/2401.12345",
    "arxiv_url": "https://arxiv.org/abs/2401.12345",
    "relevance_score": 0.85,
    "ai_summary": "AI-generated summary..."
  }
]

🐛 Troubleshooting

Issue: Workflow Not Running

Solution:

Check Actions tab is enabled
Verify workflow file is in .github/workflows/
Check branch name (must be master or main)
Manually trigger: Actions → workflow → Run workflow

Issue: No Papers Found

Solution:

Check arxiv-config.yml syntax
Broaden search keywords
Lower min_relevance_score
Increase days_to_look_back
Check arXiv categories are valid

Issue: AI Summaries Not Working

Solution:

Verify API key is set in GitHub Secrets
Check secret name: OPENAI_API_KEY or ANTHROPIC_API_KEY
Verify API key has credits/quota
Check workflow logs for errors
Test locally with export OPENAI_API_KEY=...

Issue: Rate Limiting

Solution:

Reduce max_papers_per_category
Increase rate_limit_delay in config
Reduce number of search queries
Use days_to_look_back: 1 instead of 7

Issue: Workflow Fails to Commit

Solution:

Check workflow permissions: Settings → Actions → Workflow permissions
Enable "Read and write permissions"
Verify git config in workflow

Issue: Too Many Papers

Solution:

Increase min_relevance_score (e.g., 0.7)
Reduce max_papers_per_category (e.g., 20)
Add more specific keywords
Use exclude_keywords to filter out unwanted topics

📈 Best Practices

1. Start Conservative

filters:
  max_papers_per_category: 20
  min_relevance_score: 0.7
  days_to_look_back: 1

2. Use Specific Keywords

❌ Bad: "machine learning" ✅ Good: "large language model", "transformer architecture"

3. Monitor Costs

OpenAI GPT-4o-mini: ~$0.01-0.05/day
Anthropic Claude Haiku: ~$0.01-0.03/day
Set budget alerts in API dashboards

4. Regular Review

Check weekly digests
Adjust keywords based on results
Update relevance thresholds
Archive old papers monthly

🎯 Example Use Cases

1. LLM Researcher

research_interests:
  llm_core:
    - "large language model"
    - "GPT"
    - "transformer"
  llm_training:
    - "RLHF"
    - "instruction tuning"
    - "alignment"
  llm_applications:
    - "agent"
    - "chain of thought"
    - "prompt engineering"

filters:
  min_relevance_score: 0.6
  max_papers_per_category: 30

2. MLOps Engineer

research_interests:
  mlops:
    - "MLOps"
    - "model serving"
    - "feature store"
  production_ml:
    - "model monitoring"
    - "drift detection"
    - "A/B testing"

arxiv_categories:
  - "cs.SE"
  - "cs.LG"
  - "cs.DC"

3. Computer Vision Researcher

research_interests:
  vision:
    - "object detection"
    - "segmentation"
    - "diffusion model"
  multimodal:
    - "vision language"
    - "CLIP"
    - "image captioning"

arxiv_categories:
  - "cs.CV"
  - "cs.AI"

📚 Additional Resources

🤝 Support

If you encounter issues:

Check Troubleshooting section
Review workflow logs in Actions tab
Test locally first
Check GitHub Actions status: https://www.githubstatus.com/

Happy Researching! 🚀📚

Last updated: 2025-01-17

FilesExpand file tree

ARXIV_SETUP_GUIDE.md

Latest commit

History

ARXIV_SETUP_GUIDE.md

File metadata and controls

🚀 arXiv Daily Crawler Setup Guide

📋 Table of Contents

🎯 Quick Start

Step 1: Enable GitHub Actions

Step 2: Configure Your Research Interests

Step 3: Set Up API Keys (Optional but Recommended)

⚙️ Configuration

Basic Configuration (arxiv-config.yml)

Advanced Settings

🔧 GitHub Actions Setup

Workflow Schedule

Workflow Permissions

🔑 API Keys Setup

Option 1: OpenAI (Recommended)

Option 2: Anthropic Claude

Option 3: No AI (Free)

🧪 Testing Locally

Install Dependencies

Set API Keys (if using AI)

Run Crawler

Test Output

🎨 Customization

1. Change Research Topics

2. Add More arXiv Categories

3. Filter by Exclude Keywords

4. Adjust Relevance Threshold

5. Change Paper Limit

📧 Enable Email Notifications

Using GitHub Actions Email

🌐 Enable GitHub Pages

Serve Papers as Website

📊 Output Files

Generated Files

JSON Structure

🐛 Troubleshooting

Issue: Workflow Not Running

Issue: No Papers Found

Issue: AI Summaries Not Working

Issue: Rate Limiting

Issue: Workflow Fails to Commit

Issue: Too Many Papers

📈 Best Practices

1. Start Conservative

2. Use Specific Keywords

3. Monitor Costs

4. Regular Review

🎯 Example Use Cases

1. LLM Researcher

2. MLOps Engineer

3. Computer Vision Researcher

📚 Additional Resources

🤝 Support

Basic Configuration (`arxiv-config.yml`)