Skip to content

matteocantiello/myarxiv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MyArXiv: arXiv astro-ph Daily Digest

A Python tool that automatically checks new arXiv astro-ph preprints, identifies papers relevant to your research interests, highlights papers that cite your work, and sends you a daily email digest.

Features

  • Fetch new papers: Retrieves today's new astro-ph submissions from arXiv RSS
  • Citation detection: Identifies papers that cite any paper in your ADS library
  • Keyword matching: Scores papers based on keyword matches in title and abstract
  • Semantic similarity: Uses sentence-transformers to find semantically related papers
  • Smart ranking: Combines keyword, semantic, and citation signals into a unified score
  • Email digest: Sends a nicely formatted HTML email with paper summaries
  • Deduplication: Tracks emailed papers to avoid duplicates
  • Caching: Caches your ADS library bibcodes to reduce API calls

Installation

Prerequisites

  • Python 3.11+
  • NASA ADS API token (get one here)
  • A public ADS library containing your papers
  • SMTP email credentials (e.g., Gmail with app password)

Setup

  1. Clone or download this repository
cd /path/to/myarxiv
  1. Create a virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies
pip install -r requirements.txt

Note: The sentence-transformers package will download a model (~90MB) on first run.

  1. Configure environment variables
cp .env.example .env
# Edit .env with your settings

Configuration

All configuration is done via environment variables. You can set them in a .env file in the project directory.

Required Settings

Variable Description
KEYWORDS Comma-separated keywords/phrases to match
INTEREST_PROFILE Paragraph describing your research interests
ADS_API_TOKEN Your NASA ADS API token
ADS_LIBRARY_URL URL to your public ADS library (or use ADS_LIBRARY_ID)
RECIPIENT_EMAIL Email address to receive the digest
SMTP_HOST SMTP server hostname
SMTP_PORT SMTP server port (usually 587 for TLS)
SMTP_USER SMTP username
SMTP_PASS SMTP password (use app-specific password for Gmail)
SMTP_FROM From address for emails

Optional Settings

Variable Default Description
TIMEZONE America/New_York Timezone for "today"
MAX_RESULTS_PER_DAY 200 Max papers to fetch
W_KEYWORD 0.6 Weight for keyword score
W_SEMANTIC 0.4 Weight for semantic score
W_CITATION_BOOST 0.25 Additive boost for citing papers
MIN_FINAL_SCORE 0.35 Minimum score threshold
SEMANTIC_MODE local local or off
STATE_PATH ./state.json Path to state file
LIBRARY_REFRESH_DAYS 7 Days between library cache refresh

Gmail Setup

For Gmail, you need to use an App Password:

  1. Enable 2-factor authentication on your Google account
  2. Go to Google App Passwords
  3. Create a new app password for "Mail"
  4. Use that password as SMTP_PASS

Usage

Basic Usage

# Run the digest (fetches papers, scores, sends email)
python main.py

# Dry run (prints to stdout instead of emailing)
python main.py --dry-run

# Verbose output
python main.py --dry-run --verbose

Command Line Options

Usage: main.py [OPTIONS]

Options:
  --dry-run              Print digest instead of sending email
  --refresh-library      Force refresh of ADS library cache
  --since YYYY-MM-DD     Process papers since date (limited by arXiv)
  -v, --verbose          Enable verbose/debug logging
  --env-file PATH        Path to .env file
  --help                 Show this message and exit.

Examples

# First run - test with dry-run
python main.py --dry-run --verbose

# Force refresh your library cache
python main.py --refresh-library --dry-run

# Use a different .env file
python main.py --env-file /path/to/production.env

Scheduling

Using cron (Linux/macOS)

Add to your crontab (crontab -e):

# Run daily at 8 AM
0 8 * * * cd /path/to/myarxiv && /path/to/venv/bin/python main.py >> /var/log/arxiv-digest.log 2>&1

Using systemd timer (Linux)

Create /etc/systemd/system/arxiv-digest.service:

[Unit]
Description=arXiv Digest
After=network-online.target
Wants=network-online.target

[Service]
Type=oneshot
User=youruser
WorkingDirectory=/path/to/myarxiv
Environment=PATH=/path/to/myarxiv/venv/bin
ExecStart=/path/to/myarxiv/venv/bin/python main.py
StandardOutput=journal
StandardError=journal

Create /etc/systemd/system/arxiv-digest.timer:

[Unit]
Description=Run arXiv Digest daily

[Timer]
OnCalendar=*-*-* 08:00:00
Persistent=true

[Install]
WantedBy=timers.target

Enable and start:

sudo systemctl daemon-reload
sudo systemctl enable arxiv-digest.timer
sudo systemctl start arxiv-digest.timer

# Check status
systemctl list-timers arxiv-digest.timer
journalctl -u arxiv-digest.service

Using launchd (macOS)

Create ~/Library/LaunchAgents/com.arxiv.digest.plist:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.arxiv.digest</string>
    <key>ProgramArguments</key>
    <array>
        <string>/path/to/myarxiv/venv/bin/python</string>
        <string>/path/to/myarxiv/main.py</string>
    </array>
    <key>WorkingDirectory</key>
    <string>/path/to/myarxiv</string>
    <key>StartCalendarInterval</key>
    <dict>
        <key>Hour</key>
        <integer>8</integer>
        <key>Minute</key>
        <integer>0</integer>
    </dict>
    <key>StandardOutPath</key>
    <string>/tmp/arxiv-digest.log</string>
    <key>StandardErrorPath</key>
    <string>/tmp/arxiv-digest.err</string>
</dict>
</plist>

Load the agent:

launchctl load ~/Library/LaunchAgents/com.arxiv.digest.plist

How Scoring Works

Keyword Score (default 60% weight)

  • Matches keywords/phrases in title and abstract
  • Title matches weighted 2x vs abstract
  • Multiple matches of same keyword capped
  • Uses IDF weighting (rare keywords score higher)
  • Diversity bonus for matching multiple different keywords

Semantic Score (default 40% weight)

  • Uses all-MiniLM-L6-v2 sentence transformer model
  • Computes cosine similarity between paper and your interest profile
  • Can be disabled with SEMANTIC_MODE=off

Citation Boost (default +0.25)

  • Additive boost applied if paper cites any paper in your ADS library
  • Papers citing your work are always included (even below score threshold)

Final Score

final_score = W_KEYWORD * keyword_score + W_SEMANTIC * semantic_score
if cites_my_work:
    final_score += W_CITATION_BOOST
final_score = min(final_score, 1.0)

State Management

The state.json file tracks:

  • arXiv IDs of papers already emailed (prevents duplicates)
  • Cached ADS library bibcodes (reduces API calls)
  • Last refresh date for library cache

The state file is automatically pruned to keep the last 10,000 paper IDs.

Testing

Run the tests:

pytest tests/ -v

Troubleshooting

No papers found

ADS API errors

  • Verify your API token is valid
  • Check rate limits (5000 requests/day)
  • Ensure your library is public

Email not sending

  • Verify SMTP credentials
  • For Gmail, ensure you're using an App Password
  • Check spam folder

Semantic scoring slow

  • First run downloads the model (~90MB)
  • Use SEMANTIC_MODE=off to disable if not needed

Memory issues

  • Reduce MAX_RESULTS_PER_DAY
  • Disable semantic mode

Project Structure

myarxiv/
├── main.py              # CLI entry point
├── src/
│   ├── __init__.py
│   ├── config.py        # Configuration management
│   ├── arxiv_fetcher.py # arXiv RSS/API fetching
│   ├── ads_client.py    # NASA ADS API client
│   ├── scoring.py       # Keyword and semantic scoring
│   ├── email_sender.py  # Email formatting and sending
│   └── state.py         # State persistence
├── tests/
│   ├── __init__.py
│   └── test_scoring.py  # Scoring tests
├── requirements.txt
├── .env.example
└── README.md

License

MIT License - feel free to use and modify.

Acknowledgments

About

A Python tool that automatically checks new arXiv astro-ph preprints, identifies papers relevant to your research interests, highlights papers that cite your work, and sends you a daily email digest.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages