A Python tool that automatically checks new arXiv astro-ph preprints, identifies papers relevant to your research interests, highlights papers that cite your work, and sends you a daily email digest.
- Fetch new papers: Retrieves today's new astro-ph submissions from arXiv RSS
- Citation detection: Identifies papers that cite any paper in your ADS library
- Keyword matching: Scores papers based on keyword matches in title and abstract
- Semantic similarity: Uses sentence-transformers to find semantically related papers
- Smart ranking: Combines keyword, semantic, and citation signals into a unified score
- Email digest: Sends a nicely formatted HTML email with paper summaries
- Deduplication: Tracks emailed papers to avoid duplicates
- Caching: Caches your ADS library bibcodes to reduce API calls
- Python 3.11+
- NASA ADS API token (get one here)
- A public ADS library containing your papers
- SMTP email credentials (e.g., Gmail with app password)
- Clone or download this repository
cd /path/to/myarxiv- Create a virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies
pip install -r requirements.txtNote: The sentence-transformers package will download a model (~90MB) on first run.
- Configure environment variables
cp .env.example .env
# Edit .env with your settingsAll configuration is done via environment variables. You can set them in a .env file in the project directory.
| Variable | Description |
|---|---|
KEYWORDS |
Comma-separated keywords/phrases to match |
INTEREST_PROFILE |
Paragraph describing your research interests |
ADS_API_TOKEN |
Your NASA ADS API token |
ADS_LIBRARY_URL |
URL to your public ADS library (or use ADS_LIBRARY_ID) |
RECIPIENT_EMAIL |
Email address to receive the digest |
SMTP_HOST |
SMTP server hostname |
SMTP_PORT |
SMTP server port (usually 587 for TLS) |
SMTP_USER |
SMTP username |
SMTP_PASS |
SMTP password (use app-specific password for Gmail) |
SMTP_FROM |
From address for emails |
| Variable | Default | Description |
|---|---|---|
TIMEZONE |
America/New_York |
Timezone for "today" |
MAX_RESULTS_PER_DAY |
200 |
Max papers to fetch |
W_KEYWORD |
0.6 |
Weight for keyword score |
W_SEMANTIC |
0.4 |
Weight for semantic score |
W_CITATION_BOOST |
0.25 |
Additive boost for citing papers |
MIN_FINAL_SCORE |
0.35 |
Minimum score threshold |
SEMANTIC_MODE |
local |
local or off |
STATE_PATH |
./state.json |
Path to state file |
LIBRARY_REFRESH_DAYS |
7 |
Days between library cache refresh |
For Gmail, you need to use an App Password:
- Enable 2-factor authentication on your Google account
- Go to Google App Passwords
- Create a new app password for "Mail"
- Use that password as
SMTP_PASS
# Run the digest (fetches papers, scores, sends email)
python main.py
# Dry run (prints to stdout instead of emailing)
python main.py --dry-run
# Verbose output
python main.py --dry-run --verboseUsage: main.py [OPTIONS]
Options:
--dry-run Print digest instead of sending email
--refresh-library Force refresh of ADS library cache
--since YYYY-MM-DD Process papers since date (limited by arXiv)
-v, --verbose Enable verbose/debug logging
--env-file PATH Path to .env file
--help Show this message and exit.
# First run - test with dry-run
python main.py --dry-run --verbose
# Force refresh your library cache
python main.py --refresh-library --dry-run
# Use a different .env file
python main.py --env-file /path/to/production.envAdd to your crontab (crontab -e):
# Run daily at 8 AM
0 8 * * * cd /path/to/myarxiv && /path/to/venv/bin/python main.py >> /var/log/arxiv-digest.log 2>&1Create /etc/systemd/system/arxiv-digest.service:
[Unit]
Description=arXiv Digest
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
User=youruser
WorkingDirectory=/path/to/myarxiv
Environment=PATH=/path/to/myarxiv/venv/bin
ExecStart=/path/to/myarxiv/venv/bin/python main.py
StandardOutput=journal
StandardError=journalCreate /etc/systemd/system/arxiv-digest.timer:
[Unit]
Description=Run arXiv Digest daily
[Timer]
OnCalendar=*-*-* 08:00:00
Persistent=true
[Install]
WantedBy=timers.targetEnable and start:
sudo systemctl daemon-reload
sudo systemctl enable arxiv-digest.timer
sudo systemctl start arxiv-digest.timer
# Check status
systemctl list-timers arxiv-digest.timer
journalctl -u arxiv-digest.serviceCreate ~/Library/LaunchAgents/com.arxiv.digest.plist:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.arxiv.digest</string>
<key>ProgramArguments</key>
<array>
<string>/path/to/myarxiv/venv/bin/python</string>
<string>/path/to/myarxiv/main.py</string>
</array>
<key>WorkingDirectory</key>
<string>/path/to/myarxiv</string>
<key>StartCalendarInterval</key>
<dict>
<key>Hour</key>
<integer>8</integer>
<key>Minute</key>
<integer>0</integer>
</dict>
<key>StandardOutPath</key>
<string>/tmp/arxiv-digest.log</string>
<key>StandardErrorPath</key>
<string>/tmp/arxiv-digest.err</string>
</dict>
</plist>Load the agent:
launchctl load ~/Library/LaunchAgents/com.arxiv.digest.plist- Matches keywords/phrases in title and abstract
- Title matches weighted 2x vs abstract
- Multiple matches of same keyword capped
- Uses IDF weighting (rare keywords score higher)
- Diversity bonus for matching multiple different keywords
- Uses
all-MiniLM-L6-v2sentence transformer model - Computes cosine similarity between paper and your interest profile
- Can be disabled with
SEMANTIC_MODE=off
- Additive boost applied if paper cites any paper in your ADS library
- Papers citing your work are always included (even below score threshold)
final_score = W_KEYWORD * keyword_score + W_SEMANTIC * semantic_score
if cites_my_work:
final_score += W_CITATION_BOOST
final_score = min(final_score, 1.0)
The state.json file tracks:
- arXiv IDs of papers already emailed (prevents duplicates)
- Cached ADS library bibcodes (reduces API calls)
- Last refresh date for library cache
The state file is automatically pruned to keep the last 10,000 paper IDs.
Run the tests:
pytest tests/ -v- Check that arXiv RSS is accessible: https://rss.arxiv.org/rss/astro-ph
- New papers are announced around 8 PM ET / midnight UTC
- Verify your API token is valid
- Check rate limits (5000 requests/day)
- Ensure your library is public
- Verify SMTP credentials
- For Gmail, ensure you're using an App Password
- Check spam folder
- First run downloads the model (~90MB)
- Use
SEMANTIC_MODE=offto disable if not needed
- Reduce
MAX_RESULTS_PER_DAY - Disable semantic mode
myarxiv/
├── main.py # CLI entry point
├── src/
│ ├── __init__.py
│ ├── config.py # Configuration management
│ ├── arxiv_fetcher.py # arXiv RSS/API fetching
│ ├── ads_client.py # NASA ADS API client
│ ├── scoring.py # Keyword and semantic scoring
│ ├── email_sender.py # Email formatting and sending
│ └── state.py # State persistence
├── tests/
│ ├── __init__.py
│ └── test_scoring.py # Scoring tests
├── requirements.txt
├── .env.example
└── README.md
MIT License - feel free to use and modify.
- arXiv for providing open access to preprints
- NASA ADS for the bibliography API
- Sentence Transformers for semantic embeddings