A comprehensive command-line interface (CLI) tool and API server for downloading, processing, and managing openRxiv MECA (Manuscript Exchange Common Approach) files from AWS S3. This project bridges the gap between bioRxiv/medRxiv's S3 storage and researchers who need programmatic access to preprint data.
bioRxiv and medRxiv provide MECA files in S3 buckets, but there's no official API to:
- Look up where a specific paper is stored in S3 given its DOI
- Download individual papers without traversing the entire bucket
- Access metadata without downloading the full MECA file (6-10MB vs 2-3KB for XML)
This project provides:
- CLI Tool - Command-line access to bioRxiv/medRxiv data
- Metadata API - DOI β S3 location lookups
- Batch Processing - Efficient bulk data extraction and upload
Get detailed information about preprints without downloading files.
# Basic paper information
openrxiv summary "10.1101/2024.05.08.593085"
# Full abstract and details
openrxiv summary -m "10.1101/2024.05.08.593085"
# Try medRxiv if not found on bioRxiv
openrxiv summary -s medrxiv "10.1101/2020.03.19.20039131"
Download specific MECA files by DOI using the metadata API.
# Download by DOI (requires API lookup)
openrxiv --requester-pays download "10.1101/2024.05.08.593085"
# Custom output directory
openrxiv --requester-pays download "10.1101/2024.05.08.593085" --output "./papers"
Why it exists: Researchers need individual papers, not entire months of data. The API integration means you can download a specific paper without knowing its S3 location.
Explore what's available in the S3 buckets with intelligent filtering.
# See recent content
openrxiv list
# Filter by month
openrxiv list --month "2024-01"
# Filter by batch (for historical data)
openrxiv list --batch "Batch_01"
# Explore medRxiv content
openrxiv list --server medrxiv --limit 100
Why it exists: Researchers need to understand what data is available before planning downloads. This provides a window into the S3 bucket structure without full traversal.
The project includes a lightweight API server that serves as the bridge between DOIs and S3 locations. A instance of the API server is at:
Purpose: Look up paper metadata and S3 location by DOI
Example Response:
{
"doi": "10.1101/2024.01.25.577295",
"versions": [
{
"id": "cmedr9nx800i0ii04o4nk4bdy",
"doi": "10.1101/2024.01.25.577295",
"version": 1,
"title": "Spyglass: a data analysis framework for reproducible and shareable neuroscience research",
"receivedDate": "2024-01-25T00:00:00.000Z",
"acceptedDate": "2024-01-26T00:00:00.000Z",
"server": "biorxiv",
"s3Bucket": "biorxiv-src-monthly",
"s3Key": "Current_Content/January_2024/a765f23d-6f3e-1014-a187-cd164f93e87a.meca",
"fileSize": 6147995,
"links": {
"self": "https://openrxiv.csf.now/v1/works/10.1101/2024.01.25.577295v1",
"html": "https://www.biorxiv.org/content/10.1101/2024.01.25.577295v1.full",
"pdf": "https://www.biorxiv.org/content/10.1101/2024.01.25.577295v1.full.pdf"
}
}
]
}
Why this exists: This maps preprint versions to S3 locations, enabling direct access to specific papers without bucket traversal.
Purpose: Upload paper metadata during batch processing
Why this exists: Batch processing extracts metadata from thousands of MECA files and needs to store it efficiently. This endpoint populates the database that powers the DOI lookups.
Purpose: Remove papers from the database
Why this exists: Papers can be updated, retracted, or moved. This endpoint maintains data integrity.
GET /health
- API health checkGET /
- API information and available endpoints
S3 Buckets β CLI Batch Processing β API Database β CLI Commands
- S3 Buckets: bioRxiv and medRxiv store MECA files in organized folders
- Batch Processing: CLI downloads and processes MECA files, extracting metadata
- API Database: Metadata is stored with S3 location information
- CLI Commands: Use the API to look up papers and download them efficiently
# Discover a paper
openrxiv summary "10.1101/2024.05.08.593085"
# Download it for analysis
openrxiv --requester-pays download "10.1101/2024.05.08.593085"
# See what's available this month
openrxiv list --month "2024-01" --limit 100
# Process all papers from January
openrxiv batch-process --month "2024-01" --concurrency 10
# Explore historical data
openrxiv list --batch "1-53" --server medrxiv
# Batch process multiple months
openrxiv batch-process --month "2024-01,2024-02,2024-03" --concurrency 20
- Commands: summary, download, list, batch-info, batch-process
- AWS Integration: S3 access with requester-pays support
- API Client: Integration with the metadata API
- Processing: MECA file extraction and XML parsing
- Database: Prisma with PostgreSQL
- Endpoints: Work lookup, creation, and deletion
- Authentication: API key-based access control
- Validation: Comprehensive input validation
- DOI Parsing: Handle bioRxiv's complex DOI format
- Folder Structure: Navigate S3 bucket organization
- XML Processing: Robust handling of bioRxiv XML files
As of August 2025:
- bioRxiv: 398,744 individual works
- medRxiv: 88,358 individual works
- Total: ~487,000 papers across both servers
- Node.js 18+
- AWS credentials (for S3 access)
- API key for batch processing
npm install -g openrxiv
git clone https://github.com/continuous-foundation/openrxiv
cd openrxiv
npm install
npm run build
npm run build
- Build all packagesnpm run test
- Run testsnpm run lint
- Lint codenpm run changeset
- Manage versioning
- CLI Reference - Complete command documentation
- Batch Processing - Bulk data processing guide
- DOI Structure - Understanding bioRxiv DOIs
- Processing Errors - Common issues and solutions
- Download Locations - S3 bucket organization
This project is maintained by the Continuous Science Foundation. We welcome contributions for:
- Bug fixes and improvements
- Additional CLI commands
- Enhanced API endpoints
- Documentation improvements
MIT License - see LICENSE file for details.
This tool is designed to comply with bioRxiv's and medRxiv's fair use policies:
- No content redistribution
- Proper attribution to original sources
- Intended for legitimate research and data mining purposes
- Respect for author copyright and licensing
Why This Matters: By providing efficient access to bioRxiv and medRxiv data, this project enables researchers to focus on science rather than data logistics. The combination of CLI tools and API endpoints creates a bridge between the raw S3 storage and the research community's needs.