openRxiv MECA Downloader

A comprehensive command-line interface (CLI) tool and API server for downloading, processing, and managing openRxiv MECA (Manuscript Exchange Common Approach) files from AWS S3. This project bridges the gap between bioRxiv/medRxiv's S3 storage and researchers who need programmatic access to preprint data.

🎯 Why This Project Exists

The Problem

bioRxiv and medRxiv provide MECA files in S3 buckets, but there's no official API to:

Look up where a specific paper is stored in S3 given its DOI
Download individual papers without traversing the entire bucket
Access metadata without downloading the full MECA file (6-10MB vs 2-3KB for XML)

The Solution

This project provides:

CLI Tool - Command-line access to bioRxiv/medRxiv data
Metadata API - DOI → S3 location lookups
Batch Processing - Efficient bulk data extraction and upload

🚀 Core Commands

1. Summary - Research Discovery

Get detailed information about preprints without downloading files.

# Basic paper information
openrxiv summary "10.1101/2024.05.08.593085"

# Full abstract and details
openrxiv summary -m "10.1101/2024.05.08.593085"

# Try medRxiv if not found on bioRxiv
openrxiv summary -s medrxiv "10.1101/2020.03.19.20039131"

2. Download - Individual Paper Access

Download specific MECA files by DOI using the metadata API.

# Download by DOI (requires API lookup)
openrxiv --requester-pays download "10.1101/2024.05.08.593085"

# Custom output directory
openrxiv --requester-pays download "10.1101/2024.05.08.593085" --output "./papers"

Why it exists: Researchers need individual papers, not entire months of data. The API integration means you can download a specific paper without knowing its S3 location.

3. List - Content Exploration

Explore what's available in the S3 buckets with intelligent filtering.

# See recent content
openrxiv list

# Filter by month
openrxiv list --month "2024-01"

# Filter by batch (for historical data)
openrxiv list --batch "Batch_01"

# Explore medRxiv content
openrxiv list --server medrxiv --limit 100

Why it exists: Researchers need to understand what data is available before planning downloads. This provides a window into the S3 bucket structure without full traversal.

🔌 API Endpoints

The project includes a lightweight API server that serves as the bridge between DOIs and S3 locations. A instance of the API server is at:

https://openrxiv.csf.now

Core Endpoints

`GET /v1/works/{doiPrefix}/{doiSuffix}`

Purpose: Look up paper metadata and S3 location by DOI

Example Response:

{
  "doi": "10.1101/2024.01.25.577295",
  "versions": [
    {
      "id": "cmedr9nx800i0ii04o4nk4bdy",
      "doi": "10.1101/2024.01.25.577295",
      "version": 1,
      "title": "Spyglass: a data analysis framework for reproducible and shareable neuroscience research",
      "receivedDate": "2024-01-25T00:00:00.000Z",
      "acceptedDate": "2024-01-26T00:00:00.000Z",
      "server": "biorxiv",
      "s3Bucket": "biorxiv-src-monthly",
      "s3Key": "Current_Content/January_2024/a765f23d-6f3e-1014-a187-cd164f93e87a.meca",
      "fileSize": 6147995,
      "links": {
        "self": "https://openrxiv.csf.now/v1/works/10.1101/2024.01.25.577295v1",
        "html": "https://www.biorxiv.org/content/10.1101/2024.01.25.577295v1.full",
        "pdf": "https://www.biorxiv.org/content/10.1101/2024.01.25.577295v1.full.pdf"
      }
    }
  ]
}

Why this exists: This maps preprint versions to S3 locations, enabling direct access to specific papers without bucket traversal.

`POST /v1/works`

Purpose: Upload paper metadata during batch processing

Why this exists: Batch processing extracts metadata from thousands of MECA files and needs to store it efficiently. This endpoint populates the database that powers the DOI lookups.

`DELETE /v1/works`

Purpose: Remove papers from the database

Why this exists: Papers can be updated, retracted, or moved. This endpoint maintains data integrity.

Health & Status

GET /health - API health check
GET / - API information and available endpoints

🔄 How It All Works Together

1. Data Flow

S3 Buckets → CLI Batch Processing → API Database → CLI Commands

S3 Buckets: bioRxiv and medRxiv store MECA files in organized folders
Batch Processing: CLI downloads and processes MECA files, extracting metadata
API Database: Metadata is stored with S3 location information
CLI Commands: Use the API to look up papers and download them efficiently

2. Use Case Examples

Individual Researcher

# Discover a paper
openrxiv summary "10.1101/2024.05.08.593085"

# Download it for analysis
openrxiv --requester-pays download "10.1101/2024.05.08.593085"

Data Scientist

# See what's available this month
openrxiv list --month "2024-01" --limit 100

# Process all papers from January
openrxiv batch-process --month "2024-01" --concurrency 10

Research Team

# Explore historical data
openrxiv list --batch "1-53" --server medrxiv

# Batch process multiple months
openrxiv batch-process --month "2024-01,2024-02,2024-03" --concurrency 20

🏗️ Architecture

CLI Tool (`packages/cli`)

Commands: summary, download, list, batch-info, batch-process
AWS Integration: S3 access with requester-pays support
API Client: Integration with the metadata API
Processing: MECA file extraction and XML parsing

API Server (`apps/api`)

Database: Prisma with PostgreSQL
Endpoints: Work lookup, creation, and deletion
Authentication: API key-based access control
Validation: Comprehensive input validation

Utilities (`packages/utils`)

DOI Parsing: Handle bioRxiv's complex DOI format
Folder Structure: Navigate S3 bucket organization
XML Processing: Robust handling of bioRxiv XML files

📊 Data Scale

As of August 2025:

bioRxiv: 398,744 individual works
medRxiv: 88,358 individual works
Total: ~487,000 papers across both servers

🚀 Getting Started

Prerequisites

Node.js 18+
AWS credentials (for S3 access)
API key for batch processing

Installation

npm install -g openrxiv

🔧 Development

Local Setup

git clone https://github.com/continuous-foundation/openrxiv
cd openrxiv
npm install
npm run build

Available Scripts

npm run build - Build all packages
npm run test - Run tests
npm run lint - Lint code
npm run changeset - Manage versioning

📚 Documentation

CLI Reference - Complete command documentation
Batch Processing - Bulk data processing guide
DOI Structure - Understanding bioRxiv DOIs
Processing Errors - Common issues and solutions
Download Locations - S3 bucket organization

🤝 Contributing

This project is maintained by the Continuous Science Foundation. We welcome contributions for:

Bug fixes and improvements
Additional CLI commands
Enhanced API endpoints
Documentation improvements

📄 License

MIT License - see LICENSE file for details.

🔒 Compliance

This tool is designed to comply with bioRxiv's and medRxiv's fair use policies:

No content redistribution
Proper attribution to original sources
Intended for legitimate research and data mining purposes
Respect for author copyright and licensing

Why This Matters: By providing efficient access to bioRxiv and medRxiv data, this project enables researchers to focus on science rather than data logistics. The combination of CLI tools and API endpoints creates a bridge between the raw S3 storage and the research community's needs.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.changeset		.changeset
.github/workflows		.github/workflows
apps/api		apps/api
docs		docs
packages		packages
scripts		scripts
.eslintrc.cjs		.eslintrc.cjs
.gitignore		.gitignore
.prettierrc		.prettierrc
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
turbo.json		turbo.json

continuous-foundation/openrxiv

Folders and files

Latest commit

History

Repository files navigation

openRxiv MECA Downloader

🎯 Why This Project Exists

The Problem

The Solution

🚀 Core Commands

1. Summary - Research Discovery

2. Download - Individual Paper Access

3. List - Content Exploration

🔌 API Endpoints

Core Endpoints

GET /v1/works/{doiPrefix}/{doiSuffix}

POST /v1/works

DELETE /v1/works

Health & Status

🔄 How It All Works Together

1. Data Flow

2. Use Case Examples

Individual Researcher

Data Scientist

Research Team

🏗️ Architecture

CLI Tool (packages/cli)

API Server (apps/api)

Utilities (packages/utils)

📊 Data Scale

🚀 Getting Started

Prerequisites

Installation

🔧 Development

Local Setup

Available Scripts

📚 Documentation

🤝 Contributing

📄 License

🔒 Compliance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

`GET /v1/works/{doiPrefix}/{doiSuffix}`

`POST /v1/works`

`DELETE /v1/works`

CLI Tool (`packages/cli`)

API Server (`apps/api`)

Utilities (`packages/utils`)

Packages