Skip to content

Latest commit

 

History

History
468 lines (388 loc) · 12 KB

File metadata and controls

468 lines (388 loc) · 12 KB

Project Summary: Dataverse Uploader Python Port

Overview

This document summarizes the Python port of the DVUploader (Dataverse bulk uploader) from Java to Python, including architectural decisions, key features, and next steps.

Java Repository Analysis

Original Repository: dataverse-uploader (Java)

  • Purpose: Command-line bulk uploader for Dataverse and SEAD/Clowder repositories
  • Main Components:
    • AbstractUploader: Base class with upload workflow
    • DVUploader: Dataverse-specific implementation
    • SEADUploader: SEAD/Clowder implementation
    • Resource abstraction layer for files, directories, and BagIt bags
    • HTTP client utilities with Apache HttpComponents
    • Metadata processing (OAI-ORE, JSON-LD)

Key Features:

  • Bulk file uploads (handles 1000+ files)
  • Direct upload to S3 storage
  • Checksum verification (MD5, SHA-1, SHA-256)
  • Resume capability (skip existing files)
  • Multipart uploads for large files
  • BagIt bag support
  • OAuth authentication
  • Retry logic

Python Architecture

Technology Stack

Category Technology Rationale
HTTP Client httpx Modern, async-capable, connection pooling
Configuration Pydantic Type-safe, validation, env var support
CLI Typer Beautiful CLI with minimal code
Terminal UI Rich Progress bars, colored output
JSON orjson Fastest JSON library for Python
Retry Logic tenacity Declarative retry with exponential backoff
Testing pytest Industry standard with excellent ecosystem
Code Quality black, ruff, mypy Formatting, linting, type checking

Core Backend Components

1. Configuration Management (core/config.py)

class UploaderConfig(BaseSettings):
    """Type-safe configuration with validation"""
    server_url: str
    api_key: Optional[str]
    dataset_pid: Optional[str]
    verify_checksums: bool = False
    # ... 20+ other settings

Features:

  • Pydantic-based validation
  • Environment variable support
  • .env file loading
  • Type hints throughout
  • Field validators

2. Abstract Uploader (core/abstract_uploader.py)

class AbstractUploader(ABC):
    """Template method pattern for upload workflow"""
    
    def process_requests(self, paths):
        # Template method - defines workflow
        self.validate_configuration()
        for path in paths:
            self._process_resource(resource, "/")
    
    @abstractmethod
    def upload_file(self, file, parent_path):
        """Implemented by subclasses"""
        pass

Responsibilities:

  • Resource processing workflow
  • Statistics tracking
  • Recursive directory handling
  • Extension points for specific repositories

3. Resource Abstraction (resources/)

class Resource(ABC):
    """Abstract interface for all resource types"""
    
    @abstractmethod
    def get_name(self) -> str: pass
    
    @abstractmethod
    def get_input_stream(self) -> IO[bytes]: pass
    
    @abstractmethod
    def get_hash(self, algorithm: str) -> str: pass

Implementations:

  • FileResource: Local filesystem files
  • PublishedResource: Remote published content
  • BagResource: BagIt bags
  • Extensible for new types

4. HTTP Client (utils/http_client.py)

class HTTPClient:
    """HTTP client with pooling and retry logic"""
    
    @retry(
        retry=retry_if_exception_type(NetworkError),
        wait=wait_exponential(multiplier=1, min=2, max=30),
        stop=stop_after_attempt(3),
    )
    def post(self, url, **kwargs):
        # Automatic retry with exponential backoff
        pass

Features:

  • Connection pooling (configurable size)
  • Automatic retries (network errors)
  • Exponential backoff
  • Timeout handling
  • Streaming support

5. Dataverse Uploader (uploaders/dataverse.py)

class DataverseUploader(AbstractUploader):
    """Dataverse-specific implementation"""
    
    def upload_file(self, file, parent_path):
        if self.config.direct_upload:
            return self._upload_file_direct(file, parent_path)
        else:
            return self._upload_file_traditional(file, parent_path)

Features:

  • Direct S3 upload support
  • Traditional API upload
  • Dataset metadata loading
  • Existing file detection
  • Checksum verification
  • Dataset lock handling

Project Structure

dataverse_uploader_python/
├── pyproject.toml              # Project metadata & dependencies
├── requirements.txt            # Direct dependencies
├── README.md                   # User documentation
├── ARCHITECTURE.md             # Technical architecture
├── .env.example               # Configuration template
│
├── dataverse_uploader/
│   ├── __init__.py
│   ├── cli.py                 # Command-line interface
│   │
│   ├── core/
│   │   ├── __init__.py
│   │   ├── abstract_uploader.py    # Base uploader class
│   │   ├── config.py               # Configuration management
│   │   └── exceptions.py           # Exception hierarchy
│   │
│   ├── resources/
│   │   ├── __init__.py
│   │   ├── base.py                 # Abstract Resource
│   │   ├── file_resource.py        # File implementation
│   │   ├── published_resource.py   # Published content
│   │   ├── bag_resource.py         # BagIt support
│   │   └── resource_factory.py     # Factory pattern
│   │
│   ├── uploaders/
│   │   ├── __init__.py
│   │   ├── dataverse.py            # Dataverse implementation
│   │   └── sead.py                 # SEAD implementation
│   │
│   ├── auth/
│   │   ├── __init__.py
│   │   ├── api_key.py
│   │   └── oauth.py
│   │
│   └── utils/
│       ├── __init__.py
│       ├── http_client.py          # HTTP utilities
│       ├── hashing.py              # File hashing
│       ├── metadata.py             # Metadata processing
│       └── progress.py             # Progress tracking
│
├── tests/
│   ├── __init__.py
│   ├── test_resources.py
│   ├── test_uploaders.py
│   └── fixtures/
│
└── examples/
    ├── simple_upload.py
    └── batch_upload.py

Implementation Status

✅ Completed (Core Backend)

  1. Configuration System

    • Pydantic-based settings
    • Environment variable support
    • Validation and type safety
  2. Exception Hierarchy

    • Custom exception types
    • Clear error messages
    • Proper inheritance
  3. Resource Abstraction

    • Base Resource ABC
    • FileResource implementation
    • Stream handling
    • Hash calculation
  4. HTTP Client

    • Connection pooling
    • Retry logic with exponential backoff
    • Timeout handling
    • Authentication
  5. Abstract Uploader

    • Template method pattern
    • Resource processing workflow
    • Statistics tracking
    • Extension points
  6. Dataverse Uploader

    • Configuration validation
    • Dataset metadata loading
    • File existence checking
    • Traditional upload
    • Direct S3 upload
    • File registration
    • Checksum verification
  7. CLI Interface

    • Typer-based command structure
    • Rich terminal output
    • Argument parsing
    • Help documentation
  8. Documentation

    • Comprehensive README
    • Architecture documentation
    • Code examples
    • Configuration guide

🚧 To Be Implemented

  1. Additional Resource Types

    • PublishedResource (for OAI-ORE)
    • BagResource (for BagIt bags)
    • ResourceFactory (factory pattern)
  2. SEAD Uploader

    • SEAD API implementation
    • OAuth authentication
    • Collection/dataset creation
  3. Authentication Modules

    • OAuth flow implementation
    • Token management
    • Session handling
  4. Metadata Processing

    • OAI-ORE parsing
    • JSON-LD processing
    • Metadata filtering
    • Gray/black list support
  5. Progress Tracking

    • Real-time progress bars
    • Upload speed calculation
    • ETA estimation
  6. Hashing Utilities

    • Additional algorithms
    • Streaming hash calculation
    • Hash caching
  7. Testing

    • Unit tests for all components
    • Integration tests
    • Mock Dataverse server
    • Fixture data
  8. Advanced Features

    • Multipart upload with parallel parts
    • Resume from interruption
    • Batch file registration
    • Directory structure preservation

Advantages of Python Implementation

1. Modern Language Features

  • Type hints: Better IDE support and error detection
  • f-strings: Cleaner string formatting
  • Context managers: Automatic resource cleanup
  • Decorators: Clean separation of concerns (e.g., retry logic)

2. Better Developer Experience

  • No compilation: Faster iteration cycle
  • REPL: Interactive testing and debugging
  • Rich ecosystem: Thousands of quality libraries
  • Better tooling: Black, Ruff, MyPy for quality

3. Simpler Deployment

  • No JVM: Smaller footprint
  • pip/Poetry: Simple dependency management
  • Virtual environments: Isolated dependencies
  • Docker: Easy containerization

4. Type Safety Without Verbosity

# Java
public String uploadFile(Resource file, String parentPath) 
    throws UploaderException {
    // Implementation
}

# Python (with type hints)
def upload_file(self, file: Resource, parent_path: str) -> str | None:
    # Implementation

5. Better Configuration Management

# Pydantic automatically:
# - Loads from environment variables
# - Validates types
# - Converts strings to proper types
# - Provides helpful error messages

config = UploaderConfig()  # That's it!

Migration Path from Java

For Users

Java Command:

java -jar DVUploader-v1.3.0.jar \
  -server=https://dataverse.org \
  -key=API_KEY \
  -did=doi:10.xxx/yyy \
  file.csv

Python Equivalent:

dv-upload file.csv \
  --server https://dataverse.org \
  --key API_KEY \
  --dataset doi:10.xxx/yyy

For Developers

Extending Java Version:

public class MyUploader extends AbstractUploader {
    @Override
    protected String uploadFile(Resource file, String parentPath) {
        // Implementation
    }
}

Extending Python Version:

class MyUploader(AbstractUploader):
    def upload_file(self, file: Resource, parent_path: str) -> str | None:
        # Implementation

Next Steps

Phase 1: Core Completion (Current)

  • ✅ Backend architecture
  • ✅ Core uploaders
  • ✅ Resource abstraction
  • ✅ HTTP client
  • ✅ Configuration

Phase 2: Feature Parity

  • ⬜ BagIt support
  • ⬜ SEAD uploader
  • ⬜ OAuth authentication
  • ⬜ Metadata processing
  • ⬜ Progress tracking

Phase 3: Testing & Quality

  • ⬜ Unit tests (80%+ coverage)
  • ⬜ Integration tests
  • ⬜ End-to-end tests
  • ⬜ Performance benchmarks
  • ⬜ Documentation review

Phase 4: Release

  • ⬜ Package for PyPI
  • ⬜ Docker image
  • ⬜ GitHub Actions CI/CD
  • ⬜ Release documentation
  • ⬜ Community announcement

Getting Started

Installation

cd dataverse_uploader_python
pip install -e .

Basic Usage

# Set environment variables
export DV_SERVER_URL=https://demo.dataverse.org
export DV_API_KEY=your-api-key
export DV_DATASET_PID=doi:10.5072/FK2/EXAMPLE

# Upload files
dv-upload data/ --recurse --verify

Programmatic Usage

from dataverse_uploader.core.config import UploaderConfig
from dataverse_uploader.uploaders.dataverse import DataverseUploader

config = UploaderConfig(
    server_url="https://demo.dataverse.org",
    api_key="your-api-key",
    dataset_pid="doi:10.5072/FK2/EXAMPLE",
    recurse_directories=True,
    verify_checksums=True,
)

with DataverseUploader(config) as uploader:
    uploader.process_requests(["data/"])

Contributing

See README.md for contribution guidelines.

License

Apache License 2.0 (same as original Java version)

Acknowledgments

Python implementation based on the original Java DVUploader developed by:

  • Texas Digital Library
  • Global Dataverse Community Consortium
  • University of Michigan
  • NCSA/SEAD Project