Project Summary: Dataverse Uploader Python Port

Overview

This document summarizes the Python port of the DVUploader (Dataverse bulk uploader) from Java to Python, including architectural decisions, key features, and next steps.

Java Repository Analysis

Original Repository: dataverse-uploader (Java)

Purpose: Command-line bulk uploader for Dataverse and SEAD/Clowder repositories
Main Components:
- AbstractUploader: Base class with upload workflow
- DVUploader: Dataverse-specific implementation
- SEADUploader: SEAD/Clowder implementation
- Resource abstraction layer for files, directories, and BagIt bags
- HTTP client utilities with Apache HttpComponents
- Metadata processing (OAI-ORE, JSON-LD)

Key Features:

Bulk file uploads (handles 1000+ files)
Direct upload to S3 storage
Checksum verification (MD5, SHA-1, SHA-256)
Resume capability (skip existing files)
Multipart uploads for large files
BagIt bag support
OAuth authentication
Retry logic

Python Architecture

Technology Stack

Category	Technology	Rationale
HTTP Client	httpx	Modern, async-capable, connection pooling
Configuration	Pydantic	Type-safe, validation, env var support
CLI	Typer	Beautiful CLI with minimal code
Terminal UI	Rich	Progress bars, colored output
JSON	orjson	Fastest JSON library for Python
Retry Logic	tenacity	Declarative retry with exponential backoff
Testing	pytest	Industry standard with excellent ecosystem
Code Quality	black, ruff, mypy	Formatting, linting, type checking

Core Backend Components

1. Configuration Management (`core/config.py`)

class UploaderConfig(BaseSettings):
    """Type-safe configuration with validation"""
    server_url: str
    api_key: Optional[str]
    dataset_pid: Optional[str]
    verify_checksums: bool = False
    # ... 20+ other settings

Features:

Pydantic-based validation
Environment variable support
.env file loading
Type hints throughout
Field validators

2. Abstract Uploader (`core/abstract_uploader.py`)

class AbstractUploader(ABC):
    """Template method pattern for upload workflow"""
    
    def process_requests(self, paths):
        # Template method - defines workflow
        self.validate_configuration()
        for path in paths:
            self._process_resource(resource, "/")
    
    @abstractmethod
    def upload_file(self, file, parent_path):
        """Implemented by subclasses"""
        pass

Responsibilities:

Resource processing workflow
Statistics tracking
Recursive directory handling
Extension points for specific repositories

3. Resource Abstraction (`resources/`)

class Resource(ABC):
    """Abstract interface for all resource types"""
    
    @abstractmethod
    def get_name(self) -> str: pass
    
    @abstractmethod
    def get_input_stream(self) -> IO[bytes]: pass
    
    @abstractmethod
    def get_hash(self, algorithm: str) -> str: pass

Implementations:

FileResource: Local filesystem files
PublishedResource: Remote published content
BagResource: BagIt bags
Extensible for new types

4. HTTP Client (`utils/http_client.py`)

class HTTPClient:
    """HTTP client with pooling and retry logic"""
    
    @retry(
        retry=retry_if_exception_type(NetworkError),
        wait=wait_exponential(multiplier=1, min=2, max=30),
        stop=stop_after_attempt(3),
    )
    def post(self, url, **kwargs):
        # Automatic retry with exponential backoff
        pass

Features:

Connection pooling (configurable size)
Automatic retries (network errors)
Exponential backoff
Timeout handling
Streaming support

5. Dataverse Uploader (`uploaders/dataverse.py`)

class DataverseUploader(AbstractUploader):
    """Dataverse-specific implementation"""
    
    def upload_file(self, file, parent_path):
        if self.config.direct_upload:
            return self._upload_file_direct(file, parent_path)
        else:
            return self._upload_file_traditional(file, parent_path)

Features:

Direct S3 upload support
Traditional API upload
Dataset metadata loading
Existing file detection
Checksum verification
Dataset lock handling

Project Structure

dataverse_uploader_python/
├── pyproject.toml              # Project metadata & dependencies
├── requirements.txt            # Direct dependencies
├── README.md                   # User documentation
├── ARCHITECTURE.md             # Technical architecture
├── .env.example               # Configuration template
│
├── dataverse_uploader/
│   ├── __init__.py
│   ├── cli.py                 # Command-line interface
│   │
│   ├── core/
│   │   ├── __init__.py
│   │   ├── abstract_uploader.py    # Base uploader class
│   │   ├── config.py               # Configuration management
│   │   └── exceptions.py           # Exception hierarchy
│   │
│   ├── resources/
│   │   ├── __init__.py
│   │   ├── base.py                 # Abstract Resource
│   │   ├── file_resource.py        # File implementation
│   │   ├── published_resource.py   # Published content
│   │   ├── bag_resource.py         # BagIt support
│   │   └── resource_factory.py     # Factory pattern
│   │
│   ├── uploaders/
│   │   ├── __init__.py
│   │   ├── dataverse.py            # Dataverse implementation
│   │   └── sead.py                 # SEAD implementation
│   │
│   ├── auth/
│   │   ├── __init__.py
│   │   ├── api_key.py
│   │   └── oauth.py
│   │
│   └── utils/
│       ├── __init__.py
│       ├── http_client.py          # HTTP utilities
│       ├── hashing.py              # File hashing
│       ├── metadata.py             # Metadata processing
│       └── progress.py             # Progress tracking
│
├── tests/
│   ├── __init__.py
│   ├── test_resources.py
│   ├── test_uploaders.py
│   └── fixtures/
│
└── examples/
    ├── simple_upload.py
    └── batch_upload.py

Implementation Status

✅ Completed (Core Backend)

Configuration System
- Pydantic-based settings
- Environment variable support
- Validation and type safety
Exception Hierarchy
- Custom exception types
- Clear error messages
- Proper inheritance
Resource Abstraction
- Base Resource ABC
- FileResource implementation
- Stream handling
- Hash calculation
HTTP Client
- Connection pooling
- Retry logic with exponential backoff
- Timeout handling
- Authentication
Abstract Uploader
- Template method pattern
- Resource processing workflow
- Statistics tracking
- Extension points
Dataverse Uploader
- Configuration validation
- Dataset metadata loading
- File existence checking
- Traditional upload
- Direct S3 upload
- File registration
- Checksum verification
CLI Interface
- Typer-based command structure
- Rich terminal output
- Argument parsing
- Help documentation
Documentation
- Comprehensive README
- Architecture documentation
- Code examples
- Configuration guide

🚧 To Be Implemented

Additional Resource Types
- PublishedResource (for OAI-ORE)
- BagResource (for BagIt bags)
- ResourceFactory (factory pattern)
SEAD Uploader
- SEAD API implementation
- OAuth authentication
- Collection/dataset creation
Authentication Modules
- OAuth flow implementation
- Token management
- Session handling
Metadata Processing
- OAI-ORE parsing
- JSON-LD processing
- Metadata filtering
- Gray/black list support
Progress Tracking
- Real-time progress bars
- Upload speed calculation
- ETA estimation
Hashing Utilities
- Additional algorithms
- Streaming hash calculation
- Hash caching
Testing
- Unit tests for all components
- Integration tests
- Mock Dataverse server
- Fixture data
Advanced Features
- Multipart upload with parallel parts
- Resume from interruption
- Batch file registration
- Directory structure preservation

Advantages of Python Implementation

1. Modern Language Features

Type hints: Better IDE support and error detection
f-strings: Cleaner string formatting
Context managers: Automatic resource cleanup
Decorators: Clean separation of concerns (e.g., retry logic)

2. Better Developer Experience

No compilation: Faster iteration cycle
REPL: Interactive testing and debugging
Rich ecosystem: Thousands of quality libraries
Better tooling: Black, Ruff, MyPy for quality

3. Simpler Deployment

No JVM: Smaller footprint
pip/Poetry: Simple dependency management
Virtual environments: Isolated dependencies
Docker: Easy containerization

4. Type Safety Without Verbosity

# Java
public String uploadFile(Resource file, String parentPath) 
    throws UploaderException {
    // Implementation
}

# Python (with type hints)
def upload_file(self, file: Resource, parent_path: str) -> str | None:
    # Implementation

5. Better Configuration Management

# Pydantic automatically:
# - Loads from environment variables
# - Validates types
# - Converts strings to proper types
# - Provides helpful error messages

config = UploaderConfig()  # That's it!

Migration Path from Java

For Users

Java Command:

java -jar DVUploader-v1.3.0.jar \
  -server=https://dataverse.org \
  -key=API_KEY \
  -did=doi:10.xxx/yyy \
  file.csv

Python Equivalent:

dv-upload file.csv \
  --server https://dataverse.org \
  --key API_KEY \
  --dataset doi:10.xxx/yyy

For Developers

Extending Java Version:

public class MyUploader extends AbstractUploader {
    @Override
    protected String uploadFile(Resource file, String parentPath) {
        // Implementation
    }
}

Extending Python Version:

class MyUploader(AbstractUploader):
    def upload_file(self, file: Resource, parent_path: str) -> str | None:
        # Implementation

Next Steps

Phase 1: Core Completion (Current)

✅ Backend architecture
✅ Core uploaders
✅ Resource abstraction
✅ HTTP client
✅ Configuration

Phase 2: Feature Parity

⬜ BagIt support
⬜ SEAD uploader
⬜ OAuth authentication
⬜ Metadata processing
⬜ Progress tracking

Phase 3: Testing & Quality

⬜ Unit tests (80%+ coverage)
⬜ Integration tests
⬜ End-to-end tests
⬜ Performance benchmarks
⬜ Documentation review

Phase 4: Release

⬜ Package for PyPI
⬜ Docker image
⬜ GitHub Actions CI/CD
⬜ Release documentation
⬜ Community announcement

Getting Started

Installation

cd dataverse_uploader_python
pip install -e .

Basic Usage

# Set environment variables
export DV_SERVER_URL=https://demo.dataverse.org
export DV_API_KEY=your-api-key
export DV_DATASET_PID=doi:10.5072/FK2/EXAMPLE

# Upload files
dv-upload data/ --recurse --verify

Programmatic Usage

from dataverse_uploader.core.config import UploaderConfig
from dataverse_uploader.uploaders.dataverse import DataverseUploader

config = UploaderConfig(
    server_url="https://demo.dataverse.org",
    api_key="your-api-key",
    dataset_pid="doi:10.5072/FK2/EXAMPLE",
    recurse_directories=True,
    verify_checksums=True,
)

with DataverseUploader(config) as uploader:
    uploader.process_requests(["data/"])

Contributing

See README.md for contribution guidelines.

License

Apache License 2.0 (same as original Java version)

Acknowledgments

Python implementation based on the original Java DVUploader developed by:

Texas Digital Library
Global Dataverse Community Consortium
University of Michigan
NCSA/SEAD Project

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project Summary: Dataverse Uploader Python Port

Overview

Java Repository Analysis

Python Architecture

Technology Stack

Core Backend Components

1. Configuration Management (`core/config.py`)

2. Abstract Uploader (`core/abstract_uploader.py`)

3. Resource Abstraction (`resources/`)

4. HTTP Client (`utils/http_client.py`)

5. Dataverse Uploader (`uploaders/dataverse.py`)

Project Structure

Implementation Status

✅ Completed (Core Backend)

🚧 To Be Implemented

Advantages of Python Implementation

1. Modern Language Features

2. Better Developer Experience

3. Simpler Deployment

4. Type Safety Without Verbosity

5. Better Configuration Management

Migration Path from Java

For Users

For Developers

Next Steps

Phase 1: Core Completion (Current)

Phase 2: Feature Parity

Phase 3: Testing & Quality

Phase 4: Release

Getting Started

Installation

Basic Usage

Programmatic Usage

Contributing

License

Acknowledgments

FilesExpand file tree

PROJECT_SUMMARY.md

Latest commit

History

PROJECT_SUMMARY.md

File metadata and controls

Project Summary: Dataverse Uploader Python Port

Overview

Java Repository Analysis

Python Architecture

Technology Stack

Core Backend Components

1. Configuration Management (core/config.py)

2. Abstract Uploader (core/abstract_uploader.py)

3. Resource Abstraction (resources/)

4. HTTP Client (utils/http_client.py)

5. Dataverse Uploader (uploaders/dataverse.py)

Project Structure

Implementation Status

✅ Completed (Core Backend)

🚧 To Be Implemented

Advantages of Python Implementation

1. Modern Language Features

2. Better Developer Experience

3. Simpler Deployment

4. Type Safety Without Verbosity

5. Better Configuration Management

Migration Path from Java

For Users

For Developers

Next Steps

Phase 1: Core Completion (Current)

Phase 2: Feature Parity

Phase 3: Testing & Quality

Phase 4: Release

Getting Started

Installation

Basic Usage

Programmatic Usage

Contributing

License

Acknowledgments

1. Configuration Management (`core/config.py`)

2. Abstract Uploader (`core/abstract_uploader.py`)

3. Resource Abstraction (`resources/`)

4. HTTP Client (`utils/http_client.py`)

5. Dataverse Uploader (`uploaders/dataverse.py`)