This document summarizes the Python port of the DVUploader (Dataverse bulk uploader) from Java to Python, including architectural decisions, key features, and next steps.
Original Repository: dataverse-uploader (Java)
- Purpose: Command-line bulk uploader for Dataverse and SEAD/Clowder repositories
- Main Components:
AbstractUploader: Base class with upload workflowDVUploader: Dataverse-specific implementationSEADUploader: SEAD/Clowder implementation- Resource abstraction layer for files, directories, and BagIt bags
- HTTP client utilities with Apache HttpComponents
- Metadata processing (OAI-ORE, JSON-LD)
Key Features:
- Bulk file uploads (handles 1000+ files)
- Direct upload to S3 storage
- Checksum verification (MD5, SHA-1, SHA-256)
- Resume capability (skip existing files)
- Multipart uploads for large files
- BagIt bag support
- OAuth authentication
- Retry logic
| Category | Technology | Rationale |
|---|---|---|
| HTTP Client | httpx | Modern, async-capable, connection pooling |
| Configuration | Pydantic | Type-safe, validation, env var support |
| CLI | Typer | Beautiful CLI with minimal code |
| Terminal UI | Rich | Progress bars, colored output |
| JSON | orjson | Fastest JSON library for Python |
| Retry Logic | tenacity | Declarative retry with exponential backoff |
| Testing | pytest | Industry standard with excellent ecosystem |
| Code Quality | black, ruff, mypy | Formatting, linting, type checking |
class UploaderConfig(BaseSettings):
"""Type-safe configuration with validation"""
server_url: str
api_key: Optional[str]
dataset_pid: Optional[str]
verify_checksums: bool = False
# ... 20+ other settingsFeatures:
- Pydantic-based validation
- Environment variable support
- .env file loading
- Type hints throughout
- Field validators
class AbstractUploader(ABC):
"""Template method pattern for upload workflow"""
def process_requests(self, paths):
# Template method - defines workflow
self.validate_configuration()
for path in paths:
self._process_resource(resource, "/")
@abstractmethod
def upload_file(self, file, parent_path):
"""Implemented by subclasses"""
passResponsibilities:
- Resource processing workflow
- Statistics tracking
- Recursive directory handling
- Extension points for specific repositories
class Resource(ABC):
"""Abstract interface for all resource types"""
@abstractmethod
def get_name(self) -> str: pass
@abstractmethod
def get_input_stream(self) -> IO[bytes]: pass
@abstractmethod
def get_hash(self, algorithm: str) -> str: passImplementations:
FileResource: Local filesystem filesPublishedResource: Remote published contentBagResource: BagIt bags- Extensible for new types
class HTTPClient:
"""HTTP client with pooling and retry logic"""
@retry(
retry=retry_if_exception_type(NetworkError),
wait=wait_exponential(multiplier=1, min=2, max=30),
stop=stop_after_attempt(3),
)
def post(self, url, **kwargs):
# Automatic retry with exponential backoff
passFeatures:
- Connection pooling (configurable size)
- Automatic retries (network errors)
- Exponential backoff
- Timeout handling
- Streaming support
class DataverseUploader(AbstractUploader):
"""Dataverse-specific implementation"""
def upload_file(self, file, parent_path):
if self.config.direct_upload:
return self._upload_file_direct(file, parent_path)
else:
return self._upload_file_traditional(file, parent_path)Features:
- Direct S3 upload support
- Traditional API upload
- Dataset metadata loading
- Existing file detection
- Checksum verification
- Dataset lock handling
dataverse_uploader_python/
├── pyproject.toml # Project metadata & dependencies
├── requirements.txt # Direct dependencies
├── README.md # User documentation
├── ARCHITECTURE.md # Technical architecture
├── .env.example # Configuration template
│
├── dataverse_uploader/
│ ├── __init__.py
│ ├── cli.py # Command-line interface
│ │
│ ├── core/
│ │ ├── __init__.py
│ │ ├── abstract_uploader.py # Base uploader class
│ │ ├── config.py # Configuration management
│ │ └── exceptions.py # Exception hierarchy
│ │
│ ├── resources/
│ │ ├── __init__.py
│ │ ├── base.py # Abstract Resource
│ │ ├── file_resource.py # File implementation
│ │ ├── published_resource.py # Published content
│ │ ├── bag_resource.py # BagIt support
│ │ └── resource_factory.py # Factory pattern
│ │
│ ├── uploaders/
│ │ ├── __init__.py
│ │ ├── dataverse.py # Dataverse implementation
│ │ └── sead.py # SEAD implementation
│ │
│ ├── auth/
│ │ ├── __init__.py
│ │ ├── api_key.py
│ │ └── oauth.py
│ │
│ └── utils/
│ ├── __init__.py
│ ├── http_client.py # HTTP utilities
│ ├── hashing.py # File hashing
│ ├── metadata.py # Metadata processing
│ └── progress.py # Progress tracking
│
├── tests/
│ ├── __init__.py
│ ├── test_resources.py
│ ├── test_uploaders.py
│ └── fixtures/
│
└── examples/
├── simple_upload.py
└── batch_upload.py
-
Configuration System
- Pydantic-based settings
- Environment variable support
- Validation and type safety
-
Exception Hierarchy
- Custom exception types
- Clear error messages
- Proper inheritance
-
Resource Abstraction
- Base Resource ABC
- FileResource implementation
- Stream handling
- Hash calculation
-
HTTP Client
- Connection pooling
- Retry logic with exponential backoff
- Timeout handling
- Authentication
-
Abstract Uploader
- Template method pattern
- Resource processing workflow
- Statistics tracking
- Extension points
-
Dataverse Uploader
- Configuration validation
- Dataset metadata loading
- File existence checking
- Traditional upload
- Direct S3 upload
- File registration
- Checksum verification
-
CLI Interface
- Typer-based command structure
- Rich terminal output
- Argument parsing
- Help documentation
-
Documentation
- Comprehensive README
- Architecture documentation
- Code examples
- Configuration guide
-
Additional Resource Types
PublishedResource(for OAI-ORE)BagResource(for BagIt bags)ResourceFactory(factory pattern)
-
SEAD Uploader
- SEAD API implementation
- OAuth authentication
- Collection/dataset creation
-
Authentication Modules
- OAuth flow implementation
- Token management
- Session handling
-
Metadata Processing
- OAI-ORE parsing
- JSON-LD processing
- Metadata filtering
- Gray/black list support
-
Progress Tracking
- Real-time progress bars
- Upload speed calculation
- ETA estimation
-
Hashing Utilities
- Additional algorithms
- Streaming hash calculation
- Hash caching
-
Testing
- Unit tests for all components
- Integration tests
- Mock Dataverse server
- Fixture data
-
Advanced Features
- Multipart upload with parallel parts
- Resume from interruption
- Batch file registration
- Directory structure preservation
- Type hints: Better IDE support and error detection
- f-strings: Cleaner string formatting
- Context managers: Automatic resource cleanup
- Decorators: Clean separation of concerns (e.g., retry logic)
- No compilation: Faster iteration cycle
- REPL: Interactive testing and debugging
- Rich ecosystem: Thousands of quality libraries
- Better tooling: Black, Ruff, MyPy for quality
- No JVM: Smaller footprint
- pip/Poetry: Simple dependency management
- Virtual environments: Isolated dependencies
- Docker: Easy containerization
# Java
public String uploadFile(Resource file, String parentPath)
throws UploaderException {
// Implementation
}
# Python (with type hints)
def upload_file(self, file: Resource, parent_path: str) -> str | None:
# Implementation# Pydantic automatically:
# - Loads from environment variables
# - Validates types
# - Converts strings to proper types
# - Provides helpful error messages
config = UploaderConfig() # That's it!Java Command:
java -jar DVUploader-v1.3.0.jar \
-server=https://dataverse.org \
-key=API_KEY \
-did=doi:10.xxx/yyy \
file.csvPython Equivalent:
dv-upload file.csv \
--server https://dataverse.org \
--key API_KEY \
--dataset doi:10.xxx/yyyExtending Java Version:
public class MyUploader extends AbstractUploader {
@Override
protected String uploadFile(Resource file, String parentPath) {
// Implementation
}
}Extending Python Version:
class MyUploader(AbstractUploader):
def upload_file(self, file: Resource, parent_path: str) -> str | None:
# Implementation- ✅ Backend architecture
- ✅ Core uploaders
- ✅ Resource abstraction
- ✅ HTTP client
- ✅ Configuration
- ⬜ BagIt support
- ⬜ SEAD uploader
- ⬜ OAuth authentication
- ⬜ Metadata processing
- ⬜ Progress tracking
- ⬜ Unit tests (80%+ coverage)
- ⬜ Integration tests
- ⬜ End-to-end tests
- ⬜ Performance benchmarks
- ⬜ Documentation review
- ⬜ Package for PyPI
- ⬜ Docker image
- ⬜ GitHub Actions CI/CD
- ⬜ Release documentation
- ⬜ Community announcement
cd dataverse_uploader_python
pip install -e .# Set environment variables
export DV_SERVER_URL=https://demo.dataverse.org
export DV_API_KEY=your-api-key
export DV_DATASET_PID=doi:10.5072/FK2/EXAMPLE
# Upload files
dv-upload data/ --recurse --verifyfrom dataverse_uploader.core.config import UploaderConfig
from dataverse_uploader.uploaders.dataverse import DataverseUploader
config = UploaderConfig(
server_url="https://demo.dataverse.org",
api_key="your-api-key",
dataset_pid="doi:10.5072/FK2/EXAMPLE",
recurse_directories=True,
verify_checksums=True,
)
with DataverseUploader(config) as uploader:
uploader.process_requests(["data/"])See README.md for contribution guidelines.
Apache License 2.0 (same as original Java version)
Python implementation based on the original Java DVUploader developed by:
- Texas Digital Library
- Global Dataverse Community Consortium
- University of Michigan
- NCSA/SEAD Project