┌─────────────────────────────────────────────────────────────────┐
│ CLI / Python API │
│ (dataverse_uploader.cli) │
└────────────────────────┬─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Configuration Layer │
│ (UploaderConfig / Pydantic) │
│ - Environment variables - .env files - Validation │
└────────────────────────┬─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Abstract Uploader Layer │
│ (AbstractUploader ABC) │
│ - Template method pattern │
│ - Resource processing workflow │
│ - Statistics tracking │
└────┬──────────────────┬──────────────────┬───────────────────────┘
│ │ │
▼ ▼ ▼
┌──────────┐ ┌──────────────┐ ┌─────────────┐
│Dataverse │ │ SEAD/Clowder │ │ Custom │
│ Uploader │ │ Uploader │ │ Uploaders │
└────┬─────┘ └──────┬───────┘ └──────┬──────┘
│ │ │
└─────────────────┴────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Resource Abstraction │
│ (Resource ABC) │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ File │ │ Published │ │ Bag │ │
│ │ Resource │ │ Resource │ │ Resource │ │
│ └────────────┘ └────────────┘ └────────────┘ │
└────────────────────────┬─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ Utility Layer │
│ ┌───────────┐ ┌──────────┐ ┌────────┐ ┌────────────┐ │
│ │ HTTP │ │ Hashing │ │Metadata│ │ Progress │ │
│ │ Client │ │ Utilities│ │ Parser │ │ Tracking │ │
│ └───────────┘ └──────────┘ └────────┘ └────────────┘ │
└─────────────────────────────────────────────────────────────────┘
- Purpose: User interface via command line
- Technologies: Typer, Rich
- Responsibilities:
- Parse command-line arguments
- Display progress and results
- Error handling and user feedback
- Purpose: Centralized configuration management
- Technologies: Pydantic Settings
- Responsibilities:
- Load config from env vars, .env files, or arguments
- Validate configuration values
- Type-safe configuration access
- Purpose: Template for all uploader implementations
- Pattern: Template Method, Strategy
- Responsibilities:
- Define upload workflow
- Track statistics
- Process resources recursively
- Provide extension points for specific implementations
- Responsibilities:
- Dataverse API interaction
- Direct S3 upload support
- Multipart upload handling
- Dataset lock management
- File registration
- Responsibilities:
- SEAD/Clowder API interaction
- OAuth authentication
- Collection/dataset creation
- Metadata mapping
- Purpose: Unified interface for different resource types
- Pattern: Abstract Factory
- Components:
Resource(ABC): Base interfaceFileResource: Local filesystem filesPublishedResource: Remote published contentBagResource: BagIt bag supportResourceFactory: Factory for creating resources
- Purpose: Robust HTTP communication
- Features:
- Connection pooling
- Automatic retries with exponential backoff
- Timeout handling
- Streaming support
- Hashing: File integrity verification
- Metadata: OAI-ORE, JSON-LD processing
- Progress: Upload progress tracking
1. CLI Entry
↓
2. Load Configuration
↓
3. Create Uploader Instance
↓
4. Validate Configuration
↓
5. Load Dataset Metadata
↓
6. For each path:
├── Create Resource
├── Check if exists
├── If directory:
│ ├── Create directory (virtual)
│ └── Recurse into children
└── If file:
├── Check checksum (if verify enabled)
├── Upload file (direct or traditional)
└── Track statistics
↓
7. Post-process (e.g., register batch)
↓
8. Display summary
1. Request upload URLs from Dataverse
↓
2. Receive S3 presigned URL + storage identifier
↓
3. Upload file directly to S3
↓
4. Calculate file checksum
↓
5. Register file with Dataverse
├── Send storage identifier
├── Send checksum
└── Send metadata
↓
6. Return file ID
Intent: Create families of related objects (files, directories, bags) without specifying concrete classes.
class ResourceFactory:
@staticmethod
def create(path: Path) -> Resource:
if path.is_file():
return FileResource(path)
elif path.is_dir():
return DirectoryResource(path)
# ... other typesIntent: Define skeleton of algorithm; subclasses override specific steps.
class AbstractUploader(ABC):
def process_requests(self, paths):
# Template method - defines workflow
self.validate_configuration()
for path in paths:
resource = self.create_resource(path)
self.upload_resource(resource) # Hook method
@abstractmethod
def upload_file(self, file):
# Hook method - implemented by subclasses
passIntent: Define family of algorithms (Dataverse, SEAD) and make them interchangeable.
# Client code
if repo_type == "dataverse":
uploader = DataverseUploader(config)
elif repo_type == "sead":
uploader = SEADUploader(config)
uploader.upload(files) # Same interface, different strategyIntent: Convert interface of class into another interface clients expect.
class BagResource(Resource):
"""Adapts BagIt format to Resource interface"""
def __init__(self, bag_path):
self.bag = bagit.Bag(bag_path)
def get_input_stream(self):
# Adapt BagIt interface to Resource interface
return self.bag.get_file(self.path)Exception Hierarchy:
└── UploaderException (base)
├── AuthenticationError
├── UploadError
├── ResourceError
├── MetadataError
├── ValidationError
├── HashMismatchError
├── DatasetLockedError
└── NetworkError
Error Handling Approach:
- Network errors: Automatic retry with exponential backoff
- Authentication errors: Immediate failure with clear message
- Validation errors: Early detection before upload starts
- Upload errors: Log and continue with remaining files
- Critical errors: Stop processing and report
- Test each component in isolation
- Mock external dependencies (HTTP, filesystem)
- Focus on business logic
- Test interaction between components
- Use test doubles for external services
- Verify configuration loading
- Test against demo Dataverse instance
- Verify actual uploads and checksums
- Test error recovery
- Maintain pool of HTTP connections
- Reuse connections for multiple uploads
- Configurable pool size
- Support multipart uploads with parallel parts
- Configurable concurrency level
- Thread-safe resource sharing
- Stream large files (don't load entirely in memory)
- Chunked reading for hash calculation
- Limited reader for partial uploads
- Cache calculated hashes
- Cache dataset metadata
- Avoid redundant API calls
-
API Key Storage:
- Never log API keys
- Support environment variables
- Mask in error messages
-
SSL/TLS:
- Verify certificates by default
- Option to disable (testing only)
- Use modern TLS versions
-
Input Validation:
- Validate all configuration
- Sanitize file paths
- Check file sizes
-
Checksum Verification:
- Calculate before upload
- Verify after upload
- Multiple algorithm support
- Extend
AbstractUploader - Implement abstract methods
- Add repository-specific logic
- Register in CLI or use programmatically
- Extend
ResourceABC - Implement required methods
- Add to
ResourceFactoryif needed - Use in uploader
- Extend
MetadataProcessor - Add custom mapping logic
- Configure in uploader
- Extend
Authenticatorbase - Implement authentication flow
- Use in HTTP client