The Large File Generator is a Python script designed to create test files of various sizes and formats for testing the Dataverse Uploader. It generates realistic data files that can be used to test upload performance, chunking, error handling, and large file management.
- Features
- Installation
- Quick Start
- Usage Guide
- File Types
- Size Categories
- Custom Generation
- Advanced Usage
- Testing Scenarios
- Performance Notes
- Troubleshooting
- Examples
- ✅ Multiple File Types: CSV, JSON, text, binary, and log files
- ✅ Flexible Sizing: From 1 MB to 1+ GB files
- ✅ Realistic Data: Synthetic but representative data patterns
- ✅ Progress Tracking: Real-time progress indicators
- ✅ Memory Efficient: Streams large files without loading into memory
- ✅ Nested Structures: Creates complex directory hierarchies
- ✅ Interactive Menu: User-friendly command-line interface
- ✅ Summary Reports: Detailed generation statistics
| Format | Extension | Use Case |
|---|---|---|
| CSV | .csv |
Tabular data, datasets |
| JSON | .json |
Structured data, API responses |
| Text | .txt |
Documents, logs, plain text |
| Binary | .bin |
Binary data, images, media |
| Log | .log |
Server logs, application logs |
- Python 3.9 or higher
- No external dependencies (uses only Python standard library)
-
Download the script:
cd examples # Copy generate_large_files.py to this directory
-
Make it executable (optional, Unix/Linux):
chmod +x generate_large_files.py
-
Verify installation:
python generate_large_files.py
python generate_large_files.py
# Select option 0 (Quick demo)This creates three small files (~15 MB total) in the data/ directory:
demo_data.csv- 10,000 rowsdemo_logs.json- 5,000 recordsdemo_text.txt- 5 MB
cd ..
dv-upload data/ --recurse --verify --list-onlyRun the script and follow the prompts:
python generate_large_files.pyMenu Options:
Select files to generate:
Size Categories:
1. Small files (1-10 MB) - Quick tests
2. Medium files (10-100 MB) - Standard tests
3. Large files (100-500 MB) - Stress tests
4. Extra large files (500+ MB) - Maximum stress
5. All sizes - Complete test suite
6. Custom - Specify your own
0. Quick demo (small files only)
Enter choice (0-6):
-
Launch script
python generate_large_files.py
-
Select option (0-6)
-
Wait for generation
- Progress indicators show completion percentage
- Files are created in
data/directory
-
Review summary
- Total files created
- Total size
- List of all files
Purpose: Test tabular data uploads, Dataverse's tabular file ingestion.
Characteristics:
- Configurable rows and columns
- Mixed data types (integers, floats, text, dates)
- Proper CSV formatting with headers
- Comma-separated values
Generation Parameters:
create_large_csv_file(
filename="data.csv",
num_rows=100000, # Number of data rows
num_columns=10 # Number of columns
)Sample Output:
column_0,column_1,column_2,column_3,column_4,...
0,text_1234,456.789,12345,2024-03-15 10:23:45,...
1,text_5678,789.012,67890,2024-06-20 14:56:12,...Use Cases:
- Testing Dataverse tabular ingestion
- Verifying CSV → TAB conversion
- Testing with different row counts
- Performance benchmarking
Purpose: Test structured data uploads, API response formats.
Characteristics:
- Valid JSON array of objects
- Nested structures
- Mixed data types
- Pretty-printed formatting
Generation Parameters:
create_large_json_file(
filename="logs.json",
num_records=10000 # Number of JSON objects
)Sample Output:
[
{
"id": 0,
"timestamp": "2024-11-06T10:30:45.123456",
"user": "user_1234",
"value": 456.789,
"status": "active",
"metadata": {
"key1": 42,
"key2": "value_789",
"key3": [0.123, 0.456, 0.789, 0.012, 0.345]
},
"description": "Random text data..."
},
...
]Use Cases:
- Testing JSON file uploads
- API response simulation
- Nested data structures
- Metadata testing
Purpose: Test plain text uploads, document processing.
Characteristics:
- Random text content
- Mixed characters (letters, numbers, punctuation)
- Natural line breaks
- UTF-8 encoding
Generation Parameters:
create_large_text_file(
filename="document.txt",
size_mb=50 # Target size in megabytes
)Use Cases:
- Testing large document uploads
- Text file processing
- Encoding verification
- Chunked upload testing
Purpose: Test non-text file uploads, binary data handling.
Characteristics:
- Random binary data
- No text encoding
- Exact byte size control
- OS-level random data
Generation Parameters:
create_binary_file(
filename="data.bin",
size_mb=100 # Target size in megabytes
)Use Cases:
- Testing binary file uploads
- Simulating images/media
- Checksum verification
- Direct upload testing
Purpose: Test server log uploads, line-based file processing.
Characteristics:
- Structured log format
- Timestamp for each line
- Log levels (DEBUG, INFO, WARN, ERROR, CRITICAL)
- Service names and request IDs
Generation Parameters:
create_log_file(
filename="server.log",
num_lines=100000 # Number of log lines
)Sample Output:
[2024-11-06T10:30:45.123456] INFO api | Operation 0 completed with status 200 - request_id=12345
[2024-11-06T10:30:46.234567] DEBUG database | Operation 1 completed with status 201 - request_id=23456
[2024-11-06T10:30:47.345678] ERROR cache | Operation 2 completed with status 500 - request_id=34567
Use Cases:
- Testing log file ingestion
- Line-by-line processing
- Large text file uploads
- Timestamp handling
Purpose: Test recursive directory uploads, folder hierarchy preservation.
Characteristics:
- Multiple directory levels
- Files at each level
- Configurable depth and density
- Automatic structure creation
Generation Parameters:
create_nested_directory_structure(
base_name="nested_files",
depth=3, # Number of nested levels
files_per_dir=5, # Files in each directory
file_size_kb=500 # Size of each file
)Sample Structure:
data/nested_files/
├── file_1_0.txt
├── file_1_1.txt
├── file_1_2.txt
├── level_2/
│ ├── file_2_0.txt
│ ├── file_2_1.txt
│ └── level_3/
│ ├── file_3_0.txt
│ └── file_3_1.txt
Use Cases:
- Testing
--recurseflag - Directory structure preservation
- Path handling
- Batch uploads
Purpose: Fast verification that uploader works.
Files Generated:
demo_data.csv- 10,000 rows → ~2 MBdemo_logs.json- 5,000 records → ~8 MBdemo_text.txt- 5 MB
Generation Time: ~10 seconds
Use Case:
# Quick smoke test
python generate_large_files.py # Select 0
dv-upload data/ --recurse --verifyPurpose: Quick tests, development iteration.
Files Generated:
small_data.csv- 50,000 rows → ~10 MBsmall_logs.json- 10,000 records → ~5 MBsmall_document.txt- 5 MBsmall_binary.bin- 3 MBsmall_server.log- 100,000 lines → ~8 MB
Generation Time: ~30 seconds
Use Cases:
- Feature development
- Quick testing cycles
- CI/CD pipeline tests
- Basic functionality verification
Purpose: Standard testing, realistic file sizes.
Files Generated:
medium_data.csv- 500,000 rows → ~50 MBmedium_logs.json- 100,000 records → ~40 MBmedium_document.txt- 50 MBmedium_binary.bin- 30 MBmedium_server.log- 1,000,000 lines → ~75 MB
Generation Time: ~3-5 minutes
Use Cases:
- Standard testing
- Performance benchmarking
- Chunked upload verification
- Direct vs traditional upload comparison
Purpose: Stress testing, performance evaluation.
Files Generated:
large_data.csv- 2,000,000 rows → ~200 MBlarge_logs.json- 500,000 records → ~180 MBlarge_document.txt- 200 MBlarge_binary.bin- 150 MBlarge_server.log- 5,000,000 lines → ~350 MB
Generation Time: ~10-15 minutes
Use Cases:
- Stress testing
- Multipart upload testing
- Timeout handling
- Memory management verification
- S3 direct upload testing
Purpose: Maximum stress testing, edge case handling.
Files Generated:
xlarge_data.csv- 5,000,000 rows → ~500 MBxlarge_logs.json- 1,000,000 records → ~400 MBxlarge_document.txt- 500 MBxlarge_binary.bin- 600 MBxlarge_server.log- 10,000,000 lines → ~700 MB
Generation Time: ~20-30 minutes
Use Cases:
- Maximum capacity testing
- Long-running upload tests
- Network resilience testing
- Dataset lock handling
- Server performance limits
Purpose: Comprehensive testing across all file sizes.
Files Generated:
- Small:
small_data.csv,small_document.txt - Medium:
medium_data.csv,medium_logs.json - Large:
large_data.csv,large_binary.bin - Extra Large:
xlarge_document.txt,xlarge_server.log - Nested:
nested_files/directory structure
Generation Time: ~30-45 minutes
Use Cases:
- Pre-release testing
- Full regression testing
- Performance benchmarking suite
- Documentation examples
Purpose: Generate files with specific parameters for targeted testing.
python generate_large_files.py
# Select: 6
# File type: csv
# Filename: custom_data.csv
# Number of rows: 1000000
# Number of columns: 20Use Cases:
- Test specific row counts
- Test wide tables (many columns)
- Replicate production data patterns
- Edge case testing
python generate_large_files.py
# Select: 6
# File type: json
# Filename: custom_api_response.json
# Number of records: 50000Use Cases:
- API response simulation
- Specific record count testing
- JSON structure validation
python generate_large_files.py
# Select: 6
# File type: text
# Filename: custom_document.txt
# Size in MB: 250Use Cases:
- Specific size requirements
- Documentation file testing
- Text processing benchmarks
python generate_large_files.py
# Select: 6
# File type: binary
# Filename: custom_image.bin
# Size in MB: 500Use Cases:
- Simulate image/video files
- Test binary data handling
- Checksum verification
python generate_large_files.py
# Select: 6
# File type: log
# Filename: custom_application.log
# Number of lines: 5000000Use Cases:
- Application log simulation
- Line-based processing
- Timestamp handling
You can import and use the generator in your own scripts:
from generate_large_files import LargeFileGenerator
# Create generator
generator = LargeFileGenerator(output_dir="test_data")
# Generate specific files
generator.create_large_csv_file("dataset.csv", num_rows=100000, num_columns=15)
generator.create_large_json_file("api_logs.json", num_records=50000)
generator.create_large_text_file("document.txt", size_mb=100)
# Create nested structure
generator.create_nested_directory_structure(
base_name="complex_structure",
depth=5,
files_per_dir=10,
file_size_kb=1024
)Create multiple file sets programmatically:
from generate_large_files import LargeFileGenerator
generator = LargeFileGenerator(output_dir="batch_data")
# Generate multiple datasets
for i in range(5):
generator.create_large_csv_file(
f"dataset_{i}.csv",
num_rows=100000 * (i + 1),
num_columns=10
)
# Generate time-series data
for day in range(7):
generator.create_log_file(
f"logs_day_{day}.log",
num_lines=100000
)import pytest
from generate_large_files import LargeFileGenerator
from dataverse_uploader.uploaders.dataverse import DataverseUploader
@pytest.fixture
def test_files(tmp_path):
"""Generate test files for each test."""
generator = LargeFileGenerator(output_dir=str(tmp_path))
generator.create_large_csv_file("test.csv", num_rows=1000, num_columns=5)
return tmp_path
def test_csv_upload(test_files, uploader):
"""Test CSV file upload."""
csv_file = test_files / "test.csv"
assert csv_file.exists()
result = uploader.upload_file(csv_file, "/")
assert result is not NoneGoal: Verify basic upload works
# Generate small files
python generate_large_files.py # Select 0
# Test list mode
dv-upload data/ --list-only --recurse
# Upload
dv-upload data/ --recurse
# Verify
dv-upload data/ --recurse --verify # Should skip allExpected Result: All files uploaded successfully, second run skips everything.
Goal: Test MD5 hash verification
# Generate medium files
python generate_large_files.py # Select 2
# Upload with verification
dv-upload data/ --recurse --verify
# Try uploading again
dv-upload data/ --recurse --verifyExpected Result:
- First upload: Files uploaded with checksum calculation
- Second upload: All files skipped (checksum matches)
Goal: Compare upload methods
# Generate large files
python generate_large_files.py # Select 3
# Test direct upload (S3)
time dv-upload data/large_binary.bin --verify
# Delete file from dataset
# ...
# Test traditional upload
time dv-upload data/large_binary.bin --verify --traditionalExpected Result: Compare upload times and methods.
Goal: Test chunked uploads for large files
# Generate extra large files
python generate_large_files.py # Select 4
# Upload with default chunk size
dv-upload data/xlarge_binary.bin --verify
# Monitor logs for multipart behaviorExpected Result: File uploaded in multiple parts (check logs).
Goal: Test directory structure preservation
# Generate nested structure
python generate_large_files.py # Select 5
# Upload nested directory
dv-upload data/nested_files/ --recurse
# Verify structure preserved in DataverseExpected Result: Directory hierarchy maintained in Dataverse.
Goal: Test upload resume capability
# Generate all sizes
python generate_large_files.py # Select 5
# Start upload (interrupt after a few files)
dv-upload data/ --recurse --verify
# Press Ctrl+C
# Resume upload
dv-upload data/ --recurse --verifyExpected Result: Already-uploaded files skipped, new files uploaded.
Goal: Test behavior when dataset is locked
# Generate medium files
python generate_large_files.py # Select 2
# Start first upload (don't wait for completion)
dv-upload data/ --recurse &
# Start second upload immediately
dv-upload data/ --recurseExpected Result: Second upload waits for lock or handles gracefully.
Goal: Test retry logic and error handling
# Generate large files
python generate_large_files.py # Select 3
# Upload with network issues
# (Simulate by disconnecting/reconnecting network during upload)
dv-upload data/ --recurse --verifyExpected Result: Automatic retries succeed after network restores.
| Option | Total Size | Generation Time | Files Created |
|---|---|---|---|
| 0 (Demo) | ~15 MB | 10 seconds | 3 |
| 1 (Small) | ~30 MB | 30 seconds | 5 |
| 2 (Medium) | ~250 MB | 3-5 minutes | 5 |
| 3 (Large) | ~1 GB | 10-15 minutes | 5 |
| 4 (XL) | ~2 GB | 20-30 minutes | 5 |
| 5 (All) | ~3 GB | 30-45 minutes | 10+ |
- CSV/JSON Generation: ~50-100 MB RAM (batch processing)
- Text Generation: ~20 MB RAM (streaming)
- Binary Generation: ~10 MB RAM (streaming)
- Log Generation: ~30 MB RAM (batch processing)
Always ensure sufficient disk space:
# Check available space
df -h .
# For "All Sizes" option: Need at least 4 GB free
# For custom large files: Add 20% overhead- Use SSD: Faster write speeds improve generation time
- Close Other Apps: Reduce I/O contention
- Batch Mode: Generate overnight for large datasets
- Custom Sizes: Start small, increase as needed
Symptoms:
- Takes much longer than expected
- Progress stalls
Solutions:
# Check disk space
df -h .
# Check disk I/O
iostat -x 1
# Try smaller batch size (edit script):
batch_size = 1000 # Reduce from 10000
# Use faster disk (SSD if available)Symptoms:
MemoryError: Unable to allocate array
Solutions:
# Reduce batch size in script
# For CSV: batch_size = 1000
# For JSON: batch_size = 500
# Close other applications
# Generate smaller files firstSymptoms:
PermissionError: [Errno 13] Permission denied: 'data/'
Solutions:
# Create directory manually
mkdir data
# Check permissions
ls -la data/
# Change permissions (Unix/Linux)
chmod 755 data/Symptoms:
Invalid choice: x
Solutions:
# Ensure you enter a number 0-6
# No letters or special characters
# Press Enter after typing numberSymptoms:
- Script completes but no files in
data/
Solutions:
# Check current directory
pwd
# Look for data directory
ls -la | grep data
# Check script output for errors
python generate_large_files.py 2>&1 | tee output.logSymptoms:
'python' is not recognized as an internal or external command
Solutions:
# Try python3
python3 generate_large_files.py
# Or use full path
/usr/bin/python3 generate_large_files.py
# Windows: Use py
py generate_large_files.py# Generate demo files
python generate_large_files.py
# Select: 0
# Verify uploader works
dv-upload data/ --list-only --recurse
# Clean deployment test
dv-upload data/ --recurse --verify
# Cleanup
rm -rf data/# Generate large files
python generate_large_files.py
# Select: 3
# Benchmark direct upload
time dv-upload data/ --recurse --verify > upload_direct.log 2>&1
# Delete files from Dataverse
# Benchmark traditional upload
time dv-upload data/ --recurse --verify --traditional > upload_trad.log 2>&1
# Compare results
diff upload_direct.log upload_trad.log# .github/workflows/test.yml
name: Test Upload
on: [push]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Generate test files
run: |
cd examples
python generate_large_files.py <<EOF
1
EOF
- name: Test upload
env:
DV_SERVER_URL: ${{ secrets.DV_SERVER_URL }}
DV_API_KEY: ${{ secrets.DV_API_KEY }}
DV_DATASET_PID: ${{ secrets.DV_DATASET_PID }}
run: |
dv-upload data/ --recurse --verify --list-only# Generate all file types for documentation
python generate_large_files.py
# Select: 5
# Upload and capture output
dv-upload data/ --recurse --verify | tee upload_output.txt
# Extract statistics for documentation
grep "Files uploaded:" upload_output.txt
grep "Total bytes:" upload_output.txt#!/bin/bash
# regression_test.sh
echo "Generating test files..."
python examples/generate_large_files.py <<EOF
2
EOF
echo "Testing upload..."
dv-upload data/ --recurse --verify
if [ $? -eq 0 ]; then
echo "✓ Upload successful"
echo "Testing duplicate detection..."
dv-upload data/ --recurse --verify | grep "Files skipped: 5"
if [ $? -eq 0 ]; then
echo "✓ Duplicate detection works"
else
echo "✗ Duplicate detection failed"
exit 1
fi
else
echo "✗ Upload failed"
exit 1
fi
echo "Cleaning up..."
rm -rf data/
echo "✓ All tests passed!"# custom_dataset.py
from generate_large_files import LargeFileGenerator
# Create specialized dataset
generator = LargeFileGenerator(output_dir="custom_data")
# Generate time-series data
print("Generating time-series data...")
for month in range(1, 13):
filename = f"sales_2024_{month:02d}.csv"
generator.create_large_csv_file(
filename,
num_rows=30000 * month, # More data each month
num_columns=8
)
# Generate metadata
print("\nGenerating metadata files...")
generator.create_large_json_file("metadata.json", num_records=12)
# Generate documentation
print("\nGenerating documentation...")
generator.create_large_text_file("README.txt", size_mb=1)
print("\n✓ Custom dataset generated!")
print("Upload with: dv-upload custom_data/ --recurse --verify")# Generate multiple datasets in parallel
python generate_large_files.py & # Select 1
DATASET1_PID=$!
python generate_large_files.py & # Select 2
DATASET2_PID=$!
# Wait for completion
wait $DATASET1_PID
wait $DATASET2_PID
echo "All datasets generated!"Always start with Option 0 (demo) to verify everything works:
python generate_large_files.py # Select 0
dv-upload data/ --recurse --list-onlyRemove generated files after testing:
rm -rf data/Or selectively remove:
# Keep CSVs, remove others
rm data/*.json data/*.txt data/*.bin data/*.logAdd to .gitignore:
# Generated test files
data/
examples/data/
*.csv
*.json
*.bin
*.log
test_data/
custom_data/Create a test plan:
## Test Plan
### Scenario 1: Small Files
- Generate: Option 1
- Upload: `dv-upload data/ --recurse`
- Expected: 5 files uploaded, ~30 MB
### Scenario 2: Large Files
- Generate: Option 3
- Upload: `dv-upload data/ --recurse --verify`
- Expected: 5 files uploaded, ~1 GB, checksums verifiedDuring generation:
# Monitor in another terminal
watch -n 1 'du -sh data/ && df -h .'Create shell scripts for common scenarios:
#!/bin/bash
# test_upload.sh
echo "Generating files..."
python examples/generate_large_files.py <<EOF
1
EOF
echo "Uploading files..."
dv-upload data/ --recurse --verify
echo "Verifying duplicate detection..."
dv-upload data/ --recurse --verify
echo "Cleaning up..."
rm -rf data/
echo "✓ Test complete!"The Large File Generator is a powerful tool for:
- ✅ Testing: Generate realistic test data quickly
- ✅ Benchmarking: Measure upload performance
- ✅ Development: Iterate on features with real data
- ✅ CI/CD: Automate testing in pipelines
- ✅ Documentation: Create examples and tutorials
- ✅ Debugging: Reproduce issues with specific file types/sizes
Quick Reference:
# Demo
python generate_large_files.py # Select 0
# Small
python generate_large_files.py # Select 1
# Medium
python generate_large_files.py # Select 2
# Large
python generate_large_files.py # Select 3
# Upload generated files
dv-upload data/ --recurse --verifyFor more information, see:
- README.md - Main documentation
- ARCHITECTURE.md - Technical details
- examples/ - Usage examples
Questions or Issues?
- GitHub Issues: Report an issue
- Documentation: View docs
- Community: Dataverse Community