Convert Transkribus ZIP files to HuggingFace datasets with ease.
transkribus-hf is a Python package that converts Transkribus export ZIP files into HuggingFace datasets. It supports multiple export formats and can automatically upload datasets to the HuggingFace Hub.
- Multiple Export Modes: Convert your Transkribus data to different dataset formats
- Automatic Upload: Direct integration with HuggingFace Hub
- Region & Line Extraction: Extract individual text regions and lines as separate images
- Windowed Extraction: Create sliding windows of multiple lines for data augmentation
- Preserves Metadata: Maintains reading order, region types, and other important metadata
- Command Line Interface: Easy-to-use CLI for batch processing
pip install transkribus-hfOr install from source:
git clone https://github.com/wjbmattingly/transkribus-hf.git
cd transkribus-hf
pip install -e .Exports the original image with the complete PAGE XML content.
Fields:
image: Original page imagexml: Complete PAGE XML contentfilename: Original image filenameproject: Project name
Exports the image with concatenated text from all regions.
Fields:
image: Original page imagetext: Full text content (all regions combined)filename: Original image filenameproject: Project name
Exports each text region as a separate cropped image.
Fields:
image: Cropped region imagetext: Region text contentregion_type: Type of region (e.g., "paragraph")region_id: Unique region identifierreading_order: Reading order of the regionfilename: Original image filenameproject: Project name
Exports each text line as a separate cropped image.
Fields:
image: Cropped line imagetext: Line text contentline_id: Unique line identifierline_reading_order: Reading order within the regionregion_id: Parent region identifierregion_reading_order: Reading order of parent regionregion_type: Type of parent regionfilename: Original image filenameproject: Project name
Exports sliding windows of multiple text lines, perfect for data augmentation and multi-line text recognition training.
Configuration:
window_size: Number of lines per window (1, 2, 3, 4, etc.)overlap: Number of lines to overlap between windows (0 = no overlap)
Fields:
image: Cropped window image (bounding box of all lines in window)text: Combined text from all lines in window (newline separated)window_size: Actual number of lines in this windowwindow_index: Index of this window within the regionline_ids: Comma-separated list of line IDs in this windowline_reading_orders: Comma-separated list of line reading ordersregion_id: Parent region identifierregion_reading_order: Reading order of parent regionregion_type: Type of parent regionfilename: Original image filenameproject: Project name
Examples:
window_size=1, overlap=0: Same as line modewindow_size=2, overlap=0: Non-overlapping pairs of lineswindow_size=3, overlap=1: 3-line windows with 1-line overlap (lines 1-3, 2-4, 3-5, etc.)window_size=4, overlap=2: 4-line windows with 2-line overlap (lines 1-4, 3-6, 5-8, etc.)
# Basic usage - convert and upload to HuggingFace Hub
transkribus-hf path/to/your/transkribus.zip --repo-id username/dataset-name
# Specify export mode
transkribus-hf path/to/your/transkribus.zip --repo-id username/dataset-name --mode region
# Window mode with 3 lines per window, 1 line overlap
transkribus-hf path/to/your/transkribus.zip --repo-id username/dataset-name --mode window --window-size 3 --overlap 1
# Convert to local directory only
transkribus-hf path/to/your/transkribus.zip --local-only --output-dir ./my_dataset
# View statistics only (including window estimates)
transkribus-hf path/to/your/transkribus.zip --stats-only --mode window --window-size 2
# Create private repository
transkribus-hf path/to/your/transkribus.zip --repo-id username/dataset-name --private
# Use custom HuggingFace token
transkribus-hf path/to/your/transkribus.zip --repo-id username/dataset-name --token your_token_herefrom transkribus_hf import TranskribusConverter
# Initialize converter
converter = TranskribusConverter("path/to/your/transkribus.zip")
# Get statistics
stats = converter.get_stats()
print(f"Total pages: {stats['total_pages']}")
print(f"Total regions: {stats['total_regions']}")
print(f"Total lines: {stats['total_lines']}")
# Convert to dataset (text mode)
dataset = converter.convert(export_mode='text')
print(f"Created dataset with {len(dataset)} examples")
# Convert to different modes
region_dataset = converter.convert(export_mode='region')
line_dataset = converter.convert(export_mode='line')
xml_dataset = converter.convert(export_mode='raw_xml')
# NEW: Window mode with different configurations
window_2_dataset = converter.convert(export_mode='window', window_size=2, overlap=0)
window_3_overlap_dataset = converter.convert(export_mode='window', window_size=3, overlap=1)
window_4_dataset = converter.convert(export_mode='window', window_size=4, overlap=2)
print(f"2-line windows: {len(window_2_dataset)} examples")
print(f"3-line windows (1 overlap): {len(window_3_overlap_dataset)} examples")
print(f"4-line windows (2 overlap): {len(window_4_dataset)} examples")
# Upload to HuggingFace Hub
repo_url = converter.upload_to_hub(
dataset=window_3_overlap_dataset,
repo_id="wjbmattingly/my-transkribus-windows",
private=False
)
print(f"Dataset uploaded: {repo_url}")
# Convert and upload in one step
repo_url = converter.convert_and_upload(
repo_id="wjbmattingly/my-transkribus-dataset",
export_mode="window",
window_size=2,
overlap=1,
private=False
)The package expects Transkribus ZIP files with the following structure:
transkribus_export.zip
├── project1/
│ ├── image1.jpg
│ ├── image2.jpg
│ └── page/
│ ├── image1.xml
│ └── image2.xml
├── project2/
│ ├── image3.jpg
│ └── page/
│ └── image3.xml
└── ...
The window mode is particularly useful for:
- Data Augmentation: Generate more training examples from existing data
- Multi-line Text Recognition: Train models to recognize multiple lines at once
- Reading Order Training: Train models to understand line sequences
- Flexible Context: Adjust context size (1-4+ lines) based on your needs
- Overlapping Context: Create overlapping examples for better generalization
To upload datasets to HuggingFace Hub, you need to authenticate:
- Set environment variable:
export HF_TOKEN=your_token_here - Or pass the token directly:
--token your_token_here - Or use
huggingface-cli login
- Python ≥ 3.8
- datasets ≥ 2.0.0
- huggingface_hub ≥ 0.15.0
- Pillow ≥ 9.0.0
- lxml ≥ 4.6.0
- numpy ≥ 1.21.0
- tqdm ≥ 4.62.0
MIT License - see LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.