Deep learning models for detecting ink on ancient Vesuvius scroll fragments.
This project implements state-of-the-art computer vision models to identify ink writing on 2,000-year-old papyrus scrolls from Herculaneum (destroyed by Mount Vesuvius in 79 AD). The models process 3D volumetric data from CT scans, treating stacked depth layers as temporal sequences to perform semantic segmentation of ink locations.
- Multiple Model Architectures: SWIN Transformer, VideoMAE, TimeSformer, 3D ResNet
- Self-Supervised Pretraining: VideoMAE for learning from unlabeled scroll data
- Experiment Tracking: Full integration with Weights & Biases
vesuvius_ink_detection/
├── models/ # Model architectures
│ ├── swin.py # SWIN Transformer (primary model)
│ ├── vmae.py # VideoMAE
│ ├── timesformer_hug.py # TimeSformer (HuggingFace)
│ ├── resnetall.py # 3D ResNet variants
│ ├── i3dallnl.py # I3D with non-local blocks
│ └── unetr.py # UNETR segmentation
│
├── pretraining/ # Self-supervised pretraining
│ ├── mae.py # VideoMAE pretraining
│ ├── mae_swin.py # MAE for SWIN
│ ├── prepare_data.py # Tile extraction for pretraining
│ └── download.sh # Download pretraining segments
│
├── train_scripts/ # Training utilities
│ ├── vmae_train.py # VideoMAE training wrapper
│ └── utils.py # Helper functions
│
├── Training Scripts (root):
│ ├── swin_train.py # SWIN Transformer training
│ ├── timesformer_hug_train.py # TimeSformer training
│ ├── train_resnet3d.py # 3D ResNet training
│ ├── z_cv.py # Cross-validation experiments
│ └── utils.py # Shared utilities
│
├── train_scrolls/ # Training data (per-fragment)
│ ├── frag5/ # Fragment 5 (primary)
│ │ ├── layers/ # CT scan layers (22.tif, 23.tif, ...)
│ │ ├── frag5_inklabels.png # Ground truth annotations
│ │ └── frag5_mask.png # Fragment boundary mask
│ └── [other fragments]
│
├── checkpoints/ # Saved model weights
├── outputs/ # Predictions and results
├── notebooks/ # Exploratory analysis
- Python 3.8+
- CUDA-capable GPU (recommended: 16GB+ VRAM)
- PyTorch with CUDA support
# Clone the repository
git clone <repository-url>
cd vesuvius_ink_detection
# Install dependencies
pip install -r requirements.txtThe project includes an automated download script (download.sh) to fetch Vesuvius Challenge data from the official repository.
# Make the script executable
chmod +x download.sh
# Run the download script
./download.shThe script downloads two types of data:
1. Fragment Data (smaller pieces with known ink labels):
- Fragment 1 (
Frag1): PHercParis2Fr47 scanned at 54keV with 3.24um resolution - Fragment 5 (
Frag5): PHerc1667Cr1Fr3 scanned at 70keV with 3.24um resolution
2. Full Scroll Data (larger intact scrolls):
- Scroll 4 (
20231210132040): PHerc1667 segment from full scroll
For each dataset, the script downloads:
- Layer files: CT scan slices (layers 15-45) in TIF/PNG format
- Auxiliary files:
*_mask.png- Fragment boundary masks*_inklabels.png- Ground truth ink annotations
The download script uses two main functions:
download_layers: Downloads a range of numbered layer files
- Tries multiple file extensions (tif, png, jpg) until finding the correct format
- Checks file existence before downloading to avoid errors
- Downloads layers 15-45 by default (configurable)
download_aux_files: Downloads mask and inklabels files
- Searches directory listings for files ending in
maskorinklabels - Handles unknown filename prefixes automatically
- Renames files to standardized format:
{fragment_id}_{suffix}.{ext}
Downloaded data is organized in train_scrolls/:
train_scrolls/
├── Frag1/
│ ├── layers/
│ │ ├── 15.tif
│ │ ├── 16.tif
│ │ └── ... (through 45.tif)
│ ├── Frag1_mask.png
│ └── Frag1_inklabels.png
├── Frag5/
│ ├── layers/
│ │ ├── 15.tif
│ │ └── ...
│ ├── Frag5_mask.png
│ └── Frag5_inklabels.png
└── 20231210132040/
├── layers/
│ ├── 15.tif
│ └── ...
├── 20231210132040_mask.png
└── 20231210132040_inklabels.png
To download additional fragments, edit download.sh:
# Add new fragment
fragments=("Frag1" "Frag2" "Frag3") # Add to array
# Change layer range
download_layers "$layers_url" "$out_dir" 10 50 extensions1[@] # Layers 10-50
# Change file extensions to try
extensions=(tif png jpg jpeg)The script uses default public credentials for the Vesuvius Challenge data repository:
- Username: ...
- Password: ...
These are publicly available credentials for accessing competition data.
Each fragment directory follows this structure:
fragment_id/
├── layers/ # Volumetric CT scan data
│ ├── 15.tif # Individual depth layers
│ ├── 16.tif
│ └── ... (15-45 or more layers)
├── {fragment_id}_inklabels.png # Ground truth ink labels (binary mask)
└── {fragment_id}_mask.png # Fragment boundary mask
The project includes a unified training script (train.py) with command-line argument support for easy experimentation. Use the provided run.sh script to launch training:
# Make the script executable
chmod +x run.sh
# Run training with default configuration
./run.shThe run.sh script trains a SWIN Transformer model with the following default configuration:
python train.py \
--model swin \
--segment_path ./train_scrolls/ \
--segments Frag5 s4 \
--valid_id Frag5 \
--start_idx 24 \
--in_chans 16 \
--valid_chans 16 \
--size 224 \
--tile_size 224 \
--stride_divisor 8 \
--train_batch_size 2 \
--valid_batch_size 2 \
--lr 5e-5 \
--epochs 40 \
--scheduler cosine \
--weight_decay 1e-6 \
--warmup_factor 10 \
--norm true \
--aug fourth \
--num_workers 8 \
--seed 0 \
--max_grad_norm 1.0 \
--comp_name vesuvius \
--wandb_project vesuvius \
--save_top_k -1 \
--devices -1 \
--strategy ddp_find_unused_parameters_trueYou can customize training by modifying run.sh or calling train.py directly:
# Example: Train VideoMAE on different fragments
python train.py \
--model vmae \
--segments Frag1 Frag5 \
--valid_id Frag1 \
--in_chans 24 \
--size 64 \
--epochs 50 \
--lr 1e-4
# Example: Train with higher resolution
python train.py \
--model swin \
--size 448 \
--tile_size 448 \
--train_batch_size 1Key command-line arguments for train.py:
- Model:
--model(choices: swin, vmae, timesformer_hug, resnet) - Data:
--segment_path(path to training scrolls),--segments(training fragments),--valid_id(validation fragment) - Input:
--start_idx(first layer),--in_chans(number of channels),--valid_chans(validation channels) - Resolution:
--size(input size),--tile_size(tile size),--stride_divisor(stride calculation) - Training:
--train_batch_size,--valid_batch_size,--lr,--min_lr,--epochs,--scheduler,--weight_decay,--warmup_factor - Augmentation:
--aug(choices: none, shift, fourth, None),--norm(apply normalization) - Distributed:
--devices(GPU count),--strategy(DDP strategy),--precision(training precision) - Output:
--comp_name(competition name),--wandb_project(W&B project name) - Checkpoint:
--checkpoint_path(resume from checkpoint),--save_top_k(save top k models)
Individual training scripts are still available in train_scripts/:
# SWIN Transformer (legacy)
python train_scripts/swin_train.py
# TimeSformer (legacy)
python train_scripts/timesformer_hug_train.py
# 3D ResNet (legacy)
python train_scripts/train_resnet3d.py
# VideoMAE training
python train_scripts/vmae_train.py
1. SWIN Transformer (models/swin.py)
Primary model - Shifted Window Vision Transformer adapted for volumetric data.
- Input: 224×224 spatial, 16-24 depth channels
- Output: Binary segmentation mask (ink vs. no-ink)
- Features:
- Hierarchical shifted window attention
- Variable input channels (8-54)
- Combined loss: DiceLoss + SoftBCEWithLogitsLoss
- Training: swin_train.py
2. VideoMAE (models/vmae.py)
Video Masked Autoencoder for self-supervised pretraining.
- Input: 64×64 or 224×224, 16-24 frames
- Pretraining: 75-90% mask ratio, pixel reconstruction
- Fine-tuning: Linear classifier head
- Training: pretraining/mae.py
3. TimeSformer (models/timesformer_hug.py)
Transformer designed for video/temporal understanding.
- Variants: HuggingFace and Facebook implementations
- Features: Divided space-time attention
- Training: timesformer_hug_train.py
4. 3D ResNet (models/resnetall.py)
ResNet extended to 3D convolutions.
- Depths: 10, 18, 34, 50, 101, 152, 200 layers
- Pretrained: r3d101_KM_200ep.pth (Kinetics-400)
- Training: train_resnet3d.py
Pretraining on unlabeled scroll data improves downstream performance.
cd pretraining
python mae.py- Method: Mask 75-90% of patches, reconstruct pixel values
- Configuration: 16-channel input, 16-24 frames
- Loss: L1 norm pixel prediction
- Checkpoints:
videomae_epoch=063_val_loss=0.3684.ckpt
All training scripts use PyTorch Lightning with:
- Distributed Training: DDP (multi-GPU)
- Mixed Precision: FP16 for memory efficiency
- Gradient Clipping: Max norm 1.0
- Learning Rate Scheduling: Cosine with warmup
- Checkpointing: Automatic saves with encoded metadata
Experiment tracking and hyperparameter logging:
wandb.init(project='vesuvius', name='experiment_name')View runs at: wandb.ai (requires login)
Checkpoints encode complete hyperparameter information:
{MODEL}_{FRAGMENTS}_valid={VALID_ID}_size={SIZE}_lr={LR}_in_chans={CHANS}_norm={NORM}_fourth={AUG}_epoch={EPOCH}.ckpt
Example:
SWIN_['frag5','s4']_valid=frag5_size=224_lr=2e-05_in_chans=16_norm=True_epoch=7.ckpt
This enables:
- Easy checkpoint identification
- Reproducible experiment tracking
- Automated checkpoint selection
Train on multiple fragments simultaneously:
segments = ['frag5', 's4', 'rect5'] # Training fragments
valid_id = 'frag5' # Hold out for validationExperiment with different depth ranges:
start_idx = 22 # First layer to use
in_chans = 18 # Total channels (22-39 inclusive)
valid_chans = 16 # Subset for validation (center crop)Different fragments may require different scaling:
frags_ratio1 = ['frag', 're'] # Scale by ratio1
frags_ratio2 = ['s4', '202'] # Scale by ratio2
ratio1 = 2 # Divide by 2
ratio2 = 1 # No scalingoutputs/models/
└── SWIN_['frag5','s4']_valid=frag5_size=224_lr=2e-05_in_chans=16_norm=True_epoch=7.ckpt
wand/run/files/media/images
└── mask.png
Core Functions (utils.py)
# Load volumetric data with mask
read_image_mask(fragment_id, s=22, e=38, rotate=0)
# Split data by fragment
get_train_valid_dataset(segments, valid_id)
# Initialize configuration
cfg_init(CFG, mode='train')# Reduce batch size
train_batch_size = 3 # Instead of 5
# Reduce spatial resolution
size = 64 # Instead of 224
# Enable gradient checkpointing (in model code)Increase PIL limit (already done in training scripts):
import PIL.Image
PIL.Image.MAX_IMAGE_PIXELS = 933120000Login to Weights & Biases:
wandb loginOr disable:
wandb.init(mode='disabled')[Specify license]
This repository is based on and adapted from the villa ink-detection repository, which contains the First Place Vesuvius Grand Prize solution. The original repository is part of the First Place Grand Prize Submission to the Vesuvius Challenge 2023 from Youssef Nader, Luke Farritor, and Julian Schilliger.
- Vesuvius Challenge organizers.
- Youssef Nader, Luke Farritor, and Julian Schilliger for their groundbreaking work.
- AWS resources were provided by the National Infrastructures for Research and Technology GRNET and funded by the EU Recovery and Resiliency Facility.
For questions or issues, please open a GitHub issue or contact voulgarakisdion@gmail.com .
Project Status: Active Development
Last Updated: 2025
Contributors: Voulgarakis Dionysios, Pavlopoulos John