Summary
Design and implement a unified, versioned metadata schema for MSI data that follows FAIR principles (Findable, Accessible, Interoperable, Reusable). This schema will standardize metadata handling across all supported formats (Rapiflex, timsTOF, ImzML) and ensure complete metadata preservation during conversion.
Motivation
Current metadata handling has several gaps:
- Inconsistent fields across formats (e.g.,
instrument_type only in Rapiflex)
- Missing essential fields in Zarr output (
coordinate_bounds, pixel_size, n_spectra, total_peaks)
- No versioning - schema changes could break downstream tools
- No standardization - each format uses different field names for similar concepts
FAIR Principles Alignment
| Principle |
Implementation |
| Findable |
Unique schema version, rich searchable metadata |
| Accessible |
JSON in Zarr .zattrs, human-readable |
| Interoperable |
Standardized field names, optional ontology references |
| Reusable |
Clear provenance, processing history |
Proposed Schema (v1.0.0)
Schema Versioning
# Semantic versioning for the schema itself
# MAJOR: Breaking changes (renamed/removed required fields)
# MINOR: New optional fields added
# PATCH: Documentation/description changes only
Complete Schema Structure
{
"thyra_metadata_version": "1.0.0",
"identity": {
"source_format": "rapiflex | timstof | imzml",
"source_path": "/path/to/original/data",
"conversion_timestamp": "2025-12-15T12:00:00Z",
"converter_version": "0.5.0"
},
"spatial": {
"dimensions": [100, 100, 1],
"pixel_size_um": [20.0, 20.0],
"coordinate_system": "grid_0based",
"n_spectra": 38744,
"coordinate_bounds": [0.0, 99.0, 0.0, 99.0],
"coordinate_offsets": [0, 0, 0]
},
"spectral": {
"mass_range": [100.0, 2000.0],
"spectrum_type": "centroid | profile",
"n_mass_bins": 29100,
"total_peaks": 209456000,
"polarity": "positive | negative | unknown"
},
"instrument": {
"type": "MALDI-TOF | timsTOF | Orbitrap | FTICR | unknown",
"manufacturer": "Bruker | Thermo | Waters | unknown",
"model": "rapifleX | timsTOF fleX | null",
"serial_number": "ABC123 | null"
},
"acquisition": {
"datetime": "2025-01-15T10:30:00Z",
"method": "MALDI_imaging.par",
"laser_power": 50.0,
"laser_frequency_hz": 10000,
"shots_per_pixel": 200,
"raster_step_um": [20.0, 20.0]
},
"alignment": {
"has_optical_registration": true,
"teaching_points": [
{"image": [100, 200], "stage": [1000.0, 2000.0]}
],
"areas": [
{"name": "Region1", "p1": [0, 0], "p2": [100, 100]}
],
"optical_image_path": "optical_0000.tif"
},
"calibration": {
"calibration_datetime": "2025-01-14T08:00:00Z",
"calibration_mode": "external",
"recalibrated": false
},
"processing": {
"resampling": {
"enabled": true,
"method": "nearest_neighbor | tic_preserving",
"axis_type": "reflector_tof | linear_tof | constant",
"original_mass_range": [100.0, 2000.0],
"target_bins": 29100
},
"normalization": null,
"baseline_correction": null
},
"format_specific": {
"// Preserved verbatim from original format": "for fields that don't fit above"
}
}
Field Requirements by Category
| Category |
Required |
Notes |
identity |
YES |
All fields required |
spatial |
YES |
dimensions, pixel_size_um, n_spectra required |
spectral |
YES |
mass_range, spectrum_type required |
instrument |
NO |
Recommended, all fields optional |
acquisition |
NO |
All fields optional |
alignment |
NO |
Rapiflex-specific, all fields optional |
calibration |
NO |
timsTOF-specific, all fields optional |
processing |
YES |
Documents what transformations were applied |
format_specific |
NO |
Catch-all for format-specific data |
Current vs Proposed Metadata Mapping
Rapiflex
| Current Field |
Proposed Location |
format: "Rapiflex" |
identity.source_format |
raster_x/y |
spatial.pixel_size_um |
teaching_points |
alignment.teaching_points |
areas |
alignment.areas |
shots_per_spot |
acquisition.shots_per_pixel |
laser_power |
acquisition.laser_power |
serial_number |
instrument.serial_number |
timsTOF
| Current Field |
Proposed Location |
bruker_format |
identity.source_format |
BeamScanSizeX/Y |
spatial.pixel_size_um |
calibration |
calibration.* |
laser_power |
acquisition.laser_power |
laser_frequency |
acquisition.laser_frequency_hz |
instrument_name |
instrument.model |
ImzML
| Current Field |
Proposed Location |
file_mode |
format_specific.file_mode |
spectrum_type (from cvParam) |
spectral.spectrum_type |
pixel size x/y |
spatial.pixel_size_um |
scan_direction |
acquisition.scan_direction |
Implementation Plan
Phase 1: Quick Fix (Option A)
Phase 2: Schema Implementation
Phase 3: Extractor Migration
Phase 4: Converter Integration
Open Questions
- Ontology references - Should we include MS ontology accessions (e.g.,
MS:1000127 for centroid)?
- Units convention - Explicit (
{"value": 20, "unit": "um"}) or by naming convention (pixel_size_um)?
- Strictness - Should conversion fail if required fields are missing, or warn and continue?
- Raw metadata - Keep full
raw_metadata dump alongside unified schema?
Related Issues
Labels
enhancement
metadata
FAIR
Summary
Design and implement a unified, versioned metadata schema for MSI data that follows FAIR principles (Findable, Accessible, Interoperable, Reusable). This schema will standardize metadata handling across all supported formats (Rapiflex, timsTOF, ImzML) and ensure complete metadata preservation during conversion.
Motivation
Current metadata handling has several gaps:
instrument_typeonly in Rapiflex)coordinate_bounds,pixel_size,n_spectra,total_peaks)FAIR Principles Alignment
.zattrs, human-readableProposed Schema (v1.0.0)
Schema Versioning
Complete Schema Structure
{ "thyra_metadata_version": "1.0.0", "identity": { "source_format": "rapiflex | timstof | imzml", "source_path": "/path/to/original/data", "conversion_timestamp": "2025-12-15T12:00:00Z", "converter_version": "0.5.0" }, "spatial": { "dimensions": [100, 100, 1], "pixel_size_um": [20.0, 20.0], "coordinate_system": "grid_0based", "n_spectra": 38744, "coordinate_bounds": [0.0, 99.0, 0.0, 99.0], "coordinate_offsets": [0, 0, 0] }, "spectral": { "mass_range": [100.0, 2000.0], "spectrum_type": "centroid | profile", "n_mass_bins": 29100, "total_peaks": 209456000, "polarity": "positive | negative | unknown" }, "instrument": { "type": "MALDI-TOF | timsTOF | Orbitrap | FTICR | unknown", "manufacturer": "Bruker | Thermo | Waters | unknown", "model": "rapifleX | timsTOF fleX | null", "serial_number": "ABC123 | null" }, "acquisition": { "datetime": "2025-01-15T10:30:00Z", "method": "MALDI_imaging.par", "laser_power": 50.0, "laser_frequency_hz": 10000, "shots_per_pixel": 200, "raster_step_um": [20.0, 20.0] }, "alignment": { "has_optical_registration": true, "teaching_points": [ {"image": [100, 200], "stage": [1000.0, 2000.0]} ], "areas": [ {"name": "Region1", "p1": [0, 0], "p2": [100, 100]} ], "optical_image_path": "optical_0000.tif" }, "calibration": { "calibration_datetime": "2025-01-14T08:00:00Z", "calibration_mode": "external", "recalibrated": false }, "processing": { "resampling": { "enabled": true, "method": "nearest_neighbor | tic_preserving", "axis_type": "reflector_tof | linear_tof | constant", "original_mass_range": [100.0, 2000.0], "target_bins": 29100 }, "normalization": null, "baseline_correction": null }, "format_specific": { "// Preserved verbatim from original format": "for fields that don't fit above" } }Field Requirements by Category
identityspatialdimensions,pixel_size_um,n_spectrarequiredspectralmass_range,spectrum_typerequiredinstrumentacquisitionalignmentcalibrationprocessingformat_specificCurrent vs Proposed Metadata Mapping
Rapiflex
format: "Rapiflex"identity.source_formatraster_x/yspatial.pixel_size_umteaching_pointsalignment.teaching_pointsareasalignment.areasshots_per_spotacquisition.shots_per_pixellaser_poweracquisition.laser_powerserial_numberinstrument.serial_numbertimsTOF
bruker_formatidentity.source_formatBeamScanSizeX/Yspatial.pixel_size_umcalibrationcalibration.*laser_poweracquisition.laser_powerlaser_frequencyacquisition.laser_frequency_hzinstrument_nameinstrument.modelImzML
file_modeformat_specific.file_modespectrum_type(from cvParam)spectral.spectrum_typepixel size x/yspatial.pixel_size_umscan_directionacquisition.scan_directionImplementation Plan
Phase 1: Quick Fix (Option A)
unsoutputPhase 2: Schema Implementation
ThyraMetadataSchemadataclass inthyra/metadata/schema.pyPhase 3: Extractor Migration
BrukerMetadataExtractorto populate unified schemaImzMLMetadataExtractorto populate unified schemaRapiflexMetadataExtractorto populate unified schemaPhase 4: Converter Integration
BaseSpatialDataConverterto write unified schemaOpen Questions
MS:1000127for centroid)?{"value": 20, "unit": "um"}) or by naming convention (pixel_size_um)?raw_metadatadump alongside unified schema?Related Issues
total_peakstracking)alignmentsection)Labels
enhancementmetadataFAIR