Skip to content

RFC: Unified MSI Metadata Schema (FAIR-compliant) #67

@Tomatokeftes

Description

@Tomatokeftes

Summary

Design and implement a unified, versioned metadata schema for MSI data that follows FAIR principles (Findable, Accessible, Interoperable, Reusable). This schema will standardize metadata handling across all supported formats (Rapiflex, timsTOF, ImzML) and ensure complete metadata preservation during conversion.

Motivation

Current metadata handling has several gaps:

  • Inconsistent fields across formats (e.g., instrument_type only in Rapiflex)
  • Missing essential fields in Zarr output (coordinate_bounds, pixel_size, n_spectra, total_peaks)
  • No versioning - schema changes could break downstream tools
  • No standardization - each format uses different field names for similar concepts

FAIR Principles Alignment

Principle Implementation
Findable Unique schema version, rich searchable metadata
Accessible JSON in Zarr .zattrs, human-readable
Interoperable Standardized field names, optional ontology references
Reusable Clear provenance, processing history

Proposed Schema (v1.0.0)

Schema Versioning

# Semantic versioning for the schema itself
# MAJOR: Breaking changes (renamed/removed required fields)
# MINOR: New optional fields added
# PATCH: Documentation/description changes only

Complete Schema Structure

{
  "thyra_metadata_version": "1.0.0",
  
  "identity": {
    "source_format": "rapiflex | timstof | imzml",
    "source_path": "/path/to/original/data",
    "conversion_timestamp": "2025-12-15T12:00:00Z",
    "converter_version": "0.5.0"
  },
  
  "spatial": {
    "dimensions": [100, 100, 1],
    "pixel_size_um": [20.0, 20.0],
    "coordinate_system": "grid_0based",
    "n_spectra": 38744,
    "coordinate_bounds": [0.0, 99.0, 0.0, 99.0],
    "coordinate_offsets": [0, 0, 0]
  },
  
  "spectral": {
    "mass_range": [100.0, 2000.0],
    "spectrum_type": "centroid | profile",
    "n_mass_bins": 29100,
    "total_peaks": 209456000,
    "polarity": "positive | negative | unknown"
  },
  
  "instrument": {
    "type": "MALDI-TOF | timsTOF | Orbitrap | FTICR | unknown",
    "manufacturer": "Bruker | Thermo | Waters | unknown",
    "model": "rapifleX | timsTOF fleX | null",
    "serial_number": "ABC123 | null"
  },
  
  "acquisition": {
    "datetime": "2025-01-15T10:30:00Z",
    "method": "MALDI_imaging.par",
    "laser_power": 50.0,
    "laser_frequency_hz": 10000,
    "shots_per_pixel": 200,
    "raster_step_um": [20.0, 20.0]
  },
  
  "alignment": {
    "has_optical_registration": true,
    "teaching_points": [
      {"image": [100, 200], "stage": [1000.0, 2000.0]}
    ],
    "areas": [
      {"name": "Region1", "p1": [0, 0], "p2": [100, 100]}
    ],
    "optical_image_path": "optical_0000.tif"
  },
  
  "calibration": {
    "calibration_datetime": "2025-01-14T08:00:00Z",
    "calibration_mode": "external",
    "recalibrated": false
  },
  
  "processing": {
    "resampling": {
      "enabled": true,
      "method": "nearest_neighbor | tic_preserving",
      "axis_type": "reflector_tof | linear_tof | constant",
      "original_mass_range": [100.0, 2000.0],
      "target_bins": 29100
    },
    "normalization": null,
    "baseline_correction": null
  },
  
  "format_specific": {
    "// Preserved verbatim from original format": "for fields that don't fit above"
  }
}

Field Requirements by Category

Category Required Notes
identity YES All fields required
spatial YES dimensions, pixel_size_um, n_spectra required
spectral YES mass_range, spectrum_type required
instrument NO Recommended, all fields optional
acquisition NO All fields optional
alignment NO Rapiflex-specific, all fields optional
calibration NO timsTOF-specific, all fields optional
processing YES Documents what transformations were applied
format_specific NO Catch-all for format-specific data

Current vs Proposed Metadata Mapping

Rapiflex

Current Field Proposed Location
format: "Rapiflex" identity.source_format
raster_x/y spatial.pixel_size_um
teaching_points alignment.teaching_points
areas alignment.areas
shots_per_spot acquisition.shots_per_pixel
laser_power acquisition.laser_power
serial_number instrument.serial_number

timsTOF

Current Field Proposed Location
bruker_format identity.source_format
BeamScanSizeX/Y spatial.pixel_size_um
calibration calibration.*
laser_power acquisition.laser_power
laser_frequency acquisition.laser_frequency_hz
instrument_name instrument.model

ImzML

Current Field Proposed Location
file_mode format_specific.file_mode
spectrum_type (from cvParam) spectral.spectrum_type
pixel size x/y spatial.pixel_size_um
scan_direction acquisition.scan_direction

Implementation Plan

Phase 1: Quick Fix (Option A)

  • Add missing essential fields to current uns output
  • No schema versioning yet, just preserve more data

Phase 2: Schema Implementation

  • Create ThyraMetadataSchema dataclass in thyra/metadata/schema.py
  • Add validation for required fields
  • Add schema version to output

Phase 3: Extractor Migration

  • Update BrukerMetadataExtractor to populate unified schema
  • Update ImzMLMetadataExtractor to populate unified schema
  • Update RapiflexMetadataExtractor to populate unified schema

Phase 4: Converter Integration

  • Update BaseSpatialDataConverter to write unified schema
  • Add backward compatibility for reading old format
  • Add schema migration utilities

Open Questions

  1. Ontology references - Should we include MS ontology accessions (e.g., MS:1000127 for centroid)?
  2. Units convention - Explicit ({"value": 20, "unit": "um"}) or by naming convention (pixel_size_um)?
  3. Strictness - Should conversion fail if required fields are missing, or warn and continue?
  4. Raw metadata - Keep full raw_metadata dump alongside unified schema?

Related Issues

Labels

  • enhancement
  • metadata
  • FAIR

Metadata

Metadata

Assignees

No one assigned

    Labels

    FAIRFAIR data principlesenhancementNew feature or requestmetadataMetadata handling and schema

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions