Skip to content

[GSoC 2025] Optimize Data Generation Pipeline for DIA-MS Diffusion ModelΒ #16

@singjc

Description

@singjc

Background

Our project aims to train a diffusion model to deconvolute complex DIA-MS/MS data. A critical bottleneck in our current workflow is the data generation step that prepares input for model training.

Current Challenges

  • The current data generation process takes approximately 1.5 days and consumes ~800GB RAM for processing a single DIA isolation window
  • The process extracts MS1 and MS2 spectra within an isolation window across a sliding window of N overlapping retention time points
  • This creates unnecessarily dense data which may be contributing to the performance issues. (However, it's done this way to ensure that at least one of the input training samples will contain the elution peak profile of an analyte)

Task Objectives

  • Analyze the current data generation pipeline to identify inefficiencies
  • Implement and test optimizations to reduce memory usage and processing time
  • Evaluate alternative approaches such as:
    • Using sequential (non-overlapping) retention time windows instead of sliding windows
    • Considering strategies to handle peak cutoffs at window boundaries
    • Identifying redundant data that can be safely discarded

Deliverables

  • Modified code with documented optimizations
  • Performance comparison between original and optimized implementations
  • Analysis of any trade-offs in data quality vs. performance
  • Recommendations for future improvements

Resources

Difficulty

  • Beginner to Intermediate - This task provides a good entry point to understand our data pipeline while making meaningful contributions to the project's efficiency.

Metadata

Metadata

Assignees

Labels

GSoC 2025Tasks specific for GSoC2025enhancementNew feature or requesthelp wantedExtra attention is needed

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions