[GSoC 2025] Optimize Data Generation Pipeline for DIA-MS Diffusion Model

## Background
Our project aims to train a diffusion model to deconvolute complex DIA-MS/MS data. A critical bottleneck in our current workflow is the data generation step that prepares input for model training.

## Current Challenges

- The [current data generation process](https://github.com/Roestlab/diffusion-deconvolution-dia-msms-data/blob/main/dquartic/utils/data_generation.py) takes approximately 1.5 days and consumes ~800GB RAM for processing a single DIA isolation window
- The process extracts MS1 and MS2 spectra within an isolation window across a sliding window of N overlapping retention time points
- This creates unnecessarily dense data which may be contributing to the performance issues. (However, it's done this way to ensure that at least one of the input training samples will contain the elution peak profile of an analyte)

## Task Objectives

- Analyze the current data generation pipeline to identify inefficiencies
- Implement and test optimizations to reduce memory usage and processing time
- Evaluate alternative approaches such as:
   - Using sequential (non-overlapping) retention time windows instead of sliding windows
   - Considering strategies to handle peak cutoffs at window boundaries
   - Identifying redundant data that can be safely discarded



## Deliverables

- Modified code with documented optimizations
- Performance comparison between original and optimized implementations
- Analysis of any trade-offs in data quality vs. performance
- Recommendations for future improvements

## Resources

- Example data and processing notebook: [nbs/hela_2018.ipynb](https://github.com/Roestlab/diffusion-deconvolution-dia-msms-data/blob/main/nbs/hela_2018.ipynb)
- Background literature for understanding the structure of DIA-MS/MS data [DIA-MS data in our wiki](https://github.com/Roestlab/diffusion-deconvolution-dia-msms-data/wiki/GSoC-2025)

## Difficulty
- Beginner to Intermediate - This task provides a good entry point to understand our data pipeline while making meaningful contributions to the project's efficiency.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GSoC 2025] Optimize Data Generation Pipeline for DIA-MS Diffusion Model #16

Background

Current Challenges

Task Objectives

Deliverables

Resources

Difficulty

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[GSoC 2025] Optimize Data Generation Pipeline for DIA-MS Diffusion Model #16

Description

Background

Current Challenges

Task Objectives

Deliverables

Resources

Difficulty

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions