Skip to content

GainPro is a PyTorch implementation of Generative Adversarial Imputation Networks (GAIN) for imputing missing iBAQ values in proteomics datasets.

License

Notifications You must be signed in to change notification settings

bigbio/GainPro

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GainPro

PyPi Version Colab Documentation HuggingFace

GainPro is a PyTorch implementation of Generative Adversarial Imputation Networks (GAIN) [1] for imputing missing iBAQ values in proteomics datasets. The package provides a unified command-line interface with multiple imputation methods including basic GAIN, GAIN-DANN (domain-adaptive), and pre-trained HuggingFace models.

Table of Contents

Features

  • Basic GAIN: Simple Generator + Discriminator architecture for general-purpose imputation
  • GAIN-DANN: Domain-adaptive imputation with Encoder/Decoder architecture
  • Pre-trained Models: Easy access to HuggingFace pre-trained models
  • Median Imputation: Simple baseline method
  • Flexible CLI: Unified gainpro command with intuitive subcommands
  • Python API: Full programmatic access to all functionality

Installation

From PyPI (Recommended)

The package is available on PyPI. Install it using:

pip install gainpro

From Source

  1. Clone the repository:

    git clone https://github.com/QuantitativeBiology/GainPro.git
    cd GainPro
  2. Create a Python environment (recommended):

    conda create -n gainpro python=3.10
    conda activate gainpro
  3. Install dependencies:

    pip install -r requirements.txt
  4. Install the package in development mode:

    pip install -e .

Quick Start

After installation, you can use the gainpro command-line interface:

# Basic GAIN imputation
gainpro gain -i data.csv

# With reference dataset for evaluation
gainpro gain -i data.csv --ref reference.csv

# Using a configuration file
gainpro gain --parameters configs/params_gain.json

Command-Line Usage

GainPro provides a unified CLI with the following subcommands:

gainpro gain - Basic GAIN Imputation

The basic GAIN command performs imputation using a Generator + Discriminator architecture.

Basic usage:

gainpro gain -i data.csv

With options:

gainpro gain -i data.csv -o imputed.csv --ofolder ./results/ --it 3000

Using a configuration file:

gainpro gain --parameters configs/params_gain.json

With reference dataset for evaluation:

gainpro gain -i data.csv --ref reference.csv

Note: When run without a reference, the command performs two phases:

  1. Evaluation run: Conceals a percentage of values (10% by default) during training, calculates RMSE, and creates test_imputed.csv for accuracy estimation
  2. Imputation run: Trains on the entire dataset and creates imputed.csv

Common options:

  • -i, --input: Path to input file (CSV, TSV, or Parquet)
  • -o, --output: Name of output file (default: imputed)
  • --ref: Path to reference (complete) dataset for evaluation
  • --ofolder: Output folder path (default: ./results)
  • --it: Number of training iterations (default: 2001)
  • --batchsize: Batch size (default: 128)
  • --miss: Missing rate for evaluation (0-1, default: 0.1)
  • --hint: Hint rate (0-1, default: 0.9)
  • --lrd: Learning rate for discriminator (default: 0.001)
  • --lrg: Learning rate for generator (default: 0.001)
  • --parameters: Path to JSON configuration file
  • --override: Override previous output files (1) or append (0, default)
  • --outall: Output all metrics (1) or minimal output (0, default)

gainpro train - Train GAIN-DANN Model

Train a domain-adaptive GAIN-DANN model:

gainpro train --config configs/params_gain_dann.json --save

gainpro impute - Impute with Trained Model

Use a trained GAIN-DANN checkpoint for imputation:

gainpro impute --checkpoint checkpoints/your_model --input data.csv --output imputed.csv

gainpro download - HuggingFace Pre-trained Models

Download and use pre-trained models from HuggingFace:

gainpro download --input data.csv --output imputed.csv

gainpro median - Median Imputation

Simple median imputation baseline:

gainpro median --input data.csv --output imputed.csv

Getting Help

For detailed help on any command:

gainpro --help
gainpro gain --help
gainpro train --help

Legacy command: The gain command is still available but deprecated. Use gainpro gain instead.

Python API

GainPro can also be used programmatically through its Python API:

from gainpro import utils, Network, Params, Metrics, Data
import torch
import pandas as pd

# Load your dataset
dataset_path = "your_dataset.tsv"
dataset_df = utils.build_protein_matrix(dataset_path)  # For TSV files
# dataset_df = pd.read_csv(dataset_path)  # For CSV files
dataset = dataset_df.values
missing_header = dataset_df.columns.tolist()

# Define your parameters
params = Params(
    input=dataset_path,
    output="imputed.csv",
    ref=None,
    output_folder=".",
    num_iterations=2001,
    batch_size=128,
    alpha=10,
    miss_rate=0.1,
    hint_rate=0.9,
    lr_D=0.001,
    lr_G=0.001,
    override=1,
    output_all=1,
)

# Define model architecture
input_dim = dataset.shape[1]
h_dim = input_dim
net_G = torch.nn.Sequential(
    torch.nn.Linear(input_dim * 2, h_dim),
    torch.nn.ReLU(),
    torch.nn.Linear(h_dim, h_dim),
    torch.nn.ReLU(),
    torch.nn.Linear(h_dim, input_dim),
    torch.nn.Sigmoid()
)
net_D = torch.nn.Sequential(
    torch.nn.Linear(input_dim * 2, h_dim),
    torch.nn.ReLU(),
    torch.nn.Linear(h_dim, h_dim),
    torch.nn.ReLU(),
    torch.nn.Linear(h_dim, input_dim),
    torch.nn.Sigmoid()
)

# Set up the model and data
metrics = Metrics(params)
network = Network(hypers=params, net_G=net_G, net_D=net_D, metrics=metrics)
data = Data(dataset=dataset, miss_rate=0.2, hint_rate=0.9, ref=None)

# Run evaluation and training
network.evaluate(data=data, missing_header=missing_header)
network.train(data=data, missing_header=missing_header)
print("Final Matrix:\n", metrics.data_imputed)

For more examples, see the use-case directory.

Repository Structure

Main components of the repository:

  • .github/workflows: CI/CD workflows for automated testing
  • datasets/: Sample datasets with missing values from PRIDE for testing
  • gainpro/: Core package source code
    • gainpro.py: Main CLI interface (unified command with subcommands)
    • model.py: Basic GAIN model implementation
    • gain_dann_model.py: GAIN-DANN model implementation
    • Other core modules (dataset, hypers, output, etc.)
  • configs/: Configuration files for different models
  • docs/source/: Documentation source files for ReadTheDocs
  • tests/: Unit tests to assess model functionality
  • use-case/: Examples demonstrating package usage
    • Installation examples
    • Test execution examples
    • HuggingFace model usage examples

Demo

The repository includes a breast cancer diagnostic dataset [2] in datasets/breast/:

  • breast.csv: Complete dataset
  • breastMissing_20.csv: Same dataset with 20% missing values
  • parameters.json: Example configuration file

Quick demo commands:

# Simple imputation
gainpro gain -i ./datasets/breast/breastMissing_20.csv

# With reference for evaluation
gainpro gain -i ./datasets/breast/breastMissing_20.csv --ref ./datasets/breast/breast.csv

# Using configuration file
gainpro gain --parameters ./datasets/breast/parameters.json

For detailed metric analysis, either:

  • Set --outall 1 to output all metrics
  • Use the Python API in an IPython console to access the metrics object (e.g., metrics.loss_D, metrics.loss_G, metrics.rmse_train)

DANN & GAIN Hybrid

The repository includes a hybrid model combining Domain Adversarial Neural Networks (DANN) with GAIN for domain-adaptive imputation. This is particularly useful when you have multiple datasets from different domains and want to learn domain-invariant representations.

Training a GAIN-DANN model:

gainpro train --config configs/params_gain_dann.json --save

Using a trained model:

gainpro impute --checkpoint checkpoints/your_model --input data.csv --output imputed.csv

For detailed information about the GAIN-DANN architecture and training procedure, see the documentation.

References

[1] J. Yoon, J. Jordon & M. van der Schaar (2018). GAIN: Missing Data Imputation using Generative Adversarial Nets
[2] https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic

About

GainPro is a PyTorch implementation of Generative Adversarial Imputation Networks (GAIN) for imputing missing iBAQ values in proteomics datasets.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 90.9%
  • Python 9.1%