GainPro is a PyTorch implementation of Generative Adversarial Imputation Networks (GAIN) [1] for imputing missing iBAQ values in proteomics datasets. The package provides a unified command-line interface with multiple imputation methods including basic GAIN, GAIN-DANN (domain-adaptive), and pre-trained HuggingFace models.
- Features
- Installation
- Quick Start
- Command-Line Usage
- Python API
- Repository Structure
- DANN & GAIN Hybrid
- References
- Basic GAIN: Simple Generator + Discriminator architecture for general-purpose imputation
- GAIN-DANN: Domain-adaptive imputation with Encoder/Decoder architecture
- Pre-trained Models: Easy access to HuggingFace pre-trained models
- Median Imputation: Simple baseline method
- Flexible CLI: Unified
gainprocommand with intuitive subcommands - Python API: Full programmatic access to all functionality
The package is available on PyPI. Install it using:
pip install gainpro-
Clone the repository:
git clone https://github.com/QuantitativeBiology/GainPro.git cd GainPro -
Create a Python environment (recommended):
conda create -n gainpro python=3.10 conda activate gainpro
-
Install dependencies:
pip install -r requirements.txt
-
Install the package in development mode:
pip install -e .
After installation, you can use the gainpro command-line interface:
# Basic GAIN imputation
gainpro gain -i data.csv
# With reference dataset for evaluation
gainpro gain -i data.csv --ref reference.csv
# Using a configuration file
gainpro gain --parameters configs/params_gain.jsonGainPro provides a unified CLI with the following subcommands:
The basic GAIN command performs imputation using a Generator + Discriminator architecture.
Basic usage:
gainpro gain -i data.csvWith options:
gainpro gain -i data.csv -o imputed.csv --ofolder ./results/ --it 3000Using a configuration file:
gainpro gain --parameters configs/params_gain.jsonWith reference dataset for evaluation:
gainpro gain -i data.csv --ref reference.csvNote: When run without a reference, the command performs two phases:
- Evaluation run: Conceals a percentage of values (10% by default) during training, calculates RMSE, and creates
test_imputed.csvfor accuracy estimation - Imputation run: Trains on the entire dataset and creates
imputed.csv
Common options:
-i, --input: Path to input file (CSV, TSV, or Parquet)-o, --output: Name of output file (default:imputed)--ref: Path to reference (complete) dataset for evaluation--ofolder: Output folder path (default:./results)--it: Number of training iterations (default: 2001)--batchsize: Batch size (default: 128)--miss: Missing rate for evaluation (0-1, default: 0.1)--hint: Hint rate (0-1, default: 0.9)--lrd: Learning rate for discriminator (default: 0.001)--lrg: Learning rate for generator (default: 0.001)--parameters: Path to JSON configuration file--override: Override previous output files (1) or append (0, default)--outall: Output all metrics (1) or minimal output (0, default)
Train a domain-adaptive GAIN-DANN model:
gainpro train --config configs/params_gain_dann.json --saveUse a trained GAIN-DANN checkpoint for imputation:
gainpro impute --checkpoint checkpoints/your_model --input data.csv --output imputed.csvDownload and use pre-trained models from HuggingFace:
gainpro download --input data.csv --output imputed.csvSimple median imputation baseline:
gainpro median --input data.csv --output imputed.csvFor detailed help on any command:
gainpro --help
gainpro gain --help
gainpro train --helpLegacy command: The gain command is still available but deprecated. Use gainpro gain instead.
GainPro can also be used programmatically through its Python API:
from gainpro import utils, Network, Params, Metrics, Data
import torch
import pandas as pd
# Load your dataset
dataset_path = "your_dataset.tsv"
dataset_df = utils.build_protein_matrix(dataset_path) # For TSV files
# dataset_df = pd.read_csv(dataset_path) # For CSV files
dataset = dataset_df.values
missing_header = dataset_df.columns.tolist()
# Define your parameters
params = Params(
input=dataset_path,
output="imputed.csv",
ref=None,
output_folder=".",
num_iterations=2001,
batch_size=128,
alpha=10,
miss_rate=0.1,
hint_rate=0.9,
lr_D=0.001,
lr_G=0.001,
override=1,
output_all=1,
)
# Define model architecture
input_dim = dataset.shape[1]
h_dim = input_dim
net_G = torch.nn.Sequential(
torch.nn.Linear(input_dim * 2, h_dim),
torch.nn.ReLU(),
torch.nn.Linear(h_dim, h_dim),
torch.nn.ReLU(),
torch.nn.Linear(h_dim, input_dim),
torch.nn.Sigmoid()
)
net_D = torch.nn.Sequential(
torch.nn.Linear(input_dim * 2, h_dim),
torch.nn.ReLU(),
torch.nn.Linear(h_dim, h_dim),
torch.nn.ReLU(),
torch.nn.Linear(h_dim, input_dim),
torch.nn.Sigmoid()
)
# Set up the model and data
metrics = Metrics(params)
network = Network(hypers=params, net_G=net_G, net_D=net_D, metrics=metrics)
data = Data(dataset=dataset, miss_rate=0.2, hint_rate=0.9, ref=None)
# Run evaluation and training
network.evaluate(data=data, missing_header=missing_header)
network.train(data=data, missing_header=missing_header)
print("Final Matrix:\n", metrics.data_imputed)For more examples, see the use-case directory.
Main components of the repository:
.github/workflows: CI/CD workflows for automated testingdatasets/: Sample datasets with missing values from PRIDE for testinggainpro/: Core package source codegainpro.py: Main CLI interface (unified command with subcommands)model.py: Basic GAIN model implementationgain_dann_model.py: GAIN-DANN model implementation- Other core modules (dataset, hypers, output, etc.)
configs/: Configuration files for different modelsdocs/source/: Documentation source files for ReadTheDocstests/: Unit tests to assess model functionalityuse-case/: Examples demonstrating package usage- Installation examples
- Test execution examples
- HuggingFace model usage examples
The repository includes a breast cancer diagnostic dataset [2] in datasets/breast/:
breast.csv: Complete datasetbreastMissing_20.csv: Same dataset with 20% missing valuesparameters.json: Example configuration file
Quick demo commands:
# Simple imputation
gainpro gain -i ./datasets/breast/breastMissing_20.csv
# With reference for evaluation
gainpro gain -i ./datasets/breast/breastMissing_20.csv --ref ./datasets/breast/breast.csv
# Using configuration file
gainpro gain --parameters ./datasets/breast/parameters.jsonFor detailed metric analysis, either:
- Set
--outall 1to output all metrics - Use the Python API in an IPython console to access the
metricsobject (e.g.,metrics.loss_D,metrics.loss_G,metrics.rmse_train)
The repository includes a hybrid model combining Domain Adversarial Neural Networks (DANN) with GAIN for domain-adaptive imputation. This is particularly useful when you have multiple datasets from different domains and want to learn domain-invariant representations.
Training a GAIN-DANN model:
gainpro train --config configs/params_gain_dann.json --saveUsing a trained model:
gainpro impute --checkpoint checkpoints/your_model --input data.csv --output imputed.csvFor detailed information about the GAIN-DANN architecture and training procedure, see the documentation.
[1]
J. Yoon, J. Jordon & M. van der Schaar (2018). GAIN: Missing Data Imputation using Generative Adversarial Nets
[2]
https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic