A reproducible computational framework for analyzing genotypeβtrait relationships in rice breeding populations using quantitative genetics, multivariate statistics, network analysis, and machine learning.
This repository implements a complete analytical pipeline to identify:
- key traits controlling grain yield
- direct interactions among agronomic traits
- hub traits in trait networks
- elite mutant genotypes using multi-trait selection
- yield-driving predictors using machine learning
- optimal trait combinations for breeding improvement
The workflow produces publication-ready figures and tables (600 dpi) suitable for journals such as:
PLOS ONE
Scientific Reports
Frontiers in Plant Science
Field Crops Research
Grain yield in rice is a complex quantitative trait controlled by multiple interacting morphological, physiological, and genetic components.
Traditional correlation analysis often fails to reveal true biological relationships among traits due to:
- indirect trait effects
- multicollinearity
- environmental variation
- genotype Γ trait interactions
To overcome these limitations, this pipeline integrates classical plant breeding statistics with modern computational approaches, including:
β’ BLUP-based genetic estimation
β’ Graphical Lasso partial correlation networks
β’ machine learning trait importance models
β’ multi-trait genotype selection indices (MGIDI)
β’ response surface modeling of harvest index
β’ principal component analysis (PCA)
β’ genotypeβtrait association networks
This integrated framework allows researchers to identify direct trait interactions, yield drivers, and elite genotypes for crop improvement.
The pipeline combines classical quantitative genetics with modern statistical learning methods.
| Method | Purpose |
|---|---|
| Mixed Linear Model (REML) | Estimate BLUP genotype effects |
| Variance Components | Partition genetic and residual variance |
| Broad-sense Heritability | Measure trait inheritance |
| Graphical Lasso | Estimate sparse partial correlation networks |
| Network Centrality | Identify hub traits driving trait interactions |
| Multiple Linear Regression | Detect key yield predictors |
| Partial Least Squares (PLS) | Estimate trait predictive importance |
| Random Forest | Model nonlinear trait effects |
| SHAP Analysis | Explain machine learning predictions |
| MGIDI Index | Multi-trait genotype selection |
| Path Analysis | Estimate direct trait effects on yield |
| PCA | Identify major sources of phenotypic variation |
| Response Surface Modeling | Analyze trait combinations affecting harvest index |
| Chord Diagram Networks | Visualize genotypeβtrait associations |
rice-genotype-trait-analysis β βββ data β βββ example_dataset.xlsx β βββ scripts β β βββ 01_BLUP_Heritability_Network.py β βββ 02_Trait_Network_ML_HI.py β βββ 03_GeneticParameters_MGIDI_EliteSelection.py β βββ 04_Yield_vs_Traits_Scatter.py β βββ 05_HI_Correlation_Matrix.py β βββ 06_Genotype_Trait_Chord_Diagram.py β βββ 07_HI_Response_Surface.py β βββ 08_Parent_vs_Mutants_TraitMatrix.py β βββ 09_TargetTrait_Correlation_PCA.py β βββ notebooks β βββ colab_pipeline.ipynb β βββ outputs β βββ tables β βββ figures β βββ networks β βββ pipeline_colab.py βββ requirements.txt βββ README.md
The pipeline expects a CSV or Excel dataset from a Randomized Complete Block Design (RCBD) experiment.
Genotype Replication Trait columns...
| Genotype | Replication | Plant height | Panicle length | Filled grains | Grain yield |
|---|---|---|---|---|---|
| Parent | 1 | 110 | 26 | 180 | 38 |
| Mutant1 | 1 | 118 | 28 | 195 | 41 |
| Mutant2 | 1 | 105 | 24 | 165 | 35 |
Typical agronomic traits include:
Days to flowering Days to maturity Plant height Tillers per hill Effective tillers per hill Panicle length Primary branch per panicle Secondary branch per panicle Flag leaf length Filled grain per panicle Sterile grain per panicle Grain length Grain breadth 1000 grain weight Grain yield per hill Straw yield per hill Harvest index
This project implements an automated computational pipeline for analyzing rice phenotypic traits using statistical genetics, network analysis, and machine learning.
The workflow integrates classical quantitative genetics with modern data science tools to identify key determinants of agronomic performance.
All analyses are executed in a Python-based environment using Google Colab, enabling reproducibility, scalability, and automated figure generation.
π Data Import
β
π Automatic Trait Detection
β
𧬠BLUP Estimation & Heritability
β
π Correlation Analysis
β
π Graphical Lasso Trait Network
β
π Partial Correlation Modeling
β
π€ Machine Learning Trait Prediction
β
𧬠Genetic Parameter Estimation
β
π MGIDI Multi-Trait Selection
β
π YieldβTrait Regression Analysis
β
π PCA Multivariate Trait Structure
β
π Interactive Visualization Dashboard
β
π Automated Report Generation
π β π
The phenotypic dataset containing genotype, replication, and agronomic trait measurements is imported into the computational environment.
Numeric trait columns are automatically detected using Python routines to ensure compatibility with datasets containing variable trait sets.
𧬠β π
Genotypic effects are estimated using Best Linear Unbiased Prediction (BLUP) based on a mixed linear model:
yα΅’β±Ό = ΞΌ + Rβ±Ό + Gα΅’ + eα΅’β±Ό
Where:
- ΞΌ = population mean
- Rβ±Ό = replication effect
- Gα΅’ = genotype effect
- eα΅’β±Ό = residual error
Variance components are used to compute broad-sense heritability (HΒ²).
π β π
Trait relationships are initially evaluated using Pearson correlation matrices.
To reveal direct conditional relationships between traits, a Graphical Lasso model is used to estimate sparse inverse covariance matrices, generating partial correlation networks that highlight key trait interactions.
Network visualization identifies hub traits with strong influence on plant performance.
π€ β π
Machine learning models are applied to quantify the predictive contribution of individual traits.
Algorithms used include:
- Random Forest regression
- Gradient Boosting regression
- SHAP feature importance analysis
These models identify traits that most strongly influence yield or harvest index.
𧬠β π
Classical quantitative genetic parameters are calculated to evaluate variability and selection potential:
- Genotypic coefficient of variation (GCV)
- Phenotypic coefficient of variation (PCV)
- Broad-sense heritability (HΒ²)
- Genetic advance (GA)
- Genetic advance as percent of mean (GAM)
These metrics help determine which traits respond most effectively to selection.
π β πΎ
The Multi-Trait GenotypeβIdeotype Distance Index (MGIDI) is used to identify elite genotypes with optimal trait combinations.
Genotypes with lower MGIDI scores are considered closer to the ideal breeding profile.
π β π
Linear regression models are applied to quantify relationships between grain yield and key yield components such as:
- panicle length
- filled grains per panicle
- straw yield
Regression plots with 95% confidence intervals are generated to visualize trait contributions.
π β π¬
Principal Component Analysis (PCA) summarizes multivariate trait variation across genotypes.
PCA biplots simultaneously visualize:
- genotype clustering
- trait loadings
- major axes of phenotypic variation.
π β π
Interactive dashboards built with Plotly and Holoviews allow dynamic exploration of genotypeβtrait relationships.
All outputsβincluding figures, statistical tables, and analysis summariesβare automatically exported and compiled into downloadable results and PDF reports.
The workflow is implemented using open-source Python libraries:
- pandas β data processing
- NumPy β numerical computation
- statsmodels β mixed linear models
- scikit-learn β machine learning and PCA
- NetworkX β network analysis
- Plotly / Holoviews β interactive visualization
All computations were performed in Google Colab, enabling automated and reproducible analysis pipelines.
Clone the repository.
git clone https://github.com/yourusername/rice-genotype-trait-analysis
cd rice-genotype-trait-analysis
Install dependencies.
pip install -r requirements.txt
Run the main analysis script.
python pipeline_colab.py
Upload your dataset when prompted (.xlsx or .csv).
The pipeline will automatically perform the complete genotypeβtrait analysis workflow.
The pipeline produces multiple statistical tables and figures automatically.
BLUP_matrix_genotype_x_traits.csv VarianceComponents_H2BLUP.csv PartialCorrelation_Glasso.csv YieldPredictors_LinearModel.csv RandomForest_importance.csv MGIDI_scores.csv HubTraits_degree_centrality.csv PCA_Loadings_All_Traits_All_PCs.csv
PartialCorrelation_heatmap.png Trait_network.png Yield_predictor_barplot.png PLS_predicted_vs_actual.png RandomForest_importance.png SHAP_feature_importance.png MGIDI_Top10.png Scatter_yield_vs_traits.png Chord_genotype_trait_network.png HI_ResponseSurface.png PCA_Biplot.png
All figures are exported at high resolution suitable for journal submission.
Md Rezve
PhD Applicant β Plant Breeding & Quantitative Genetics
Research interests:
β’ Quantitative genetics
β’ Trait network analysis
β’ Machine learning in crop improvement
β’ Mutation breeding
β’ Multi-trait genotype selection
If you use this pipeline, please cite:
Md Rezve (2026) Rice GenotypeβTrait Network and Yield Driver Analysis Pipeline GitHub Repository
MIT License
β If this repository helps your research, please consider starring the project.