Skip to content

Commit cd86ba2

Browse files
GSOC 2025 evaluation blog (#37)
* Initial commit for blog * First draft for evaluation blog * Added link to dingo PR * Updated blog
1 parent d6b1e56 commit cd86ba2

File tree

1 file changed

+354
-0
lines changed

1 file changed

+354
-0
lines changed
Lines changed: 354 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,354 @@
1+
---
2+
layout: single
3+
title: "Initiating te dingo-stats package for statistics on flux sampling datasets"
4+
date: 2025-08-02
5+
author: Sotirios Touliopoulos
6+
author_profile: true
7+
read_time: true
8+
comments: true
9+
share: true
10+
related: true
11+
hidden: false
12+
---
13+
14+
15+
# Introduction to dingo-stats: package for statistics on flux sampling datasets
16+
17+
18+
<div style="text-align:center;">
19+
<br>
20+
<b>Sotirios Touliopoulos</b>
21+
<br>
22+
<b>Contributor to Google Summer of Code 2025 with GeomScale</b>
23+
<br>
24+
</div>
25+
26+
27+
> #### Mentors of this project: Elias Tsigaridas, Apostolos Chalkis, Haris Zafeiropoulos, Ioannis Psarros
28+
29+
30+
## Goal of the project
31+
32+
The goal of this project is to introduce [dingo-stats](https://github.com/SotirisTouliopoulos/dingo-stats): a python package for statistical analysis of flux sampling datasets. Till now, all post-sampling statistics were integrated into the `dingo` package. However, as the package grew, it became clear that separating the statistical analysis into its own package would be beneficial. This separation allows for a more focused development of statistical methods and makes it easier for users to access and utilize these methods without needing to install the entire `dingo` package.
33+
34+
The `dingo-stats` package provides a set of functions for performing statistical tests and visualizing results given any flux sampling dataset (not restricted to `dingo` samples). The package is designed to be user-friendly with a series of [tutorials](https://github.com/SotirisTouliopoulos/dingo-stats/tree/master/tests) and [documentation](https://dingo-stats.readthedocs.io/en/latest/) available.
35+
36+
37+
## Updated features
38+
39+
Functions regarding computing pairwise reaction correlations, clustering and graphs analysis from `dingo` have been updated and moved to `dingo-stats`. The removal of these functions from `dingo` was completed in Pull Request [#118](https://github.com/GeomScale/dingo/pull/118). These functions will not be presented in this blog, as they were presented in [last year's blog](https://sotiristouliopoulos.github.io/dingo/).
40+
41+
42+
## New features
43+
44+
The `dingo-stats` package will include the following new features:
45+
- Functions for modifying metabolic models (changing the objective function, objective direction, setting reaction bounds)
46+
- Functions for inspecting Sampling Distributions
47+
- Functions for retrieving pathway information and mapping reaction IDs to KEGG IDs.
48+
- Functions used to identify loopy reactions in metabolic models.
49+
- Functions for performing Differential Flux Analysis, including statistical tests such as the Kolmogorov-Smirnov test and Hypergeometric test for pathway enrichment.
50+
51+
52+
## Modify Metabolic Models
53+
54+
The `modify_model()` function modifies a given cobra model. Users can change objective function, optimal percentage (lower bound of the objective will be fixed based on the optimal percentage), define custom reaction bounds and change objective direction.
55+
56+
- `cobra_model` is a cobra model object
57+
- `objective_function` is a string with the requested objective function ID to be assigned to the model
58+
- `optimal_percentage` is the percentage of the FBA solution to be set as lower bound to the objectie function.
59+
- `reaction_bounds` is a dictionary of custom reaction bounds to be assigned to the model’s reactions
60+
- `objective_direction` defines the direction to optimize the model: max and min are available options.
61+
62+
This function enables the user to create and sample metabolic models simulating different conditions. Here, 2 new models are created based on updating the initial model:
63+
64+
- One that maximizes for biomass production (asking at least almost 100% of the biomass maximum FBA value)
65+
66+
```python
67+
ec_cobra_model_condition_100 = modify_model(
68+
cobra_model = ec_cobra_model,
69+
objective_function = "BIOMASS_Ecoli_core_w_GAM",
70+
optimal_percentage = 100,
71+
objective_direction = "max"
72+
)
73+
```
74+
75+
- One that maximize for biomass production (asking at least 0% of the biomass maximum FBA value)
76+
77+
```python
78+
ec_cobra_model_condition_0 = modify_model(
79+
cobra_model = ec_cobra_model,
80+
objective_function = "BIOMASS_Ecoli_core_w_GAM",
81+
optimal_percentage = 0,
82+
objective_direction = "max"
83+
)
84+
```
85+
86+
Now you are ready to sample your edited model with the sampling methods of your choice
87+
88+
89+
## Inspect Sampling Distributions
90+
91+
The `plot_grid_reactions_flux()` function plots a grid of selected flux distributions and enables inspections of sampling results to get early insights. Distributions here are chosen based on a list of reaction IDs (subset from Glycolysis and Pentose Phosphate Pathways) derived from KEGG pathway information (see helper functions in next paragraph). The corresponding KEGG pathways tutorial is presented in the next section. From the grid we can detect: Left/Right Shifted, Normal, Fixed (transparent based on the tolerance parameter) or other uncommon distributions.
92+
93+
- `samples` is a numpy 2D array of the samples
94+
- `model_reactions` is a list containing strings of reaction IDs
95+
- `tolerance` is a tolerance level to make distribution plot transparent based on flux range
96+
- `nrows` is the number of rows for the plot
97+
- `ncols` is the number of columns for the plot
98+
- `title` is the title of the plot
99+
100+
plot_grid_reactions_flux(
101+
samples = samples_glycolysis_ppp,
102+
model_reactions = reaction_ids_glycolysis_ppp,
103+
nrows = 5,
104+
ncols = 5,
105+
tolerance = 1e-3,
106+
title = "Sampling Distributions")
107+
108+
<center>
109+
<img src="https://dingo-stats.readthedocs.io/en/latest/_images/grid_flux_distributions.png"><br><br>
110+
</center>
111+
112+
The `sampling_statistics()` function calculates statistics on flux distributions for a reaction of interest. This information can be used to identify significantly altered flux distributions between different conditions
113+
114+
- `samples` is a numpy 2D array of the samples
115+
- `model_reactions` is a list containing strings of reaction IDs
116+
- `reaction_id` is a reaction ID of the selected reaction to calculate statistics on
117+
118+
```python
119+
mean, min, max, std, skewness, kurtosis = sampling_statistics(
120+
samples = samples,
121+
model_reactions = ec_cobra_reaction_ids,
122+
reaction_id = "FRD7")
123+
124+
print(mean, min, max, std, skewness, kurtosis)
125+
```
126+
127+
```python
128+
479.1369619828837 0.14995320273677437 993.9157983127856 289.3315287517938 0.09951034266820558 -1.2064534542990524
129+
```
130+
131+
132+
## KEGG Pathway Information
133+
134+
A set of helper functions is available to retrieve KEGG pathway information and map reaction IDs to KEGG IDs. These helper functions will not be presented here, but are available in the [dingo-stats documentation](https://dingo-stats.readthedocs.io/en/latest/index.html). Here, we will present how to leverage the pathway information to analyze a sampling dataset.
135+
136+
The `subset_model_reactions_from_pathway_info()` function works given a dataFrame created with the `get_kegg_pathways_from_reaction_ids()` function and returns all reaction IDs affiliated with a given KEGG pathway name or ID.
137+
138+
```python
139+
PPP_from_name = subset_model_reactions_from_pathway_info(
140+
kegg_info_df = df_kegg_pathways,
141+
pathway_query = "Pentose phosphate pathway")
142+
143+
Glycolysis_from_name = subset_model_reactions_from_pathway_info(
144+
kegg_info_df = df_kegg_pathways,
145+
pathway_query = "Glycolysis / Gluconeogenesis")
146+
147+
Glycolysis_from_id = subset_model_reactions_from_pathway_info(
148+
kegg_info_df = df_kegg_pathways,
149+
pathway_query = "rn00010")
150+
151+
print(PPP_from_name)
152+
print(Glycolysis_from_name)
153+
print(Glycolysis_from_id)
154+
```
155+
156+
```python
157+
['FBA', 'FBP', 'GND', 'PFK', 'PGL', 'RPE', 'RPI', 'TKT1']
158+
['ALCD2x', 'ENO', 'FBA', 'FBP', 'GAPD', 'PFK', 'PGK', 'PGM', 'PPCK', 'PPS', 'PYK', 'TPI']
159+
['ALCD2x', 'ENO', 'FBA', 'FBP', 'GAPD', 'PFK', 'PGK', 'PGM', 'PPCK', 'PPS', 'PYK', 'TPI']
160+
```
161+
162+
The `dictionary_reaction_id_to_pathway()` function takes one or multiple lists containing reaction IDs (corresponding to KEGG pathways) and creates a dictionary that maps the IDs to pathway names. If a reaction appears in more than one pathway, it is classified with the term `Multiple-Pathways`. This is useful for plotting to work with subsets of reactions and to replace names from the `df_kegg_pathways` dataframe like `Glycolysis / Gluconeogenesis` to `Glycolysis` and `Pentose phosphate pathway` to `PPP`.
163+
164+
- `**named_lists` are named lists where each argument is a list of reaction IDs and the argument name represents the pathway name.
165+
166+
```python
167+
named_lists = {
168+
"Glycolysis": Glycolysis,
169+
"PPP": PPP
170+
}
171+
172+
bigg_to_pathway_dict = dictionary_reaction_id_to_pathway(
173+
named_lists)
174+
175+
print(bigg_to_pathway_dict.get("GND"))
176+
print(bigg_to_pathway_dict.get("ENO"))
177+
print(bigg_to_pathway_dict.get("FBA"))
178+
```
179+
180+
```python
181+
"Pentose phosphate pathway"
182+
"Glycolysis / Gluconeogenesis"
183+
"Multiple-Pathways"
184+
```
185+
186+
The `reaction_in_pathway_binary_matrix()` function is used to create a new pandas dataframe with reactions as rows and different pathways as columns. The corresponding cell of the dataframe will show if a reaction belongs to a certain pathway (1) or not (0). If a reaction belongs to more than one pathways, then the column: `Multiple-Pathways` is created and the reaction matching this will only get True (1) there and not in the individual pathway columns (e.g. 1 in `Multiple-Pathways`, 0 in `Glycolysis` and 0 in `PPP`).
187+
188+
- `reaction_id_to_pathway_dict` is dictionary mapping reaction IDs to pathway names created with the dictionary_reaction_id_to_pathway function
189+
190+
```python
191+
binary_df = reaction_in_pathway_binary_matrix(
192+
reaction_id_to_pathway_dict = bigg_to_pathway_dict)
193+
```
194+
195+
The `plot_reaction_in_pathway_heatmap()` function is used to plot a heatmap of the `binary_df` created from the `reaction_in_pathway_binary_matrix()` function to better illustrate the connection between reactions and pathways.
196+
197+
- `binary_df` is a pandas dataFrame with binary values (0 or 1)
198+
- `font_size` is the font size for axis labels and ticks
199+
- `fig_width` is the width of the figure in pixels
200+
- `fig_height` is the height of the figure in pixels
201+
- `title` is the title of the plot
202+
203+
```python
204+
plot_reaction_in_pathway_heatmap(
205+
binary_df = binary_df,
206+
font_size = 8,
207+
fig_width = 600,
208+
fig_height = 600,
209+
title = "" )
210+
```
211+
212+
<center>
213+
<img src="https://dingo-stats.readthedocs.io/en/latest/_images/heatmap_pathways_binary.png"><br><br>
214+
</center>
215+
216+
217+
## Identify Loopy Reactions
218+
219+
The `loops_enumeration_from_fva()` function computes twice a Flux Variability Analysis (with `loopless` parameter to `False` and `True`) and identifies possible loopy reaction in given model and returns them in a list. Each item in the list is a tuple of a string (reaction ID) and a float (loopy score returned from the function)
220+
221+
- `ec_cobra_model` is a cobra model object
222+
- `fraction_of_optimum` is a float variable that defines the fraction_of_optimum parameter used in `flux_variability_analysis`
223+
224+
```python
225+
loopy_reactions_fva = loops_enumeration_from_fva(
226+
ec_cobra_model = ec_cobra_model,
227+
fraction_of_optimum = 0.999)
228+
229+
print(loopy_reactions_fva)
230+
```
231+
232+
```python
233+
[('SUCDi', 994.7794007141794), ('FRD7', 995.0539767141795)]
234+
```
235+
236+
The `loopy_reactions_turned_off_in_pfba()` function identifies which reactions from the loopy reactions set calculated with the `loops_enumeration_from_fva()` function can be turned off when performing a `loopless_solution` applied to `pFBA` results
237+
238+
- `loopy_cobra_model` is a cobra model object possibly containing loops (default model without any preproccess)
239+
- `loopy_reactions` is a list with loopy_reactions with reactions classified as loopy
240+
- `tol` is a tolerance value used to classify a numeric change as important or not
241+
242+
```python
243+
turned_off_reactions = loopy_reactions_turned_off_in_pfba(
244+
loopy_cobra_model = ec_cobra_model,
245+
loopy_reactions = loopy_reactions,
246+
tol = 1e-6)
247+
248+
print(turned_off_reactions)
249+
```
250+
251+
```python
252+
['FRD7']
253+
```
254+
255+
256+
## Differential Flux Analysis
257+
258+
The `significantly_altered_reactions()` function takes as input at least two (different) flux sampling datasets to compare and identifies significantly altered reactions. It performs a Kolmogorov-Smirnov (KS) non-parametric test and corrects p-value for multiple comparisons. It additionally calculates a fold change that together with the p-value classifies reactions as significantly altered or not. It returns two lists, one with the significantly changed (`significant_diff_reactions`) and one with the not significantly changed (`not_significant_diff_reactions`) reactions. Also, two dictionaries, one mapping reaction IDs to corrected p_values (`pval_dict`) and one mapping reactions IDs to fold change values (`fold_change_dict`)
259+
260+
- `conditions` is a list of different flux sampling arrays
261+
- `selected_comparisons` is a list showing which conditions to compare (useful when comparing more than two sampling arrays)
262+
- `cobra_model` is a cobra model object
263+
- `fold_change_cutoff` is a cutoff for fold-change to consider two distributions significantly different
264+
- `std_cutoff` is a cutoff to ensure distributions that are compared are not fixed to a certain value
265+
266+
```python
267+
conditions = [samples_condition_1, samples_condition_2]
268+
selected_comparisons = [(0, 1)]
269+
270+
(significant_diff_reactions,
271+
not_significant_diff_reactions,
272+
pval_dict,
273+
fold_change_dict) = significantly_altered_reactions(
274+
conditions = conditions,
275+
selected_comparisons = selected_comparisons,
276+
cobra_model = ec_cobra_model,
277+
fold_change_cutoff = 0.6,
278+
std_cutoff = 1e-3)
279+
```
280+
281+
The `plot_volcano()` function creates a volcano plot to show results from differential flux analysis. Volcano plot has Fold Change on the x-axis and -log10(p_value) on the y-axis. Users can provide a reactions list in the annotate parameter, to show reaction IDs on the plot. Also, lines showing the significance cutoffs may optionally be added, when providing `p_value_cutoff`, `fold_change_cutoff` and `show_cutoff_lines`.
282+
283+
- `pval_dict` is a dictionary mapping reaction ID to corrected p-value.
284+
- `fold_change_dict` is a dictionary mapping reaction ID to fold-change value.
285+
- `p_value_cutoff` is a significance threshold for p-value (to appear in the plot).
286+
- `fold_change_cutoff` is a threshold for fold-change (to appear in the plot).
287+
- `annotate` is a list of reaction IDs to annotate on the plot.
288+
- `width` is the width of the figure in pixels.
289+
- `height` is the height of the figure in pixels.
290+
- `title` is the title of the plot.
291+
- `show_cutoff_lines` is a boolean variable for whether to draw p-value and fold-change cutoff lines.
292+
293+
```python
294+
reactions_to_annotate = ["NH4t"]
295+
296+
plot_volcano(
297+
pval_dict = pval_dict,
298+
fold_change_dict = fold_change_dict,
299+
p_value_cutoff = 0.05,
300+
fold_change_cutoff = 0.6,
301+
annotate = reactions_to_annotate,
302+
width = 800,
303+
height = 600,
304+
title = "",
305+
show_cutoff_lines = True)
306+
```
307+
308+
<center>
309+
<img src="https://dingo-stats.readthedocs.io/en/latest/_images/volcano_plot.png"><br><br>
310+
</center>
311+
312+
313+
The `hypergeometric_test_pathway_enrichment()` function performs a hypergeometric test to find significantly affected pathways between our sampling conditions. It also calculated fold enrichment and p_values useful for filtering significant changes
314+
315+
- `significant_reactions` is a list of reaction IDs with altered flux.
316+
- `model_reactions` is a list of all reaction IDs considered in the analysis.
317+
- `reaction_to_pathways` is a dictinary mapping reaction ID -> list of pathway names.
318+
319+
```python
320+
hypergeometric_pathway_enrichment_df = hypergeometric_test_pathway_enrichment(
321+
significant_reactions = significant_diff_reactions,
322+
model_reactions = ec_cobra_reaction_ids,
323+
reaction_to_pathways = reaction_to_pathways)
324+
```
325+
326+
The `plot_pathway_enrichment()` function takes as input the hypergeometric_pathway_enrichment_df dataframe created with the `hypergeometric_test_pathway_enrichment()` function and plots the enrichment results in a bubble plot. Bubble size stands for pathway size (number of reactions) and reaction colour stands for -log10(p-value)
327+
328+
- `hypergeometric_enrichment_df` is a dataframe from the hypergeometric_test_pathway_enrichment function.
329+
- `pval_threshold` is a significance p value threshold for filtering.
330+
- `use_fdr` is a boolean for whether to use ‘fdr’ or ‘pval’ for filtering.
331+
- `font_size` is the font size for plot labels and title.
332+
- `width` is the width of the figure in pixels.
333+
- `height` is the height of the figure in pixels.
334+
335+
```python
336+
plot_pathway_enrichment(
337+
hypergeometric_enrichment_df = hypergeometric_pathway_enrichment_df,
338+
pval_threshold = 2,
339+
use_fdr = True,
340+
font_size = 14,
341+
width = 900,
342+
height = 600)
343+
```
344+
345+
<center>
346+
<img src="https://dingo-stats.readthedocs.io/en/latest/_images/pathway_enrichment.png"><br><br>
347+
</center>
348+
349+
350+
## References
351+
352+
- Ebrahim, A.; Lerman, J. A.; Palsson, B. O.; Hyduke, D. R. COBRApy: COnstraints-Based Reconstruction and Analysis for Python. BMC Systems Biology 2013, 7 (1). https://doi.org/10.1186/1752-0509-7-74.
353+
354+
- Apostolos Chalkis; Vissarion Fisikopoulos; Tsigaridas, E.; Haris Zafeiropoulos. Dingo: A Python Package for Metabolic Flux Sampling. Bioinformatics Advances 2024, 4 (1). https://doi.org/10.1093/bioadv/vbae037.

0 commit comments

Comments
 (0)