Reproducible pipelines for null-distribution and path-probability experiments on biomedical knowledge graphs (Hetionet permutations and related metapaths). This README is the single source for running everything end-to-end via Poe tasks.
# from repo root
cd environments
conda env create -f environment.yml # first time only
conda activate CAPP
cd .. # back to repo root# Quick end-to-end validation (smoke-scale figures + model/comparison suite)
poe quick-test-end-to-end
# Full figure/diagnostic reproduction suite
poe reproduce-figures-full
# CI-oriented smoke command (figures + one model/comparison smoke target)
poe ci-smokeIf using conda directly:
conda run -n CAPP poe quick-test-end-to-end
conda run -n CAPP poe reproduce-figures-full
conda run -n CAPP poe ci-smokeRun tasks with either:
poe <task-name>(inside activeCAPP)conda run -n CAPP poe <task-name>(without activation)
Poetry remains supported for dependency management, but canonical task execution is conda + poe.
- Hetionet hetmat and generated permutations expected under
data/:- Base graph:
data/edges/*.sparse.npz - Generated perms:
data/permutations/###.hetmat/edges/*.npz
- Base graph:
- Downloaded prebuilt permutations land under:
data/downloads/hetionet-permutations/permutations/*.hetmat
- Empirical edge-frequency CSVs (produced by
compute-edge-frequencies) land inresults/empirical_edge_frequencies/.
generate-permutationsbuilds local degree-preserved permutations indata/permutations/.- Supports
--count/--seed/--start.
- Supports
download-permutationsfetches the full prebuilt Hetionet bundle (about 200 permutations) intodata/downloads/hetionet-permutations/.- This task does not support
--count. - Current source ZIP size is ~
863 MB(about823 MiB) before extraction. - Typical wall-clock estimate (download + extract):
- fast links: ~
4-12minutes - slower links: ~
15-35minutes
- fast links: ~
- This task does not support
Use-cases:
- Core null/compositional pipeline and most figure scripts read from
data/permutations/. compute-edge-frequencies(notebook 3 workflow) reads downloaded permutations fromdata/downloads/hetionet-permutations/permutations/.
All pipelines are encoded as Poethepoet tasks. Run from repo root:
poe --help # list tasks
poe <task-name> # run a taskfetch-hetmat– uses the active environment'spythonby default. Ifdata/already contains a hetmat, it validates metanode/metaedge counts; if not, it downloads Hetionet v1.0 JSON (https://github.com/dhimmel/hetionet/raw/76550e6c93fbe92124edc71725e8c7dd4ca8b1f5/hetnet/json/hetionet-v1.0.json.bz2) and builds hetmat intodata/. If needed, override by settingCAPP_PYbefore running poe tasks.generate-permutationsordownload-permutations– generate missing degree-preserved permutations (default target 50; skips existing) or fetch prebuilt ones- Parameters:
--count N(total permutations desired, incl. existing; default 50),--seed S(base seed; default 42),--start K(force starting index; default = next unused). - Examples:
poe generate-permutations --count 10,poe generate-permutations --count 60 --seed 123,poe generate-permutations --start 20 --count 25.
- Parameters:
compute-edge-frequencies– empirical edge frequencies (feeds compositional notebooks)train-null-models– null model training (results/null_models/)
- Defaults: train on permutations
1-20, validate on21-30. - Uses
data/permutations/###.hetmat/edges/*.sparse.npz. - Quick sample:
poe train-null-models --edge-type CbG --training-perm-end 2 --validation-perm-start 3 --validation-perm-end 4 --skip-empirical-validation.
compose-null– compositional null fitting
- Defaults: metapath
CbGpPW(CbG -> GpPW) using validation permutations21-30. - Outputs to
results/compositional_null/(validation CSVs, summary, optional plot/checkpoint). - Quick sample:
poe compose-null --validation-perm-start 1 --validation-perm-end 2 --max-pairs 5000 --skip-plot.
build-metapath-nulls– metapath null distributions
- Computes observed 2-edge metapath path probabilities and compares to compositional null predictions.
- Defaults: metapaths
CbGpPW,CtDaG,CrCbG,CbGaD; model typesrfandpoly. - Outputs to
results/metapath_nulls/. - Quick sample:
poe build-metapath-nulls --metapath CbGpPW --model-type rf --max-pairs 5000 --skip-plot.
validate-composition– compositional validation
- Validates compositional predictions across held-out permutations for default 2-hop metapaths.
- Outputs to
results/compositional_validation/(accuracy_by_metapath.csv,per_permutation_metrics.csv,validation_summary.json, optional plots). - Quick sample:
poe validate-composition --metapath CbGpPW --train-perms-end 2 --valid-perms-start 3 --valid-perms-end 4 --max-compared-pairs 20000 --skip-plot.
analyze-composition-failures– failure analysis
- Runs stratified residual analysis by degree bins and reports where compositional predictions fail.
- Outputs to
results/compositional_validation/(failure_analysis.csv,degree_stratified_correlations.csv,correction_analysis.csv, optional plots). - Quick sample:
poe analyze-composition-failures --metapath CbGpPW --train-perms-end 2 --valid-perms-start 3 --valid-perms-end 4 --n-degree-bins 4 --samples-per-bin 20 --max-locations 10000 --skip-plot.
model-comparison-analysis
- Main outputs under
results/model_comparison/<EDGE_TYPE>_results/:model_comparison.csvmodels_vs_analytical_comparison.csvtest_vs_empirical_comparison.csv(if empirical frequencies exist)raw_logit_comparison.csvprobability_vs_raw_logit_comparison.csv- optional all-pairs exports:
<EDGE_TYPE>_all_model_predictions.csv(.gz)<EDGE_TYPE>_predictions_by_degree.csv<EDGE_TYPE>_predictions_metadata.json
- Quick sample (terminal smoke test):
poe model-comparison-analysis --edge-type CtD --skip-plots --max-all-pairs 300000
- Notes:
- All-pairs prediction export is guarded by
--max-all-pairs(default2,000,000) to avoid huge files on dense edge types. - For dense edge types, leave defaults (auto-skip) or disable with
--no-generate-all-predictions.
- All-pairs prediction export is guarded by
model-testing-summary
- Aggregates per-edge outputs from
results/model_comparison/*_results/. - Writes summaries to
results/model_comparison_summary_with_degree/, including:model_comparison_all_edges.csvanalytical_comparison_all_edges.csvempirical_comparison_all_edges.csvmodel_performance_summary.csvgraph_characteristics.csvdegree_analysis_summary.json
- Optional degree-analysis aggregation output:
aggregate_degree_metrics.csv(if degree metrics exist or--run-degree-analysisis enabled)
- Quick sample:
poe model-testing-summary --edge-type CtD --skip-plots
pathway-data-preparation(notebook 18a replacement)
- Builds degree-binned training data for pathway-NN benchmarks.
- Default metapath:
CbGpPW(or run all defaults with--all-metapaths). - Outputs under
results/pathway_nn/training_data/:<METAPATH>_training_data.csv<METAPATH>_training_data_summary.jsonpathway_data_preparation_run_summary.json
- Quick samples:
poe pathway-data-preparation --metapath CbGpPWpoe pathway-data-preparation --all-metapaths
pathway-train-random(notebook 18b replacement)
- Trains random baseline on degree-binned pathway data and writes benchmark metadata.
- Outputs:
results/pathway_nn/trained_models/<METAPATH>_Random.pklresults/pathway_nn/benchmarks/<METAPATH>_Random_benchmark.jsonresults/pathway_nn/benchmarks/pathway_train_random_run_summary.json
- Quick sample:
poe pathway-train-random --metapath CbGpPW
pathway-train-degree-product(notebook 18c replacement)
- Trains degree-product baseline on degree-binned pathway data.
- Uses notebook-parity synthetic edge-probability proxy columns for current compatibility.
- Outputs:
results/pathway_nn/trained_models/<METAPATH>_Degree_Product.pklresults/pathway_nn/benchmarks/<METAPATH>_Degree_Product_benchmark.jsonresults/pathway_nn/benchmarks/pathway_train_degree_product_run_summary.json
- Quick sample:
poe pathway-train-degree-product --metapath CbGpPW
pathway-train-negbin-glm(notebook 18d replacement)
- Trains Negative Binomial GLM baseline on degree-binned pathway data.
- Uses statsmodels NegBin GLM when available; falls back to deterministic log-linear fit if needed.
- Outputs:
results/pathway_nn/trained_models/<METAPATH>_NegBin_GLM.pklresults/pathway_nn/benchmarks/<METAPATH>_NegBin_GLM_benchmark.jsonresults/pathway_nn/benchmarks/pathway_train_negbin_glm_run_summary.json
- Quick sample:
poe pathway-train-negbin-glm --metapath CbGpPW
pathway-train-random-forest(notebook 18e replacement)
- Trains random forest baseline on degree-binned pathway data.
- Outputs:
results/pathway_nn/trained_models/<METAPATH>_Random_Forest.pklresults/pathway_nn/benchmarks/<METAPATH>_Random_Forest_benchmark.jsonresults/pathway_nn/benchmarks/pathway_train_random_forest_run_summary.json
- Quick sample:
poe pathway-train-random-forest --metapath CbGpPW
pathway-train-degree-signature-nn(notebook 18f replacement)
- Trains the Degree Signature neural network on degree-bin pathway features.
- Outputs:
results/pathway_nn/trained_models/<METAPATH>_Degree_Sig_NN.ptresults/pathway_nn/benchmarks/<METAPATH>_Degree_Sig_NN_benchmark.jsonresults/pathway_nn/visualizations/<METAPATH>_Degree_Sig_NN_validation.pngresults/pathway_nn/intermediate/<METAPATH>_test_*.npyresults/pathway_nn/benchmarks/pathway_train_degree_signature_nn_run_summary.json
- Quick sample:
poe pathway-train-degree-signature-nn --metapath CbGpPW --skip-plots
pathway-variance-estimation(notebook 18g replacement)
- Validates the trained Degree Sig NN against permutation graphs and estimates per-bin variance.
- If
--n-inter-binsdoes not match the checkpoint input size, the task infers bins from the checkpoint and logs the adjustment. - Outputs:
results/pathway_nn/variance_analysis/<METAPATH>_variance_estimates.csvresults/pathway_nn/variance_analysis/<METAPATH>_permutation_metrics.csvresults/pathway_nn/variance_analysis/<METAPATH>_validation_summary.jsonresults/pathway_nn/variance_analysis/permutation_<ID>_predictions.npyresults/pathway_nn/variance_analysis/all_permutations_results.npzresults/pathway_nn/variance_analysis/<METAPATH>_permutation_validation.png(unless--skip-plots)
- Quick sample:
poe pathway-variance-estimation --metapath CbGpPW --n-permutations 2 --skip-plots
pathway-anomaly-detection(notebook 18h replacement)
- Computes positive anomalies (enrichment only), compares to DWPC, and exports ranked discoveries.
- If
--n-inter-binsdoes not match the checkpoint input size, the task infers bins from the checkpoint and logs the adjustment. - Outputs:
results/pathway_nn/anomaly_detection/<METAPATH>_all_anomalies.csvresults/pathway_nn/anomaly_detection/<METAPATH>_significant_anomalies.csvresults/pathway_nn/anomaly_detection/<METAPATH>_novel_discoveries.csvresults/pathway_nn/anomaly_detection/<METAPATH>_anomaly_summary.jsonresults/pathway_nn/anomaly_detection/<METAPATH>_volcano_plot.png(unless--skip-plots)results/pathway_nn/anomaly_detection/<METAPATH>_dwpc_comparison.png(unless--skip-plots)results/pathway_nn/anomaly_detection/<METAPATH>_anomaly_distributions.png(unless--skip-plots)
- Quick sample:
poe pathway-anomaly-detection --metapath CbGpPW --max-pairs 5000 --skip-plots
learned-analytical(notebook 8 replacement)
- Trains/evaluates learned analytical formula variants and degree diagnostics.
- Outputs under
results/learned_formula/. - Quick sample:
poe learned-analytical --edge-type CtD --n-candidates 2 3 5 --skip-comparison-plot
degree-conditioned-compositionality(notebook 11.x replacement)
- Degree-conditioned compositional analysis (Option A style) for a selected 2-hop metapath.
- Outputs under
results/compositionality/. - Quick sample:
poe degree-conditioned-compositionality --metapath CbGpPW --perm-start 1 --perm-end 2 --max-pairs 5000 --skip-plots
degree-aware-compositional-model(notebook 12 replacement)
- Compares naive vs continuous degree-aware compositional predictions.
- Outputs under
results/compositionality/degree_aware/. - Quick sample:
poe degree-aware-compositional-model --metapath CbGpPW --perm-start 1 --perm-end 2 --max-pairs 5000 --skip-plots
nn-architecture-exploration(notebook 21 replacement)
- Runs script-first optimizer/loss/architecture tests and saves analysis artifacts.
- Outputs under
results/nn_optimizer_comparison/. - Quick sample:
poe nn-architecture-exploration --tests 1,2 --max-samples 5000 --max-epochs-linear 5 --patience-linear 3 --skip-plots --no-save-pkl
make-pathcount-heatmaps– path-count variance figures (results/path_count_visualization/)assess-length-effects– length degradation analysis (results/length_degradation/)assess-sparsity– sparsity effects analysis (results/hierarchical_prediction/)assess-topology– topology/outlier diagnostics (results/topology_specific_outliers/)- All four scripts support
--helpand--smoke, and automatically use available local permutations when higher IDs are missing.
baseline-pair-level– baseline pair-levelpair-level-degree-correction– degree-aware correctionfeature-comparison– feature/binning comparisonscontrol-experiments– control experimentslinear-model-cv– linear CVdegree-aware-correction-evalbias-diagnosticstheoretical-correction-evalregularization-studygnn-variant-comparison– GNN/multitask tests (defaults toCbGpPW)focused-composition-tests– focused composition testsperm000-vs-permuted-comparison– perm000 vs perms comparison (supports--smoke)
validate-dwpc– DWPC p-value validation suitesmoke-figures– sequence task that runs smoke checks for all four figure/diagnostic scriptsvalidate-figures– runssmoke-figuresplusperm000-vs-permuted-comparison --smoke --skip-plotsci-smoke– CI-oriented smoke sequence (validate-figures+quick-baseline-pair-level)quick-test-end-to-end– smoke-scale end-to-end run across figures + model/comparison suite + focused compositionreproduce-figures-full– full figure/diagnostic/model-comparison reproduction suite
poe fetch-hetmat
poe generate-permutations # or download-permutations
poe compute-edge-frequencies
poe train-null-models
poe compose-null
poe build-metapath-nulls
poe validate-composition
poe analyze-composition-failures
poe model-comparison-analysis --edge-type CtD --skip-plots --max-all-pairs 300000
poe model-testing-summary --edge-type CtD --skip-plots
poe pathway-data-preparation --metapath CbGpPW
poe pathway-train-random --metapath CbGpPW
poe pathway-train-degree-product --metapath CbGpPW
poe pathway-train-negbin-glm --metapath CbGpPW
poe pathway-train-random-forest --metapath CbGpPW
poe pathway-train-degree-signature-nn --metapath CbGpPW --skip-plots
poe pathway-variance-estimation --metapath CbGpPW --n-permutations 2 --skip-plots
poe pathway-anomaly-detection --metapath CbGpPW --max-pairs 5000 --skip-plots
poe learned-analytical --edge-type CtD --n-candidates 2 3 5 --skip-comparison-plot
poe degree-conditioned-compositionality --metapath CbGpPW --perm-start 1 --perm-end 2 --max-pairs 5000 --skip-plots
poe degree-aware-compositional-model --metapath CbGpPW --perm-start 1 --perm-end 2 --max-pairs 5000 --skip-plots
poe nn-architecture-exploration --tests 1,2 --max-samples 5000 --max-epochs-linear 5 --patience-linear 3 --skip-plots --no-save-pkl
# Figures / model-comparison analyses as neededThis is the canonical, script-backed set currently implemented for manuscript draft reproduction.
Prerequisites:
- Run from repo root with
CAPPactive. - Ensure permutations exist under
data/permutations/(for manuscript-faithful runs, keep000-020available).
Run all currently implemented manuscript repro tasks:
poe repro-script-backedRun in manuscript order (script-backed subset):
poe repro-manuscript-v1Individual commands and outputs:
- Figure 2 (
CbGpPWpGheatmap):- Command:
poe repro-fig2-pathcount-heatmap - Output:
results/path_count_visualization/CbGpPWpG_path_count_heatmap.png
- Command:
- Figure 3 (model failures):
- Command:
poe repro-fig3-model-failures - Output:
results/model_failures/model_failure_analysis.png
- Command:
- Figure 4 (permutation similarity):
- Command:
poe repro-fig4-permutation-similarity - Output:
results/permuations_similarlity/AeG_permutation_similarity.png
- Command:
- Figure 13 (variance vs PMI):
- Command:
poe repro-fig13-variance-pmi - Output dir:
results/variance_pmi/
- Command:
- Figure 14 (perm0 vs perm-mean topology outliers):
- Command:
poe repro-fig14-topology-outliers - Output dir:
results/topology_specific_outliers/
- Command:
- Table 1 (count prediction performance):
- Command:
poe repro-table1-count-prediction - Output:
results/model_comparison/table1_count_prediction.csv
- Command:
- Figure 15:
- Command:
poe repro-fig3-model-failures(same backend/output as Figure 3) - Output:
results/model_failures/model_failure_analysis.png
- Command:
- Figure 16 (z-score + QQ calibration):
- Command:
poe repro-fig16-zscore-qq - Output dir:
results/model_comparison/qq_and_zscore/
- Command:
- Heavy tasks may require HPC resources; adjust scripts accordingly.
- Superseded shell wrappers are archived under
archive/scripts_legacy/. - Legacy docs have been moved to
archive/docs/and will be deleted once reproducibility is confirmed. - PYTHONPATH is set by poe tasks to include repo root for
src/imports.
This project utilized the AI assistants Claude and ChatGPT, developed by Anthropic and OpenAI, during the development process. Its assistance included generating initial code snippets and improving documentation. All AI-generated content was reviewed, tested, and validated by human developers.