Add basic ARG analysis page

jeromekelleher · jeromekelleher · commit 33a6dbec962f · 2025-11-21T15:55:58.000Z
diff --git a/docs/arg_analysis.md b/docs/arg_analysis.md
@@ -1,51 +1,69 @@
+---
+jupytext:
+  text_representation:
+    extension: .md
+    format_name: myst
+    format_version: 0.12
+    jupytext_version: 1.9.1
+kernelspec:
+  display_name: Python 3
+  language: python
+  name: python3
+---
+
+```{eval-rst}
+.. currentmodule:: sc2ts
+```
+
 (sec_arg_analysis)=
 # ARG analysis
 
+The sc2ts API provides some convenience functions to compute summary
+dataframes for the nodes and mutations in a sc2ts-output ARG.
 
-## ARG analysis API
 
-The sc2ts API provides two convenience functions to compute summary
-dataframes for the nodes and mutations in a sc2ts-output ARG.
+## Prerequisites
 
-To see some examples, first download the (31MB) sc2ts inferred ARG
-from [Zenodo](https://zenodo.org/records/17558489/):
+Download a subset of the [sc2ts Viridian ARG](https://zenodo.org/records/17558489/)
+with 1000 samples:
 
 ```
-curl -O https://zenodo.org/records/17558489/files/sc2ts_viridian_v1.2.trees.tsz
+curl -O https://raw.githubusercontent.com/tskit-dev/sc2ts/refs/heads/main/docs/sc2ts_viridian_v1.2_subset_1000.trees.tsz
 ```
 
-We can then use these like
+We'll use this small subset as an example throughout.
+
+## Loading
 
-```python
+
+```{code-cell}
 import sc2ts
 import tszip
 
-ts = tszip.load("sc2ts_viridian_v1.2.trees.tsz")
-
-df_node = sc2ts.node_data(ts)
-df_mutation = sc2ts.mutation_data(ts)
+ts = tszip.load("sc2ts_viridian_v1.2_subset_1000.trees.tsz")
 ```
 
-See the [live demo](https://tskit.dev/explore/lab/index.html?path=sc2ts.ipynb)
-for a browser based interactive demo of using these dataframes for
-real-time pandemic-scale analysis.
-
-## Dataset API
+You can then use the full [tskit](https://tskit.dev/tskit/docs/)
+Python API on this ARG.
 
-Sc2ts also provides a convenient API for accessing large-scale
-alignments and metadata stored in
-[VCF Zarr](https://doi.org/10.1093/gigascience/giaf049) format.
+## Node data
 
-Resources:
+The {func}`node_data` function returns a Pandas dataframe of data for each
+node in the ARG.
 
-- See this [notebook](https://github.com/jeromekelleher/sc2ts-paper/blob/main/notebooks/example_data_processing.ipynb)
-for an example in which we access the data variant-by-variant and
-which explains the low-level data encoding
-- See the [VCF Zarr publication](https://doi.org/10.1093/gigascience/giaf049)
-for more details on and benchmarks on this dataset.
+```{code-cell}
+dfn = sc2ts.node_data(ts)
+dfn
+```
 
 
-**TODO** Add some references to API documentation
+## Mutation data
 
+The {func}`mutation_data` function returns a Pandas dataframe of data for each
+mutation_in the ARG.
 
+```{code-cell}
+dfm = sc2ts.mutation_data(ts)
+dfm
+```
 
diff --git a/docs/make_sc2ts_arg_subset.py b/docs/make_sc2ts_arg_subset.py
@@ -0,0 +1,13 @@
+import tszip
+import numpy as np
+
+ts = tszip.load("sc2ts_viridian_v1.2.trees.tsz")
+
+k = 1000
+idx = np.round(np.linspace(0, ts.num_samples - 1, k)).astype(int)
+
+subset = ts.samples()[idx]
+print(subset)
+tss = ts.simplify(subset, filter_sites=False)
+
+tszip.compress(tss, f"sc2ts_viridian_v1.2_subset_{k}.trees.tsz")
diff --git a/docs/sc2ts_viridian_v1.2_subset_1000.trees.tsz b/docs/sc2ts_viridian_v1.2_subset_1000.trees.tsz