Merge pull request #577 from tskit-dev/add-dataset-docs-1

jeromekelleher · web-flow · commit 06a36a74b764 · 2025-11-20T17:03:58.000Z
Add subset dataset and start on documenting dataset
diff --git a/docs/alignments_analysis.md b/docs/alignments_analysis.md
@@ -1,2 +1,61 @@
+---
+jupytext:
+  text_representation:
+    extension: .md
+    format_name: myst
+    format_version: 0.12
+    jupytext_version: 1.9.1
+kernelspec:
+  display_name: Python 3
+  language: python
+  name: python3
+---
+
+```{eval-rst}
+.. currentmodule:: sc2ts
+```
+
+
 (sec_alignments_analysis)=
+
 # Alignments analysis
+
+## Prerequisites
+
+Download the first 1000 samples of the Viridian dataset (450K):
+
+```
+curl -O https://raw.githubusercontent.com/tskit-dev/sc2ts/refs/heads/main/docs/viridian_mafft_subset_1000_v1.vcz.zip
+```
+
+We'll use this small subset as an example throughout.
+
+
+## Loading and getting information
+
+To load up a dataset, we use the {class}`Dataset` constructor:
+
+
+```{code-cell}
+import sc2ts
+
+ds = sc2ts.Dataset("viridian_mafft_subset_1000_v1.vcz.zip")
+ds
+```
+
+When we return the dataset object from a notebook cell (as here) it prints
+out a summary of the contents. Here, we're working with a small subset
+of the Viridian dataset consisting of the first 1000 samples at the 29903
+sites in the SARS-CoV-2 genome.
+
+The basic information is also available in the {attr}`Dataset.num_samples`
+and {attr}`Dataset.num_variants` attributes.
+
+To get information on the metadata fields that are present, we can use
+
+
+```{code-cell}
+ds.metadata.field_descriptors()
+```
+
+
diff --git a/docs/build.sh b/docs/build.sh
@@ -6,7 +6,7 @@
 
 REPORTDIR=_build/html/reports
 
-jupyter-book build  .
+jupyter-book build -W . 
 RETVAL=$?
 if [ $RETVAL -ne 0 ]; then
     if [ -e $REPORTDIR ]; then
diff --git a/docs/cli.md b/docs/cli.md
@@ -1,20 +1,22 @@
-.. _sc2ts_sec_cli:
 
-Command line interface
-======================
+(sc2ts_sec_cli)=
+
+# Command line interface
 
 The ``sc2ts`` package provides a command line interface for running
 inference and working with sc2ts datasets. After installation, the
-``sc2ts`` entry point should be available::
+``sc2ts`` entry point should be available
 
-    $ sc2ts --help
+```
+$ sc2ts --help
+```
 
 You can also invoke the CLI via the module::
+```
+$ python -m sc2ts --help
+```
 
-    $ python -m sc2ts --help
-
-Order of high-level commands
-----------------------------
+## Order of high-level commands
 
 In a typical end-to-end workflow, the main subcommands are used in the
 following order:
@@ -28,11 +30,22 @@ following order:
 4. ``minimise-metadata`` to generate an analysis-ready ARG with compact
    metadata suitable for use with the Python analysis APIs.
 
-Below we list all subcommands and options provided by the CLI. This
-output is generated directly from the Click definitions in
-``sc2ts.cli`` using the ``sphinx-click`` extension, and so stays in
-sync with the implementation.
 
-.. click:: sc2ts.cli:cli
-   :prog: sc2ts
-   :nested: full
+## CLI reference
+
+<!-- Below we list all subcommands and options provided by the CLI. This -->
+<!-- output is generated directly from the Click definitions in -->
+<!-- ``sc2ts.cli`` using the ``sphinx-click`` extension, and so stays in -->
+<!-- sync with the implementation. -->
+
+:::{todo}
+Add the sphinx-click output here somehow.
+:::
+
+<!-- ```{eval-rst} -->
+<!-- .. click:: sc2ts.cli:cli -->
+<!--    :prog: sc2ts infer -->
+<!--    :nested: full -->
+<!-- ``` -->
+
+
diff --git a/docs/make_viridian_subset.py b/docs/make_viridian_subset.py
@@ -0,0 +1,12 @@
+import sc2ts
+
+ds = sc2ts.Dataset("../viridian_mafft_2024-10-14_v1.vcz.zip")
+print(ds)
+
+samples = ds["sample_id"][:]
+k = 1000
+samples = samples[:k]
+path = f"viridian_mafft_subset_{k}_v1.vcz"
+ds.copy(path, sample_id=samples)
+sc2ts.Dataset.create_zip(path, path + ".zip")
+
diff --git a/docs/viridian_mafft_subset_1000_v1.vcz.zip b/docs/viridian_mafft_subset_1000_v1.vcz.zip
diff --git a/sc2ts/dataset.py b/sc2ts/dataset.py
@@ -294,10 +294,18 @@ def variants_chunk_size(self):
 
     @property
     def num_samples(self):
+        """
+        Return the number of samples in this dataset.
+        """
         return self.root.call_genotype.shape[1]
 
     @property
     def num_variants(self):
+        """
+        Return the number of variants in this dataset. Note that this does not mean that
+        there's necessarily variation at each site; the terminology is borrowed from
+        VCF Zarr.
+        """
         return self.root.call_genotype.shape[0]
 
     def __str__(self):