Skip to content

Commit e213d1b

Browse files
jonbrenasleehartalimanfoo
authored
Expansion of the documentation (#592)
* Starting to expand the documentation. * Expanded the doc for some basic functions. * Adding details to genome_features * Correction of a typo * Taking a breather before tackling sample_metadata() * Taking a breather after sample_metadata() * Sample metadata should be done. * SNP calls done. * Dealt with site_annotations, thanks to Eric for the help * Dealt with biallelic_snp_calls * Dealt with haplotypes. * Adding AIMs * Started with cnv_hmm. Not sure what I wrote is correct. * STarted on cnv_coverage_calls. I have no idea what some of the variables are. * Started work on discordant read calls. A few holes yet. * Dealt with gene_cnv. Some unknowns remain. * Learning how to count to 5 * Started on snp_allele_frequencies. * Dealt with snp_allele_frequencies and gene_cnv_frequencies * Missed anopheles.py ... again * Dealt with pca * Dealt with njt * Dealt with roh_hmm * Dealt with diversity stats * Done (for now) * Update malariagen_data/anoph/cnv_data.py Co-authored-by: Alistair Miles <[email protected]> * Update malariagen_data/anoph/cnv_data.py Co-authored-by: Alistair Miles <[email protected]> * Update malariagen_data/anoph/genome_features.py Co-authored-by: Alistair Miles <[email protected]> * Update malariagen_data/anoph/sample_metadata.py Co-authored-by: Alistair Miles <[email protected]> * Update malariagen_data/anopheles.py Co-authored-by: Alistair Miles <[email protected]> * Replaced ** in the docs * Missed two files --------- Co-authored-by: Lee <[email protected]> Co-authored-by: Alistair Miles <[email protected]>
1 parent f88f7b9 commit e213d1b

14 files changed

+368
-27
lines changed

malariagen_data/anoph/aim_data.py

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,11 @@ def _prep_aims_param(self, *, aims: aim_params.aims) -> str:
6565
@check_types
6666
@doc(
6767
summary="Access ancestry informative marker variants.",
68-
returns="A dataset containing AIM positions and discriminating alleles.",
68+
returns="""
69+
A dataset with 2 dimensions: `variants` the number of AIMs sites, and `alleles` which will always be 2, each representing one of the species. It contains 2 coordinates:
70+
`variant_contig` has `variants` values and contains the chromosome arm of each AIM, and `variant_position` has `variants` values and contains the position of each AIM. It contains 1 data variable:
71+
`variant_allele` has (`variants`, `allele`) values and contains the discriminating alleles for each AIM.
72+
""",
6973
)
7074
def aim_variants(self, aims: aim_params.aims) -> xr.Dataset:
7175
self._require_aim_analysis()
@@ -113,7 +117,16 @@ def _aim_calls_dataset(self, *, aims, sample_set):
113117
calls.
114118
""",
115119
returns="""
116-
A dataset containing AIM SNP sites, alleles and genotype calls.
120+
A dataset with 4 dimensions:
121+
`variants` the number of AIMs sites,
122+
`samples` the number of samples,
123+
`ploidy` the ploidy (2),
124+
and `alleles` which will always be 2, each representing one of the species. It contains 3 coordinates:
125+
`sample_id` has `samples` values and contains the identifier of each sample,
126+
`variant_contig` has `variants` values and contains the chromosome arm of each AIM,
127+
and `variant_position` has `variants` values and contains the position of each AIM. It contains 2 data variables:
128+
`call_genotype` has (`variants`, `samples`, `ploidy`) values and contains both calls for each sample and each AIM,
129+
`variant_allele` has (`variants`, `allele`) values and contains the discriminating alleles for each AIM.
117130
""",
118131
)
119132
def aim_calls(

malariagen_data/anoph/base.py

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -397,7 +397,16 @@ def _read_sample_sets(self, *, single_release: str):
397397
@check_types
398398
@doc(
399399
summary="Access a dataframe of sample sets",
400-
returns="A dataframe of sample sets, one row per sample set.",
400+
returns="""A dataframe of sample sets, one row per sample set. It contains five columns:
401+
`sample_set` is the name of the sample set,
402+
`sample_count` is the number of samples the sample set contains,
403+
`study_id` is the identifier for the study that generated the sample set,
404+
`study_url` is the URL of the study on the MalariaGEN website,
405+
`term_of_use_expiry` is the date when the terms of use expire,
406+
`terms_of_use_url` is the URL of the terms of use,
407+
`release` is the identifier of the release containing the sample set,
408+
`unrestricted_use` whether the sample set can be without restriction (e.g., if the terms of use have expired).
409+
""",
401410
)
402411
def sample_sets(
403412
self,
@@ -441,6 +450,7 @@ def sample_sets(
441450
@check_types
442451
@doc(
443452
summary="Find which release a sample set was included in.",
453+
returns="The release the sample set is part of.",
444454
)
445455
def lookup_release(self, sample_set: base_params.sample_set) -> str:
446456
if self._cache_sample_set_to_release is None:
@@ -455,6 +465,7 @@ def lookup_release(self, sample_set: base_params.sample_set) -> str:
455465
@check_types
456466
@doc(
457467
summary="Find which study a sample set belongs to.",
468+
returns="The study the sample set belongs to.",
458469
)
459470
def lookup_study(self, sample_set: base_params.sample_set) -> str:
460471
if self._cache_sample_set_to_study is None:
@@ -468,6 +479,7 @@ def lookup_study(self, sample_set: base_params.sample_set) -> str:
468479
@check_types
469480
@doc(
470481
summary="Find the study info for a sample set.",
482+
returns="The info for the study the sample set belongs to.",
471483
)
472484
def lookup_study_info(self, sample_set: base_params.sample_set) -> dict:
473485
if self._cache_sample_set_to_study_info is None:
@@ -483,6 +495,7 @@ def lookup_study_info(self, sample_set: base_params.sample_set) -> dict:
483495
@check_types
484496
@doc(
485497
summary="Find the terms-of-use info for a sample set.",
498+
returns="The terms-of-use info for the sample set.",
486499
)
487500
def lookup_terms_of_use_info(self, sample_set: base_params.sample_set) -> dict:
488501
if self._cache_sample_set_to_terms_of_use_info is None:

malariagen_data/anoph/cnv_data.py

Lines changed: 41 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -170,7 +170,19 @@ def _cnv_hmm_dataset(self, *, contig, sample_set, inline_array, chunks):
170170
@check_types
171171
@doc(
172172
summary="Access CNV HMM data from CNV calling.",
173-
returns="An xarray dataset of CNV HMM calls and associated data.",
173+
returns="""A dataset with 2 dimensions:
174+
`variants` the number of CNV regions in the selected region,
175+
`samples` the number of samples. There are 4 coordinates:
176+
`variant_position` has `variants` values and contains the initial position of each CNV region,
177+
`variant_end` has `variants` values and contains the final position of each CNV region,
178+
`variant_contig` has `variants` values and contains the contig of each CNV region,
179+
`sample_id` has `samples` values and contains the identifier of each sample. It contains 5 data variables:
180+
`call_CN`, it has (`variants`, `samples`) values and contains the number of copies for each sample and each CNV region,
181+
`call_RawCov`, it has (`variants`, `samples`) values and contains the raw coverage for each sample and each CNV region,
182+
`call_NormCov`, it has (`variants`, `samples`) values and contains the normalized coverage for each sample and each CNV region,
183+
`sample_coverage_variance`, it has `samples` values and contains the variance of the coverage for each sample,
184+
`sample_id_high_variance`, it has `samples` values and contains whether each sample has a high variance.
185+
""",
174186
)
175187
def cnv_hmm(
176188
self,
@@ -377,7 +389,19 @@ def _cnv_coverage_calls_dataset(
377389
@check_types
378390
@doc(
379391
summary="Access CNV HMM data from genome-wide CNV discovery and filtering.",
380-
returns="An xarray dataset of CNV alleles and genotypes.",
392+
returns="""A dataset with 2 dimensions:
393+
`variants` the number of CNV regions in the selected region,
394+
`samples` the number of samples. There are 5 coordinates:
395+
`variant_position` has `variants` values and contains the initial position of each CNV region,
396+
`variant_end` has `variants` values and contains the final position of each CNV region,
397+
`variant_contig` has `variants` values and contains the contig of each CNV region,
398+
`variant_id` has `variants` values and contains the identifier for each CNV region,
399+
`sample_id` has `samples` values and contains the identifier of each sample. It contains 4 data variables:
400+
`variant_CIPOS`, it has `variants` values and contains the confidence interval for the start position for each CNV region,
401+
`variant_CIEND`, it has `variants` values and contains the confidence interval for the end position for each CNV region,
402+
`variant_filter_pass`, it has `variants` values and is True for each CNV region that passes quality filters,
403+
`call_genotype`, it has (`variants`, `samples`) values and contains the coverage call for each sample and each CNV region,
404+
""",
381405
)
382406
def cnv_coverage_calls(
383407
self,
@@ -533,7 +557,21 @@ def _cnv_discordant_read_calls_dataset(
533557
@check_types
534558
@doc(
535559
summary="Access CNV discordant read calls data.",
536-
returns="An xarray dataset of CNV alleles and genotypes.",
560+
returns="""A dataset with 2 dimensions:
561+
`variants` the number of discordant read calls in the selected region,
562+
`samples` the number of samples. There are 5 coordinates:
563+
`variant_position` has `variants` values and contains the initial position of each discordant read call,
564+
`variant_end` has `variants` values and contains the final position of each discordant read call,
565+
`variant_id` has `variants` values and contains the identifier of each discordant read call,
566+
`variant_contig` has `variants` values and contains the contig of each discordant read call,
567+
`sample_id` has `samples` values and contains the identifier of each sample. It contains 6 data variables:
568+
`variant_Region`, it has `variants` values and contains the identifier of the region covered by each discordant read call,
569+
`variant_StartBreakpointMethod`, it has `variants` values and specifies how the start breakpoint was determined for each discordant read call,
570+
`variant_EndBreakpointMethod`, it has `variants` values and specifies how the end breakpoint was determined for each discordant read call,
571+
`call_genotype`, it has (`variants`, `samples`) values and contains the number of copies of each discordant read call for each sample,
572+
`sample_coverage_variance`, it has `samples` values and contains the variance of the coverage for each sample,
573+
`sample_id_high_variance`, it has `samples` values and contains whether each sample has a high variance.
574+
""",
537575
)
538576
def cnv_discordant_read_calls(
539577
self,

malariagen_data/anoph/distance.py

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,7 @@ def __init__(self, **kwargs):
8686
summary="""
8787
Compute pairwise distances between samples using biallelic SNP genotypes.
8888
""",
89+
returns=("dist", "samples", "n_snps_used"),
8990
)
9091
def biallelic_diplotype_pairwise_distances(
9192
self,
@@ -107,7 +108,9 @@ def biallelic_diplotype_pairwise_distances(
107108
random_seed: base_params.random_seed = 42,
108109
inline_array: base_params.inline_array = base_params.inline_array_default,
109110
chunks: base_params.chunks = base_params.native_chunks,
110-
) -> Tuple[np.ndarray, np.ndarray, int]:
111+
) -> Tuple[
112+
distance_params.dist, distance_params.samples, distance_params.n_snps_used
113+
]:
111114
# Change this name if you ever change the behaviour of this function, to
112115
# invalidate any previously cached data.
113116
name = "biallelic_diplotype_pairwise_distances"
@@ -234,6 +237,7 @@ def _biallelic_diplotype_pairwise_distances(
234237
summary="""
235238
Construct a neighbour-joining tree between samples using biallelic SNP genotypes.
236239
""",
240+
returns=("Z", "samples", "n_snps_used"),
237241
)
238242
def njt(
239243
self,
@@ -260,7 +264,7 @@ def njt(
260264
random_seed: base_params.random_seed = 42,
261265
inline_array: base_params.inline_array = base_params.inline_array_default,
262266
chunks: base_params.chunks = base_params.native_chunks,
263-
) -> Tuple[np.ndarray, np.ndarray, int]:
267+
) -> Tuple[distance_params.Z, distance_params.samples, distance_params.n_snps_used]:
264268
# Change this name if you ever change the behaviour of this function, to
265269
# invalidate any previously cached data.
266270
name = "njt_v1"

malariagen_data/anoph/distance_params.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22

33
from typing_extensions import Annotated, TypeAlias
44

5+
import numpy as np
6+
57
distance_metric: TypeAlias = Annotated[
68
Literal[
79
"cityblock",
@@ -20,6 +22,32 @@
2022

2123
default_nj_algorithm: nj_algorithm = "dynamic"
2224

25+
dist: TypeAlias = Annotated[
26+
np.ndarray,
27+
"""
28+
A numpy array containing the distance between each pair of samples.
29+
""",
30+
]
31+
32+
Z: TypeAlias = Annotated[
33+
np.ndarray,
34+
"""
35+
A neighbour-joining tree encoded as a numpy array. Each row in the
36+
array contains data for one internal node in the tree, in the order
37+
in which they were created by the neighbour-joining algorithm.
38+
Within each row there are five values: left child node identifier,
39+
right child node identifier, distance to left child, distance to
40+
right child, total number of leaves. This data structure is similar
41+
to that returned by scipy's hierarchical clustering functions,
42+
except that here we have two distance values for each internal node
43+
rather than one because distances to the children may be different.
44+
""",
45+
]
46+
47+
samples: TypeAlias = Annotated[np.ndarray, "The list of the sample identifiers"]
48+
49+
n_snps_used: TypeAlias = Annotated[int, "The number of SNPs used"]
50+
2351
center_x: TypeAlias = Annotated[int | float, "X coordinate where plotting is centered."]
2452

2553
center_y: TypeAlias = Annotated[int | float, "Y coordinate where plotting is centered."]

malariagen_data/anoph/fst_params.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,11 @@
2626
df_pairwise_fst: TypeAlias = Annotated[
2727
pd.DataFrame,
2828
"""
29-
A dataframe of pairwise Fst and standard error values.
29+
A dataframe of pairwise Fst and standard error values. It has
30+
4 columns:
31+
`cohort1` and `cohort2` are the two cohorts,
32+
`fst` is the value of the Fst between the two cohorts,
33+
`se` is the standard error.
3034
""",
3135
]
3236

malariagen_data/anoph/genome_features.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -119,7 +119,7 @@ def _prep_gff_attributes(
119119
@check_types
120120
@doc(
121121
summary="Access genome feature annotations.",
122-
returns="A dataframe of genome annotations, one row per feature.",
122+
returns="A dataframe of genome annotations, one row per feature. The dataframe follows the GFF3 format (https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md), including extra attributes `ID`, `Parent`, `Name` and `description` depending on the dataset.",
123123
)
124124
def genome_features(
125125
self,

malariagen_data/anoph/hap_data.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -321,7 +321,17 @@ def _haplotypes_for_contig(
321321
@check_types
322322
@doc(
323323
summary="Access haplotype data.",
324-
returns="A dataset of haplotypes and associated data.",
324+
returns="""A dataset with 4 dimensions:
325+
`variants` the number of sites in the selected region,
326+
`allele` the number of alleles (2),
327+
`samples` the number of samples,
328+
and `ploidy` the ploidy (2). There are 3 coordinates:
329+
`variant_position` has `variants` values and contains the position of each site,
330+
`variant_contig` has `variants` values and contains the contig of each site,
331+
`sample_id` has `samples` values and contains the identifier of each sample. The data variables are:
332+
`variant_allele`, it has (`variants`, `alleles`) values and contains the reference followed by the alternate allele for each site,
333+
`call_genotype`, it has (`variants`, `samples`, `ploidy`) values and contains both calls for each site and each sample.
334+
""",
325335
)
326336
def haplotypes(
327337
self,

malariagen_data/anoph/het_params.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,13 @@
4747
pd.DataFrame,
4848
"""
4949
A DataFrame where each row provides data about a single run of
50-
homozygosity.
50+
homozygosity. The columns are:
51+
`sample_id` containing the identifier of the sample,
52+
`contig` containing the contig,
53+
`roh_start` containing the start of the run of homozygosity,
54+
`roh_stop` containing the end of the run of homozygosity,
55+
`roh_length` containing the length of the run of homozygosity,
56+
`roh_is_marginal` containing whether the run of homozygosity is marginal.
5157
""",
5258
]
5359

malariagen_data/anoph/pca_params.py

Lines changed: 60 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,66 @@
1515
df_pca: TypeAlias = Annotated[
1616
pd.DataFrame,
1717
"""
18-
A dataframe of sample metadata, with columns "PC1", "PC2", "PC3",
19-
etc., added.
18+
A dataframe of projections along principal components, one row per sample. The columns are:
19+
`sample_id` is the identifier of the sample,
20+
`partner_sample_id` is the identifier of the sample used by the partners who contributed it,
21+
`contributor` is the partner who contributed the sample,
22+
`country` is the country the sample was collected in,
23+
`location` is the location the sample was collected in,
24+
`year` is the year the sample was collected,
25+
`month` is the month the sample was collected,
26+
`latitude` is the latitude of the location the sample was collected in,
27+
`longitude` is the longitude of the location the sample was collected in,
28+
`sex_call` is the sex of the sample,
29+
`sample_set` is the sample set containing the sample,
30+
`release` is the release containing the sample,
31+
`quarter` is the quarter of the year the sample was collected,
32+
`study_id* is the identifier of the study the sample set containing the sample came from,
33+
`study_url` is the URL of the study the sample set containing the sample came from,
34+
`terms_of_use_expiry_date` is the date the terms of use for the sample expire,
35+
`terms_of_use_url` is the URL of the terms of use for the sample,
36+
`unrestricted_use` indicates whether the sample can be used without restrictions (e.g., if the terms of use of expired),
37+
`mean_cov` is mean value of the coverage,
38+
`median_cov` is the median value of the coverage,
39+
`modal_cov` is the mode of the coverage,
40+
`mean_cov_2L` is mean value of the coverage on 2L,
41+
`median_cov_2L` is the median value of the coverage on 2L,
42+
`mode_cov_2L` is the mode of the coverage on 2L,
43+
`mean_cov_2R` is mean value of the coverage on 2R,
44+
`median_cov_2R` is the median value of the coverage on 2R,
45+
`mode_cov_2R` is the mode of the coverage on 2R,
46+
`mean_cov_3L` is mean value of the coverage on 3L,
47+
`median_cov_3L` is the median value of the coverage on 3L,
48+
`mode_cov_3L` is the mode of the coverage on 3L,
49+
`mean_cov_3R` is mean value of the coverage on 3R,
50+
`median_cov_3R` is the median value of the coverage on 3R,
51+
`mode_cov_3R` is the mode of the coverage on 3R,
52+
`mean_cov_X` is mean value of the coverage on X,
53+
`median_cov_X` is the median value of the coverage on X,
54+
`mode_cov_X` is the mode of the coverage on X,
55+
`frac_gen_cov` is the faction of the genome covered,
56+
`divergence` is the divergence,
57+
`contam_pct` is the percentage of contamination,
58+
`contam_LLR` is the log-likelihood ratio of contamination,
59+
`aim_species_fraction_arab` is the fraction of the gambcolu vs. arabiensis AIMs that indicated arabiensis (this column is only present for *Ag3*),
60+
`aim_species_fraction_colu` is the fraction of the gambiae vs. coluzzii AIMs that indicated coluzzii (this column is only present for *Ag3*),
61+
`aim_species_fraction_colu_no2l` is the fraction of the gambiae vs. coluzzii AIMs that indicated coluzzii, not including the chromosome arm 2L which contains an introgression (this column is only present for *Ag3*),
62+
`aim_species_gambcolu_arabiensis` is the taxonomic group assigned by the gambcolu vs. arabiensis AIMs (this column is only present for *Ag3*),
63+
`aim_species_gambiae_coluzzi` is the taxonomic group assigned by the gambiae vs. coluzzii AIMs (this column is only present for *Ag3*),
64+
`aim_species_gambcolu_arabiensis` is the taxonomic group assigned by the combination of both AIMs analyses (this column is only present for *Ag3*),
65+
`country_iso` is the ISO code of the country the sample was collected in,
66+
`admin1_name` is the name of the first administrative level the sample was collected in,
67+
`admin1_iso` is the ISO code of the first administrative level the sample was collected in,
68+
`admin2_name` is the name of the second administrative level the sample was collected in,
69+
`taxon` is the taxon assigned to the sample by the combination of the AIMs analysis and the cohort analysis,
70+
`cohort_admin1_year` is the cohort the sample belongs to when samples are grouped by first administrative level and year,
71+
`cohort_admin1_month` is the cohort the sample belongs to when samples are grouped by first administrative level and month,
72+
`cohort_admin1_quarter` is the cohort the sample belongs to when samples are grouped by first administrative level and quarter,
73+
`cohort_admin2_year` is the cohort the sample belongs to when samples are grouped by second administrative level and year,
74+
`cohort_admin2_month` is the cohort the sample belongs to when samples are grouped by second administrative level and month,
75+
`cohort_admin2_quarter` is the cohort the sample belong to when samples are grouped by second administrative level and quarter.
76+
`PC?` is the projection along principal component ? (? being an integer between 1 and the number of components). There are as many such columns as components,
77+
`pca_fit` is whether this sample was used for fitting.
2078
""",
2179
]
2280

0 commit comments

Comments
 (0)