@@ -129,36 +129,6 @@ Sites which are not used for inference will
129129still be included in the final tree sequence, with mutations at those sites being placed
130130onto branches by {meth}` parsimony<tskit.Tree.map_mutations> ` .
131131
132- ### Masks
133-
134- It is also possible to * completely* exclude sites and samples, by specifing a boolean
135- ` site_mask ` and/or a ` sample_mask ` when creating the ` VariantData ` object. Sites or samples with
136- a mask value of ` True ` will be completely omitted both from inference and the final tree sequence.
137- This can be useful, for example, if you wish to select only a subset of the chromosome for
138- inference, e.g. to reduce computational load. You can also use it to subset inference to a
139- particular contig, if your dataset contains multiple contigs. Note that if a ` site_mask ` is provided,
140- the ancestral states array should only specify alleles for the unmasked sites.
141-
142- Below, for instance, is an example of including only sites up to position six in the contig
143- labelled "chr1" in the ` example_data.vcz ` file:
144-
145- ``` {code-cell}
146- import numpy as np
147-
148- # mask out any sites not associated with the contig named "chr1"
149- # (for demonstration: all sites in this .vcz file are from "chr1" anyway)
150- chr1_index = np.where(vcf_zarr.contig_id[:] == "chr1")[0]
151- site_mask = vcf_zarr.variant_contig[:] != chr1_index
152- # also mask out any sites with a position >= 80
153- site_mask[vcf_zarr.variant_position[:] >= 80] = True
154-
155- smaller_vdata = tsinfer.VariantData(
156- "_static/example_data.vcz",
157- ancestral_state="ancestral_state",
158- site_mask=site_mask,
159- )
160- print(f"The `smaller_vdata` object returns data for only {smaller_vdata.num_sites} sites")
161- ```
162132
163133### Topology inference
164134
@@ -257,6 +227,40 @@ software such as [tsdate](https://tskit.dev/software/tsdate.html): the _tsinfer_
257227algorithm is only intended to infer the genetic relationships between the samples
258228(i.e. the * topology* of the tree sequence).
259229
230+ ### Masks
231+
232+ It is possible to * completely* exclude sites and samples, by specifing a boolean
233+ ` site_mask ` and/or a ` sample_mask ` when creating the ` VariantData ` object. Sites or samples with
234+ a mask value of ` True ` will be completely omitted both from inference and the final tree sequence.
235+ This can be useful, for example, if you wish to select only a subset of the chromosome for
236+ inference, e.g. to reduce computational load. You can also use it to subset inference to a
237+ particular contig, if your dataset contains multiple contigs. Note that if a ` site_mask ` is provided,
238+ the ancestral states array should only specify alleles for the unmasked sites.
239+
240+ Below, for instance, is an example of including only sites up to position six in the contig
241+ labelled "chr1" in the ` example_data.vcz ` file:
242+
243+ ``` {code-cell}
244+ import numpy as np
245+ import zarr
246+
247+ vcf_zarr = zarr.open("_static/example_data.vcz")
248+
249+ # mask out any sites not associated with the contig named "chr1"
250+ # (for demonstration: all sites in this .vcz file are from "chr1" anyway)
251+ chr1_index = np.where(vcf_zarr.contig_id[:] == "chr1")[0]
252+ site_mask = vcf_zarr.variant_contig[:] != chr1_index
253+ # also mask out any sites with a position >= 80
254+ site_mask[vcf_zarr.variant_position[:] >= 80] = True
255+
256+ smaller_vdata = tsinfer.VariantData(
257+ "_static/example_data.vcz",
258+ ancestral_state="ancestral_state",
259+ site_mask=site_mask,
260+ )
261+ print(f"The `smaller_vdata` object returns data for only {smaller_vdata.num_sites} sites")
262+ ```
263+
260264
261265(sec_usage_simulation_example)=
262266
0 commit comments