Replies: 3 comments 7 replies
-
This isn't a tsdate question, as it happens. It's tsinfer that assumes all your tips are at time 0. Instructions for how to use tsinfer to take advantage of the individual times are a bit scattered all over the place, and out of date, sorry. Some discussion here: tskit-dev/tsinfer#602, but I'll try to put some better information together, not that tsdate can cope perfectly well with historical samples. |
Beta Was this translation helpful? Give feedback.
-
Ah yes, I remember now. Try setting data = tsinfer.VariantData("chr1/chr1.vcz",
ancestral_state="variant_AA",
individuals_time = sampling_times) # tip dates stored in sampling_times
anc = tsinfer.generate_ancestors(vdata)
anc_ts = tsinfer.match_samples(vdata, ancestors, time_units = "years")
inferred_ts = tsinfer.match_samples(vdata, anc_ts, force_sample_times=True)
# as before
simplified_ts = tsdate.preprocess_ts(
inferred_ts,
erase_flanks=True
)
dated_ts, fit = tsdate.date(simplified_ts, ...) Let me know if that works? |
Beta Was this translation helpful? Give feedback.
-
Here's a temporary import numpy as np
import tsinfer
import tskit
tsinfer.AncestorData.insert_proxy_samples = insert_proxy_samples # hack: monkey patch the new routine
# Now run the inference
anc = tsinfer.generate_ancestors(vdata)
anc_with_proxy = anc.insert_proxy_samples(vdata) # add proxy ancestors for samples > time 0, and change ancestor times to fit
anc_ts = tsinfer.match_ancestors(vdata, anc_with_proxy, time_units = "years")
inferred_ts = tsinfer.match_samples(vdata, anc_ts, force_sample_times=True) # hopefully will work now Here's the new method itself, the docstring should be more or less understandable? import logging
logger = logging.getLogger("tsinfer")
def insert_proxy_samples(
self,
variant_data,
*,
sample_ids=None,
epsilon=None,
keep_ancestor_times=None,
allow_mutation=None, # deprecated alias
**kwargs,
):
"""
Take a set of samples from a :class:`.VariantData` instance and create additional
"proxy sample ancestors" from them, returning a new :class:`.AncestorData`
instance including both the current ancestors and the additional ancestors
at the appropriate time points.
A *proxy sample ancestor* is an ancestor based upon a known sample. At
sites used in the full inference process, the haplotype of this ancestor
is identical to that of the sample on which it is based. The time of the
ancestor is taken to be a fraction ``epsilon`` older than the sample on
which it is based.
A common use of this function is to provide ancestral nodes for anchoring
historical samples at the correct time when matching them into a tree
sequence during the :func:`tsinfer.match_samples` stage of inference.
For this reason, by default, the samples chosen from ``sample_data``
are those associated with historical (i.e. non-contemporary)
:ref:`individuals <sec_inference_data_model_individual>`. This can be
altered by using the ``sample_ids`` parameter.
If ``keep_ancestor_times`` is ``False``, times of the current ancestors
may be increased to account for the times of the proxy sample ancestors.
.. note::
The proxy sample ancestors inserted here will correspond to extra nodes
in the inferred tree sequence. At sites which are not used in the full
inference process (e.g. sites unique to a single historical sample),
these proxy sample ancestor nodes may have a different genotype from
their corresponding sample.
:param VariantData variant_data: The `VariantData` instance
from which to select the samples used to create extra ancestors.
:param list(int) sample_ids: A list of sample ids in the ``variant_data``
instance that will be selected to create the extra ancestors. If
``None`` (default) select all the historical samples, i.e. those
associated with an :ref:`sec_inference_data_model_individual` whose
time is greater than zero. The order of ids is ignored, as are
duplicate ids.
:param list(float) epsilon: An list of small time increments
determining how much older each proxy sample ancestor is than the
corresponding sample listed in ``sample_ids``. A single value is also
allowed, in which case it is used as the time increment for all selected
proxy sample ancestors. If None (default) find :math:`{\\delta}t`, the
smallest time difference between the sample times and the next
oldest ancestor in the current :class:`.AncestorData` instance, setting
``epsilon`` = :math:`{\\delta}t / 100` (or, if all selected samples
are at least as old as the oldest ancestor, take :math:`{\\delta}t`
to be the smallest non-zero time difference between existing ancestors).
:param bool keep_ancestor_times: If ``False`` (the default), the existing
times of the ancestors in the current :class:`.AncestorData` instance
may be increased so that derived states in the inserted proxy samples.
can have an ancestor with a mutation to that site above them (i.e. the
infinite sites assumption is maintained). This is useful when sites
times have been approximated by using their frequency. Alternatively,
if ``keep_ancestor_times`` is ``True``, existing ancestor times are
preserved, and inserted proxy sample ancestors are allowed to
possess derived alleles at sites where there are no pre-existing
mutations in older ancestors. This can lead to a de-novo mutation at a
site that also has a mutation elsewhere (i.e. breaking the infinite sites
assumption).
:param bool allow_mutation: Deprecated alias for `keep_ancestor_times`.
:param \\**kwargs: Further arguments passed to the constructor when creating
the new :class:`AncestorData` instance which will be returned.
:return: A new :class:`.AncestorData` object.
:rtype: AncestorData
"""
if allow_mutation is not None:
if keep_ancestor_times is not None:
raise ValueError(
"Cannot specify both `allow_mutation` and `keep_ancestor_times`"
)
keep_ancestor_times = allow_mutation
self._check_finalised()
variant_data._check_finalised()
if self.sequence_length != variant_data.sequence_length:
raise ValueError("variant_data does not have the correct sequence length")
used_sites = np.isin(variant_data.sites_position[:], self.sites_position[:])
if np.sum(used_sites) != self.num_sites:
raise ValueError("Genome positions in ancestors missing from variant_data")
if sample_ids is None:
sample_ids = []
for i in variant_data.individuals():
if i.time > 0:
sample_ids += i.samples
# sort by ID and make unique for quick haplotype access
sample_ids, unique_indices = np.unique(np.array(sample_ids), return_index=True)
sample_times = np.zeros(len(sample_ids), dtype=self.ancestors_time.dtype)
for i, s in enumerate(sample_ids):
sample = variant_data.sample(s)
if sample.individual != tskit.NULL:
sample_times[i] = variant_data.individual(sample.individual).time
if epsilon is not None:
epsilons = np.atleast_1d(epsilon)
if len(epsilons) == 1:
# all get the same epsilon
epsilons = np.repeat(epsilons, len(sample_ids))
else:
if len(epsilons) != len(unique_indices):
raise ValueError(
"The number of epsilon values must equal the number of "
f"sample_ids ({len(sample_ids)})"
)
epsilons = epsilons[unique_indices]
else:
anc_times = self.ancestors_time[:][::-1] # find ascending time order
older_index = np.searchsorted(anc_times, sample_times, side="right")
# Don't include times older than the oldest ancestor
allowed = older_index < self.num_ancestors
if np.sum(allowed) > 0:
delta_t = anc_times[older_index[allowed]] - sample_times[allowed]
else:
# All samples have times equal to or older than the oldest curr ancestor
time_diffs = np.diff(anc_times)
delta_t = np.min(time_diffs[time_diffs > 0])
epsilons = np.repeat(np.min(delta_t) / 100.0, len(sample_ids))
proxy_times = sample_times + epsilons
time_sorted_indexes = np.argsort(proxy_times)
reverse_time_sorted_indexes = time_sorted_indexes[::-1]
# In cases where we have more than a handful of samples to use as proxies, it is
# inefficient to access the haplotypes out of order, so we iterate and cache
# (caution: the haplotypes list may be quite large in this case)
haplotypes = [
h[1] for h in variant_data.haplotypes(
samples=sample_ids, sites=used_sites, recode_ancestral=True
)
]
new_anc_times = self.ancestors_time[:] # this is a copy
if not keep_ancestor_times:
assert np.all(np.diff(self.ancestors_time) <= 0)
# Find the youngest (max) ancestor ID constrained by each sample haplotype
site_ancestor = -np.ones(self.num_sites, dtype=int)
anc_min_time = np.zeros(self.num_ancestors, dtype=self.ancestors_time.dtype)
# If (unusually) there are multiple ancestors for the same focal site, we
# can take the youngest
for ancestor_id, focal_sites in enumerate(self.ancestors_focal_sites):
site_ancestor[focal_sites] = ancestor_id
for hap_id in time_sorted_indexes:
derived_sites = haplotypes[hap_id] > 0
if np.sum(derived_sites) == 0:
root = 0 # no derived sites, so only needs to be below the root
for i, focal_sites in enumerate(self.ancestors_focal_sites):
if len(focal_sites) > 0:
if i > 0:
root = i - 1
anc_min_time[root] = proxy_times[hap_id] + epsilons[hap_id]
break
else:
max_anc_id = np.max(site_ancestor[derived_sites]) # youngest ancstr
if max_anc_id >= 0:
anc_min_time[max_anc_id] = proxy_times[hap_id] + epsilons[hap_id]
# Go from youngest to oldest, pushing up the times of the ancestors to
# achieve compatibility with infinite sites.
# TODO - replace with something mre efficient that uses time_diffs
for anc_id in range(self.num_ancestors - 1, -1, -1):
current_time = new_anc_times[anc_id]
if anc_min_time[anc_id] > current_time:
new_anc_times[:(anc_id + 1)] += anc_min_time[anc_id] - current_time
assert new_anc_times[1] > np.max(sample_times) # root ancestor
with self.__class__( # Create new AncestorData instance to return
variant_data.sites_position[:][used_sites],
variant_data.sequence_length,
**kwargs,
) as other:
mutated_sites = set() # To check if mutations have occurred yet
ancestors_iter = self.ancestors()
anc = next(ancestors_iter, None)
for i in reverse_time_sorted_indexes:
proxy_time = proxy_times[i]
sample_id = sample_ids[i]
haplotype = haplotypes[i]
derived_sites = set(np.where(haplotype > 0)[0])
while anc is not None and new_anc_times[anc.id] > proxy_time:
anc_time = new_anc_times[anc.id]
other.add_ancestor(
anc.start, anc.end, anc_time, anc.focal_sites, anc.haplotype)
mutated_sites.update(anc.focal_sites)
anc = next(ancestors_iter, None)
if not derived_sites.issubset(mutated_sites):
assert not keep_ancestor_times
logging.info(
f"Infinite sites assumption deliberately broken: {sample_id}"
"contains an allele which requires a novel mutation."
)
logger.debug(
f"Inserting proxy ancestor: sample {sample_id} at time {proxy_time}"
)
other.add_ancestor(
start=0,
end=self.num_sites,
time=proxy_time,
focal_sites=[],
haplotype=haplotype,
)
# Add any ancestors remaining in the current instance
while anc is not None:
anc_time = new_anc_times[anc.id]
other.add_ancestor(
anc.start, anc.end, anc_time, anc.focal_sites, anc.haplotype,
)
anc = next(ancestors_iter, None)
other.clear_provenances()
for timestamp, record in self.provenances():
other.add_provenance(timestamp, record)
other.record_provenance(command="insert_proxy_samples", **kwargs)
assert other.num_ancestors == self.num_ancestors + len(sample_ids)
return other |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
I generated a tree sequence using real samples, the majority of which have sampling dates > 0. However, after running tsdate, node_times for these tips remain at 0. Is it possible to set these dates to their actual times using tsdate?
To incorporate tip dates, I've done the following:
These dates were successfully stored in individual metadata. However, dated_ts
nodes_time
return time = 0 for all nodes which are tips. In addition, dated_tsfit.node_posteriors()
andmetadata['mn']
return time = 0 for all nodes which are tips.A notebook file with more detail is attached here. Thanks for your help!
Best,
Nashwa
Beta Was this translation helpful? Give feedback.
All reactions