Constraining tip dates to non-zero times? #469

nmmahmed · 2025-06-20T20:36:12Z

nmmahmed
Jun 20, 2025

Hello,

I generated a tree sequence using real samples, the majority of which have sampling dates > 0. However, after running tsdate, node_times for these tips remain at 0. Is it possible to set these dates to their actual times using tsdate?

To incorporate tip dates, I've done the following:

vdata = tsinfer.VariantData("chr1/chr1.vcz", 
                            ancestral_state="variant_AA",
                           individuals_time = sampling_times) # tip dates stored in sampling_times

inferred_ts = tsinfer.infer(vdata, # tip dates stored in vdata
                           time_units = "years")

simplified_ts = tsdate.preprocess_ts(
    inferred_ts,
    erase_flanks=True
)

dated_ts, fit = tsdate.date(simplified_ts, ...)

These dates were successfully stored in individual metadata. However, dated_ts nodes_time return time = 0 for all nodes which are tips. In addition, dated_ts fit.node_posteriors() and metadata['mn'] return time = 0 for all nodes which are tips.

A notebook file with more detail is attached here. Thanks for your help!

Best,
Nashwa

hyanwong · 2025-06-20T21:32:43Z

hyanwong
Jun 20, 2025
Maintainer

This isn't a tsdate question, as it happens. It's tsinfer that assumes all your tips are at time 0. Instructions for how to use tsinfer to take advantage of the individual times are a bit scattered all over the place, and out of date, sorry.

Some discussion here: tskit-dev/tsinfer#602, but I'll try to put some better information together, not that tsdate can cope perfectly well with historical samples.

0 replies

hyanwong · 2025-06-20T21:53:46Z

hyanwong
Jun 20, 2025
Maintainer

Ah yes, I remember now. Try setting force_sample_times=True in tsinfer.match_samples():

data = tsinfer.VariantData("chr1/chr1.vcz", 
                            ancestral_state="variant_AA",
                           individuals_time = sampling_times) # tip dates stored in sampling_times

anc = tsinfer.generate_ancestors(vdata)
anc_ts = tsinfer.match_samples(vdata, ancestors, time_units = "years")
inferred_ts = tsinfer.match_samples(vdata, anc_ts, force_sample_times=True)

# as before
simplified_ts = tsdate.preprocess_ts(
    inferred_ts,
    erase_flanks=True
)

dated_ts, fit = tsdate.date(simplified_ts, ...)

Let me know if that works?

5 replies

hyanwong Jun 20, 2025
Maintainer

Note that there is a logical problem here, which is that tsinfer normally doesn't have any notion of a time scale, other than using frequency as a proxy for time. So the above probably won't work unless you also specify the site times, hence all the complication.

For pathogen-like systems, where the majority of samples might have a time >0, we probably need to think about how to structure the ancestor placement so that it is impossible to make ancestors younger than the sample on which they are based. It's unclear to me quite how to do this, however.

hyanwong Jun 21, 2025
Maintainer

I just figured out how we should do (have done!) this. How extremely stupid of me not to have thought of that before. See tskit-dev/tsinfer#1034

nmmahmed Jun 23, 2025
Author

Thank you for your response! This is helpful.

Let me know if that works?

I ran into a few errors with this.

anc = tsinfer.generate_ancestors(vdata)
anc_ts = tsinfer.match_samples(vdata, ancestors, time_units = "years")

I was unable to run match_samples w/o first running match_ancestors, so I added in a line:

anc = tsinfer.generate_ancestors(vdata)
anc_m = tsinfer.match_ancestors(vdata, anc, time_units = "years") ## additional line
anc_ts = tsinfer.match_samples(vdata, anc_m) ## time_units param not recognized, removed

I'm unable to use "force_sample_times=True" with the first run of match_samples due to "ValueError: Failed to put sample 0 (node 28) at time 17.0 as it has a younger parent (node 26)."

It seems that due to the uncalibrated times given by tsinfer, my real tip ages are always much older than any internal nodes.

This next run of match_samples then throws "ValueError: All nodes must have time > 0".

inferred_ts = tsinfer.match_samples(vdata, anc_ts, force_sample_times=True)

Notebook file.

hyanwong Jun 23, 2025
Maintainer

Oh yes, that was a typo, sorry. I meant match_ancestors. And yes, I am currently thinking about the proper way to do this. However, the quick hack round it would be either to change all your ancestor dates to be a lot older, or to push the ancestors tree sequence back in time before matching, so that the samples can be pushed to the correct time.

Probably the second is easier:

anc = tsinfer.generate_ancestors(vdata)
anc_ts = tsinfer.match_ancestors(vdata, anc, time_units = "years")
tables = anc_ts.dump_tables()
tables.nodes_time += 1e6  # add a million on  to the ancestor times
anc_ts = tables.tree_sequence()
inferred_ts = tsinfer.match_samples(vdata, anc_ts, force_sample_times=True)
# Now date, which will hopefully correct the silly internal times

In general, for complex stuff, you should be running the generate _ancestors, match_anceastors, and match_samples phases separately anyway.

Note that this hack doesn't allow samples to copy from each other (or to copy from proxy samples)

hyanwong Jun 23, 2025
Maintainer

Can I ask what organism you are inferring from here?

hyanwong · 2025-06-24T08:44:07Z

hyanwong
Jun 24, 2025
Maintainer

Here's a temporary insert_proxy_samples replacement function to try, that I've been working on. No guarantee that it will work, but maybe you could test it e.g. on simulated data, and give me feedback as to whether it give sensible results (especially if the results are better than the hack above to add a large time value to the nodes in the ancestors ts)? Once you have defined the new method (at the end of this post), you can use it like this:

import numpy as np
import tsinfer
import tskit

tsinfer.AncestorData.insert_proxy_samples = insert_proxy_samples  # hack: monkey patch the new routine

# Now run the inference
anc = tsinfer.generate_ancestors(vdata)
anc_with_proxy = anc.insert_proxy_samples(vdata)  # add proxy ancestors for samples > time 0, and change ancestor times to fit
anc_ts = tsinfer.match_ancestors(vdata, anc_with_proxy, time_units = "years")
inferred_ts = tsinfer.match_samples(vdata, anc_ts, force_sample_times=True)  # hopefully will work now

Here's the new method itself, the docstring should be more or less understandable?

import logging
logger = logging.getLogger("tsinfer")

def insert_proxy_samples(
        self,
        variant_data,
        *,
        sample_ids=None,
        epsilon=None,
        keep_ancestor_times=None,
        allow_mutation=None,  # deprecated alias
        **kwargs,
):
        """
        Take a set of samples from a :class:`.VariantData` instance and create additional
        "proxy sample ancestors" from them, returning a new :class:`.AncestorData`
        instance including both the current ancestors and the additional ancestors
        at the appropriate time points.

        A *proxy sample ancestor* is an ancestor based upon a known sample. At
        sites used in the full inference process, the haplotype of this ancestor
        is identical to that of the sample on which it is based. The time of the
        ancestor is taken to be a fraction ``epsilon`` older than the sample on
        which it is based.

        A common use of this function is to provide ancestral nodes for anchoring
        historical samples at the correct time when matching them into a tree
        sequence during the :func:`tsinfer.match_samples` stage of inference.
        For this reason, by default, the samples chosen from ``sample_data``
        are those associated with historical (i.e. non-contemporary)
        :ref:`individuals <sec_inference_data_model_individual>`. This can be
        altered by using the ``sample_ids`` parameter.

        If ``keep_ancestor_times`` is ``False``, times of the current ancestors
        may be increased to account for the times of the proxy sample ancestors.

        .. note::

            The proxy sample ancestors inserted here will correspond to extra nodes
            in the inferred tree sequence. At sites which are not used in the full
            inference process (e.g. sites unique to a single historical sample),
            these proxy sample ancestor nodes may have a different genotype from
            their corresponding sample.

        :param VariantData variant_data: The `VariantData` instance
            from which to select the samples used to create extra ancestors.
        :param list(int) sample_ids: A list of sample ids in the ``variant_data``
            instance that will be selected to create the extra ancestors. If
            ``None`` (default) select all the historical samples, i.e. those
            associated with an :ref:`sec_inference_data_model_individual` whose
            time is greater than zero. The order of ids is ignored, as are
            duplicate ids.
        :param list(float) epsilon: An list of small time increments
            determining how much older each proxy sample ancestor is than the
            corresponding sample listed in ``sample_ids``. A single value is also
            allowed, in which case it is used as the time increment for all selected
            proxy sample ancestors. If None (default) find :math:`{\\delta}t`, the
            smallest time difference between the sample times and the next
            oldest ancestor in the current :class:`.AncestorData` instance, setting
            ``epsilon`` = :math:`{\\delta}t / 100` (or, if all selected samples
            are at least as old as the oldest ancestor, take :math:`{\\delta}t`
            to be the smallest non-zero time difference between existing ancestors).
        :param bool keep_ancestor_times: If ``False`` (the default), the existing
            times of the ancestors in the current :class:`.AncestorData` instance
            may be increased so that derived states in the inserted proxy samples.
            can have an ancestor with a mutation to that site above them (i.e. the
            infinite sites assumption is maintained). This is useful when sites
            times have been approximated by using their frequency. Alternatively,
            if ``keep_ancestor_times`` is ``True``, existing ancestor times are
            preserved, and inserted proxy sample ancestors are allowed to
            possess derived alleles at sites where there are no pre-existing
            mutations in older ancestors. This can lead to a de-novo mutation at a
            site that also has a mutation elsewhere (i.e. breaking the infinite sites
            assumption).
        :param bool allow_mutation: Deprecated alias for `keep_ancestor_times`.
        :param \\**kwargs: Further arguments passed to the constructor when creating
            the new :class:`AncestorData` instance which will be returned.

        :return: A new :class:`.AncestorData` object.
        :rtype: AncestorData
        """
        if allow_mutation is not None:
            if keep_ancestor_times is not None:
                raise ValueError(
                    "Cannot specify both `allow_mutation` and `keep_ancestor_times`"
                )
            keep_ancestor_times = allow_mutation
        self._check_finalised()
        variant_data._check_finalised()
        if self.sequence_length != variant_data.sequence_length:
            raise ValueError("variant_data does not have the correct sequence length")
        used_sites = np.isin(variant_data.sites_position[:], self.sites_position[:])
        if np.sum(used_sites) != self.num_sites:
            raise ValueError("Genome positions in ancestors missing from variant_data")

        if sample_ids is None:
            sample_ids = []
            for i in variant_data.individuals():
                if i.time > 0:
                    sample_ids += i.samples
        # sort by ID and make unique for quick haplotype access
        sample_ids, unique_indices = np.unique(np.array(sample_ids), return_index=True)

        sample_times = np.zeros(len(sample_ids), dtype=self.ancestors_time.dtype)
        for i, s in enumerate(sample_ids):
            sample = variant_data.sample(s)
            if sample.individual != tskit.NULL:
                sample_times[i] = variant_data.individual(sample.individual).time

        if epsilon is not None:
            epsilons = np.atleast_1d(epsilon)
            if len(epsilons) == 1:
                # all get the same epsilon
                epsilons = np.repeat(epsilons, len(sample_ids))
            else:
                if len(epsilons) != len(unique_indices):
                    raise ValueError(
                        "The number of epsilon values must equal the number of "
                        f"sample_ids ({len(sample_ids)})"
                    )
                epsilons = epsilons[unique_indices]

        else:
            anc_times = self.ancestors_time[:][::-1]  # find ascending time order
            older_index = np.searchsorted(anc_times, sample_times, side="right")
            # Don't include times older than the oldest ancestor
            allowed = older_index < self.num_ancestors
            if np.sum(allowed) > 0:
                delta_t = anc_times[older_index[allowed]] - sample_times[allowed]
            else:
                # All samples have times equal to or older than the oldest curr ancestor
                time_diffs = np.diff(anc_times)
                delta_t = np.min(time_diffs[time_diffs > 0])
            epsilons = np.repeat(np.min(delta_t) / 100.0, len(sample_ids))

        proxy_times = sample_times + epsilons
        time_sorted_indexes = np.argsort(proxy_times)
        reverse_time_sorted_indexes = time_sorted_indexes[::-1]
        # In cases where we have more than a handful of samples to use as proxies, it is
        # inefficient to access the haplotypes out of order, so we iterate and cache
        # (caution: the haplotypes list may be quite large in this case)
        haplotypes = [
            h[1] for h in variant_data.haplotypes(
                samples=sample_ids, sites=used_sites, recode_ancestral=True
            )
        ]

        new_anc_times = self.ancestors_time[:]  # this is a copy
        if not keep_ancestor_times:
            assert np.all(np.diff(self.ancestors_time) <= 0)
            # Find the youngest (max) ancestor ID constrained by each sample haplotype
            site_ancestor = -np.ones(self.num_sites, dtype=int)
            anc_min_time = np.zeros(self.num_ancestors, dtype=self.ancestors_time.dtype)
            # If (unusually) there are multiple ancestors for the same focal site, we
            # can take the youngest
            for ancestor_id, focal_sites in enumerate(self.ancestors_focal_sites):
                site_ancestor[focal_sites] = ancestor_id
            for hap_id in time_sorted_indexes:
                derived_sites = haplotypes[hap_id] > 0
                if np.sum(derived_sites) == 0:
                    root = 0  # no derived sites, so only needs to be below the root
                    for i, focal_sites in enumerate(self.ancestors_focal_sites):
                        if len(focal_sites) > 0:
                            if i > 0:
                                root = i - 1
                            anc_min_time[root] = proxy_times[hap_id] + epsilons[hap_id]
                            break
                else:
                    max_anc_id = np.max(site_ancestor[derived_sites])  # youngest ancstr
                    if max_anc_id >= 0:
                        anc_min_time[max_anc_id] = proxy_times[hap_id] + epsilons[hap_id]
            # Go from youngest to oldest, pushing up the times of the ancestors to
            # achieve compatibility with infinite sites.
            # TODO - replace with something mre efficient that uses time_diffs
            for anc_id in range(self.num_ancestors - 1, -1, -1):
                current_time = new_anc_times[anc_id]
                if anc_min_time[anc_id] > current_time:
                    new_anc_times[:(anc_id + 1)] += anc_min_time[anc_id] - current_time
            assert new_anc_times[1] > np.max(sample_times)  # root ancestor

        with self.__class__(  # Create new AncestorData instance to return
            variant_data.sites_position[:][used_sites],
            variant_data.sequence_length,
            **kwargs,
        ) as other:
            mutated_sites = set()  # To check if mutations have occurred yet
            ancestors_iter = self.ancestors()
            anc = next(ancestors_iter, None)
            for i in reverse_time_sorted_indexes:
                proxy_time = proxy_times[i]
                sample_id = sample_ids[i]
                haplotype = haplotypes[i]
                derived_sites = set(np.where(haplotype > 0)[0])
                while anc is not None and new_anc_times[anc.id] > proxy_time:
                    anc_time = new_anc_times[anc.id]
                    other.add_ancestor(
                        anc.start, anc.end, anc_time, anc.focal_sites, anc.haplotype)
                    mutated_sites.update(anc.focal_sites)
                    anc = next(ancestors_iter, None)
                if not derived_sites.issubset(mutated_sites):
                    assert not keep_ancestor_times
                    logging.info(
                        f"Infinite sites assumption deliberately broken: {sample_id}"
                        "contains an allele which requires a novel mutation."
                    )
                logger.debug(
                    f"Inserting proxy ancestor: sample {sample_id} at time {proxy_time}"
                )
                other.add_ancestor(
                    start=0,
                    end=self.num_sites,
                    time=proxy_time,
                    focal_sites=[],
                    haplotype=haplotype,
                )
            # Add any ancestors remaining in the current instance
            while anc is not None:
                anc_time = new_anc_times[anc.id]
                other.add_ancestor(
                    anc.start, anc.end, anc_time, anc.focal_sites, anc.haplotype,  
                )
                anc = next(ancestors_iter, None)

            other.clear_provenances()
            for timestamp, record in self.provenances():
                other.add_provenance(timestamp, record)
            other.record_provenance(command="insert_proxy_samples", **kwargs)

        assert other.num_ancestors == self.num_ancestors + len(sample_ids)
        return other

2 replies

nmmahmed Jun 24, 2025
Author

Thanks so much for the clear documentation! I tried out both methods (insert_proxy_samples and pushing back ancestral node ages) on my real data and was able to correctly date my tips. I will test these out on simulated data and get back to you!

nmmahmed Jul 15, 2025
Author

My first run through testing proxy sample ('proxy') and adding a large time value ('push') methods: Notebook

Here, I simulated a geneology (w/o recombination, i.e. a single tree) and inferred tree sequences using each method under varying recombination_rate. In combination with tsdate, both gave accurate estimates (R^2 = 1) of MRCAs for each sample in the geneology for rates 1e-8 to 0.1

With recombination_rate set to 'None', each method inferred 20+ trees, whereas explicitly setting a recombination rate led to inference of just 3 trees (where each flanking tree had zero mutations).

With recombination_rate set to 'None', many of the trees cover very short spans of the genome (see end of notebook). Can you explain how I should interpret these trees/how I should approach situations where I see a result like this when using real data?

Thanks again!

Constraining tip dates to non-zero times? #469

Uh oh!

nmmahmed Jun 20, 2025

Replies: 3 comments · 7 replies

Uh oh!

hyanwong Jun 20, 2025 Maintainer

Uh oh!

hyanwong Jun 20, 2025 Maintainer

Uh oh!

hyanwong Jun 20, 2025 Maintainer

Uh oh!

hyanwong Jun 21, 2025 Maintainer

Uh oh!

nmmahmed Jun 23, 2025 Author

Uh oh!

Uh oh!

hyanwong Jun 23, 2025 Maintainer

Uh oh!

hyanwong Jun 23, 2025 Maintainer

Uh oh!

Uh oh!

hyanwong Jun 24, 2025 Maintainer

Uh oh!

Uh oh!

nmmahmed Jun 24, 2025 Author

Uh oh!

Uh oh!

nmmahmed Jul 15, 2025 Author

nmmahmed
Jun 20, 2025

Replies: 3 comments 7 replies

hyanwong
Jun 20, 2025
Maintainer

hyanwong
Jun 20, 2025
Maintainer

hyanwong Jun 20, 2025
Maintainer

hyanwong Jun 21, 2025
Maintainer

nmmahmed Jun 23, 2025
Author

hyanwong Jun 23, 2025
Maintainer

hyanwong Jun 23, 2025
Maintainer

hyanwong
Jun 24, 2025
Maintainer

nmmahmed Jun 24, 2025
Author

nmmahmed Jul 15, 2025
Author