diff --git a/README.md b/README.md index ace2b4572..7715f7958 100644 --- a/README.md +++ b/README.md @@ -5,10 +5,9 @@ GPSEA (Genotypes and Phenotypes - Statistical Evaluation of Associations) is a Python package for finding genotype-phenotype associations. - -See the [Tutorial](https://monarch-initiative.github.io/gpsea/stable/tutorial.html) -and a comprehensive [User guide](https://monarch-initiative.github.io/gpsea/stable/user-guide/index.html) -for more information. +See our documentation for the [setup](https://monarch-initiative.github.io/gpsea/stable/setup.html) instructions, +a [tutorial](https://monarch-initiative.github.io/gpsea/stable/tutorial.html) with an end-to-end genotype-phenotype association analysis, +and a comprehensive [user guide](https://monarch-initiative.github.io/gpsea/stable/user-guide/index.html) with everything else. The documentation comes in two flavors: diff --git a/docs/conf.py b/docs/conf.py index 4ee862193..94556ac79 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -63,7 +63,7 @@ # The short X.Y version. version = u'0.9' # The full version, including alpha/beta/rc tags. -release = u'0.9.1' +release = u'0.9.2' # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. diff --git a/docs/img/tutorial/tbx5_protein_diagram.png b/docs/img/tutorial/tbx5_protein_diagram.png index cbd4cbc02..10422c9ec 100644 Binary files a/docs/img/tutorial/tbx5_protein_diagram.png and b/docs/img/tutorial/tbx5_protein_diagram.png differ diff --git a/docs/setup.rst b/docs/setup.rst index 1be54561b..71eac1804 100644 --- a/docs/setup.rst +++ b/docs/setup.rst @@ -4,8 +4,8 @@ Setup ##### -Here we show how to install GPSEA and to prepare your Python environment -for genotype-phenotype association analysis. +Here we show how to install GPSEA and prepare your Python environment +for genotype-phenotype association analyses. .. contents:: Table of Contents diff --git a/docs/tutorial.rst b/docs/tutorial.rst index cf6dc0551..4db660626 100644 --- a/docs/tutorial.rst +++ b/docs/tutorial.rst @@ -273,10 +273,14 @@ depending on presence of a single allele of a missense or truncating variant >>> from gpsea.analysis.clf import monoallelic_classifier >>> is_missense = variant_effect(VariantEffect.MISSENSE_VARIANT, tx_id) >>> truncating_effects = ( +... VariantEffect.TRANSCRIPT_ABLATION, +... VariantEffect.TRANSCRIPT_TRANSLOCATION, ... VariantEffect.FRAMESHIFT_VARIANT, +... VariantEffect.START_LOST, ... VariantEffect.STOP_GAINED, ... VariantEffect.SPLICE_DONOR_VARIANT, ... VariantEffect.SPLICE_ACCEPTOR_VARIANT, +... # more effects could be listed here ... ... ) >>> is_truncating = anyof(variant_effect(e, tx_id) for e in truncating_effects) >>> gt_clf = monoallelic_classifier( diff --git a/docs/user-guide/analyses/partitioning/genotype/variant_predicates.rst b/docs/user-guide/analyses/partitioning/genotype/variant_predicates.rst index 588d3da27..36287cae3 100644 --- a/docs/user-guide/analyses/partitioning/genotype/variant_predicates.rst +++ b/docs/user-guide/analyses/partitioning/genotype/variant_predicates.rst @@ -26,13 +26,11 @@ The predicates operate on several lines of information: +------------------------+-------------------------------------------------------------------------------------------------+ | Protein data | variant is located in a region encoding a protein domain, protein feature type | +------------------------+-------------------------------------------------------------------------------------------------+ -| Genome | overlap with a genomic region of interest | -+------------------------+-------------------------------------------------------------------------------------------------+ The scope of the builtin predicates is fairly narrow and likely insufficient for real-life analyses. -However, the predicates can be chained into a compound predicate +However, several predicates can be "chained" into a compound predicate using a boolean logic, to achive more expressivity for testing complex conditions, such as "variant is a missense or synonymous variant located in exon 6 of `NM_013275.6`". @@ -41,8 +39,9 @@ such as "variant is a missense or synonymous variant located in exon 6 of `NM_01 Examples ******** -Here we show examples of several simple variant predicates and -how to chain them for testing complex conditions. +Here we show how to use the builtin predicates for simple tests +and how to build a compound predicate from the builtin predicates, +for testing complex conditions. Load cohort @@ -112,10 +111,10 @@ See the :mod:`gpsea.analysis.predicate` module for a complete list of the builtin predicates. -Predicate chain -=============== +Compound predicates +=================== -Using the builtin predicates, we can build a logical chain to test complex conditions. +A compound predicate for testing complex conditions can be built from two or more predicates. For instance, we can test if the variant meets any of several conditions: >>> import gpsea.analysis.predicate as vp @@ -130,7 +129,13 @@ or *all* conditions: >>> missense_and_exon20.test(variant) True -All variant predicates overload Python ``&`` (AND) and ``|`` (OR) operators, to allow chaining. +All variant predicates overload Python ``&`` (AND) and ``|`` (OR) operators, +to combine a predicate pair into a compound predicate. + +.. note:: + + Combining three or or more predicates can be achieved with :func:`~gpsea.analysis.allof` + and :func:`~gpsea.analysis.anyof` functions. Therefore, there is nothing that prevents us to combine the predicates into multi-level tests, e.g. to test if the variant is a *"chromosomal deletion" or a deletion which removes at least 50 bp*: @@ -180,12 +185,16 @@ The builtin predicates should cover majority of use cases. However, if a predicate seems to be missing, feel free to submit an issue in our `GitHub tracker `_, -or to implement a custom predicate -by extending the :class:`~gpsea.analysis.predicate.VariantPredicate` class 😎. +or implement your own predicate by following the :ref:`custom-variant-predicate` +guide. +**** +Next +**** The variant predicate offers a flexible API for testing if variants meet a condition. -However, the genotype phenotype correlations are done on the individual level -and the variant predicates are used as a component of the genotype predicate. -The next sections show how to use variant predicates to assign individuals into groups. +However, the genotype phenotype correlations are studied on the level of individuals. +As described in :ref:`genotype-classifiers`, GPSEA uses the :class:`~gpsea.analysis.clf.GenotypeClassifier` API +to assign individuals into non-overlapping classes. Variant predicates are essential for creating such classifier. +We explain the details in the following sections. diff --git a/docs/user-guide/analyses/phenotype-scores.rst b/docs/user-guide/analyses/phenotype-scores.rst index 85347ed96..502a59f52 100644 --- a/docs/user-guide/analyses/phenotype-scores.rst +++ b/docs/user-guide/analyses/phenotype-scores.rst @@ -121,9 +121,11 @@ In this example, the point mutation is a mutation that meets the following condi '((change length == 0 AND reference allele length == 1) AND MISSENSE_VARIANT on NM_001042681.2)' -For the loss of function predicate, the following variant effects are considered loss of function: +For the loss-of-function predicate, the following is a non-exhausting list +of variant effects considered as a loss-of-function: >>> lof_effects = ( +... VariantEffect.TRANSCRIPT_TRANSLOCATION, ... VariantEffect.TRANSCRIPT_ABLATION, ... VariantEffect.FRAMESHIFT_VARIANT, ... VariantEffect.START_LOST, @@ -131,7 +133,7 @@ For the loss of function predicate, the following variant effects are considered ... ) >>> lof_mutation = anyof(variant_effect(eff, tx_id) for eff in lof_effects) >>> lof_mutation.description -'(TRANSCRIPT_ABLATION on NM_001042681.2 OR FRAMESHIFT_VARIANT on NM_001042681.2 OR START_LOST on NM_001042681.2 OR STOP_GAINED on NM_001042681.2)' +'(TRANSCRIPT_TRANSLOCATION on NM_001042681.2 OR TRANSCRIPT_ABLATION on NM_001042681.2 OR FRAMESHIFT_VARIANT on NM_001042681.2 OR START_LOST on NM_001042681.2 OR STOP_GAINED on NM_001042681.2)' The genotype predicate will bin the patient into two classes: a point mutation or the loss of function: @@ -154,6 +156,26 @@ Phenotype score This component is responsible for computing a phenotype score for an individual. As far as GPSEA framework is concerned, the phenotype score must be a floating point number or a `NaN` value if the score cannot be computed for an individual. +This is the essence of the :class:`~gpsea.analysis.pscore.PhenotypeScorer` class. + +GPSEA ships with several builtin phenotype scorers which can be used as + ++------------------------------------------------------------+---------------------------------------------+ +| Name | Description | ++============================================================+=============================================+ +| | Compute the total number of occurrences | +| * :class:`~gpsea.analysis.pscore.CountingPhenotypeScorer` | of specific phenotypic features | +| | (used in this section) | ++------------------------------------------------------------+---------------------------------------------+ +| | Compute the "adapted De Vries Score" | +| * :class:`~gpsea.analysis.pscore.DeVriesPhenotypeScorer` | for assessing severity | +| | of intellectual disability | ++------------------------------------------------------------+---------------------------------------------+ + +.. tip:: + + See :ref:`custom-phenotype-scorer` section to learn how to build a phenotype scorer from scratch. + Here we use the :class:`~gpsea.analysis.pscore.CountingPhenotypeScorer` for scoring the individuals based on the number of structural defects diff --git a/docs/user-guide/analyses/report/rere_phenotype_score_boxplot.png b/docs/user-guide/analyses/report/rere_phenotype_score_boxplot.png index 5f1fcada6..86e0d3a48 100644 Binary files a/docs/user-guide/analyses/report/rere_phenotype_score_boxplot.png and b/docs/user-guide/analyses/report/rere_phenotype_score_boxplot.png differ diff --git a/docs/user-guide/analyses/survival.rst b/docs/user-guide/analyses/survival.rst index 616bfa467..faef0bbb2 100644 --- a/docs/user-guide/analyses/survival.rst +++ b/docs/user-guide/analyses/survival.rst @@ -188,3 +188,96 @@ or `None` if computing the survival was impossible (see :func:`~gpsea.analysis.t The `Survival` reports the number of days until attaining the endpoint, here defined as end stage renal disease (`is_censored=False`), or until the individual dropped out of the analysis (`is_censored=True`). + + +Troubleshooting +=============== + +Sometimes the survival analysis fails and an :class:`~gpsea.analysis.AnalysisException` is raised. +For instance, the current Logrank test implementation reports a p value of `NaN` +if the survival is the same for all individuals. +This is unlikely an expected outcome, therefore GPSEA raises +an :class:`~gpsea.analysis.AnalysisException` to force the user to troubleshoot. + +To help with troubleshooting, the data computed prior detecting the error is included in the exception's +:attr:`~gpsea.analysis.AnalysisException.data` attribute. In survival analysis, the data should include +the identifiers, genotype classes, and survivals of the tested individuals. + +Let's show this on an example. We will create a toy cohort of 10 individuals +with onset of `Lynch syndrome I `_ +(`OMIM:120435`) at 40 years. + +>>> from gpsea.model import Cohort, Patient, Disease, Age +>>> onset = Age.from_iso8601_period("P40Y") +>>> individuals = [ +... Patient.from_raw_parts( +... labels=label, +... diseases=( +... Disease.from_raw_parts( +... term_id="OMIM:120435", +... name="Lynch syndrome I", +... is_observed=True, +... onset=onset, +... ), +... ), +... ) +... for label in "ABCDEFGHIJ" # 10 individuals +... ] +>>> cohort = Cohort.from_patients(individuals) + +We will assign them into genotype classes on random, ... + +>>> from gpsea.analysis.clf import random_classifier +>>> gt_clf = random_classifier(seed=123) +>>> gt_clf.description +'Classify the individual into random classes' + +... using the Lynch syndrome I diagnosis as the endpoint ... + +>>> from gpsea.analysis.temporal.endpoint import disease_onset +>>> endpoint = disease_onset(disease_id="OMIM:120435") +>>> endpoint.description +'Compute time until OMIM:120435 onset' + +... and we will use Logrank test for differences in survival. + +>>> from gpsea.analysis.temporal.stats import LogRankTest +>>> survival_statistic = LogRankTest() + +We put together the survival analysis ... + +>>> from gpsea.analysis.temporal import SurvivalAnalysis +>>> survival_analysis = SurvivalAnalysis( +... statistic=survival_statistic, +... ) + +... which we expect to fail with an :class:`~gpsea.analysis.AnalysisException`: + +>>> result = survival_analysis.compare_genotype_vs_survival( +... cohort=cohort, +... gt_clf=gt_clf, +... endpoint=endpoint, +... ) +Traceback (most recent call last): + ... +gpsea.analysis._base.AnalysisException: The survival values did not meet the expectation of the statistical test! + +The genotype classes and survival values can be retrieved from the exception: + +>>> from gpsea.analysis import AnalysisException +>>> try: +... result = survival_analysis.compare_genotype_vs_survival( +... cohort=cohort, +... gt_clf=gt_clf, +... endpoint=endpoint, +... ) +... except AnalysisException as ae: +... genotypes = ae.data["genotype"] +... survivals = ae.data["survival"] + +and the values can come in handy in troubleshooting: + +>>> genotypes[:3] +(0, 0, 0) +>>> survivals[:3] +(Survival(value=14610.0, is_censored=False), Survival(value=14610.0, is_censored=False), Survival(value=14610.0, is_censored=False)) diff --git a/docs/user-guide/custom-components.rst b/docs/user-guide/custom-components.rst new file mode 100644 index 000000000..af851a2c5 --- /dev/null +++ b/docs/user-guide/custom-components.rst @@ -0,0 +1,142 @@ +.. _custom-components: + +################# +Custom components +################# + +GPSEA aims to stay useful in the long run. Therefore, we took a great care +to adhere to "good software development practices" and we framed GPSEA functionalities +as a set of loosely coupled components. As a rule of thumb, each component corresponds +to a Python abstract base class which is then extended by the builtin components +and can also be extended by the future components, to serve both common or exotic use cases. + +The abstract base classes define the component API. +Per guidelines in Python's :mod:`abc` module, the abstract classes use :class:`abc.ABCMeta` as a metaclass +and the class API consists of methods annotated with the :func:`abc.abstractmethod` decorator. +These decorated methods must be overridden in the subclasses. + +The following sections provide guidance for customizing the most commonly used GPSEA components. + + +.. _custom-phenotype-scorer: + +**************** +Phenotype scorer +**************** + +:class:`~gpsea.analysis.pscore.PhenotypeScorer` computes a phenotype score for an individual. +The phenotype score is a `float` with range :math:`(-\infty, \infty)` where `NaN` indicates +that a score cannot be computed (e.g. the lab measurement value was not ascertained for the individual). + +Here we show an example of a toy phenotype scorer +for using body mass index (BMI) as a phenotype score. + +>>> import typing +>>> from gpsea.model import Patient +>>> from gpsea.analysis.pscore import PhenotypeScorer +>>> class BmiScorer(PhenotypeScorer): # ❶ +... +... def __init__( # ❷ +... self, +... id2bmi: typing.Mapping[str, float], +... ): +... self._id2bmi = id2bmi +... +... @property +... def name(self) -> str: # ❸ +... return "BMI phenotype scorer" +... +... @property +... def description(self) -> str: # ❹ +... return "Body mass index used as a phenotype score" +... +... @property +... def variable_name(self) -> str: # ❺ +... return "BMI" +... +... def score(self, patient: Patient) -> float: # ❻ +... try: +... return self._id2bmi[patient.labels.label] +... except KeyError: +... return float('nan') + +❶ The ``BmiScorer`` must extend :class:`~gpsea.analysis.pscore.PhenotypeScorer` +to be used as a phenotype scorer. +❷ The scorer needs a ``dict`` with `label` → `BMI` for the analyzed individuals. +We assume the user will pre-compute the corresponding ``dict``. + +Then, the scorer must expose several properties, including ❸ ``name``, ❹ ``description``, +and the ❺ ``variable_name`` it operates on. +The properties provide bookkeeping metadata to use in e.g. visualizations. +Try to choose short and concise names. + +The most important part of the scorer is the ❻ `score` method +which retrieves the BMI for an individual or returns `NaN` if the value is not available +and the individual should be omitted from the analysis. + +.. _custom-variant-predicate: + +***************** +Variant predicate +***************** + +The purpose of a :class:`~gpsea.analysis.predicate.VariantPredicate` is to test +if a variant meets a certain criterion and GPSEA ships with an array +of builtin predicates (see :mod:`gpsea.analysis.predicate` module). +However, chances are a custom predicate will be needed in future, +so we show how to how to extend +the :class:`~gpsea.analysis.predicate.VariantPredicate` class +to create one's own predicate. + +Specifically, we show how to create a predicate to test if the variant affects a glycine residue +of the transcript of interest. + +>>> from gpsea.model import Variant, VariantEffect +>>> from gpsea.analysis.predicate import VariantPredicate +>>> class AffectsGlycinePredicate(VariantPredicate): # ❶ +... def __init__( # ❷ +... self, +... tx_id: str, +... ): +... self._tx_id = tx_id +... self._aa_code = "Gly" +... +... @property +... def name(self) -> str: # ❸ +... return "Affects Glycine" +... +... @property +... def description(self) -> str: # ❹ +... return "affects a glycine residue" +... +... @property +... def variable_name(self) -> str: # ❺ +... return "affected aminoacid residue" +... +... def test(self, variant: Variant) -> bool: # ❻ +... tx_ann = variant.get_tx_anno_by_tx_id(self._tx_id) +... if tx_ann is not None: +... hgvsp = tx_ann.hgvsp +... if hgvsp is not None: +... return hgvsp.startswith(f"p.{self._aa_code}") +... return False +... +... def __eq__(self, value: object) -> bool: # ➐ +... return isinstance(value, AffectsGlycinePredicate) and self._tx_id == value._tx_id +... +... def __hash__(self) -> int: # ❽ +... return hash((self._tx_id,)) +... +... def __repr__(self) -> str: # ❾ +... return str(self) +... +... def __str__(self) -> str: # ➓ +... return f"AffectsGlycinePredicate(tx_id={self._tx_id})" + +❶ The ``AffectsGlycinePredicate`` must extend :class:`~gpsea.analysis.predicate.VariantPredicate`. +❷ We ask the user to provide the transcript accession `str` and we set the target aminoacid code to glycine ``Gly``. +Like in the :ref:`custom-phenotype-scorer` above, ❸❹❺ provide metadata required for the bookkeeping. +The ❻ ``test`` method includes the most interesting part - we retrieve the :class:`~gpsea.model.TranscriptAnnotation` +with the functional annotation data for the transcript of interest, and we test if the HGVS protein indicates +that the reference aminoacid is glycine. +Last, we override ➐ ``__eq__()`` and ❽ ``__hash__()`` (required) as well as ❾ ``__repr__()`` and ➓ ``__str__()`` (recommended). diff --git a/docs/user-guide/img/TBX5_protein_diagram.png b/docs/user-guide/img/TBX5_protein_diagram.png index cbd4cbc02..10422c9ec 100644 Binary files a/docs/user-guide/img/TBX5_protein_diagram.png and b/docs/user-guide/img/TBX5_protein_diagram.png differ diff --git a/docs/user-guide/index.rst b/docs/user-guide/index.rst index 2a3c92cc3..23b9e9718 100644 --- a/docs/user-guide/index.rst +++ b/docs/user-guide/index.rst @@ -28,4 +28,5 @@ in an interactive Python environment, such as Jupyter notebook. input-data exploratory analyses/index + custom-components glossary diff --git a/docs/user-guide/various/autosomal_recessive.rst b/docs/user-guide/various/autosomal_recessive.rst deleted file mode 100644 index be160a20c..000000000 --- a/docs/user-guide/various/autosomal_recessive.rst +++ /dev/null @@ -1,24 +0,0 @@ -:orphan: - -.. _autosomal_recessive: - -=============================================================== -Genotype-Phenotype Correlations in Autosomal Recessive Diseases -=============================================================== - -In autosomal dominant diseases, the analysis of genotype-phenotype correlations is comparatively simple because -each affected individual carries only one disease-associated allele. While it would in principle be possible to -restrict analysis to homozygotes (who carry two identical copies of the disease-associated allele, one each on the maternal and paternal chromosome), -in practice compound heterozygotes are common and it is desirable to perform other comparisons. - -For instance, one might compare the three groups missense/missense (homozygotes) against missense/nonse (compound heterozygotes) - and nonsense/nonsense (homozygotes) (see for example `Li JT et al. (2022) ` for an example with - the *SUOX* gene). - - - -GPSEA implements a 6-field Fisher Exact Test to support this kind of analysis. - -Tutorial -^^^^^^^^ -TODO diff --git a/src/gpsea/__init__.py b/src/gpsea/__init__.py index 3423bd7b7..95aceea3e 100644 --- a/src/gpsea/__init__.py +++ b/src/gpsea/__init__.py @@ -2,7 +2,7 @@ GPSEA is a library for finding genotype-phenotype associations. """ -__version__ = "0.9.1" +__version__ = "0.9.2" _overwrite = False """ diff --git a/src/gpsea/analysis/_base.py b/src/gpsea/analysis/_base.py index 1725cf9e8..72aa54483 100644 --- a/src/gpsea/analysis/_base.py +++ b/src/gpsea/analysis/_base.py @@ -395,6 +395,11 @@ class MonoPhenotypeAnalysisResult(AnalysisResult, metaclass=abc.ABCMeta): that tested a single genotype-phenotype association. """ + SAMPLE_ID = "patient_id" + """ + Name of the data index. + """ + GT_COL = "genotype" """ Name of column for storing genotype data. diff --git a/src/gpsea/analysis/clf/__init__.py b/src/gpsea/analysis/clf/__init__.py index 3cf48eda7..671596bee 100644 --- a/src/gpsea/analysis/clf/__init__.py +++ b/src/gpsea/analysis/clf/__init__.py @@ -9,6 +9,7 @@ monoallelic_classifier, biallelic_classifier, allele_count, + random_classifier, ) from ._util import ( prepare_classifiers_for_terms_of_interest, @@ -28,6 +29,7 @@ "monoallelic_classifier", "biallelic_classifier", "allele_count", + "random_classifier", "PhenotypeClassifier", "PhenotypeCategorization", "P", diff --git a/src/gpsea/analysis/clf/_gt_classifiers.py b/src/gpsea/analysis/clf/_gt_classifiers.py index 51a4ce289..c7d608733 100644 --- a/src/gpsea/analysis/clf/_gt_classifiers.py +++ b/src/gpsea/analysis/clf/_gt_classifiers.py @@ -1,3 +1,4 @@ +import random import typing from collections import Counter @@ -640,3 +641,62 @@ def diagnosis_classifier( diagnoses=diagnoses, labels=labels, ) + + +class RandomClassifier(GenotypeClassifier): + + CATS = ( + Categorization( + category=PatientCategory( + cat_id=0, name="A", + ) + ), + Categorization( + category=PatientCategory( + cat_id=1, name="B", + ) + ) + ) + + def __init__( + self, + seed: typing.Optional[float] = None, + ): + self._rng = random.Random(x=seed) + + @property + def name(self) -> str: + return "Random Classifier" + + @property + def description(self) -> str: + return "Classify the individual into random classes" + + @property + def variable_name(self) -> str: + return "Randomness" + + def get_categorizations(self) -> typing.Sequence[Categorization]: + return RandomClassifier.CATS + + def test(self, _: Patient) -> typing.Optional[Categorization]: + return self._rng.choice(RandomClassifier.CATS) + + def __eq__(self, value: object) -> bool: + return isinstance(value, RandomClassifier) and self._rng == value._rng + + def __hash__(self) -> int: + return hash((self._rng, )) + + +def random_classifier( + seed: typing.Optional[float] = None, +) -> GenotypeClassifier: + """ + Genotype classifier to assign an individual into one of two classes, `A` and `B` on random.. + + :param seed: the seed for the random number generator. + """ + return RandomClassifier( + seed=seed, + ) diff --git a/src/gpsea/analysis/pscore/_api.py b/src/gpsea/analysis/pscore/_api.py index 72049f04e..27a4279ab 100644 --- a/src/gpsea/analysis/pscore/_api.py +++ b/src/gpsea/analysis/pscore/_api.py @@ -141,13 +141,15 @@ def plot_boxplots( self, ax, colors=("darksalmon", "honeydew"), + median_color: str = "black", ): """ Draw box plot with distributions of phenotype scores for the genotype groups. :param gt_predicate: the genotype predicate used to produce the genotype groups. :param ax: the Matplotlib :class:`~matplotlib.axes.Axes` to draw on. - :param colors: a tuple with colors to use for coloring the box patches of the box plot. + :param colors: a sequence with colors to use for coloring the box patches of the box plot. + :param median_color: a `str` with the color for the boxplot median line. """ # skip the patients with unassigned genotype group bla = self._data.notna() @@ -175,9 +177,13 @@ def plot_boxplots( tick_labels=gt_cat_names, ) + # Set face colors of the boxes for patch, color in zip(bplot["boxes"], colors): patch.set_facecolor(color) + for median in bplot['medians']: + median.set_color(median_color) + def __eq__(self, value: object) -> bool: return isinstance(value, PhenotypeScoreAnalysisResult) and super( MonoPhenotypeAnalysisResult, self diff --git a/src/gpsea/analysis/temporal/_api.py b/src/gpsea/analysis/temporal/_api.py index 0ecffec10..831a3365e 100644 --- a/src/gpsea/analysis/temporal/_api.py +++ b/src/gpsea/analysis/temporal/_api.py @@ -168,7 +168,10 @@ def compare_genotype_vs_survival( Execute the survival analysis on a given `cohort`. """ - idx = pd.Index((patient.patient_id for patient in cohort), name="patient_id") + idx = pd.Index( + (patient.patient_id for patient in cohort), + name=MonoPhenotypeAnalysisResult.SAMPLE_ID, + ) data = pd.DataFrame( None, index=idx, @@ -194,8 +197,14 @@ def compare_genotype_vs_survival( vals = tuple(survivals[gt_cat] for gt_cat in gt_clf.get_categorizations()) result = self._statistic.compute_pval(vals) if math.isnan(result.pval): + partial = { + MonoPhenotypeAnalysisResult.SAMPLE_ID: tuple(data.index), + "genotype": tuple(data[MonoPhenotypeAnalysisResult.GT_COL]), + "survival": tuple(data[MonoPhenotypeAnalysisResult.PH_COL]), + } + raise AnalysisException( - dict(data=data), + partial, "The survival values did not meet the expectation of the statistical test!", ) diff --git a/src/gpsea/model/_base.py b/src/gpsea/model/_base.py index bb8f27b5c..0a33fa060 100644 --- a/src/gpsea/model/_base.py +++ b/src/gpsea/model/_base.py @@ -1,7 +1,6 @@ import enum import typing -import hpotk class Sex(enum.Enum): diff --git a/src/gpsea/model/_cohort.py b/src/gpsea/model/_cohort.py index a8559fd05..f671bf48a 100644 --- a/src/gpsea/model/_cohort.py +++ b/src/gpsea/model/_cohort.py @@ -13,7 +13,7 @@ from ._variant import Variant, VariantInfo -I = typing.TypeVar('I', bound=hpotk.model.Identified) +IDENTIFIED = typing.TypeVar('IDENTIFIED', bound=hpotk.model.Identified) """ Anything that extends `Identified` (e.g. `Disease`, `Phenotype`, `Measurement`). """ @@ -289,8 +289,8 @@ def _check_id( @staticmethod def _find_first_by_id( term_id: hpotk.TermId, - items: typing.Iterable[I], - ) -> typing.Optional[I]: + items: typing.Iterable[IDENTIFIED], + ) -> typing.Optional[IDENTIFIED]: for m in items: if m.identifier == term_id: return m @@ -299,13 +299,13 @@ def _find_first_by_id( @staticmethod def _unique_identifiers_of_identified( - items: typing.Iterable[I], + items: typing.Iterable[IDENTIFIED], ) -> typing.Collection[hpotk.TermId]: return set(item.identifier for item in items) @staticmethod def _count_unique_identifiers( - items: typing.Iterable[I], + items: typing.Iterable[IDENTIFIED], ) -> int: return len(Patient._unique_identifiers_of_identified(items)) @@ -668,8 +668,8 @@ def _count_individuals_with_condition( def _iterate_through_items( self, - extract_items: typing.Callable[[Patient,], typing.Iterable[I]], - ) -> typing.Iterator[I]: + extract_items: typing.Callable[[Patient,], typing.Iterable[IDENTIFIED]], + ) -> typing.Iterator[IDENTIFIED]: return itertools.chain(item for individual in self._members for item in extract_items(individual)) def _get_most_common( @@ -689,7 +689,7 @@ def _get_most_common( @staticmethod def _count_distinct_items( - items: typing.Iterable[I], + items: typing.Iterable[IDENTIFIED], ) -> int: return len(set(item.identifier for item in items)) diff --git a/src/gpsea/model/_protein.py b/src/gpsea/model/_protein.py index 5c69e9527..4df41657a 100644 --- a/src/gpsea/model/_protein.py +++ b/src/gpsea/model/_protein.py @@ -83,7 +83,7 @@ def __repr__(self) -> str: def _deprecation_warning(): warnings.warn( - f"`FeatureType` was deprecated and will be removed prior `v1.0.0`. Use a `str` instead!", + "`FeatureType` was deprecated and will be removed prior `v1.0.0`. Use a `str` instead!", DeprecationWarning, ) diff --git a/src/gpsea/model/_test_gt.py b/src/gpsea/model/_test_gt.py index 1a033e75b..6d581406f 100644 --- a/src/gpsea/model/_test_gt.py +++ b/src/gpsea/model/_test_gt.py @@ -30,7 +30,7 @@ def test_iteration(self): assert len(labels) == len(gts) == len(genotypes) - assert all(l.label in ('A', 'C', 'D') for l in labels) + assert all(sample_labels.label in ('A', 'C', 'D') for sample_labels in labels) assert all( gt in (Genotype.HETEROZYGOUS, Genotype.HEMIZYGOUS, Genotype.HOMOZYGOUS_REFERENCE) for gt in gts diff --git a/src/gpsea/model/_variant_effects.py b/src/gpsea/model/_variant_effects.py index 9567592d4..082e54ae3 100644 --- a/src/gpsea/model/_variant_effects.py +++ b/src/gpsea/model/_variant_effects.py @@ -24,14 +24,15 @@ class VariantEffect(enum.Enum): 'SO:0001583' """ + TRANSCRIPT_AMPLIFICATION = "SO:0001889" TRANSCRIPT_ABLATION = "SO:0001893" + TRANSCRIPT_TRANSLOCATION = "SO:0001883" SPLICE_ACCEPTOR_VARIANT = "SO:0001574" SPLICE_DONOR_VARIANT = "SO:0001575" STOP_GAINED = "SO:0001587" FRAMESHIFT_VARIANT = "SO:0001589" STOP_LOST = "SO:0001578" START_LOST = "SO:0002012" - TRANSCRIPT_AMPLIFICATION = "SO:0001889" INFRAME_INSERTION = "SO:0001821" INFRAME_DELETION = "SO:0001822" MISSENSE_VARIANT = "SO:0001583" @@ -118,14 +119,15 @@ def __str__(self) -> str: effect_to_display = { + VariantEffect.TRANSCRIPT_AMPLIFICATION: "transcript amplification", VariantEffect.TRANSCRIPT_ABLATION: "transcript ablation", + VariantEffect.TRANSCRIPT_TRANSLOCATION: "transcript translocation", VariantEffect.SPLICE_ACCEPTOR_VARIANT: "splice acceptor", VariantEffect.SPLICE_DONOR_VARIANT: "splice donor", VariantEffect.STOP_GAINED: "stop gained", VariantEffect.FRAMESHIFT_VARIANT: "frameshift", VariantEffect.STOP_LOST: "stop lost", VariantEffect.START_LOST: "start lost", - VariantEffect.TRANSCRIPT_AMPLIFICATION: "transcript amplification", VariantEffect.INFRAME_INSERTION: "inframe insertion", VariantEffect.INFRAME_DELETION: "inframe deletion", VariantEffect.MISSENSE_VARIANT: "missense", diff --git a/src/gpsea/preprocessing/_config.py b/src/gpsea/preprocessing/_config.py index 82e204f05..1294eb5e6 100644 --- a/src/gpsea/preprocessing/_config.py +++ b/src/gpsea/preprocessing/_config.py @@ -401,7 +401,7 @@ def _configure_imprecise_sv_annotator( ): # Setup cache for SVs if cache_dir is not None: - sv_cache_dir = os.path.join(cache_dir, "sv_cache") + _sv_cache_dir = os.path.join(cache_dir, "sv_cache") # TODO: implement the cache. # os.makedirs(sv_cache_dir, exist_ok=True) # var_cache = VariantAnnotationCache(sv_cache_dir) @@ -495,7 +495,7 @@ def load_phenopackets( # Keep track of the progress by wrapping the list of phenopackets # with TQDM 😎 cohort_iter = tqdm( - phenopackets, desc="Individuals Processed", file=sys.stdout, unit="individuals" + phenopackets, desc="Individuals Processed", file=sys.stdout, unit=" individuals" ) notepad = create_notepad(label="Phenopackets") cohort = cohort_creator.process(cohort_iter, notepad) diff --git a/src/gpsea/preprocessing/_generic.py b/src/gpsea/preprocessing/_generic.py index 4364573b6..0c52e04a2 100644 --- a/src/gpsea/preprocessing/_generic.py +++ b/src/gpsea/preprocessing/_generic.py @@ -41,12 +41,14 @@ def annotate(self, item: ImpreciseSvInfo) -> typing.Sequence[TranscriptAnnotatio def _map_to_variant_effects( self, - variant_class: str, + variant_class: VariantClass, ) -> typing.Sequence[VariantEffect]: if variant_class == VariantClass.DEL: return (VariantEffect.TRANSCRIPT_ABLATION,) elif variant_class == VariantClass.DUP: return (VariantEffect.TRANSCRIPT_AMPLIFICATION,) + elif variant_class == VariantClass.TRANSLOCATION: + return (VariantEffect.TRANSCRIPT_TRANSLOCATION,) else: # This mapping is most likely incomplete. # Please open a ticket if support diff --git a/src/gpsea/preprocessing/_phenopacket.py b/src/gpsea/preprocessing/_phenopacket.py index c7d90fb8e..227d17640 100644 --- a/src/gpsea/preprocessing/_phenopacket.py +++ b/src/gpsea/preprocessing/_phenopacket.py @@ -128,6 +128,7 @@ def find_coordinates( f"{variation_descriptor.vcf_record.genome_assembly} is not {self._build.identifier}." ) contig = self._build.contig_by_name(variation_descriptor.vcf_record.chrom) + assert contig is not None start = int(variation_descriptor.vcf_record.pos) - 1 ref = variation_descriptor.vcf_record.ref alt = variation_descriptor.vcf_record.alt @@ -142,6 +143,7 @@ def find_coordinates( seq_location = variation.copy_number.allele.sequence_location refseq_contig_name = seq_location.sequence_id.split(":")[1] contig = self._build.contig_by_name(refseq_contig_name) + assert contig is not None # Assuming SV coordinates are 1-based (VCF style), # so we subtract 1 to transform to 0-based coordinate system @@ -591,7 +593,11 @@ def _extract_age( ) -> typing.Optional[Age]: if individual.HasField("time_at_last_encounter"): tale = individual.time_at_last_encounter - return parse_age_element(tale, notepad) + return parse_age_element( + 'time_at_last_encounter', + tale, + notepad, + ) return None @staticmethod @@ -614,6 +620,7 @@ def _extract_vital_status( if vital_status.HasField("time_of_death"): age_of_death = parse_age_element( + 'time_of_death', time_element=vital_status.time_of_death, notepad=notepad, ) @@ -904,21 +911,27 @@ def parse_onset_element( We allow to use `GestationalAge`, `Age` or `OntologyClass` as onset. """ case = time_element.WhichOneof("element") - if case == "ontology_class": + if case == "age": + age = time_element.age + return Age.from_iso8601_period(value=age.iso8601duration) + elif case == "gestational_age": + age = time_element.gestational_age + return Age.gestational(weeks=age.weeks, days=age.days) + elif case == "ontology_class": if term_onset_parser is None: return None else: return term_onset_parser.process( ontology_class=time_element.ontology_class, notepad=notepad, - ) + ) else: - return parse_age_element( - time_element=time_element, - notepad=notepad, - ) + notepad.add_warning(f"`time_element` is in currently unsupported format `{case}`") + return None + def parse_age_element( + field: str, time_element: PPTimeElement, notepad: Notepad, ) -> typing.Optional[Age]: @@ -933,5 +946,8 @@ def parse_age_element( age = time_element.age return Age.from_iso8601_period(value=age.iso8601duration) else: - notepad.add_warning(f"`time_element` is in currently unsupported format `{case}`") + notepad.add_warning( + f"{case} of the {field} field cannot be parsed into age", + "Consider formatting the age as ISO8601 duration (e.g., \"P31Y2M\" for 31 years and 2 months)" + ) return None diff --git a/src/gpsea/preprocessing/_vep.py b/src/gpsea/preprocessing/_vep.py index 7c9596027..4ec2cb0bb 100644 --- a/src/gpsea/preprocessing/_vep.py +++ b/src/gpsea/preprocessing/_vep.py @@ -186,7 +186,7 @@ def format_coordinates_for_vep_query(vc: VariantCoordinates) -> str: # TODO: Verify are working correctly else: if len(vc.ref) == 0 or len(vc.alt) == 0: - raise ValueError(f'Trimmed alleles are not yet supported!') + raise ValueError('Trimmed alleles are not yet supported!') if len(vc.ref) == 1 and len(vc.alt) != 1: # INS/DUP start = start + 1 # we must "trim" diff --git a/src/gpsea/view/_draw_variants.py b/src/gpsea/view/_draw_variants.py index a98ea9847..a8cd27ba1 100644 --- a/src/gpsea/view/_draw_variants.py +++ b/src/gpsea/view/_draw_variants.py @@ -36,7 +36,7 @@ def _calc_aa_based_pos(pos_bases, tx_coordinates): :param exons: exon positions """ print(f'{pos_bases=}') - exons, cds_start, cds_end = tx_coordinates.exons, tx_coordinates.cds_start, tx_coordinates.cds_end + exons, _cds_start, _cds_end = tx_coordinates.exons, tx_coordinates.cds_start, tx_coordinates.cds_end num_nt = 0 @@ -325,7 +325,7 @@ def draw_fig(self, tx_coordinates: TranscriptCoordinates, protein_meta: ProteinM # get minimum position on chromosome for all transcripts min_exon_limit = np.min(exon_limits) feature_limits = np.array([(feature.info.start, feature.info.end) for feature in protein_meta.protein_features]) - feature_types = [pf.feature_type for pf in protein_meta.protein_features] + _feature_types = [pf.feature_type for pf in protein_meta.protein_features] feature_limits = (feature_limits * 3) - 2 + min_exon_limit # to convert from codons to bases variant_locations = list() for ann in tx_anns: @@ -334,11 +334,11 @@ def draw_fig(self, tx_coordinates: TranscriptCoordinates, protein_meta: ProteinM if prot_eff_loc is not None: variant_locations.append([prot_eff_loc.start, prot_eff_loc.end]) variant_locations = np.array(variant_locations) - variant_effects = np.array([(ann.variant_effects[0]) for ann in tx_anns]) + _variant_effects = np.array([(ann.variant_effects[0]) for ann in tx_anns]) exon_labels = [f'{i + 1}' for i in range(len(exon_limits))] protein_track_x_min, protein_track_x_max = 0.15, 0.85 - protein_track_y_min, protein_track_y_max = 0.492, 0.508 + protein_track_y_min, _protein_track_y_max = 0.492, 0.508 exon_y_min, exon_y_max = 0.39, 0.43 font_size = 12 text_padding = 0.004 @@ -364,7 +364,7 @@ def preprocess(x_absolute): # x_axis x_axis_y = protein_track_y_min - 0.02 x_axis_min_x, x_axis_max_x = protein_track_x_min, protein_track_x_max - big_tick_length, small_tick_length = 0.01, 0.005 + big_tick_length, _small_tick_length = 0.01, 0.005 draw_line(x_axis_min_x, x_axis_y, x_axis_max_x, x_axis_y, line_color=self.axis_color, line_width=1.0) # main line draw_line(x_axis_min_x, x_axis_y - big_tick_length, x_axis_min_x, x_axis_y, line_color=self.axis_color, diff --git a/src/gpsea/view/_protein_visualizable.py b/src/gpsea/view/_protein_visualizable.py index 40ed0ebfe..babac8bed 100644 --- a/src/gpsea/view/_protein_visualizable.py +++ b/src/gpsea/view/_protein_visualizable.py @@ -1,16 +1,24 @@ import typing -from gpsea.model import * +from gpsea.model import ( + Cohort, + ProteinMetadata, + TranscriptAnnotation, + TranscriptCoordinates, + Variant, + VariantEffect, +) import numpy as np +from gpsea.model.genome._genome import Region -class ProteinVisualizable: +class ProteinVisualizable: def __init__( - self, - tx_coordinates: TranscriptCoordinates, - protein_meta: ProteinMetadata, - cohort: Cohort, + self, + tx_coordinates: TranscriptCoordinates, + protein_meta: ProteinMetadata, + cohort: Cohort, ) -> None: self._tx_coordinates = tx_coordinates self._protein_meta = protein_meta @@ -19,7 +27,7 @@ def __init__( transcript_annotations = ProteinVisualizable._get_tx_anns( cohort.all_variants(), self._tx_coordinates.identifier ) - self._variant_regions_on_protein = list() + variant_regions_on_protein: typing.List[Region] = list() self._variant_effect = list() for tx_ann in transcript_annotations: variant_effects = tx_ann.variant_effects @@ -27,7 +35,7 @@ def __init__( continue prot_eff_loc = tx_ann.protein_effect_location if prot_eff_loc is not None: - self._variant_regions_on_protein.append(prot_eff_loc) + variant_regions_on_protein.append(prot_eff_loc) self._variant_effect.append(variant_effects[0]) self._protein_feature_names = list() @@ -40,13 +48,17 @@ def __init__( self._protein_feature_starts.append(feature.info.start) self._protein_feature_ends.append(feature.info.end) - self._variant_locations = np.array([item.start for item in self._variant_regions_on_protein]) + self._variant_locations = np.array( + [item.start for item in variant_regions_on_protein] + ) - #variant_locations = (variant_locations * 3) - 2 + min_exon_limit # to convert from codons to bases - #variant_effects = np.array([(ann.variant_effects[0]) for ann in tx_anns]) + # variant_locations = (variant_locations * 3) - 2 + min_exon_limit # to convert from codons to bases + # variant_effects = np.array([(ann.variant_effects[0]) for ann in tx_anns]) # count marker occurrences and remove duplicates self._variant_locations_counted_absolute, self._marker_counts = np.unique( - self._variant_locations, axis=0, return_counts=True, + self._variant_locations, + axis=0, + return_counts=True, ) if protein_meta.protein_length > 0: @@ -59,8 +71,8 @@ def __init__( @staticmethod def _get_tx_anns( - variants: typing.Iterable[Variant], - tx_id: str, + variants: typing.Iterable[Variant], + tx_id: str, ) -> typing.Sequence[TranscriptAnnotation]: """ By default, the API returns transcript annotations for many transcripts. @@ -74,7 +86,9 @@ def _get_tx_anns( tx_ann = ann break if tx_ann is None: - raise ValueError(f'The transcript annotation for {tx_id} was not found!') + raise ValueError( + f"The transcript annotation for {tx_id} was not found!" + ) else: tx_anns.append(tx_ann) @@ -103,7 +117,7 @@ def protein_feature_starts(self) -> typing.Sequence[int]: @property def protein_feature_ends(self) -> typing.Sequence[int]: return self._protein_feature_ends - + @property def protein_feature_types(self) -> typing.Sequence[str]: return self._protein_feature_types @@ -129,7 +143,7 @@ def protein_length(self) -> int: @property def protein_feature_names(self) -> typing.Sequence[str]: return self._protein_feature_names - + @property def variant_effects(self) -> typing.Sequence[VariantEffect]: return self._variant_effect diff --git a/src/gpsea/view/_txp.py b/src/gpsea/view/_txp.py index de9bb351d..eda97bc01 100644 --- a/src/gpsea/view/_txp.py +++ b/src/gpsea/view/_txp.py @@ -2,9 +2,7 @@ from collections import defaultdict from matplotlib import pyplot as plt from matplotlib.patches import Rectangle -from matplotlib.collections import PatchCollection from matplotlib.lines import Line2D -import typing from gpsea.model import Variant, TranscriptCoordinates, ProteinMetadata @@ -30,15 +28,15 @@ def draw_variants(self, variants: typing.Iterable[Variant], tx: TranscriptCoordinates, protein: ProteinMetadata): title = f"{protein.protein_id} ({protein.label})" - fig, ax = plt.subplots(1, figsize=(10, 10)) + _, ax = plt.subplots(1, figsize=(10, 10)) protein_domains = set() - THRESHOLD = 2 + _THRESHOLD = 2 BOTTOM_MARGIN = 20 amino_acid_len = tx.get_codon_count() # draw a box that is ten aax tall, where aax is the dimension of one amino acid prot_start = get_interpolated_location_in_protein(1, amino_acid_len) prot_end = get_interpolated_location_in_protein(amino_acid_len, amino_acid_len) - box_height = 10/amino_acid_len + _box_height = 10/amino_acid_len prot_width = prot_end - prot_start + 1 protein_height = prot_width/20 #rect = Rectangle((prot_start, BOTTOM_MARGIN), prot_width, protein_height) @@ -59,11 +57,11 @@ def draw_variants(self, variants: typing.Iterable[Variant], hgvs_cdna = hgvs variant_effects = tx_annot.variant_effects if len(variant_effects) > 1: - var_effect = "MULTIPLE" + _var_effect = "MULTIPLE" elif len(variant_effects) == 0: - var_effect = "UNKNOWN" + _var_effect = "UNKNOWN" else: - var_effect = variant_effects[0].name + _var_effect = variant_effects[0].name for p in tx_annot.protein_affected: for f in p.domains(): protein_domains.add(f.info) @@ -92,7 +90,7 @@ def draw_variants(self, variants: typing.Iterable[Variant], start = feature.start end = feature.end #print(name, start, end, box_color) - box_height = 10/amino_acid_len + _box_height = 10/amino_acid_len prot_width = prot_end - prot_start + 1 protein_height = prot_width/20 #rect = Rectangle((prot_start, BOTTOM_MARGIN), prot_width, protein_height) diff --git a/tests/analysis/pscore/test_de_vries_scorer.py b/tests/analysis/pscore/test_de_vries_scorer.py index a96384c7f..e62d5edbf 100644 --- a/tests/analysis/pscore/test_de_vries_scorer.py +++ b/tests/analysis/pscore/test_de_vries_scorer.py @@ -4,7 +4,7 @@ import pytest from gpsea.analysis.pscore import DeVriesPhenotypeScorer -from gpsea.model import Patient, SampleLabels, Phenotype, Sex +from gpsea.model import Patient, Phenotype intrauterine_growth_retardation = 'HP:0001511' small_for_gestational_age = 'HP:0001518' diff --git a/tests/preprocessing/test_phenopacket.py b/tests/preprocessing/test_phenopacket.py index 3e066efce..5503bd065 100644 --- a/tests/preprocessing/test_phenopacket.py +++ b/tests/preprocessing/test_phenopacket.py @@ -11,6 +11,7 @@ from phenopackets.schema.v2.core.interpretation_pb2 import GenomicInterpretation from phenopackets.schema.v2.phenopackets_pb2 import Phenopacket +from gpsea.model import VariantClass from gpsea.model.genome import GenomeBuild, Strand from gpsea.preprocessing import VVHgvsVariantCoordinateFinder @@ -138,7 +139,7 @@ def read_genomic_interpretation_json(fpath: str) -> GenomicInterpretation: class TestPhenopacketPatientCreator: - @pytest.fixture + @pytest.fixture(scope="class") def functional_annotator( self, fpath_project_dir: str, @@ -152,7 +153,7 @@ def functional_annotator( cache_dir=fpath_variant_cache_dir, ) - @pytest.fixture + @pytest.fixture(scope="class") def imprecise_sv_functional_annotator( self, genome_build: GenomeBuild, @@ -163,7 +164,7 @@ def imprecise_sv_functional_annotator( ), ) - @pytest.fixture + @pytest.fixture(scope="class") def variant_coordinate_finder( self, genome_build: GenomeBuild, @@ -176,7 +177,7 @@ def variant_coordinate_finder( def onset_term_parser(self) -> PhenopacketOntologyTermOnsetParser: return PhenopacketOntologyTermOnsetParser.default_parser() - @pytest.fixture + @pytest.fixture(scope="class") def patient_creator( self, hpo: hpotk.MinimalOntology, @@ -197,7 +198,7 @@ def patient_creator( term_onset_parser=onset_term_parser, ) - @pytest.fixture + @pytest.fixture(scope="class") def phenopacket( self, fpath_phenopacket_dir: str, @@ -299,6 +300,39 @@ def test_phenopacket_patient_creator( assert disease.onset.is_postnatal is True assert disease.onset.days == pytest.approx(20.) + # variants + assert len(patient.variants) == 3 + snp = patient.variants[0] + assert snp.variant_info.has_variant_coordinates() + snp_vc = snp.variant_info.variant_coordinates + assert snp_vc is not None + assert snp_vc.chrom == "6" + assert snp_vc.start == 32_040_420 + assert snp_vc.end == 32_040_421 + assert snp_vc.ref == "C" + assert snp_vc.alt == "T" + assert snp_vc.change_length == 0 + assert snp_vc.variant_class == VariantClass.SNV + + imprecise_sv = patient.variants[1] + assert imprecise_sv.variant_info.has_sv_info() + sv_vi = imprecise_sv.variant_info.sv_info + assert sv_vi is not None + assert sv_vi.gene_id == "HGNC:2600" + assert sv_vi.gene_symbol == "CYP21A2" + assert sv_vi.structural_type.value == "SO:1000029" # `chromosomal_deletion` + assert sv_vi.variant_class == VariantClass.DEL + + imprecise_tra = patient.variants[2] + assert imprecise_tra.variant_info.has_sv_info() + tra_vi = imprecise_tra.variant_info.sv_info + assert tra_vi is not None + assert tra_vi.gene_id == "HGNC:24650" + assert tra_vi.gene_symbol == "EHMT1" + assert tra_vi.structural_type.value == "SO:1000044" # `chromosomal_translocation` + assert tra_vi.variant_class == VariantClass.TRANSLOCATION + + def test_individual_with_no_genotype( self, phenopacket: Phenopacket, diff --git a/tests/test_data/phenopackets/PMID_30968594_individual_1.json b/tests/test_data/phenopackets/PMID_30968594_individual_1.json index 712366691..32650f033 100644 --- a/tests/test_data/phenopackets/PMID_30968594_individual_1.json +++ b/tests/test_data/phenopackets/PMID_30968594_individual_1.json @@ -266,6 +266,29 @@ } } } + }, + { + "subjectOrBiosampleId": "individual 1", + "interpretationStatus": "CAUSATIVE", + "variantInterpretation": { + "variationDescriptor": { + "id": "var_TerBgCxXbYQtjsIsIjzgpliJm", + "label": "translocation: t(9;15)(q34.1;q13) breakpoint in EHMT1", + "geneContext": { + "valueId": "HGNC:24650", + "symbol": "EHMT1" + }, + "moleculeContext": "genomic", + "structuralType": { + "id": "SO:1000044", + "label": "chromosomal_translocation" + }, + "allelicState": { + "id": "GENO:0000135", + "label": "heterozygous" + } + } + } } ] } diff --git a/tests/test_data/phenopackets/README.md b/tests/test_data/phenopackets/README.md index e4a58e595..5f0788ea7 100644 --- a/tests/test_data/phenopackets/README.md +++ b/tests/test_data/phenopackets/README.md @@ -5,4 +5,4 @@ The phenopackets used for testing. ## `PMID_30968594_individual_1.json` A phenopacket with individual characteristics, phenotype features, measurements, -interpretations, and diseases. +interpretations (a SNP, an imprecise SV, and a translocation), and diseases.