Skip to content

Commit d0f69d0

Browse files
authored
Merge pull request #398 from monarch-initiative/fix-bugs
Fix various bugs
2 parents dd3dcc4 + bf691ef commit d0f69d0

File tree

9 files changed

+209
-21
lines changed

9 files changed

+209
-21
lines changed

README.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,9 @@
55

66
GPSEA (Genotypes and Phenotypes - Statistical Evaluation of Associations) is a Python package for finding genotype-phenotype associations.
77

8-
9-
See the [Tutorial](https://monarch-initiative.github.io/gpsea/stable/tutorial.html)
10-
and a comprehensive [User guide](https://monarch-initiative.github.io/gpsea/stable/user-guide/index.html)
11-
for more information.
8+
See our documentation for the [setup](https://monarch-initiative.github.io/gpsea/stable/setup.html) instructions,
9+
a [tutorial](https://monarch-initiative.github.io/gpsea/stable/tutorial.html) with an end-to-end genotype-phenotype association analysis,
10+
and a comprehensive [user guide](https://monarch-initiative.github.io/gpsea/stable/user-guide/index.html) with everything else.
1211

1312
The documentation comes in two flavors:
1413

docs/setup.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@
44
Setup
55
#####
66

7-
Here we show how to install GPSEA and to prepare your Python environment
8-
for genotype-phenotype association analysis.
7+
Here we show how to install GPSEA and prepare your Python environment
8+
for genotype-phenotype association analyses.
99

1010

1111
.. contents:: Table of Contents

docs/user-guide/analyses/partitioning/genotype/variant_predicates.rst

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -185,12 +185,16 @@ The builtin predicates should cover majority of use cases.
185185
However, if a predicate seems to be missing,
186186
feel free to submit an issue in our
187187
`GitHub tracker <https://github.com/monarch-initiative/gpsea/issues>`_,
188-
or to implement a custom predicate
189-
by extending the :class:`~gpsea.analysis.predicate.VariantPredicate` class 😎.
188+
or implement your own predicate by following the :ref:`custom-variant-predicate`
189+
guide.
190190

191191

192+
****
193+
Next
194+
****
192195

193196
The variant predicate offers a flexible API for testing if variants meet a condition.
194-
However, the genotype phenotype correlations are done on the individual level
195-
and the variant predicates are used as a component of the genotype predicate.
196-
The next sections show how to use variant predicates to assign individuals into groups.
197+
However, the genotype phenotype correlations are studied on the level of individuals.
198+
As described in :ref:`genotype-classifiers`, GPSEA uses the :class:`~gpsea.analysis.clf.GenotypeClassifier` API
199+
to assign individuals into non-overlapping classes. Variant predicates are essential for creating such classifier.
200+
We explain the details in the following sections.

docs/user-guide/analyses/phenotype-scores.rst

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -156,6 +156,26 @@ Phenotype score
156156
This component is responsible for computing a phenotype score for an individual.
157157
As far as GPSEA framework is concerned, the phenotype score must be a floating point number
158158
or a `NaN` value if the score cannot be computed for an individual.
159+
This is the essence of the :class:`~gpsea.analysis.pscore.PhenotypeScorer` class.
160+
161+
GPSEA ships with several builtin phenotype scorers which can be used as
162+
163+
+------------------------------------------------------------+---------------------------------------------+
164+
| Name | Description |
165+
+============================================================+=============================================+
166+
| | Compute the total number of occurrences |
167+
| * :class:`~gpsea.analysis.pscore.CountingPhenotypeScorer` | of specific phenotypic features |
168+
| | (used in this section) |
169+
+------------------------------------------------------------+---------------------------------------------+
170+
| | Compute the "adapted De Vries Score" |
171+
| * :class:`~gpsea.analysis.pscore.DeVriesPhenotypeScorer` | for assessing severity |
172+
| | of intellectual disability |
173+
+------------------------------------------------------------+---------------------------------------------+
174+
175+
.. tip::
176+
177+
See :ref:`custom-phenotype-scorer` section to learn how to build a phenotype scorer from scratch.
178+
159179

160180
Here we use the :class:`~gpsea.analysis.pscore.CountingPhenotypeScorer` for scoring
161181
the individuals based on the number of structural defects
Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
.. _custom-components:
2+
3+
#################
4+
Custom components
5+
#################
6+
7+
GPSEA aims to stay useful in the long run. Therefore, we took a great care
8+
to adhere to "good software development practices" and we framed GPSEA functionalities
9+
as a set of loosely coupled components. As a rule of thumb, each component corresponds
10+
to a Python abstract base class which is then extended by the builtin components
11+
and can also be extended by the future components, to serve both common or exotic use cases.
12+
13+
The abstract base classes define the component API.
14+
Per guidelines in Python's :mod:`abc` module, the abstract classes use :class:`abc.ABCMeta` as a metaclass
15+
and the class API consists of methods annotated with the :func:`abc.abstractmethod` decorator.
16+
These decorated methods must be overridden in the subclasses.
17+
18+
The following sections provide guidance for customizing the most commonly used GPSEA components.
19+
20+
21+
.. _custom-phenotype-scorer:
22+
23+
****************
24+
Phenotype scorer
25+
****************
26+
27+
:class:`~gpsea.analysis.pscore.PhenotypeScorer` computes a phenotype score for an individual.
28+
The phenotype score is a `float` with range :math:`(-\infty, \infty)` where `NaN` indicates
29+
that a score cannot be computed (e.g. the lab measurement value was not ascertained for the individual).
30+
31+
Here we show an example of a toy phenotype scorer
32+
for using body mass index (BMI) as a phenotype score.
33+
34+
>>> import typing
35+
>>> from gpsea.model import Patient
36+
>>> from gpsea.analysis.pscore import PhenotypeScorer
37+
>>> class BmiScorer(PhenotypeScorer): #
38+
...
39+
... def __init__( #
40+
... self,
41+
... id2bmi: typing.Mapping[str, float],
42+
... ):
43+
... self._id2bmi = id2bmi
44+
...
45+
... @property
46+
... def name(self) -> str: #
47+
... return "BMI phenotype scorer"
48+
...
49+
... @property
50+
... def description(self) -> str: #
51+
... return "Body mass index used as a phenotype score"
52+
...
53+
... @property
54+
... def variable_name(self) -> str: #
55+
... return "BMI"
56+
...
57+
... def score(self, patient: Patient) -> float: #
58+
... try:
59+
... return self._id2bmi[patient.labels.label]
60+
... except KeyError:
61+
... return float('nan')
62+
63+
❶ The ``BmiScorer`` must extend :class:`~gpsea.analysis.pscore.PhenotypeScorer`
64+
to be used as a phenotype scorer.
65+
❷ The scorer needs a ``dict`` with `label` → `BMI` for the analyzed individuals.
66+
We assume the user will pre-compute the corresponding ``dict``.
67+
68+
Then, the scorer must expose several properties, including ❸ ``name``, ❹ ``description``,
69+
and the ❺ ``variable_name`` it operates on.
70+
The properties provide bookkeeping metadata to use in e.g. visualizations.
71+
Try to choose short and concise names.
72+
73+
The most important part of the scorer is the ❻ `score` method
74+
which retrieves the BMI for an individual or returns `NaN` if the value is not available
75+
and the individual should be omitted from the analysis.
76+
77+
.. _custom-variant-predicate:
78+
79+
*****************
80+
Variant predicate
81+
*****************
82+
83+
The purpose of a :class:`~gpsea.analysis.predicate.VariantPredicate` is to test
84+
if a variant meets a certain criterion and GPSEA ships with an array
85+
of builtin predicates (see :mod:`gpsea.analysis.predicate` module).
86+
However, chances are a custom predicate will be needed in future,
87+
so we show how to how to extend
88+
the :class:`~gpsea.analysis.predicate.VariantPredicate` class
89+
to create one's own predicate.
90+
91+
Specifically, we show how to create a predicate to test if the variant affects a glycine residue
92+
of the transcript of interest.
93+
94+
>>> from gpsea.model import Variant, VariantEffect
95+
>>> from gpsea.analysis.predicate import VariantPredicate
96+
>>> class AffectsGlycinePredicate(VariantPredicate): #
97+
... def __init__( #
98+
... self,
99+
... tx_id: str,
100+
... ):
101+
... self._tx_id = tx_id
102+
... self._aa_code = "Gly"
103+
...
104+
... @property
105+
... def name(self) -> str: #
106+
... return "Affects Glycine"
107+
...
108+
... @property
109+
... def description(self) -> str: #
110+
... return "affects a glycine residue"
111+
...
112+
... @property
113+
... def variable_name(self) -> str: #
114+
... return "affected aminoacid residue"
115+
...
116+
... def test(self, variant: Variant) -> bool: #
117+
... tx_ann = variant.get_tx_anno_by_tx_id(self._tx_id)
118+
... if tx_ann is not None:
119+
... hgvsp = tx_ann.hgvsp
120+
... if hgvsp is not None:
121+
... return hgvsp.startswith(f"p.{self._aa_code}")
122+
... return False
123+
...
124+
... def __eq__(self, value: object) -> bool: #
125+
... return isinstance(value, AffectsGlycinePredicate) and self._tx_id == value._tx_id
126+
...
127+
... def __hash__(self) -> int: #
128+
... return hash((self._tx_id,))
129+
...
130+
... def __repr__(self) -> str: #
131+
... return str(self)
132+
...
133+
... def __str__(self) -> str: #
134+
... return f"AffectsGlycinePredicate(tx_id={self._tx_id})"
135+
136+
❶ The ``AffectsGlycinePredicate`` must extend :class:`~gpsea.analysis.predicate.VariantPredicate`.
137+
❷ We ask the user to provide the transcript accession `str` and we set the target aminoacid code to glycine ``Gly``.
138+
Like in the :ref:`custom-phenotype-scorer` above, ❸❹❺ provide metadata required for the bookkeeping.
139+
The ❻ ``test`` method includes the most interesting part - we retrieve the :class:`~gpsea.model.TranscriptAnnotation`
140+
with the functional annotation data for the transcript of interest, and we test if the HGVS protein indicates
141+
that the reference aminoacid is glycine.
142+
Last, we override ➐ ``__eq__()`` and ❽ ``__hash__()`` (required) as well as ❾ ``__repr__()`` and ➓ ``__str__()`` (recommended).

docs/user-guide/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,4 +28,5 @@ in an interactive Python environment, such as Jupyter notebook.
2828
input-data
2929
exploratory
3030
analyses/index
31+
custom-components
3132
glossary

src/gpsea/analysis/pscore/_api.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -141,13 +141,15 @@ def plot_boxplots(
141141
self,
142142
ax,
143143
colors=("darksalmon", "honeydew"),
144+
median_color: str = "black",
144145
):
145146
"""
146147
Draw box plot with distributions of phenotype scores for the genotype groups.
147148
148149
:param gt_predicate: the genotype predicate used to produce the genotype groups.
149150
:param ax: the Matplotlib :class:`~matplotlib.axes.Axes` to draw on.
150-
:param colors: a tuple with colors to use for coloring the box patches of the box plot.
151+
:param colors: a sequence with colors to use for coloring the box patches of the box plot.
152+
:param median_color: a `str` with the color for the boxplot median line.
151153
"""
152154
# skip the patients with unassigned genotype group
153155
bla = self._data.notna()
@@ -175,9 +177,13 @@ def plot_boxplots(
175177
tick_labels=gt_cat_names,
176178
)
177179

180+
# Set face colors of the boxes
178181
for patch, color in zip(bplot["boxes"], colors):
179182
patch.set_facecolor(color)
180183

184+
for median in bplot['medians']:
185+
median.set_color(median_color)
186+
181187
def __eq__(self, value: object) -> bool:
182188
return isinstance(value, PhenotypeScoreAnalysisResult) and super(
183189
MonoPhenotypeAnalysisResult, self

src/gpsea/preprocessing/_config.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -495,7 +495,7 @@ def load_phenopackets(
495495
# Keep track of the progress by wrapping the list of phenopackets
496496
# with TQDM 😎
497497
cohort_iter = tqdm(
498-
phenopackets, desc="Individuals Processed", file=sys.stdout, unit="individuals"
498+
phenopackets, desc="Individuals Processed", file=sys.stdout, unit=" individuals"
499499
)
500500
notepad = create_notepad(label="Phenopackets")
501501
cohort = cohort_creator.process(cohort_iter, notepad)

src/gpsea/preprocessing/_phenopacket.py

Lines changed: 24 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -128,6 +128,7 @@ def find_coordinates(
128128
f"{variation_descriptor.vcf_record.genome_assembly} is not {self._build.identifier}."
129129
)
130130
contig = self._build.contig_by_name(variation_descriptor.vcf_record.chrom)
131+
assert contig is not None
131132
start = int(variation_descriptor.vcf_record.pos) - 1
132133
ref = variation_descriptor.vcf_record.ref
133134
alt = variation_descriptor.vcf_record.alt
@@ -142,6 +143,7 @@ def find_coordinates(
142143
seq_location = variation.copy_number.allele.sequence_location
143144
refseq_contig_name = seq_location.sequence_id.split(":")[1]
144145
contig = self._build.contig_by_name(refseq_contig_name)
146+
assert contig is not None
145147

146148
# Assuming SV coordinates are 1-based (VCF style),
147149
# so we subtract 1 to transform to 0-based coordinate system
@@ -591,7 +593,11 @@ def _extract_age(
591593
) -> typing.Optional[Age]:
592594
if individual.HasField("time_at_last_encounter"):
593595
tale = individual.time_at_last_encounter
594-
return parse_age_element(tale, notepad)
596+
return parse_age_element(
597+
'time_at_last_encounter',
598+
tale,
599+
notepad,
600+
)
595601
return None
596602

597603
@staticmethod
@@ -614,6 +620,7 @@ def _extract_vital_status(
614620

615621
if vital_status.HasField("time_of_death"):
616622
age_of_death = parse_age_element(
623+
'time_of_death',
617624
time_element=vital_status.time_of_death,
618625
notepad=notepad,
619626
)
@@ -904,21 +911,27 @@ def parse_onset_element(
904911
We allow to use `GestationalAge`, `Age` or `OntologyClass` as onset.
905912
"""
906913
case = time_element.WhichOneof("element")
907-
if case == "ontology_class":
914+
if case == "age":
915+
age = time_element.age
916+
return Age.from_iso8601_period(value=age.iso8601duration)
917+
elif case == "gestational_age":
918+
age = time_element.gestational_age
919+
return Age.gestational(weeks=age.weeks, days=age.days)
920+
elif case == "ontology_class":
908921
if term_onset_parser is None:
909922
return None
910923
else:
911924
return term_onset_parser.process(
912925
ontology_class=time_element.ontology_class,
913926
notepad=notepad,
914-
)
927+
)
915928
else:
916-
return parse_age_element(
917-
time_element=time_element,
918-
notepad=notepad,
919-
)
929+
notepad.add_warning(f"`time_element` is in currently unsupported format `{case}`")
930+
return None
931+
920932

921933
def parse_age_element(
934+
field: str,
922935
time_element: PPTimeElement,
923936
notepad: Notepad,
924937
) -> typing.Optional[Age]:
@@ -933,5 +946,8 @@ def parse_age_element(
933946
age = time_element.age
934947
return Age.from_iso8601_period(value=age.iso8601duration)
935948
else:
936-
notepad.add_warning(f"`time_element` is in currently unsupported format `{case}`")
949+
notepad.add_warning(
950+
f"{case} of the {field} field cannot be parsed into age",
951+
"Consider formatting the age as ISO8601 duration (e.g., \"P31Y2M\" for 31 years and 2 months)"
952+
)
937953
return None

0 commit comments

Comments
 (0)