Skip to content

Commit b4bb9d4

Browse files
committed
Add SPDI normalization with refseq files
1 parent f584fc9 commit b4bb9d4

28 files changed

+998
-393
lines changed

.env

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
UTILITIES_DATA_VERSION=113c119
2+
UTA_DATABASE_SCHEMA=uta_20240523b
23
PYARD_DATABASE_VERSION=3580

.github/workflows/cicd.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ jobs:
3333
run: ./fetch_utilities_data.sh && python -m pytest
3434
env:
3535
MONGODB_READONLY_PASSWORD: ${{ secrets.MONGODB_READONLY_PASSWORD }}
36+
UTA_DATABASE_URL: ${{ secrets.UTA_DATABASE_URL }}
3637

3738
deploy:
3839
name: Deploy to dev

.gitignore

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@
44
.pytest_cache
55
__pycache__
66
.venv
7-
utilities/FASTA
8-
utilities/mongo_utilities.py
9-
/data
107
secrets.env
11-
app/temp.py
8+
/data
9+
/seqrepo
10+
/tmp
11+
/utilities/FASTA

.vscode/settings.json

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,6 @@
1010
"python.testing.pytestArgs": [
1111
"."
1212
],
13-
"[python]": {
14-
"editor.defaultFormatter": "ms-python.autopep8",
15-
},
1613
"autopep8.args": [
1714
"--max-line-length=200"
1815
],

README.md

Lines changed: 147 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -42,17 +42,156 @@ The operations return the following status codes:
4242

4343
## Testing
4444

45-
To run the [integration tests](https://github.com/FHIR/genomics-operations/tree/main/tests), you can use the VS Code Testing functionality which should discover them automatically. You can also
46-
run `python3 -m pytest` from the terminal to execute them all.
45+
For local development, you will have to create a `secrets.env` file in the root of the repo and add in it the MongoDB
46+
password and the UTA Postgres database connection string (see the UTA section below for details):
47+
48+
```
49+
MONGODB_READONLY_PASSWORD=...
50+
UTA_DATABASE_URL=...
51+
```
52+
53+
Then, you will need to run `fetch_utilities_data.sh` in a terminal to fetch the required data files:
54+
55+
```shell
56+
$ ./fetch_utilities_data.sh
57+
```
58+
59+
To run the [integration tests](https://github.com/FHIR/genomics-operations/tree/main/tests), you can use the VS Code
60+
Testing functionality which should discover them automatically. You can also run `python3 -m pytest` from the terminal
61+
to execute them all.
4762

4863
Additionally, since the tests run against the Mongo DB database, if you need to update the test data in this repo, you
4964
can run `OVERWRITE_TEST_EXPECTED_DATA=true python3 -m pytest` from the terminal and then create a pull request with the
5065
changes.
5166

52-
## Update py-ard database
67+
## Heroku Deployment
68+
69+
Currently, there are two environments running in Heroku:
70+
- Dev: <https://fhir-gen-ops-dev-ca42373833b6.herokuapp.com/>
71+
- Prod: <https://fhir-gen-ops.herokuapp.com/>
72+
73+
Pull requests will trigger a deployment to the dev environment automatically after being merged.
74+
75+
The ["Manual Deployment"](https://github.com/FHIR/genomics-operations/actions/workflows/manual_deployment.yml) workflow
76+
can be used to deploy code to either the `dev` or `prod` environments. To do so, please select "Run workflow", ignore
77+
the "Use workflow from" dropdown which lists the branches in the current repo (I can't disable / remove it) and then
78+
select the environment, the branch and the repository. By default, the `https://github.com/FHIR/genomics-operations`
79+
repo is specified, but you can replace it with any any fork.
80+
81+
Deployments to the prod environment can only be triggered manually from the `main` branch of the repo using the Manual
82+
Deployment.
83+
84+
### Heroku Stack
85+
86+
Make sure that the Python version under [`runtime.txt`](./runtime.txt) is
87+
[supported](https://devcenter.heroku.com/articles/python-support#supported-runtimes) by the
88+
[Heroku stack](https://devcenter.heroku.com/articles/stack) that is currently running in each environment.
89+
90+
## UTA Database
91+
92+
The Biocommons [hgvs](https://github.com/biocommons/hgvs) library which is used for variant parsing, validation and
93+
normalisation requires access to a copy of the [UTA](https://github.com/biocommons/uta) Postgres database.
94+
95+
We have provisioned a Heroku Postgres instance in the Prod environment which contains the imported data from a database
96+
dump as described [here](https://github.com/biocommons/uta#installing-from-database-dumps).
97+
98+
We define a `UTA_DATABASE_SCHEMA` environment variable in the [`.env`](.env) file which contains the name of the
99+
currently imported database schema.
100+
101+
### Database import procedure (it will take about 30 minutes):
102+
103+
- Go to the UTA dump download site (http://dl.biocommons.org/uta/) and get the latest `<UTA_SCHEMA>.pgd.gz` file.
104+
- Go to https://dashboard.heroku.com/apps/fhir-gen-ops/resources and click on the "Heroku Postgres" instance (it will
105+
open a new window)
106+
- Go to the Settings tab
107+
- Click "View Credentials"
108+
- Use the fields from this window to fill in the variables below
109+
110+
```shell
111+
$ POSTGRES_HOST="<Heroku Postgres Host>"
112+
$ POSTGRES_DATABASE="<Heroku Postgres Database>"
113+
$ POSTGRES_USER="<Heroku Postgres User>"
114+
$ PGPASSWORD="<Heroku Postgres Password>"
115+
$ UTA_SCHEMA="<UTA Schema>" # Specify the UTA schema of the UTA dump you downloaded (example: uta_20240523b)
116+
$ gzip -cdq ${UTA_SCHEMA}.pgd.gz | grep -v '^GRANT USAGE ON SCHEMA .* TO anonymous;$' | grep -v '^ALTER .* OWNER TO uta_admin;$' | psql -U ${POSTGRES_USER} -1 -v ON_ERROR_STOP=1 -d ${POSTGRES_DATABASE} -h ${POSTGRES_HOST} -Eae
117+
```
118+
119+
Note: The `grep -v` commands are required because the Heroku Postgres instance doesn't allow us to create a new role.
120+
121+
Once complete, make sure you update the `UTA_DATABASE_SCHEMA` environment variable in the [`.env`](.env) file and commit
122+
it.
123+
124+
### Connection string
125+
126+
The connection string for this database can be found in the same Heroku Postgres Settings tab under "View Credentials".
127+
It is pre-populated in the Heroku runtime under the `UTA_DATABASE_URL` environment variable. Additionally, we set the
128+
same `UTA_DATABASE_URL` environment variable in GitHub so the CI can can use this database when running the tests.
129+
130+
For local development, set `UTA_DATABASE_URL` to the Heroku Postgres connection string in the `secrets.env` file.
131+
Alternatively, you can set it to `postgresql://anonymous:[email protected]/uta` if you'd like to use the HGVS
132+
public instance.
133+
134+
### Testing the database
135+
136+
```shell
137+
$ source secrets.env
138+
$ pgcli "${UTA_DATABASE_URL}"
139+
> set schema '<UTA Schema>'; # Specify the UTA schema of the UTA dump you downloaded (example: uta_20240523b)
140+
> select count(*) from alembic_version
141+
union select count(*) from associated_accessions
142+
union select count(*) from exon
143+
union select count(*) from exon_aln
144+
union select count(*) from exon_set
145+
union select count(*) from gene
146+
union select count(*) from meta
147+
union select count(*) from origin
148+
union select count(*) from seq
149+
union select count(*) from seq_anno
150+
union select count(*) from transcript
151+
union select count(*) from translation_exception;
152+
```
153+
154+
### Update utilities data
155+
156+
The RefSeq metadata from the UTA database needs to be in sync with the RefSeq data which is available for the Seqfetcher
157+
Utility endpoint. Currently, this is stored in GitHub as release artifacts. Similarly, the PyARD SQLite database is also
158+
stored as a release artifact.
159+
160+
To update the RefSeq data and PyARD database, you will have to run `./utilities/pack_seqrepo_data.py`. Here is a
161+
step-by-step guide on how to do this:
162+
163+
```shell
164+
$ mkdir seqrepo
165+
$ cd seqrepo
166+
$ python3 -m venv .venv
167+
$ . .venv/bin/activate
168+
$ pip install setuptools==75.7.0
169+
$ pip install biocommons.seqrepo==0.6.9
170+
$ # See https://github.com/biocommons/biocommons.seqrepo/issues/171 for a bug that's causing issues with the builtin
171+
$ # rsync on OSX.
172+
# # This OSX-specific. Guess the standard package managers have it available on Linux.
173+
$ brew install rsync
174+
$ # Fetch seqrepo data (should take about 16 minutes)
175+
$ seqrepo --rsync-exe /opt/homebrew/bin/rsync -r . pull --update-latest
176+
$ # If you'll get a "Permission denied" error, then you can run the following command (using the temp directory which
177+
$ # got created):
178+
$ # > chmod +w 2024-02-20.r4521u5y && mv 2024-02-20.r4521u5y 2024-02-20 && ln -s 2024-02-20 latest
179+
$
180+
$ # Exit venv and cd to genomics-operations repo.
181+
$
182+
$ # Pack the utilities data (should take about 25 minutes)
183+
$ python ./utilities/pack_utilities_data.py
184+
```
185+
You should see a warning in the output log if the current `PYARD_DATABASE_VERSION` is outdated and you can change
186+
`PYARD_DATABASE_VERSION` in the `.env` file if you wish to switch to the latest version that is printed in this log.
187+
188+
Now you should set a new value for `UTILITIES_DATA_VERSION` in the `.env` file, create a new branch and commit this
189+
change in it. Then also create a git tag for this commit with the `UTILITIES_DATA_VERSION` value and push it to GitHub
190+
along with the branch. Now you can use this tag to create a new [release](https://github.com/FHIR/genomics-operations/releases).
191+
Inside this release, you need to attach all the `*.tar.gz` files from the `./tmp` folder which was created after
192+
`pack_utilities_data.py` ran successfully.
193+
194+
Once the release is published, create PR from this new branch and merge it.
53195
54-
- Run `pyard.init(data_dir='./data/pyard', imgt_version=<new version>)` to download the new version
55-
- Run `cd data/pyard && tar -czf pyard.sqlite3.tar.gz pyard-<new version>.sqlite3`
56-
- Upload `pyard.sqlite3.tar.gz` in a new release on GitHub
57-
- Update `PYARD_DATABASE_VERSION` in `.env`
58-
- Update `UTILITIES_DATA_VERSION` in `.env` with the new tag ID (short git sha)
196+
Finally, in order to validate the new release locally, run `fetch_utilities_data.sh` locally to recreate the `data`
197+
directory (delete it first if you have it already).

app/__init__.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,13 @@
33
from flask_cors import CORS
44
import os
55

6+
import hgvs
7+
# Disable the hgvs LRU cache to avoid blowing up memory
8+
# TODO: Revisit this, since this caching might not use a ton of memory.
9+
hgvs.global_config.lru_cache.maxsize = 0
10+
# Disable HGVS strict bounds checks as a workaround for liftover failures: https://github.com/biocommons/hgvs/issues/717
11+
hgvs.global_config.mapping.strict_bounds = False
12+
613

714
def create_app():
815
# App and API

app/api_spec.yml

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1330,6 +1330,64 @@ paths:
13301330
type: string
13311331
example: "NM_001127510.3:c.145A>T"
13321332

1333+
/utilities/normalize-variant-hgvs:
1334+
get:
1335+
summary: "Normalize Variant HGVS"
1336+
operationId: "app.utilities_endpoints.normalize_variant_hgvs"
1337+
tags:
1338+
- "Operations Utilities (not part of balloted HL7 Operations)"
1339+
responses:
1340+
"200":
1341+
description: "Returns a normalized variant in both GRCh37 and GRCh38."
1342+
content:
1343+
application/json:
1344+
schema:
1345+
type: object
1346+
parameters:
1347+
- name: variant
1348+
in: query
1349+
required: true
1350+
description: "Variant."
1351+
schema:
1352+
type: string
1353+
example: "NM_021960.4:c.740C>T"
1354+
1355+
/utilities/seqfetcher/1/sequence/{acc}:
1356+
get:
1357+
summary: "Seqfetcher"
1358+
operationId: "app.utilities_endpoints.seqfetcher"
1359+
tags:
1360+
- "Operations Utilities (not part of balloted HL7 Operations)"
1361+
responses:
1362+
"200":
1363+
description: "Returns RefSeq subsequence"
1364+
content:
1365+
text/plain:
1366+
schema:
1367+
type: string
1368+
parameters:
1369+
- name: acc
1370+
in: path
1371+
required: true
1372+
description: Accession
1373+
schema:
1374+
type: string
1375+
example: "NC_000001.10"
1376+
- name: start
1377+
in: query
1378+
required: true
1379+
description: Subsequence start index
1380+
schema:
1381+
type: integer
1382+
example: 10000
1383+
- name: end
1384+
in: query
1385+
required: true
1386+
description: Subsequence end index
1387+
schema:
1388+
type: integer
1389+
example: 10010
1390+
13331391
/utilities/normalize-hla:
13341392
get:
13351393
description: >

0 commit comments

Comments
 (0)