Set of Command Line Interface tools to process Open Targets Genetics GWAS data.
pip install gentroutils
To see all available commands after installation run
gentroutils --help
To run a single step run
uv run gentroutils -s gwas_catalog_release # After cloning the repository
gentroutils -s gwas_catalog_release -c otter_config.yaml # When installed by pip
The gentroutils repository uses the otter framework to build the set of tasks to run. The current implementation of tasks can be found in the config.yaml file in the root of the repository. To run gentroutils installed via pip you need to define the otter config that looks like the config.yaml file.
Example config
For the top level fields refer to the otter documentation
[!NOTE] All
destination_templatemust point to the Google Cloud Storage (GCS) bucket objects. Allsource_templatemust point to the FTP server paths. In case this is not enforced, the user may experience silent failures.
---
work_path: ./work
log_level: DEBUG
scratchpad:
steps:
gwas_catalog_release:
- name: crawl release metadata
stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/stats.json"
promote: "true"
- name: fetch associations
stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-associations_ontology-annotated.tsv"
destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_associations_ontology_annotated.tsv"
promote: true
- name: fetch studies
stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-download-studies-v1.0.3.1.txt"
destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_download_studies.tsv"
promote: true
- name: fetch ancestries
stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-download-ancestries-v1.0.3.1.txt"
destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_download_ancestries.tsv"
promote: true
- name: curation study
requires:
- fetch studies
previous_curation: gs://gwas_catalog_inputs/curation/latest/curated/GWAS_Catalog_study_curation.tsv
studies: gs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_studies.tsv
destination_template: gs://gwas_catalog_inputs/gentroutils/curation/{release_date}/GWAS_Catalog_study_curation.tsv
summary_statistics_glob: gs://gwas_catalog_inputs/raw_summary_statistics/*.h.tsv.gz
promote: trueThe config above defines the steps that are run in parallel by the otter framework.
The list of tasks (defined in the config.yaml file) that can be run are:
- name: crawl release metadata
stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/stats.json"
promote: "true"This task fetches the latest GWAS Catalog release metadata from the https://www.ebi.ac.uk/gwas/api/search/stats endpoint and saves it to the specified destination.
Note
Task parameters
- The
stats_uriis used to fetch the latest release date and other metadata. - The
destination_templateis where the metadata will be saved, and it uses the{release_date}placeholder to specify the release date dynamically. By default it searches for the release directly in the stats_uri json output. - The
promotefield is set totrue, which means the output will be promoted to the latest release. Meaning that the file will be saved undergs://gwas_catalog_inputs/gentroutils/latest/stats.jsonafter the task is completed. If thepromotefield is set tofalse, the file will not be promoted and will be saved under the specified path with the release date.
- name: fetch associations
stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-associations_ontology-annotated.tsv"
destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_associations_ontology_annotated.tsv"
promote: trueThis task fetches the GWAS Catalog associations file from the specified FTP server and saves it to the specified destination.
Note
Task parameters
- The
stats_uriis used to fetch the latest release date and other metadata. - The
source_templateis the URL of the GWAS Catalog associations file, which uses the{release_date}placeholder to specify the release date dynamically. The release date is fetched from thestats_uriendpoint. - The
destination_templateis where the associations file will be saved, and it also uses the{release_date}placeholder. The release date is fetched from thestats_uriendpoint. - The
promotefield is set totrue, which means the output will be promoted to the latest release. Meaning that the file will be saved undergs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_associations_ontology_annotated.tsvafter the task is completed. If thepromotefield is set tofalse, the file will not be promoted and will be saved under the specified path with the release date.
- name: fetch studies
stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-download-studies-v1.0.3.1.txt"
destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_download_studies.tsv"
promote: trueThis task fetches the GWAS Catalog studies file from the specified FTP server and saves it to the specified destination.
Note
Task parameters
- The
stats_uriis used to fetch the latest release date and other metadata. - The
source_templateis the URL of the GWAS Catalog studies file, which uses the{release_date}placeholder to specify the release date dynamically. The release date is fetched from thestats_uriendpoint. - The
destination_templateis where the studies file will be saved, and it also uses the{release_date}placeholder. The release date is fetched from thestats_uriendpoint. - The
promotefield is set totrue, which means the output will be promoted to the latest release. Meaning that the file will be saved undergs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_studies.tsvafter the task is completed. If thepromotefield is set tofalse, the file will not be promoted and will be saved under the specified path with the release date.
- name: fetch ancestries
stats_uri: "https://www.ebi.ac.uk/gwas/api/search/stats"
source_template: "ftp://ftp.ebi.ac.uk/pub/databases/gwas/releases/{release_date}/gwas-catalog-download-ancestries-v1.0.3.1.txt"
destination_template: "gs://gwas_catalog_inputs/gentroutils/{release_date}/gwas_catalog_download_ancestries.tsv"
promote: trueThis task fetches the GWAS Catalog ancestries file from the specified FTP server and saves it to the specified destination.
Note
Task parameters
- The
stats_uriis used to fetch the latest release date and other metadata. - The
source_templateis the URL of the GWAS Catalog ancestries file, which uses the{release_date}placeholder to specify the release date dynamically. The release date is fetched from thestats_uriendpoint. - The
destination_templateis where the ancestries file will be saved, and it also uses the{release_date}placeholder. The release date is fetched from thestats_uriendpoint. - The
promotefield is set totrue, which means the output will be promoted to the latest release. Meaning that the file will be saved undergs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_ancestries.tsvafter the task is completed. If thepromotefield is set tofalse, the file will not be promoted and will be saved under the specified path with the release date.
- name: curation study
requires:
- fetch studies
previous_curation: gs://gwas_catalog_inputs/curation/latest/curated/GWAS_Catalog_study_curation.tsv
studies: gs://gwas_catalog_inputs/gentroutils/latest/gwas_catalog_download_studies.tsv
destination_template: gs://gwas_catalog_inputs/curation/{release_date}/raw/gwas_catalog_study_curation.tsv
summary_statistics_glob: gs://gwas_catalog_inputs/raw_summary_statistics/*.h.tsv.gz
promote: trueThis task is used to build the GWAS Catalog curation file that is later used as a template for manual curation. It requires the fetch studies task to be completed before it can run. This is due to the fact that the curation file is build based on the list of studies fetched from download studies file.
Note
Task parameters
- The
requiresfield specifies that this task depends on thefetch studiestask, meaning it will only run after the studies have been fetched. - The
previous_curationfield is used to specify the path to the previous curation file. This is used to build the new curation file based on the previous one. - The
studiesfield is the path to the studies file that was fetched in thefetch studiestask. This file is used to build the curation file. - The
destination_templateis where the curation file will be saved, and it uses the{release_date}placeholder to specify the release date dynamically. The release date is fetched from thestats_uriendpoint. - The
promotefield is set totrue, which means the output will be promoted to the latest release. Meaning that the file will be saved undergs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsvafter the task is completed. If thepromotefield is set tofalse, the file will not be promoted and will be saved under the specified path with the release date. Thesummary_statistics_globfield is used to specify the glob pattern to list all synced summary statistics files from GCS. This is used to identify which studies have summary statistics available.
The base of the curation process for GWAS Catalog data is defined in the docs/gwas_catalog_curation.md. The original solution uses R script to prepare the data for curation and then manually curates the data. The solution proposed in the curation task automates the preparation of the data for curation and provides a template for manual curation. The manual curation process is still required, but the data preparation is automated.
The automated process includes:
- Reading
download studiesfile with the list of studies that are currently comming from the latest GWAS Catalog release. - Reading
previous curationfile that contains the list of the curated studies from the previous release. - Listing all synced summary statistics files from the
summary_statistics_globparameter to identify which studies have summary statistics available. Note that this can be more then the list of studies in thedownload studiesfile as syncing also involves the unpublished studies. - Comparing the three datasets with following logic:
- In case the study is present in the
previous curationanddownload studies, the study is marked ascurated - In case the study is present in the
download studiesbut not in theprevious curation, the study is marked asto_curateorhas_no_sumstatsdepending on the presence of summary statistics files - In case the study is present in the
previous curationbut not in thedownload studies, the study is marked asremoved
- In case the study is present in the
- The output of the curation process is a file that contains the list of studies with their status (curated, new, removed) and the fields that are required for manual curation. The output file is saved to the
destination_templatepath specified in the task configuration. The file is saved undergs://gwas_catalog_inputs/curation/{release_date}/raw/gwas_catalog_study_curation.tsvpath. - The output file is then promoted to the latest release path
gs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsvso that it can be used for manual curation. - The manual curation process is then performed on the
gs://gwas_catalog_inputs/curation/latest/raw/gwas_catalog_study_curation.tsvfile. The manual curation process is not automated and requires manual intervention. The output from the manual curation process should be saved then to thegs://gwas_catalog_inputs/curation/latest/curated/GWAS_Catalog_study_curation.tsvandgs://gwas_catalog_inputs/curation/{release_date}/curated/GWAS_Catalog_study_curation.tsvfile. This file is then used for the Open Targets Staging Dags.
To be able to contribute to the project you need to set it up. This project runs on:
- python 3.13
- uv (dependency manager)
To set up the project run
make dev
The command will install above dependencies (initial requirements are curl and bash) if not present and
install all python dependencies listed in pyproject.toml. Finally the command will install pre-commit hooks
required to be run before the commit is created.
The project has additional dev dependencies that include the list of packages used for testing purposes.
All of the dev dependencies are automatically installed by uv.
To see all available dev commands
Run following command to see all available dev commands
make help
To check CLI execution manually you need to run
uv run gentroutils
This software was developed as part of the Open Targets project. For more information please see: http://www.opentargets.org