Skip to content

Project overview

Mauricio Martinez edited this page Jun 6, 2022 · 4 revisions

Project Overview

The PDCM ETL is a pipeline that receives some input data and goes to a set of transformations that end in a database with the needed information to power up the PDCM Finder (https://www.cancermodels.org/) website.

More specifically, the database that is populated with this ETL is the source of a PostgREST instance, which automatically creates a set of endpoints that are the ones used by the web application.

The code

The ETL is written in Python and uses a library called luigi that provides a workflow manager.

Input

The main source of data for the ETL is the data that is sent by the different providers, but the process also needs additional information.

Before running the ETL we need to set the folder that contains all the input data. Let's called here: [data_dir].

Providers' data

This is the data that every provider gives us. It's a set of Excel files that we later convert to tsv files.

The providers' data can be found at [data_dir]/data/UPDOG.

Mapping Rules

The mapping rules are 2 JSON files that contain rules defining the relationship between diagnosis/treatments to ontology terms. This is important because different providers can have different names for the same terms so we need to harmonise the data to power up the search functionality in PDCM Finder.

The mapping rules can be found at [data_dir]/mapping.

Markers

Molecular data contains genes names that we make sure are correct by mapping them to existing genes. To avoid querying that data during execution time, beforehand we download the markers information and use it as an input file for the process.

The markers data can be found at [data_dir]/markers/markers.tsv. The data is fetched from:

https://www.genenames.org/cgi-bin/download/custom?col=gd_hgnc_id&col=gd_app_sym&col=gd_app_name&col=gd_status&col=gd_prev_sym&col=gd_aliases&col=gd_pub_acc_ids&col=gd_pub_refseq_ids&col=gd_name_aliases&col=gd_pub_ensembl_id&col=gd_pub_eg_id&status=Approved&hgnc_dbtag=on&order_by=gd_app_sym_sort&format=text&submit=submit

NCIT terms

Ontology for NCIT terms that along with the mapping rules allow to relate diagnosis and treatments to ontology terms.

The NCIT terms can be found at [data_dir]/ontology/ncit.obo. The data is fetched from: http://purl.obolibrary.org/obo/ncit.obo

Ontolia

Ontolia is an internal tool to relate regimens and terms. This information is used to improve the search engine in PDCM Finder. After Ontolia is executed, the output must be copied at [data_dir]/ontology/ontolia_output.txt

The Ontolia repository can be found at: https://github.com/PDCMFinder/ontolia

Output

The output of the ETL process is a populated database, which at the end is going to power up the PDCM Finder website, through a PostgREST api.

Clone this wiki locally