Skip to content

External Databases Cross‐referencing

Mauricio Martinez edited this page Oct 13, 2023 · 8 revisions

Part of the ETL process is the generation of links to external resources, which provide additional information for some elements in our data, like genes, studies, etc.

Types of resources

Raw data resources

Resources linked from a molecular characterization (the combination of [sample id - sample type - data type - platform]). Usually, the link corresponds to a study.

Examples

The links in the molecular_characterization table are stored in the column external_db_links. At the level of the model, the list of raw data resources the model links to is stored in the column raw_data_resources in the table search_index.

Link - Column association

Raw data resources links are associated with the column raw_data_url.

Source column

Ids are taken from raw_data_url in the molecular_characterization table.

raw_data_url content examples

  • GSM1986309&PRJNA307186
  • ERR4290182

Cancer annotation resources

Resources linked from molecular data records. They can be resources with gene/variant information. These links are associated with the molecular data tables (cna_molecular_data, expression_molecular_data, mutation_measurement_data, and biomarker_molecular_data).

Examples

The links in the molecular data tables are stored in the column external_db_links. At the level of the model, the list of cancer annotation resources the model links to is stored in the column cancer_annotation_resources in the table search_index.

Link - Column association

If the resource is of type Gene, the link will be associated with the column hgnc_symbol. If the type is Variant, the link will be associated with the column amino_acid_change.

Source columns

Ids are taken from different columns depending on the resource.

hgnc_symbol content examples

  • DVL1
  • CDK11B

amino_acid_change content examples

  • K384R
  • F271S

variation_id content examples

  • rs3115849
  • rs307377&CM098260

Resources that require downloading

For certain resources, we need to download some data before the ETL can generate links. Such is the case of Civic, OncoMx, and ClinGen. For them, we first download a list of genes/symbols that they have in their platform, and then we can cross those names with the symbols we have in our molecular data so the links we generate are links that actually exist.

Why don't just make the ETL download this data each time?

The download of data can be time-consuming and we decided we don't need "fresh" data every single time we run the ETL. We can download the data from time to time and then the ETL will just use it.

Where is the downloaded data stored?

We keep a folder in our data repository https://gitlab.ebi.ac.uk/mouse-informatics/pdxfinder-data/-/tree/master/externalDBs. That folder contains the downloaded data in the expected format.

How to download the data?

We manually run https://github.com/PDCMFinder/pdcm-etl/blob/master/etl/jobs/util/external_resources/download_resources_data.py every time we want to have updated data from the resources. This downloads the data locally, so we need to move it to the data repository and replace the files in externalDBs.

Resources that don't require downloading

For other resources, the link generation doesn't need a previous download of a list of symbols. The links can be generated in those cases with information that is already in our data. That is the case of resources like dbSNP, COSMIC, or OpenCravat. For then, the links are built using data that is in the same record of the molecular data table/dataset. We call them inline links.

External resources configuration file

The file https://github.com/PDCMFinder/pdcm-etl/blob/master/etl/external_resources.yaml contains the configuration for all the external resources that will be taken into account in the link generation process.

The following table describes the different resources for which we generate links.

  • id: Internal identifier to be used in internal references.
  • name: Informative text for internal use, to help distinguish resources that differ only in their type.
  • label: Actual resource name to be displayed as the text of the generated links.
  • type: Indicates the type of data the resource holds. This helps in the logic to decide which links to generate.
  • link_building_method: Indicates the method to build the links.
  • link_template: Template to create the link to the resource for a specific entry or replacement of column values.
id Name Label Type link_building_method link_template
1 Civic Civic (Genes) Gene referenceLookup https://civicdb.org/links/entrez_name/ENTRY_ID
2 Civic Civic (Variants) Variant referenceLookup https://civicdb.org/links?idtype=variant&id=ENTRY_ID
3 OncoMX OncoMx (Genes) Gene referenceLookup https://oncomx.org/searchview/?gene=ENTRY_ID
4 dbSNP dbSNP (Variants) Variant dbSNPInlineLink https://www.ncbi.nlm.nih.gov/snp/RS_ID
5 COSMIC COSMIC (Variants) Variant COSMICInlineLink https://cancer.sanger.ac.uk/cosmic/mutation/overview?id=COSMIC_ID
6 OpenCravat OpenCravat (Variants) Variant OpenCravatInlineLink https://run.opencravat.org/webapps/variantreport/index.html?alt_base=ALT_BASE&chrom=chrCHROM&pos=POSITION&ref_base=REF_BASE
7 ENA ENA (Studies) Study ENAInlineLink https://www.ebi.ac.uk/ena/browser/view/ENA_ID
8 EGA EGA (Studies) Study EGAInlineLink https://ega-archive.org/studies/EGA_ID
9 GEO GEO (Studies) Study GEOInlineLink https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GEO_ID
10 ClinGen ClinGen (Genes) Gene referenceLookup https://search.clinicalgenome.org/kb/genes/ENTRY_ID

Linking methods

The value in the configuration file for the field link_building_method determines how the link will be created.

Note: The variation_id column is present only in mutation molecular data, so resources that need that column can only be processed in mutation data.

referenceLookup

The resource data has been previously downloaded and contains a set of entries used as a lookup table to find matches when scanning columns which might require a link.

This is the method used to create Civic, OncoMx, and ClinGen links.

dbSNPInlineLink

Resource type: Cancer annotation resources.

Source Column: variation_id.

Inline link for dbSNP. It means the id (rsId) needs to be extracted from the variation_id column.

Regular expression: rs\d+

COSMICInlineLink

Resource type: Cancer annotation resources.

Source Column: variation_id.

Inline link for COSMIC. It means the id (COSM###) needs to be extracted from the variation_id column. If several values, only the first one is processed.

Regular expression: COSM(\d+)

OpenCravatInlineLink

Resource type: Cancer annotation resources.

Source Columns: alt_base, chromosome, position, ref_base.

Inline link for OpenCravat. It means the link needs the data alt_base, chromosome, position, and ref_base to be built.

Other restrictions: Column variation_id must contain the string Rs. For example: "rs121913512".

ENAInlineLink

Resource type: Cancer annotation resource.

Source Column: raw_data_url.

Inline link for ENA studies. ID extracted from the raw_data_url column.

Regular expression: PRJ[EDN][A-Z][0-9]{0,15}|[EDS]R[SXRP][0-9]{6,}

EGAInlineLink

Resource type: Cancer annotation resource.

Source Column: raw_data_url.

Inline link for EGA studies. ID extracted from the raw_data_url column.

Regular expression: EGA[A-Za-z0-9]+

GEOInlineLink

Resource type: Cancer annotation resource.

Source Column: raw_data_url.

Inline link for GEO studies. ID extracted from the raw_data_url column.

Regular expression: GSM[A-Za-z0-9]+