External Databases Cross‐referencing

Part of the ETL process is the generation of links to external resources, which provide additional information for some elements in our data, like genes, studies, etc.

Types of resources

Raw data resources

Resources linked from a molecular characterization (the combination of [sample id - sample type - data type - platform]). Usually, the link corresponds to a study.

Examples

The links in the molecular_characterization table are stored in the column external_db_links. At the level of the model, the list of raw data resources the model links to is stored in the column raw_data_resources in the table search_index.

Link - Column association

Raw data resources links are associated with the column raw_data_url.

Source column

Ids are taken from raw_data_url in the molecular_characterization table.

raw_data_url content examples

GSM1986309&PRJNA307186
ERR4290182

Cancer annotation resources

Resources linked from molecular data records. They can be resources with gene/variant information. These links are associated with the molecular data tables (cna_molecular_data, expression_molecular_data, mutation_measurement_data, and biomarker_molecular_data).

Examples

The links in the molecular data tables are stored in the column external_db_links. At the level of the model, the list of cancer annotation resources the model links to is stored in the column cancer_annotation_resources in the table search_index.

Link - Column association

If the resource is of type Gene, the link will be associated with the column hgnc_symbol. If the type is Variant, the link will be associated with the column amino_acid_change.

Source columns

Ids are taken from different columns depending on the resource.

`hgnc_symbol` content examples

DVL1
CDK11B

`amino_acid_change` content examples

K384R
F271S

`variation_id` content examples

rs3115849
rs307377&CM098260

Resources that require downloading

For certain resources, we need to download some data before the ETL can generate links. Such is the case of Civic, OncoMx, and ClinGen. For them, we first download a list of genes/symbols that they have in their platform, and then we can cross those names with the symbols we have in our molecular data so the links we generate are links that actually exist.

Why don't just make the ETL download this data each time?

The download of data can be time-consuming and we decided we don't need "fresh" data every single time we run the ETL. We can download the data from time to time and then the ETL will just use it.

Where is the downloaded data stored?

We keep a folder in our data repository https://gitlab.ebi.ac.uk/mouse-informatics/pdxfinder-data/-/tree/master/externalDBs. That folder contains the downloaded data in the expected format.

How to download the data?

We manually run https://github.com/PDCMFinder/pdcm-etl/blob/master/etl/jobs/util/external_resources/download_resources_data.py every time we want to have updated data from the resources. This downloads the data locally, so we need to move it to the data repository and replace the files in externalDBs.

Resources that don't require downloading

For other resources, the link generation doesn't need a previous download of a list of symbols. The links can be generated in those cases with information that is already in our data. That is the case of resources like dbSNP, COSMIC, or OpenCravat. For then, the links are built using data that is in the same record of the molecular data table/dataset. We call them inline links.

External resources configuration file

The file https://github.com/PDCMFinder/pdcm-etl/blob/master/etl/external_resources.yaml contains the configuration for all the external resources that will be taken into account in the link generation process.

The following table describes the different resources for which we generate links.

id: Internal identifier to be used in internal references.
name: Informative text for internal use, to help distinguish resources that differ only in their type.
label: Actual resource name to be displayed as the text of the generated links.
type: Indicates the type of data the resource holds. This helps in the logic to decide which links to generate.
link_building_method: Indicates the method to build the links.
link_template: Template to create the link to the resource for a specific entry or replacement of column values.

id	Name	Label	Type	link_building_method	link_template
1	Civic	Civic (Genes)	Gene	referenceLookup	https://civicdb.org/links/entrez_name/ENTRY_ID
2	Civic	Civic (Variants)	Variant	referenceLookup	https://civicdb.org/links?idtype=variant&id=ENTRY_ID
3	OncoMX	OncoMx (Genes)	Gene	referenceLookup	https://oncomx.org/searchview/?gene=ENTRY_ID
4	dbSNP	dbSNP (Variants)	Variant	dbSNPInlineLink	https://www.ncbi.nlm.nih.gov/snp/RS_ID
5	COSMIC	COSMIC (Variants)	Variant	COSMICInlineLink	https://cancer.sanger.ac.uk/cosmic/mutation/overview?id=COSMIC_ID
6	OpenCravat	OpenCravat (Variants)	Variant	OpenCravatInlineLink	https://run.opencravat.org/webapps/variantreport/index.html?alt_base=ALT_BASE&chrom=chrCHROM&pos=POSITION&ref_base=REF_BASE
7	ENA	ENA (Studies)	Study	ENAInlineLink	https://www.ebi.ac.uk/ena/browser/view/ENA_ID
8	EGA	EGA (Studies)	Study	EGAInlineLink	https://ega-archive.org/studies/EGA_ID
9	GEO	GEO (Studies)	Study	GEOInlineLink	https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GEO_ID
10	ClinGen	ClinGen (Genes)	Gene	referenceLookup	https://search.clinicalgenome.org/kb/genes/ENTRY_ID

Linking methods

The value in the configuration file for the field link_building_method determines how the link will be created.

Note: The variation_id column is present only in mutation molecular data, so resources that need that column can only be processed in mutation data.

referenceLookup

The resource data has been previously downloaded and contains a set of entries used as a lookup table to find matches when scanning columns which might require a link.

This is the method used to create Civic, OncoMx, and ClinGen links.

dbSNPInlineLink

Resource type: Cancer annotation resources.

Source Column: variation_id.

Inline link for dbSNP. It means the id (rsId) needs to be extracted from the variation_id column.

Regular expression: rs\d+

COSMICInlineLink

Resource type: Cancer annotation resources.

Source Column: variation_id.

Inline link for COSMIC. It means the id (COSM###) needs to be extracted from the variation_id column. If several values, only the first one is processed.

Regular expression: COSM(\d+)

OpenCravatInlineLink

Resource type: Cancer annotation resources.

Source Columns: alt_base, chromosome, position, ref_base.

Inline link for OpenCravat. It means the link needs the data alt_base, chromosome, position, and ref_base to be built.

Other restrictions: Column variation_id must contain the string Rs. For example: "rs121913512".

ENAInlineLink

Resource type: Cancer annotation resource.

Source Column: raw_data_url.

Inline link for ENA studies. ID extracted from the raw_data_url column.

Regular expression: PRJ[EDN][A-Z][0-9]{0,15}|[EDS]R[SXRP][0-9]{6,}

EGAInlineLink

Resource type: Cancer annotation resource.

Source Column: raw_data_url.

Inline link for EGA studies. ID extracted from the raw_data_url column.

Regular expression: EGA[A-Za-z0-9]+

GEOInlineLink

Resource type: Cancer annotation resource.

Source Column: raw_data_url.

Inline link for GEO studies. ID extracted from the raw_data_url column.

Regular expression: GSM[A-Za-z0-9]+

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

External Databases Cross‐referencing

Types of resources

Raw data resources

Link - Column association

Source column

raw_data_url content examples

Cancer annotation resources

Link - Column association

Source columns

`hgnc_symbol` content examples

`amino_acid_change` content examples

`variation_id` content examples

Resources that require downloading

Why don't just make the ETL download this data each time?

Where is the downloaded data stored?

How to download the data?

Resources that don't require downloading

External resources configuration file

Linking methods

referenceLookup

dbSNPInlineLink

COSMICInlineLink

OpenCravatInlineLink

ENAInlineLink

EGAInlineLink

GEOInlineLink

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

External Databases Cross‐referencing

Types of resources

Raw data resources

Link - Column association

Source column

raw_data_url content examples

Cancer annotation resources

Link - Column association

Source columns

hgnc_symbol content examples

amino_acid_change content examples

variation_id content examples

Resources that require downloading

Why don't just make the ETL download this data each time?

Where is the downloaded data stored?

How to download the data?

Resources that don't require downloading

External resources configuration file

Linking methods

referenceLookup

dbSNPInlineLink

COSMICInlineLink

OpenCravatInlineLink

ENAInlineLink

EGAInlineLink

GEOInlineLink

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

`hgnc_symbol` content examples

`amino_acid_change` content examples

`variation_id` content examples