- Docker
- Docker Compose plugin
- A reverse proxy such as NGINX, Trafik, or similar (configuring this is out of scope for this guide)
- A valid HTTPS certificate (configuring this is out of scope for this guide)
The following assay types can be ingested into an EpiVar node:
RNA-seqATAC-seqH3K4me1H3K4me3H3K27acH3K27me3
-
A metadata file for the bigWig tracks, which can be one of the following:
-
An XLSX file with one or more sheets (see an example for the Aracena et al. dataset), each with the following headers:
file.path: relative path tobigWig, without theEPIVAR_TRACKS_DIRenvironment variable directory prefixethnicity: ethnicity / population group ID (not name!)- if set to
Exclude sample, sample will be skipped
- if set to
condition: condition / experimental group ID (not name!)sample_name: Full sample name, uniquely indentifying the sample withinassay,condition,donor, andtrack.viewvariablesdonor: donor ID (i.e., individual ID)track.view: literal value, one ofsignal_forward,signal_reverse, orsignal_unstrandedtrack.track_type: must be the literal valuebigWigassay.name: one of the available assays
The file may have additional headers, but these will be discarded internally.
-
OR, a JSON file containing a list of objects with the following keys, mapping to the above headers in order:
pathethnicityconditionsample_namedonorviewtypeassay
-
-
A dataset configuration file, which takes the form described in the example configuration file. Here, assays available in this node can be specified, as well as experimental conditions, population groups, and functions for interacting with the genotype VCF file.
This file specifies information about the dataset being hosted by the EpiVar node, including dataset title, sample groups and experimental treatments (in both of these, each entry has an ID and a name), assembly ID (
hg19orhg38), and how to find samples in the genotype VCF file. -
A human-readable dataset description file, in Markdown format, to show in the
About Datasettab in the browser. See an example for the Aracena et al. dataset.
-
A bgzipped, Tabix-indexed VCF containing sample variants, using one of two available reference genomes (
hg19/hg38). -
A set of normalized signal matrices: one per assay, each containing columns of samples and rows of features (see an example for ATAC-seq.)
-
A set of bigWigs, one or two (in the case of RNA-seq; forward/reverse view) per sample-assay pair.
-
Peak and gene-peak-link CSV files, respectively containing the following:
-
Peak files are, by default, named according to the template:
<qtls-directory>/QTLs_complete_$ASSAY.csv, where$ASSAYis one of the available assays. This template naming can be changed (keeping the$ASSAYreplaceable value) using theEPIVAR_QTLS_TEMPLATEenvironment variable.They are CSVs where each row contains data about the SNP, the p-value associations between the genotypes and each treatment, and the corresponding assay. There are a couple truncated example files in
/input-files/qtls. The required headings are the following:rsID: The rsID of the SNPsnp: The SNP in the SNP-peak association; formatted likechr#_######(UCSC-formatted chromosome name, underscore, position)feature: The feature name - eitherchr#_startpos_endposorGENE_NAMEpvalue.*where*is the ID of the condition, as specified in themetadata.jsonfile (see above.)- These are floating point numbers
feature_type: The assay the peak is from - e.g.,RNA-seq
As an example, the header row for the Aracena et al. dataset's RNA-seq QTLs file is the following:
rsID,snp,feature,pvalue.NI,pvalue.Flu,feature_type -
Some peaks are associated with genes, and their links should be provided in a gene-peak-link CSV file. This takes the form of a CSV with header row:
"symbol","peak_ids","feature_type"wheresymbolis gene symbol, and should be unique;peak_idsis a feature string composed of an underscore-separated contig/start position/end position (e.g.,chr1_9998_11177); andfeature_typeis the name of the assay (see available assays.)See the version of this file for the hg19 Aracena et al. dataset for an example.
-
In order to follow this guide, you should have experience deploying Docker containers, including configuring volumes, networks, and environment variables.
To turn a metadata XLSX file into a JSON file, run the following command:
# the -i is important here; otherwise, the metadata.json file will be blank!
docker run -i ghcr.io/c3g/epivar-server node ./scripts/metadata-to-json.js < path/to/metadata.xlsx > data/metadata.jsonAlternatively, generate a metadata JSON matching the required format directly.
In a production instance, you will need the multiple volumes/bind-mounts from the host filesystem to the server Docker container.
- Your dataset's config file should be bound to
/app/config.jsinside the container. - Your dataset's about file (in Markdown format) should be bound to
/app/data/about.mdinside the container. - The genotype
.vcf.gzand.vcf.gz.tbishould be bound to/app/data/genotypes.vcf.gzand/app/data/genotypes.vcf.gz.tbi, respectively. - Your
metadata.jsonfile, pre-existing or as created above, should be bound to/app/data/metadata.json.
- The node must be provided with a tracks folder containing subfolders and bigWig files matching the paths specified
in the
metadata.jsonfile described above. This folder can be bound as read-only, and should be bound to/tracks, e.g.:/path/to/tracks-dir:/tracks:ro. - The node must be provided with a readable/writable volume in which to place merged tracks, computed on-the-fly for
visualization, bound to
/mergedTracksinside the container, e.g.:/volumes/mergedTracks:/mergedTracks. - The Redis container, used for caching, does not strictly require a filesystem mount. However, to preserve the cache
across system restarts, a readable/writable volume should be bound to the Redis container's
/datapath, e.g.:/volumes/redis:/data. - The database must be persisted to the host filesystem, bound to
/var/lib/postgresql/datainside the container, e.g.:/volumes/db:/var/lib/postgresql/data.
There are a few required environment variables that do not have default values that must be configured when deploying a node. Some of these required environment variables are secrets, so they should not be shared or made public:
EPIVAR_NODE_BASE_URL=https://my-node-url.example.org
EPIVAR_SESSION_SECRET=some-long-secret-value-do-not-share-me
POSTGRES_PASSWORD=my-secure-password
EPIVAR_PG_CONNECTION=postgresql://postgres:my-secure-password@epivar-db:5432/postgres-
EPIVAR_NODE_BASE_URLshould not have a trailing slash, nor the/apisuffix even though API endpoints are off of this suffix. -
These environment variables can either be configured in a
.envfile and attached to a Docker Compose container via theenv_filedirective attached to both the EpiVar server container and the Postgres container, or put into theenvironmentdirective in the Compose file directly -EPIVAR_SESSION_SECRETandEPIVAR_PG_CONNECTIONare for the EpiVar server container, andPOSTGRES_PASSWORDis for the Postgres container. Either way, make sure not to commit them to any public repository. -
Several other configuration options are available, and documented in commented code, in the
/envConfig.jsfile. A lot of the default values here match how the Docker container is configured, especially the filesystem path options, so change with caution.
Assuming you have set up a Docker Compose file, similar to the one we provide as an example, you can start the node using the following command:
docker compose up -dFirst, import the assembly gene list and gene-peak association data into the database using the following command:
docker compose exec -i epivar-server node ./scripts/import-genes.mjs < ./input-files/flu-infection-gene-peaks.csvThen, import peaks and pre-computed peak matrix values into the database using the following command:
docker compose exec epivar-server node ./scripts/import-peaks.jsNote: This will take a while.
Then, calculate summary data for the peaks:
# Aggregate data for peaks grouped by SNP and gene, used for autocomplete:
docker compose exec epivar-server node ./scripts/calculate-peak-groups.mjs
# Used to generate Manhattan plots for chromosome/assay pairs, binned by SNP position:
docker compose exec epivar-server node ./scripts/calculate-top-peaks.mjsFinally, ensure the cache is clear in case any values have been added accidentally during data ingestion, or are remaining from prior data ingestions:
docker compose exec epivar-server node ./scripts/clear-cache.jsIn order to connect an EpiVar node to the EpiVar Browser portal, the node must be publicly accessible with a valid HTTPS certificate and a reverse proxy passing traffic to the EpiVar server (the configuration of which is out of scope for this guide.)
Then, contact us at epivar@computationalgenomics.ca, including information about your node, your dataset, and including the domain name + path of your instance, and we will decide whether to include your node in the list of nodes available in the EpiVar Browser.