-
Notifications
You must be signed in to change notification settings - Fork 0
Reference Genome instruction #112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
MikhailAf
wants to merge
8
commits into
develop
Choose a base branch
from
feature/guide-reference-genome
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 1 commit
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
c934f68
Reference Genome instruction
MikhailAf 08a701d
Merge branch 'develop' into feature/guide-reference-genome
eeliane cd0d53b
Merge branch 'develop' into feature/guide-reference-genome
MikhailAf 0c24517
Updated details
srz11d 07551dd
Supported file formats link fixed
MikhailAf 6117315
Merge branch 'develop' into feature/guide-reference-genome
MikhailAf ab1f263
All linkes are fixed
MikhailAf a76f8d1
Merge branch 'develop' into feature/guide-reference-genome
MikhailAf File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Binary file added
BIN
+293 KB
...e/doc-odm-user-guide/doc-odm-user-guide/images/variants/file-manager-genome.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+534 KB
...de/doc-odm-user-guide/doc-odm-user-guide/images/variants/post-link-variants.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+165 KB
...de/doc-odm-user-guide/doc-odm-user-guide/images/variants/reference-metadata.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+472 KB
...guide/doc-odm-user-guide/doc-odm-user-guide/images/variants/variant-me-data.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+825 KB
docs/user-guide/doc-odm-user-guide/doc-odm-user-guide/images/variants/vcf-file.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,160 @@ | ||
| # Working with Reference Genome | ||
|
|
||
| This guide explains how to import, manage, and extract VCF files using ODM APIs, with a focus on working with different | ||
| reference genomes. ODM is a flexible platform that allows users to work with various species by importing custom | ||
| reference genomes. | ||
|
|
||
| ## Understanding Reference Genomes | ||
|
|
||
| A reference genome is a standardized representation of a species' genome, used as a comparative framework to identify | ||
| genetic variations. When analyzing genetic variants, sequencing data is aligned to a reference genome, enabling | ||
| researchers to determine mutations, structural changes, and variant effects. | ||
|
|
||
| By default, ODM uses the human genome (**GRCh38**) as the reference genome. Any VCF file uploaded without specifying | ||
| a reference genome will be mapped against GRCh38. However, researchers working with non-human species, such as plants, | ||
| animals, or model organisms, require custom reference genomes for accurate data interpretation. ODM supports the import | ||
| of alternative reference genomes, ensuring flexibility for diverse research applications. | ||
|
|
||
| ## VCF Format Overview | ||
|
|
||
| The Variant Call Format (VCF) is a widely used file format for storing genetic variation data from sequencing experiments. | ||
| It is designed to be both human-readable and machine-parsable, making it ideal for bioinformatics applications. | ||
|
|
||
| A typical VCF file consists of two main sections: the header and the body. | ||
|
|
||
| **Key Features of the VCF Format:** | ||
|
|
||
| * **Header**: Contains metadata such as the VCF version, reference genome, and column descriptions. | ||
| * **Body**: Lists variant data, where each row represents a single variant. | ||
|
|
||
| ### **Important Columns in the VCF Body:** | ||
|
|
||
| * **CHROM**: Chromosome where the variant is located. | ||
| * **POS**: Genomic position of the variant. | ||
| * **ID**: Identifier (e.g., dbSNP ID). | ||
| * **REF**: Reference base(s) at the variant position. | ||
| * **ALT**: Alternative base(s) observed at the variant position. | ||
| * **QUAL**: Variant quality score. | ||
| * **FILTER**: Quality control filter status. | ||
| * **INFO**: Additional metadata, such as allele frequency. | ||
| * **FORMAT**: Defines genotype fields for samples. | ||
| * **Sample Data**: Genotype information for individual samples. | ||
|
|
||
|  | ||
|
|
||
| ## Importing Reference Genomes into ODM | ||
|
|
||
| To enable proper variant mapping, users can import new reference genomes into ODM. These reference genomes must be in | ||
| Gene Transfer Format (**GTF**), which contains details about gene features such as exons, introns, and coding regions. | ||
|
|
||
| New reference genomes can be sourced from public repositories such as Ensembl or custom datasets. | ||
|
|
||
| ### Steps to Import a New Reference Genome via API | ||
|
|
||
| 1. Use the `POST /api/v1/reference-genomes` endpoint to import a new reference genome. | ||
| 2. Provide the required details, including: | ||
| * **annotationUrl**: URL to the GFT file of the genome annotation file (compressed in .gtf.gz format). | ||
| * **organism**: Scientific name of the species (e.g., *Zea mays*). | ||
| * **assembly**: Genome assembly version (e.g., Zm-B73-REFERENCE-NAM-5.0). | ||
| * **release**: Minor version of the reference genome. | ||
| * **name**: A custom title for the reference genome, typically derived from species, assembly, and release details | ||
|
|
||
| **Example request:** | ||
|
|
||
| ``` json | ||
| { | ||
| "annotationUrl": "https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-60/plants/gtf/zea_mays/Zea_mays.Zm-B73-REFERENCE-NAM-5.0.60.chr.gtf.gz", | ||
| "organism": "Zea Mays", | ||
| "assembly": "Zm-B73-REFERENCE-NAM-5.0", | ||
| "release": "113.8", | ||
| "name": "Zea mays (maize) Zm-B73" | ||
| } | ||
| ``` | ||
|
|
||
| **Example response:** | ||
|
|
||
| ``` json | ||
| { | ||
| "genestack:accession": "GSF1278535" | ||
| } | ||
| ``` | ||
|
|
||
| The response confirms successful import, assigning a unique accession number. Users can locate the imported genome | ||
| in ODM’s **File Manager** using this identifier: | ||
|
|
||
|  | ||
|
|
||
| Once the genome is available, it can be used as a reference for variant files. | ||
|
|
||
| ## Importing Gene Variant Files | ||
|
|
||
| To import VCF files and link them to a specific reference genome, users must specify the reference genome ID in the metadata. | ||
|
|
||
| ### Steps to Import Gene Variant Files | ||
|
|
||
| 1. Use the `POST /api/v1/jobs/import/variant` endpoint to upload a VCF file. | ||
| 2. Create a metadata file in tabular format specifying the reference genome for the VCF file. | ||
|  | ||
| A metadata file in tabular format ensures the VCF file is linked to the correct reference genome. | ||
|
|
||
| 3. Include the metadata file link and the VCF file link in the API request. | ||
|
|
||
| **Example Request to Import VCF:** | ||
|
|
||
| ``` json | ||
| { | ||
| "source": "S3", | ||
| "metadataLink": "s3://my_instance/SRL_GenVariant/Metadata_RefGen2.tsv", | ||
| "dataLink": "s3://my_instance/SRL_GenVariant/Zea_mays_example.vcf", | ||
| "templateId": "GSF1150101" | ||
| } | ||
| ``` | ||
|
|
||
| **Example response:** | ||
|
|
||
| ``` json | ||
| { | ||
| "jobExecId": 2005, | ||
| "startedBy": "sharon.ruiz.lopez@genestack.com", | ||
| "jobName": "IMPORT_VARIANT_VCF", | ||
| "status": "STARTING", | ||
| "createTime": "14-03-2025 06:11:48" | ||
| } | ||
| ``` | ||
|
|
||
| The response indicates that the job has started. Users can track progress using the endpoint: | ||
|
|
||
| `GET /api/v1/jobs/{jobExecId}/output` | ||
|
|
||
| Once completed, the system assigns an accession number to the imported file. | ||
|
|
||
| **Expected response**: | ||
|
|
||
| ``` json | ||
| { | ||
| "status": "COMPLETED", | ||
| "result": { | ||
| "groupAccession": "GSF1278671" | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| This confirms that the VCF file has been successfully imported and linked to the specified reference genome. | ||
|
|
||
| ## Linking Gene Variant Files to Sample Metadata | ||
|
|
||
| After importing a VCF file, it must be linked to its corresponding sample metadata. This is done via the following API endpoint: | ||
|
|
||
| `POST /api/v1/as-curator/integration/link/variant/group/{sourceId}/to/sample/group/{targetId}` | ||
|
|
||
|  | ||
| Users can link VCF files to sample metadata by providing both accession numbers | ||
|
|
||
| Once linked, the variant data becomes accessible in the Gene Variant Data section of ODM. | ||
|
|
||
|  | ||
| Successfully imported and linked VCF files can be explored in ODM’s Gene Variant Data section | ||
|
|
||
| ODM provides a flexible and scalable solution for working with VCF files, supporting multiple reference genomes beyond | ||
| the default human genome. By leveraging these capabilities, users can efficiently import, manage, and link genetic | ||
| variant data across different species, enhancing their data analysis workflows. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.