|
| 1 | +# Supplementary Information |
| 2 | + |
| 3 | + |
| 4 | +## File Formats |
| 5 | + |
| 6 | +Where can you source reference genomes and annotation files: |
| 7 | +* Ensembl database: https://asia.ensembl.org/info/data/ftp/index.html |
| 8 | +* USCS database: https://hgdownload.soe.ucsc.edu/downloads.html |
| 9 | +* NCBI database: https://www.ncbi.nlm.nih.gov/guide/howto/dwn-genome/ |
| 10 | + |
| 11 | +The top of an ensembl homo sapiens fasta file: |
| 12 | + |
| 13 | +```{r, echo=FALSE, out.width="100%",} |
| 14 | +knitr::include_images("images/supplementary/chr_fasta_full_name.png") |
| 15 | +``` |
| 16 | + |
| 17 | +Fasta files will have a chromosome header line, indicated by the line starting with `>`. The header line will have the chromosome number and may contain some extra information. A minimal header can just have the chromosome number. |
| 18 | + |
| 19 | +```{r, echo=FALSE, out.width="100%",} |
| 20 | +knitr::include_images("images/supplementary/chr_fasta.png") |
| 21 | +``` |
| 22 | + |
| 23 | +The lines following the header will contain that specific chromosome’s sequence |
| 24 | + |
| 25 | +```{r, echo=FALSE, out.width="100%",} |
| 26 | +knitr::include_images("images/supplementary/fasta_seq.png") |
| 27 | +``` |
| 28 | + |
| 29 | +Annotation files are usually GTF or GFF3 format files. Below is a GTF file: |
| 30 | + |
| 31 | +```{r, echo=FALSE, out.width="100%",} |
| 32 | +knitr::include_images("images/supplementary/gtf_file.png") |
| 33 | +``` |
| 34 | + |
| 35 | +A gtf file is a 'tab separated file' - this means that it is a file with columns indicated by tab spacing. A GTF file will always have 9 columns containing the following information (taken from here): |
| 36 | + |
| 37 | +1. seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Note: the chromosome name format should be the same as the fasta file e.g if the fasta file has `chr1` then the gtf file should also have `chr1` in this column. If the fasta file has `1` then the gtf file should have `1` in this column. |
| 38 | +2. source - name of the program that generated this feature, or the data source (database or project name) |
| 39 | +3. feature - feature type name, e.g. Gene, Variation, Similarity |
| 40 | +4. start - Start position* of the feature, with sequence numbering starting at 1. |
| 41 | +5. end - End position* of the feature, with sequence numbering starting at 1. |
| 42 | +6. score - A floating point value. |
| 43 | +7. strand - defined as + (forward) or - (reverse). |
| 44 | +8. frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on.. |
| 45 | +9. attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature. |
0 commit comments