Tutorial

This tutorial assumes you have an Amazon Web Services account registered with the NMDP. This will grant you access to a public machine image with all the data, tools, and compute infrastructure you need to proceed. If you do not have these things, go here first.

Get the code

If you have a GitHub account with a public key:

git clone [email protected]:nmdp-bioinformatics/pipeline.git

This will allow you to contribute changes, otherwise you may clone and use the pipeline anonymously:

git clone https://github.com/nmdp-bioinformatics/pipeline.git

Will create a local clone (working copy) of the GitHub repository, which contains several shell scripts for parallel execution of pipeline components.

Get the sample data

Public sample data from the sequence read archive are provided here:

/opt/data/DRP000941/

Each file (73 total) contains phased NGS data for 6-locus HLA published by Hosomichi et al, 2013. The files must be decompressed from SRA format to FASTQ before processing. SRA provides tools for this purpose. The decompressed data are also provided in the fastq/ directory.

Run the pipeline

Be Careful, please do not process all of the SRA data on the NMDP-provided AMI, use the following command to process a single sample and then scale up as you wish using your own Amazon account or other high-performance compute environment.

From within your cloned pipeline directory:

./tutorial/prep_data_dir.bash ~/workdir
./splitter.bash ~/workdir

Processing the data will take several minutes, depending on your hardware instance. In the meantime, you may proceed using preprocessed results.

Interpret and validate the results

Clinical interpretation of HLA DNA sequence for transplantation is typically confined to the antigen recognition sites (ARS), which correspond to exons 2 and 3 or exon 2 of class I and class II HLA genes, respectively. The NMDP's interpretation service currently requires consensus sequences that are trimmed of sequence representing other structural elements (non-ARS exons, introns, promoters and other untranslated regions). We will use HLA-A results for a single homozygous sample DRR003809 (DKB) for illustrative purposes. The first step is to filter contigs by region, in this case exons 2 and 3 of HLA-A:

groovy -classpath /mnt/scratch/caleb/ngs-feature-1.0-SNAPSHOT/lib/ngs-feature-1.2-SNAPSHOT.jar:/mnt/scratch/caleb/ngs-tools-1.0-SNAPSHOT/lib/biojava.jar:/mnt/scratch/caleb/ngs-tools-1.0-SNAPSHOT/lib/picard-1.102.0.jar:/mnt/scratch/caleb/ngs-tools-1.0-SNAPSHOT/lib/guava-17.0.jar:/mnt/scratch/caleb/ngs-variant-1.0-SNAPSHOT/lib/ngs-variant-1.2-SNAPSHOT.jar /mnt/scratch/caleb/splice-bam.groovy -i /opt/data/DRP000941/results/DRR003809_1.fastq.contigs.bwa.sorted.bam -x /opt/nmdp/regions/clinical-exons/hla-a.txt -g HLA-A -m -c -b 0.5 > DKB.fasta

The parameters are as follows:

-classpath includes required Java dependencies
-i is the input aligned, consensus sequences
-x is the filter regions, in this case exons 2 and 3 of HLA-A
-g is a passthrough parameter that is used for FASTA headers
-m enforces strict merging of consensus contigs
-b is the minimum breadth-of-coverage filter
-c creates phased cDNA from exon-sequences on the same contig

You may validate that the consensus sequences were correctly filtered by uploading to the UCSC genome browser BLAT tool:

Finally, use the NMDP's interpretation service to assign nomenclature on the command-line. A simple web-interface is also available (not shown):

curl -T DKB.fasta http://interp.b12x.org/hla/api/rs/interp | awk '{split($0,a,"|"); print a[1]}'

The awk command limits output to the first assigned allele, resulting in:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 26011    0 25455  100   556   9232    201  0:00:02  0:00:02 --:--:--  9232
HLA-A*24:02:01:01+HLA-A*24:02:01:01

Consistent with the known typing for cell line DKB.

Create an HML message

With this tutorial completed you have some of the basic elements to construct an HML message. The following will demonstrate some additional tools and process that should facilitate creation of fully-compliant HML documents appropriate for submitting genetic data to the NMDP.

Create a template HML from the XSD

Create `<sequence>` tags

The following Groovy code snippet uses BioJava's FASTA reader to validate consensus sequences.

import java.io.BufferedReader

import java.io.File

import org.biojava.bio.BioException;

import org.biojava.bio.seq.io.SeqIOTools
import org.biojava.bio.seq.SequenceIterator

File file = new File("DKB.fasta")
BufferedReader reader = new BufferedReader(new FileReader(file))

try {
  for (SequenceIterator sequences = SeqIOTools.readFastaDNA(reader); sequences.hasNext(); ) {
    def sequence = sequences.nextSequence()  
    // do something with sequence
  }
}
catch (BioException error) {
  println "invalid sequence-tag: ${error}"
}

For other file formats, such as VCF, the feature parser has corresponding methods to validate DNA fields.

Create `<targeted-region>` tags

The following Groovy code snippet uses the feature parser to reformat HLA-A clinical exons into proper HML targeted-region tags.

import org.nmdp.ngs.feature.Locus
import org.nmdp.ngs.feature.parser.FeatureParser

def filename = "/opt/nmdp/regions/clinical-exons/hla-a.txt"

new File(filename).each { line ->
  def (index, coordinate) = line.split("\t")
  def locus = FeatureParser.parseLocus(coordinate)
  
  println  "<targeted-region \
             \n assembly=\"GRCh38\" \
             \n contig=\"${locus.getContig()}\" \
             \n start=\"${locus.getMin()}\" \
             \n end=\"${locus.getMax()}\" \
             \n strand=\"1\" \
             \n id=\"file://${filename}\" \
             \n description=\"HLA-A exon ${index}\"/>"
}

Create `<gl:glstring>` tags

This is as simple as placing the interpreted allele between tags such as:

<gl:glstring>HLA-A*24:02:01:01+HLA-A*24:02:01:01</gl:glstring>

DaSH

Home
DaSH 18 (Leiden) 2025
DaSH 17 (Prague) 2025
DaSH 16 (Stanford) 2025
DaSH 15 (Utrecht) 2024
DaSH 14 (Oklahoma City) 2024
DaSH 13 (Rochester) 2023
DASH VRS (Virtual) 2022
DASSH3 (Virtual) 2020
DASH12 (Virtual) 2022
DASSH4 (Virtual) 2021
DASH11 (Virtual) 2021
DASSH3 (Virtual) 2020
DASH10 (Virtual) 2020
DASH Validation (Minneapolis) 2020
DaSSH 2 (Minneapolis) 2019
DASH9 (Denver) 2019
DASH8 (Baltimore) 2018
DASSH FHIR (Minneapolis) 2018
DASH7 (Utrecht) 2017
DASH IHIWS (Stanford) 2017
DASH6 (Heidelberg) 2017
DASH5 (Berkeley) 2017
DASH4 (Vienna) 2016
DASH3 (Minneapolis) 2016
DASH2 (La Jolla) 2015
DASH1 (Bethesda) 2014
Preparing for the Hackathon
Tool access
- AWS Account Creation and Setup
- Shared Hackathon Server Access
- Tutorial
Tools
- MIRING
- HML
- HML/MIRING Validation
- HL7 FHIR
- HL7 OID Registration
- Gene Feature Enumeration
- GL Service
- Pipeline
- Validation tools
- Public resources
Data
Github help

Tutorial

Get the code

Get the sample data

Run the pipeline

Interpret and validate the results

Create an HML message

Create a template HML from the XSD

Create <sequence> tags

Create <targeted-region> tags

Create <gl:glstring> tags

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DaSH

Clone this wiki locally

Create `<sequence>` tags

Create `<targeted-region>` tags

Create `<gl:glstring>` tags