Skip to content

LoadingLocalData

Andrew Ramsay edited this page Dec 10, 2020 · 12 revisions

Loading a new dataset into NPLinker

This guide describes the steps that need to be taken by first-time users of NPLinker in order to load a new dataset.

Prerequisites:

  • Docker
  • Metabolomics data. Typically the contents of the GNPS "Clustered Spectra as MGF" .zip file for the job submitted.
  • Genomics data is available. At minimum a folder of antiSMASH .gbk files. BiG-SCAPE data can optionally be provided, but will be generated if not found.
  • A strain mappings CSV file (see below for details)

1. Create a shared folder for NPLinker files

When using the Docker version of NPLinker, the application has no direct access to the files stored on your system. Instead, you give Docker access to a chosen "shared folder" where one or more datasets are located. All files you wish to load into NPLinker must be inside this folder somewhere, but you can have any number of other folders inside the top level one to organise different datasets.

It doesn't matter where the shared folder is located, or what it is called. Simply pick a location and create a new empty folder. This guide assumes the folder is called nplinker_shared.

2. Create a dedicated folder for the dataset

Inside the nplinker_shared folder, create a new subfolder. Once again the name doesn't matter. This guide assumes the folder is called dataset_1 but feel free to substitute the name of your dataset.

3. Create a basic NPLinker configuration file

NPLinker has various options that can be configured. The easiest way to do this is by using a configuration file. The file contains text formatted using simple TOML syntax and so has a .toml extension. A complete example of an NPLinker configuration file can be found here, but a typical example will be only a few lines long.

To configure NPLinker to load a dataset from the dataset_1 folder, create a new file in nplinker_shared called nplinker.toml. The Docker version of NPLinker is already configured to look for a file in this location with this name.

Open the file in a text editor, and add the following content:

[dataset]
root = "path to dataset_1"

Examples assuming nplinker_shared is in your home folder

  • OSX:
    • root = "/Users/myusername/nplinker_shared/dataset_1"
  • Linux:
    • root = "/home/myusername/nplinker_shared/dataset_1"
  • Windows (NOTE: use "/" or "\" as path separators):
    • root = "c:/Users/myusername/nplinker_shared/dataset_1"

4. Populating the dataset folder

4.1 Metabolomics data

NPLinker is designed to work with the folder structure generated by GNPS jobs. Download the results of your job using the "Download clustered spectra as MGF" link, and extract the zip file inside the dataset_1 folder. The content may vary slightly depending on the GNPS workflow used. NPLinker is known to work with the following workflow outputs:

  • METABOLOMICS-SNETS (version 1.2.3)
  • METABOLOMICS-SNETS-V2 (version release_14)
  • FEATURE-BASED-MOLECULAR-NETWORKING (version 1.2.3)

Some of the files/folders generated by GNPS are not used by NPLinker, but can safely be left in place. The files and folders which NPLinker expects to find are listed below. "*.tsv" and similar entries indicate that NPLinker will load any file with a ".tsv" extension, the exact filename is not important.

  • clusterinfosummarygroup_attributes_withIDs_withcomponentID/*.clustersummary OR clusterinfo_summary/*.tsv
  • networkedges_selfloop/*.selfloop
  • *.mgf OR spectra/*.mgf
  • (optional, not in all workflows) metadata_table/metadata_table*.txt
  • (optional, not in all workflows) quantification_table_reformatted/.csv
  • (optional) DB_result/*.tsv OR result_specnets_DB/*.tsv
  • (optional) params.xml

4.2 Genomics data

On the genomics side, NPLinker requires at minimum a folder of antiSMASH .gbk files. These may be in a single flat folder or in subfolders. Simply create an "antismash" folder inside dataset_1 and copy/move these files into that folder.

BiG-SCAPE files are not required, but if you already have them available create a new "bigscape" folder inside dataset_1 and copy/move them there.

If BiG-SCAPE files are not available, the tool will be run during the NPLinker loading process and store the results in the same location (this will only happen once per-dataset).

At this stage, you should have a dataset_1 folder which looks something like this:

5. Creating strain mappings

The next key step is to create a set of strain mappings so that NPLinker can correctly identify strains across the genomics and metabolomics data. To supply these mappings to the application, begin by creating a file called "strain_mappings.csv" in the dataset1 folder.

Open the file in a text editor. You should add a single line for each strain in the dataset. The first column of each line should contain the most relevant/useful strain label. Each subsequent column should contain the other labels used to refer to the same strain throughout the dataset (NOTE: the number of columns on each line does NOT need to be consistent).

Here is a trival example:

strain1,strain1A,strain1.B,strain1_C,strainONE
strain2,strainTWO,strainTWO_
strain3

Taking this line by line, it is saying:

  • strain1 is also known as strain1A, strain1.B, strain1_C, and strainONE. NPLinker will therefore treat any instances of the latter 4 labels as equivalent to strain1
  • strain2 is also known as strainTWO and strainTWO_, so again NPLinker will treat these as equivalent
  • strain3 is only known as strain3 in the dataset, it has no other labels

When NPLinker loads your data, it will warn if it encounters any strains that don't appear in the set of mappings you supply. To make it easier to determine if there are missing mappings, NPLinker generates a pair of CSV files called "unknown_strains_met.csv" and "unknown_strains_gen.csv" each time it is executed. These files will be located inside the dataset folder. Each contains a list of the unknown strain labels found in the metabolomics and genomics data.

Clone this wiki locally