Skip to content

Latest commit

 

History

History
200 lines (149 loc) · 10.9 KB

File metadata and controls

200 lines (149 loc) · 10.9 KB

Import Data Using Python Script

This page demonstrates how to use the import_ODM_data.py script to input data into ODM. Please note that you need to be a member of the curator group in ODM to be able to import and edit data in ODM.

Requirements

Read the full list of requirements here

Optional files

You can optionally also provide:

  • The accession of a template to validate against rather than the default. Use --template <ACCESSION> to specify.
  • The server address if you want to apply the script to a different ODM server. Use --host <HOST> to specify.
  • Any data in the Tabular format (Data Frame) as a TSV, hosted at an HTTPS web address
  • Gene expression data in GCT format, hosted at an HTTPS web address
  • Gene expression or Cell expression data in TSV format, hosted at an HTTPS web address
  • Gene expression metadata in TSV format, hosted at an HTTPS web address
  • Gene variant data in VCF format, hosted at an HTTPS web address
  • Gene variant metadata in TSV format, hosted at an HTTPS web address
  • Flow cytometry data in .facs format, hosted at an HTTPS web address
  • Flow cytometry metadata in TSV format, hosted at an HTTPS web address
  • A cross-reference mapping file, in TSV format, hosted at an HTTPS web address. You can also use --mapping_file_accession instead to specify a previously uploaded mapping file.
  • A libraries file in TSV format, hosted at an HTTPS web address, or the accession of an existing library file
  • A preparations file in TSV format, hosted at an HTTPS web address, or the accession of an existing preparations file
  • A Cell metadata file in TSV format, hosted at an HTTPS web address

Once imported, studies, samples, libraries, preparations, cells metadata, and signal metadata will be queryable and editable from both the User Interface and APIs, whilst the signal data will only queryable via APIs.

Linking using sample source ID

By default linking is done via the Sample Source ID key, so this needs to be consistent in the above files for linking to occur. You can read about linking core data types here and more details about signal data linking on this page

Getting a Genestack API token

Before you begin you will need a genestack API token.

For instructions on how to generate a token, refer to the Quick Start guide.

Script usage

If you are using Genestack API Token, run the script by typing:

odm-import-data --token [token] --host [HOST] --study [URL to study file] --samples [URL to samples file]

Or if you are using Access Token, run the script specifying the token and template accession:

odm-import-data --access-token [access-token] --host [HOST] --study [URL to study file] --samples [URL to samples file] --template [template accession]

Important Note: you should always specify the template accession whenever you are uploading the study with a file URL and an Access Token.

Optionally include data files by appending any or all of the following to the above command:

--expression [URL] --expression_metadata [URL]
--variant [URL] --variant_metadata [URL]
--flow_cytometry [URL] --flow_cytometry_metadata [URL]
--mapping_file [URL] --mapping_file_metadata [URL]
--libraries [URL]
--preparations [URL]
--cell [URL]

Importing Multiple Tabular Files

  • Test_basic_generic_expression.tsv, a tab-separated file containing tabular expression data with two text features and two numeric features, followed by expression values for four samples.
Text Feature One Text Feature Two Numeric Feature One Numeric Feature Two HG00119 HG00121 HG00183 HG00176
f1_1 f2_1 1.069 2.218 0.804 0.350 0.591 7.260
f1_2 f2_2 4.845 0.391 0.729 5.657 11.730 11.007
f1_3 f2_3 1.427 0.147 1.588 8.145 1.480 2.718
f1_4 f2_4 4.854 3.723 0.645 4.493 0.862 1.370
f1_5 f2_5 10.563 4.217 1.102 1.627 3.157 4.393
  • Test_basic_generic_expression_3nfa.tsv, a tab-separated file with three feature attributes (1 text + 2 numeric columns). This format requires setting "numberOfFeatureAttributes": 3 during import. The remaining columns represent sample-level expression values.
Text Feature Two Numeric Feature One Numeric Feature Two HG00119 HG00121 HG00183 HG00176
f2_1 1.069 2.218 0.804 0.350 0.591 7.260
f2_2 4.845 0.391 0.729 5.657 11.730 11.007
f2_3 1.427 3.147 1.588 8.145 1.480 2.718
f2_4 4.854 3.723 0.645 4.493 0.862 1.370
f2_5 10.563 4.217 1.102 1.627 3.157 4.393

In order to import the data set, that has multiple Tabular data files in TSV (tab-separated values) you need to specify the numberOfFeatureAttributes for each file. The example call below will import the dataset that contain a Study, Samples and 2 Tabular datasets.

Each Tabular dataset has different number of Feature Attributes, that we set via -nfa parameter.

odm-import-data \
--token <TOKEN> \
--server <HOST> \
--study https://s3.amazonaws.com/bio-test-data/odm/Test_1000g/Test_1000g.study.tsv \
--samples https://bio-test-data.s3.us-east-1.amazonaws.com/odm/user-guide/Test_samples.tsv \
--expression https://bio-test-data.s3.us-east-1.amazonaws.com/odm/user-guide/Test_basic_generic_expression.tsv \
-nfa 4 \
-dc "Lipidomics" \
--expression https://bio-test-data.s3.us-east-1.amazonaws.com/odm/user-guide/Test_basic_generic_expression_3nfa.tsv \
-nfa 3

!!! abstract "Data Class Behavior" In the example above, we use the -dc parameter to set the data class for one data set, while omitting it for the other.
If no data class is specified, it will default to "Other".

Updating data files

To update a data file (e.g. TSV, GCT, VCF file) rather than adding another data file, append the accession of the data file to be updated in square brackets to the URL of the data file import. Existing study and sample accessions must be supplied. See the example below:

--study_accession GSF994039 \
--samples GSF994040 \
--expression http://exampl.com/expression.gct[GSF994565]  \
--expression_metadata http://exampl.com/expression_metadata.tsv  \
--variant http://exampl.com/variations.vcf[GSF994700] \
--variant_metadata http://exampl.com/variant_metadata.tsv

Example files

The following are some example files to illustrate file formats:

For working with Cell metadata and Cell expression use the following example files:

Run the script with the above by typing the following (inserting your token instead of [token], note you may need to escape or quote strings depending on your specific command line interface):

odm-import-data --token [token] --host [HOST] --study https://s3.amazonaws.com/bio-test-data/odm/Test_1000g/Test_1000g.study.tsv --samples https://s3.amazonaws.com/bio-test-data/odm/Test_1000g/Test_1000g.samples.tsv --expression https://s3.amazonaws.com/bio-test-data/odm/Test_1000g/Test_1000g.gct --expression_metadata https://s3.amazonaws.com/bio-test-data/odm/Test_1000g/Test_1000g.gct.tsv --variant https://s3.amazonaws.com/bio-test-data/odm/Test_1000g/Test_1000g.vcf --variant_metadata https://s3.amazonaws.com/bio-test-data/odm/Test_1000g/Test_1000g.vcf.tsv

Script example (Study → Samples → Cells → Expression)

odm-import-data \
--server <HOST> \
--token <TOKEN> \
--study 's3://bio-test-data/User_guide_test_data/Single_cell_data/study_metadata.tsv' \
--samples 's3://bio-test-data/User_guide_test_data/Single_cell_data/samples.tsv' \
--cells 's3://bio-test-data/User_guide_test_data/Single_cell_data/cells_2_samples_full_match.tsv' \
--expression 's3://bio-test-data/User_guide_test_data/Single_cell_data/expression_2_cells_linked_to_samples.tsv' \
--data-class 'Single-cell transcriptomics' \
--number-of-feature-attributes 1 \
--allow-duplicates