Skip to content

Latest commit

 

History

History
215 lines (147 loc) · 9.77 KB

File metadata and controls

215 lines (147 loc) · 9.77 KB

Supported Data Formats

!!! info "About this guide" This guide provides a basic overview of file formats and data supported in the ODM. For a detailed description and instructions on using various data formats and working with them (sorting, filtering, sampling), visit the Supported Data Formats page in our Advanced User Guide.

  • :octicons-table-16: TSV (Tabular data)


    Upload and manage tabular data (TSV files) seamlessly within the ODM. Work with Samples, Libraries, Preparations, Expression data, and more.

  • :fontawesome-solid-signal:{ .lg .middle } GCT (Gene Expression)


    Upload and work with GCT (Gene Cluster Text) files in the Open Data Manager (ODM). Optimize the analysis of matrix-compatible datasets.

  • :material-dna:{ .lg .middle } VCF (Variants)


    Upload and work with VCF (Variant Call Format) files to search, filter, retrieve, and analyze genetic variants in the ODM.

  • :fontawesome-solid-disease:{ .lg .middle } HDF5 (e.g. Single Cell)


    Upload and store HDF5 (Hierarchical Data Format 5) files as attachments in the ODM. Future releases will enable seamless analysis of Single Cell data stored within TSV and HDF5 formats.

  • :fontawesome-solid-wave-square:{ .lg .middle } FACS (Flow Cytometry)


    Upload and work with FACS (Fluorescence-Activated Cell Sorting) files in the Open Data Manager (ODM) to efficiently analyze Flow Cytometry data.

  • :fontawesome-solid-file-lines:{ .lg .middle } Attached Files


    Upload and organize a diverse range of attached non-indexed files in the ODM. Easily manage your entire data catalog, access and collaboration across any file types.

TSV (Tabular data)

In ODM, you can upload any tabular data formatted as TSV (tab-separated values). As long as your file represents a data frame, ODM can import and index it. A data frame is a data structure that organizes data into a two-dimensional table of rows and columns, similar to a spreadsheet.

A data frame contains two main elements:

  • Features: These are the entities measured in an experiment (e.g., genes, proteins, metabolites, pathways, sales regions, etc.).
  • Measurements (or values): These are the actual values recorded for each feature under different conditions (e.g., gene expression values, protein abundance, pathway activity, sales volume, etc.).

Simple Data Frame

The example below demonstrates the simplest and most common type of data frame.

Data Frame Simple

Here, the features (genes) are listed in the first column, while the rest of the table contains measurements of gene expression across multiple samples. Each column represents a different sample, with the column name indicating the corresponding dataset of gene expression values.

Complex Data Frame

Data Frame Complex

This format provides a wide range of data types that can be uploaded and indexed in ODM.

For a detailed description and instructions on using TSV, visit the Supported Data Formats page in our Advanced User Guide.

GCT (Gene Expression)

The ODM supports GCT (Gene Cluster Text) files, which are commonly used for storing gene expression datasets, such as microarray and RNA-seq data. These files provide a structured, tab-delimited format for organizing expression values across different samples. ODM automatically recognizes GCT files enabling integration and analysis.

Supported GCT Formats

ODM accepts the following GCT file formats:

  • .gct – Standard GCT file
  • .gct.gz, .gct.zip – Compressed versions of the standard GCT file (available via API)
  • .gct.tsv, .gct.tsv.gz, .gct.tsv.zip – GCT files with additional expression metadata (available via API)

Structure of a GCT File

A GCT file consists of a structured matrix with gene expression values. The key components include:

  1. File Version: The first line always contains the file version, which is #1.2 for the GCT format.
  2. Matrix Dimensions: The second line specifies the number of genes (rows) and the number of samples (columns), excluding metadata columns.
  3. Header Row: The third line contains column labels:
    • Name (gene identifier, case insensitive)
    • Description (text description of the gene, case insensitive)
    • Sample identifiers (unique, single-word names without spaces)
  4. Data Matrix:
    • Each row corresponds to a gene, with its identifier and description in the first two columns.
    • The remaining columns contain expression values for each sample.

For a detailed description and instructions on using GCT, visit the Supported Data Formats page in our Advanced User Guide.

VCF (Variants)

The ODM supports VCF (Variant Call Format) files, a widely used format for storing genetic variation data. VCF files provide a structured, tab-delimited representation of genetic variants and are typically generated as output from variant calling pipelines. These files contain detailed information about sequence variations, including single nucleotide polymorphisms (SNPs), insertions, deletions, and structural variants.

Supported VCF Formats

ODM accepts the following VCF file formats:

  • .vcf – Standard uncompressed VCF file
  • .vcf.gz, .vcf.zip – Compressed versions of the standard VCF file (available via API)

Structure of a VCF File

A VCF file consists of three main components:

  1. Meta-information lines (##)
  2. Header line (#) – The final metadata line, which defines the column names for variant data:
    • CHROM (chromosome)
    • POS (genomic position)
    • ID (variant identifier)
    • REF (reference allele)
    • ALT (alternate allele(s))
    • QUAL (quality score)
    • FILTER (filter status)
    • INFO (additional annotations)
  3. Data lines – Each row represents a variant, detailing its position, reference and alternate alleles, quality scores, and annotations.

Common Fields in VCF Files

VCF files provide essential information for genomic studies, with key fields including:

  • INFO: Contains annotations about the variant, such as depth of coverage (DP), allele frequency (AF), and functional effect (EFF).
  • FILTER: Specifies whether the variant has passed quality control thresholds (PASS or filter conditions like q10 for quality < 10).
  • FORMAT: Defines the structure of genotype-related fields in the sample data.
  • Sample Columns: Contain individual genotype information, including genotype (GT), depth (DP), and phasing status (PS).

For detailed specifications on the VCF format, refer to the official VCF documentation.

Using VCF Files in ODM

The ODM allows users to:

  • Import and store genomic variants.
  • Integrate variant data with sample metadata for downstream analysis.
  • Searching and filtering of genetic variants.
  • Perform cross-sample comparisons and study variant distributions.

For a detailed description and instructions on ODM capabilities for using VCF, visit the Supported Data Formats page in our Advanced User Guide.

HDF5 (e.g. Single Cell)

!!! info "Limitations" HDF5 is supported as Attached File in ODM with ability to observe ans search by File Structure (Contents) only. We are working on full functionality for HDF5 data content parsing, search, and filtering.

HDF5 (Hierarchical Data Format version 5) is a widely used data format in genomic research, particularly in Single Cell studies. It is designed to store large, complex datasets efficiently, making it a preferred choice for structured biological data such as gene expression matrices and metadata.

The ODM now supports HDF5 file upload as Attached File, search by File Structure (Contents), and manage these files within Studies.

Supported HDF5 Formats

  • .h5, .h5ad - Standard HDF5 files
  • .h5.gz, .h5ad.zip - Compressed versions of standard HDF5 files

Viewing File Structure (Contents)

  • Access File Contents via GUI: The ODM displays File Contents on the Data Tab of Metadata Editor. It is accessible on Content button click.
  • Retrieve File Contents via API: You can retrieve File Contents for the list of files or by unique Genestack Accession.

Searching HDF5 Files by File Contents

Users can search via GUI and API for:

  • Unique File: by Genestack Accession.

  • Files:

    • By File Contents fields/pathways.
    • By Study Genestack Accession.
  • Studies:

    • By File Contents fields/pathways.
    • By File Genestack Accession.

For a detailed description and instructions on using HDF5, visit the Supported Data Formats page in our Advanced User Guide.

FACS (Flow Cytometry)

For a detailed description and instructions on using FACS, visit the Supported Data Formats page in our Advanced User Guide.

Attached Files

For a detailed description and instructions on using Attached Files, visit the Supported Data Formats page in our Advanced User Guide.