Skip to content

General Descriptions

Haoliang Xue edited this page Apr 2, 2025 · 2 revisions

1. Analysis Workflow

KaMRaT per se is shown at the center of the workflow. It is a C++ program that takes as input a feature count matrix and generates an reduced matrix with informative features as output, with respect to information provided in the design or the FASTA file.

Some notes include:

  • The term "feature" generally indicates "k-mer feature", but for certain KaMRaT modules, general features such as genes, transcripts, or others can be allowed (please see the section below 3.1 Feature count matrix for details).
  • k-mer feature count matrix can be generated from FASTQ files with a provided workflow.

workflow

2. Choice of k-mer Length

Currently, KaMRaT only accepts k-mers no longer than 32nt, as the k-mers are coded in an uint64 variable.

In addition, we recommend to choose $k$ as an odd number, to avoid confounding one k-mer with its reverse complement counterpart in unstranded data. For example, in the situation $k=6$, 6-mers such as AAATTT lose information of their strandedness.

3. Inputs

3.1 Feature count matrix

KaMRaT takes a feature count matrix as input, which contains features in rows and samples in columns. Features can be:

  • k-mers for any module,
  • general features such as genes/transcripts for index, filter and score modules.

The feature counts can be either normalised or non-normalised.

The matrix should be in a tab-separated file format, gzipped (.tsv.gz) or not. The matrix should be $(P + 1)$ rows $\times$ $(N + 1)$ columns, where:

  • $P$ is the feature number;
  • $N$ is the sample number;
  • First row and column are matrix header and feature strings (k-mers or general feature names), respectively.

3.2 (Optional) FASTA file

The FASTA file is required by mask and query modules, providing the sequences to mask or query.

3.3 (Optional) Design file

The design file is required by filter and score modules.

The design file is formed by two columns:

  • the first column indicates samples (i.e., columns of the input count matrix);
  • the second column indicates the associated values considered to filter or score.

In KaMRaT filter, the second column can be either "UP" or "DOWN", indicating whether the sample should be considered as up-regulated or down-regulated for filtering.

In KaMRaT score, the second column can be:

  • a string indicating sample's condition for the classification methods;
  • a real value for correlation evaluation.

Please note that the design file should NOT contain any header row.

4. Outputs

4.1 KaMRaT index

For the index module, three index files are output in the indicated folder by the -idxdir argument.

4.2 Other modules

For other modules, three types output formats are allowed, as indicated by the -outfmt argument.

  • By default, a final output count matrix is generated, in a tab-separated value format, where the features are presented as rows with the columns being in the same order as input.
  • For filter, merge and score modules, -outfmt fa indicates to generate a FASTA file containing informative sequences.
  • For filter, mask, merge and score, -outfmt bin asks to generate an intermediate binary file to be taken by another module.

For count matrix output, the values are normalised values (if applied in the index module). Also, values are rounded to the nearest integers by default. Please set the -counts argument to obtain the decimal values.

5. Latest Release Notes

The current release of KaMRaT is v1.2. Compared to its previous release, v1.1, it introduces several new characteristics:

  • Each module of filter, merge and score now supports outputting a fasta file containing selected/merged sequences.
  • The output tables of all modules filter, mask, merge, score and query now output the count table with values being rounted to the nearest integers. The decimal values can be output by setting -counts argument.
  • The index module now allows a normalisation factor base too large or too small, instead of throwing an exception, it puts a warning.
  • The mask module now allows simultaneously indicating sequences to select and suppress.
  • Updated license.

For previous releases, please refer to here.