CFASTA is a high-performance C++ library and command-line tool for processing, analyzing, and converting biological sequence data in FASTA and BLAST formats. It includes functionality for binary encoding/decoding, matrix operations, and advanced statistical analyses.
- FASTA file processing: Parse, manipulate, and write FASTA files
- BLAST file processing: Parse, manipulate, and write BLAST output files
- Binary conversion: Encode and decode FASTA/BLAST files to efficient binary format
- Matrix operations: Perform various matrix operations including combinations and eigenfunction support
- Statistical analysis: Perform GLLM, dbMEM, rDNA, CCA, and PCA analyses on biological data
- C++17 compatible compiler
- Eigen3 (for matrix operations)
- Boost (for filesystem operations)
- GTest (for unit testing)
mkdir build
cd build
cmake ..
make
# Encode FASTA file to binary
./cfasta_tool encode-fasta input.fasta output.bin
# Decode binary back to FASTA
./cfasta_tool decode-fasta input.bin output.fasta
# Encode BLAST file to binary
./cfasta_tool encode-blast input.blast output.bin
# Decode binary back to BLAST
./cfasta_tool decode-blast input.bin output.blast
# Perform matrix operation (add, subtract, multiply, divide)
./cfasta_tool matrix-op add matrix_a.txt matrix_b.txt result.txt
# Perform PCA analysis
./cfasta_tool pca data.txt pca_results
# Perform CCA analysis
./cfasta_tool cca response.txt predictor.txt cca_results
# Perform dbMEM analysis
./cfasta_tool dbmem distance.txt response.txt dbmem_results
# Fit GLLM model
./cfasta_tool gllm response.txt predictor.txt gaussian 3 gllm_results
# Perform rDNA analysis
./cfasta_tool rdna sequences.txt reference.txt rdna_results.txt
#include "fasta_processor.h"
#include "blast_processor.h"
#include "binary_converter.h"
#include "matrix_operations.h"
#include "statistical_analysis.h"
// Parse FASTA file
cfasta::FastaProcessor fastaProcessor;
auto entries = fastaProcessor.parseFile("input.fasta");
// Encode to binary
auto binaryData = fastaProcessor.encodeToBinary(entries);
// Matrix operations
cfasta::MatrixOperations matrixOps;
Eigen::MatrixXd matrix = matrixOps.loadMatrix("matrix.txt");
// Perform PCA
cfasta::StatisticalAnalysis statsAnalysis;
auto pcaResult = statsAnalysis.performPCA(matrix);
Standard FASTA format with sequence headers starting with '>' character:
>Sequence1 description
ATGCATGCATGCATGCATGCATGCATGC
>Sequence2 description
GTACGTACGTACGTACGTACGTACGTAC
Standard BLAST tabular output format (-outfmt 6):
# BLASTN 2.13.0+
# Database: nr
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
#
Query1 Subject1 98.5 100 1 0 1 100 50 150 1e-50 200.5
Query1 Subject2 90.0 80 7 1 5 85 10 90 1e-30 150.2
- include/: Header files
- src/: Source files
- test/: Unit tests
- lib/: External libraries
- build/: Build artifacts (not tracked in git)
Reduces data dimensionality while retaining most of the variation.
Multivariate method to elucidate relationships between biological assemblages and their environment.
Spatial eigenfunction analysis method for modeling spatial structure.
Extends GLMs with latent variables to model correlations in multivariate responses.
Specialized analysis for ribosomal DNA sequences.