Skip to content

AI-Ecology-Lab/C_FASTA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CFASTA - C++ Compiler and Decompiler for FASTA and BLAST files

CFASTA is a high-performance C++ library and command-line tool for processing, analyzing, and converting biological sequence data in FASTA and BLAST formats. It includes functionality for binary encoding/decoding, matrix operations, and advanced statistical analyses.

Features

  • FASTA file processing: Parse, manipulate, and write FASTA files
  • BLAST file processing: Parse, manipulate, and write BLAST output files
  • Binary conversion: Encode and decode FASTA/BLAST files to efficient binary format
  • Matrix operations: Perform various matrix operations including combinations and eigenfunction support
  • Statistical analysis: Perform GLLM, dbMEM, rDNA, CCA, and PCA analyses on biological data

Dependencies

  • C++17 compatible compiler
  • Eigen3 (for matrix operations)
  • Boost (for filesystem operations)
  • GTest (for unit testing)

Building

mkdir build
cd build
cmake ..
make

Usage

Command-line interface

# Encode FASTA file to binary
./cfasta_tool encode-fasta input.fasta output.bin

# Decode binary back to FASTA
./cfasta_tool decode-fasta input.bin output.fasta

# Encode BLAST file to binary
./cfasta_tool encode-blast input.blast output.bin

# Decode binary back to BLAST
./cfasta_tool decode-blast input.bin output.blast

# Perform matrix operation (add, subtract, multiply, divide)
./cfasta_tool matrix-op add matrix_a.txt matrix_b.txt result.txt

# Perform PCA analysis
./cfasta_tool pca data.txt pca_results

# Perform CCA analysis
./cfasta_tool cca response.txt predictor.txt cca_results

# Perform dbMEM analysis
./cfasta_tool dbmem distance.txt response.txt dbmem_results

# Fit GLLM model
./cfasta_tool gllm response.txt predictor.txt gaussian 3 gllm_results

# Perform rDNA analysis
./cfasta_tool rdna sequences.txt reference.txt rdna_results.txt

Library API

#include "fasta_processor.h"
#include "blast_processor.h"
#include "binary_converter.h"
#include "matrix_operations.h"
#include "statistical_analysis.h"

// Parse FASTA file
cfasta::FastaProcessor fastaProcessor;
auto entries = fastaProcessor.parseFile("input.fasta");

// Encode to binary
auto binaryData = fastaProcessor.encodeToBinary(entries);

// Matrix operations
cfasta::MatrixOperations matrixOps;
Eigen::MatrixXd matrix = matrixOps.loadMatrix("matrix.txt");

// Perform PCA
cfasta::StatisticalAnalysis statsAnalysis;
auto pcaResult = statsAnalysis.performPCA(matrix);

File Format Support

FASTA

Standard FASTA format with sequence headers starting with '>' character:

>Sequence1 description
ATGCATGCATGCATGCATGCATGCATGC
>Sequence2 description
GTACGTACGTACGTACGTACGTACGTAC

BLAST

Standard BLAST tabular output format (-outfmt 6):

# BLASTN 2.13.0+
# Database: nr
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
#
Query1  Subject1  98.5  100  1  0  1  100  50  150  1e-50  200.5
Query1  Subject2  90.0  80   7  1  5  85   10  90   1e-30  150.2

Project Structure

  • include/: Header files
  • src/: Source files
  • test/: Unit tests
  • lib/: External libraries
  • build/: Build artifacts (not tracked in git)

Statistical Analysis Methods

PCA (Principal Component Analysis)

Reduces data dimensionality while retaining most of the variation.

CCA (Canonical Correspondence Analysis)

Multivariate method to elucidate relationships between biological assemblages and their environment.

dbMEM (distance-based Moran's Eigenvector Maps)

Spatial eigenfunction analysis method for modeling spatial structure.

GLLM (Generalized Linear Latent Models)

Extends GLMs with latent variables to model correlations in multivariate responses.

rDNA Analysis

Specialized analysis for ribosomal DNA sequences.

License

MIT License

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published