🧬 DNA Sequence Alignment - CSCI 570

⚠️ ACADEMIC INTEGRITY NOTICE

This repository contains a refactored and enhanced version of the DNA sequence alignment project. The implementation, code structure, and architecture shown here are significantly different from the original academic submission for CSCI 570.

This refactored codebase was created AFTER course completion and includes:

Complete architectural redesign with modular components

Enhanced type safety and professional code organization

Additional tooling, testing, and documentation

Improved performance benchmarking and visualization

The original course project submission remains separate and was completed in accordance with all academic integrity policies. This public version is intended for portfolio and learning purposes only.

A comprehensive implementation of DNA sequence alignment algorithms using dynamic programming. This project provides two algorithmic approaches: a standard space-intensive algorithm and a space-optimized variant, along with complete tooling for testing, benchmarking, and visualization.

✨ Features

Two Algorithm Implementations
- Basic: Standard Needleman-Wunsch (O(m×n) space)
- Efficient: Hirschberg's Algorithm (O(min(m,n)) space)
Modular Architecture
- Clean separation of concerns
- Reusable core utilities
- Type-safe with comprehensive type hints
Performance Benchmarking
- Execution time measurement
- Memory usage tracking
- Automated graph generation
Production-Ready Code
- Comprehensive error handling
- Input validation
- Detailed documentation

🚀 Quick Start

Run Basic Algorithm

./basic.sh data/input/in1.txt data/output/output.txt

Run Efficient Algorithm

./efficient.sh data/input/in1.txt data/output/output.txt

Run All Test Cases

./run_batch.sh

📂 Project Structure

CSCI-570_Sequence-alignment/
├── src/                      # Source code
│   ├── main.py              # Single entry point
│   ├── basic.py             # Basic DP algorithm
│   ├── efficient.py         # Space-optimized algorithm
│   ├── alignment_core.py    # Shared utilities
│   ├── string_processor.py  # Input processing
│   ├── io_utils.py          # File I/O operations
│   ├── perf_utils.py        # Performance measurement
│   └── cost_constants.py    # Cost configuration
├── data/
│   ├── input/               # Test input files
│   ├── output/              # Results and graphs
│   └── SampleTestCases/     # Sample test cases
├── test_alignment.py        # Comprehensive unit tests
├── basic.sh                 # Run basic algorithm
├── efficient.sh             # Run efficient algorithm
├── run_batch.sh             # Batch processing
├── setup.sh                 # Environment setup
└── requirements.txt         # Python dependencies

🏗️ Architecture

Module Overview

Core Entry Point

main.py - Single entry point that orchestrates the entire workflow
- CLI argument parsing and validation
- Algorithm selection
- Performance measurement
- Error handling and user feedback

Algorithm Modules

basic.py - Standard Needleman-Wunsch algorithm
- Time: O(m × n)
- Space: O(m × n)
- Stores full DP table for alignment reconstruction
efficient.py - Hirschberg's space-optimized algorithm
- Time: O(m × n)
- Space: O(min(m, n))
- Uses divide-and-conquer with DP

Shared Utilities

alignment_core.py - Core alignment logic shared by both algorithms
- init_dp_table() - Initialize DP table with base cases
- compute_cell_cost() - Calculate cell costs
- backtrack_alignment() - Reconstruct alignment
- validate_sequences() - Input validation
string_processor.py - Input parsing and sequence expansion
- Handles positional duplication rules
- Input validation and error handling
io_utils.py - Input/output operations
- File reading/writing
- JSON results management
- Performance data logging
perf_utils.py - Performance measurement
- Execution time tracking
- Memory usage monitoring
cost_constants.py - Configuration
- Mismatch penalty matrix (ALPHA)
- Gap penalty (DELTA)

Data Flow

Input File
    ↓
readInput() → Parse raw lines
    ↓
processStrings() → Expand sequences
    ↓
validate_sequences() → Validate DNA bases
    ↓
basic() OR efficient() → Compute alignment
    ↓
writeOutput() → Save results
    ↓
Output Files (text + JSON)

Design Principles

Single Responsibility - Each module has one clear purpose
DRY (Don't Repeat Yourself) - Common logic in alignment_core
Type Safety - Type hints throughout for better IDE support
Error Handling - Comprehensive validation and error messages
Modularity - Easy to extend with new algorithms

� Usage

Command Line Interface

python src/main.py <mode> <input_file> <output_file>

Arguments:

mode: Algorithm to use (basic or efficient)
input_file: Path to input data file
output_file: Path to store results

Example:

python src/main.py basic data/input/in1.txt data/output/out1.txt

Shell Scripts

Run Single Test

./basic.sh data/input/in1.txt data/output/output.txt
./efficient.sh data/input/in1.txt data/output/output.txt

Run Batch Processing

./run_batch.sh

Processes all test cases and generates performance comparison graphs.

Environment Setup

./setup.sh

Creates virtual environment and installs dependencies.

Input Format

The input file should follow this format:

ACTG           # First sequence
1              # Index to duplicate after (for seq1)
3              # Another index (for seq1)
TACG           # Second sequence
0              # Index to duplicate after (for seq2)

Each number indicates where to insert a copy of the entire sequence.

🧮 Algorithms

Basic Algorithm (Needleman-Wunsch)

Approach: Standard dynamic programming with full table storage

Implementation:

Initialize m×n DP table with base cases
Fill table bottom-up using recurrence relation
Backtrack from bottom-right to reconstruct alignment

Complexity:

Time: O(m × n)
Space: O(m × n)

Use Cases:

Small to medium sequences
When full DP table is needed for analysis
Educational purposes

Code Structure:

def basic(seq1: str, seq2: str) -> Tuple[int, str, str]:
    # Validate inputs
    validate_sequences(seq1, seq2)

    # Initialize full DP table
    dp = init_dp_table(m, n)

    # Fill table
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            dp[i][j] = compute_cell_cost(seq1, seq2, i, j, dp)

    # Backtrack to get alignment
    aligned1, aligned2 = backtrack_alignment(seq1, seq2, dp)

    return dp[m][n], aligned1, aligned2

Efficient Algorithm (Hirschberg)

Approach: Space-optimized divide-and-conquer with DP

Implementation:

Base case: Use standard DP for small sequences
Divide: Split first sequence at midpoint
Conquer: Compute forward and backward costs
Combine: Find optimal split point, recurse on halves

Complexity:

Time: O(m × n)
Space: O(min(m, n))

Use Cases:

Large sequences where memory is constrained
Production systems with memory limitations
Processing multiple alignments concurrently

Key Functions:

compute_cost_row() - Calculate costs using only 2 rows
align_small_sequences() - Handle base cases
hirschberg_recursive() - Main divide-and-conquer logic

Cost Model

Mismatch Penalties (ALPHA matrix):

     A    C    G    T
A    0   110   48   94
C   110   0   118   48
G    48  118   0   110
T    94   48  110   0

Gap Penalty (DELTA): 30

The penalty values reflect biological similarity between nucleotide bases:

Lower penalties for chemically similar bases (e.g., A-G both purines)
Higher penalties for dissimilar bases

👨‍💻 Development

Code Quality Standards

Type Hints: All functions have complete type annotations
Docstrings: Google-style documentation for all modules and functions
Error Handling: Comprehensive validation with informative error messages
Testing: Unit tests for core functions (recommended)

Adding a New Algorithm

Create a new file in src/ (e.g., new_algorithm.py)
Import from alignment_core for shared utilities
Implement with signature: (seq1: str, seq2: str) -> Tuple[int, str, str]
Update main.py to include the algorithm:

from new_algorithm import new_algorithm

def select_algorithm(mode: str):
    algorithms = {
        "basic": basic,
        "efficient": efficient,
        "new": new_algorithm  # Add here
    }
    return algorithms[mode]

Modifying Cost Parameters

Edit src/cost_constants.py:

# Adjust mismatch penalties
ALPHA['A']['G'] = 50

# Change gap penalty
DELTA = 25

Project Setup

# Make scripts executable
chmod +x *.sh

# Set up environment
./setup.sh

# Run tests
python src/main.py basic data/SampleTestCases/input1.txt /tmp/test.txt

📊 Output Format

Text Output File

Each run produces a text file with:

114                    # Minimum alignment cost
AACT_G                # Aligned sequence 1
A_CTTG                # Aligned sequence 2
12.345                # Execution time (ms)
2048                  # Memory usage (KB)

JSON Performance Log

Results are also logged to data/output/result.json:

{
	"10": {
		"basic": {
			"run": 10,
			"cost": 114,
			"time": 12.345,
			"memory": 2048
		},
		"efficient": {
			"run": 10,
			"cost": 114,
			"time": 15.678,
			"memory": 1024
		}
	}
}

Performance Graphs

Running batch tests generates comparison graphs:

data/output/time_comparison.png - Runtime comparison
data/output/memory_comparison.png - Memory usage comparison

🧪 Testing

Run Unit Tests

The project includes comprehensive unit tests covering all modules:

# Run all unit tests
python test_alignment.py

# Run with verbose output
python test_alignment.py -v

Test Coverage:

✅ Basic alignment algorithm
✅ Efficient alignment algorithm
✅ Core alignment utilities
✅ String processing
✅ Input/output operations
✅ Cost constants validation
✅ Algorithm consistency checks
✅ Integration tests

Verify Installation

# Test basic algorithm
python src/main.py basic data/SampleTestCases/input1.txt /tmp/test_basic.txt

# Test efficient algorithm
python src/main.py efficient data/SampleTestCases/input1.txt /tmp/test_efficient.txt

# Verify both produce same alignment cost
diff /tmp/test_basic.txt /tmp/test_efficient.txt | head -1

Run All Test Cases

./run_batch.sh

This will process all input files and generate performance comparison graphs.

📦 Dependencies

psutil>=5.9.0      # Memory monitoring
matplotlib>=3.7.0  # Graph generation

Install with:

pip install -r requirements.txt

🎓 Algorithm Comparison

Aspect	Basic	Efficient
Algorithm	Needleman-Wunsch	Hirschberg
Time Complexity	O(m×n)	O(m×n)
Space Complexity	O(m×n)	O(min(m,n))
Memory Usage	Higher	Lower
Implementation	Simpler	More Complex
Best For	Small/medium sequences	Large sequences
DP Table	Full table stored	Only 2 rows at a time

📝 Notes

Input files must follow the specified format
Sequences should contain only valid DNA bases (A, C, G, T)
Both algorithms produce identical alignment results
The efficient algorithm trades some code complexity for significant memory savings

� Error Handling

The program handles various error conditions:

Invalid arguments: Clear usage instructions
Missing files: Helpful file-not-found messages
Invalid sequences: Validation with specific error messages
Malformed input: Detailed parsing errors
I/O errors: Graceful handling with user feedback

Example:

$ python src/main.py basic nonexistent.txt output.txt
Error: Input file not found: nonexistent.txt

$ python src/main.py invalid data/input/in1.txt output.txt
Error: Unknown mode 'invalid'
Valid modes are: basic, efficient

🚀 Performance Tips

For large sequences: Use the efficient algorithm to save memory
Batch processing: Use run_batch.sh for multiple files
Monitoring: Check JSON logs for performance trends
Optimization: Adjust cost parameters based on your use case

📄 License

This project is part of CSCI 570 coursework.

Developed with ❤️ for CSCI 570

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
basic.sh		basic.sh
efficient.sh		efficient.sh
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
run_batch.sh		run_batch.sh
setup.sh		setup.sh
test_alignment.py		test_alignment.py

adylagad/CSCI-570_Sequence-alignment

Folders and files

Latest commit

History

Repository files navigation

🧬 DNA Sequence Alignment - CSCI 570

� Table of Contents

✨ Features

🚀 Quick Start

Run Basic Algorithm

Run Efficient Algorithm

Run All Test Cases

📂 Project Structure

🏗️ Architecture

Module Overview

Core Entry Point

Algorithm Modules

Shared Utilities

Data Flow

Design Principles

� Usage

Command Line Interface

Shell Scripts

Run Single Test

Run Batch Processing

Environment Setup

Input Format

🧮 Algorithms

Basic Algorithm (Needleman-Wunsch)

Efficient Algorithm (Hirschberg)

Cost Model

👨‍💻 Development

Code Quality Standards

Adding a New Algorithm

Modifying Cost Parameters

Project Setup

📊 Output Format

Text Output File

JSON Performance Log

Performance Graphs

🧪 Testing

Run Unit Tests

Verify Installation

Run All Test Cases

📦 Dependencies

🎓 Algorithm Comparison

📝 Notes

� Error Handling

🚀 Performance Tips

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages