Skip to content

adylagad/CSCI-570_Sequence-alignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

11 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿงฌ DNA Sequence Alignment - CSCI 570

Python Code Quality

โš ๏ธ ACADEMIC INTEGRITY NOTICE

This repository contains a refactored and enhanced version of the DNA sequence alignment project. The implementation, code structure, and architecture shown here are significantly different from the original academic submission for CSCI 570.

This refactored codebase was created AFTER course completion and includes:

  • Complete architectural redesign with modular components
  • Enhanced type safety and professional code organization
  • Additional tooling, testing, and documentation
  • Improved performance benchmarking and visualization

The original course project submission remains separate and was completed in accordance with all academic integrity policies. This public version is intended for portfolio and learning purposes only.


A comprehensive implementation of DNA sequence alignment algorithms using dynamic programming. This project provides two algorithmic approaches: a standard space-intensive algorithm and a space-optimized variant, along with complete tooling for testing, benchmarking, and visualization.


๏ฟฝ Table of Contents


โœจ Features

  • Two Algorithm Implementations
    • Basic: Standard Needleman-Wunsch (O(mร—n) space)
    • Efficient: Hirschberg's Algorithm (O(min(m,n)) space)
  • Modular Architecture
    • Clean separation of concerns
    • Reusable core utilities
    • Type-safe with comprehensive type hints
  • Performance Benchmarking
    • Execution time measurement
    • Memory usage tracking
    • Automated graph generation
  • Production-Ready Code
    • Comprehensive error handling
    • Input validation
    • Detailed documentation

๐Ÿš€ Quick Start

Run Basic Algorithm

./basic.sh data/input/in1.txt data/output/output.txt

Run Efficient Algorithm

./efficient.sh data/input/in1.txt data/output/output.txt

Run All Test Cases

./run_batch.sh

๐Ÿ“‚ Project Structure

CSCI-570_Sequence-alignment/
โ”œโ”€โ”€ src/                      # Source code
โ”‚   โ”œโ”€โ”€ main.py              # Single entry point
โ”‚   โ”œโ”€โ”€ basic.py             # Basic DP algorithm
โ”‚   โ”œโ”€โ”€ efficient.py         # Space-optimized algorithm
โ”‚   โ”œโ”€โ”€ alignment_core.py    # Shared utilities
โ”‚   โ”œโ”€โ”€ string_processor.py  # Input processing
โ”‚   โ”œโ”€โ”€ io_utils.py          # File I/O operations
โ”‚   โ”œโ”€โ”€ perf_utils.py        # Performance measurement
โ”‚   โ””โ”€โ”€ cost_constants.py    # Cost configuration
โ”œโ”€โ”€ data/
โ”‚   โ”œโ”€โ”€ input/               # Test input files
โ”‚   โ”œโ”€โ”€ output/              # Results and graphs
โ”‚   โ””โ”€โ”€ SampleTestCases/     # Sample test cases
โ”œโ”€โ”€ test_alignment.py        # Comprehensive unit tests
โ”œโ”€โ”€ basic.sh                 # Run basic algorithm
โ”œโ”€โ”€ efficient.sh             # Run efficient algorithm
โ”œโ”€โ”€ run_batch.sh             # Batch processing
โ”œโ”€โ”€ setup.sh                 # Environment setup
โ””โ”€โ”€ requirements.txt         # Python dependencies

๐Ÿ—๏ธ Architecture

Module Overview

Core Entry Point

  • main.py - Single entry point that orchestrates the entire workflow
    • CLI argument parsing and validation
    • Algorithm selection
    • Performance measurement
    • Error handling and user feedback

Algorithm Modules

  • basic.py - Standard Needleman-Wunsch algorithm

    • Time: O(m ร— n)
    • Space: O(m ร— n)
    • Stores full DP table for alignment reconstruction
  • efficient.py - Hirschberg's space-optimized algorithm

    • Time: O(m ร— n)
    • Space: O(min(m, n))
    • Uses divide-and-conquer with DP

Shared Utilities

  • alignment_core.py - Core alignment logic shared by both algorithms

    • init_dp_table() - Initialize DP table with base cases
    • compute_cell_cost() - Calculate cell costs
    • backtrack_alignment() - Reconstruct alignment
    • validate_sequences() - Input validation
  • string_processor.py - Input parsing and sequence expansion

    • Handles positional duplication rules
    • Input validation and error handling
  • io_utils.py - Input/output operations

    • File reading/writing
    • JSON results management
    • Performance data logging
  • perf_utils.py - Performance measurement

    • Execution time tracking
    • Memory usage monitoring
  • cost_constants.py - Configuration

    • Mismatch penalty matrix (ALPHA)
    • Gap penalty (DELTA)

Data Flow

Input File
    โ†“
readInput() โ†’ Parse raw lines
    โ†“
processStrings() โ†’ Expand sequences
    โ†“
validate_sequences() โ†’ Validate DNA bases
    โ†“
basic() OR efficient() โ†’ Compute alignment
    โ†“
writeOutput() โ†’ Save results
    โ†“
Output Files (text + JSON)

Design Principles

  1. Single Responsibility - Each module has one clear purpose
  2. DRY (Don't Repeat Yourself) - Common logic in alignment_core
  3. Type Safety - Type hints throughout for better IDE support
  4. Error Handling - Comprehensive validation and error messages
  5. Modularity - Easy to extend with new algorithms

๏ฟฝ Usage

Command Line Interface

python src/main.py <mode> <input_file> <output_file>

Arguments:

  • mode: Algorithm to use (basic or efficient)
  • input_file: Path to input data file
  • output_file: Path to store results

Example:

python src/main.py basic data/input/in1.txt data/output/out1.txt

Shell Scripts

Run Single Test

./basic.sh data/input/in1.txt data/output/output.txt
./efficient.sh data/input/in1.txt data/output/output.txt

Run Batch Processing

./run_batch.sh

Processes all test cases and generates performance comparison graphs.

Environment Setup

./setup.sh

Creates virtual environment and installs dependencies.

Input Format

The input file should follow this format:

ACTG           # First sequence
1              # Index to duplicate after (for seq1)
3              # Another index (for seq1)
TACG           # Second sequence
0              # Index to duplicate after (for seq2)

Each number indicates where to insert a copy of the entire sequence.


๐Ÿงฎ Algorithms

Basic Algorithm (Needleman-Wunsch)

Approach: Standard dynamic programming with full table storage

Implementation:

  1. Initialize mร—n DP table with base cases
  2. Fill table bottom-up using recurrence relation
  3. Backtrack from bottom-right to reconstruct alignment

Complexity:

  • Time: O(m ร— n)
  • Space: O(m ร— n)

Use Cases:

  • Small to medium sequences
  • When full DP table is needed for analysis
  • Educational purposes

Code Structure:

def basic(seq1: str, seq2: str) -> Tuple[int, str, str]:
    # Validate inputs
    validate_sequences(seq1, seq2)

    # Initialize full DP table
    dp = init_dp_table(m, n)

    # Fill table
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            dp[i][j] = compute_cell_cost(seq1, seq2, i, j, dp)

    # Backtrack to get alignment
    aligned1, aligned2 = backtrack_alignment(seq1, seq2, dp)

    return dp[m][n], aligned1, aligned2

Efficient Algorithm (Hirschberg)

Approach: Space-optimized divide-and-conquer with DP

Implementation:

  1. Base case: Use standard DP for small sequences
  2. Divide: Split first sequence at midpoint
  3. Conquer: Compute forward and backward costs
  4. Combine: Find optimal split point, recurse on halves

Complexity:

  • Time: O(m ร— n)
  • Space: O(min(m, n))

Use Cases:

  • Large sequences where memory is constrained
  • Production systems with memory limitations
  • Processing multiple alignments concurrently

Key Functions:

  • compute_cost_row() - Calculate costs using only 2 rows
  • align_small_sequences() - Handle base cases
  • hirschberg_recursive() - Main divide-and-conquer logic

Cost Model

Mismatch Penalties (ALPHA matrix):

     A    C    G    T
A    0   110   48   94
C   110   0   118   48
G    48  118   0   110
T    94   48  110   0

Gap Penalty (DELTA): 30

The penalty values reflect biological similarity between nucleotide bases:

  • Lower penalties for chemically similar bases (e.g., A-G both purines)
  • Higher penalties for dissimilar bases

๐Ÿ‘จโ€๐Ÿ’ป Development

Code Quality Standards

  • Type Hints: All functions have complete type annotations
  • Docstrings: Google-style documentation for all modules and functions
  • Error Handling: Comprehensive validation with informative error messages
  • Testing: Unit tests for core functions (recommended)

Adding a New Algorithm

  1. Create a new file in src/ (e.g., new_algorithm.py)
  2. Import from alignment_core for shared utilities
  3. Implement with signature: (seq1: str, seq2: str) -> Tuple[int, str, str]
  4. Update main.py to include the algorithm:
from new_algorithm import new_algorithm

def select_algorithm(mode: str):
    algorithms = {
        "basic": basic,
        "efficient": efficient,
        "new": new_algorithm  # Add here
    }
    return algorithms[mode]

Modifying Cost Parameters

Edit src/cost_constants.py:

# Adjust mismatch penalties
ALPHA['A']['G'] = 50

# Change gap penalty
DELTA = 25

Project Setup

# Make scripts executable
chmod +x *.sh

# Set up environment
./setup.sh

# Run tests
python src/main.py basic data/SampleTestCases/input1.txt /tmp/test.txt

๐Ÿ“Š Output Format

Text Output File

Each run produces a text file with:

114                    # Minimum alignment cost
AACT_G                # Aligned sequence 1
A_CTTG                # Aligned sequence 2
12.345                # Execution time (ms)
2048                  # Memory usage (KB)

JSON Performance Log

Results are also logged to data/output/result.json:

{
	"10": {
		"basic": {
			"run": 10,
			"cost": 114,
			"time": 12.345,
			"memory": 2048
		},
		"efficient": {
			"run": 10,
			"cost": 114,
			"time": 15.678,
			"memory": 1024
		}
	}
}

Performance Graphs

Running batch tests generates comparison graphs:

  • data/output/time_comparison.png - Runtime comparison
  • data/output/memory_comparison.png - Memory usage comparison

๐Ÿงช Testing

Run Unit Tests

The project includes comprehensive unit tests covering all modules:

# Run all unit tests
python test_alignment.py

# Run with verbose output
python test_alignment.py -v

Test Coverage:

  • โœ… Basic alignment algorithm
  • โœ… Efficient alignment algorithm
  • โœ… Core alignment utilities
  • โœ… String processing
  • โœ… Input/output operations
  • โœ… Cost constants validation
  • โœ… Algorithm consistency checks
  • โœ… Integration tests

Verify Installation

# Test basic algorithm
python src/main.py basic data/SampleTestCases/input1.txt /tmp/test_basic.txt

# Test efficient algorithm
python src/main.py efficient data/SampleTestCases/input1.txt /tmp/test_efficient.txt

# Verify both produce same alignment cost
diff /tmp/test_basic.txt /tmp/test_efficient.txt | head -1

Run All Test Cases

./run_batch.sh

This will process all input files and generate performance comparison graphs.


๐Ÿ“ฆ Dependencies

psutil>=5.9.0      # Memory monitoring
matplotlib>=3.7.0  # Graph generation

Install with:

pip install -r requirements.txt

๐ŸŽ“ Algorithm Comparison

Aspect Basic Efficient
Algorithm Needleman-Wunsch Hirschberg
Time Complexity O(mร—n) O(mร—n)
Space Complexity O(mร—n) O(min(m,n))
Memory Usage Higher Lower
Implementation Simpler More Complex
Best For Small/medium sequences Large sequences
DP Table Full table stored Only 2 rows at a time

๐Ÿ“ Notes

  • Input files must follow the specified format
  • Sequences should contain only valid DNA bases (A, C, G, T)
  • Both algorithms produce identical alignment results
  • The efficient algorithm trades some code complexity for significant memory savings

๏ฟฝ Error Handling

The program handles various error conditions:

  • Invalid arguments: Clear usage instructions
  • Missing files: Helpful file-not-found messages
  • Invalid sequences: Validation with specific error messages
  • Malformed input: Detailed parsing errors
  • I/O errors: Graceful handling with user feedback

Example:

$ python src/main.py basic nonexistent.txt output.txt
Error: Input file not found: nonexistent.txt

$ python src/main.py invalid data/input/in1.txt output.txt
Error: Unknown mode 'invalid'
Valid modes are: basic, efficient

๐Ÿš€ Performance Tips

  1. For large sequences: Use the efficient algorithm to save memory
  2. Batch processing: Use run_batch.sh for multiple files
  3. Monitoring: Check JSON logs for performance trends
  4. Optimization: Adjust cost parameters based on your use case

๐Ÿ“„ License

This project is part of CSCI 570 coursework.


Developed with โค๏ธ for CSCI 570