โ ๏ธ ACADEMIC INTEGRITY NOTICEThis repository contains a refactored and enhanced version of the DNA sequence alignment project. The implementation, code structure, and architecture shown here are significantly different from the original academic submission for CSCI 570.
This refactored codebase was created AFTER course completion and includes:
- Complete architectural redesign with modular components
- Enhanced type safety and professional code organization
- Additional tooling, testing, and documentation
- Improved performance benchmarking and visualization
The original course project submission remains separate and was completed in accordance with all academic integrity policies. This public version is intended for portfolio and learning purposes only.
A comprehensive implementation of DNA sequence alignment algorithms using dynamic programming. This project provides two algorithmic approaches: a standard space-intensive algorithm and a space-optimized variant, along with complete tooling for testing, benchmarking, and visualization.
- Two Algorithm Implementations
- Basic: Standard Needleman-Wunsch (O(mรn) space)
- Efficient: Hirschberg's Algorithm (O(min(m,n)) space)
- Modular Architecture
- Clean separation of concerns
- Reusable core utilities
- Type-safe with comprehensive type hints
- Performance Benchmarking
- Execution time measurement
- Memory usage tracking
- Automated graph generation
- Production-Ready Code
- Comprehensive error handling
- Input validation
- Detailed documentation
./basic.sh data/input/in1.txt data/output/output.txt./efficient.sh data/input/in1.txt data/output/output.txt./run_batch.shCSCI-570_Sequence-alignment/
โโโ src/ # Source code
โ โโโ main.py # Single entry point
โ โโโ basic.py # Basic DP algorithm
โ โโโ efficient.py # Space-optimized algorithm
โ โโโ alignment_core.py # Shared utilities
โ โโโ string_processor.py # Input processing
โ โโโ io_utils.py # File I/O operations
โ โโโ perf_utils.py # Performance measurement
โ โโโ cost_constants.py # Cost configuration
โโโ data/
โ โโโ input/ # Test input files
โ โโโ output/ # Results and graphs
โ โโโ SampleTestCases/ # Sample test cases
โโโ test_alignment.py # Comprehensive unit tests
โโโ basic.sh # Run basic algorithm
โโโ efficient.sh # Run efficient algorithm
โโโ run_batch.sh # Batch processing
โโโ setup.sh # Environment setup
โโโ requirements.txt # Python dependencies
main.py- Single entry point that orchestrates the entire workflow- CLI argument parsing and validation
- Algorithm selection
- Performance measurement
- Error handling and user feedback
-
basic.py- Standard Needleman-Wunsch algorithm- Time: O(m ร n)
- Space: O(m ร n)
- Stores full DP table for alignment reconstruction
-
efficient.py- Hirschberg's space-optimized algorithm- Time: O(m ร n)
- Space: O(min(m, n))
- Uses divide-and-conquer with DP
-
alignment_core.py- Core alignment logic shared by both algorithmsinit_dp_table()- Initialize DP table with base casescompute_cell_cost()- Calculate cell costsbacktrack_alignment()- Reconstruct alignmentvalidate_sequences()- Input validation
-
string_processor.py- Input parsing and sequence expansion- Handles positional duplication rules
- Input validation and error handling
-
io_utils.py- Input/output operations- File reading/writing
- JSON results management
- Performance data logging
-
perf_utils.py- Performance measurement- Execution time tracking
- Memory usage monitoring
-
cost_constants.py- Configuration- Mismatch penalty matrix (ALPHA)
- Gap penalty (DELTA)
Input File
โ
readInput() โ Parse raw lines
โ
processStrings() โ Expand sequences
โ
validate_sequences() โ Validate DNA bases
โ
basic() OR efficient() โ Compute alignment
โ
writeOutput() โ Save results
โ
Output Files (text + JSON)
- Single Responsibility - Each module has one clear purpose
- DRY (Don't Repeat Yourself) - Common logic in
alignment_core - Type Safety - Type hints throughout for better IDE support
- Error Handling - Comprehensive validation and error messages
- Modularity - Easy to extend with new algorithms
python src/main.py <mode> <input_file> <output_file>Arguments:
mode: Algorithm to use (basicorefficient)input_file: Path to input data fileoutput_file: Path to store results
Example:
python src/main.py basic data/input/in1.txt data/output/out1.txt./basic.sh data/input/in1.txt data/output/output.txt
./efficient.sh data/input/in1.txt data/output/output.txt./run_batch.shProcesses all test cases and generates performance comparison graphs.
./setup.shCreates virtual environment and installs dependencies.
The input file should follow this format:
ACTG # First sequence
1 # Index to duplicate after (for seq1)
3 # Another index (for seq1)
TACG # Second sequence
0 # Index to duplicate after (for seq2)
Each number indicates where to insert a copy of the entire sequence.
Approach: Standard dynamic programming with full table storage
Implementation:
- Initialize mรn DP table with base cases
- Fill table bottom-up using recurrence relation
- Backtrack from bottom-right to reconstruct alignment
Complexity:
- Time: O(m ร n)
- Space: O(m ร n)
Use Cases:
- Small to medium sequences
- When full DP table is needed for analysis
- Educational purposes
Code Structure:
def basic(seq1: str, seq2: str) -> Tuple[int, str, str]:
# Validate inputs
validate_sequences(seq1, seq2)
# Initialize full DP table
dp = init_dp_table(m, n)
# Fill table
for i in range(1, m + 1):
for j in range(1, n + 1):
dp[i][j] = compute_cell_cost(seq1, seq2, i, j, dp)
# Backtrack to get alignment
aligned1, aligned2 = backtrack_alignment(seq1, seq2, dp)
return dp[m][n], aligned1, aligned2Approach: Space-optimized divide-and-conquer with DP
Implementation:
- Base case: Use standard DP for small sequences
- Divide: Split first sequence at midpoint
- Conquer: Compute forward and backward costs
- Combine: Find optimal split point, recurse on halves
Complexity:
- Time: O(m ร n)
- Space: O(min(m, n))
Use Cases:
- Large sequences where memory is constrained
- Production systems with memory limitations
- Processing multiple alignments concurrently
Key Functions:
compute_cost_row()- Calculate costs using only 2 rowsalign_small_sequences()- Handle base caseshirschberg_recursive()- Main divide-and-conquer logic
Mismatch Penalties (ALPHA matrix):
A C G T
A 0 110 48 94
C 110 0 118 48
G 48 118 0 110
T 94 48 110 0Gap Penalty (DELTA): 30
The penalty values reflect biological similarity between nucleotide bases:
- Lower penalties for chemically similar bases (e.g., A-G both purines)
- Higher penalties for dissimilar bases
- Type Hints: All functions have complete type annotations
- Docstrings: Google-style documentation for all modules and functions
- Error Handling: Comprehensive validation with informative error messages
- Testing: Unit tests for core functions (recommended)
- Create a new file in
src/(e.g.,new_algorithm.py) - Import from
alignment_corefor shared utilities - Implement with signature:
(seq1: str, seq2: str) -> Tuple[int, str, str] - Update
main.pyto include the algorithm:
from new_algorithm import new_algorithm
def select_algorithm(mode: str):
algorithms = {
"basic": basic,
"efficient": efficient,
"new": new_algorithm # Add here
}
return algorithms[mode]Edit src/cost_constants.py:
# Adjust mismatch penalties
ALPHA['A']['G'] = 50
# Change gap penalty
DELTA = 25# Make scripts executable
chmod +x *.sh
# Set up environment
./setup.sh
# Run tests
python src/main.py basic data/SampleTestCases/input1.txt /tmp/test.txtEach run produces a text file with:
114 # Minimum alignment cost
AACT_G # Aligned sequence 1
A_CTTG # Aligned sequence 2
12.345 # Execution time (ms)
2048 # Memory usage (KB)
Results are also logged to data/output/result.json:
{
"10": {
"basic": {
"run": 10,
"cost": 114,
"time": 12.345,
"memory": 2048
},
"efficient": {
"run": 10,
"cost": 114,
"time": 15.678,
"memory": 1024
}
}
}Running batch tests generates comparison graphs:
data/output/time_comparison.png- Runtime comparisondata/output/memory_comparison.png- Memory usage comparison
The project includes comprehensive unit tests covering all modules:
# Run all unit tests
python test_alignment.py
# Run with verbose output
python test_alignment.py -vTest Coverage:
- โ Basic alignment algorithm
- โ Efficient alignment algorithm
- โ Core alignment utilities
- โ String processing
- โ Input/output operations
- โ Cost constants validation
- โ Algorithm consistency checks
- โ Integration tests
# Test basic algorithm
python src/main.py basic data/SampleTestCases/input1.txt /tmp/test_basic.txt
# Test efficient algorithm
python src/main.py efficient data/SampleTestCases/input1.txt /tmp/test_efficient.txt
# Verify both produce same alignment cost
diff /tmp/test_basic.txt /tmp/test_efficient.txt | head -1./run_batch.shThis will process all input files and generate performance comparison graphs.
psutil>=5.9.0 # Memory monitoring
matplotlib>=3.7.0 # Graph generation
Install with:
pip install -r requirements.txt| Aspect | Basic | Efficient |
|---|---|---|
| Algorithm | Needleman-Wunsch | Hirschberg |
| Time Complexity | O(mรn) | O(mรn) |
| Space Complexity | O(mรn) | O(min(m,n)) |
| Memory Usage | Higher | Lower |
| Implementation | Simpler | More Complex |
| Best For | Small/medium sequences | Large sequences |
| DP Table | Full table stored | Only 2 rows at a time |
- Input files must follow the specified format
- Sequences should contain only valid DNA bases (A, C, G, T)
- Both algorithms produce identical alignment results
- The efficient algorithm trades some code complexity for significant memory savings
The program handles various error conditions:
- Invalid arguments: Clear usage instructions
- Missing files: Helpful file-not-found messages
- Invalid sequences: Validation with specific error messages
- Malformed input: Detailed parsing errors
- I/O errors: Graceful handling with user feedback
Example:
$ python src/main.py basic nonexistent.txt output.txt
Error: Input file not found: nonexistent.txt
$ python src/main.py invalid data/input/in1.txt output.txt
Error: Unknown mode 'invalid'
Valid modes are: basic, efficient- For large sequences: Use the efficient algorithm to save memory
- Batch processing: Use
run_batch.shfor multiple files - Monitoring: Check JSON logs for performance trends
- Optimization: Adjust cost parameters based on your use case
This project is part of CSCI 570 coursework.
Developed with โค๏ธ for CSCI 570