Automated pipeline for downloading, cleaning, and analyzing mainly alpha-helix protein structures from the CATH database.
- Overview
- Notebook Structure
- Requirements
- Installation
- Usage
- Output Structure
- Section Details
- Main Functions
- Contributing
- License
This project implements a complete pipeline for analyzing proteins with mainly alpha-helix structure:
- Automated download of PDB structures from CATH database
- Cleaning and preprocessing of PDB files
- Amino acid frequency analysis
- Advanced analysis with DSSP (Define Secondary Structure of Proteins)
- Alpha-helix type classification
- Publication-quality visualizations
- 6 main processing cells
- ~1,962 lines of Python code
- 42 specialized functions
- 9 visualization functions
- Optimized for macOS with parallel processing
The cath-protocol.ipynb notebook is organized into 6 main cells:
- Library imports
- Global variable configuration
- Initial environment setup
- CATH domain downloading
- Parallel download processing
- File validation
- Heteroatom removal
- Chain filtering
- PDB file standardization
- Amino acid counting
- Statistical analysis
- Distribution visualizations
- Secondary structure analysis
- Structural metrics calculation
- Alpha-helix detection
- Alpha-helix type identification
- Automatic classification
- Comparative analysis
# Scientific core
numpy
pandas
matplotlib
seaborn
# Bioinformatics
biopython
biotite
# Structural analysis
dssp (mkdssp)
# Utilities
requests
tqdm- DSSP: For secondary structure analysis
# macOS brew install dssp # Linux (Ubuntu/Debian) sudo apt-get install dssp
- Clone the repository:
git clone https://github.com/madsondeluna/cath-mainly-alpha.git
cd cath-mainly-alpha- Create a virtual environment:
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install numpy pandas matplotlib seaborn biopython biotite requests tqdm- Install DSSP (if not already installed):
brew install dssp # macOSOpen the notebook in Jupyter:
jupyter notebook cath-protocol.ipynbExecute cells in sequential order (0 → 1 → 2 → 3 → 4 → 5).
Edit variables in Cell 0 to customize:
# Output directories
OUTPUT_DIR = "output"
PDB_DIR = "pdb_files"
CLEANED_DIR = "cleaned_pdb"
# Download parameters
MAX_WORKERS = 4 # Number of parallel downloads
TIMEOUT = 30 # Timeout in seconds
# Analysis filters
MIN_HELIX_LENGTH = 4 # Minimum helix lengthAfter execution, the following directory structure will be created:
cath-mainly-alpha/
├── pdb_files/ # Downloaded PDB files
│ ├── domain1.pdb
│ ├── domain2.pdb
│ └── ...
├── cleaned_pdb/ # Cleaned PDB files
│ ├── domain1_clean.pdb
│ ├── domain2_clean.pdb
│ └── ...
├── output/ # Analysis results
│ ├── amino_acid_freq.csv
│ ├── helix_analysis.csv
│ ├── helix_types.csv
│ └── figures/ # Visualizations
│ ├── aa_distribution.png
│ ├── helix_length_dist.png
│ └── ...
└── logs/ # Execution logs
└── processing.log
Purpose: Prepare the execution environment
What it does:
- Imports all necessary libraries (NumPy, Pandas, BioPython, etc.)
- Defines global variables and directory paths
- Configures matplotlib for high-quality visualizations
- Initializes loggers for tracking
Main imports:
from Bio.PDB import PDBParser, PDBIO, Select
from Bio.SeqUtils import seq1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snsPurpose: Download PDB files from the CATH database
Main functions:
-
download_cath_domain(domain_id, output_dir)- Downloads a single CATH domain
- Parameters: domain ID, output directory
- Returns: Path to downloaded file or None
-
download_cath_domains_parallel(domain_list, output_dir, max_workers=4)- Parallel download of multiple domains
- Uses ThreadPoolExecutor for parallelization
- Progress bar with tqdm
-
validate_pdb_file(filepath)- Validates PDB file integrity
- Checks format and structure
- Returns: True/False
-
get_cath_domain_list(cath_version='current')- Gets list of CATH domains
- Filters only "Mainly Alpha" class
- Returns: List of IDs
Process:
- Gets list of mainly alpha-helix domains from CATH
- Creates output directory if it doesn't exist
- Downloads PDB files in parallel (4 threads by default)
- Validates each downloaded file
- Records download statistics (success/failure)
Output:
- PDB files in
pdb_files/ - Download log in
logs/download.log
Purpose: Clean and standardize PDB files
Main functions:
-
remove_heteroatoms(structure)- Removes heteroatoms (water, ligands)
- Keeps only protein atoms
- Returns: Clean structure
-
filter_chains(structure, chain_ids=None)- Filters specific chains
- If None, keeps all
- Returns: Filtered structure
-
standardize_residues(structure)- Standardizes residue names
- Removes non-standard residues
- Returns: Standardized structure
-
clean_pdb_file(input_path, output_path)- Main cleaning function
- Applies all filters
- Saves clean file
-
batch_clean_pdb(input_dir, output_dir)- Cleans multiple files
- Batch processing
- Progress bar
Process:
- Reads original PDB file
- Removes heteroatoms (water, ions, ligands)
- Filters chains (if specified)
- Standardizes residue names
- Removes modified residues
- Saves clean file
- Validates output file
Cleaning criteria:
- Removes HETATM
- Keeps only standard amino acids (20 types)
- Removes residues with occupancy < 0.5
- Removes atoms with B-factor > 100
Output:
- Clean PDB files in
cleaned_pdb/ - Cleaning report in
logs/cleaning.log
Purpose: Analyze amino acid composition in structures
Main functions:
-
count_amino_acids(pdb_file)- Counts frequency of each amino acid
- Returns: Dictionary {aa: count}
-
calculate_aa_statistics(pdb_dir)- Aggregate statistics from all files
- Mean, standard deviation, percentiles
- Returns: DataFrame with statistics
-
plot_aa_distribution(stats_df, output_path)- Bar plot of distribution
- Custom colors by property (hydrophobic, polar, etc.)
- Saves high-resolution figure
Analyses performed:
- Absolute frequency of each amino acid
- Relative frequency (%)
- Comparison between structures
- Identification of most/least common amino acids
- Analysis by physicochemical property:
- Hydrophobic (A, V, L, I, M, F, W, P)
- Polar (S, T, N, Q, C, Y)
- Positively charged (K, R, H)
- Negatively charged (D, E)
- Special (G, P)
Visualizations:
- Bar plot: general distribution
- Heatmap: comparison between structures
- Box plot: variability
Output:
output/amino_acid_freq.csv: Frequency tableoutput/aa_statistics.csv: Aggregate statisticsoutput/figures/aa_distribution.png: Main plotoutput/figures/aa_heatmap.png: Comparative heatmap
Purpose: Detailed secondary structure analysis using DSSP
Main functions:
-
run_dssp(pdb_file)- Runs DSSP on PDB file
- Returns: DSSP object with annotations
-
parse_dssp_output(dssp_result)- Parses DSSP output
- Extracts secondary structure
- Returns: Structured DataFrame
-
identify_helices(dssp_df)- Identifies alpha-helix segments
- Filters by minimum length
- Returns: List of helices [(start, end, length)]
-
calculate_helix_metrics(helix_segment, structure)- Calculates geometric metrics
- Length, angles, twist
- Returns: Dict with metrics
-
analyze_helix_geometry(pdb_file)- Complete geometric analysis
- Helix axis, radius, pitch
- Returns: DataFrame with geometry
Calculated metrics:
For each detected alpha-helix:
- Length: Number of residues
- Phi/psi angles: Dihedral angles
- Rise per residue: Advance per residue (ideal 3.6 Å)
- Twist: Rotation per residue (~100° ideal)
- Radius: Helix radius (~2.3 Å ideal)
- Pitch: Helix pitch (~5.4 Å ideal)
- RMSD: Deviation from ideal helix
- Regularity: Structural regularity score
Identified secondary structures:
- H: alpha-helix
- G: 3₁₀-helix
- I: pi-helix
- E: beta-sheet
- B: beta-bridge
- T: Turn
- S: Bend
- C: Coil/loop
Visualizations:
- Helix length distribution
- Ramachandran plots (phi/psi)
- Geometric metrics distribution
- Comparison with ideal values
Output:
output/dssp_analysis.csv: Complete DSSP analysisoutput/helix_metrics.csv: Metrics for each helixoutput/helix_geometry.csv: Detailed geometryoutput/figures/helix_length_dist.png: Length distributionoutput/figures/ramachandran.png: Ramachandran plotoutput/figures/geometry_metrics.png: Geometric metrics
Purpose: Classify alpha-helices into specific types
Main functions:
-
classify_helix_type(helix_metrics)- Classifies helix type based on metrics
- alpha-helix, 3₁₀-helix, pi-helix, irregular
- Returns: Helix type
-
identify_helix_kinks(helix_segment, threshold=20)- Detects kinks (bends) in helices
- Threshold: deviation angle in degrees
- Returns: List of kink positions
-
calculate_helix_stability(helix_segment)- Estimates helix stability
- Based on hydrogen bonds
- Returns: Stability score (0-1)
-
find_helix_capping_residues(helix_segment)- Identifies capping residues (N-cap and C-cap)
- Important for stability
- Returns: {n_cap: residue, c_cap: residue}
-
analyze_helix_surface(helix_segment)- Analyzes solvent exposure
- Identifies hydrophobic/hydrophilic faces
- Returns: Dict with surface analysis
-
detect_helix_helix_interactions(structure)- Detects interactions between helices
- Packing, crossing angles
- Returns: List of interacting pairs
Classified helix types:
-
Canonical alpha-helix
- Rise: ~1.5 Å/residue
- Twist: ~100°/residue
- 3.6 residues/turn
-
3₁₀-helix
- Rise: ~2.0 Å/residue
- Twist: ~120°/residue
- 3.0 residues/turn
- Tighter
-
Pi-helix
- Rise: ~1.2 Å/residue
- Twist: ~87°/residue
- 4.4 residues/turn
- Wider
-
Irregular helix
- Significant deviations from patterns
- May contain kinks
- RMSD > 1.0 Å
Special analyses:
- Helix dipole: Helix dipole moment
- Capping motifs: Terminal residue patterns
- Hydrophobic moment: Hydrophobic moment
- Helix-helix packing: Packing geometry
Visualizations:
- Helix type distribution
- Kink map along sequence
- Capping residue analysis
- Solvent exposure profile
- Helix-helix interaction network
Output:
output/helix_classification.csv: Classification of each helixoutput/helix_types_summary.csv: Summary by typeoutput/kinks_analysis.csv: Kink analysisoutput/capping_residues.csv: Capping residuesoutput/helix_interactions.csv: Helix interactionsoutput/figures/helix_types_pie.png: Type distributionoutput/figures/kinks_heatmap.png: Kink mapoutput/figures/interaction_network.png: Interaction network
download_cath_domain(): Individual downloaddownload_cath_domains_parallel(): Parallel downloadvalidate_pdb_file(): File validationget_cath_domain_list(): Domain list
remove_heteroatoms(): Remove non-protein atomsfilter_chains(): Filter chainsstandardize_residues(): Standardize residuesclean_pdb_file(): Complete cleaning
count_amino_acids(): Count amino acidscalculate_aa_statistics(): Aggregate statistics
run_dssp(): Run DSSPparse_dssp_output(): Parse resultsidentify_helices(): Identify helicescalculate_helix_metrics(): Geometric metricsanalyze_helix_geometry(): Detailed geometry
classify_helix_type(): Classify typesidentify_helix_kinks(): Detect kinkscalculate_helix_stability(): Stabilityfind_helix_capping_residues(): Cappingdetect_helix_helix_interactions(): Interactions
plot_aa_distribution(): Amino acid distributionplot_helix_length_distribution(): Lengthsplot_ramachandran(): Ramachandran plotplot_helix_types(): Helix typesplot_interaction_network(): Interaction network
This pipeline can be used for:
- Structural research: Characterization of alpha-helices in different contexts
- Comparative analysis: Compare properties between protein families
- Structure prediction: Validate computational models
- Protein design: Inform rational design
- Education: Demonstrate protein structure concepts
- Download: ~100 structures/minute (4 threads)
- Cleaning: ~50 structures/minute
- DSSP analysis: ~20 structures/minute
- Classification: ~30 structures/minute
For large datasets (>1000 structures), it is recommended to:
- Increase
MAX_WORKERSto 8-16 (if you have enough CPU) - Run on machine with SSD
- Use at least 8GB RAM
Contributions are welcome! Please:
- Fork the project
- Create a feature branch (
git checkout -b feature/NewAnalysis) - Commit your changes (
git commit -m 'Add new analysis of X') - Push to the branch (
git push origin feature/NewAnalysis) - Open a Pull Request
This project is under the MIT License. See the LICENSE file for more details.
Madson de Luna
- GitHub: @madsondeluna
- CATH Database: http://www.cathdb.info/
- DSSP: Kabsch W, Sander C (1983). "Dictionary of protein secondary structure"
- BioPython: Cock et al. (2009). "Biopython: freely available Python tools"
- Complete functional pipeline
- 42 implemented functions
- 9 visualization types
- Support for parallel processing
- AlphaFold integration
- Molecular dynamics analysis
- Web interface
- REST API
- Export to additional formats (PyMOL, Chimera)