CATH Mainly Alpha - Protein Structure Analysis

Automated pipeline for downloading, cleaning, and analyzing mainly alpha-helix protein structures from the CATH database.

Overview

This project implements a complete pipeline for analyzing proteins with mainly alpha-helix structure:

Automated download of PDB structures from CATH database
Cleaning and preprocessing of PDB files
Amino acid frequency analysis
Advanced analysis with DSSP (Define Secondary Structure of Proteins)
Alpha-helix type classification
Publication-quality visualizations

Project Statistics

6 main processing cells
~1,962 lines of Python code
42 specialized functions
9 visualization functions
Optimized for macOS with parallel processing

Notebook Structure

The cath-protocol.ipynb notebook is organized into 6 main cells:

Cell 0: Environment Setup (50 lines)

Library imports
Global variable configuration
Initial environment setup

Cell 1: PDB Structure Download (279 lines, 7 functions)

CATH domain downloading
Parallel download processing
File validation

Cell 2: Structure Cleaning (322 lines, 8 functions)

Heteroatom removal
Chain filtering
PDB file standardization

Cell 3: Amino Acid Frequency Analysis (156 lines, 3 functions)

Amino acid counting
Statistical analysis
Distribution visualizations

Cell 4: Advanced DSSP Analysis (558 lines, 13 functions)

Secondary structure analysis
Structural metrics calculation
Alpha-helix detection

Cell 5: Helix Type Classification (597 lines, 11 functions)

Alpha-helix type identification
Automatic classification
Comparative analysis

Requirements

Python Dependencies

# Scientific core
numpy
pandas
matplotlib
seaborn

# Bioinformatics
biopython
biotite

# Structural analysis
dssp (mkdssp)

# Utilities
requests
tqdm

External Tools

DSSP: For secondary structure analysis

# macOS
brew install dssp

# Linux (Ubuntu/Debian)
sudo apt-get install dssp

Installation

Clone the repository:

git clone https://github.com/madsondeluna/cath-mainly-alpha.git
cd cath-mainly-alpha

Create a virtual environment:

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install numpy pandas matplotlib seaborn biopython biotite requests tqdm

Install DSSP (if not already installed):

brew install dssp  # macOS

Usage

Basic Execution

Open the notebook in Jupyter:

jupyter notebook cath-protocol.ipynb

Execute cells in sequential order (0 → 1 → 2 → 3 → 4 → 5).

Configuration

Edit variables in Cell 0 to customize:

# Output directories
OUTPUT_DIR = "output"
PDB_DIR = "pdb_files"
CLEANED_DIR = "cleaned_pdb"

# Download parameters
MAX_WORKERS = 4  # Number of parallel downloads
TIMEOUT = 30     # Timeout in seconds

# Analysis filters
MIN_HELIX_LENGTH = 4  # Minimum helix length

Output Structure

After execution, the following directory structure will be created:

cath-mainly-alpha/
├── pdb_files/              # Downloaded PDB files
│   ├── domain1.pdb
│   ├── domain2.pdb
│   └── ...
├── cleaned_pdb/            # Cleaned PDB files
│   ├── domain1_clean.pdb
│   ├── domain2_clean.pdb
│   └── ...
├── output/                 # Analysis results
│   ├── amino_acid_freq.csv
│   ├── helix_analysis.csv
│   ├── helix_types.csv
│   └── figures/           # Visualizations
│       ├── aa_distribution.png
│       ├── helix_length_dist.png
│       └── ...
└── logs/                  # Execution logs
    └── processing.log

Section Details

Cell 0: Environment Setup

Purpose: Prepare the execution environment

What it does:

Imports all necessary libraries (NumPy, Pandas, BioPython, etc.)
Defines global variables and directory paths
Configures matplotlib for high-quality visualizations
Initializes loggers for tracking

Main imports:

from Bio.PDB import PDBParser, PDBIO, Select
from Bio.SeqUtils import seq1
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Cell 1: PDB Structure Download

Purpose: Download PDB files from the CATH database

Main functions:

download_cath_domain(domain_id, output_dir)
- Downloads a single CATH domain
- Parameters: domain ID, output directory
- Returns: Path to downloaded file or None
download_cath_domains_parallel(domain_list, output_dir, max_workers=4)
- Parallel download of multiple domains
- Uses ThreadPoolExecutor for parallelization
- Progress bar with tqdm
validate_pdb_file(filepath)
- Validates PDB file integrity
- Checks format and structure
- Returns: True/False
get_cath_domain_list(cath_version='current')
- Gets list of CATH domains
- Filters only "Mainly Alpha" class
- Returns: List of IDs

Process:

Gets list of mainly alpha-helix domains from CATH
Creates output directory if it doesn't exist
Downloads PDB files in parallel (4 threads by default)
Validates each downloaded file
Records download statistics (success/failure)

Output:

PDB files in pdb_files/
Download log in logs/download.log

Cell 2: Structure Cleaning

Purpose: Clean and standardize PDB files

Main functions:

remove_heteroatoms(structure)
- Removes heteroatoms (water, ligands)
- Keeps only protein atoms
- Returns: Clean structure
filter_chains(structure, chain_ids=None)
- Filters specific chains
- If None, keeps all
- Returns: Filtered structure
standardize_residues(structure)
- Standardizes residue names
- Removes non-standard residues
- Returns: Standardized structure
clean_pdb_file(input_path, output_path)
- Main cleaning function
- Applies all filters
- Saves clean file
batch_clean_pdb(input_dir, output_dir)
- Cleans multiple files
- Batch processing
- Progress bar

Process:

Reads original PDB file
Removes heteroatoms (water, ions, ligands)
Filters chains (if specified)
Standardizes residue names
Removes modified residues
Saves clean file
Validates output file

Cleaning criteria:

Removes HETATM
Keeps only standard amino acids (20 types)
Removes residues with occupancy < 0.5
Removes atoms with B-factor > 100

Output:

Clean PDB files in cleaned_pdb/
Cleaning report in logs/cleaning.log

Cell 3: Amino Acid Frequency Analysis

Purpose: Analyze amino acid composition in structures

Main functions:

count_amino_acids(pdb_file)
- Counts frequency of each amino acid
- Returns: Dictionary {aa: count}
calculate_aa_statistics(pdb_dir)
- Aggregate statistics from all files
- Mean, standard deviation, percentiles
- Returns: DataFrame with statistics
plot_aa_distribution(stats_df, output_path)
- Bar plot of distribution
- Custom colors by property (hydrophobic, polar, etc.)
- Saves high-resolution figure

Analyses performed:

Absolute frequency of each amino acid
Relative frequency (%)
Comparison between structures
Identification of most/least common amino acids
Analysis by physicochemical property:
- Hydrophobic (A, V, L, I, M, F, W, P)
- Polar (S, T, N, Q, C, Y)
- Positively charged (K, R, H)
- Negatively charged (D, E)
- Special (G, P)

Visualizations:

Bar plot: general distribution
Heatmap: comparison between structures
Box plot: variability

Output:

output/amino_acid_freq.csv: Frequency table
output/aa_statistics.csv: Aggregate statistics
output/figures/aa_distribution.png: Main plot
output/figures/aa_heatmap.png: Comparative heatmap

Cell 4: Advanced DSSP Analysis

Purpose: Detailed secondary structure analysis using DSSP

Main functions:

run_dssp(pdb_file)
- Runs DSSP on PDB file
- Returns: DSSP object with annotations
parse_dssp_output(dssp_result)
- Parses DSSP output
- Extracts secondary structure
- Returns: Structured DataFrame
identify_helices(dssp_df)
- Identifies alpha-helix segments
- Filters by minimum length
- Returns: List of helices [(start, end, length)]
calculate_helix_metrics(helix_segment, structure)
- Calculates geometric metrics
- Length, angles, twist
- Returns: Dict with metrics
analyze_helix_geometry(pdb_file)
- Complete geometric analysis
- Helix axis, radius, pitch
- Returns: DataFrame with geometry

Calculated metrics:

For each detected alpha-helix:

Length: Number of residues
Phi/psi angles: Dihedral angles
Rise per residue: Advance per residue (ideal 3.6 Å)
Twist: Rotation per residue (~100° ideal)
Radius: Helix radius (~2.3 Å ideal)
Pitch: Helix pitch (~5.4 Å ideal)
RMSD: Deviation from ideal helix
Regularity: Structural regularity score

Identified secondary structures:

H: alpha-helix
G: 3₁₀-helix
I: pi-helix
E: beta-sheet
B: beta-bridge
T: Turn
S: Bend
C: Coil/loop

Visualizations:

Helix length distribution
Ramachandran plots (phi/psi)
Geometric metrics distribution
Comparison with ideal values

Output:

output/dssp_analysis.csv: Complete DSSP analysis
output/helix_metrics.csv: Metrics for each helix
output/helix_geometry.csv: Detailed geometry
output/figures/helix_length_dist.png: Length distribution
output/figures/ramachandran.png: Ramachandran plot
output/figures/geometry_metrics.png: Geometric metrics

Cell 5: Helix Type Classification

Purpose: Classify alpha-helices into specific types

Main functions:

classify_helix_type(helix_metrics)
- Classifies helix type based on metrics
- alpha-helix, 3₁₀-helix, pi-helix, irregular
- Returns: Helix type
identify_helix_kinks(helix_segment, threshold=20)
- Detects kinks (bends) in helices
- Threshold: deviation angle in degrees
- Returns: List of kink positions
calculate_helix_stability(helix_segment)
- Estimates helix stability
- Based on hydrogen bonds
- Returns: Stability score (0-1)
find_helix_capping_residues(helix_segment)
- Identifies capping residues (N-cap and C-cap)
- Important for stability
- Returns: {n_cap: residue, c_cap: residue}
analyze_helix_surface(helix_segment)
- Analyzes solvent exposure
- Identifies hydrophobic/hydrophilic faces
- Returns: Dict with surface analysis
detect_helix_helix_interactions(structure)
- Detects interactions between helices
- Packing, crossing angles
- Returns: List of interacting pairs

Classified helix types:

Canonical alpha-helix
- Rise: ~1.5 Å/residue
- Twist: ~100°/residue
- 3.6 residues/turn
3₁₀-helix
- Rise: ~2.0 Å/residue
- Twist: ~120°/residue
- 3.0 residues/turn
- Tighter
Pi-helix
- Rise: ~1.2 Å/residue
- Twist: ~87°/residue
- 4.4 residues/turn
- Wider
Irregular helix
- Significant deviations from patterns
- May contain kinks
- RMSD > 1.0 Å

Special analyses:

Helix dipole: Helix dipole moment
Capping motifs: Terminal residue patterns
Hydrophobic moment: Hydrophobic moment
Helix-helix packing: Packing geometry

Visualizations:

Helix type distribution
Kink map along sequence
Capping residue analysis
Solvent exposure profile
Helix-helix interaction network

Output:

output/helix_classification.csv: Classification of each helix
output/helix_types_summary.csv: Summary by type
output/kinks_analysis.csv: Kink analysis
output/capping_residues.csv: Capping residues
output/helix_interactions.csv: Helix interactions
output/figures/helix_types_pie.png: Type distribution
output/figures/kinks_heatmap.png: Kink map
output/figures/interaction_network.png: Interaction network

Main Functions

By Category

Download and I/O

download_cath_domain(): Individual download
download_cath_domains_parallel(): Parallel download
validate_pdb_file(): File validation
get_cath_domain_list(): Domain list

Cleaning

remove_heteroatoms(): Remove non-protein atoms
filter_chains(): Filter chains
standardize_residues(): Standardize residues
clean_pdb_file(): Complete cleaning

Sequence Analysis

count_amino_acids(): Count amino acids
calculate_aa_statistics(): Aggregate statistics

Structural Analysis (DSSP)

run_dssp(): Run DSSP
parse_dssp_output(): Parse results
identify_helices(): Identify helices
calculate_helix_metrics(): Geometric metrics
analyze_helix_geometry(): Detailed geometry

Classification

classify_helix_type(): Classify types
identify_helix_kinks(): Detect kinks
calculate_helix_stability(): Stability
find_helix_capping_residues(): Capping
detect_helix_helix_interactions(): Interactions

Visualization

plot_aa_distribution(): Amino acid distribution
plot_helix_length_distribution(): Lengths
plot_ramachandran(): Ramachandran plot
plot_helix_types(): Helix types
plot_interaction_network(): Interaction network

Scientific Applications

This pipeline can be used for:

Structural research: Characterization of alpha-helices in different contexts
Comparative analysis: Compare properties between protein families
Structure prediction: Validate computational models
Protein design: Inform rational design
Education: Demonstrate protein structure concepts

Performance

Download: ~100 structures/minute (4 threads)
Cleaning: ~50 structures/minute
DSSP analysis: ~20 structures/minute
Classification: ~30 structures/minute

For large datasets (>1000 structures), it is recommended to:

Increase MAX_WORKERS to 8-16 (if you have enough CPU)
Run on machine with SSD
Use at least 8GB RAM

Contributing

Contributions are welcome! Please:

Fork the project
Create a feature branch (git checkout -b feature/NewAnalysis)
Commit your changes (git commit -m 'Add new analysis of X')
Push to the branch (git push origin feature/NewAnalysis)
Open a Pull Request

License

This project is under the MIT License. See the LICENSE file for more details.

Author

Madson de Luna

GitHub: @madsondeluna

References

CATH Database: http://www.cathdb.info/
DSSP: Kabsch W, Sander C (1983). "Dictionary of protein secondary structure"
BioPython: Cock et al. (2009). "Biopython: freely available Python tools"

Updates

Current Version

Complete functional pipeline
42 implemented functions
9 visualization types
Support for parallel processing

Planned Features

AlphaFold integration
Molecular dynamics analysis
Web interface
REST API
Export to additional formats (PyMOL, Chimera)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
cath-protocol.ipynb		cath-protocol.ipynb
cath_protocol_v1.ipynb		cath_protocol_v1.ipynb

Folders and files

Latest commit

History

Repository files navigation

CATH Mainly Alpha - Protein Structure Analysis

Table of Contents

Overview

Project Statistics

Notebook Structure

Cell 0: Environment Setup (50 lines)

Cell 1: PDB Structure Download (279 lines, 7 functions)

Cell 2: Structure Cleaning (322 lines, 8 functions)

Cell 3: Amino Acid Frequency Analysis (156 lines, 3 functions)

Cell 4: Advanced DSSP Analysis (558 lines, 13 functions)

Cell 5: Helix Type Classification (597 lines, 11 functions)

Requirements

Python Dependencies

External Tools

Installation

Usage

Basic Execution

Configuration

Output Structure

Section Details

Cell 0: Environment Setup

Cell 1: PDB Structure Download

Cell 2: Structure Cleaning

Cell 3: Amino Acid Frequency Analysis

Cell 4: Advanced DSSP Analysis

Cell 5: Helix Type Classification

Main Functions

By Category

Download and I/O

Cleaning

Sequence Analysis

Structural Analysis (DSSP)

Classification

Visualization

Scientific Applications

Performance

Contributing

License

Author

References

Updates

Current Version

Planned Features

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages