griffithlab
diff --git a/‎.DS_Store‎
0 Bytes b/‎.DS_Store‎
0 Bytes
diff --git a/‎CLAUDE.md‎
Lines changed: 99 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 99 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 12 additions & 2 deletions b/‎README.md‎
Lines changed: 12 additions & 2 deletions
diff --git a/‎__pycache__/generate_control_peptides.cpython-311.pyc‎
16.7 KB b/‎__pycache__/generate_control_peptides.cpython-311.pyc‎
16.7 KB
diff --git a/‎__pycache__/peptide_gui.cpython-311.pyc‎
5.26 KB b/‎__pycache__/peptide_gui.cpython-311.pyc‎
5.26 KB
diff --git a/‎generate_control_peptides.py‎
Lines changed: 139 additions & 32 deletions b/‎generate_control_peptides.py‎
Lines changed: 139 additions & 32 deletions
@@ -0,0 +1,99 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Project Overview
+
+This is a bioinformatics tool for generating control peptides used in neoantigen analysis and benchmarking with pVACtools. The project targets cancer immunotherapy research by providing high-quality reference peptide sets.
+
+## Core Architecture
+
+The project consists of two main components:
+
+1. **CLI Tool** (`generate_control_peptides.py`) - Main peptide generation engine
+2. **GUI Interface** (`peptide_gui.py`) - User-friendly wrapper using PySimpleGUI
+
+### Peptide Generation Methods
+
+Three generation strategies are implemented as separate functions:
+
+- `generate_random_peptides()` - Random amino acid sequences from 20 standard amino acids
+- `sample_peptides_from_fasta()` - Extracts subsequences from existing protein sequences 
+- `generate_llm_peptides()` - Uses HuggingFace protein language models (ProtGPT2 or ESM-2) for biologically plausible sequences
+
+### Key Design Patterns
+
+- **Input validation**: Comprehensive parameter bounds checking (length 1-50, count 1-10M)
+- **Error handling**: Graceful degradation with informative error messages
+- **Reproducibility**: Random seed support for deterministic results
+- **Batch processing**: Optimized generation for large peptide counts
+
+## Development Commands
+
+### Setup
+```bash
+pip install -r requirements.txt
+```
+
+### Testing the CLI
+```bash
+# Test random generation
+python generate_control_peptides.py --length 9 --count 10 --source random --output test.fasta --seed 42
+
+# Test FASTA sampling (requires protein sequences in data/)
+python generate_control_peptides.py --length 9 --count 10 --source fasta --fasta_file data/GCF_000001405.40/protein.faa --output test.fasta
+
+# Test ProtGPT2 generation (good for short peptides, 8-12 AA)
+python generate_control_peptides.py --length 9 --count 5 --source llm --llm_model protgpt2 --temperature 1.2 --output test.fasta
+
+# Test ESM-2 generation (better for longer peptides, 10+ AA)
+python generate_control_peptides.py --length 15 --count 5 --source llm --llm_model esm2 --temperature 1.0 --output test.fasta
+```
+
+### Testing the GUI
+```bash
+python peptide_gui.py
+```
+
+### Code Validation
+```bash
+# Check syntax
+python3 -m py_compile generate_control_peptides.py
+python3 -m py_compile peptide_gui.py
+
+# View help
+python generate_control_peptides.py --help
+```
+
+## Important Implementation Details
+
+### Multi-LLM Architecture
+- **ProtGPT2**: 738M parameter model, token-based (~4 AA per token), good for short peptides (8-12 AA)
+- **ESM-2**: 650M parameter model, amino acid-level tokenization, better for longer peptides (10+ AA)
+- Modular design allows easy addition of new protein language models
+- Uses batch generation for efficiency (ProtGPT2: 50/batch, ESM-2: 10/batch)
+- Includes retry logic with model-specific attempt limits
+- Filters generated text to valid amino acids only
+
+### GUI-CLI Communication
+The GUI builds command-line arguments and executes the CLI tool via subprocess, streaming output in real-time to the GUI text box.
+
+### Input Validation Bounds
+- Peptide length: 1-50 amino acids
+- Count: 1-10,000,000 peptides  
+- LLM parameters: temperature > 0, top_k ≥ 1, top_p (0,1], repetition_penalty > 0
+- Model selection: protgpt2 (recommended for ≤12 AA), esm2 (recommended for ≥10 AA)
+
+### File Formats
+- Input: FASTA format (`.fasta`, `.faa`) for protein sequences
+- Output: FASTA format with headers like `>peptide_1`, `>peptide_2`, etc.
+
+## Dependencies
+
+Core dependencies are specified in `requirements.txt`:
+- PySimpleGUI (GUI framework)
+- transformers (protein language models)
+- torch (PyTorch backend)
+- sentencepiece (tokenization support)
+
+The project uses only Python standard library beyond these requirements.
@@ -9,7 +9,7 @@ This project provides a robust, reproducible tool for generating large, high-qua
 - **Three peptide generation modes:**
   - Random (from 20 amino acids)
   - Sampled from a user-supplied FASTA file
-  - Generated by a protein language model (ProtGPT2 via HuggingFace Transformers)
+  - Generated by protein language models (ProtGPT2 or ESM-2 via HuggingFace Transformers)
 - **FASTA output** compatible with pVACtools
 - **Command-line interface** for batch processing
 - **Simple GUI** (using PySimpleGUI) for non-technical users
@@ -32,7 +32,17 @@ pip install -r requirements.txt
 ### Command Line
 
 ```bash
-python generate_control_peptides.py --mode [random|fasta|protGPT2] --length 9 --count 1000000 --output peptides.fasta [--fasta_file input.fasta] [--temperature 1.0] [--top_k 50]
+# Random generation
+python generate_control_peptides.py --source random --length 9 --count 1000 --output peptides.fasta --seed 42
+
+# FASTA sampling
+python generate_control_peptides.py --source fasta --length 9 --count 1000 --fasta_file input.fasta --output peptides.fasta
+
+# LLM generation (ProtGPT2 - good for short peptides)
+python generate_control_peptides.py --source llm --llm_model protgpt2 --length 9 --count 100 --output peptides.fasta
+
+# LLM generation (ESM-2 - better for longer peptides)
+python generate_control_peptides.py --source llm --llm_model esm2 --length 15 --count 100 --output peptides.fasta --temperature 1.2
 ```
 
 ### GUI
 
@@ -42,34 +42,140 @@ def sample_peptides_from_fasta(fasta_path: Path, length: int, count: int) -> Lis
         peptides.append(random.choice(all_subseqs))
     return peptides[:count]
 
-def generate_protgpt2_peptides(length: int, count: int, temperature: float = 1.0, top_k: int = 950, top_p: float = 0.9, repetition_penalty: float = 1.2) -> List[str]:
+def generate_llm_peptides(length: int, count: int, model_name: str = "protgpt2", temperature: float = 1.0, top_k: int = 950, top_p: float = 0.9, repetition_penalty: float = 1.2) -> List[str]:
     try:
-        from transformers import pipeline
+        from transformers import pipeline, AutoTokenizer, AutoModel, AutoModelForMaskedLM
+        import torch
+        import torch.nn.functional as F
     except ImportError:
-        print("Error: transformers package is required for ProtGPT2 generation. Please install with 'pip install transformers torch'", file=sys.stderr)
+        print("Error: transformers and torch packages are required for LLM generation. Please install with 'pip install transformers torch'", file=sys.stderr)
         sys.exit(1)
-    # Each token is ~4 amino acids, so for a peptide of length N, set max_length ≈ N/4 (rounded up)
-    max_length = max(5, (length + 3) // 4)  # ensure at least 1 token
-    protgpt2 = pipeline('text-generation', model="nferruz/ProtGPT2", framework="pt")
+    # Model configurations
+    if model_name.lower() == "protgpt2":
+        model_id = "nferruz/ProtGPT2"
+        # Each token is ~4 amino acids, so for a peptide of length N, set max_length ≈ N/4 (rounded up)
+        max_length = max(5, (length + 3) // 4)  # ensure at least 1 token
+        prompt = "<|endoftext|>"
+        use_pipeline = True
+    elif model_name.lower() == "esm2":
+        model_id = "facebook/esm2_t12_35M_UR50D"  # ESM-2 35M model (faster, smaller)
+        max_length = length + 10  # ESM works with direct amino acid sequences
+        use_pipeline = False
+    else:
+        print(f"Error: Unsupported model '{model_name}'. Supported models: protgpt2, esm2", file=sys.stderr)
+        sys.exit(1)
+    
     peptides = []
     tries = 0
-    batch_size = min(50, count)  # Generate more peptides per batch for efficiency
-    while len(peptides) < count and tries < count * 10:
-        sequences = protgpt2("<|endoftext|>", max_length=max_length, do_sample=True, top_k=top_k, top_p=top_p, temperature=temperature, repetition_penalty=repetition_penalty, num_return_sequences=min(count - len(peptides), batch_size), eos_token_id=0)
-        if not sequences or not hasattr(sequences, '__iter__'):
-            tries += 1
-            continue
-        for seq in sequences:
-            if not isinstance(seq, dict):
-                continue
-            gen_text = seq.get('generated_text', '')
-            if not isinstance(gen_text, str):
+    batch_size = min(50, count) if model_name.lower() == "protgpt2" else min(10, count)
+    
+    if use_pipeline:
+        # ProtGPT2 pipeline approach
+        llm_pipeline = pipeline('text-generation', model=model_id, framework="pt")
+        while len(peptides) < count and tries < count * 10:
+            sequences = llm_pipeline(prompt, max_length=max_length, do_sample=True, top_k=top_k, top_p=top_p, temperature=temperature, repetition_penalty=repetition_penalty, num_return_sequences=min(count - len(peptides), batch_size), eos_token_id=0)
+            if not sequences or not hasattr(sequences, '__iter__'):
+                tries += 1
                 continue
-            # Remove whitespace and newlines, keep only valid amino acids
-            pep = ''.join([c for c in gen_text if c in AMINO_ACIDS])
-            if len(pep) == length:
-                peptides.append(pep)
-        tries += 1
+            for seq in sequences:
+                if not isinstance(seq, dict):
+                    continue
+                gen_text = seq.get('generated_text', '')
+                if not isinstance(gen_text, str):
+                    continue
+                # Remove whitespace and newlines, keep only valid amino acids
+                pep = ''.join([c for c in gen_text if c in AMINO_ACIDS])
+                if len(pep) == length:
+                    peptides.append(pep)
+            tries += 1
+    else:
+        # ESM-2 approach using masked language modeling
+        print(f"Loading ESM-2 model (this may take a moment)...", file=sys.stderr)
+        tokenizer = AutoTokenizer.from_pretrained(model_id)
+        model = AutoModelForMaskedLM.from_pretrained(model_id)
+        
+        while len(peptides) < count and tries < count * 10:  # Reduce tries for efficiency
+            try:
+                for _ in range(min(count - len(peptides), batch_size)):
+                    # Simple approach: create a sequence with random masks
+                    # Start with a random seed and mask 40% of positions
+                    seed_length = max(1, length // 3)
+                    seed_seq = "".join(random.choices(AMINO_ACIDS, k=seed_length))
+                    
+                    # Create masked sequence: seed + masks for remaining positions
+                    remaining_masks = length - seed_length
+                    masked_sequence = seed_seq + "<mask>" * remaining_masks
+                    
+                    # Tokenize
+                    inputs = tokenizer(masked_sequence, return_tensors="pt", max_length=512, truncation=True)
+                    
+                    with torch.no_grad():
+                        outputs = model(**inputs)
+                        predictions = outputs.logits
+                        
+                        # Apply temperature
+                        predictions = predictions / temperature
+                        
+                        # Find mask positions and generate amino acids
+                        mask_token_id = tokenizer.mask_token_id
+                        input_ids = inputs.input_ids[0]
+                        
+                        generated_sequence = []
+                        token_idx = 0
+                        
+                        for token_id in input_ids:
+                            if token_id == mask_token_id:
+                                # Get prediction for this mask
+                                logits = predictions[0, token_idx]
+                                
+                                # Simple top-k sampling
+                                if top_k > 0 and top_k < logits.size(-1):
+                                    top_k_logits, top_k_indices = torch.topk(logits, top_k)
+                                    # Sample from top-k
+                                    probs = F.softmax(top_k_logits, dim=-1)
+                                    sampled_idx = torch.multinomial(probs, 1).item()
+                                    sampled_token_id = top_k_indices[sampled_idx].item()
+                                else:
+                                    # Sample from full distribution
+                                    probs = F.softmax(logits, dim=-1)
+                                    sampled_token_id = torch.multinomial(probs, 1).item()
+                                
+                                sampled_token = tokenizer.decode([sampled_token_id])
+                                
+                                # Only add valid amino acids
+                                if sampled_token in AMINO_ACIDS:
+                                    generated_sequence.append(sampled_token)
+                                else:
+                                    # Fallback to random amino acid
+                                    generated_sequence.append(random.choice(AMINO_ACIDS))
+                            else:
+                                # Keep original token if it's an amino acid
+                                original_token = tokenizer.decode([token_id])
+                                if original_token in AMINO_ACIDS:
+                                    generated_sequence.append(original_token)
+                            
+                            token_idx += 1
+                        
+                        # Create final peptide
+                        final_peptide = "".join(generated_sequence)
+                        
+                        # Ensure exact length
+                        if len(final_peptide) >= length:
+                            final_peptide = final_peptide[:length]
+                        elif len(final_peptide) < length:
+                            # Pad with random amino acids if too short
+                            final_peptide += "".join(random.choices(AMINO_ACIDS, k=length - len(final_peptide)))
+                        
+                        if len(final_peptide) == length:
+                            peptides.append(final_peptide)
+            
+            except Exception as e:
+                print(f"Warning: ESM-2 generation error: {e}", file=sys.stderr)
+                # Fallback to random generation for this attempt
+                fallback_peptide = "".join(random.choices(AMINO_ACIDS, k=length))
+                peptides.append(fallback_peptide)
+            
+            tries += 1
     if len(peptides) < count:
         print(f"Warning: Only generated {len(peptides)} peptides of requested {count} with exact length {length}.", file=sys.stderr)
     return peptides[:count]
@@ -83,14 +189,15 @@ def main():
     parser = argparse.ArgumentParser(description="Generate control peptides for neoantigen analysis.")
     parser.add_argument('--length', type=int, required=True, help='Peptide length (e.g., 8, 9, 10)')
     parser.add_argument('--count', type=int, required=True, help='Number of peptides to generate')
-    parser.add_argument('--source', choices=['random', 'fasta', 'protgpt2'], required=True, help='Source of peptides: random, fasta, or protgpt2')
+    parser.add_argument('--source', choices=['random', 'fasta', 'llm'], required=True, help='Source of peptides: random, fasta, or llm')
+    parser.add_argument('--llm_model', choices=['protgpt2', 'esm2'], default='protgpt2', help='LLM model to use for generation (protgpt2 or esm2)')
     parser.add_argument('--fasta_file', type=Path, help='Path to reference FASTA file (required if source is fasta)')
     parser.add_argument('--output', type=Path, default=Path('control_peptides.fasta'), help='Output FASTA file')
-    parser.add_argument('--seed', type=int, help='Random seed for reproducibility (not used for protgpt2)')
-    parser.add_argument('--temperature', type=float, default=1.0, help='Temperature for ProtGPT2 generation (higher = more random)')
-    parser.add_argument('--top_k', type=int, default=950, help='Top-k sampling for ProtGPT2')
-    parser.add_argument('--top_p', type=float, default=0.9, help='Top-p (nucleus) sampling for ProtGPT2')
-    parser.add_argument('--repetition_penalty', type=float, default=1.2, help='Repetition penalty for ProtGPT2')
+    parser.add_argument('--seed', type=int, help='Random seed for reproducibility (not used for llm models)')
+    parser.add_argument('--temperature', type=float, default=1.0, help='Temperature for LLM generation (higher = more random)')
+    parser.add_argument('--top_k', type=int, default=950, help='Top-k sampling for LLM')
+    parser.add_argument('--top_p', type=float, default=0.9, help='Top-p (nucleus) sampling for LLM')
+    parser.add_argument('--repetition_penalty', type=float, default=1.2, help='Repetition penalty for LLM')
     args = parser.parse_args()
 
     # Input validation
@@ -100,7 +207,7 @@ def main():
     if args.count < 1 or args.count > 10000000:
         print('Error: Count must be between 1 and 10,000,000 peptides', file=sys.stderr)
         sys.exit(1)
-    if args.source == 'protgpt2':
+    if args.source == 'llm':
         if args.temperature <= 0:
             print('Error: Temperature must be positive', file=sys.stderr)
             sys.exit(1)
@@ -114,7 +221,7 @@ def main():
             print('Error: Repetition penalty must be positive', file=sys.stderr)
             sys.exit(1)
 
-    if args.seed is not None and args.source != 'protgpt2':
+    if args.seed is not None and args.source != 'llm':
         random.seed(args.seed)
 
     if args.source == 'random':
@@ -124,8 +231,8 @@ def main():
             print('Error: --fasta_file is required when source is fasta', file=sys.stderr)
             sys.exit(1)
         peptides = sample_peptides_from_fasta(args.fasta_file, args.length, args.count)
-    elif args.source == 'protgpt2':
-        peptides = generate_protgpt2_peptides(args.length, args.count, args.temperature, args.top_k, args.top_p, args.repetition_penalty)
+    elif args.source == 'llm':
+        peptides = generate_llm_peptides(args.length, args.count, args.llm_model, args.temperature, args.top_k, args.top_p, args.repetition_penalty)
     else:
         print(f"Unknown source: {args.source}", file=sys.stderr)
         sys.exit(1)