Skip to content

Latest commit

 

History

History
161 lines (124 loc) · 7.35 KB

File metadata and controls

161 lines (124 loc) · 7.35 KB

Phase 3 - Evaluation & Analysis

Goal: To evaluate the trained model's performance and analyze its learned representations.

Overview

Phase 3 focused on comprehensive evaluation of the GNN model trained in Phase 2. This phase provided quantitative validation that Graph Neural Networks can successfully learn meaningful structural representations for Ruby code complexity prediction, achieving superior performance compared to heuristic baselines.

Phase 3 Issues Completed

  • Created comprehensive evaluation script (src/evaluate.py)
  • Loaded best model from Phase 2 training (models/best_model.pt)
  • Calculated final performance metrics on test dataset
  • Result: GNN achieved MAE of 4.27 vs heuristic baseline of 4.46 (4.3% improvement)
  • Implemented t-SNE visualization of learned 64-dimensional embeddings
  • Created interactive Jupyter notebook (notebooks/visualize_embeddings.ipynb)
  • Demonstrated meaningful clustering by complexity levels
  • Result: Visual evidence that GNN learned complexity-aware structural patterns
  • Extended training to 100 epochs as required
  • Generated final performance evaluation and embedding visualization
  • Updated README with comprehensive results and conclusion
  • Result: Complete project validation with documented success

Performance Results

Model Performance Metrics

Metric GNN Model Heuristic Baseline Improvement
Mean Absolute Error (MAE) 4.2702 4.4617 4.3% better
Root Mean Squared Error (RMSE) 7.6023 - -
R² Score 0.375 - -

Training Results (100 Epochs)

  • Best Validation Loss: 33.8711 (achieved at epoch 34)
  • Final Training Loss: Converged successfully
  • Model Convergence: Stable training with proper generalization
  • Overfitting: No evidence of overfitting observed

Dataset Performance

  • Test Dataset: 189 Ruby methods evaluated
  • Coverage: 8 diverse Ruby codebases (Rails, Sinatra, Forem, Mastodon, Discourse, Fastlane, Spree, Liquid)
  • Complexity Range: 2.0 to 100.0 cyclomatic complexity
  • Generalization: Model performs consistently across different codebases

Embedding Analysis

Visualization Results

The t-SNE visualization of learned embeddings revealed:

  • Meaningful Clustering: Methods with similar complexity scores cluster together
  • Structural Patterns: Model captures complexity beyond simple keyword counting
  • Dimensionality: 64-dimensional embeddings contain rich structural information
  • Complexity Categories: Clear separation between low/medium/high complexity methods

Complexity Distribution Analysis

  • Low Complexity (2-5): 40.7% of test methods
  • Medium Complexity (6-10): 39.7% of test methods
  • High Complexity (11-20): 15.9% of test methods
  • Very High Complexity (20+): 3.7% of test methods

Technical Validation

  • Model Learning: R² = 0.375 demonstrates meaningful learning
  • Clustering Evidence: Embedding visualization shows complexity-based groupings
  • Baseline Beating: 4.3% improvement over heuristic methods
  • Structural Understanding: Goes beyond simple pattern matching

Key Achievements

🎯 Primary Success Criteria Met

  1. Superior Performance: GNN model beats heuristic baseline (MAE 4.27 vs 4.46)
  2. Meaningful Embeddings: t-SNE visualization shows complexity-based clustering
  3. Robust Training: 100-epoch training with stable convergence
  4. Comprehensive Evaluation: Complete evaluation pipeline with multiple metrics

🧠 Scientific Validation

  • Hypothesis Confirmed: GNNs can learn structural complexity patterns from ASTs
  • Generalization: Model works across diverse Ruby codebases
  • Embedding Quality: Learned representations capture semantic similarity
  • Scalability: Training pipeline handles real-world dataset sizes

📊 Quantitative Evidence

  • 4.3% Performance Improvement over baseline methods
  • 189 Test Samples evaluated with consistent performance
  • 64-dimensional Embeddings successfully capture complexity patterns
  • 100 Training Epochs demonstrate training stability and convergence

Technical Implementation

Evaluation Pipeline

# Load trained model and test data
model = RubyComplexityGNN.load('models/best_model.pt')
test_loader = ASTDataLoader('dataset/test.jsonl')

# Calculate performance metrics
predictions = model.predict(test_loader)
mae = mean_absolute_error(true_labels, predictions)
rmse = root_mean_squared_error(true_labels, predictions)

Visualization Generation

# Extract embeddings for visualization
embeddings = model.get_embeddings(test_data)
reduced_embeddings = TSNE(n_components=2).fit_transform(embeddings)

# Create complexity-colored scatter plot
plt.scatter(reduced_embeddings[:, 0], reduced_embeddings[:, 1], 
           c=complexity_scores, cmap='viridis')

File Structure Created

src/
├── evaluate.py           # Model evaluation script
└── ...                  # Existing model files

notebooks/
├── visualize_embeddings.ipynb  # Interactive embedding analysis
└── ...                  # Other analysis notebooks

generate_visualization.py    # Standalone visualization script
final_embedding_visualization.png  # High-resolution t-SNE plot
models/best_model.pt               # Final trained model (100 epochs)
models/final_model.pt             # Training completion checkpoint

Impact and Conclusions

Experimental Success

This phase successfully validated the core hypothesis: Graph Neural Networks can learn meaningful structural representations of code complexity that outperform traditional heuristic methods.

Technical Significance

  • First Demonstration: Successful application of GNNs to Ruby code complexity prediction
  • Structural Learning: Model captures AST patterns beyond simple keyword counting
  • Embedding Quality: 64D representations suitable for downstream tasks
  • Baseline Beating: Quantitative improvement over established methods

Future Applications

The learned embeddings establish a foundation for:

  • Generative Models: AST reconstruction and code generation (Phase 4)
  • Code Quality Assessment: Automated complexity analysis tools
  • Similarity Search: Finding structurally similar code patterns
  • Refactoring Tools: Identifying overly complex methods for simplification

Next Steps to Phase 4

Phase 3's success in learning meaningful embeddings enabled Phase 4: AST Autoencoder development. The proven ability to extract 64-dimensional structural representations provides the foundation for:

  1. Encoder-Decoder Architecture: Using trained GNN as frozen encoder
  2. AST Reconstruction: Building decoder to reconstruct ASTs from embeddings
  3. Generative Validation: Testing bidirectional code ↔ embedding mappings
  4. Code Generation: Exploring automated Ruby method synthesis

The comprehensive evaluation and embedding analysis in Phase 3 provided the scientific validation needed to proceed with confidence to generative model development in Phase 4.