This document describes the methodology used by the Coding-Doc-Agent to mine documentation drift events from GitHub repositories.
Documentation drift occurs when:
- Source code is modified (bug fixes, new features, refactoring)
- Corresponding documentation (docstrings, comments, external docs) is not updated
- Documentation becomes inconsistent with the actual code behavior
This creates a "drift" between what the code does and what the documentation says it does.
The tool focuses on well-maintained scientific computing repositories:
- SciPy: Python library for scientific computing
- NumPy: Fundamental package for numerical computing in Python
These repositories are chosen because they:
- Have extensive documentation
- Maintain high code quality standards
- Have active communities that fix documentation drift
- Provide rich datasets of drift-fixing commits
The tool identifies commits that fix documentation drift by searching commit messages for specific keywords:
update docsupdate documentationfix docsfix documentationfix formulafix docstringsync commentsync documentation
correct docscorrect documentationdocs fixdocumentation fixupdate commentfix comment
These keywords indicate that a developer recognized and fixed a documentation inconsistency.
For each drift-fixing commit, the tool extracts two states:
- The state of the code and documentation before the fix
- Represents the "drift event" where documentation is inconsistent
- Labeled as "Drifted" in the dataset
- The state of the code and documentation after the fix
- Represents the corrected state where documentation matches code
- Labeled as "Consistent" in the dataset
The tool extracts structured segments from modified files:
- Function definitions with their docstrings
- Class definitions with their docstrings
- Method definitions within classes
- Context lines of actual implementation code
- Parse the file line by line
- Identify function/class definitions (
deforclasskeywords) - Extract the associated docstring (""" or ''' delimited)
- Include surrounding code context (up to 10 lines)
- Create structured segment with metadata:
- filename
- start line number
- code block
- documentation block
Each drift event is stored with the following structure:
{
'repository': 'scipy/scipy',
'commit_sha': 'abc123...',
'commit_message': 'DOC: Fix formula in linear_model',
'commit_date': '2024-01-15T10:30:00',
'author': 'John Doe',
'file': 'scipy/optimize/linear_model.py',
'patch': '--- a/file\n+++ b/file\n...',
'before_segments': [
{
'filename': 'linear_model.py',
'start_line': 42,
'code': 'def fit(self, X, y):...',
'documentation': '"""Old incorrect docs"""'
}
],
'after_segments': [
{
'filename': 'linear_model.py',
'start_line': 42,
'code': 'def fit(self, X, y):...',
'documentation': '"""New correct docs"""'
}
]
}This methodology enables several research directions:
- What types of documentation drift are most common?
- Which parts of documentation (parameters, returns, examples) drift most?
- How long does drift persist before being fixed?
- Train models to detect drift automatically
- Learn to generate corrected documentation
- Predict which code changes require doc updates
- Build linters that detect documentation drift
- Create IDE plugins that warn about potential drift
- Develop automated documentation update tools
- Measure documentation quality in repositories
- Track drift rates over time
- Compare documentation practices across projects
- Language Support: Currently focuses on Python files
- Simple Heuristics: Keyword-based detection may miss some drift events
- Context Window: Limited code context (10 lines)
- API Rate Limits: GitHub API limits affect mining speed
- Support for C/C++/Fortran (important for NumPy/SciPy)
- More sophisticated commit classification (ML-based)
- Semantic analysis of code-documentation consistency
- Cross-repository drift pattern analysis
- Real-time drift detection during development
The methodology has been validated through:
- Manual inspection of extracted drift events
- Comparison with known documentation issues
- Test suite covering core functionality
- Example runs on real repositories
If you use this methodology in your research, please cite:
@misc{drift_mining_methodology,
title={Documentation Drift Mining Methodology},
author={Coding-Doc-Agent Project},
year={2024},
howpublished={\url{https://github.com/pranavgupta0001/Coding-Doc-Agent}}
}- "Documentation Debt" - Examining Technical Debt in Documentation
- "Code Comment Quality" - Studies on maintaining code comments
- "Mining Software Repositories" - Techniques for extracting insights from version control
- NumPy/SciPy Documentation Guidelines