This guide will help you get started with the Documentation Drift Miner quickly.
- Python 3.8 or higher
- pip (Python package manager)
- GitHub account (for API token - optional but recommended)
git clone https://github.com/pranavgupta0001/Coding-Doc-Agent.git
cd Coding-Doc-Agentpip install -r requirements.txtWithout a token, you're limited to 60 API requests per hour. With a token, you get 5,000 requests per hour.
- Go to https://github.com/settings/tokens
- Click "Generate new token (classic)"
- Give it a name like "Drift Miner"
- Select scope:
public_repo(for accessing public repositories) - Click "Generate token"
- Copy the token (you won't see it again!)
Option A: Environment variable
export GITHUB_TOKEN="your_token_here"Option B: .env file
cp .env.example .env
# Edit .env and add your token
echo "GITHUB_TOKEN=your_token_here" > .envpython3 drift_miner.py --repos numpy/numpy --max-commits 50Expected output:
Mining repository: numpy/numpy
Checked 10 commits...
Found drift-fixing commit: a1b2c3d - DOC: Fix formula in mean function
Checked 20 commits...
...
Found 3 drift events in 50 commits
Results saved to drift_events.json
==================================================
SUMMARY
==================================================
Total drift events found: 3
...
python3 drift_miner.py \
--repos scipy/scipy numpy/numpy \
--max-commits 100 \
--output my_analysis.jsonCreate a file my_mining.py:
from drift_miner import DriftMiner
# Initialize
miner = DriftMiner()
# Mine repositories
scipy_events = miner.mine_repository('scipy/scipy', max_commits=20)
numpy_events = miner.mine_repository('numpy/numpy', max_commits=20)
# Combine results
miner.drift_events.extend(scipy_events)
miner.drift_events.extend(numpy_events)
# Save and summarize
miner.save_results('my_results.json')
summary = miner.generate_summary()
print(f"Found {summary['total_drift_events']} drift events")Run it:
python3 my_mining.pyThe tool creates a JSON file with this structure:
[
{
"repository": "numpy/numpy",
"commit_sha": "abc123def456...",
"commit_message": "DOC: Fix incorrect formula in numpy.mean",
"commit_date": "2024-01-15T10:30:00",
"author": "Jane Developer",
"file": "numpy/core/fromnumeric.py",
"before_segments": [
{
"filename": "fromnumeric.py",
"start_line": 100,
"code": "def mean(a, axis=None):\n return sum(a) / count(a)",
"documentation": "\"\"\"Calculate mean using formula: sum/n\"\"\""
}
],
"after_segments": [
{
"filename": "fromnumeric.py",
"start_line": 100,
"code": "def mean(a, axis=None):\n return sum(a) / count(a)",
"documentation": "\"\"\"Calculate mean using formula: Σx/n where n is count\"\"\""
}
]
}
]- before_segments: Documentation BEFORE the fix (Drifted state)
- after_segments: Documentation AFTER the fix (Consistent state)
- commit_sha: Unique identifier to view the commit on GitHub
- patch: The actual diff showing what changed
import json
# Load results
with open('drift_events.json', 'r') as f:
events = json.load(f)
# Print first event
print(json.dumps(events[0], indent=2))
# Count by repository
from collections import Counter
repos = Counter(e['repository'] for e in events)
print(repos)
# Find events with specific keywords
formula_fixes = [e for e in events if 'formula' in e['commit_message'].lower()]
print(f"Found {len(formula_fixes)} formula fixes")Each event includes a commit_sha. View it on GitHub:
https://github.com/{repository}/commit/{commit_sha}
Example:
https://github.com/numpy/numpy/commit/abc123def456
Begin with a small number of commits to test:
python3 drift_miner.py --repos numpy/numpy --max-commits 10If you hit rate limits:
Error accessing repository: 403 Forbidden
Note: This is likely due to API rate limiting. Please provide a GitHub token.
Solution: Add a GitHub token (see Step 3 above)
Recent commits are more likely to have accessible file content:
python3 drift_miner.py --repos numpy/numpy --max-commits 200Mine several projects at once:
python3 drift_miner.py \
--repos scipy/scipy numpy/numpy pandas-dev/pandas \
--max-commits 50 \
--output multi_repo_analysis.jsonUsing PyGithub directly:
from github import Github
g = Github("your_token")
rate = g.get_rate_limit()
print(f"Remaining: {rate.core.remaining}/{rate.core.limit}")Solution:
pip install -r requirements.txtSolution: Add a GitHub token (see Setup section)
This is normal! Not all commits fix documentation drift. Try:
- Increasing
--max-commits - Using repositories with more documentation commits
Solution: Mine fewer commits or filter results:
import json
with open('drift_events.json', 'r') as f:
events = json.load(f)
# Keep only events with substantial changes
filtered = [e for e in events if len(e['before_segments']) > 0]
with open('filtered_events.json', 'w') as f:
json.dump(filtered, f, indent=2)- Run the test suite:
python3 test_drift_miner.py - Try the example script:
python3 example_usage.py - Read the methodology: METHODOLOGY.md
- Explore the output JSON files
- Build your own analysis scripts!
- Check the README.md for detailed documentation
- Review METHODOLOGY.md for research background
- Open an issue on GitHub for bugs or feature requests
Happy mining! 🚀