Feat: Add repository documentation storage and processing with comprehensive test suite#80
Open
theycallmeswift wants to merge 7 commits intocoleam00:mainfrom
Open
Conversation
Enhanced the repository extractor validation in parse_github_repository to also check for a valid driver. Added .cursor to .gitignore to prevent tracking editor-specific files.
Added support for processing repository documentation files and storing them in Supabase alongside Neo4j code analysis. The DirectNeo4jExtractor now accepts an optional Supabase client and processes documentation using the new process_repository_docs function. Documentation chunking, metadata, and code example extraction are handled in utils.py, and the parse_github_repository tool now returns both code and documentation processing results.
Replaces nbconvert with a pure Python function for converting Jupyter notebooks to markdown in documentation processing. Updates process_document_files to use the new function and adds comprehensive unit tests for documentation discovery, processing, and metadata extraction. Enhances .gitignore for test artifacts and configures test dependencies and pytest options in pyproject.toml.
Introduces end-to-end (E2E) test support with new helpers for database and MCP server interaction, adds E2E tests for GitHub repository parsing, and splits unit and E2E test configurations. Updates pyproject.toml with new dependencies, test markers, and coverage settings. Moves and refines unit test fixtures, and adds a Makefile for common development tasks.
Refactored all knowledge_graphs module imports to use relative imports for package compatibility. Improved create_repository_source_id in utils.py to normalize both SSH and HTTPS repository URLs to a consistent format, and updated related tests for consistency. Added ruff as a development dependency and updated .gitignore and pyproject.toml for ruff support. Cleaned up test and Makefile targets to clarify unit vs. e2e tests. Minor code and logging improvements throughout for clarity and maintainability.
Introduced construct_doc_url to standardize documentation URL creation using repository source IDs. Updated process_repository_docs to use the new function and improved formatting in utils.py. Adjusted test to patch the correct function for error handling.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR enhances the
parse_github_repositorytool with comprehensive documentation processing capabilities, storing repository documentation in Supabase for semantic search and RAG operations alongside the existing Neo4j code analysis.✨ Key Features Added
📚 Documentation Processing
.md,.rst,.txt, and.ipynbfiles from GitHub repositories🔄 Enhanced Repository Processing
🧪 Comprehensive Test Infrastructure
📁 Files Changed
src/utils.pywith 500+ lines of documentation processing logicsrc/crawl4ai_mcp.pyto integrate documentation processing into repository parsingpyproject.tomlwith new test dependencies and project structure🚀 Benefits
🔗 Integration Points
🧪 Testing
This enhancement significantly expands the MCP server's capabilities, making it a comprehensive solution for both code analysis and documentation processing from GitHub repositories.