Commit a1718b2
Feat/general refinement (#6)
* feat/general-refinement - refactor: consolidate models and clean up project structure
- Move elicit models (ObstacleResolutionDecision, RequirementsClarification) from server.py to models.py
- Remove duplicate model definitions to follow DRY principle
- Update imports in server.py to use centralized models
- Remove PROJECT_SUMMARY.md file for cleaner project structure
- Improve code organization and maintainability
* feat/general-refinement - fix: prioritize PyPI installation and use GitHub Container Registry
- Update README installation instructions to prioritize PyPI package over git clone
- Change primary installation method to use 'uv add mcp-as-a-judge' and 'pip install mcp-as-a-judge'
- Update Docker Compose to use pre-built images from GitHub Container Registry for production
- Separate development and production Docker configurations with profiles
- Ensure all Docker instructions reference ghcr.io/hepivax/mcp-as-a-judge
- Keep git clone only for development and source builds
- Improve user experience by making package installation the default path
* feat/general-refinement - fix: remove unnecessary port configurations for MCP stdio communication
- Remove PORT and TRANSPORT build args from Dockerfile (MCP uses stdio, not HTTP)
- Remove EXPOSE directive and port mappings from Docker configurations
- Update docker-compose.yml to remove port mappings and add stdin_open/tty for stdio
- Remove nginx service (not needed for MCP servers)
- Update Docker run commands in README to use -it instead of port mappings
- Fix health check to use process check instead of HTTP endpoint
- Add note in README explaining MCP uses stdio communication
- Simplify Docker configuration for proper MCP server deployment
* feat/general-refinement - docs: add LLM-as-a-Judge concept and MCP client requirements
- Add explanation that concept derives from LLM-as-a-Judge paradigm
- Specify MCP client requirements with official documentation links:
- Sampling capability required for AI-powered code evaluation
- Elicitation capability required for user decision prompts
- Link to official MCP docs for sampling and elicitation concepts
- Enhance features section to reference specific MCP capabilities
- Improve clarity on technical requirements for proper functionality
* feat/general-refinement - docs: highlight main purpose of improving developer-AI interface
- Add prominent section explaining core mission to enhance developer-AI collaboration
- Emphasize preventing AI poor decisions and involving humans in critical choices
- Update main description to highlight transformation of developer-AI experience
- Add focus on intelligent AI-human collaboration with clear boundaries
- Make it clear this is about improving the interface between developers and AI assistants
- Position as solution for better AI-human workflow in software development
* feat/general-refinement - docs: add judge icons for better visual branding
- Replace 🚨 with ⚖️ in main title for better thematic representation
- Add ⚖️ to Main Purpose section header
- Update Five Powerful Tools to Five Powerful Judge Tools with ⚖️ icon
- Add ⚖️ to Concept section for consistent judge theme
- Improve visual identity and reinforce the 'judge' concept throughout README
- Create cohesive branding with scales of justice emoji
* feat/general-refinement - refactor: replace static research validation with AI-powered evaluation
- Replace hardcoded research validation logic with intelligent AI evaluation
- Embed research, plan, design, and user requirements into validation prompt
- Use LLM sampling to assess research comprehensiveness and design alignment
- Evaluate if design is properly based on research findings
- Check for exploration of existing solutions, alternatives, and best practices
- Validate research quality and actionable insights
- Provide detailed feedback on research gaps and design-research alignment
- Maintain obstacle resolution pattern for user involvement in decisions
- Improve validation accuracy and reduce false positives from static checks
* feat/general-refinement - fix: correct judge_code_change trigger and use Pydantic JSON schema
- Fix judge_code_change trigger: must be called BEFORE making any file changes
- Replace hardcoded JSON format with actual Pydantic model schema
- Use JudgeResponse.model_json_schema() for consistent response format
- Ensure proper validation timing: code review before file modification
- Improve prompt accuracy by using actual model schema instead of manual format
- Maintain consistency between expected response format and actual model structure
* feat/general-refinement - enhance: add Pragmatic Programmer principles to evaluation criteria
- Integrate key concepts from The Pragmatic Programmer book into judge prompts
- Add DRY Principle, Orthogonality, and Design by Contract evaluations
- Include Defensive Programming, Fail Fast, and Broken Windows Theory
- Add Tracer Bullets, Reversibility, and Good Enough Software principles
- Enhance with Test Early/Test Often and Premature Optimization awareness
- Include Easy to Change, Refactoring Strategy, and Plain Text Power concepts
- Add Rubber Duck Debugging and 'Use the Source, Luke' references
- Improve evaluation guidelines with pragmatic context-driven approach
- Balance perfectionism with practical software delivery principles
- Create more comprehensive and industry-standard evaluation criteria
* feat/general-refinement - enhance: add comprehensive software engineering best practices to evaluation criteria
- Integrate DRY Principle, Orthogonality, and Design by Contract evaluations
- Add Defensive Programming, Fail Fast, and Broken Windows Theory concepts
- Include Tracer Bullets, Reversibility, and Good Enough Software principles
- Enhance with Test Early/Test Often and Premature Optimization awareness
- Add Easy to Change, Refactoring Strategy, and Plain Text Power concepts
- Include Rubber Duck Debugging and authoritative source validation
- Improve evaluation guidelines with context-driven approach
- Balance perfectionism with practical software delivery principles
- Create more comprehensive and industry-standard evaluation criteria
- Focus on maintainable, working software over perfect solutions
* feat/general-refinement - fix: ensure judge_code_change is called for new file creation
- Make it explicit that judge_code_change must be called BEFORE creating ANY new files
- Add comprehensive list of file operations that require code review
- Include new Python files, configuration files, scripts, and modules
- Update parameter descriptions to clarify new file content vs modifications
- Change prompt language from 'code changes' to 'code content' for clarity
- Ensure all file operations involving code are properly validated
- Prevent creation of unreviewed code files in any format
* feat/general-refinement - enhance: make judge_code_change documentation impossible to miss
- Add prominent 🚨🚨🚨 alerts and visual emphasis for mandatory requirement
- Specify exact triggers: save-file, str-replace-editor, and other code-writing tools
- Add explicit consequences of not calling: SWE compliance violations, security risks
- Include clear example workflow showing proper usage timing
- Change from 'BEFORE' to 'IMMEDIATELY AFTER' for clarity on timing
- Add specific tool names that trigger the requirement
- Make file_path parameter required instead of optional
- Emphasize this is mandatory compliance, not optional review
- Use multiple warning levels and visual cues to prevent oversight
* feat/general-refinement - fix: replace relative imports with absolute imports and enhance pre-commit
- Replace 'from .models' with 'from mcp_as_a_judge.models' in server.py
- Replace 'from .server' with 'from mcp_as_a_judge.server' in __init__.py
- Add gitleaks security scanning to pre-commit hooks (first priority)
- Add additional pre-commit hooks for better code quality
- Ensure all imports are absolute for better maintainability
- Improve import clarity and avoid relative import issues
- Note: ruff already provides black, isort, and flake8 functionality
* feat/general-refinement - fix: apply pre-commit hook auto-fixes
- Fix trailing whitespace in multiple files
- Fix end-of-file issues in docker-compose.yml
- Apply isort import sorting to all Python files
- Apply black code formatting to 9 Python files
- Fix prettier formatting for markdown and YAML files
- All security checks passed (gitleaks found no secrets)
- Pre-commit hooks are now working correctly and enforcing quality standards
* feat/general-refinement - fix: resolve flake8 and mypy errors
- Remove poetry-check from pre-commit (we use uv, not poetry)
- Fix all flake8 D202 errors (blank lines after docstrings)
- Fix flake8 D400 error (missing period in docstring)
- Fix boolean comparison issues (== True/False -> direct boolean checks)
- Add missing return type annotations to all test functions
- Add missing docstrings to __init__ methods in conftest.py
- Extract research validation logic to reduce complexity (C901)
- Create _validate_research_quality helper function
- Replace duplicated research validation code with helper function call
- Improve code maintainability and reduce cyclomatic complexity
* feat/general-refinement - fix: resolve complexity and final flake8 errors
- Extract _evaluate_coding_plan helper function to reduce complexity
- Reduce judge_coding_plan complexity from 15 to under 10 (C901 resolved)
- Remove duplicated prompt code and use helper functions
- Fix final D202 error (blank line after docstring)
- All flake8 errors now resolved
- Improve code maintainability with better separation of concerns
- Helper functions make code more testable and reusable
* feat/general-refinement - fix: apply black formatting
- Black automatically reformatted server.py for consistent style
- All flake8 errors resolved ✅
- Gitleaks security scan passing ✅
- Code formatting and style checks passing ✅
- Only mypy type checking issues remain (expected for MCP project)
* feat/general-refinement - test: demonstrate pre-commit blocking behavior with pytest
* feat/general-refinement - test: this commit should be blocked by pre-commit
* feat/general-refinement - add: pytest to pre-commit hooks and demonstrate blocking behavior
- Add pytest hook to run tests before every commit
- Configure pytest with verbose output and short traceback
- Fix test assertions to match actual server name format
- Demonstrate pre-commit blocking with multiple hook failures
- All hooks now properly validate code quality before commits
* feat/general-refinement - refactor: move prompts to Markdown files with Jinja2 templating
✨ MAJOR REFACTORING: Externalized Prompts for Better Maintainability
🎯 **What Changed:**
- **Extracted all hardcoded prompts** to separate Markdown files in src/prompts/
- **Added Jinja2 templating** for dynamic variable substitution
- **Created PromptLoader utility** for loading and rendering templates
- **Comprehensive test coverage** for prompt loading functionality
📁 **New Structure:**
- src/prompts/judge_coding_plan.md - Main evaluation prompt
- src/prompts/judge_code_change.md - Code review prompt
- src/prompts/research_validation.md - Research quality validation
- src/mcp_as_a_judge/prompt_loader.py - Template loading utility
- tests/test_prompt_loader.py - Full test coverage
🚀 **Benefits:**
- **Easy editing**: Prompts now in readable Markdown format
- **Version control**: Track prompt changes separately from code
- **Maintainability**: No more giant f-strings in Python code
- **Flexibility**: Jinja2 templating for dynamic content
- **Testability**: Isolated prompt testing and validation
- **Collaboration**: Non-developers can edit prompts easily
✅ **Quality Assurance:**
- All existing tests pass (28/28)
- New comprehensive prompt loader tests
- Backward compatibility maintained
- No functional changes to evaluation logic
This refactoring makes the codebase much more maintainable and allows for easier prompt iteration and improvement! 🎉
* feat/general-refinement - refactor prompts with perfect system/user separation and fix all mypy issues
- Reorganized prompts into system/ and user/ directories for clear separation
- System prompts contain behavioral instructions (HOW to evaluate)
- User prompts contain simple requests (WHAT to evaluate)
- Fixed all mypy type checking issues with proper annotations
- Updated pre-commit configuration for proper mypy integration
- Removed unused files (docker-compose.yml, example files)
- All tests passing (29/29) with full type safety
- Perfect separation of concerns in prompt architecture
* feat/general-refinement - Fix deterministic JSON parsing and remove exception swallowing
- Add ResearchValidationResponse Pydantic model for proper validation
- Create robust _extract_json_from_response() function to handle:
* Markdown code blocks
* Plain JSON objects
* JSON embedded in explanatory text
* Proper error handling for malformed responses
- Replace manual json.loads() + dict.get() with Pydantic model_validate_json()
- Remove exception swallowing that masked real parsing errors
- Remove inappropriate raise_obstacle suggestions from parsing errors
- Apply consistent parsing pattern to all LLM sampling functions:
* _validate_research_quality
* _evaluate_workflow_guidance
* _evaluate_coding_plan
* judge_code_change
- Add comprehensive test suite (tests/test_json_extraction.py) with 8 test cases
- Fix context injection issues by using proper Context type annotations
- All 37 tests passing, mypy clean
Resolves the Invalid JSON expected value at line 1 column 1 error
caused by LLMs returning JSON wrapped in markdown code blocks.
* feat/general-refinement - Improve README documentation structure
- Remove Technical Prerequisites section
- Update AI assistants section to show only supported ones in clean table format
- Change Critical Requirements to MCP Client Prerequisites with bold formatting
- Convert Five Powerful Judge Tools to List of Tools with tools emoji
- Reorganize tools section as a clean table with tool names and descriptions
- Streamline documentation for better readability and focus
* feat/general-refinement - Upgrade to Python 3.13.5 and improve coverage configuration
- Upgrade Python version from 3.12 to 3.13.5 across all configurations:
* Update .python-version, pyproject.toml, and all GitHub workflows
* Update Dockerfile to use python:3.13-slim base images
* Update README badge and CONTRIBUTING.md requirements
* Regenerate uv.lock with Python 3.13 dependencies
- Add Python 3.13+ to system prerequisites in README
- Improve coverage configuration in pyproject.toml:
* Add comprehensive source and omit patterns
* Configure exclude_lines for better coverage reporting
* Set XML output configuration
- Update CI workflow for better Codecov integration:
* Set fail_ci_if_error to false for more reliable CI
* Add verbose output for better debugging
* Ensure CODECOV_TOKEN environment variable is properly set
- All 37 tests passing on Python 3.13.5
- MyPy type checking clean with Python 3.13
* feat/general-refinement - Fix Dockerfile merge conflict by keeping stdio-only MCP configuration
- Remove HTTP/port-related configurations (PORT, TRANSPORT, EXPOSE)
- Keep Python 3.13-slim base images for latest Python version
- Maintain process-based health check using pgrep instead of HTTP curl
- Ensure MCP server remains stdio-only as intended for MCP protocol
- Resolve merge conflict with main branch while preserving Python 3.13 upgrade
* feat/general-refinement - Implement dynamic Docker image versioning and fix merge conflicts
- Replace hardcoded version '1.0.0' with dynamic VERSION build argument in Dockerfile
- Add VERSION build arg with 'latest' default for flexible versioning
- Update CI workflow to pass development version (dev-{commit-sha}) for test builds
- Update release workflow to pass actual tag version for production builds
- Remove HTTP/port configurations to keep MCP server stdio-only as intended
- Maintain Python 3.13-slim base images while resolving main branch conflicts
- Ensure proper version tracking across PyPI packages and Docker images
- Enable automatic versioning without manual Dockerfile updates
---------
Co-authored-by: Zvi Fried <[email protected]>1 parent 1abfbdb commit a1718b2
File tree
38 files changed
+1844
-1451
lines changed- .github/workflows
- scripts
- src
- mcp_as_a_judge
- prompts
- system
- user
- tests
38 files changed
+1844
-1451
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
10 | | - | |
| 10 | + | |
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
18 | | - | |
19 | | - | |
| 18 | + | |
| 19 | + | |
20 | 20 | | |
21 | 21 | | |
22 | 22 | | |
23 | | - | |
| 23 | + | |
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
28 | | - | |
| 28 | + | |
29 | 29 | | |
30 | 30 | | |
31 | | - | |
| 31 | + | |
32 | 32 | | |
33 | 33 | | |
34 | 34 | | |
35 | | - | |
| 35 | + | |
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
39 | 39 | | |
40 | | - | |
| 40 | + | |
41 | 41 | | |
42 | 42 | | |
43 | 43 | | |
44 | | - | |
| 44 | + | |
45 | 45 | | |
46 | 46 | | |
47 | 47 | | |
48 | | - | |
| 48 | + | |
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
52 | 52 | | |
53 | | - | |
| 53 | + | |
| 54 | + | |
54 | 55 | | |
| 56 | + | |
| 57 | + | |
55 | 58 | | |
56 | 59 | | |
57 | 60 | | |
58 | 61 | | |
59 | | - | |
| 62 | + | |
60 | 63 | | |
61 | 64 | | |
62 | 65 | | |
63 | | - | |
| 66 | + | |
64 | 67 | | |
65 | 68 | | |
66 | 69 | | |
67 | 70 | | |
68 | | - | |
| 71 | + | |
69 | 72 | | |
70 | 73 | | |
71 | | - | |
| 74 | + | |
72 | 75 | | |
73 | 76 | | |
74 | | - | |
| 77 | + | |
75 | 78 | | |
76 | 79 | | |
77 | 80 | | |
78 | 81 | | |
79 | | - | |
| 82 | + | |
80 | 83 | | |
81 | 84 | | |
82 | 85 | | |
| |||
86 | 89 | | |
87 | 90 | | |
88 | 91 | | |
89 | | - | |
| 92 | + | |
90 | 93 | | |
91 | 94 | | |
92 | 95 | | |
93 | | - | |
| 96 | + | |
94 | 97 | | |
95 | 98 | | |
96 | 99 | | |
97 | 100 | | |
98 | | - | |
| 101 | + | |
99 | 102 | | |
100 | 103 | | |
101 | | - | |
| 104 | + | |
102 | 105 | | |
103 | 106 | | |
104 | | - | |
| 107 | + | |
105 | 108 | | |
106 | 109 | | |
107 | 110 | | |
108 | | - | |
| 111 | + | |
109 | 112 | | |
110 | 113 | | |
111 | 114 | | |
112 | 115 | | |
113 | | - | |
| 116 | + | |
114 | 117 | | |
115 | 118 | | |
116 | 119 | | |
| |||
122 | 125 | | |
123 | 126 | | |
124 | 127 | | |
125 | | - | |
| 128 | + | |
126 | 129 | | |
127 | 130 | | |
128 | 131 | | |
129 | | - | |
| 132 | + | |
130 | 133 | | |
131 | 134 | | |
132 | | - | |
| 135 | + | |
133 | 136 | | |
134 | 137 | | |
135 | 138 | | |
136 | 139 | | |
137 | 140 | | |
138 | 141 | | |
| 142 | + | |
| 143 | + | |
139 | 144 | | |
140 | 145 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
21 | | - | |
| 21 | + | |
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
9 | | - | |
| 9 | + | |
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
21 | | - | |
| 21 | + | |
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
27 | | - | |
| 27 | + | |
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
32 | | - | |
| 32 | + | |
33 | 33 | | |
34 | 34 | | |
35 | | - | |
| 35 | + | |
36 | 36 | | |
37 | 37 | | |
38 | | - | |
| 38 | + | |
39 | 39 | | |
40 | 40 | | |
41 | 41 | | |
42 | 42 | | |
43 | 43 | | |
44 | 44 | | |
45 | | - | |
| 45 | + | |
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
50 | 50 | | |
51 | 51 | | |
52 | 52 | | |
53 | | - | |
| 53 | + | |
54 | 54 | | |
55 | 55 | | |
56 | 56 | | |
| |||
60 | 60 | | |
61 | 61 | | |
62 | 62 | | |
63 | | - | |
| 63 | + | |
64 | 64 | | |
65 | 65 | | |
66 | 66 | | |
67 | | - | |
| 67 | + | |
68 | 68 | | |
69 | 69 | | |
70 | 70 | | |
71 | 71 | | |
72 | | - | |
| 72 | + | |
73 | 73 | | |
74 | 74 | | |
75 | 75 | | |
76 | 76 | | |
77 | 77 | | |
78 | 78 | | |
79 | | - | |
| 79 | + | |
80 | 80 | | |
81 | 81 | | |
82 | 82 | | |
| |||
88 | 88 | | |
89 | 89 | | |
90 | 90 | | |
91 | | - | |
| 91 | + | |
92 | 92 | | |
93 | 93 | | |
94 | | - | |
| 94 | + | |
95 | 95 | | |
96 | 96 | | |
97 | 97 | | |
98 | 98 | | |
99 | 99 | | |
100 | 100 | | |
101 | 101 | | |
| 102 | + | |
| 103 | + | |
102 | 104 | | |
103 | 105 | | |
104 | | - | |
| 106 | + | |
105 | 107 | | |
106 | 108 | | |
107 | 109 | | |
| |||
112 | 114 | | |
113 | 115 | | |
114 | 116 | | |
115 | | - | |
| 117 | + | |
116 | 118 | | |
117 | 119 | | |
118 | 120 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
10 | | - | |
| 10 | + | |
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
| |||
19 | 19 | | |
20 | 20 | | |
21 | 21 | | |
22 | | - | |
| 22 | + | |
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
29 | | - | |
| 29 | + | |
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
33 | 33 | | |
34 | | - | |
| 34 | + | |
35 | 35 | | |
36 | 36 | | |
37 | | - | |
| 37 | + | |
38 | 38 | | |
39 | 39 | | |
40 | | - | |
| 40 | + | |
41 | 41 | | |
42 | 42 | | |
43 | 43 | | |
44 | 44 | | |
45 | | - | |
| 45 | + | |
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
49 | | - | |
| 49 | + | |
50 | 50 | | |
51 | 51 | | |
52 | 52 | | |
| |||
0 commit comments