Skip to content

Commit a1718b2

Browse files
OtherVibesZvi Fried
andauthored
Feat/general refinement (#6)
* feat/general-refinement - refactor: consolidate models and clean up project structure - Move elicit models (ObstacleResolutionDecision, RequirementsClarification) from server.py to models.py - Remove duplicate model definitions to follow DRY principle - Update imports in server.py to use centralized models - Remove PROJECT_SUMMARY.md file for cleaner project structure - Improve code organization and maintainability * feat/general-refinement - fix: prioritize PyPI installation and use GitHub Container Registry - Update README installation instructions to prioritize PyPI package over git clone - Change primary installation method to use 'uv add mcp-as-a-judge' and 'pip install mcp-as-a-judge' - Update Docker Compose to use pre-built images from GitHub Container Registry for production - Separate development and production Docker configurations with profiles - Ensure all Docker instructions reference ghcr.io/hepivax/mcp-as-a-judge - Keep git clone only for development and source builds - Improve user experience by making package installation the default path * feat/general-refinement - fix: remove unnecessary port configurations for MCP stdio communication - Remove PORT and TRANSPORT build args from Dockerfile (MCP uses stdio, not HTTP) - Remove EXPOSE directive and port mappings from Docker configurations - Update docker-compose.yml to remove port mappings and add stdin_open/tty for stdio - Remove nginx service (not needed for MCP servers) - Update Docker run commands in README to use -it instead of port mappings - Fix health check to use process check instead of HTTP endpoint - Add note in README explaining MCP uses stdio communication - Simplify Docker configuration for proper MCP server deployment * feat/general-refinement - docs: add LLM-as-a-Judge concept and MCP client requirements - Add explanation that concept derives from LLM-as-a-Judge paradigm - Specify MCP client requirements with official documentation links: - Sampling capability required for AI-powered code evaluation - Elicitation capability required for user decision prompts - Link to official MCP docs for sampling and elicitation concepts - Enhance features section to reference specific MCP capabilities - Improve clarity on technical requirements for proper functionality * feat/general-refinement - docs: highlight main purpose of improving developer-AI interface - Add prominent section explaining core mission to enhance developer-AI collaboration - Emphasize preventing AI poor decisions and involving humans in critical choices - Update main description to highlight transformation of developer-AI experience - Add focus on intelligent AI-human collaboration with clear boundaries - Make it clear this is about improving the interface between developers and AI assistants - Position as solution for better AI-human workflow in software development * feat/general-refinement - docs: add judge icons for better visual branding - Replace 🚨 with ⚖️ in main title for better thematic representation - Add ⚖️ to Main Purpose section header - Update Five Powerful Tools to Five Powerful Judge Tools with ⚖️ icon - Add ⚖️ to Concept section for consistent judge theme - Improve visual identity and reinforce the 'judge' concept throughout README - Create cohesive branding with scales of justice emoji * feat/general-refinement - refactor: replace static research validation with AI-powered evaluation - Replace hardcoded research validation logic with intelligent AI evaluation - Embed research, plan, design, and user requirements into validation prompt - Use LLM sampling to assess research comprehensiveness and design alignment - Evaluate if design is properly based on research findings - Check for exploration of existing solutions, alternatives, and best practices - Validate research quality and actionable insights - Provide detailed feedback on research gaps and design-research alignment - Maintain obstacle resolution pattern for user involvement in decisions - Improve validation accuracy and reduce false positives from static checks * feat/general-refinement - fix: correct judge_code_change trigger and use Pydantic JSON schema - Fix judge_code_change trigger: must be called BEFORE making any file changes - Replace hardcoded JSON format with actual Pydantic model schema - Use JudgeResponse.model_json_schema() for consistent response format - Ensure proper validation timing: code review before file modification - Improve prompt accuracy by using actual model schema instead of manual format - Maintain consistency between expected response format and actual model structure * feat/general-refinement - enhance: add Pragmatic Programmer principles to evaluation criteria - Integrate key concepts from The Pragmatic Programmer book into judge prompts - Add DRY Principle, Orthogonality, and Design by Contract evaluations - Include Defensive Programming, Fail Fast, and Broken Windows Theory - Add Tracer Bullets, Reversibility, and Good Enough Software principles - Enhance with Test Early/Test Often and Premature Optimization awareness - Include Easy to Change, Refactoring Strategy, and Plain Text Power concepts - Add Rubber Duck Debugging and 'Use the Source, Luke' references - Improve evaluation guidelines with pragmatic context-driven approach - Balance perfectionism with practical software delivery principles - Create more comprehensive and industry-standard evaluation criteria * feat/general-refinement - enhance: add comprehensive software engineering best practices to evaluation criteria - Integrate DRY Principle, Orthogonality, and Design by Contract evaluations - Add Defensive Programming, Fail Fast, and Broken Windows Theory concepts - Include Tracer Bullets, Reversibility, and Good Enough Software principles - Enhance with Test Early/Test Often and Premature Optimization awareness - Add Easy to Change, Refactoring Strategy, and Plain Text Power concepts - Include Rubber Duck Debugging and authoritative source validation - Improve evaluation guidelines with context-driven approach - Balance perfectionism with practical software delivery principles - Create more comprehensive and industry-standard evaluation criteria - Focus on maintainable, working software over perfect solutions * feat/general-refinement - fix: ensure judge_code_change is called for new file creation - Make it explicit that judge_code_change must be called BEFORE creating ANY new files - Add comprehensive list of file operations that require code review - Include new Python files, configuration files, scripts, and modules - Update parameter descriptions to clarify new file content vs modifications - Change prompt language from 'code changes' to 'code content' for clarity - Ensure all file operations involving code are properly validated - Prevent creation of unreviewed code files in any format * feat/general-refinement - enhance: make judge_code_change documentation impossible to miss - Add prominent 🚨🚨🚨 alerts and visual emphasis for mandatory requirement - Specify exact triggers: save-file, str-replace-editor, and other code-writing tools - Add explicit consequences of not calling: SWE compliance violations, security risks - Include clear example workflow showing proper usage timing - Change from 'BEFORE' to 'IMMEDIATELY AFTER' for clarity on timing - Add specific tool names that trigger the requirement - Make file_path parameter required instead of optional - Emphasize this is mandatory compliance, not optional review - Use multiple warning levels and visual cues to prevent oversight * feat/general-refinement - fix: replace relative imports with absolute imports and enhance pre-commit - Replace 'from .models' with 'from mcp_as_a_judge.models' in server.py - Replace 'from .server' with 'from mcp_as_a_judge.server' in __init__.py - Add gitleaks security scanning to pre-commit hooks (first priority) - Add additional pre-commit hooks for better code quality - Ensure all imports are absolute for better maintainability - Improve import clarity and avoid relative import issues - Note: ruff already provides black, isort, and flake8 functionality * feat/general-refinement - fix: apply pre-commit hook auto-fixes - Fix trailing whitespace in multiple files - Fix end-of-file issues in docker-compose.yml - Apply isort import sorting to all Python files - Apply black code formatting to 9 Python files - Fix prettier formatting for markdown and YAML files - All security checks passed (gitleaks found no secrets) - Pre-commit hooks are now working correctly and enforcing quality standards * feat/general-refinement - fix: resolve flake8 and mypy errors - Remove poetry-check from pre-commit (we use uv, not poetry) - Fix all flake8 D202 errors (blank lines after docstrings) - Fix flake8 D400 error (missing period in docstring) - Fix boolean comparison issues (== True/False -> direct boolean checks) - Add missing return type annotations to all test functions - Add missing docstrings to __init__ methods in conftest.py - Extract research validation logic to reduce complexity (C901) - Create _validate_research_quality helper function - Replace duplicated research validation code with helper function call - Improve code maintainability and reduce cyclomatic complexity * feat/general-refinement - fix: resolve complexity and final flake8 errors - Extract _evaluate_coding_plan helper function to reduce complexity - Reduce judge_coding_plan complexity from 15 to under 10 (C901 resolved) - Remove duplicated prompt code and use helper functions - Fix final D202 error (blank line after docstring) - All flake8 errors now resolved - Improve code maintainability with better separation of concerns - Helper functions make code more testable and reusable * feat/general-refinement - fix: apply black formatting - Black automatically reformatted server.py for consistent style - All flake8 errors resolved ✅ - Gitleaks security scan passing ✅ - Code formatting and style checks passing ✅ - Only mypy type checking issues remain (expected for MCP project) * feat/general-refinement - test: demonstrate pre-commit blocking behavior with pytest * feat/general-refinement - test: this commit should be blocked by pre-commit * feat/general-refinement - add: pytest to pre-commit hooks and demonstrate blocking behavior - Add pytest hook to run tests before every commit - Configure pytest with verbose output and short traceback - Fix test assertions to match actual server name format - Demonstrate pre-commit blocking with multiple hook failures - All hooks now properly validate code quality before commits * feat/general-refinement - refactor: move prompts to Markdown files with Jinja2 templating ✨ MAJOR REFACTORING: Externalized Prompts for Better Maintainability 🎯 **What Changed:** - **Extracted all hardcoded prompts** to separate Markdown files in src/prompts/ - **Added Jinja2 templating** for dynamic variable substitution - **Created PromptLoader utility** for loading and rendering templates - **Comprehensive test coverage** for prompt loading functionality 📁 **New Structure:** - src/prompts/judge_coding_plan.md - Main evaluation prompt - src/prompts/judge_code_change.md - Code review prompt - src/prompts/research_validation.md - Research quality validation - src/mcp_as_a_judge/prompt_loader.py - Template loading utility - tests/test_prompt_loader.py - Full test coverage 🚀 **Benefits:** - **Easy editing**: Prompts now in readable Markdown format - **Version control**: Track prompt changes separately from code - **Maintainability**: No more giant f-strings in Python code - **Flexibility**: Jinja2 templating for dynamic content - **Testability**: Isolated prompt testing and validation - **Collaboration**: Non-developers can edit prompts easily ✅ **Quality Assurance:** - All existing tests pass (28/28) - New comprehensive prompt loader tests - Backward compatibility maintained - No functional changes to evaluation logic This refactoring makes the codebase much more maintainable and allows for easier prompt iteration and improvement! 🎉 * feat/general-refinement - refactor prompts with perfect system/user separation and fix all mypy issues - Reorganized prompts into system/ and user/ directories for clear separation - System prompts contain behavioral instructions (HOW to evaluate) - User prompts contain simple requests (WHAT to evaluate) - Fixed all mypy type checking issues with proper annotations - Updated pre-commit configuration for proper mypy integration - Removed unused files (docker-compose.yml, example files) - All tests passing (29/29) with full type safety - Perfect separation of concerns in prompt architecture * feat/general-refinement - Fix deterministic JSON parsing and remove exception swallowing - Add ResearchValidationResponse Pydantic model for proper validation - Create robust _extract_json_from_response() function to handle: * Markdown code blocks * Plain JSON objects * JSON embedded in explanatory text * Proper error handling for malformed responses - Replace manual json.loads() + dict.get() with Pydantic model_validate_json() - Remove exception swallowing that masked real parsing errors - Remove inappropriate raise_obstacle suggestions from parsing errors - Apply consistent parsing pattern to all LLM sampling functions: * _validate_research_quality * _evaluate_workflow_guidance * _evaluate_coding_plan * judge_code_change - Add comprehensive test suite (tests/test_json_extraction.py) with 8 test cases - Fix context injection issues by using proper Context type annotations - All 37 tests passing, mypy clean Resolves the Invalid JSON expected value at line 1 column 1 error caused by LLMs returning JSON wrapped in markdown code blocks. * feat/general-refinement - Improve README documentation structure - Remove Technical Prerequisites section - Update AI assistants section to show only supported ones in clean table format - Change Critical Requirements to MCP Client Prerequisites with bold formatting - Convert Five Powerful Judge Tools to List of Tools with tools emoji - Reorganize tools section as a clean table with tool names and descriptions - Streamline documentation for better readability and focus * feat/general-refinement - Upgrade to Python 3.13.5 and improve coverage configuration - Upgrade Python version from 3.12 to 3.13.5 across all configurations: * Update .python-version, pyproject.toml, and all GitHub workflows * Update Dockerfile to use python:3.13-slim base images * Update README badge and CONTRIBUTING.md requirements * Regenerate uv.lock with Python 3.13 dependencies - Add Python 3.13+ to system prerequisites in README - Improve coverage configuration in pyproject.toml: * Add comprehensive source and omit patterns * Configure exclude_lines for better coverage reporting * Set XML output configuration - Update CI workflow for better Codecov integration: * Set fail_ci_if_error to false for more reliable CI * Add verbose output for better debugging * Ensure CODECOV_TOKEN environment variable is properly set - All 37 tests passing on Python 3.13.5 - MyPy type checking clean with Python 3.13 * feat/general-refinement - Fix Dockerfile merge conflict by keeping stdio-only MCP configuration - Remove HTTP/port-related configurations (PORT, TRANSPORT, EXPOSE) - Keep Python 3.13-slim base images for latest Python version - Maintain process-based health check using pgrep instead of HTTP curl - Ensure MCP server remains stdio-only as intended for MCP protocol - Resolve merge conflict with main branch while preserving Python 3.13 upgrade * feat/general-refinement - Implement dynamic Docker image versioning and fix merge conflicts - Replace hardcoded version '1.0.0' with dynamic VERSION build argument in Dockerfile - Add VERSION build arg with 'latest' default for flexible versioning - Update CI workflow to pass development version (dev-{commit-sha}) for test builds - Update release workflow to pass actual tag version for production builds - Remove HTTP/port configurations to keep MCP server stdio-only as intended - Maintain Python 3.13-slim base images while resolving main branch conflicts - Ensure proper version tracking across PyPI packages and Docker images - Enable automatic versioning without manual Dockerfile updates --------- Co-authored-by: Zvi Fried <[email protected]>
1 parent 1abfbdb commit a1718b2

38 files changed

+1844
-1451
lines changed

.github/workflows/ci.yml

Lines changed: 32 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -7,76 +7,79 @@ on:
77
branches: [ main, develop ]
88

99
env:
10-
PYTHON_VERSION: "3.12"
10+
PYTHON_VERSION: "3.13"
1111

1212
jobs:
1313
test:
1414
name: Test Suite
1515
runs-on: ubuntu-latest
1616
strategy:
1717
matrix:
18-
python-version: ["3.12", "3.13"]
19-
18+
python-version: ["3.13"]
19+
2020
steps:
2121
- name: Checkout code
2222
uses: actions/checkout@v4
23-
23+
2424
- name: Install uv
2525
uses: astral-sh/setup-uv@v4
2626
with:
2727
version: "latest"
28-
28+
2929
- name: Set up Python ${{ matrix.python-version }}
3030
run: uv python install ${{ matrix.python-version }}
31-
31+
3232
- name: Install dependencies
3333
run: |
3434
uv sync --all-extras --dev
35-
35+
3636
- name: Run linting
3737
run: |
3838
uv run ruff check src tests
3939
uv run ruff format --check src tests
40-
40+
4141
- name: Run type checking
4242
run: |
4343
uv run mypy src
44-
44+
4545
- name: Run tests
4646
run: |
4747
uv run pytest --cov=src/mcp_as_a_judge --cov-report=xml --cov-report=term-missing
48-
48+
4949
- name: Upload coverage to Codecov
5050
uses: codecov/codecov-action@v4
5151
with:
5252
file: ./coverage.xml
53-
fail_ci_if_error: true
53+
fail_ci_if_error: false
54+
verbose: true
5455
token: ${{ secrets.CODECOV_TOKEN }}
56+
env:
57+
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
5558

5659
security:
5760
name: Security Scan
5861
runs-on: ubuntu-latest
59-
62+
6063
steps:
6164
- name: Checkout code
6265
uses: actions/checkout@v4
63-
66+
6467
- name: Install uv
6568
uses: astral-sh/setup-uv@v4
6669
with:
6770
version: "latest"
68-
71+
6972
- name: Set up Python
7073
run: uv python install ${{ env.PYTHON_VERSION }}
71-
74+
7275
- name: Install dependencies
7376
run: uv sync --all-extras --dev
74-
77+
7578
- name: Run safety check
7679
run: |
7780
uv add --dev safety
7881
uv run safety check
79-
82+
8083
- name: Run bandit security linter
8184
run: |
8285
uv add --dev bandit
@@ -86,31 +89,31 @@ jobs:
8689
name: Build Package
8790
runs-on: ubuntu-latest
8891
needs: [test, security]
89-
92+
9093
steps:
9194
- name: Checkout code
9295
uses: actions/checkout@v4
93-
96+
9497
- name: Install uv
9598
uses: astral-sh/setup-uv@v4
9699
with:
97100
version: "latest"
98-
101+
99102
- name: Set up Python
100103
run: uv python install ${{ env.PYTHON_VERSION }}
101-
104+
102105
- name: Install dependencies
103106
run: uv sync --all-extras --dev
104-
107+
105108
- name: Build package
106109
run: |
107110
uv build --no-sources
108-
111+
109112
- name: Check package
110113
run: |
111114
uv add --dev twine
112115
uv run twine check dist/*
113-
116+
114117
- name: Upload build artifacts
115118
uses: actions/upload-artifact@v4
116119
with:
@@ -122,19 +125,21 @@ jobs:
122125
name: Build Docker Image
123126
runs-on: ubuntu-latest
124127
needs: [test, security]
125-
128+
126129
steps:
127130
- name: Checkout code
128131
uses: actions/checkout@v4
129-
132+
130133
- name: Set up Docker Buildx
131134
uses: docker/setup-buildx-action@v3
132-
135+
133136
- name: Build Docker image
134137
uses: docker/build-push-action@v5
135138
with:
136139
context: .
137140
push: false
138141
tags: mcp-as-a-judge:test
142+
build-args: |
143+
VERSION=dev-${{ github.sha }}
139144
cache-from: type=gha
140145
cache-to: type=gha,mode=max

.github/workflows/dependabot-auto-merge.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ jobs:
1818
uses: dependabot/fetch-metadata@v2
1919
with:
2020
github-token: "${{ secrets.GITHUB_TOKEN }}"
21-
21+
2222
- name: Auto-merge Dependabot PRs for patch and minor updates
2323
if: steps.metadata.outputs.update-type == 'version-update:semver-patch' || steps.metadata.outputs.update-type == 'version-update:semver-minor'
2424
run: gh pr merge --auto --merge "$PR_URL"

.github/workflows/release.yml

Lines changed: 18 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ on:
66
- 'v*'
77

88
env:
9-
PYTHON_VERSION: "3.12"
9+
PYTHON_VERSION: "3.13"
1010
REGISTRY: ghcr.io
1111
IMAGE_NAME: ${{ github.repository }}
1212

@@ -18,39 +18,39 @@ jobs:
1818
contents: write
1919
packages: write
2020
id-token: write
21-
21+
2222
steps:
2323
- name: Checkout code
2424
uses: actions/checkout@v4
2525
with:
2626
fetch-depth: 0
27-
27+
2828
- name: Install uv
2929
uses: astral-sh/setup-uv@v4
3030
with:
3131
version: "latest"
32-
32+
3333
- name: Set up Python
3434
run: uv python install ${{ env.PYTHON_VERSION }}
35-
35+
3636
- name: Install dependencies
3737
run: uv sync --all-extras --dev
38-
38+
3939
- name: Extract version from tag
4040
id: version
4141
run: |
4242
VERSION=${GITHUB_REF#refs/tags/v}
4343
echo "VERSION=$VERSION" >> $GITHUB_OUTPUT
4444
echo "TAG=${GITHUB_REF#refs/tags/}" >> $GITHUB_OUTPUT
45-
45+
4646
- name: Verify version matches pyproject.toml
4747
run: |
4848
PROJECT_VERSION=$(uv run python -c "import tomllib; print(tomllib.load(open('pyproject.toml', 'rb'))['project']['version'])")
4949
if [ "$PROJECT_VERSION" != "${{ steps.version.outputs.VERSION }}" ]; then
5050
echo "Version mismatch: tag=${{ steps.version.outputs.VERSION }}, pyproject.toml=$PROJECT_VERSION"
5151
exit 1
5252
fi
53-
53+
5454
- name: Generate changelog
5555
id: changelog
5656
run: |
@@ -60,23 +60,23 @@ jobs:
6060
git log --pretty=format:"- %s" $(git describe --tags --abbrev=0 HEAD^)..HEAD >> $GITHUB_OUTPUT || echo "- Initial release" >> $GITHUB_OUTPUT
6161
echo "" >> $GITHUB_OUTPUT
6262
echo "EOF" >> $GITHUB_OUTPUT
63-
63+
6464
- name: Build package
6565
run: |
6666
uv build --no-sources
67-
67+
6868
- name: Publish to PyPI
6969
uses: pypa/gh-action-pypi-publish@release/v1
7070
with:
7171
password: ${{ secrets.PYPI_API_TOKEN }}
72-
72+
7373
- name: Log in to Container Registry
7474
uses: docker/login-action@v3
7575
with:
7676
registry: ${{ env.REGISTRY }}
7777
username: ${{ github.actor }}
7878
password: ${{ secrets.GITHUB_TOKEN }}
79-
79+
8080
- name: Extract metadata for Docker
8181
id: meta
8282
uses: docker/metadata-action@v5
@@ -88,20 +88,22 @@ jobs:
8888
type=semver,pattern={{major}}.{{minor}}
8989
type=semver,pattern={{major}}
9090
type=raw,value=latest,enable={{is_default_branch}}
91-
91+
9292
- name: Set up Docker Buildx
9393
uses: docker/setup-buildx-action@v3
94-
94+
9595
- name: Build and push Docker image
9696
uses: docker/build-push-action@v5
9797
with:
9898
context: .
9999
push: true
100100
tags: ${{ steps.meta.outputs.tags }}
101101
labels: ${{ steps.meta.outputs.labels }}
102+
build-args: |
103+
VERSION=${{ steps.version.outputs.VERSION }}
102104
cache-from: type=gha
103105
cache-to: type=gha,mode=max
104-
106+
105107
- name: Create GitHub Release
106108
uses: actions/create-release@v1
107109
env:
@@ -112,7 +114,7 @@ jobs:
112114
body: ${{ steps.changelog.outputs.CHANGELOG }}
113115
draft: false
114116
prerelease: false
115-
117+
116118
- name: Upload release assets
117119
uses: actions/upload-release-asset@v1
118120
env:

.github/workflows/semantic-release.yml

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ on:
77
workflow_dispatch:
88

99
env:
10-
PYTHON_VERSION: "3.12"
10+
PYTHON_VERSION: "3.13"
1111

1212
jobs:
1313
semantic-release:
@@ -19,34 +19,34 @@ jobs:
1919
issues: write
2020
pull-requests: write
2121
id-token: write
22-
22+
2323
steps:
2424
- name: Checkout code
2525
uses: actions/checkout@v4
2626
with:
2727
fetch-depth: 0
2828
token: ${{ secrets.GITHUB_TOKEN }}
29-
29+
3030
- name: Install uv
3131
uses: astral-sh/setup-uv@v4
3232
with:
3333
version: "latest"
34-
34+
3535
- name: Set up Python
3636
run: uv python install ${{ env.PYTHON_VERSION }}
37-
37+
3838
- name: Install dependencies
3939
run: uv sync --all-extras --dev
40-
40+
4141
- name: Setup Node.js
4242
uses: actions/setup-node@v4
4343
with:
4444
node-version: '20'
45-
45+
4646
- name: Install semantic-release
4747
run: |
4848
npm install -g semantic-release @semantic-release/changelog @semantic-release/git @semantic-release/github
49-
49+
5050
- name: Run semantic-release
5151
env:
5252
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

0 commit comments

Comments
 (0)