- Added
scripts/test_performance_regression.pyto measure LLM Judge evaluation time and memory usage. - Test fails if elapsed time > 30s or memory usage > 600MB (default, configurable).
- Integrated as a new job in CI after llm-judge, uploads
performance_metrics.jsonartifact. - Thresholds and rationale are documented in both the script and workflow.
- Ensures no performance regressions are merged.
- Created branch feature/playwright-e2e-testing for Playwright E2E testing setup
- Added package.json and playwright.config.ts for Playwright
- To be followed by E2E test suites, fixtures, and CI integration
- Added
scripts/generate_final_report.pyto aggregate test, coverage, LLM Judge, and performance results into a single Markdown report (final_test_report.md). - Updated
.github/workflows/verify.ymlto run the report script after all tests and upload the report as an artifact for every job. - The report includes: test summary, coverage, LLM Judge results, performance metrics, recommendations, and a stub for comparisons to previous runs.
- All steps are robust to missing files and always generate a report for CI/CD visibility.
- File:
scripts/e2e_health_check.py - Purpose: Checks if all required services are up before running E2E tests:
- Streamlit app (
localhost:8501) - Ollama API (
localhost:11434, models:mistral,nomic-embed-text) - ChromaDB (
localhost:8000) - Redis (
localhost:6379)
- Streamlit app (
- Usage:
poetry run python scripts/e2e_health_check.py # or python scripts/e2e_health_check.py - Behavior:
- Prints ✅/❌ for each service
- Fails fast with clear error if any service is down
- Summary at the end (ALL SERVICES HEALTHY 🟢 or SOME SERVICES UNHEALTHY 🔴)
- Refactored:
tests/e2e/specs/basic-e2e.spec.ts- Uses robust
data-testidselectors viaChatHelperfor all main flows - DRY: All common actions (send message, upload, select mode) use helpers
- Retry logic and explicit waits for LLM and document upload steps
- Uses robust
- Benefits:
- Less flakiness from UI changes or slow infra
- Easier to maintain and debug
- Always run the health check before E2E:
poetry run python scripts/e2e_health_check.py
- If a test fails:
- Check the Playwright HTML report for logs/screenshots
- Ensure all infra is up and healthy
- If infra is slow, increase timeouts in helpers
- Harden other E2E specs to use helpers/selectors (optional)
- Add more health checks as needed (e.g., S3, Supabase)
- Keep this doc updated with new E2E patterns
- Start the app using local scripts (e.g.,
./start_basicchat.shor./start_dev.sh) - Do NOT use Docker Compose or Docker images for E2E tests
- Run E2E tests with
bunx playwright test --reporter=listagainst the local app - In CI/CD, ensure the app is started from source, not from a Docker image
- Double-check with
docker psthat no containers are running for the app during E2E
- Stop any running Docker containers for the app:
docker ps # If any containers are running for the app: docker stop <container_id>
- Start the app from source:
./start_basicchat.sh # or ./start_dev.sh - In a new terminal, run E2E tests:
bunx playwright test --reporter=list
- Ensure your pipeline installs dependencies and starts the app from source before running E2E tests.
- Do not use Docker images for E2E unless you are explicitly testing the container build.
Last updated: Ensured E2E always runs against local code, not Docker images.