Skip to content

feat(api): add observability baseline with structured logging and telemetry#6

Merged
raychrisgdp merged 19 commits intomainfrom
feature/observability-baseline
Jan 3, 2026
Merged

feat(api): add observability baseline with structured logging and telemetry#6
raychrisgdp merged 19 commits intomainfrom
feature/observability-baseline

Conversation

@raychrisgdp
Copy link
Copy Markdown
Owner

@raychrisgdp raychrisgdp commented Jan 3, 2026

Summary

Issues & Goals:

  • Enable structured JSON logging for easier debugging and log correlation
  • Provide request correlation IDs for tracing requests across the system
  • Add telemetry endpoint for monitoring system health and database status
  • Implement log redaction to protect sensitive data (tokens, passwords, emails)

Implementation Highlights:

  • Structured Logging (backend/logging.py): New JSON formatter with context-aware request_id field, redaction filter for sensitive keys and email addresses, and dual output (stdout + rotating file handler)
  • Request Middleware (backend/middleware.py): Request logging middleware that generates/reuses correlation IDs, logs HTTP requests with method/path/status/duration, handles exceptions with error logging, and echoes X-Request-Id in response headers
  • Telemetry Endpoint (backend/api/v1/telemetry.py): New /api/v1/telemetry endpoint providing system health metrics (status, version, uptime), database connectivity and migration version, graceful degradation on errors, and optional metrics placeholders for future PRs
  • Configuration (backend/config.py): Added LOG_LEVEL, TELEMETRY_ENABLED, and LOG_FILE_PATH settings with helper methods for log level resolution and file path defaults
  • Integration (backend/main.py, backend/cli/main.py): Integrated logging setup in FastAPI lifespan and CLI startup, registered middleware and telemetry router conditionally based on settings
  • Documentation (docs/USER_GUIDE.md, README.md): Added comprehensive observability section covering environment variables, telemetry endpoint usage, and log file configuration

How to Test

Prerequisites:

  1. Install dependencies (including dev dependencies for FastAPI/uvicorn):
    # Install dev dependencies (recommended - uses Makefile)
    # This will use 'uv sync' if uv.lock exists, otherwise 'uv pip install'
    make dev
    
    # Or manually (if uv.lock exists):
    uv sync --extra dev
    
    # Or manually (if no uv.lock):
    uv venv && uv pip install --python .venv/bin/python -e ".[dev]"
    
    # Verify uvicorn is installed in the project venv:
    .venv/bin/python -c "import uvicorn; print(f'uvicorn version: {uvicorn.__version__}')"
    # Should show: uvicorn version: 0.40.0 (or similar)
    
    # Note: If dependencies change, update the lock file:
    # uv lock  # Updates uv.lock with current pyproject.toml dependencies

Backend API Testing:

  1. Start the API server:

    # Start in background (recommended for testing):
    uv run python -m backend.main > /tmp/taskgenie.log 2>&1 &
    
    # Or start in foreground:
    uv run python -m backend.main
    # Or: uvicorn backend.main:app --reload
    
    # Verify server started:
    sleep 2 && tail -5 /tmp/taskgenie.log
    # Should see: "Uvicorn running on http://127.0.0.1:8080"
  2. Test structured logging:

    • Make any API request (e.g., curl http://127.0.0.1:8080/health)
    • Verify logs are JSON format with fields: timestamp, level, logger, message, request_id
    • Check that request_id is present (UUID4 format) and echoed in response headers as X-Request-Id
    • Verify logs are written to both stdout and ~/.taskgenie/logs/taskgenie.jsonl (or configured LOG_FILE_PATH)
  3. Test request logging middleware:

    • Make API requests with different methods: GET /health, GET /api/v1/tasks
    • Verify each request logs an http_request event with fields: method, path, status, duration_ms
    • Check that X-Request-Id header is present in all responses
    • Test request ID reuse: curl -H "X-Request-Id: test-id-123" http://127.0.0.1:8080/health and verify the same ID is echoed back
    • Test unsafe request IDs (too long or non-ASCII) are rejected and new UUIDs are generated
  4. Test log redaction:

    • Trigger logs containing sensitive data (e.g., authorization headers, email addresses)
    • Verify sensitive keys (authorization, token, password, secret, cookie, email) are redacted as [redacted]
    • Verify email addresses in string values are replaced with [redacted-email]
    • Check that redaction works for nested dictionaries and lists
  5. Test telemetry endpoint:

    curl http://127.0.0.1:8080/api/v1/telemetry
    • Verify response includes: status ("ok" or "degraded"), version, uptime_s, db.connected, db.migration_version
    • Verify optional.event_queue_size and optional.agent_runs_active are present with null values
    • Test degraded status: Temporarily break database connection and verify status="degraded" with error message in db.error
  6. Test configuration:

    • Set LOG_LEVEL=DEBUG and verify debug logs appear
    • Set TELEMETRY_ENABLED=false and verify /api/v1/telemetry returns 404
    • Set LOG_FILE_PATH=/tmp/test.jsonl and verify logs are written to custom path
    • Verify DEBUG=true automatically sets log level to DEBUG

CLI Testing:

  1. Test CLI logging:
    uv run tgenie --help
    • Verify CLI output includes structured JSON logs
    • Check that request_id is null for CLI operations (not in request context)
    • Verify logs are written to configured log file path

Expected Behavior:

  • All API requests generate structured JSON logs with correlation IDs
  • Request IDs are propagated via X-Request-Id header for tracing
  • Sensitive data is automatically redacted from logs
  • Telemetry endpoint provides system health metrics
  • Logs are written to both stdout and rotating file handler
  • Configuration respects environment variables and defaults

Related Issues

  • Implements PR-016: Observability Baseline

Author Checklist

  • Synced with latest main branch
  • Self-reviewed
  • All tests pass locally (27 tests passing)
  • Documentation updated (USER_GUIDE.md, README.md)
  • No breaking changes
  • Manual testing completed (server starts, endpoints work, structured logging verified)

Additional Notes

Key Implementation Areas for Review

Backend API:

  • backend/logging.py: JSON formatter implementation, redaction filter logic, context variable usage for request ID propagation
  • backend/middleware.py: Request ID generation/reuse logic, HTTP request logging, exception handling, response header injection
  • backend/api/v1/telemetry.py: Database health checks, migration version retrieval, graceful error handling
  • backend/config.py: New observability settings and helper methods for log level/file path resolution
  • backend/main.py: Logging setup in lifespan, middleware registration, conditional telemetry router registration

Testing:

  • tests/test_logging.py: Unit tests for JSON formatter, redaction filter, and logging setup
  • tests/test_middleware.py: Middleware tests for request ID handling, request logging, error logging
  • tests/api/test_telemetry.py: Integration tests for telemetry endpoint response shape and degraded status

Documentation:

  • docs/USER_GUIDE.md: Comprehensive observability section with environment variables and telemetry usage
  • README.md: Quick reference for observability features

Testing Notes

  • Manual Testing Completed: Server starts successfully, all endpoints respond correctly
  • Structured Logging: JSON logs verified with proper fields (timestamp, level, logger, message, request_id, event, method, path, status, duration_ms)
  • Request ID Propagation: Auto-generated UUIDs and custom ID reuse both working correctly
  • Telemetry Endpoint: Returns correct JSON with status, version, uptime_s, db.connected, db.migration_version
  • Log File: Created at ~/.taskgenie/logs/taskgenie.jsonl with proper JSON formatting
  • Log file rotation: Verify logs rotate when file exceeds 10MB (5 backup files retained) - Not tested manually
  • Request ID context: Verify request_id is properly scoped per request and doesn't leak between requests - Verified via manual testing
  • Redaction edge cases: Test redaction with various sensitive data patterns (nested structures, list values) - Unit tests cover this
  • Telemetry degraded mode: Test telemetry endpoint behavior when database is unavailable or migration table missing - Unit tests cover this
  • Configuration precedence: Verify environment variables override defaults correctly - Unit tests cover this

- Set DEBUG to false in .env.example for production readiness.
- Removed unnecessary database and LLM configuration options from .env.example.
- Updated the Typer dependency in pyproject.toml and uv.lock to remove the 'all' extra, simplifying the installation process.
- Improved developer quickstart instructions for installing dependencies and running the application.
- Enhanced PR-002 task CRUD API documentation with additional details on response shapes and pagination.

These changes aim to streamline configuration, clarify setup instructions, and improve API documentation.
- Added a new API v1 for task management, including endpoints for creating, retrieving, updating, and deleting tasks.
- Introduced task schemas for request validation and response formatting.
- Implemented error handling for task not found scenarios with a standardized error response.
- Updated the Makefile to include precommit checks in the test coverage command.
- Removed linting step from CI workflow to streamline the testing process.

These changes enhance the API functionality for task management and improve error handling, contributing to a more robust application.
- Add structured JSON logging with redaction filter
- Implement request correlation IDs via middleware
- Add telemetry endpoint with DB health and migration version
- Add comprehensive test coverage (27 tests)
- Update PR-016 spec with implementation details
@raychrisgdp raychrisgdp self-assigned this Jan 3, 2026
@raychrisgdp raychrisgdp marked this pull request as draft January 3, 2026 04:12
- Replace magic values with constants (HTTP_OK, UUID_LENGTH, etc.)
- Move imports to top level
- Remove unused imports
- Fix PLR2004 and PLC0415 violations
- Add model_validator to TaskUpdate to reject title: null (prevents DB integrity errors)
- Fix async generator return type annotation in test fixture
- Add noqa comment for magic number in pagination test
- Add test for null title rejection

Fixes CI/CD issues: mypy errors and ruff warnings
feat(api): implement task CRUD API endpoints
- Add structured JSON logging with redaction filter
- Implement request correlation IDs via middleware
- Add telemetry endpoint with DB health and migration version
- Add comprehensive test coverage (27 tests)
- Update PR-016 spec with implementation details
- Replace magic values with constants (HTTP_OK, UUID_LENGTH, etc.)
- Move imports to top level
- Remove unused imports
- Fix PLR2004 and PLC0415 violations
- Set logger level explicitly for backend.middleware logger
- Use caplog.at_level() with specific logger name to capture logs
- Fixes test failures where logs weren't being captured
- Set logger.propagate = True to ensure logs reach root logger
- Set root logger level explicitly for caplog capture
- Fixes test failures after rebase onto main
- Configure logger at module level to ensure logs are captured
- Set logger levels to DEBUG in test functions for better isolation
- Ensures tests pass when run individually or with PR-016 test suite
- Note: test isolation issue persists when running full test suite in parallel
- Merge tasks router and telemetry router in main.py
- Keep logger configuration in test_middleware.py
- Keep type annotation in api/v1/__init__.py
…dler

- Use custom LogCaptureHandler instead of caplog for better isolation
- Set propagate=False to avoid interference from setup_logging()
- Ensure logger is enabled and configured right before request
- Add verification checks to ensure logger is properly configured
- Added details on structured logging and telemetry configuration.
- Updated logging section with environment variables and examples.
- Included information about the telemetry endpoint and its usage.
- Clarified request_id handling in logging format.
@raychrisgdp raychrisgdp changed the title feat: implement PR-016 observability baseline feat(api): add observability baseline with structured logging and telemetry Jan 3, 2026
@raychrisgdp raychrisgdp marked this pull request as ready for review January 3, 2026 08:20
- Added 'make lock' target to update the uv.lock file after modifying pyproject.toml.
- Enhanced 'make dev' and 'make install-all' to use 'uv sync' if uv.lock exists, ensuring consistent dependency installation.
- Updated CI workflow to reflect changes in dependency installation logic.
@raychrisgdp raychrisgdp merged commit b399897 into main Jan 3, 2026
2 checks passed
@raychrisgdp raychrisgdp deleted the feature/observability-baseline branch January 3, 2026 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant