Skip to content

Conversation

@mahdibayouli
Copy link
Collaborator

Add Observability and Monitoring Stack

Overview

This PR introduces a comprehensive observability and monitoring solution for the SkillForge microservices architecture. It establishes a containerized stack with Prometheus, Grafana, Loki, and AlertManager to provide real-time metrics, logging, and alerting capabilities.

Components Added

  • Prometheus: Central metrics collector with alert rules
  • Grafana: Visualization platform with pre-configured dashboards
  • Loki + Promtail: Centralized logging solution
  • AlertManager: Alert routing and notifications
  • MailHog: Local SMTP server for alert testing
  • MongoDB Exporter: Database metrics collection
  • Test Scripts: Tools for generating test traffic and alerts

Documentation

Three main documentation files are included:

  • README.md: Overview of the stack, components, and quick-start guide
  • grafana/README.md: Detailed dashboard documentation with testing instructions
  • ALERTS.md: Alert system reference with screenshots and links to UIs

Key Features

  • Auto-provisioned dashboards for all services (GenAI, MongoDB, Spring Boot, User Service, Logs)
  • Predefined alert rules for service availability, JVM heap, security issues
  • Centralized logging with search capabilities
  • Local testing capability for metrics and alerts without external dependencies
  • Environment variable configuration for flexible port mapping

Testing

All components have been tested locally and include step-by-step instructions for:

  • Generating sample metrics and logs
  • Triggering and resolving alerts
  • Visualizing application performance

This monitoring solution enables the team to quickly detect issues, troubleshoot performance problems, and maintain system reliability across all SkillForge services.

@mahdibayouli mahdibayouli self-assigned this Jul 16, 2025
@mahdibayouli mahdibayouli linked an issue Jul 16, 2025 that may be closed by this pull request
18 tasks
@mahdibayouli mahdibayouli added documentation Improvements or additions to documentation monitoring labels Jul 16, 2025
@mahdibayouli mahdibayouli linked an issue Jul 16, 2025 that may be closed by this pull request
18 tasks
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a full observability stack for SkillForge, including Prometheus, Grafana, Loki, and AlertManager, and instruments services with consistent metrics and version tagging.

  • Added APP_VERSION and management.metrics.tags to service configs for uniform Prometheus labeling
  • Integrated Micrometer counters into user service and provided debug endpoints for testing
  • Provisioned Prometheus scrape rules, alerting, Grafana dashboards, Loki config, and test scripts

Reviewed Changes

Copilot reviewed 43 out of 54 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
server/skillforge-user/src/main/resources/application.yml Define spring.application.version and add management.metrics.tags
server/skillforge-user/src/main/java/com/gitittogether/skillForge/server/user/service/user/UserServiceImpl.java Inject and increment signup/auth-failure counters
server/skillforge-user/src/main/java/com/gitittogether/skillForge/server/user/controller/DebugAuthController.java Register debug auth-failure counter but missing trigger endpoint
monitoring/prometheus/alert.rules.yml Alert rule for UserAuthFailuresHigh
monitoring/scripts/test_user_auth_failures.py Script to generate high auth-failure traffic
monitoring/grafana/README.md Dashboard overview documentation
Comments suppressed due to low confidence (2)

monitoring/grafana/README.md:16

  • The README refers to mongo.json, but the actual file is mongodb.json. Update the table entry to match the correct filename.
| `mongo.json`             | MongoDB internals       | Prometheus          |

*/
@RestController
@RequestMapping("/api/v1/debug")
public class DebugAuthController {
Copy link

Copilot AI Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DebugAuthController defines a counter but no endpoint to trigger increments. Consider adding a mapping method (e.g., @GetMapping) to call authFailureCounter.increment() for testing.

Copilot uses AI. Check for mistakes.
Comment on lines 6 to 11
import org.springframework.stereotype.Component;
import org.springframework.web.server.ServerWebExchange;
import reactor.core.publisher.Mono;

@Profile("dev")
@Slf4j
@Component
Copy link

Copilot AI Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Removing @Profile("dev") makes this logging filter active in all environments, which could degrade performance in production. Consider scoping it to non-production profiles or making it conditional.

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +4

GATEWAY_URL = "http://server.localhost:8081/api/v1/users/login"
Copy link

Copilot AI Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GATEWAY_URL uses 'server.localhost', which may not resolve in most environments; consider using 'localhost' or an environment variable for the host.

Suggested change
GATEWAY_URL = "http://server.localhost:8081/api/v1/users/login"
import os
GATEWAY_URL = f"http://{os.getenv('GATEWAY_HOST', 'localhost')}:8081/api/v1/users/login"

Copilot uses AI. Check for mistakes.
severity: warning
annotations:
summary: "High user authentication failure count"
description: "More than 100 failed logins in the past minute."
Copy link

Copilot AI Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alert description does not match the expr (increase >50 over 2m). Update either the threshold in the expr or the description to reflect 'more than 50 failed logins in 2 minutes'.

Suggested change
description: "More than 100 failed logins in the past minute."
description: "More than 50 failed logins in the past 2 minutes."

Copilot uses AI. Check for mistakes.
Comment on lines +64 to +66
if (userSignupCounter != null) {
userSignupCounter.increment();
}
Copy link

Copilot AI Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With constructor injection, these counter beans will not be null at runtime, so the null-check is unnecessary. You can remove the guard to simplify the code.

Suggested change
if (userSignupCounter != null) {
userSignupCounter.increment();
}
userSignupCounter.increment();

Copilot uses AI. Check for mistakes.
- --compatible-mode
- --web.listen-address=:9216
environment:
- MONGODB_URI=mongodb://${MONGODB_USERNAME:-root}:${MONGODB_PASSWORD:-password}@mongo:27017/admin?authSource=admin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Make use of the environment variable MONGO_URL
  • Background ? Local and Cloud mongo URI do not have same uri structure
  • Solution -> use the URL as env variable here
Suggested change
- MONGODB_URI=mongodb://${MONGODB_USERNAME:-root}:${MONGODB_PASSWORD:-password}@mongo:27017/admin?authSource=admin
- MONGODB_URI=${MONGO_URL}

mongo-exporter:
image: percona/mongodb_exporter:0.30
command:
- --mongodb.uri=mongodb://${MONGODB_USERNAME:-root}:${MONGODB_PASSWORD:-password}@mongo:27017/admin?authSource=admin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Make use of the environment variable MONGO_URL
  • Background ? Local and Cloud mongo URI do not have same uri structure
  • Solution -> use the URL as env variable here
Suggested change
- --mongodb.uri=mongodb://${MONGODB_USERNAME:-root}:${MONGODB_PASSWORD:-password}@mongo:27017/admin?authSource=admin
- --mongodb.uri=${MONGO_URL}

Copy link
Collaborator

@GravityDarkLab GravityDarkLab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small changes in the docker compose file.
other then that lgtm 🚀 - The documentation is also amazing 👏

@mahdibayouli mahdibayouli merged commit d45be19 into main Jul 17, 2025
9 checks passed
@GravityDarkLab GravityDarkLab deleted the 43-monitoring-implement-prometheus-grafana-with-alerting branch July 20, 2025 02:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation monitoring

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Monitoring] Implement Prometheus & Grafana with Alerting

3 participants