-
Notifications
You must be signed in to change notification settings - Fork 0
43 monitoring implement prometheus grafana with alerting #75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
43 monitoring implement prometheus grafana with alerting #75
Conversation
…ll monitoring stack
…enai_tokens_used_total counter
…ll stack observability
…h custom metrics for tokens and exceptions
…erformance and request metrics
…se performance and operations
… and memory utilization metrics
…and HTTP metrics for presentation clarity
…e for comprehensive observability
…trics collection and alerting
…nd request monitoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a full observability stack for SkillForge, including Prometheus, Grafana, Loki, and AlertManager, and instruments services with consistent metrics and version tagging.
- Added
APP_VERSIONandmanagement.metrics.tagsto service configs for uniform Prometheus labeling - Integrated Micrometer counters into user service and provided debug endpoints for testing
- Provisioned Prometheus scrape rules, alerting, Grafana dashboards, Loki config, and test scripts
Reviewed Changes
Copilot reviewed 43 out of 54 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| server/skillforge-user/src/main/resources/application.yml | Define spring.application.version and add management.metrics.tags |
| server/skillforge-user/src/main/java/com/gitittogether/skillForge/server/user/service/user/UserServiceImpl.java | Inject and increment signup/auth-failure counters |
| server/skillforge-user/src/main/java/com/gitittogether/skillForge/server/user/controller/DebugAuthController.java | Register debug auth-failure counter but missing trigger endpoint |
| monitoring/prometheus/alert.rules.yml | Alert rule for UserAuthFailuresHigh |
| monitoring/scripts/test_user_auth_failures.py | Script to generate high auth-failure traffic |
| monitoring/grafana/README.md | Dashboard overview documentation |
Comments suppressed due to low confidence (2)
monitoring/grafana/README.md:16
- The README refers to
mongo.json, but the actual file ismongodb.json. Update the table entry to match the correct filename.
| `mongo.json` | MongoDB internals | Prometheus |
| */ | ||
| @RestController | ||
| @RequestMapping("/api/v1/debug") | ||
| public class DebugAuthController { |
Copilot
AI
Jul 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DebugAuthController defines a counter but no endpoint to trigger increments. Consider adding a mapping method (e.g., @GetMapping) to call authFailureCounter.increment() for testing.
| import org.springframework.stereotype.Component; | ||
| import org.springframework.web.server.ServerWebExchange; | ||
| import reactor.core.publisher.Mono; | ||
|
|
||
| @Profile("dev") | ||
| @Slf4j | ||
| @Component |
Copilot
AI
Jul 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Removing @Profile("dev") makes this logging filter active in all environments, which could degrade performance in production. Consider scoping it to non-production profiles or making it conditional.
|
|
||
| GATEWAY_URL = "http://server.localhost:8081/api/v1/users/login" |
Copilot
AI
Jul 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GATEWAY_URL uses 'server.localhost', which may not resolve in most environments; consider using 'localhost' or an environment variable for the host.
| GATEWAY_URL = "http://server.localhost:8081/api/v1/users/login" | |
| import os | |
| GATEWAY_URL = f"http://{os.getenv('GATEWAY_HOST', 'localhost')}:8081/api/v1/users/login" |
| severity: warning | ||
| annotations: | ||
| summary: "High user authentication failure count" | ||
| description: "More than 100 failed logins in the past minute." |
Copilot
AI
Jul 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alert description does not match the expr (increase >50 over 2m). Update either the threshold in the expr or the description to reflect 'more than 50 failed logins in 2 minutes'.
| description: "More than 100 failed logins in the past minute." | |
| description: "More than 50 failed logins in the past 2 minutes." |
| if (userSignupCounter != null) { | ||
| userSignupCounter.increment(); | ||
| } |
Copilot
AI
Jul 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With constructor injection, these counter beans will not be null at runtime, so the null-check is unnecessary. You can remove the guard to simplify the code.
| if (userSignupCounter != null) { | |
| userSignupCounter.increment(); | |
| } | |
| userSignupCounter.increment(); |
| - --compatible-mode | ||
| - --web.listen-address=:9216 | ||
| environment: | ||
| - MONGODB_URI=mongodb://${MONGODB_USERNAME:-root}:${MONGODB_PASSWORD:-password}@mongo:27017/admin?authSource=admin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Make use of the environment variable MONGO_URL
- Background ? Local and Cloud mongo URI do not have same uri structure
- Solution -> use the URL as env variable here
| - MONGODB_URI=mongodb://${MONGODB_USERNAME:-root}:${MONGODB_PASSWORD:-password}@mongo:27017/admin?authSource=admin | |
| - MONGODB_URI=${MONGO_URL} |
| mongo-exporter: | ||
| image: percona/mongodb_exporter:0.30 | ||
| command: | ||
| - --mongodb.uri=mongodb://${MONGODB_USERNAME:-root}:${MONGODB_PASSWORD:-password}@mongo:27017/admin?authSource=admin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Make use of the environment variable MONGO_URL
- Background ? Local and Cloud mongo URI do not have same uri structure
- Solution -> use the URL as env variable here
| - --mongodb.uri=mongodb://${MONGODB_USERNAME:-root}:${MONGODB_PASSWORD:-password}@mongo:27017/admin?authSource=admin | |
| - --mongodb.uri=${MONGO_URL} |
GravityDarkLab
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small changes in the docker compose file.
other then that lgtm 🚀 - The documentation is also amazing 👏
Add Observability and Monitoring Stack
Overview
This PR introduces a comprehensive observability and monitoring solution for the SkillForge microservices architecture. It establishes a containerized stack with Prometheus, Grafana, Loki, and AlertManager to provide real-time metrics, logging, and alerting capabilities.
Components Added
Documentation
Three main documentation files are included:
README.md: Overview of the stack, components, and quick-start guidegrafana/README.md: Detailed dashboard documentation with testing instructionsALERTS.md: Alert system reference with screenshots and links to UIsKey Features
Testing
All components have been tested locally and include step-by-step instructions for:
This monitoring solution enables the team to quickly detect issues, troubleshoot performance problems, and maintain system reliability across all SkillForge services.