Skip to content

Commit f3e89fc

Browse files
Monitoring page (#18)
1 parent 21829bb commit f3e89fc

File tree

8 files changed

+200
-6
lines changed

8 files changed

+200
-6
lines changed

guides/monitoring/dashboard1.jpg

47.1 KB
Loading

guides/monitoring/dashboard2.jpg

105 KB
Loading

guides/monitoring/dashboard3.jpg

83.8 KB
Loading

guides/monitoring/dashboards.jpg

53.4 KB
Loading

guides/monitoring/index.md

Lines changed: 200 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ review_date: 2026-04-17
66
guide_category:
77
- Managing operations
88
guide_category_order: 1
9-
guide_description: Talking about metrics and observability for TrustGraph deployments using Prometheus and Grafana.
9+
guide_description: Monitor TrustGraph deployments using Prometheus and Grafana
1010
guide_difficulty: intermediate
1111
guide_time: 30 min
1212
guide_emoji: 📈
@@ -17,11 +17,205 @@ guide_labels:
1717
- Operations
1818
---
1919

20-
{: .wip }
21-
> This page is planned but not yet complete.
22-
2320
# Monitoring
2421

25-
FIXME: Coming soon
22+
{% capture requirements %}
23+
<ul style="margin: 0; padding-left: 20px;">
24+
<li>A running TrustGraph deployment</li>
25+
<li>Access to Grafana (port 3000)</li>
26+
<li>Access to Prometheus (port 9090)</li>
27+
</ul>
28+
{% endcapture %}
29+
30+
{% include guide/guide-intro-box.html
31+
description=page.guide_description
32+
difficulty=page.guide_difficulty
33+
duration=page.guide_time
34+
you_will_need=requirements
35+
goal="Access metrics, logs, and dashboards to monitor TrustGraph system health and performance."
36+
%}
37+
38+
## Observability Stack
39+
40+
TrustGraph deployments include a complete observability stack for monitoring system health and performance.
41+
42+
**Components:**
43+
44+
- **Prometheus** - Time-series metrics database that collects and stores metrics from all TrustGraph services
45+
- **Grafana** - Visualization platform providing dashboards for metrics and logs
46+
- **Loki** - Log aggregation system that collects logs from TrustGraph components
47+
48+
**What's monitored:**
49+
50+
TrustGraph components expose metrics and logs automatically. The monitoring stack currently captures:
51+
- All TrustGraph processing components (flows, queues, processors)
52+
- API gateway request/response metrics
53+
- Pulsar message queue statistics
54+
- System resource usage
55+
56+
**Note:** Infrastructure components (Cassandra, Pulsar, etc.) are not yet integrated into the monitoring stack but may be added in future releases.
57+
58+
## Accessing Grafana
59+
60+
Grafana provides the primary interface for viewing metrics and logs.
61+
62+
Access Grafana at `http://localhost:3000` (or your deployment hostname).
63+
64+
Default credentials:
65+
- Username: `admin`
66+
- Password: `admin`
67+
68+
![Grafana overview screen](overview.jpg)
69+
70+
The home screen shows:
71+
- Recent dashboards
72+
- Starred dashboards
73+
- Navigation to logs and metrics
74+
75+
## Available Dashboards
76+
77+
Grafana includes pre-configured dashboards for monitoring TrustGraph:
78+
79+
![Grafana dashboards list](dashboards.jpg)
80+
81+
**TrustGraph Dashboard** - Main monitoring dashboard showing:
82+
- Flow processing rates
83+
- Queue depths and throughput
84+
- API request metrics
85+
- System resource usage
86+
- Processing latency
87+
88+
**Custom Dashboards** - Create additional dashboards for:
89+
- Specific flow instances
90+
- Document processing metrics
91+
- LLM usage and costs
92+
- Custom business metrics
93+
94+
## Viewing Logs
95+
96+
Loki collects logs from all TrustGraph components, providing centralized log access.
97+
98+
![TrustGraph logs](logs.jpg)
99+
100+
**Log sources:**
101+
- Processing flows (document-load, graph-rag, etc.)
102+
- API gateway requests
103+
- Initialization services
104+
- Error messages and stack traces
105+
106+
**Query logs:**
107+
- Use the Explore interface in Grafana
108+
- Filter by component, log level, or time range
109+
- Search log content for debugging
110+
111+
**Current limitation:** Only TrustGraph components send logs to Loki. Infrastructure components (Cassandra, Pulsar) log to their own systems.
112+
113+
## TrustGraph Dashboard
114+
115+
The main TrustGraph dashboard provides comprehensive system monitoring:
116+
117+
![TrustGraph dashboard part 1](dashboard1.jpg)
118+
119+
**Top section shows:**
120+
- Knowledge backlog - backlog on knowledge extraction queues
121+
- Graph and tripple load backlog
122+
- Latency through LLM as a heatmap
123+
- Error rates
124+
125+
![TrustGraph dashboard part 2](dashboard2.jpg)
126+
127+
**Middle section displays:**
128+
- Request rates per queue
129+
- Pub/sub queue backlogs
130+
- Chunk size counts histogram
131+
- Indicator of the number of rate-limit events
132+
133+
![TrustGraph dashboard part 3](dashboard3.jpg)
134+
135+
**Bottom section includes:**
136+
- Resource utilization (CPU, memory)
137+
- List of models in use + token counts
138+
- Token usage
139+
- Token cost (based on token const configuration)
140+
141+
## Using Prometheus
142+
143+
Prometheus provides direct access to raw metrics data for custom queries and analysis.
144+
145+
Access Prometheus at `http://localhost:9090`.
146+
147+
![Prometheus web UI](prometheus.jpg)
148+
149+
**Use Prometheus to:**
150+
- Execute custom PromQL queries
151+
- Explore available metrics
152+
- Test alert expressions
153+
- Analyze time-series data
154+
- Debug metric collection
155+
156+
**Example queries:**
157+
158+
```promql
159+
# Total messages processed per minute
160+
rate(trustgraph_messages_processed_total[1m])
161+
162+
# Queue depth for a specific flow
163+
trustgraph_queue_depth{flow="default"}
164+
165+
# API request latency (95th percentile)
166+
histogram_quantile(0.95, rate(trustgraph_api_duration_seconds_bucket[5m]))
167+
```
168+
169+
## Monitoring Best Practices
170+
171+
**Regular checks:**
172+
- Monitor queue depths - growing queues indicate processing bottlenecks
173+
- Track error rates - spikes suggest configuration or resource issues
174+
- Watch processing latency - increasing latency means slower responses
175+
- Review logs for errors - catch issues before they impact users
176+
177+
**Set up alerts for:**
178+
- Queue depth exceeding thresholds
179+
- Error rates above acceptable levels
180+
- Processing failures
181+
- Resource exhaustion (CPU, memory, disk)
182+
183+
**Performance tuning:**
184+
- Use metrics to identify slow processors
185+
- Balance flow processing resources
186+
- Optimize queue configurations
187+
- Scale components based on load patterns
188+
189+
## Troubleshooting with Metrics
190+
191+
**Queue backlog growing:**
192+
- Check processor resource limits
193+
- Verify flow configuration
194+
- Look for processing errors in logs
195+
- Consider scaling processors
196+
197+
**High error rates:**
198+
- Filter logs by error level
199+
- Identify failing components
200+
- Check resource availability
201+
- Review recent configuration changes
202+
203+
**Slow API responses:**
204+
- Check API gateway metrics
205+
- Review queue processing times
206+
- Verify LLM response latency
207+
- Examine database query performance
208+
209+
**Resource exhaustion:**
210+
- Monitor CPU and memory usage
211+
- Identify resource-hungry components
212+
- Adjust container limits
213+
- Scale horizontally if needed
214+
215+
## Next Steps
26216

27-
This page will contain guides for metrics, alerts, and observability in TrustGraph deployments.
217+
- Configure alert rules in Prometheus
218+
- Create custom Grafana dashboards for your workflows
219+
- Export metrics to external monitoring systems
220+
- Set up long-term metric retention
221+
- Integrate infrastructure component metrics

guides/monitoring/logs.jpg

126 KB
Loading

guides/monitoring/overview.jpg

324 KB
Loading

guides/monitoring/prometheus.jpg

152 KB
Loading

0 commit comments

Comments
 (0)