@@ -6,7 +6,7 @@ review_date: 2026-04-17
66guide_category :
77 - Managing operations
88guide_category_order : 1
9- guide_description : Talking about metrics and observability for TrustGraph deployments using Prometheus and Grafana.
9+ guide_description : Monitor TrustGraph deployments using Prometheus and Grafana
1010guide_difficulty : intermediate
1111guide_time : 30 min
1212guide_emoji : 📈
@@ -17,11 +17,205 @@ guide_labels:
1717 - Operations
1818---
1919
20- {: .wip }
21- > This page is planned but not yet complete.
22-
2320# Monitoring
2421
25- FIXME: Coming soon
22+ {% capture requirements %}
23+ <ul style =" margin : 0 ; padding-left : 20px ;" >
24+ <li >A running TrustGraph deployment</li >
25+ <li >Access to Grafana (port 3000)</li >
26+ <li >Access to Prometheus (port 9090)</li >
27+ </ul >
28+ {% endcapture %}
29+
30+ {% include guide/guide-intro-box.html
31+ description=page.guide_description
32+ difficulty=page.guide_difficulty
33+ duration=page.guide_time
34+ you_will_need=requirements
35+ goal="Access metrics, logs, and dashboards to monitor TrustGraph system health and performance."
36+ %}
37+
38+ ## Observability Stack
39+
40+ TrustGraph deployments include a complete observability stack for monitoring system health and performance.
41+
42+ ** Components:**
43+
44+ - ** Prometheus** - Time-series metrics database that collects and stores metrics from all TrustGraph services
45+ - ** Grafana** - Visualization platform providing dashboards for metrics and logs
46+ - ** Loki** - Log aggregation system that collects logs from TrustGraph components
47+
48+ ** What's monitored:**
49+
50+ TrustGraph components expose metrics and logs automatically. The monitoring stack currently captures:
51+ - All TrustGraph processing components (flows, queues, processors)
52+ - API gateway request/response metrics
53+ - Pulsar message queue statistics
54+ - System resource usage
55+
56+ ** Note:** Infrastructure components (Cassandra, Pulsar, etc.) are not yet integrated into the monitoring stack but may be added in future releases.
57+
58+ ## Accessing Grafana
59+
60+ Grafana provides the primary interface for viewing metrics and logs.
61+
62+ Access Grafana at ` http://localhost:3000 ` (or your deployment hostname).
63+
64+ Default credentials:
65+ - Username: ` admin `
66+ - Password: ` admin `
67+
68+ ![ Grafana overview screen] ( overview.jpg )
69+
70+ The home screen shows:
71+ - Recent dashboards
72+ - Starred dashboards
73+ - Navigation to logs and metrics
74+
75+ ## Available Dashboards
76+
77+ Grafana includes pre-configured dashboards for monitoring TrustGraph:
78+
79+ ![ Grafana dashboards list] ( dashboards.jpg )
80+
81+ ** TrustGraph Dashboard** - Main monitoring dashboard showing:
82+ - Flow processing rates
83+ - Queue depths and throughput
84+ - API request metrics
85+ - System resource usage
86+ - Processing latency
87+
88+ ** Custom Dashboards** - Create additional dashboards for:
89+ - Specific flow instances
90+ - Document processing metrics
91+ - LLM usage and costs
92+ - Custom business metrics
93+
94+ ## Viewing Logs
95+
96+ Loki collects logs from all TrustGraph components, providing centralized log access.
97+
98+ ![ TrustGraph logs] ( logs.jpg )
99+
100+ ** Log sources:**
101+ - Processing flows (document-load, graph-rag, etc.)
102+ - API gateway requests
103+ - Initialization services
104+ - Error messages and stack traces
105+
106+ ** Query logs:**
107+ - Use the Explore interface in Grafana
108+ - Filter by component, log level, or time range
109+ - Search log content for debugging
110+
111+ ** Current limitation:** Only TrustGraph components send logs to Loki. Infrastructure components (Cassandra, Pulsar) log to their own systems.
112+
113+ ## TrustGraph Dashboard
114+
115+ The main TrustGraph dashboard provides comprehensive system monitoring:
116+
117+ ![ TrustGraph dashboard part 1] ( dashboard1.jpg )
118+
119+ ** Top section shows:**
120+ - Knowledge backlog - backlog on knowledge extraction queues
121+ - Graph and tripple load backlog
122+ - Latency through LLM as a heatmap
123+ - Error rates
124+
125+ ![ TrustGraph dashboard part 2] ( dashboard2.jpg )
126+
127+ ** Middle section displays:**
128+ - Request rates per queue
129+ - Pub/sub queue backlogs
130+ - Chunk size counts histogram
131+ - Indicator of the number of rate-limit events
132+
133+ ![ TrustGraph dashboard part 3] ( dashboard3.jpg )
134+
135+ ** Bottom section includes:**
136+ - Resource utilization (CPU, memory)
137+ - List of models in use + token counts
138+ - Token usage
139+ - Token cost (based on token const configuration)
140+
141+ ## Using Prometheus
142+
143+ Prometheus provides direct access to raw metrics data for custom queries and analysis.
144+
145+ Access Prometheus at ` http://localhost:9090 ` .
146+
147+ ![ Prometheus web UI] ( prometheus.jpg )
148+
149+ ** Use Prometheus to:**
150+ - Execute custom PromQL queries
151+ - Explore available metrics
152+ - Test alert expressions
153+ - Analyze time-series data
154+ - Debug metric collection
155+
156+ ** Example queries:**
157+
158+ ``` promql
159+ # Total messages processed per minute
160+ rate(trustgraph_messages_processed_total[1m])
161+
162+ # Queue depth for a specific flow
163+ trustgraph_queue_depth{flow="default"}
164+
165+ # API request latency (95th percentile)
166+ histogram_quantile(0.95, rate(trustgraph_api_duration_seconds_bucket[5m]))
167+ ```
168+
169+ ## Monitoring Best Practices
170+
171+ ** Regular checks:**
172+ - Monitor queue depths - growing queues indicate processing bottlenecks
173+ - Track error rates - spikes suggest configuration or resource issues
174+ - Watch processing latency - increasing latency means slower responses
175+ - Review logs for errors - catch issues before they impact users
176+
177+ ** Set up alerts for:**
178+ - Queue depth exceeding thresholds
179+ - Error rates above acceptable levels
180+ - Processing failures
181+ - Resource exhaustion (CPU, memory, disk)
182+
183+ ** Performance tuning:**
184+ - Use metrics to identify slow processors
185+ - Balance flow processing resources
186+ - Optimize queue configurations
187+ - Scale components based on load patterns
188+
189+ ## Troubleshooting with Metrics
190+
191+ ** Queue backlog growing:**
192+ - Check processor resource limits
193+ - Verify flow configuration
194+ - Look for processing errors in logs
195+ - Consider scaling processors
196+
197+ ** High error rates:**
198+ - Filter logs by error level
199+ - Identify failing components
200+ - Check resource availability
201+ - Review recent configuration changes
202+
203+ ** Slow API responses:**
204+ - Check API gateway metrics
205+ - Review queue processing times
206+ - Verify LLM response latency
207+ - Examine database query performance
208+
209+ ** Resource exhaustion:**
210+ - Monitor CPU and memory usage
211+ - Identify resource-hungry components
212+ - Adjust container limits
213+ - Scale horizontally if needed
214+
215+ ## Next Steps
26216
27- This page will contain guides for metrics, alerts, and observability in TrustGraph deployments.
217+ - Configure alert rules in Prometheus
218+ - Create custom Grafana dashboards for your workflows
219+ - Export metrics to external monitoring systems
220+ - Set up long-term metric retention
221+ - Integrate infrastructure component metrics
0 commit comments