Skip to content

Commit 00219d0

Browse files
mpfusterjpe442
authored andcommitted
misc changes
1 parent 279d42b commit 00219d0

File tree

15 files changed

+76
-109
lines changed

15 files changed

+76
-109
lines changed

advocacy_docs/supported-open-source/warehousepg/wem/get-started.mdx

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,36 @@ To start using the interface:
1616

1717
For security, sessions automatically expire after a period of inactivity; if a timeout occurs, the system will display a "Session Expired" message and redirect you to the login screen. You can also manually terminate your session at any time by clicking the Logout icon located in the sidebar footer.
1818

19+
## Configuring WEM settings post-installation
20+
21+
Once you have installed WEM, you may need to fine-tune how WEM connects to your cluster or external services (like Prometheus and Alertmanager). There are two primary ways to manage these configurations:
22+
23+
1. Using the WEM settings tab
24+
25+
Administrators can modify most operational parameters directly through the browser:
26+
27+
1. Navigate to **User Management** > **Settings**.
28+
1. Update fields such as **Prometheus URL** or **Backup History Database Path**.
29+
30+
!!! Note
31+
While convenient, some low-level system parameters are only accessible via the configuration file.
32+
33+
2. Manual configuration file editing
34+
35+
For parameters not exposed in the WEM console or for automated deployments, you can edit the configuration file directly on the host server:
36+
37+
1. Stop the service: `systemctl stop wem`.
38+
1. Edit the file `/etc/wem/wem.conf` on your WEM cluster and modify the desired parameter.
39+
1. Restart the service: `systemctl start wem` to apply the changes.
40+
41+
1942
## Navigating the interface structure
2043

2144
The WEM interface is organized into three functional areas:
2245
1. **Sidebar (left)**: The primary navigation menu.
2346
2. **Header (top)**: Displays the current page title, system time, and global controls such as filters and refresh triggers.
2447
3. **Main content (center)**: The primary workspace where data tables, charts, and configuration tools are rendered.
2548

26-
2749
## Understanding user roles and permissions
2850

2951
WEM utilizes Role-Based Access Control (RBAC). The panels available in the console are determined by the role assigned to your user account.

advocacy_docs/supported-open-source/warehousepg/wem/installing/wem.mdx

Lines changed: 0 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -19,23 +19,10 @@ Install the WarehousePG Enterprise Manager (WEM) package on your designated host
1919

2020
1. Install WEM on your designated host:
2121

22-
<TabContainer syncKey="install">
23-
<Tab title="RHEL 8, 9">
24-
2522
```bash
2623
sudo dnf install -y whpg-enterprise-manager
2724
```
2825

29-
</Tab>
30-
<Tab title="RHEL 7">
31-
32-
```bash
33-
sudo yum install -y whpg-enterprise-manager
34-
```
35-
36-
</Tab>
37-
</TabContainer>
38-
3926
## Configuring WEM
4027

4128
Edit the configuration file `/etc/wem/wem.conf` and configure the following parameters:
Lines changed: 13 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,14 @@
11
---
22
title: Managing alerts
33
navTitle: Managing alerts
4-
description: Use the Alerts panel to monitor system health through real-time notifications and manage the incident lifecycle.
4+
description: Use the Alerts panel to integrate with Prometheus Alertmanager and govern the incident lifecycle through real-time notifications.
55
deepToC: true
66
---
77

8-
The **Alerts** panel serves as the central nervous system for your cluster, aggregating health signals from canary checks, segment events, and resource breaches. Use these actions to identify and resolve system degradation before it impacts your users.
8+
The **Alerts** panel serves as the central nervous system for your cluster, aggregating health signals from across your infrastructure. This panel integrates directly with Prometheus Alertmanager to provide a unified interface for incident response and rule management.
9+
10+
!!! Warning "Alertmanager required"
11+
If the **Alerts** panel displays `Alertmanager Not Configured`, you must set the `ALERTMANAGER_URL` in your system environment. See [Configuring WEM](../installing/wem#configuring-wem) and [Configuring WEM settings post-installation](../get-started#configuring-wem-settings-post-installation) for details.
912

1013
### Identifying alert sources
1114

@@ -14,6 +17,7 @@ Alerts are automatically generated from several monitoring vectors:
1417
- **Segment down events:** Triggered if a segment becomes unreachable or enters a recovery state.
1518
- **Resource threshold breaches:** Fired when CPU, Memory, or Disk Usage cross predefined limits.
1619
- **System errors:** Critical database engine events captured from the WHPG log stream.
20+
- **WEM outages**: If Prometheus is unable to reach the WEM service, it triggers an alert.
1721

1822
### Understanding severity levels
1923

@@ -23,27 +27,10 @@ WEM displays severity levels to help you prioritize your operational workflow:
2327
- **Info:** Routine informational notices regarding system changes or successful task completions.
2428

2529

26-
### Overseeing active notifications
27-
28-
Use the **Active Alerts** view to identify and respond to current system issues that require immediate attention.
29-
- **Prioritize by severity:** Review the impact level of all active notifications. Focus on critical alerts first, followed by warning alerts to prevent escalation into service outages.
30-
- **Investigate alert details:** Select any active alert to review its summary and message. Use this data to identify exactly which component—such as a specific segment or a resource threshold—is currently failing.
31-
- **Acknowledge system events:** If you have **Admin** or **Operator** privileges, click the **Ack** button on an active alert. This signals to the rest of the team that the issue is being investigated, preventing duplicate effort and maintaining a clear timeline for the incident lifecycle.
32-
33-
### Analyzing incident history
34-
35-
Use the **Alert History View** to perform retrospective analysis and identify recurring patterns across your infrastructure.
36-
- **Audit past incidents:** Review the history of resolved alerts to identify intermittent service instability or consistent resource pressure. Use the summary info cards to see the distribution of critical vs. informational events over time.
37-
- **Filter by time and severity:*** Narrow your audit trail to specific windows, such as the last 24 hours or the last 7 days. Apply severity filters to isolate specific tiers, such as reviewing all critical events from the past month to identify stability trends.
38-
39-
### Configuring alert logic and rules
40-
41-
Use the **Settings** tab to manage the detection logic that triggers system notifications.
42-
43-
!!! Important
44-
Access to this tab is restricted to users with the **Admin** or **Operator** role privilege.
45-
46-
- **Manage alert rules:** Review the configuration dashboard to track active rules by name, type, and severity. You can enable or disable specific rules without deleting them to tune your monitoring during maintenance windows.
47-
- **Create custom detection logic:** Use the create rule button to define new monitoring parameters. Choose the type of alert, which includes Threshold (numerical values), Pattern (status codes like "down"), or Anomaly (statistical outliers).
48-
- **Define technical conditions:** Specify the target metric (such as disk_usage or query_duration) and input the technical logic in the condition (json) field. This allows you to tailor alerts to the specific performance characteristics of your hardware.
49-
30+
### Managing the incident lifecycle
31+
Use the specialized tabs to move through the stages of alert detection, suppression, and resolution.
32+
- **Respond to current threats:** Use the **Active Alerts** tab to identify and prioritize immediate issues. Filter by severity to address critical failures first, ensuring that total service outages are resolved before investigating warning or info events.
33+
- **Suppress noise during maintenance:** Use the **Silences** tab to temporarily mute specific alerts. This is essential during scheduled maintenance or segment recovery windows to prevent alert fatigue and ensure that your notification channels remain focused on unexpected issues.
34+
- **Audit dispatch history:** Review the **Notifications** tab to see exactly when and where alerts were sent (e.g., Slack, Email, or PagerDuty). Use this to verify that the correct stakeholders were notified during an incident.
35+
- **Evaluate detection logic:** Browse the **Alert Rules** tab to inspect the active triggers defined in your Prometheus configuration. This view allows you to verify the technical conditions (thresholds, durations, and labels) that govern how WEM identifies system degradation.
36+
- **Perform retrospective analysis:** Use the **Alert History** tab to identify recurring patterns. By auditing resolved alerts, you can isolate intermittent hardware failures or recurring resource pressure that may require long-term capacity planning.

advocacy_docs/supported-open-source/warehousepg/wem/monitoring/cluster-overview.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
11
---
22
title: Verifying the cluster health
33
navTitle: Verifying the cluster health
4-
description: Use the Cluster Overview panel to monitor real-time WarehousePG cluster health, verify node availability, and track critical connectivity metrics to ensure high availability
4+
description: Use the Cluster panel to monitor real-time WarehousePG cluster health, verify node availability, and track critical connectivity metrics to ensure high availability
55
deepToC: true
66
---
77

8-
The **Cluster Overview** panel provides a high-level summary of the WarehousePG (WHPG) cluster configuration and real-time health metrics. This panel is the primary starting point for verifying cluster availability and resource utilization.
8+
The **Cluster** panel provides a high-level summary of the WarehousePG (WHPG) cluster configuration and real-time health metrics. This panel is the primary starting point for verifying cluster availability and resource utilization.
99

1010
### Confirming core cluster availability
1111

advocacy_docs/supported-open-source/warehousepg/wem/monitoring/index.mdx

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,14 +12,13 @@ navigation:
1212

1313
Verify cluster health, maintain operational awareness and respond to system events in real time through the following core actions:
1414

15-
- [Verifying the cluster health:](cluster-overview) Maintain a real-time view of the cluster’s topology and health via the **Cluster Overview** panel. Use this to ensure that coordinator, standby, and segment nodes are online and correctly configured.
15+
- [Verifying the cluster health:](cluster-overview) Maintain a real-time view of the cluster’s topology and health via the **Cluster** panel. Use this to ensure that coordinator, standby, and segment nodes are online and correctly configured.
1616

1717
- [Visualizing hardware performance:](system-metrics) Use the **System Metrics** panel to tracking the physical health of your infrastructure. Use these charts to identify OS-level bottlenecks, such as CPU spikes, memory exhaustion, or network latency across specific hosts.
1818

1919
- [Validating database responsiveness:](monitoring) Ensure the database engine is actively processing requests. Use the **Monitoring** panel to review automated canary checks—synthetic SQL probes that verify connectivity and execution speed.
2020

2121
- [Auditing system logs:](logs) The **Logs** panel allows you to investigate the unified stream of system and database telemetry. Search through coordinator and segment logs to pinpoint the root cause of query failures or administrative changes.
2222

23-
- [Managing alerts:](alerts) Use the **Alerts** panel to oversee the the notification lifecycle. Define thresholds for critical metrics and manage the queue of active alerts to ensure proactive intervention.
24-
23+
- [Managing alerts:](alerts) Use the **Alerts** panel to integrate with Prometheus Alertmanager and govern the incident lifecycle through real-time notifications.
2524

advocacy_docs/supported-open-source/warehousepg/wem/monitoring/logs.mdx

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,13 @@ The **Logs** panel serves as a centralized diagnostic hub. By consolidating inte
1010
### Understanding log levels
1111

1212
WarehousePG Enterprise Manager (WEM) displays severity levels generated directly by the underlying WarehousePG (WHPG) engine. These levels categorize every log entry based on its impact on database operations:
13-
- DEBUG: Contains granular technical details used primarily for deep-dive troubleshooting and development analysis.
14-
- INFO: Provides standard informational messages regarding routine system operations.
15-
- LOG: Reports standard engine-level events and process completions.
16-
- WARNING: Highlights events that are not fatal but may indicate potential configuration issues or approaching resource limits.
17-
- PANIC: Indicates a critical error that caused all database sessions to be disconnected; the system will usually attempt a restart after a PANIC.
18-
- FATAL: Indicates an error that caused a specific session to be terminated, though the rest of the database remains operational.
19-
- ERROR: Reports a problem that prevented a specific command or query from completing successfully.
13+
- `DEBUG`: Contains granular technical details used primarily for deep-dive troubleshooting and development analysis.
14+
- `INFO`: Provides standard informational messages regarding routine system operations.
15+
- `LOG`: Reports standard engine-level events and process completions.
16+
- `WARNING`: Highlights events that are not fatal but may indicate potential configuration issues or approaching resource limits.
17+
- `ERROR`: Reports a problem that prevented a specific command or query from completing successfully.
18+
- `FATAL`: Indicates an error that caused a specific session to be terminated, though the rest of the database remains operational.
19+
- `PANIC`: Indicates a critical error that caused all database sessions to be disconnected; the system will usually attempt a restart after a `PANIC`.
2020

2121
### Performing structured database analysis
2222

Lines changed: 7 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,20 @@
11
---
22
title: Validating database responsiveness
33
navTitle: Validating database responsiveness
4-
description: Use the Monitor panel to rack proactive health indicators and automated canary check results to ensure database availability.
4+
description: Use the Monitor panel to track proactive health indicators and automated canary check results to ensure database availability.
55
deepToC: true
66
---
77

88
The **Monitoring** panel provides proactive verification of cluster health through automated canary checks. Unlike passive metrics, these checks execute active tasks to ensure the database engine is responding correctly and meeting performance baselines.
99

1010
### Performing proactive health checks: canary checks
1111

12-
Canary checks are recurring, automated scripts that simulate real-world operations to verify the end-to-end integrity of the system. You can configure these test on the [Users panel](../security/user-management).
12+
Canary checks are recurring, automated scripts that simulate real-world operations to verify the end-to-end integrity of the system. You can configure these tests on the [User Management panel](../security/user-management).
1313

1414
Use the **Canary Checks** tab to verify that the database can successfully execute core operations.
15-
- **Run health checks on demand:** Trigger an immediate execution of any check in the list to verify a fix or test real-time connectivity. While you must go to the [Users panel](../security/user-management) to create or edit a check, the monitoring panel allows you to run and stop them at any time to get an instant status update.
16-
- **Measure query execution speed:** Observe the duration column during a run. If the millisecond count is rising even while the status remains "Success," you are seeing early signs of resource saturation before it impacts users.
17-
- **Respond to status changes:** Follow the required action protocol for non-passing checks. A "Warning" status is a signal to begin capacity planning, while a "Critical" status requires an immediate audit of system logs.
15+
- **Assess overall probe health:** Review the header metrics to get an instant snapshot of system integrity. Compare the **Successful** count against the **Total Checks** count to identify if a specific subset of your monitoring is failing.
16+
- **Verify scheduler activity:** Check the status of the scheduler to ensure it is running. This confirms that the WarehousePG Enterprise Manager (WEM) engine is actively triggering your background probes. If the scheduler is stopped, your health data will become stale and you will lose proactive visibility.
17+
- **Run health checks on demand:** Trigger an immediate execution of any check in the list to verify a fix or test real-time connectivity. While you must go to the [User Management panel](../security/user-management) to create or edit a check, the **Monitoring** panel allows you to run and stop them at any time to get an instant status update.
18+
- **Benchmark execution speed:** Monitor the **Average Duration** metric to establish a baseline for expected responsiveness. A sudden spike in this metric, even if checks are still successful, serves as an early warning of resource saturation or network latency.
19+
- **Investigate specific check failures:** Audit the **Health Checks** table to isolate the root cause of a failure. By checking which specific probe is non-passing, you can determine if the issue is a total service outage or a localized subsystem failure.
1820

19-
20-
### Analyzing query concurrency and scheduling
21-
22-
Use the **Smart Analysis** tab to understand how high-volume traffic impacts the internal database scheduler and locking mechanisms.
23-
24-
- **Identify resource queue bottlenecks:** Monitor the **Queued** query count. If this number is high, your resource queues are likely too restrictive or your cluster lacks the CPU/Memory to handle the current concurrent load.
25-
- **Investigate lock contention:** Check the **Lock Waits** indicator. A high count suggests that long-running transactions are holding onto resources and blocking other users, which is a common cause of application performance degradation.
26-
- **Correlate connections with performance:** Use the **Connections Over Time** graph to see if spikes in new sessions lead to a rise in average query time. This helps you determine your cluster's true breaking point for concurrent users.
27-
28-
### Managing the alert lifecycle
29-
30-
Use the **Alerts** tab to track threshold violations and coordinate a response to system incidents.
31-
- **Acknowledge active incidents:** Use the **Ack** button to signal that an investigation is underway. This prevents duplicate efforts and helps maintain an accurate timeline for incident resolution.
32-
- **Prioritize by severity:** Sort the alert log by impact level to focus on critical infrastructure failures before addressing informational messages or performance warnings.
33-
34-
35-
## Tracking historical performance drift
36-
37-
Use the **Trends** tab to move beyond real-time status and analyze the long-term stability of the cluster.
38-
- **Detecting intermittent service instability:** Review the history of check results to identify services that fail and recover automatically. Repeated triggers for the same service often point to underlying resource exhaustion or network instability rather than a total hardware failure.
39-
- **Identifying performance degradation:** Monitor if check durations are gradually increasing over weeks or months. This "drift" helps you visualize how data growth and increased user load are slowly impacting your baseline performance.
40-
- **Isolating recurring warning windows:** Review the frequency of warnings to see if the system consistently underperforms during specific times, such as during daily backup windows or heavy ETL (Extract, Transform, Load) cycles.

advocacy_docs/supported-open-source/warehousepg/wem/monitoring/system-metrics.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ The **System Metrics** panel provides a comprehensive view of the hardware and d
1111

1212
Use the **System Metrics** tab to identify immediate hardware saturation that could be impacting query response times.
1313

14-
- **Spot processing bottlenecks:** Check the **CPU Usage % over time** to see if specific nodes are hitting 100% utilization. If one node is consistently higher than others, you may have data skew issues.
14+
- **Spot processing bottlenecks:** Check the **CPU Usage % Over Time** to see if specific nodes are hitting 100% utilization. If one node is consistently higher than others, you may have data skew issues.
1515
- **Assess memory pressure:** Monitor **Available Memory** vs. **Cached Memory**. If the available memory is low and cached is also shrinking, the OS is under pressure and may start swapping, which significantly slows down database operations.
1616
- **Validate storage and network throughput:** Review **Disk I/O** and **Network Traffic** graphs. High disk read rates during unexpected times might indicate inefficient queries that are forcing full table scans instead of using indexes.
1717
- **Evaluate system load averages:** Observe the 1m, 5m, and 15m load averages. If the 15-minute load consistently exceeds the number of available CPU cores, the host is over-provisioned and tasks are queuing at the OS level.

0 commit comments

Comments
 (0)