Skip to content

Commit 5aed624

Browse files
committed
Mention new metrics, and adjust installation instructions to assume p4metrics not monitor_metrics.sh
1 parent 1daae6f commit 5aed624

File tree

2 files changed

+78
-73
lines changed

2 files changed

+78
-73
lines changed

INSTALL.md

Lines changed: 30 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ On other related servers, e.g. running Swarm, Hansoft, Helix TeamHub (HTH), etc,
3434
- [Importing Prometheus data into Victoria Metrics](#importing-prometheus-data-into-victoria-metrics)
3535
- [Install node exporter](#install-node-exporter)
3636
- [Install p4prometheus - details](#install-p4prometheus---details)
37-
- [Install monitor metrics cron jobs](#install-monitor-metrics-cron-jobs)
37+
- [Install p4metrics and monitor\_locks systemd timer service](#install-p4metrics-and-monitor_locks-systemd-timer-service)
3838
- [Checking for blocked commands](#checking-for-blocked-commands)
3939
- [Start and enable service](#start-and-enable-service)
4040
- [Alerting](#alerting)
@@ -43,15 +43,15 @@ On other related servers, e.g. running Swarm, Hansoft, Helix TeamHub (HTH), etc,
4343
- [Prometheus config to reference alertmanager rules](#prometheus-config-to-reference-alertmanager-rules)
4444
- [Troubleshooting](#troubleshooting)
4545
- [p4prometheus](#p4prometheus)
46-
- [monitor metrics](#monitor-metrics)
46+
- [p4metrics](#p4metrics)
4747
- [node exporter](#node-exporter)
4848
- [prometheus](#prometheus)
4949
- [Grafana](#grafana)
5050
- [Advanced config options](#advanced-config-options)
5151
- [Windows Installation](#windows-installation)
5252
- [Windows Exporter](#windows-exporter)
5353
- [P4prometheus on Windows](#p4prometheus-on-windows)
54-
- [Running monitor\_metrics.sh](#running-monitor_metricssh)
54+
- [Running p4metrics](#running-p4metrics)
5555
- [Installing Programs as Services](#installing-programs-as-services)
5656

5757
# Metrics Available
@@ -261,7 +261,7 @@ It is API compatible and thus a drop in for querying. It is configured as a Prom
261261

262262
Run the following as root:
263263

264-
export PVER="1.74.0"
264+
export PVER="1.87.5" # Adjust to suitable recent value!
265265
wget https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/v$PVER/victoria-metrics-v$PVER.tar.gz
266266
wget https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/v$PVER/vmutils-v$PVER.tar.gz
267267

@@ -275,7 +275,7 @@ Run the following as root:
275275
mv vmbackup-prod /usr/local/bin/
276276
mv vmrestore-prod /usr/local/bin/
277277

278-
Create service file:
278+
Create service file (**adjust retentionPeriod as desired**):
279279

280280
```ini
281281
cat << EOF > /etc/systemd/system/victoria-metrics.service
@@ -429,13 +429,10 @@ Get latest release download link: https://github.com/perforce/p4prometheus/relea
429429

430430
Run the following as `root` (using link copied from above page):
431431

432-
export PVER=0.8.8
432+
export PVER=0.9.6 # Adjust this to latest release!
433433
wget https://github.com/perforce/p4prometheus/releases/download/v$PVER/p4prometheus.linux-amd64.gz
434-
435434
gunzip p4prometheus.linux-amd64.gz
436-
437-
chmod +x p4prometheus.linux-amd64
438-
435+
chmod +rx p4prometheus.linux-amd64
439436
mv p4prometheus.linux-amd64 /usr/local/bin/p4prometheus
440437

441438
As user `perforce` run as below.
@@ -546,20 +543,14 @@ Check that metrics are being written:
546543

547544
grep lines /hxlogs/metrics/p4_cmds.prom
548545

546+
## Install p4metrics and monitor_locks systemd timer service
549547

550-
## Install monitor metrics cron jobs
551-
552-
Download the following files (or use [Automated Script Installation](#automated-script-installation)):
553-
554-
* [monitor_metrics.sh](scripts/monitor_metrics.sh) or for use with wget, download raw file: [*right click this link > copy link address*](https://raw.githubusercontent.com/perforce/p4prometheus/master/scripts/monitor_metrics.sh)
555-
* [monitor_wrapper.sh](scripts/monitor_wrapper.sh) or for use with wget, download raw file: [*right click this link > copy link address*](https://raw.githubusercontent.com/perforce/p4prometheus/master/scripts/monitor_wrapper.sh)
556-
* [monitor_metrics.py](scripts/monitor_metrics.py) or for use with wget, download raw file: [*right click this link > copy link address*](https://raw.githubusercontent.com/perforce/p4prometheus/master/scripts/monitor_metrics.py)
548+
Use [Automated Script Installation](#automated-script-installation)!!
557549

558-
There is a convenience script to keep things up-to-date in future:
550+
Check out: [install_p4metrics function in installer script for details](https://github.com/perforce/p4prometheus/blob/master/scripts/install_p4prom.sh#L361)
559551

560-
* [check_for_updates.sh](scripts/check_for_updates.sh) or for use with wget, download raw file: [*right click this link > copy link address*](https://raw.githubusercontent.com/perforce/p4prometheus/master/scripts/check_for_updates.sh). It relies on the `jq` utility to parse GitHub and update the above scripts if new releases have been made.
561-
562-
Configure them for your metrics directory (e.g. `/hxlogs/metrics`)
552+
Note that `p4metrics` has now replaced `monitor_metrics.sh` - the latter is retained as an example, but should
553+
not be installed!
563554

564555
Please note that `monitor_metrics.py` (which is called by `monitor_wrapper.sh`) runs `lslocks` and
565556
cross references locks found with `p4 monitor show` output. This is incredibly useful for
@@ -568,16 +559,8 @@ if you are not collecting the data at the time!
568559

569560
Warning: make sure that `lslocks` is installed on your Linux distribution!
570561

571-
Install in crontab (for user `perforce` or `$OSUSER`) to run every minute:
572-
573-
INSTANCE=1
574-
*/1 * * * * /p4/common/site/bin/monitor_metrics.sh $INSTANCE > /dev/null 2>&1 ||:
575-
*/1 * * * * /p4/common/site/bin/monitor_wrapper.sh $INSTANCE > /dev/null 2>&1 ||:
576-
577-
For non-SDP installation:
578-
579-
*/1 * * * * /path/to/monitor_metrics.sh -p $P4PORT -u $P4USER -nosdp > /dev/null 2>&1 ||:
580-
*/1 * * * * /path/to/monitor_wrapper.sh -p $P4PORT -u $P4USER -nosdp > /dev/null 2>&1 ||:
562+
The `monitor_metrics.py` (which is called by `monitor_wrapper.sh`) were previously installed in `crontab`.
563+
They are now installed as a systemd timer service.
581564

582565
If not using SDP then please ensure that an appropriate LONG TERM TICKET is setup in the environment
583566
that this script is running in.
@@ -627,7 +610,7 @@ Or open URL in a browser.
627610

628611
# Alerting
629612

630-
Done via alertmanager. Optional component
613+
Done via `alertmanager`. Optional component.
631614

632615
Setup is very similar to the above.
633616

@@ -690,7 +673,7 @@ See sample config file here:
690673

691674
Note that Makefile format requires a `<tab>` char (not spaces) at the start of 'action' lines.
692675

693-
```
676+
```Makefile
694677
# Makefile for alertmanager
695678
validate:
696679
amtool check-config alertmanager.yml
@@ -702,7 +685,7 @@ restart: validate
702685

703686
Then you can validate your config:
704687

705-
```
688+
```bash
706689
# make validate
707690
amtool check-config alertmanager.yml
708691
Checking 'alertmanager.yml' SUCCESS
@@ -774,6 +757,7 @@ Make sure all *firewalls* are appropriately configured and the various component
774757
Port defaults are:
775758
* Grafana: 3000
776759
* Prometheus: 9090
760+
* Victoria Metrics: 8428
777761
* Node_exporter: 9100
778762
* Alertmanager: 9093
779763

@@ -791,19 +775,22 @@ You can just grep for the most basic metric a couple of times (make sure it is i
791775
# TYPE p4_prom_log_lines_read counter
792776
p4_prom_log_lines_read{serverid="master.1",sdpinst="1"} 7143
793777

794-
## monitor metrics
778+
## p4metrics
795779

796-
Make sure monitor_metrics.sh is working:
780+
Make sure `p4metrics` is working:
797781

798782
```bash
799-
bash -xv /p4/common/site/bin/monitor_metrics.sh 1
783+
sudo systemctl status p4metrics
784+
sudo journalctl -u p4metrics --no-pager | less
800785
```
801786

802-
Or if not using SDP, copy the [monitor_metrics.sh script](scripts/monitor_metrics.sh) to an appropriate place such as `/usr/local/bin` and install it in your crontab.
787+
You can test it via:
788+
789+
p4metrics --config p4metrics.yaml --debug --dry.run
803790

804791
Check that appropriate files are listed in your metrics dir (and are being updated every minute), e.g.
805792

806-
ls -l /hxlogs/metrics
793+
ls -ltr /hxlogs/metrics/
807794

808795
## node exporter
809796

@@ -875,21 +862,13 @@ and see what the output is.
875862

876863
The executable takes the `--config` parameter and the yaml file is same format as for Linux version. You can specify paths with forward slashes if desired, e.g. `c:/p4/metrics`
877864

878-
## Running monitor_metrics.sh
879-
880-
Download [Git Bash](https://gitforwindows.org/) and install.
881-
882-
Edit `monitor_metrics.sh` and adjust path settings, e.g. `/p4/metrics` -> `/c/p4/metrics`
883-
884-
Test the script with your installation (analyse it's settings). First make sure your admin user is logged in.
885-
886-
bash -xv ./monitor_metrics.sh -p $P4PORT -u $P4USER -nosdp
865+
## Running p4metrics
887866

888-
When it is working and writing metric files to your defined metrics directory, then create a .BAT wrapper, e.g. `run_monitor_metrics.bat` with something like the following contents (adjusted for your local settings):
867+
Edit `p4metrics.yaml` and adjust path settings, e.g. `/p4/metrics` -> `/c/p4/metrics`
889868

890-
cmd /c ""C:\Program Files (x86)\Git\bin\bash.exe" --login -i -- C:\p4\monitor\monitor_metrics.sh -p localhost:1666 -u perforce -nosdp"
869+
Test the tool with your installation (analyse it's settings). First make sure your admin user is logged in.
891870

892-
Then you can create a Task Scheduler entry which runs `run_monitor_metrics.bat` every minute, for example.
871+
p4metrics.exe --config p4metrics.yaml --debug --dry.run
893872

894873
It is important that the user account used has a long login ticket specified.
895874

README.md

Lines changed: 48 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ Uses [go-libp4dlog](https://github.com/rcowham/go-libp4dlog) for actual log file
2424
- [Detailed Installation Instructions](#detailed-installation-instructions)
2525
- [Metrics Available](#metrics-available)
2626
- [P4Prometheus Metrics](#p4prometheus-metrics)
27-
- [Monitor\_metrics.sh Metrics](#monitor_metricssh-metrics)
27+
- [p4metrics Metrics](#p4metrics-metrics)
2828
- [Locks Metrics](#locks-metrics)
2929

3030
## Support Status
@@ -130,40 +130,66 @@ Note these metrics will all have these labels: sdpinst (if SDP), serverid. Extra
130130
| p4_total_write_held_seconds | table | The total write locks held in seconds (by table) |
131131
| p4_total_trigger_lapse_seconds | trigger | The total lapse time for triggers in seconds (by trigger) |
132132

133-
## Monitor_metrics.sh Metrics
133+
## p4metrics Metrics
134+
135+
These were previously written by `monitor_metrics.sh` but that has been superceded by `p4metrics`.
136+
137+
For backwards compatibility with history of previously collected metrics the existing metric
138+
names/types/labels have been kept the same.
134139

135140
Note these metrics will all have these labels: sdpinst (if SDP), serverid. Extra metric labels are shown in the table.
136141

137142
| Metric Name | Labels | Description |
138143
| ----------- | ------ | ----------- |
139-
| p4_server_uptime | | P4D Server uptime (seconds) |
144+
| p4_auth_ssl_cert_expires | | Epoch seconds when Helix Auth Service SSL cert expires |
145+
| p4_auth_version | version | The version of the Helix Auth Service (unknown means <= 2022.1) |
140146
| p4_change_counter | | P4D change counter - monitor normal activity for submits etc |
141-
| p4_monitor_by_cmd | cmd | P4 running processes - counted by cmd |
142-
| p4_monitor_by_user | user | P4 running processes - counted by user |
143-
| p4_process_count | | P4 running processes - counted via 'ps' |
144147
| p4_completed_cmds | | Completed p4 commands - simple grep of log file (turned off for large logs) |
145-
| p4_sdp_checkpoint_log_time | | Time of last checkpoint log - helps check if automated jobs are running |
146-
| p4_sdp_checkpoint_duration | | Time taken for last checkpoint/restore action - check for sudden increases |
147-
| p4_replica_curr_jnl | servername | Current journal for server (from "servers -J" |
148-
| p4_replica_curr_pos | servername | Current journal for server - key measure of replication lag (from "servers -J" |
149-
| p4_error_count | subsystem, error_id, level | Server errors by id - for sudden spurts of errors |
150-
| p4_pull_errors | | P4 pull transfers failed count - to monitor replication status |
151-
| p4_pull_queue | | P4 pull files in queue count - for replication |
152-
| p4_pull_replica_journals_behind | | How many journals replica is behind |
153-
| p4_pull_replication_error | | Set to 1 if replication error detected or 0 if working |
154-
| p4_pull_replica_lag | | How many bytes replica is behind in current journal (-1 = error) |
155-
| p4_licensed_user_count | | P4D Licensed User count |
156-
| p4_licensed_user_limit | | P4D Licensed User Limit |
148+
| p4_error_count | subsystem, error_id, level | (Deprecated - monitor_metrics.sh) Server errors by id - for sudden spurts of errors |
149+
| p4_errors_count | subsys, severity | Server errors by subsystem and severiy (e.g. error/fatal) - for sudden spurts of errors |
150+
| p4_filesys_min | filesys | Value of P4D configurable filesys.*.min |
157151
| p4_license_expires | | P4D License expiry (epoch secs) |
158-
| p4_license_time_remaining | | P4D License time remaining (secs) |
159-
| p4_license_support_expires | | P4D License support expiry (epoch secs) |
160152
| p4_license_info | info | P4D License info (if present) |
161153
| p4_license_IP | IP | P4D License IP address (if present) |
162-
| p4_filesys_min | filesys | Value of P4D configurable filesys.*.min |
154+
| p4_license_support_expires | | P4D License support expiry (epoch secs) |
155+
| p4_license_time_remaining | | P4D License time remaining (secs) |
156+
| p4_licensed_user_count | | P4D Licensed User count |
157+
| p4_licensed_user_limit | | P4D Licensed User Limit |
158+
| p4_monitor_by_cmd | cmd | P4 running processes - counted by cmd |
159+
| p4_monitor_by_state | state | P4 running processes - counted by state (see 'p4 monitor show' status field) |
160+
| p4_monitor_by_user | user | P4 running processes - counted by user |
161+
| p4_monitor_max_cmd_time | user | Max time in seconds for a non-service user command in the monitor table |
163162
| p4_p4d_build_info | version | P4D Version/build info |
164163
| p4_p4d_server_type | services | P4D server type/services |
165-
| p4_ssl_cert_expires | | P4D SSL certificate expiry epoch seconds |
164+
| p4_process_count | | (Deprecated monitor_metrics.sh) P4 running processes - counted via 'ps' |
165+
| p4_processes_count | | P4 running processes - counted via 'ps' |
166+
| p4_pull_error_count | | P4 pull transfers in failed state - to monitor replication status |
167+
| p4_pull_errors | | P4 pull transfers failed count - to monitor replication status |
168+
| p4_pull_queue | | (Deprecated monitor_metrics.sh) P4 pull files in queue count - for replication |
169+
| p4_pull_queue_bytes | | P4 `pull -ls` how many bytes in archive pull queue |
170+
| p4_pull_queue_count | | P4 `pull -ls` how many files in archive pull queue (not in failed state) |
171+
| p4_pull_queue_total | | P4 `pull -ls` how many files in archive pull queue (total) |
172+
| p4_pull_replica_bytes_behind | | How many total bytes replica is behind (`pull -ljv`) |
173+
| p4_pull_replica_journals_behind | | How many journals replica is behind |
174+
| p4_pull_replica_lag | | How many bytes replica is behind in current journal (-1 = error) |
175+
| p4_pull_replication_error | | Set to 1 if replication error detected or 0 if working |
176+
| p4_replica_curr_jnl | servername | Current journal for server (from "servers -J" |
177+
| p4_replica_curr_pos | servername | Current journal for server - key measure of replication lag (from "servers -J" |
178+
| p4_sdp_checkpoint_duration | | Time taken for last checkpoint/restore action - check for sudden increases |
179+
| p4_sdp_checkpoint_log_time | | Time of last checkpoint log - helps check if automated jobs are running |
180+
| p4_sdp_verify_duration | | How long in seconds last SDP p4verify.sh run took |
181+
| p4_sdp_verify_errors | type | Verify errors by type (submitted/sehlved/spec/upload) |
182+
| p4_sdp_verify_log_modtime | | Epoch time when SDP log file p4verify.log was last modified |
166183
| p4_sdp_version | version | SDP Version |
184+
| p4_server_uptime | | P4D Server uptime (seconds) |
185+
| p4_ssl_cert_expires | | P4D SSL certificate expiry epoch seconds |
186+
| p4_swarm_authorized | | Set to 1 if SDP superuser authorized to login to Swarm via API (e.g. 401) |
187+
| p4_swarm_error | | Set to 1 if swarm returns an http error (timeout or 500) (good for alerting) |
188+
| p4_swarm_future_tasks | | Count of future swarm tasks from `/queue/status` URL |
189+
| p4_swarm_max_workers | | Count of current swarm tasks from `/queue/status` URL |
190+
| p4_swarm_tasks | | Count of current swarm tasks from `/queue/status` URL |
191+
| p4_swarm_version | | Swarm version string |
192+
| p4_swarm_workers | | Count of current swarm workers from `/queue/status` URL |
167193

168194
## Locks Metrics
169195

0 commit comments

Comments
 (0)