|
| 1 | +# Tuning hints for monitoring "large" queue managers |
| 2 | + |
| 3 | +If you have a large queue manager - perhaps several thousands of queues - then a lot of data could be produced for |
| 4 | +monitoring those queues. Some default configuration options might need tuning to get acceptable performance. Reducing |
| 5 | +the frequency of generation and/or collection may be appropriate. There may be several places where tuning might be |
| 6 | +done: in this collector, in the database configuration, and in the queue manager. |
| 7 | + |
| 8 | +The following sections describe different pieces that you might want to look at. |
| 9 | + |
| 10 | +The document is mostly written from the viewpoint of using Prometheus as the database. That is mainly because |
| 11 | +Prometheus has the unique "pull" model, where the server calls the collector at configured intervals. Other databases and |
| 12 | +collector technologies supported from this repository have a simpler way of "pushing" data to the various backends. |
| 13 | +However much of the document is relevant regardless of where the metrics end up. |
| 14 | + |
| 15 | +## Collector location |
| 16 | +It is most efficient to run the collector program as a local bindings application, connecting directly to the queue |
| 17 | +manager. That removes all the MQ Client flows that would have to be done for every message. |
| 18 | + |
| 19 | +If you cannot avoid running as a client (for example, you are trying to monitor the MQ Appliance or z/OS), then keep the |
| 20 | +network latency between the queue manager and collector as low as possible. For z/OS, you might consider running the |
| 21 | +collector in a zLinux LPAR on the same machine. Or perhaps in a zCX container. |
| 22 | + |
| 23 | +Also configure the client to take advantage of readahead when getting publications. This is done by setting |
| 24 | +`DEFREADA(YES)` on the nominated ReplyQueue(s). |
| 25 | + |
| 26 | +## Collection processing time |
| 27 | +The collector reports on how long it takes to collect and process the data on each interval. You can see this in a debug |
| 28 | +log. The Prometheus collector also has a `ibmmq_qmgr_exporter_collection_time` metric. Note that this time is the value |
| 29 | +as seen by the main collection thread; the real total time as seen by Prometheus is usually longer. There is likely |
| 30 | +still work going on in the background to send metrics to the database, and for it to be successfully ingested. |
| 31 | + |
| 32 | +The first time that the collection time exceeds the Prometheus default `scrape_timeout` value, a warning message is |
| 33 | +emitted. This can be ignored if you are expecting a scrape to take a longer period. But it can be helpful if you didn't |
| 34 | +know that you might need to do some tuning. |
| 35 | + |
| 36 | +The true total time taken for a scrape can be seen in Prometheus directly. For example, you can use the admininistrative |
| 37 | +interface at `http://<server>:9090/targets?search=` and find the target corresponding to your queue manager. |
| 38 | + |
| 39 | +For other collectors, there is no specific metric. But the timestamps on each collection block allow you to deduce the |
| 40 | +time taken as the difference between successive iterations is the collection period plus the `interval` configuration |
| 41 | +value. |
| 42 | + |
| 43 | +## Ensuring collection intervals have enough time to run |
| 44 | +The Prometheus `scrape_configs` configuration attributes can be configured for all or some collectors. In particular, |
| 45 | +you will probably want to change the `scrape_interval` and `scrape_timeout` values for the jobs associated with large |
| 46 | +queue managers. Use the reported collection processing time as a basis from which to set these values. |
| 47 | + |
| 48 | +For other collector models, the collector-specific `interval` attribute determines the gap between each push of the |
| 49 | +metrics. There is no "maximum" collection time. |
| 50 | + |
| 51 | +## Reducing metric publication interval from queue manager |
| 52 | +By default, the queue manager publishes resource metrics every 10 seconds. This matches fairly well with the Prometheus |
| 53 | +default scrape interval of 15s. But if you increase the scrape interval, you might also want to reduce the frequency of |
| 54 | +publications so that fewer "merges" have to be done when processing the subscription destination queues. Setting the |
| 55 | +following stanza in the _qm.ini_ file changes that frequency: |
| 56 | +``` |
| 57 | + TuningParameters: |
| 58 | + MonitorPublishHeartBeat = 30 |
| 59 | +``` |
| 60 | +This value is given in seconds. And the attribute is case-sensitive. As increasing the value reduces the frequency of |
| 61 | +generation, it may cause you to miss shorter-lived transient spikes in some values. That's the tradeoff you have to |
| 62 | +evaluate. But having a value smaller than the time taken to process the publications might result in a never-ending |
| 63 | +scrape. The publication-processing portion of the scrape can be seen in a debug log. |
| 64 | + |
| 65 | +## Reducing subscriptions made to queue manager |
| 66 | +Reducing the total number of subscriptions made will reduce the data that needs to be processed. But at the cost of |
| 67 | +missing some metrics that you might find useful. See also the section in the [README](README.md) file about using |
| 68 | +durable subscriptions. |
| 69 | + |
| 70 | +* You can disable all use of published resource metrics, and rely on the `DISPLAY xxSTATUS` responses. This clearly |
| 71 | + reduces the data, but you lose out on many useful metrics. It is essentially how we monitor z/OS queue managers as |
| 72 | + they do not have the publication model for metrics. But if you want this approach, set the `global.usePublications` |
| 73 | + configuration option to `false` |
| 74 | + |
| 75 | +* You can reduce the total number of subscriptions made for queue metrics. The `filters.queueSubscriptionSelector` list |
| 76 | + defines the sets of topics that you might be interested in. The complete set - for now - is |
| 77 | + [OPENCLOSE, INQSET, PUT, GET, GENERAL]. In many cases, only the last three of these may be of interest. The smaller |
| 78 | + set reduces the number of publications per queue. Within each set, multiple metrics are created but there is no way to |
| 79 | + report on only a subset of the metrics in each set. |
| 80 | + |
| 81 | +* You can choose to not subscribe to any queue metrics, but still subscribe to metrics for other resources such as the |
| 82 | + queue manager and Native HA by setting the filter to `NONE`. If you do this, then many queue metrics become |
| 83 | + unavailable. However, the current queue depth will still be available as it can also be determined from the |
| 84 | + `DISPLAY QSTATUS` response. |
| 85 | + |
| 86 | +## Reducing the number of monitored objects and status requests |
| 87 | +Each object type (queues, channels etc) has a block in the collector configuration that names which objects should be |
| 88 | +monitored. While both positive and negative wildcards can be used in these blocks, it is probably most efficient to use |
| 89 | +only positive wildcards. That allows the `DISPLAY xxSTATUS` requests to pass the wildcards directly into the queue |
| 90 | +manager commands; if there are any negative patterns, the collector has to work out which objects match the pattern, and |
| 91 | +then inquire for the remainder individually. |
| 92 | + |
| 93 | +## Other configuration options |
| 94 | +The `global.pollInterval` and `global.rediscoverInterval` options may help to further reduce inquiries. |
| 95 | + |
| 96 | +The first of these controls how frequently the `DISPLAY xxSTATUS` commands are used, assuming the |
| 97 | +`global.useObjectStatus` is `true`. In some circumstances, you might not want all of the responses as regularly as the |
| 98 | +published metrics are handled. |
| 99 | + |
| 100 | +The second attribute controls how frequently the collector reassesses the list of objects to be monitored, and their |
| 101 | +more stable attributes. For example, the `DESCRIPTION` or `MAXDEPTH` settings on a queue. If you have a large number of |
| 102 | +queues that do not change frequently, then you might want to increase the rediscovery attribute. The default is 1 hour. |
| 103 | +The tradeoff here is that newly-defined queues may not have any metrics reported until this interval expires. |
| 104 | + |
| 105 | +## Dividing the workload |
| 106 | +One further approach that you might like to consider, though I wouldn't usually recommend it, is to have two or more |
| 107 | +collectors running against the same queue manager. And then configure different sets of queues to be monitored. So a |
| 108 | +collector listening on port 9157 might manage queues A*-M*, while another collector on port 9158 monitors queues N*-Z*. |
| 109 | +You would likely need additional configuration to reduce duplication of other components, for example by using the |
| 110 | +`jobname` or `instance` as a filter element on dashboard queries, but it might be one way to reduce the time taken for a |
| 111 | +single scrape. |
| 112 | + |
| 113 | +## Very slow queue managers |
| 114 | +The collectors wait for a short time for each response to a status request. If the timeout expires with no expected |
| 115 | +message appearing, then an error is reported. Some queue managers - particuarly when hosted in cloud services - have |
| 116 | +appeared to "stall" for a period. Even though they are not especially busy, the response messages have not appeared in |
| 117 | +time. The default wait of 3 seconds can be tuned using the `connection.waitInterval` option. |
| 118 | + |
| 119 | +For all collectors _except_ Prometheus, a small number of these timeout errors are permitted consecutively. The failure |
| 120 | +count is reset after a successful collection. See _pkg/errors/errors.go_ for details. The Prometheus collector has an |
| 121 | +automatic reconnect option after failures, so does not currently use this strategy. |
0 commit comments