-
Notifications
You must be signed in to change notification settings - Fork 77
Description
Note: This initiative is not complete, the only issue defined should give us more info on what else needs to be done once that is complete.
For https://github.com/2i2c-org/meta/issues/1814, we need to provide an easily explainable set of numbers that BD can use to talk to communities. We currently provide 'active users' only. We would like to provide:
- Total number of user sessions
- Median length of each user session
- 99th percentile length of user sessions
This fundamentally can be expressed as a prometheus histogram, although we don't have the ways to provide that yet.
Problem
This is a little difficult to do currently with the information we have in prometheus, because we don't really have any raw data on sessions. Session presence or absence is calculated by the presence or absence of metrics (like kube_pod_status_phase
) which makes this difficult. While we can do things like 'number of running servers at any given instant' easily, it's more difficult to do 'number of running servers today'. It's even more difficult to do things like 'median session length of users today'. A useful way to think of this is that promql excels at 'map' operations and 'filter' operations, but struggles with 'reduce' operations.
It's possible that someone with more promql knowledge than me could do this, but I've asked around and haven't necessarily found solutions we can actually trust.
The last time we tried to do this, we ended up adding these metrics to JupyterHub directly so we could trust the active user counts: jupyterhub/jupyterhub#4214
Possible solutions
- promql + Grafana based this off our existing metrics. This may include using a Prometheus recording rule to create a new timeseries
- Add this to JupyterHub as a metric, same as we did last time. However, that was possible because JupyterHub already tracked the one thing that was needed - users and when they were last active. While
Spawner
(the ORM object) in JupyterHub does havestarted
andlast_activity
timestamps, these are reset whenever a user's server stops, so we can't really use them here. Also, it doesn't contain historical information - we can only know about the last server start / stop, nothing before - Collect new metrics from the user pods. In particular, collect prometheus: Expose 3 activity metrics jupyter-server/jupyter_server#1471 and then do solution 1.
- Write our own prometheus exporter in python, that can treat prometheus as a source of data, compute the additional info we want, and export it for collection again by prometheus. This we know will work, so if we can't do it through the other options we do this.
Path forward
We'll experiment with 2i2c-org/meta#2544 to figure out if we can get at least active user session counts set up, and if not, what additional work we would need to do.