BUILDS_INPROGRESS metric not working as expected

As reported downstream on https://github.com/jupyterhub/mybinder.org-deploy/issues/3594, the `binderhub_inprogress_builds` / `BUILDS_INPROGRESS` metric is not working as expected.

| Observed | Expected |
| --- | --- |
| <img width="524" height="296" alt="Image" src="https://github.com/user-attachments/assets/4ff5d7a7-93f1-4f7c-8730-d8373d477035" /> | <img width="524" height="296" alt="Image" src="https://github.com/user-attachments/assets/b4685adb-b7aa-4874-98c4-7036312f0add" /> |

`BUILDS_INPROGRESS` is defined in https://github.com/jupyterhub/binderhub/blob/55326777d89f5b811e21577d8df4ffb850fe510b/binderhub/builder.py#L55 The Prometheus `client_python` [`Gauge()`](https://prometheus.github.io/client_python/instrumenting/gauge/) can go up and down but from the observed Grafana visualisation, it is only going up. `track_inprogress()` is used in https://github.com/jupyterhub/binderhub/blob/55326777d89f5b811e21577d8df4ffb850fe510b/binderhub/builder.py#L507-L597 as described in the Prometheus `client_python` documentation.

My search pointed me that the cause of this problem could be

- that the process might be terminated before the `with` block completes
- the async / `await` instruction

@yuvipanda and @manics what does your experience says?

Should we replace the `BUILDS_INPROGRESS` with a downstream calculation of `BUILDS_STARTED - BUILDS_COMPLETED - BUILDS_FAILED`?

	with BUILDS_INPROGRESS.track_inprogress():
	done = False
	failed = False

	def _check_result(future):
	nonlocal done
	nonlocal failed
	try:
	r = future.result()
	app_log.debug("task completed: %s", r)
	except Exception:
	app_log.error("task failed", exc_info=True)
	done = True
	failed = True
	# TODO: Propagate error to front-end

	build_starttime = time.perf_counter()
	pool = self.settings["build_pool"]
	# Start building
	submit_future = pool.submit(build.submit)
	submit_future.add_done_callback(_check_result)
	IOLoop.current().add_callback(lambda: submit_future)

	log_future = None

	# initial waiting event
	await self.emit(
	{
	"phase": "waiting",
	"message": "Waiting for build to start...\n",
	}
	)

	while not done:
	progress = await q.get()
	# FIXME: If pod goes into an unrecoverable stage, such as ImagePullBackoff or
	# whatever, we should fail properly.
	if progress.kind == ProgressEvent.Kind.BUILD_STATUS_CHANGE:
	phase = progress.payload.value
	if progress.payload == ProgressEvent.BuildStatus.PENDING:
	# nothing to do, just waiting
	continue
	elif progress.payload == ProgressEvent.BuildStatus.BUILT:
	if build_only:
	message = "Done! Image built\n"
	phase = "ready"
	else:
	message = "Built image, launching...\n"
	event = {
	"phase": phase,
	"message": message,
	"imageName": image_name,
	}
	BUILD_TIME.labels(status="success").observe(
	time.perf_counter() - build_starttime
	)
	BUILD_COUNT.labels(
	status="success", **self.repo_metric_labels
	).inc()
	done = True
	elif progress.payload == ProgressEvent.BuildStatus.RUNNING:
	# start capturing build logs once the pod is running
	if log_future is None:
	log_future = pool.submit(build.stream_logs)
	log_future.add_done_callback(_check_result)
	continue
	elif progress.payload == ProgressEvent.BuildStatus.BUILT:
	# Do nothing, is ok!
	continue
	elif progress.payload == ProgressEvent.BuildStatus.FAILED:
	event = {"phase": phase}
	elif progress.payload == ProgressEvent.BuildStatus.UNKNOWN:
	event = {"phase": phase}
	else:
	raise ValueError(
	f"Found unknown phase {phase} in ProgressEvent"
	)
	elif progress.kind == ProgressEvent.Kind.LOG_MESSAGE:
	# The logs are coming out of repo2docker, so we expect
	# them to be JSON structured anyway
	event = progress.payload
	payload = json.loads(event)
	if payload.get("phase") in ("failure", "failed"):
	failed = True
	BUILD_TIME.labels(status="failure").observe(
	time.perf_counter() - build_starttime
	)
	BUILD_COUNT.labels(
	status="failure", **self.repo_metric_labels
	).inc()
	await self.emit(event)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUILDS_INPROGRESS metric not working as expected #2051

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BUILDS_INPROGRESS metric not working as expected #2051

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions