Skip to content

BUILDS_INPROGRESS metric not working as expected #2051

@rgaiacs

Description

@rgaiacs

As reported downstream on jupyterhub/mybinder.org-deploy#3594, the binderhub_inprogress_builds / BUILDS_INPROGRESS metric is not working as expected.

Observed Expected
Image Image

BUILDS_INPROGRESS is defined in

BUILDS_INPROGRESS = Gauge("binderhub_inprogress_builds", "Builds currently in progress")
The Prometheus client_python Gauge() can go up and down but from the observed Grafana visualisation, it is only going up. track_inprogress() is used in
with BUILDS_INPROGRESS.track_inprogress():
done = False
failed = False
def _check_result(future):
nonlocal done
nonlocal failed
try:
r = future.result()
app_log.debug("task completed: %s", r)
except Exception:
app_log.error("task failed", exc_info=True)
done = True
failed = True
# TODO: Propagate error to front-end
build_starttime = time.perf_counter()
pool = self.settings["build_pool"]
# Start building
submit_future = pool.submit(build.submit)
submit_future.add_done_callback(_check_result)
IOLoop.current().add_callback(lambda: submit_future)
log_future = None
# initial waiting event
await self.emit(
{
"phase": "waiting",
"message": "Waiting for build to start...\n",
}
)
while not done:
progress = await q.get()
# FIXME: If pod goes into an unrecoverable stage, such as ImagePullBackoff or
# whatever, we should fail properly.
if progress.kind == ProgressEvent.Kind.BUILD_STATUS_CHANGE:
phase = progress.payload.value
if progress.payload == ProgressEvent.BuildStatus.PENDING:
# nothing to do, just waiting
continue
elif progress.payload == ProgressEvent.BuildStatus.BUILT:
if build_only:
message = "Done! Image built\n"
phase = "ready"
else:
message = "Built image, launching...\n"
event = {
"phase": phase,
"message": message,
"imageName": image_name,
}
BUILD_TIME.labels(status="success").observe(
time.perf_counter() - build_starttime
)
BUILD_COUNT.labels(
status="success", **self.repo_metric_labels
).inc()
done = True
elif progress.payload == ProgressEvent.BuildStatus.RUNNING:
# start capturing build logs once the pod is running
if log_future is None:
log_future = pool.submit(build.stream_logs)
log_future.add_done_callback(_check_result)
continue
elif progress.payload == ProgressEvent.BuildStatus.BUILT:
# Do nothing, is ok!
continue
elif progress.payload == ProgressEvent.BuildStatus.FAILED:
event = {"phase": phase}
elif progress.payload == ProgressEvent.BuildStatus.UNKNOWN:
event = {"phase": phase}
else:
raise ValueError(
f"Found unknown phase {phase} in ProgressEvent"
)
elif progress.kind == ProgressEvent.Kind.LOG_MESSAGE:
# The logs are coming out of repo2docker, so we expect
# them to be JSON structured anyway
event = progress.payload
payload = json.loads(event)
if payload.get("phase") in ("failure", "failed"):
failed = True
BUILD_TIME.labels(status="failure").observe(
time.perf_counter() - build_starttime
)
BUILD_COUNT.labels(
status="failure", **self.repo_metric_labels
).inc()
await self.emit(event)
as described in the Prometheus client_python documentation.

My search pointed me that the cause of this problem could be

  • that the process might be terminated before the with block completes
  • the async / await instruction

@yuvipanda and @manics what does your experience says?

Should we replace the BUILDS_INPROGRESS with a downstream calculation of BUILDS_STARTED - BUILDS_COMPLETED - BUILDS_FAILED?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions