Fix trimmed labels being used in other metrics, due to shallow copy (#4588)

vitorguidi · web-flow · commit 777669ef70ad · 2025-01-08T13:33:14.000-03:00
This  addresses the errors on b/384820142

The main issue is that, in _MetricsRecorder, trimmed_labels was a
shallow copy for self._labels, which had job removed, thus causing many
utask duration with different jobs to reduce to the same labels
(deduplication seems to only take into account the labels declared in
monitoring_metrics.py)

The other error is related to saturation, where time series would be
written too often. This probably happens due to the loss of the job
label, causing many requests to a subset of the original labels. It can
possibly go away once this lands, otherwise more investigation is
needed.
diff --git a/src/clusterfuzz/_internal/bot/tasks/utasks/__init__.py b/src/clusterfuzz/_internal/bot/tasks/utasks/__init__.py
@@ -177,10 +177,11 @@ def __exit__(self, _exc_type, _exc_value, _traceback):
     # Get rid of job as a label, so we can have another metric to make
     # error conditions more explicit, respecting the 30k distinct
     # labels limit recommended by gcp.
-    trimmed_labels = self._labels
+    trimmed_labels = {
+        **self._labels, 'task_succeeded': task_succeeded,
+        'error_condition': error_condition
+    }
     del trimmed_labels['job']
-    trimmed_labels['task_succeeded'] = task_succeeded
-    trimmed_labels['error_condition'] = error_condition
     monitoring_metrics.TASK_OUTCOME_COUNT_BY_ERROR_TYPE.increment(
         trimmed_labels)