Limit concurrency for evaluations/inferences in ui e2e tests (tensorzero#2948)

Aaron1011 · Copilot · web-flow · commit 237201b67f47 · 2025-07-31T03:40:08.000Z
* Limit concurrency for evaluations/inferences in ui e2e tests

When regenerating the fixtures, we need to run with an evaluations
concurrency limit of 1, and only load one of the (duplicate) image
inferences on the playground page. This prevents us from having multiple
in-flight inferences at once, which can cause cache entries to get
overwritten.

While we might be able to increase the concurrency limit in 'normal'
(non-regen) mode, we've had a lot of issues with this test. For now,
let's have regen and non-regen modes run exactly the same logic.

* Fix typo

Co-authored-by: Copilot &lt;175728472+Copilot@users.noreply.github.com&gt;

---------

Co-authored-by: Copilot &lt;175728472+Copilot@users.noreply.github.com&gt;
diff --git a/ui/app/routes/evaluations/LaunchEvaluationModal.tsx b/ui/app/routes/evaluations/LaunchEvaluationModal.tsx
@@ -197,6 +197,7 @@ function EvaluationForm({
           type="number"
           id="concurrency_limit"
           name="concurrency_limit"
+          data-testid="concurrency-limit"
           min="1"
           value={concurrencyLimit}
           onChange={(e) => setConcurrencyLimit(e.target.value)}
diff --git a/ui/e2e_tests/evaluations.spec.ts b/ui/e2e_tests/evaluations.spec.ts
@@ -27,6 +27,10 @@ test("push the new run button, launch an evaluation", async ({ page }) => {
   await page.getByText("Select a variant").click();
   await page.waitForTimeout(500);
   await page.getByRole("option", { name: "gpt4o_mini_initial_prompt" }).click();
+  // IMPORTANT - we need to set concurrency to 1 in order to prevent a race condition
+  // when regenerating fixtures, as we intentionally have multiple datapoints with
+  // identical inputs. See https://www.notion.so/tensorzerodotcom/Evaluations-cache-non-determinism-23a7520bbad3801f80fceaa7e859ce06
+  await page.getByTestId("concurrency-limit").fill("1");
   await page.getByRole("button", { name: "Launch" }).click();
 
   await expect(
@@ -70,6 +74,10 @@ test("push the new run button, launch an image evaluation", async ({
   await page.getByText("Select a variant").click();
   await page.waitForTimeout(500);
   await page.getByRole("option", { name: "honest_answer" }).click();
+  // IMPORTANT - we need to set concurrency to 1 in order to prevent a race condition
+  // when regenerating fixtures, as we intentionally have multiple datapoints with
+  // identical inputs. See https://www.notion.so/tensorzerodotcom/Evaluations-cache-non-determinism-23a7520bbad3801f80fceaa7e859ce06
+  await page.getByTestId("concurrency-limit").fill("1");
   await page.getByRole("button", { name: "Launch" }).click();
 
   await expect(
diff --git a/ui/e2e_tests/playground.spec.ts b/ui/e2e_tests/playground.spec.ts
@@ -100,7 +100,10 @@ test("playground should work for extract_entities JSON function with 2 variants"
 test("playground should work for image_judger function with images in input", async ({
   page,
 }) => {
-  await page.goto("/playground?limit=2");
+  // We set 'limit=1' so that we don't make parallel inference requests
+  // (two of the datapoints have the same input, and could trample on each other's
+  // cache entries)
+  await page.goto("/playground?limit=1");
   await expect(page.getByText("Select a function")).toBeVisible();
 
   // Select function 'image_judger' by typing in the combobox
@@ -122,14 +125,14 @@ test("playground should work for image_judger function with images in input", as
   await expect(page.getByText("baz")).toBeVisible();
   await expect(page.getByRole("link", { name: "honest_answer" })).toBeVisible();
 
-  // Verify that there are 2 inputs and 2 reference outputs
-  await expect(page.getByRole("heading", { name: "Input" })).toHaveCount(2);
+  // Verify that there is 1 input and 1 reference output
+  await expect(page.getByRole("heading", { name: "Input" })).toHaveCount(1);
   await expect(
     page.getByRole("heading", { name: "Reference Output" }),
-  ).toHaveCount(2);
+  ).toHaveCount(1);
 
-  // Verify that images are rendered in the input elements
-  await expect(page.locator("img")).toHaveCount(2);
+  // Verify that the image is rendered in the input element
+  await expect(page.locator("img")).toHaveCount(1);
 
   // Wait for at least one textbox containing "crab"
   // Wait for and assert at least one exists