Reduce embedding encode batch size to prevent GPU OOM

justincasher · justincasher · commit fec8763ea455 · 2026-02-10T13:25:31.000-05:00
The hardcoded batch_size=256 caused CUDA OOM on the 8GB Vast.ai
GPU when the backend already has models loaded for serving.
diff --git a/src/lean_explore/util/embedding_client.py b/src/lean_explore/util/embedding_client.py
@@ -78,7 +78,7 @@ def _encode():
             encode_kwargs = {
                 "show_progress_bar": False,
                 "convert_to_numpy": True,
-                "batch_size": 256,  # Larger batches for GPU utilization
+                "batch_size": 32,
             }
             if is_query:
                 encode_kwargs["prompt_name"] = "query"

Original file line number	Diff line number	Diff line change
`@@ -78,7 +78,7 @@ def _encode():`
`78`	`78`	`encode_kwargs = {`
`79`	`79`	`"show_progress_bar": False,`
`80`	`80`	`"convert_to_numpy": True,`
`81`		`- "batch_size": 256, # Larger batches for GPU utilization`
	`81`	`+ "batch_size": 32,`
`82`	`82`	`}`
`83`	`83`	`if is_query:`
`84`	`84`	`encode_kwargs["prompt_name"] = "query"`