Add Py4J comparison section to README

kubraaksux · kubraaksux · commit 5f23e56dfa2e · 2026-03-05T04:53:45.000Z
Document the architectural evolution from the previous Py4J callback approach (PR #2430) to the current llmPredict DML built-in with HTTP.
diff --git a/scripts/staging/llm-bench/README.md b/scripts/staging/llm-bench/README.md
@@ -265,6 +265,52 @@ without deploying vLLM. The long-term vision is to replace this external
 server approach entirely with native DML transformer operations that
 run model inference directly inside SystemDS's matrix engine.
 
+### Previous Approach: Py4J Callback (PR #2430)
+
+The initial implementation (closed PR #2430) loaded HuggingFace models
+directly inside a Python worker process and used Py4J callbacks to bridge
+Java and Python:
+
+```
+Python worker (loads model into GPU memory)
+  ^
+  | Py4J callback: generateBatch(prompts)
+  v
+Java JMLC (PreparedScript.generateBatchWithMetrics)
+```
+
+This approach had several drawbacks:
+- **Tight coupling:** Model loading, tokenization, and inference all lived
+  in `llm_worker.py`, requiring Python-side changes for every model config.
+- **No standard API:** Used a custom Py4J callback protocol instead of the
+  OpenAI-compatible `/v1/completions` interface that vLLM and other servers
+  already provide.
+- **Limited optimization:** The Python worker reimplemented batching and
+  tokenization rather than leveraging vLLM's continuous batching, PagedAttention,
+  and KV cache management.
+- **Process lifecycle:** Java had to manage the Python worker process
+  (`loadModel()` / `releaseModel()`) with 300-second timeouts for large models.
+
+The current approach (this PR) replaces the Py4J callback with a native
+DML built-in (`llmPredict`) that issues HTTP requests to any
+OpenAI-compatible server:
+
+```
+DML script: llmPredict(prompts, url=..., model=...)
+  -> LlmPredictCPInstruction (Java HTTP client)
+    -> Any OpenAI-compatible server (vLLM, llm_server.py, etc.)
+```
+
+Benefits of the current approach:
+- **Decoupled:** Inference server is independent — swap vLLM for TGI, Ollama,
+  or any OpenAI-compatible endpoint without changing DML scripts or Java code.
+- **Standard protocol:** Uses the `/v1/completions` API, making benchmarks
+  directly comparable across backends.
+- **Server-side optimization:** vLLM handles batching, KV cache, PagedAttention,
+  and speculative decoding transparently.
+- **Simpler Java code:** `LlmPredictCPInstruction` is a single 216-line class
+  that builds JSON, sends HTTP, and parses the response — no process management.
+
 ## Benchmark Results
 
 ### Evaluation Methodology