Skip to content

Commit 5f23e56

Browse files
committed
Add Py4J comparison section to README
Document the architectural evolution from the previous Py4J callback approach (PR #2430) to the current llmPredict DML built-in with HTTP.
1 parent fe35989 commit 5f23e56

File tree

1 file changed

+46
-0
lines changed

1 file changed

+46
-0
lines changed

scripts/staging/llm-bench/README.md

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -265,6 +265,52 @@ without deploying vLLM. The long-term vision is to replace this external
265265
server approach entirely with native DML transformer operations that
266266
run model inference directly inside SystemDS's matrix engine.
267267

268+
### Previous Approach: Py4J Callback (PR #2430)
269+
270+
The initial implementation (closed PR #2430) loaded HuggingFace models
271+
directly inside a Python worker process and used Py4J callbacks to bridge
272+
Java and Python:
273+
274+
```
275+
Python worker (loads model into GPU memory)
276+
^
277+
| Py4J callback: generateBatch(prompts)
278+
v
279+
Java JMLC (PreparedScript.generateBatchWithMetrics)
280+
```
281+
282+
This approach had several drawbacks:
283+
- **Tight coupling:** Model loading, tokenization, and inference all lived
284+
in `llm_worker.py`, requiring Python-side changes for every model config.
285+
- **No standard API:** Used a custom Py4J callback protocol instead of the
286+
OpenAI-compatible `/v1/completions` interface that vLLM and other servers
287+
already provide.
288+
- **Limited optimization:** The Python worker reimplemented batching and
289+
tokenization rather than leveraging vLLM's continuous batching, PagedAttention,
290+
and KV cache management.
291+
- **Process lifecycle:** Java had to manage the Python worker process
292+
(`loadModel()` / `releaseModel()`) with 300-second timeouts for large models.
293+
294+
The current approach (this PR) replaces the Py4J callback with a native
295+
DML built-in (`llmPredict`) that issues HTTP requests to any
296+
OpenAI-compatible server:
297+
298+
```
299+
DML script: llmPredict(prompts, url=..., model=...)
300+
-> LlmPredictCPInstruction (Java HTTP client)
301+
-> Any OpenAI-compatible server (vLLM, llm_server.py, etc.)
302+
```
303+
304+
Benefits of the current approach:
305+
- **Decoupled:** Inference server is independent — swap vLLM for TGI, Ollama,
306+
or any OpenAI-compatible endpoint without changing DML scripts or Java code.
307+
- **Standard protocol:** Uses the `/v1/completions` API, making benchmarks
308+
directly comparable across backends.
309+
- **Server-side optimization:** vLLM handles batching, KV cache, PagedAttention,
310+
and speculative decoding transparently.
311+
- **Simpler Java code:** `LlmPredictCPInstruction` is a single 216-line class
312+
that builds JSON, sends HTTP, and parses the response — no process management.
313+
268314
## Benchmark Results
269315

270316
### Evaluation Methodology

0 commit comments

Comments
 (0)