@@ -265,6 +265,52 @@ without deploying vLLM. The long-term vision is to replace this external
265265server approach entirely with native DML transformer operations that
266266run model inference directly inside SystemDS's matrix engine.
267267
268+ ### Previous Approach: Py4J Callback (PR #2430 )
269+
270+ The initial implementation (closed PR #2430 ) loaded HuggingFace models
271+ directly inside a Python worker process and used Py4J callbacks to bridge
272+ Java and Python:
273+
274+ ```
275+ Python worker (loads model into GPU memory)
276+ ^
277+ | Py4J callback: generateBatch(prompts)
278+ v
279+ Java JMLC (PreparedScript.generateBatchWithMetrics)
280+ ```
281+
282+ This approach had several drawbacks:
283+ - ** Tight coupling:** Model loading, tokenization, and inference all lived
284+ in ` llm_worker.py ` , requiring Python-side changes for every model config.
285+ - ** No standard API:** Used a custom Py4J callback protocol instead of the
286+ OpenAI-compatible ` /v1/completions ` interface that vLLM and other servers
287+ already provide.
288+ - ** Limited optimization:** The Python worker reimplemented batching and
289+ tokenization rather than leveraging vLLM's continuous batching, PagedAttention,
290+ and KV cache management.
291+ - ** Process lifecycle:** Java had to manage the Python worker process
292+ (` loadModel() ` / ` releaseModel() ` ) with 300-second timeouts for large models.
293+
294+ The current approach (this PR) replaces the Py4J callback with a native
295+ DML built-in (` llmPredict ` ) that issues HTTP requests to any
296+ OpenAI-compatible server:
297+
298+ ```
299+ DML script: llmPredict(prompts, url=..., model=...)
300+ -> LlmPredictCPInstruction (Java HTTP client)
301+ -> Any OpenAI-compatible server (vLLM, llm_server.py, etc.)
302+ ```
303+
304+ Benefits of the current approach:
305+ - ** Decoupled:** Inference server is independent — swap vLLM for TGI, Ollama,
306+ or any OpenAI-compatible endpoint without changing DML scripts or Java code.
307+ - ** Standard protocol:** Uses the ` /v1/completions ` API, making benchmarks
308+ directly comparable across backends.
309+ - ** Server-side optimization:** vLLM handles batching, KV cache, PagedAttention,
310+ and speculative decoding transparently.
311+ - ** Simpler Java code:** ` LlmPredictCPInstruction ` is a single 216-line class
312+ that builds JSON, sends HTTP, and parses the response — no process management.
313+
268314## Benchmark Results
269315
270316### Evaluation Methodology
0 commit comments