Compare Large Language Models (LLMs) and Small Language Models (SLMs) for local vs cloud inference scenarios. Learn deployment patterns leveraging ONNX Runtime acceleration, WebGPU execution, and hybrid RAG experiences. Includes a Chainlit RAG demo with a local model plus an optional OpenWebUI exploration. You will adapt a WebGPU inference starter and evaluate Phi vs GPT-OSS-20B capability & cost/perf trade-offs.
- Contrast SLM vs LLM on latency, memory, quality axes
- Deploy models with ONNXRuntime and (where supported) WebGPU
- Run browser-based inference (privacy-preserving interactive demo)
- Integrate a Chainlit RAG pipeline with a local SLM backend
- Evaluate using lightweight quality + cost heuristics
- Sessions 1–3 completed
chainlitinstalled (already inrequirements.txtfor Module08)- WebGPU-capable browser (Edge / Chrome latest on Windows 11)
- Foundry Local running (
foundry service status)
Windows remains the primary target environment. For macOS developers awaiting native binaries:
- Run Foundry Local in a Windows 11 VM (Parallels / UTM) OR a remote Windows workstation.
- Expose the service (default port 5273) and set on macOS:
export FOUNDRY_LOCAL_ENDPOINT=http://<windows-host>:5273/v1- Use the same Python virtual environment steps as prior sessions.
Chainlit install (both platforms):
pip install chainlitfoundry model run phi-4-mini
foundry model run gpt-oss-20b
# Quick capability probes (one-shot non-interactive)
foundry model run phi-4-mini --prompt "Summarize retrieval augmented generation in 2 sentences."
foundry model run gpt-oss-20b --prompt "Summarize retrieval augmented generation in 2 sentences."
# Basic token / latency test (repeat a few times for intuition)
foundry model run phi-4-mini --prompt "List 5 creative IoT edge AI ideas."
foundry model run gpt-oss-20b --prompt "List 5 creative IoT edge AI ideas."Track: response depth, factual accuracy, stylistic richness, latency.
Observe throughput changes after enabling GPU vs CPU-only.
Adapt starter 04-webgpu-inference (create samples/04-cutting-edge/webgpu_demo.html):
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<title>Foundry Local WebGPU Demo</title>
<style>body{font-family:system-ui;margin:2rem;max-width:820px;} textarea{width:100%;height:120px;} pre{background:#111;color:#eee;padding:1rem;} .resp{white-space:pre-wrap;margin-top:1rem;border:1px solid #444;padding:1rem;border-radius:6px;}</style>
</head>
<body>
<h1>WebGPU Inference (Experimental)</h1>
<p>Demonstration placeholder for a WebGPU-backed transformer (concept). Replace with actual JS runtime once exposed by Foundry Local or associated runtime libs.</p>
<textarea id="prompt" placeholder="Enter your prompt..."></textarea>
<button id="run">Generate</button>
<div id="out" class="resp"></div>
<script>
document.getElementById('run').onclick = async () => {
const p = document.getElementById('prompt').value.trim();
if(!p) return;
document.getElementById('out').textContent = 'Running (simulated)...';
// Placeholder: in a real implementation you'd call into a WASM/WebGPU pipeline or local gateway endpoint.
const resp = await fetch('http://localhost:5273/v1/chat/completions', {
method: 'POST', headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'phi-4-mini',
messages: [ { role: 'user', content: p } ],
max_tokens: 200, temperature: 0.5
})
}).then(r=>r.json()).catch(e=>({error:e.toString()}));
if(resp.error){
document.getElementById('out').textContent = 'Error: '+resp.error;
} else {
document.getElementById('out').textContent = resp.choices?.[0]?.message?.content || JSON.stringify(resp,null,2);
}
};
</script>
</body>
</html>Open the file in a browser; observe low-latency local roundtrip.
Minimal samples/04-cutting-edge/chainlit_app.py:
#!/usr/bin/env python3
"""Chainlit RAG demo using Foundry Local SLM as backend."""
import chainlit as cl
from openai import OpenAI
DOCS = [
"Foundry Local enables local model execution with OpenAI-compatible APIs.",
"RAG combines retrieval and generation for grounded answers.",
"SLMs provide efficiency advantages on constrained hardware."
]
client = OpenAI(base_url="http://localhost:5273/v1", api_key="not-needed")
def build_context(query: str):
# Naive lexical scoring
scored = sorted(DOCS, key=lambda d: sum(w in d.lower() for w in query.lower().split()), reverse=True)
return "\n".join(scored[:2])
@cl.on_message
async def main(message: cl.Message):
ctx = build_context(message.content)
resp = client.chat.completions.create(
model="phi-4-mini",
messages=[
{"role": "system", "content": "Answer using ONLY the provided context. If insufficient, say so."},
{"role": "user", "content": f"Context:\n{ctx}\n\nQuestion: {message.content}"}
],
max_tokens=300,
temperature=0.3
)
await cl.Message(content=resp.choices[0].message.content).send()Run:
chainlit run samples/04-cutting-edge/chainlit_app.py -wDeliverables:
- Replace placeholder fetch logic with streaming tokens (use
stream=Trueendpoint variant once enabled) - Add latency chart (client-side) for phi vs gpt-oss-20b toggles
- Embed RAG context inline (textarea for reference docs)
| Category | Phi-4-mini | GPT-OSS-20B | Observation |
|---|---|---|---|
| Latency (cold) | Fast | Slower | SLM warms quickly |
| Memory | Low | High | Device feasibility |
| Context adherence | Good | Strong | Larger model may be more verbose |
| Cost (local) | Minimal | Higher (resource) | Energy/time trade-off |
| Best use case | Edge apps | Deep reasoning | Hybrid pipeline possible |
# List catalog (no --running flag; loaded models are those you have previously run)
foundry model list
# For runtime metrics use the Python benchmark script (Session 3) and OS tools (Task Manager / nvidia-smi) instead of 'model stats'
# Example:
# cd Workshop/samples
# set BENCH_MODELS=phi-4-mini,gpt-oss-20b
# python -m session03.benchmark_oss_models| Symptom | Cause | Action |
|---|---|---|
| Web page fetch fails | CORS or service down | Use curl to verify endpoint; enable CORS proxy if needed |
| Chainlit blank | Env not active | Activate venv & reinstall deps |
| High latency | Model just loaded | Warm with small prompt sequence |
- Foundry Local SDK: https://github.com/microsoft/Foundry-Local/tree/main/sdk/python
- Chainlit Docs: https://docs.chainlit.io
- RAG Evaluation (Ragas): https://docs.ragas.io
Session Duration: 30 min
Difficulty: Advanced
| Workshop Artifacts | Scenario | Objective | Data / Prompt Source |
|---|---|---|---|
samples/session04/model_compare.py / notebooks/session04_model_compare.ipynb |
Architecture team evaluating SLM vs LLM for executive summary generator | Quantify latency + token usage delta | Single COMPARE_PROMPT env var |
chainlit_app.py (RAG demo) |
Internal knowledge assistant prototype | Ground short answers with minimal lexical retrieval | Inline DOCS list in file |
webgpu_demo.html |
Futuristic on‑device browser inference preview | Show low‑latency local roundtrip + UX narrative | Live user prompt only |
The product org wants an executive briefing generator. A lightweight SLM (phi‑4‑mini) drafts summaries; a larger LLM (gpt‑oss‑20b) may refine only high‑priority reports. Session scripts capture empirical latency & token metrics to justify a hybrid design, while the Chainlit demo illustrates how grounded retrieval keeps small model answers factual. The WebGPU concept page provides a vision path for fully client‑side processing when browser acceleration matures.
DOCS = [
"Foundry Local enables local model execution with OpenAI-compatible APIs.",
"RAG combines retrieval and generation for grounded answers.",
"SLMs provide efficiency advantages on constrained hardware."
]draft, _ = chat_once('phi-4-mini', messages=[{"role":"user","content":prompt}], max_tokens=280)
if len(draft) < 600: # heuristic: escalate only for longer briefs or flagged topics
final = draft
else:
final, _ = chat_once('gpt-oss-20b', messages=[{"role":"user","content":f"Refine and polish:\n{draft}"}], max_tokens=220)Track both latency components to report blended average cost.
| Focus | Enhancement | Why | Implementation Hint |
|---|---|---|---|
| Comparative Metrics | Track token usage + first-token latency | Holistic perf view | Use enhanced benchmark script (Session 3) with BENCH_STREAM=1 |
| Hybrid Pipeline | SLM draft → LLM refine | Reduce latency & cost | Generate with phi-4-mini, refine summary w/ gpt-oss-20b |
| Streaming UI | Better UX in Chainlit | Incremental feedback | Use stream=True once local streaming is exposed; accumulate chunks |
| WebGPU Caching | Faster JS init | Reduce recompile overhead | Cache compiled shader artifacts (future runtime capability) |
| Deterministic QA Set | Fair model comparison | Remove variance | Fixed prompt list + temperature=0 for evaluation runs |
| Output Scoring | Structured quality lens | Move beyond anecdotes | Simple rubric: coherence / factuality / brevity (1–5) |
| Energy / Resource Notes | Classroom discussion | Show trade-offs | Use OS monitors (Task Manager, nvidia-smi) + benchmark script outputs |
| Cost Emulation | Pre-cloud justification | Plan scaling | Map tokens to hypothetical cloud pricing for TCO narrative |
| Latency Decomposition | Identify bottlenecks | Target optimizations | Measure prompt prep, request send, first token, full completion |
| RAG + LLM Fallback | Quality safety net | Improve difficult queries | If SLM answer length < threshold or low confidence → escalate |
draft, _ = chat_once('phi-4-mini', messages=[{"role":"user","content":task}], max_tokens=300, temperature=0.4)
refine, _ = chat_once('gpt-oss-20b', messages=[{"role":"user","content":f"Improve clarity but keep facts:\n{draft}"}], max_tokens=220, temperature=0.3)import time
t0 = time.time(); # build messages
prep_ms = (time.time()-t0)*1000
t1 = time.time(); text,_ = chat_once(alias, messages=msgs, max_tokens=180)
full_ms = (time.time()-t1)*1000
print({"prep_ms": prep_ms, "full_gen_ms": full_ms})Use consistent measurement scaffolding across models for fair comparisons.