Hands-On Learning Path for Building Production-Ready Edge AI Applications
Master local AI deployment with Microsoft Foundry Local, from first chat completion to multi-agent orchestration in 6 progressive sessions.
Welcome to the EdgeAI for Beginners Workshop - your practical, hands-on guide to building intelligent applications that run entirely on local hardware. This workshop transforms theoretical Edge AI concepts into real-world skills through progressively challenging exercises using Microsoft Foundry Local and Small Language Models (SLMs).
The Edge AI Revolution is Here
Organizations worldwide are shifting from cloud-dependent AI to edge computing for three critical reasons:
- Privacy & Compliance - Process sensitive data locally without cloud transmission (HIPAA, GDPR, financial regulations)
- Performance - Eliminate network latency (50-500ms local vs 500-2000ms cloud round-trip)
- Cost Control - Remove per-token API costs and scale without cloud expenses
But Edge AI is Different
Running AI on-premises requires new skills:
- Model selection and optimization for resource constraints
- Local service management and hardware acceleration
- Prompt engineering for smaller models
- Production deployment patterns for edge devices
This Workshop Delivers Those Skills
In 6 focused sessions (~3 hours total), you'll progress from "Hello World" to deploying production-ready multi-agent systems - all running locally on your machine.
By completing this workshop, you will be able to:
-
Deploy and Manage Local AI Services
- Install and configure Microsoft Foundry Local
- Select appropriate models for edge deployment
- Manage model lifecycle (download, load, cache)
- Monitor resource usage and optimize performance
-
Build AI-Powered Applications
- Implement OpenAI-compatible chat completions locally
- Design effective prompts for Small Language Models
- Handle streaming responses for better UX
- Integrate local models into existing applications
-
Create RAG (Retrieval Augmented Generation) Systems
- Build semantic search with embeddings
- Ground LLM responses in domain-specific knowledge
- Evaluate RAG quality with industry-standard metrics
- Scale from prototype to production
-
Optimize Model Performance
- Benchmark multiple models for your use case
- Measure latency, throughput, and first-token time
- Select optimal models based on speed/quality tradeoffs
- Compare SLM vs LLM trade-offs in real scenarios
-
Orchestrate Multi-Agent Systems
- Design specialized agents for different tasks
- Implement agent memory and context management
- Coordinate agents in complex workflows
- Route requests intelligently across multiple models
-
Deploy Production-Ready Solutions
- Implement error handling and retry logic
- Monitor token usage and system resources
- Build scalable architectures with model-as-tools patterns
- Plan migration paths from edge to hybrid (edge + cloud)
By the end of this workshop, you will have created:
| Session | Deliverable | Skills Demonstrated |
|---|---|---|
| 1 | Chat application with streaming | Service setup, basic completions, streaming UX |
| 2 | RAG system with evaluation | Embeddings, semantic search, quality metrics |
| 3 | Multi-model benchmark suite | Performance measurement, model comparison |
| 4 | SLM vs LLM comparator | Trade-off analysis, optimization strategies |
| 5 | Multi-agent orchestrator | Agent design, memory management, coordination |
| 6 | Intelligent routing system | Intent detection, model selection, scalability |
| Skill Level | Session 1-2 | Session 3-4 | Session 5-6 |
|---|---|---|---|
| Beginner | ✅ Setup & basics | ❌ Too advanced | |
| Intermediate | ✅ Quick review | ✅ Core learning | |
| Advanced | ✅ Breeze through | ✅ Refinement | ✅ Production patterns |
After this workshop, you'll be prepared to:
✅ Build Privacy-First Applications
- Healthcare apps handling PHI/PII locally
- Financial services with compliance requirements
- Government systems with data sovereignty needs
✅ Optimize for Edge Environments
- IoT devices with limited resources
- Offline-first mobile applications
- Low-latency real-time systems
✅ Design Intelligent Architectures
- Multi-agent systems for complex workflows
- Hybrid edge-cloud deployments
- Cost-optimized AI infrastructure
✅ Lead Edge AI Initiatives
- Evaluate Edge AI feasibility for projects
- Select appropriate models and frameworks
- Architect scalable local AI solutions
| Session | Topic | Focus | Duration |
|---|---|---|---|
| 1 | Getting Started with Foundry Local | Install, validate, first completions | 30 min |
| 2 | Building AI Solutions with RAG | Prompt engineering, embeddings, evaluation | 30 min |
| 3 | Open Source Models | Model discovery, benchmarking, selection | 30 min |
| 4 | Cutting Edge Models | SLM vs LLM, optimization, frameworks | 30 min |
| 5 | AI-Powered Agents | Agent design, orchestration, memory | 30 min |
| 6 | Models as Tools | Routing, chaining, scaling strategies | 30 min |
System Requirements:
- OS: Windows 10/11, macOS 11+, or Linux (Ubuntu 20.04+)
- RAM: 8GB minimum, 16GB+ recommended
- Storage: 10GB+ free space for models
- CPU: Modern processor with AVX2 support
- GPU (optional): CUDA-compatible or Qualcomm NPU for acceleration
Software Requirements:
- Python 3.8+ (Download)
- Microsoft Foundry Local (Installation Guide)
- Git (Download)
- Visual Studio Code (recommended) (Download)
Windows:
winget install Microsoft.FoundryLocalmacOS:
brew tap microsoft/foundrylocal
brew install foundrylocalVerify Installation:
foundry --version
foundry service statusEnsure Azure AI Foundry Local is running with a fixed port
# Set FoundryLocal to use port 58123 (default)
foundry service set --port 58123 --show
# Or use a different port
foundry service set --port 58000 --showVerify it's working:
# Check service status
foundry service status
# Test the endpoint
curl http://127.0.0.1:58123/v1/modelsFinding Available Models To see which models are available in your Foundry Local instance, you can query the models endpoint:
# cmd/bash/powershell
foundry model listUsing Web Endpoint
# Windows PowerShell
powershell -Command "Invoke-RestMethod -Uri 'http://127.0.0.1:58123/v1/models' -Method Get"
# Or using curl (if available)
curl http://127.0.0.1:58123/v1/models# Clone repository
git clone https://github.com/microsoft/edgeai-for-beginners.git
cd edgeai-for-beginners/Workshop
# Create virtual environment
python -m venv .venv
# Activate virtual environment
# Windows:
.\.venv\Scripts\activate
# macOS/Linux:
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt# Start Foundry Local and load a model
foundry model run phi-4-mini
# Run the chat bootstrap sample
cd samples
python -m session01.chat_bootstrap "What is edge AI?"✅ Success! You should see a streaming response about edge AI.
Progressive hands-on examples demonstrating each concept:
| Session | Sample | Description | Run Time |
|---|---|---|---|
| 1 | chat_bootstrap.py |
Basic & streaming chat | ~30s |
| 2 | rag_pipeline.py |
RAG with embeddings | ~45s |
| 2 | rag_eval_ragas.py |
RAG quality evaluation | ~60s |
| 3 | benchmark_oss_models.py |
Multi-model benchmarking | ~2-3m |
| 4 | model_compare.py |
SLM vs LLM comparison | ~45s |
| 5 | agents_orchestrator.py |
Multi-agent system | ~60s |
| 6 | models_router.py |
Intent-based routing | ~45s |
| 6 | models_pipeline.py |
Multi-step pipeline | ~60s |
Interactive exploration with explanations and visualizations:
| Session | Notebook | Description | Difficulty |
|---|---|---|---|
| 1 | session01_chat_bootstrap.ipynb |
Chat basics & streaming | ⭐ Beginner |
| 2 | session02_rag_pipeline.ipynb |
Build RAG system | ⭐⭐ Intermediate |
| 2 | session02_rag_eval_ragas.ipynb |
Evaluate RAG quality | ⭐⭐ Intermediate |
| 3 | session03_benchmark_oss_models.ipynb |
Model benchmarking | ⭐⭐ Intermediate |
| 4 | session04_model_compare.ipynb |
Model comparison | ⭐⭐ Intermediate |
| 5 | session05_agents_orchestrator.ipynb |
Agent orchestration | ⭐⭐⭐ Advanced |
| 6 | session06_models_router.ipynb |
Intent routing | ⭐⭐⭐ Advanced |
| 6 | session06_models_pipeline.ipynb |
Pipeline orchestration | ⭐⭐⭐ Advanced |
Comprehensive guides and references:
| Document | Description | Use When |
|---|---|---|
| QUICK_START.md | Fast-track setup guide | Starting from scratch |
| QUICK_REFERENCE.md | Command & API cheat sheet | Need quick answers |
| FOUNDRY_SDK_QUICKREF.md | SDK patterns & examples | Writing code |
| ENV_CONFIGURATION.md | Environment variable guide | Configuring samples |
| notebooks/TROUBLESHOOTING.md | Common issues & fixes | Debugging problems |
- ✅ Session 1: Getting Started (focus on setup and basic chat)
- ✅ Session 2: RAG Basics (skip evaluation initially)
- ✅ Session 3: Simple Benchmarking (2 models only)
- ⏭️ Skip Sessions 4-6 for now
- 🔄 Return to Sessions 4-6 after building first application
- ⚡ Session 1: Quick setup validation
- ✅ Session 2: Complete RAG pipeline with evaluation
- ✅ Session 3: Full benchmarking suite
- ✅ Session 4: Model optimization
- ✅ Sessions 5-6: Focus on architecture patterns
- ⚡ Sessions 1-3: Quick review and validation
- ✅ Session 4: Optimization deep-dive
- ✅ Session 5: Multi-agent architecture
- ✅ Session 6: Production patterns and scaling
- 🚀 Extend: Build custom routing logic and hybrid deployments
If you're following the condensed 6-session workshop format, use these dedicated guides (each maps to and complements the broader module docs above):
| Workshop Session | Guide | Core Focus |
|---|---|---|
| 1 | Session01-GettingStartedFoundryLocal | Install, validate, run phi & GPT-OSS-20B, acceleration |
| 2 | Session02-BuildAISolutionsRAG | Prompt engineering, RAG patterns, CSV & document grounding, migration |
| 3 | Session03-OpenSourceModels | Hugging Face integration, benchmarking, model selection |
| 4 | Session04-CuttingEdgeModels | SLM vs LLM, WebGPU, Chainlit RAG, ONNX acceleration |
| 5 | Session05-AIPoweredAgents | Agent roles, memory, tools, orchestration |
| 6 | Session06-ModelsAsTools | Routing, chaining, scaling path to Azure |
Each session file includes: abstract, learning objectives, 30‑minute demo flow, starter project, validation checklist, troubleshooting, and references to the official Foundry Local Python SDK.
Install workshop dependencies (Windows):
cd Workshop
py -m venv .venv
.\.venv\Scripts\activate
pip install -r requirements.txtmacOS / Linux:
cd Workshop
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtIf running the Foundry Local service on a different (Windows) machine or VM from macOS, export the endpoint:
export FOUNDRY_LOCAL_ENDPOINT=http://<windows-host>:5273/v1| Session | Script(s) | Description |
|---|---|---|
| 1 | samples/session01/chat_bootstrap.py |
Bootstrap service & streaming chat |
| 2 | samples/session02/rag_pipeline.py |
Minimal RAG (in-memory embeddings) |
samples/session02/rag_eval_ragas.py |
RAG evaluation with ragas metrics | |
| 3 | samples/session03/benchmark_oss_models.py |
Multi-model latency & throughput benchmarking |
| 4 | samples/session04/model_compare.py |
SLM vs LLM comparison (latency & sample output) |
| 5 | samples/session05/agents_orchestrator.py |
Two‑agent research → editorial pipeline |
| 6 | samples/session06/models_router.py |
Intent-based routing demo |
samples/session06/models_pipeline.py |
Multi-step plan/execute/refine chain |
| Variable | Purpose | Example |
|---|---|---|
FOUNDRY_LOCAL_ALIAS |
Default single model alias for basic samples | phi-4-mini |
SLM_ALIAS / LLM_ALIAS |
Explicit SLM vs larger model for comparison | phi-4-mini / gpt-oss-20b |
BENCH_MODELS |
Comma list of aliases to benchmark | qwen2.5-0.5b,mistral-7b |
BENCH_ROUNDS |
Benchmark repetitions per model | 3 |
BENCH_PROMPT |
Prompt used in benchmarking | Explain retrieval augmented generation briefly. |
EMBED_MODEL |
Sentence-transformers embedding model | sentence-transformers/all-MiniLM-L6-v2 |
RAG_QUESTION |
Override test query for RAG pipeline | Why use RAG with local inference? |
AGENT_QUESTION |
Override agents pipeline query | Explain why edge AI matters for compliance. |
AGENT_MODEL_PRIMARY |
Model alias for research agent | phi-4-mini |
AGENT_MODEL_EDITOR |
Model alias for editor agent (can differ) | gpt-oss-20b |
SHOW_USAGE |
When 1, prints token usage per completion |
1 |
RETRY_ON_FAIL |
When 1, retry once on transient chat errors |
1 |
RETRY_BACKOFF |
Seconds to wait before retry | 1.0 |
If a variable isn’t set, scripts fall back to sensible defaults. For single‑model demos you typically only need FOUNDRY_LOCAL_ALIAS.
All samples now share a helper samples/workshop_utils.py providing:
- Cached
FoundryLocalManager+ OpenAI client creation chat_once()helper with optional retry + usage printing- Simple token usage reporting (enable via
SHOW_USAGE=1)
This reduces duplication and highlights best practices for efficient local model orchestration.
| Theme | Enhancement | Sessions | Env / Toggle |
|---|---|---|---|
| Determinism | Fixed temperature + stable prompt sets | 1–6 | Set temperature=0, top_p=1 |
| Token Usage Visibility | Consistent cost/efficiency teaching | 1–6 | SHOW_USAGE=1 |
| Streaming First Token | Perceived latency metric | 1,3,4,6 | BENCH_STREAM=1 (benchmark) |
| Retry Resilience | Handles transient cold-start | All | RETRY_ON_FAIL=1 + RETRY_BACKOFF |
| Multi-Model Agents | Heterogeneous role specialization | 5 | AGENT_MODEL_PRIMARY, AGENT_MODEL_EDITOR |
| Adaptive Routing | Intent + cost heuristics | 6 | Extend router with escalation logic |
| Vector Memory | Long-term semantic recall | 2,5,6 | Integrate FAISS/Chroma embedding index |
| Trace Export | Auditing & evaluation | 2,5,6 | Append JSON lines per step |
| Quality Rubrics | Qualitative tracking | 3–6 | Secondary scoring prompts |
| Smoke Tests | Quick pre-workshop validation | All | python Workshop/tests/smoke.py |
set FOUNDRY_LOCAL_ALIAS=phi-4-mini
set SHOW_USAGE=1
python Workshop\tests\smoke.pyExpect stable token counts across repeated identical inputs.
Use rag_eval_ragas.py to compute answer relevancy, faithfulness, and context precision on a tiny synthetic dataset:
cd Workshop/samples
python -m session02.rag_eval_ragasExtend by supplying a larger JSONL of questions, contexts, and ground truths, then converting to a Hugging Face Dataset.
The workshop deliberately uses only currently documented / stable Foundry Local CLI commands.
| Category | Command | Purpose |
|---|---|---|
| Core | foundry --version |
Show installed version |
| Service | foundry service start |
Start local service (if not auto) |
| Service | foundry service status |
Show service status |
| Models | foundry model list |
List catalog / available models |
| Models | foundry model download <alias> |
Download model weights into cache |
| Models | foundry model run <alias> |
Launch (load) a model locally; combine with --prompt for one‑shot |
| Models | foundry model unload <alias> / foundry model stop <alias> |
Unload a model from memory (if supported) |
| Cache | foundry cache list |
List cached (downloaded) models |
Instead of a deprecated model chat subcommand, use:
foundry model run <alias> --prompt "Your question here"This executes a single prompt/response cycle then exits.
| Deprecated / Undocumented | Replacement / Guidance |
|---|---|
foundry model chat <model> "..." |
foundry model run <model> --prompt "..." |
foundry model list --running |
Use plain foundry model list + recent activity / logs |
foundry model list --cached |
foundry cache list |
foundry model stats <model> |
Use benchmark Python script + OS tools (Task Manager / nvidia-smi) |
foundry model benchmark ... |
samples/session03/benchmark_oss_models.py |
- Latency, p95, tokens/sec:
samples/session03/benchmark_oss_models.py - First‑token latency (streaming): set
BENCH_STREAM=1 - Resource usage: OS monitors (Task Manager, Activity Monitor,
nvidia-smi).
As new CLI telemetry commands stabilize upstream, they can be incorporated with minimal edits to session markdowns.
An automated linter prevents reintroduction of deprecated CLI patterns inside fenced code blocks of markdown files:
Script: Workshop/scripts/lint_markdown_cli.py
Deprecated patterns are blocked inside code fences.
Recommended replacements:
| Deprecated | Replacement |
|---|---|
foundry model chat <a> "..." |
foundry model run <a> --prompt "..." |
model list --running |
model list |
model list --cached |
cache list |
model stats |
Benchmark script + system tools |
model benchmark |
samples/session03/benchmark_oss_models.py |
model list --available |
model list |
Run locally:
python Workshop\scripts\lint_markdown_cli.py --verboseGitHub Action: .github/workflows/markdown-cli-lint.yml runs on every push & PR.
Optional pre-commit hook:
echo "python Workshop/scripts/lint_markdown_cli.py" > .git/hooks/pre-commit
chmod +x .git/hooks/pre-commit| Task | CLI One-Liner | SDK (Python) Equivalent | Notes |
|---|---|---|---|
| Run a model once (prompt) | foundry model run phi-4-mini --prompt "Hello" |
manager=FoundryLocalManager("phi-4-mini"); client=OpenAI(base_url=manager.endpoint, api_key=manager.api_key or "not-needed"); client.chat.completions.create(model=manager.get_model_info("phi-4-mini").id, messages=[{"role":"user","content":"Hello"}]) |
SDK bootstraps service & caching automatically |
| Download (cache) model | foundry model download qwen2.5-0.5b |
FoundryLocalManager("qwen2.5-0.5b") # triggers download/load |
Manager picks best variant if alias maps to multiple builds |
| List catalog | foundry model list |
# use manager for each alias or maintain known list |
CLI aggregates; SDK currently per-alias instantiation |
| List cached models | foundry cache list |
manager.list_cached_models() |
After manager init (any alias) |
| Get endpoint URL | (implicit) | manager.endpoint |
Used to create OpenAI-compatible client |
| Warm a model | foundry model run <alias> then first prompt |
chat_once(alias, messages=[...]) (utility) |
Utilities handle initial cold latency warmup |
| Measure latency | python -m session03.benchmark_oss_models |
import benchmark_oss_models (or new exporter script) |
Prefer script for consistent metrics |
| Stop / unload model | foundry model unload <alias> |
(Not exposed – restart service / process) | Typically not required for workshop flow |
| Retrieve token usage | (view output) | resp.usage.total_tokens |
Provided if backend returns usage object |
Use the script Workshop/scripts/export_benchmark_markdown.py to run a fresh benchmark (same logic as samples/session03/benchmark_oss_models.py) and emit a GitHub-friendly Markdown table plus raw JSON.
python Workshop\scripts\export_benchmark_markdown.py --models "qwen2.5-0.5b,mistral-7b" --prompt "Explain retrieval augmented generation briefly." --rounds 3 --output benchmark_report.mdGenerated files:
| File | Contents |
|---|---|
benchmark_report.md |
Markdown table + interpretation hints |
benchmark_report.json |
Raw metrics array (for diffing / trend tracking) |
Set BENCH_STREAM=1 in the environment to include first-token latency if supported.