EdgeAI for Beginners - Workshop

Hands-On Learning Path for Building Production-Ready Edge AI Applications

Master local AI deployment with Microsoft Foundry Local, from first chat completion to multi-agent orchestration in 6 progressive sessions.

🎯 Introduction

Welcome to the EdgeAI for Beginners Workshop - your practical, hands-on guide to building intelligent applications that run entirely on local hardware. This workshop transforms theoretical Edge AI concepts into real-world skills through progressively challenging exercises using Microsoft Foundry Local and Small Language Models (SLMs).

Why This Workshop?

The Edge AI Revolution is Here

Organizations worldwide are shifting from cloud-dependent AI to edge computing for three critical reasons:

Privacy & Compliance - Process sensitive data locally without cloud transmission (HIPAA, GDPR, financial regulations)
Performance - Eliminate network latency (50-500ms local vs 500-2000ms cloud round-trip)
Cost Control - Remove per-token API costs and scale without cloud expenses

But Edge AI is Different

Running AI on-premises requires new skills:

Model selection and optimization for resource constraints
Local service management and hardware acceleration
Prompt engineering for smaller models
Production deployment patterns for edge devices

This Workshop Delivers Those Skills

In 6 focused sessions (~3 hours total), you'll progress from "Hello World" to deploying production-ready multi-agent systems - all running locally on your machine.

📚 Learning Objectives

By completing this workshop, you will be able to:

Core Competencies

Deploy and Manage Local AI Services
- Install and configure Microsoft Foundry Local
- Select appropriate models for edge deployment
- Manage model lifecycle (download, load, cache)
- Monitor resource usage and optimize performance
Build AI-Powered Applications
- Implement OpenAI-compatible chat completions locally
- Design effective prompts for Small Language Models
- Handle streaming responses for better UX
- Integrate local models into existing applications
Create RAG (Retrieval Augmented Generation) Systems
- Build semantic search with embeddings
- Ground LLM responses in domain-specific knowledge
- Evaluate RAG quality with industry-standard metrics
- Scale from prototype to production
Optimize Model Performance
- Benchmark multiple models for your use case
- Measure latency, throughput, and first-token time
- Select optimal models based on speed/quality tradeoffs
- Compare SLM vs LLM trade-offs in real scenarios
Orchestrate Multi-Agent Systems
- Design specialized agents for different tasks
- Implement agent memory and context management
- Coordinate agents in complex workflows
- Route requests intelligently across multiple models
Deploy Production-Ready Solutions
- Implement error handling and retry logic
- Monitor token usage and system resources
- Build scalable architectures with model-as-tools patterns
- Plan migration paths from edge to hybrid (edge + cloud)

🎓 Learning Outcomes

What You'll Build

By the end of this workshop, you will have created:

Session	Deliverable	Skills Demonstrated
1	Chat application with streaming	Service setup, basic completions, streaming UX
2	RAG system with evaluation	Embeddings, semantic search, quality metrics
3	Multi-model benchmark suite	Performance measurement, model comparison
4	SLM vs LLM comparator	Trade-off analysis, optimization strategies
5	Multi-agent orchestrator	Agent design, memory management, coordination
6	Intelligent routing system	Intent detection, model selection, scalability

Competency Matrix

Skill Level	Session 1-2	Session 3-4	Session 5-6
Beginner	✅ Setup & basics	⚠️ Challenging	❌ Too advanced
Intermediate	✅ Quick review	✅ Core learning	⚠️ Stretch goals
Advanced	✅ Breeze through	✅ Refinement	✅ Production patterns

Career-Ready Skills

After this workshop, you'll be prepared to:

✅ Build Privacy-First Applications

Healthcare apps handling PHI/PII locally
Financial services with compliance requirements
Government systems with data sovereignty needs

✅ Optimize for Edge Environments

IoT devices with limited resources
Offline-first mobile applications
Low-latency real-time systems

✅ Design Intelligent Architectures

Multi-agent systems for complex workflows
Hybrid edge-cloud deployments
Cost-optimized AI infrastructure

✅ Lead Edge AI Initiatives

Evaluate Edge AI feasibility for projects
Select appropriate models and frameworks
Architect scalable local AI solutions

🗺️ Workshop Structure

Session Overview (6 Sessions × 30 Minutes = 3 Hours)

Session	Topic	Focus	Duration
1	Getting Started with Foundry Local	Install, validate, first completions	30 min
2	Building AI Solutions with RAG	Prompt engineering, embeddings, evaluation	30 min
3	Open Source Models	Model discovery, benchmarking, selection	30 min
4	Cutting Edge Models	SLM vs LLM, optimization, frameworks	30 min
5	AI-Powered Agents	Agent design, orchestration, memory	30 min
6	Models as Tools	Routing, chaining, scaling strategies	30 min

🚀 Quick Start

Prerequisites

System Requirements:

OS: Windows 10/11, macOS 11+, or Linux (Ubuntu 20.04+)
RAM: 8GB minimum, 16GB+ recommended
Storage: 10GB+ free space for models
CPU: Modern processor with AVX2 support
GPU (optional): CUDA-compatible or Qualcomm NPU for acceleration

Software Requirements:

Python 3.8+ (Download)
Microsoft Foundry Local (Installation Guide)
Git (Download)
Visual Studio Code (recommended) (Download)

Setup in 3 Steps

1. Install Foundry Local

Windows:

winget install Microsoft.FoundryLocal

macOS:

brew tap microsoft/foundrylocal
brew install foundrylocal

Verify Installation:

foundry --version
foundry service status

Ensure Azure AI Foundry Local is running with a fixed port

# Set FoundryLocal to use port 58123 (default)
foundry service set --port 58123 --show

# Or use a different port
foundry service set --port 58000 --show

Verify it's working:

# Check service status
foundry service status

# Test the endpoint
curl http://127.0.0.1:58123/v1/models

Finding Available Models To see which models are available in your Foundry Local instance, you can query the models endpoint:

# cmd/bash/powershell
foundry model list

Using Web Endpoint

# Windows PowerShell
powershell -Command "Invoke-RestMethod -Uri 'http://127.0.0.1:58123/v1/models' -Method Get"

# Or using curl (if available)
curl http://127.0.0.1:58123/v1/models

2. Clone Repository & Install Dependencies

# Clone repository
git clone https://github.com/microsoft/edgeai-for-beginners.git
cd edgeai-for-beginners/Workshop

# Create virtual environment
python -m venv .venv

# Activate virtual environment
# Windows:
.\.venv\Scripts\activate
# macOS/Linux:
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

3. Run Your First Sample

# Start Foundry Local and load a model
foundry model run phi-4-mini

# Run the chat bootstrap sample
cd samples
python -m session01.chat_bootstrap "What is edge AI?"

✅ Success! You should see a streaming response about edge AI.

📦 Workshop Resources

Python Samples

Progressive hands-on examples demonstrating each concept:

Session	Sample	Description	Run Time
1	`chat_bootstrap.py`	Basic & streaming chat	~30s
2	`rag_pipeline.py`	RAG with embeddings	~45s
2	`rag_eval_ragas.py`	RAG quality evaluation	~60s
3	`benchmark_oss_models.py`	Multi-model benchmarking	~2-3m
4	`model_compare.py`	SLM vs LLM comparison	~45s
5	`agents_orchestrator.py`	Multi-agent system	~60s
6	`models_router.py`	Intent-based routing	~45s
6	`models_pipeline.py`	Multi-step pipeline	~60s

Jupyter Notebooks

Interactive exploration with explanations and visualizations:

Session	Notebook	Description	Difficulty
1	`session01_chat_bootstrap.ipynb`	Chat basics & streaming	⭐ Beginner
2	`session02_rag_pipeline.ipynb`	Build RAG system	⭐⭐ Intermediate
2	`session02_rag_eval_ragas.ipynb`	Evaluate RAG quality	⭐⭐ Intermediate
3	`session03_benchmark_oss_models.ipynb`	Model benchmarking	⭐⭐ Intermediate
4	`session04_model_compare.ipynb`	Model comparison	⭐⭐ Intermediate
5	`session05_agents_orchestrator.ipynb`	Agent orchestration	⭐⭐⭐ Advanced
6	`session06_models_router.ipynb`	Intent routing	⭐⭐⭐ Advanced
6	`session06_models_pipeline.ipynb`	Pipeline orchestration	⭐⭐⭐ Advanced

Documentation

Comprehensive guides and references:

Document	Description	Use When
QUICK_START.md	Fast-track setup guide	Starting from scratch
QUICK_REFERENCE.md	Command & API cheat sheet	Need quick answers
FOUNDRY_SDK_QUICKREF.md	SDK patterns & examples	Writing code
ENV_CONFIGURATION.md	Environment variable guide	Configuring samples
notebooks/TROUBLESHOOTING.md	Common issues & fixes	Debugging problems

🎓 Learning Path Recommendations

For Beginners (3-4 hours)

✅ Session 1: Getting Started (focus on setup and basic chat)
✅ Session 2: RAG Basics (skip evaluation initially)
✅ Session 3: Simple Benchmarking (2 models only)
⏭️ Skip Sessions 4-6 for now
🔄 Return to Sessions 4-6 after building first application

For Intermediate Developers (3 hours)

⚡ Session 1: Quick setup validation
✅ Session 2: Complete RAG pipeline with evaluation
✅ Session 3: Full benchmarking suite
✅ Session 4: Model optimization
✅ Sessions 5-6: Focus on architecture patterns

For Advanced Practitioners (2-3 hours)

⚡ Sessions 1-3: Quick review and validation
✅ Session 4: Optimization deep-dive
✅ Session 5: Multi-agent architecture
✅ Session 6: Production patterns and scaling
🚀 Extend: Build custom routing logic and hybrid deployments

Workshop Session Pack (Focused 30‑Minute Labs)

If you're following the condensed 6-session workshop format, use these dedicated guides (each maps to and complements the broader module docs above):

Workshop Session	Guide	Core Focus
1	Session01-GettingStartedFoundryLocal	Install, validate, run phi & GPT-OSS-20B, acceleration
2	Session02-BuildAISolutionsRAG	Prompt engineering, RAG patterns, CSV & document grounding, migration
3	Session03-OpenSourceModels	Hugging Face integration, benchmarking, model selection
4	Session04-CuttingEdgeModels	SLM vs LLM, WebGPU, Chainlit RAG, ONNX acceleration
5	Session05-AIPoweredAgents	Agent roles, memory, tools, orchestration
6	Session06-ModelsAsTools	Routing, chaining, scaling path to Azure

Each session file includes: abstract, learning objectives, 30‑minute demo flow, starter project, validation checklist, troubleshooting, and references to the official Foundry Local Python SDK.

Sample Scripts

Install workshop dependencies (Windows):

cd Workshop
py -m venv .venv
.\.venv\Scripts\activate
pip install -r requirements.txt

macOS / Linux:

cd Workshop
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

If running the Foundry Local service on a different (Windows) machine or VM from macOS, export the endpoint:

export FOUNDRY_LOCAL_ENDPOINT=http://<windows-host>:5273/v1

Session	Script(s)	Description
1	`samples/session01/chat_bootstrap.py`	Bootstrap service & streaming chat
2	`samples/session02/rag_pipeline.py`	Minimal RAG (in-memory embeddings)
	`samples/session02/rag_eval_ragas.py`	RAG evaluation with ragas metrics
3	`samples/session03/benchmark_oss_models.py`	Multi-model latency & throughput benchmarking
4	`samples/session04/model_compare.py`	SLM vs LLM comparison (latency & sample output)
5	`samples/session05/agents_orchestrator.py`	Two‑agent research → editorial pipeline
6	`samples/session06/models_router.py`	Intent-based routing demo
	`samples/session06/models_pipeline.py`	Multi-step plan/execute/refine chain

Environment Variables (Common Across Samples)

Variable	Purpose	Example
`FOUNDRY_LOCAL_ALIAS`	Default single model alias for basic samples	`phi-4-mini`
`SLM_ALIAS` / `LLM_ALIAS`	Explicit SLM vs larger model for comparison	`phi-4-mini` / `gpt-oss-20b`
`BENCH_MODELS`	Comma list of aliases to benchmark	`qwen2.5-0.5b,mistral-7b`
`BENCH_ROUNDS`	Benchmark repetitions per model	`3`
`BENCH_PROMPT`	Prompt used in benchmarking	`Explain retrieval augmented generation briefly.`
`EMBED_MODEL`	Sentence-transformers embedding model	`sentence-transformers/all-MiniLM-L6-v2`
`RAG_QUESTION`	Override test query for RAG pipeline	`Why use RAG with local inference?`
`AGENT_QUESTION`	Override agents pipeline query	`Explain why edge AI matters for compliance.`
`AGENT_MODEL_PRIMARY`	Model alias for research agent	`phi-4-mini`
`AGENT_MODEL_EDITOR`	Model alias for editor agent (can differ)	`gpt-oss-20b`
`SHOW_USAGE`	When `1`, prints token usage per completion	`1`
`RETRY_ON_FAIL`	When `1`, retry once on transient chat errors	`1`
`RETRY_BACKOFF`	Seconds to wait before retry	`1.0`

If a variable isn’t set, scripts fall back to sensible defaults. For single‑model demos you typically only need FOUNDRY_LOCAL_ALIAS.

Utility Module

All samples now share a helper samples/workshop_utils.py providing:

Cached FoundryLocalManager + OpenAI client creation
chat_once() helper with optional retry + usage printing
Simple token usage reporting (enable via SHOW_USAGE=1)

This reduces duplication and highlights best practices for efficient local model orchestration.

Optional Enhancements (Cross-Session)

Theme	Enhancement	Sessions	Env / Toggle
Determinism	Fixed temperature + stable prompt sets	1–6	Set `temperature=0`, `top_p=1`
Token Usage Visibility	Consistent cost/efficiency teaching	1–6	`SHOW_USAGE=1`
Streaming First Token	Perceived latency metric	1,3,4,6	`BENCH_STREAM=1` (benchmark)
Retry Resilience	Handles transient cold-start	All	`RETRY_ON_FAIL=1` + `RETRY_BACKOFF`
Multi-Model Agents	Heterogeneous role specialization	5	`AGENT_MODEL_PRIMARY`, `AGENT_MODEL_EDITOR`
Adaptive Routing	Intent + cost heuristics	6	Extend router with escalation logic
Vector Memory	Long-term semantic recall	2,5,6	Integrate FAISS/Chroma embedding index
Trace Export	Auditing & evaluation	2,5,6	Append JSON lines per step
Quality Rubrics	Qualitative tracking	3–6	Secondary scoring prompts
Smoke Tests	Quick pre-workshop validation	All	`python Workshop/tests/smoke.py`

Deterministic Quick Start

set FOUNDRY_LOCAL_ALIAS=phi-4-mini
set SHOW_USAGE=1
python Workshop\tests\smoke.py

Expect stable token counts across repeated identical inputs.

RAG Evaluation (Session 2)

Use rag_eval_ragas.py to compute answer relevancy, faithfulness, and context precision on a tiny synthetic dataset:

cd Workshop/samples
python -m session02.rag_eval_ragas

Extend by supplying a larger JSONL of questions, contexts, and ground truths, then converting to a Hugging Face Dataset.

CLI Command Accuracy Appendix

The workshop deliberately uses only currently documented / stable Foundry Local CLI commands.

Stable Commands Referenced

Category	Command	Purpose
Core	`foundry --version`	Show installed version
Service	`foundry service start`	Start local service (if not auto)
Service	`foundry service status`	Show service status
Models	`foundry model list`	List catalog / available models
Models	`foundry model download <alias>`	Download model weights into cache
Models	`foundry model run <alias>`	Launch (load) a model locally; combine with `--prompt` for one‑shot
Models	`foundry model unload <alias>` / `foundry model stop <alias>`	Unload a model from memory (if supported)
Cache	`foundry cache list`	List cached (downloaded) models

One‑Shot Prompt Pattern

Instead of a deprecated model chat subcommand, use:

foundry model run <alias> --prompt "Your question here"

This executes a single prompt/response cycle then exits.

Removed / Avoided Patterns

Deprecated / Undocumented	Replacement / Guidance
`foundry model chat <model> "..."`	`foundry model run <model> --prompt "..."`
`foundry model list --running`	Use plain `foundry model list` + recent activity / logs
`foundry model list --cached`	`foundry cache list`
`foundry model stats <model>`	Use benchmark Python script + OS tools (Task Manager / `nvidia-smi`)
`foundry model benchmark ...`	`samples/session03/benchmark_oss_models.py`

Benchmarking & Telemetry

Latency, p95, tokens/sec: samples/session03/benchmark_oss_models.py
First‑token latency (streaming): set BENCH_STREAM=1
Resource usage: OS monitors (Task Manager, Activity Monitor, nvidia-smi).

As new CLI telemetry commands stabilize upstream, they can be incorporated with minimal edits to session markdowns.

Automated Lint Guard

An automated linter prevents reintroduction of deprecated CLI patterns inside fenced code blocks of markdown files:

Script: Workshop/scripts/lint_markdown_cli.py

Deprecated patterns are blocked inside code fences.

Recommended replacements:

Deprecated	Replacement
`foundry model chat <a> "..."`	`foundry model run <a> --prompt "..."`
`model list --running`	`model list`
`model list --cached`	`cache list`
`model stats`	Benchmark script + system tools
`model benchmark`	`samples/session03/benchmark_oss_models.py`
`model list --available`	`model list`

Run locally:

python Workshop\scripts\lint_markdown_cli.py --verbose

GitHub Action: .github/workflows/markdown-cli-lint.yml runs on every push & PR.

Optional pre-commit hook:

echo "python Workshop/scripts/lint_markdown_cli.py" > .git/hooks/pre-commit
chmod +x .git/hooks/pre-commit

Quick CLI → SDK Migration Table

Task	CLI One-Liner	SDK (Python) Equivalent	Notes
Run a model once (prompt)	`foundry model run phi-4-mini --prompt "Hello"`	`manager=FoundryLocalManager("phi-4-mini"); client=OpenAI(base_url=manager.endpoint, api_key=manager.api_key or "not-needed"); client.chat.completions.create(model=manager.get_model_info("phi-4-mini").id, messages=[{"role":"user","content":"Hello"}])`	SDK bootstraps service & caching automatically
Download (cache) model	`foundry model download qwen2.5-0.5b`	`FoundryLocalManager("qwen2.5-0.5b") # triggers download/load`	Manager picks best variant if alias maps to multiple builds
List catalog	`foundry model list`	`# use manager for each alias or maintain known list`	CLI aggregates; SDK currently per-alias instantiation
List cached models	`foundry cache list`	`manager.list_cached_models()`	After manager init (any alias)
Get endpoint URL	(implicit)	`manager.endpoint`	Used to create OpenAI-compatible client
Warm a model	`foundry model run <alias>` then first prompt	`chat_once(alias, messages=[...])` (utility)	Utilities handle initial cold latency warmup
Measure latency	`python -m session03.benchmark_oss_models`	`import benchmark_oss_models` (or new exporter script)	Prefer script for consistent metrics
Stop / unload model	`foundry model unload <alias>`	(Not exposed – restart service / process)	Typically not required for workshop flow
Retrieve token usage	(view output)	`resp.usage.total_tokens`	Provided if backend returns usage object

Benchmark Markdown Export

Use the script Workshop/scripts/export_benchmark_markdown.py to run a fresh benchmark (same logic as samples/session03/benchmark_oss_models.py) and emit a GitHub-friendly Markdown table plus raw JSON.

Example

python Workshop\scripts\export_benchmark_markdown.py --models "qwen2.5-0.5b,mistral-7b" --prompt "Explain retrieval augmented generation briefly." --rounds 3 --output benchmark_report.md

Generated files:

File	Contents
`benchmark_report.md`	Markdown table + interpretation hints
`benchmark_report.json`	Raw metrics array (for diffing / trend tracking)

Set BENCH_STREAM=1 in the environment to include first-token latency if supported.

FilesExpand file tree

Readme.md

Latest commit

History