Skip to content

Commit 1499203

Browse files
chrisdMSFTYi Ren
andauthored
Hello WindowsML (microsoft#1711)
# Hello Windows ML! Adding GenAI support for the Windows ML (WinML) ORT based distribution. ## ✅Behavior Changes: - `SetProviderSessionOptions` now calls `SessionOptionsAppendExecutionProvider_V2` when building the Execution Providers to leverage the new Plugin EP pattern. ## 🎯New Build Targets: WinML builds for two targets, x64 and arm64. In both case we define `USE_WINML` along with the following: - `windows_x64_winml_relwithdebinfo` - `USE_DML`, `USE_CUDA` - `windows_arm64_winml_relwithdebinfo` - `USE_DML` When building it is REQURED to specify the version of the `Microsoft.WindowsAppSDK.ML` we are building against in the `WINML_SDK_VERSION` variable. ### Example: ```bash make --preset windows_arm64_winml_relwithdebinfo -T cuda='E:\_work\1\onnxruntime-genai\cuda_sdk\v12.2' make --preset windows_x64_winml_relwithdebinfo -T cuda='E:\_work\1\onnxruntime-genai\cuda_sdk\v12.2' -DWINML_SDK_VERSION='1.8.1075-preview-g295f112894' ``` ## 📋Testing - Don't break existing build targets, added `win-winml-x64-build.yml` for CI. ## 🤔Open Questions: - Seems like the decision between `SessionOptionsAppendExecutionProvider` and `SessionOptionsAppendExecutionProvider_V2` is not specific to WinML but the new Plugin EP pattern on ORT. - Should the WINML and V2 concepts be split out? Does non-winml GenAI code want to use the V2 pattern? ## 🪫Batteries not included (debt): - Add targeted WinML tests. --------- Co-authored-by: Yi Ren <[email protected]>
1 parent 5f60ecc commit 1499203

28 files changed

+1375
-250
lines changed

.github/copilot-instructions.md

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
# ONNX Runtime GenAI - AI Coding Agent Instructions
2+
3+
## Architecture Overview
4+
5+
This is **ONNX Runtime GenAI**, a high-performance inference library for generative AI models. The codebase implements the complete generative AI loop including preprocessing, ONNX Runtime inference, logits processing, search/sampling, and KV cache management.
6+
7+
### Core Components
8+
9+
- **`src/models/`** - Model implementations with support for LLMs, VLMs (Vision), ALMs (Audio), and Pipeline models
10+
- **`src/engine/`** - Request batching engine for concurrent model execution with dynamic scheduling
11+
- **`src/generators.h`** - Central generator logic coordinating the full inference pipeline
12+
- **`src/ort_genai.h`** - Zero-cost C++ wrapper around the C API for automatic resource management
13+
- **Language bindings**: Python (`src/python/`), C# (`src/csharp/`), Java (`src/java/`), Objective-C (`src/objectivec/`)
14+
15+
### Key Abstractions
16+
17+
```cpp
18+
// Core inference flow: Model -> Generator -> Tokenizer
19+
auto model = OgaModel::Create("phi-2");
20+
auto tokenizer = OgaTokenizer::Create(*model);
21+
auto generator = OgaGenerator::Create(*model, params);
22+
```
23+
24+
The `State` class hierarchy in `src/models/model.h` handles device-specific execution, while the `Engine` class in `src/engine/` manages request batching and scheduling.
25+
26+
## Build System & Development Workflow
27+
28+
### Primary Build Commands
29+
30+
```bash
31+
# Cross-platform Python build script (preferred)
32+
python build.py --config Release --use_cuda --build_java --enable_tests
33+
34+
# Platform-specific scripts
35+
build.bat # Windows batch
36+
build.sh # Linux/Mac shell
37+
```
38+
39+
### Key Build Options (cmake/options.cmake)
40+
41+
- `USE_CUDA/USE_DML/USE_ROCM` - Hardware acceleration backends
42+
- `USE_WINML` - Windows ML integration requiring `WINML_SDK_VERSION` parameter
43+
- `ENABLE_JAVA/ENABLE_PYTHON` - Language binding compilation
44+
- `USE_GUIDANCE` - Constrained generation support
45+
46+
### WinML Build Pattern
47+
48+
WinML builds require explicit SDK version specification:
49+
50+
```bash
51+
# WinML build - WINML_SDK_VERSION is mandatory
52+
python build.py --use_winml -DWINML_SDK_VERSION=1.8.2084
53+
```
54+
55+
WinML integration downloads `Microsoft.WindowsAppSDK.ML` via NuGet and copies headers/libs to a local `ort/` directory.
56+
57+
### Testing
58+
59+
```bash
60+
# Python tests with test models
61+
python -m pytest -sv test_onnxruntime_genai_api.py -k "test_name" --test_models ..\test_models
62+
63+
# C++ unit tests via CMake/CTest
64+
ctest --build-config Release --output-on-failure
65+
```
66+
67+
## Code Patterns & Conventions
68+
69+
### Device Interface Pattern
70+
71+
Each hardware backend implements `DeviceInterface` (defined in `src/smartptrs.h`):
72+
73+
```cpp
74+
struct CudaInterface : DeviceInterface {
75+
std::unique_ptr<DeviceBuffer> Allocate(size_t size) override;
76+
void CopyToDevice(DeviceSpan<T> dst, std::span<const T> src) override;
77+
};
78+
```
79+
80+
### Model State Management
81+
82+
Models follow the `State` pattern where each model type extends the base `State` class:
83+
84+
```cpp
85+
struct State {
86+
virtual DeviceSpan<float> Run(int total_length,
87+
DeviceSpan<int32_t>& next_tokens) = 0;
88+
virtual void RewindTo(size_t index) {} // For session continuation
89+
};
90+
```
91+
92+
### Error Handling Convention
93+
94+
Use `OgaCheckResult()` wrapper for C API error propagation:
95+
96+
```cpp
97+
OgaCheckResult(OgaCreateModel(model_path, &model)); // Throws std::runtime_error
98+
```
99+
100+
### Memory Management
101+
102+
- **DeviceSpan/DeviceBuffer**: Device-agnostic memory abstractions
103+
- **std::unique_ptr with custom deleters**: For C API resource cleanup
104+
- **LeakChecked<T>**: Debug-mode leak detection for core types
105+
106+
## Critical Integration Points
107+
108+
### ONNX Runtime Dependency Management
109+
110+
ADO pipelines obtain ORT lib/headers via three methods:
111+
1. **Explicit `ORT_HOME`** - Pipeline provides pre-built ORT artifacts (preferred)
112+
2. **Auto-download via CMake** - `cmake/ortlib.cmake` fetches from ORT-Nightly feed when `ORT_HOME` unset
113+
3. **Python build driver** - `tools/python/util/dependency_resolver.py` downloads NuGet packages
114+
115+
### Model Loading Pipeline
116+
117+
1. **Config parsing** (`src/config.cpp`) - Reads `genai_config.json` model metadata
118+
2. **ONNX session creation** via `onnxruntime_api.h` wrappers
119+
3. **Device interface selection** based on provider availability
120+
4. **KV cache initialization** (`src/models/kv_cache.cpp`) for transformer models
121+
122+
### Multi-Modal Support
123+
124+
Vision models (Phi-Vision) use separate processor classes:
125+
- `PhiImageProcessor` - Image tokenization and preprocessing
126+
- `MultiModalProcessor` - Coordinates text/image inputs
127+
128+
### Execution Provider Detection
129+
130+
Hardware acceleration auto-detection follows this priority:
131+
1. CUDA (if `USE_CUDA=ON` and CUDA runtime available)
132+
2. DirectML (Windows, if `USE_DML=ON`)
133+
3. CPU fallback
134+
135+
## Project-Specific Gotchas
136+
137+
### Windows-Specific Build Requirements
138+
139+
- **Visual Studio 2022** required for C++20 features
140+
- **WinML integration** requires specific NuGet package versions (see `cmake/nuget.cmake`)
141+
- **Cross-compilation** for ARM64/ARM64EC supported via CMake platform flags
142+
143+
### Model Compatibility Matrix
144+
145+
The repo supports specific model architectures - check `src/models/model_type.h` for the canonical list. New models require:
146+
1. Config template in model directory
147+
2. State implementation extending base `State` class
148+
3. Optional custom processors for multi-modal inputs
149+
150+
### Performance Considerations
151+
152+
- **KV caching** is automatically managed but can be configured via `runtime_settings.cpp`
153+
- **Continuous decoding** (session continuation) requires careful state management
154+
- **Multi-LoRA** adapters use separate weight loading in `src/models/adapters.cpp`
155+
156+
## Testing Strategy
157+
158+
Tests are organized by language binding:
159+
- **C++ tests**: `test/` directory, focused on core API validation
160+
- **Python tests**: `test/python/`, includes end-to-end model testing
161+
- **Platform tests**: Android/iOS tests run via emulator/simulator
162+
163+
Always test with actual model files from `test/test_models/` directory rather than mock data.
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,70 @@
11
name: "Windows WinML x64 Build"
2+
on:
3+
workflow_dispatch:
4+
push:
5+
branches:
6+
- main
7+
- rel-*
8+
pull_request:
9+
10+
concurrency:
11+
group: ${{ github.workflow }}-${{ github.event_name == 'pull_request' && github.ref || github.sha }}
12+
cancel-in-progress: true
13+
14+
env:
15+
AZCOPY_AUTO_LOGIN_TYPE: MSI
16+
AZCOPY_MSI_CLIENT_ID: 63b63039-6328-442f-954b-5a64d124e5b4
17+
cuda_dir: "${{ github.workspace }}\\cuda_sdk"
18+
cuda_version: "12.2"
19+
CUDA_PATH: ${{ github.workspace }}\\cuda_sdk\\v12.2
20+
binaryDir: 'build/cuda/win-x64'
21+
ORT_NIGHTLY_REST_API: "https://feeds.dev.azure.com/aiinfra/PublicPackages/_apis/packaging/Feeds/ORT-Nightly/packages?packageNameQuery=Microsoft.ML.OnnxRuntime.Gpu.Windows&api-version=6.0-preview.1"
22+
ORT_PACKAGE_NAME: "Microsoft.ML.OnnxRuntime.Gpu.Windows"
23+
24+
jobs:
25+
windows-cuda-x64-build:
26+
runs-on: ["self-hosted", "1ES.Pool=onnxruntime-genai-Win2022-GPU-A10"]
27+
steps:
28+
- name: Checkout OnnxRuntime GenAI repo
29+
uses: actions/checkout@v4
30+
with:
31+
submodules: true
32+
33+
- uses: actions/setup-python@v5
34+
with:
35+
python-version: '3.11.x'
36+
architecture: 'x64'
37+
38+
- name: Setup VCPKG
39+
uses: microsoft/onnxruntime-github-actions/[email protected]
40+
with:
41+
vcpkg-version: '2025.03.19'
42+
vcpkg-hash: '17e96169cd3f266c4716fcdc1bb728e6a64f103941ece463a2834d50694eba4fb48f30135503fd466402afa139abc847ef630733c442595d1c34979f261b0114'
43+
cmake-version: '3.31.6'
44+
cmake-hash: '0f1584e8666cf4a65ec514bd02afe281caabf1d45d2c963f3151c41484f457386aa03273ab25776a670be02725354ce0b46f3a5121857416da37366342a833a0'
45+
add-cmake-to-path: 'true'
46+
disable-terrapin: 'false'
47+
48+
- name: Download cuda
49+
run: |
50+
azcopy.exe cp --recursive "https://lotusscus.blob.core.windows.net/models/cuda_sdk/v${{ env.cuda_version }}" ${{ env.cuda_dir}}
51+
52+
- uses: actions/setup-dotnet@v4
53+
with:
54+
dotnet-version: '8.0.x'
55+
56+
- name: Install Rust Toolchain
57+
run: |
58+
$exePath = "$env:TEMP\rustup-init.exe"
59+
(New-Object Net.WebClient).DownloadFile('https://static.rust-lang.org/rustup/dist/x86_64-pc-windows-msvc/rustup-init.exe', $exePath)
60+
& $exePath -y --default-toolchain=1.86.0
61+
Add-Content $env:GITHUB_PATH "$env:USERPROFILE\.cargo\bin"
62+
63+
- name: Configure CMake
64+
run: |
65+
cmake --preset windows_x64_winml_relwithdebinfo -T cuda=${{ env.cuda_dir }}\\v${{ env.cuda_version }} -DWINML_SDK_VERSION='1.8.2088'
66+
67+
- name: Build with CMake
68+
run: |
69+
cmake --build --preset windows_x64_winml_relwithdebinfo --parallel
70+
cmake --build --preset windows_x64_winml_relwithdebinfo --target PyPackageBuild

0 commit comments

Comments
 (0)