Skip to content

Commit 381f1ce

Browse files
authored
Continue plugin demo (openvinotoolkit#3203)
CVS-165486
1 parent 67365b1 commit 381f1ce

File tree

7 files changed

+176
-1
lines changed

7 files changed

+176
-1
lines changed
Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
# Code Completion and Copilot served via OpenVINO Model Server
2+
3+
## Intro
4+
With the rise of AI PC capabilities, hosting own Visual Studio code assistant is at your reach. In this demo, we will showcase how to deploy local LLM serving with OVMS and integrate it with Continue extension. It will employ iGPU or NPU acceleration.
5+
6+
# Requirements
7+
- Windows (for standalone app) or Linux (using Docker)
8+
- Python installed (for model preparation only)
9+
- Intel Meteor Lake, Lunar Lake, Arrow Lake or newer Intel CPU.
10+
11+
## Prepare Code Chat/Edit Model
12+
We need to use medium size model in order to keep 50ms/word for human to feel the chat responsive.
13+
This will work in streaming mode, meaning we will see the chat response/code diff generation slowly roll out in real-time.
14+
15+
Download export script, install its dependencies and create directory for the models:
16+
```console
17+
curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/1/demos/common/export_models/export_model.py -o export_model.py
18+
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/1/demos/common/export_models/requirements.txt
19+
mkdir models
20+
```
21+
> **Note:** The users in China need to set environment variable HF_ENDPOINT="https://hf-mirror.com" before running the export script to connect to the HF Hub.
22+
23+
Export `codellama/CodeLlama-7b-Instruct-hf`:
24+
```console
25+
python export_model.py text_generation --source_model codellama/CodeLlama-7b-Instruct-hf --weight-format int4 --config_file_path models/config_all.json --model_repository_path models --target_device NPU --overwrite_models
26+
```
27+
28+
> **Note:** Use `--target_device GPU` for Intel GPU or omit this parameter to run on Intel CPU
29+
30+
## Prepare Code Completion Model
31+
For this task we need smaller, lighter model that will produce code quicker than chat task.
32+
Since we do not want to wait for the code to appear, we need to use smaller model. It should be responsive enough to generate multi-line blocks of code ahead of time as we type.
33+
Code completion works in non-streaming, unary mode. Do not use instruct model, there is no chat involved in the process.
34+
35+
Export `Qwen/Qwen2.5-Coder-1.5B`:
36+
```baconsolesh
37+
python export_model.py text_generation --source_model Qwen/Qwen2.5-Coder-1.5B --weight-format int4 --config_file_path models/config_all.json --model_repository_path models --target_device NPU --overwrite_models
38+
```
39+
40+
Examine that workspace is set up properly `models/config_all.json`:
41+
```
42+
{
43+
"mediapipe_config_list": [
44+
{
45+
"name": "codellama/CodeLlama-7b-Instruct-hf",
46+
"base_path": "codellama/CodeLlama-7b-Instruct-hf"
47+
},
48+
{
49+
"name": "Qwen/Qwen2.5-Coder-1.5B",
50+
"base_path": "Qwen/Qwen2.5-Coder-1.5B"
51+
}
52+
],
53+
"model_config_list": []
54+
}
55+
```
56+
57+
```console
58+
tree models
59+
models
60+
├── codellama
61+
│   └── CodeLlama-7b-Instruct-hf
62+
│   ├── config.json
63+
│   ├── generation_config.json
64+
│   ├── graph.pbtxt
65+
│   ├── openvino_detokenizer.bin
66+
│   ├── openvino_detokenizer.xml
67+
│   ├── openvino_model.bin
68+
│   ├── openvino_model.xml
69+
│   ├── openvino_tokenizer.bin
70+
│   ├── openvino_tokenizer.xml
71+
│   ├── special_tokens_map.json
72+
│   ├── tokenizer_config.json
73+
│   ├── tokenizer.json
74+
│   └── tokenizer.model
75+
├── config_all.json
76+
└── Qwen
77+
└── Qwen2.5-Coder-1.5B
78+
├── added_tokens.json
79+
├── config.json
80+
├── generation_config.json
81+
├── graph.pbtxt
82+
├── merges.txt
83+
├── openvino_detokenizer.bin
84+
├── openvino_detokenizer.xml
85+
├── openvino_model.bin
86+
├── openvino_model.xml
87+
├── openvino_tokenizer.bin
88+
├── openvino_tokenizer.xml
89+
├── special_tokens_map.json
90+
├── tokenizer_config.json
91+
├── tokenizer.json
92+
└── vocab.json
93+
94+
4 directories, 29 files
95+
```
96+
97+
## Set Up Server
98+
Run OpenVINO Model Server with both models loaded at the same time:
99+
100+
### Windows: deploying on bare metal
101+
Please refer to OpenVINO Model Server installation first: [link](../../docs/deploying_server_baremetal.md)
102+
103+
```console
104+
ovms --rest_port 8000 --config_path ./models/config_all.json
105+
```
106+
107+
### Linux: via Docker
108+
```bash
109+
docker run -d --rm -v $(pwd)/:/workspace/ -p 8000:8000 openvino/model_server:2025.1 --rest_port 8000 --config_path /workspace/models/config_all.json
110+
```
111+
112+
## Set Up Visual Studio Code
113+
114+
### Download [Continue plugin](https://www.continue.dev/)
115+
116+
![search_continue_plugin](search_continue_plugin.png)
117+
118+
### Setup Local Assistant
119+
120+
We need to point Continue plugin to our OpenVINO Model Server instance.
121+
Open configuration file:
122+
123+
![setup_local_assistant](setup_local_assistant.png)
124+
125+
Add both models. Specify roles:
126+
```
127+
name: Local Assistant
128+
version: 1.0.0
129+
schema: v1
130+
models:
131+
-
132+
name: OVMS CodeLlama-7b-Instruct-hf
133+
provider: openai
134+
model: codellama/CodeLlama-7b-Instruct-hf
135+
apiKey: unused
136+
apiBase: localhost:8000/v3
137+
roles:
138+
- chat
139+
- edit
140+
- apply
141+
-
142+
name: OVMS Qwen2.5-Coder-1.5B
143+
provider: openai
144+
model: Qwen/Qwen2.5-Coder-1.5B
145+
apiKey: unused
146+
apiBase: localhost:8000/v3
147+
roles:
148+
- autocomplete
149+
context:
150+
- provider: code
151+
- provider: docs
152+
- provider: diff
153+
- provider: terminal
154+
- provider: problems
155+
- provider: folder
156+
- provider: codebase
157+
```
158+
159+
## Have Fun
160+
161+
- to use chatting feature click continue button on the left sidebar
162+
- use `CTRL+I` to select and include source in chat message
163+
- use `CTRL+L` to select and edit the source via chat request
164+
- simply write code to see code autocompletion (NOTE: this is turned off by default)
165+
166+
![final](final.png)
167+
168+
169+
## Troubleshooting
170+
171+
OpenVINO Model Server uses python to apply chat templates. If you get an error during model loading, enable Unicode UTF-8 in your system settings:
172+
173+
![utf8](utf8.png)
170 KB
Loading
91.4 KB
Loading
54.8 KB
Loading
88.3 KB
Loading

docs/deploying_server_baremetal.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,9 @@ Run `setupvars` script to set required environment variables.
111111
.\ovms\setupvars.ps1
112112
```
113113

114-
> **Note**: Running this script changes Python settings for the shell that runs it.Environment variables are set only for the current shell so make sure you rerun the script before using model server in a new shell.
114+
> **Note**: Running this script changes Python settings for the shell that runs it. Environment variables are set only for the current shell so make sure you rerun the script before using model server in a new shell.
115+
116+
> **Note**: When serving LLM models, OVMS uses Python's Jinja package to apply chat template. Please ensure you have Windows "Beta Unicode UTF-8 for worldwide language support" enabled. [Instruction](llm_utf8_troubleshoot.png)
115117
116118
You can also build model server from source by following the [developer guide](windows_developer_guide.md).
117119

docs/llm_utf8_troubleshoot.png

88.3 KB
Loading

0 commit comments

Comments
 (0)