Skip to content

Commit 30d620e

Browse files
XingLiu0923Xing Liu
andauthored
feat(embedder): make embedder support openai compatible model like qwen (#169)
* feat(embedder): make embedder support openai compatible model like qwen * refactor config py --------- Co-authored-by: Xing Liu <[email protected]>
1 parent 732e0fb commit 30d620e

14 files changed

+5965
-627
lines changed

README.es.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -239,6 +239,20 @@ La configuración de base_url del Cliente OpenAI está diseñada principalmente
239239

240240
**Próximamente**: En futuras actualizaciones, DeepWiki soportará un modo donde los usuarios deberán proporcionar sus propias claves API en las solicitudes. Esto permitirá a los clientes empresariales con canales privados utilizar sus disposiciones API existentes sin compartir credenciales con el despliegue de DeepWiki.
241241

242+
## 🧩 Uso de modelos de embedding compatibles con OpenAI (por ejemplo, Alibaba Qwen)
243+
244+
Si deseas usar modelos de embedding compatibles con la API de OpenAI (como Alibaba Qwen), sigue estos pasos:
245+
246+
1. Sustituye el contenido de `api/config/embedder.json` por el de `api/config/embedder_openai_compatible.json`.
247+
2. En el archivo `.env` de la raíz del proyecto, configura las variables de entorno necesarias, por ejemplo:
248+
```
249+
OPENAI_API_KEY=tu_api_key
250+
OPENAI_API_BASE_URL=tu_endpoint_compatible_openai
251+
```
252+
3. El programa sustituirá automáticamente los placeholders de embedder.json por los valores de tus variables de entorno.
253+
254+
Así puedes cambiar fácilmente a cualquier servicio de embedding compatible con OpenAI sin modificar el código.
255+
242256
## 🤖 Funciones de Preguntas e Investigación Profunda
243257

244258
### Función de Preguntas
@@ -317,3 +331,4 @@ Este proyecto está licenciado bajo la Licencia MIT - consulta el archivo [LICEN
317331
## ⭐ Historial de Estrellas
318332

319333
[![Gráfico de Historial de Estrellas](https://api.star-history.com/svg?repos=AsyncFuncAI/deepwiki-open&type=Date)](https://star-history.com/#AsyncFuncAI/deepwiki-open&Date)
334+

README.ja.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -444,3 +444,4 @@ _DeepWiki の動作を見る!_
444444
## ⭐ スター履歴
445445

446446
[![スター履歴チャート](https://api.star-history.com/svg?repos=AsyncFuncAI/deepwiki-open&type=Date)](https://star-history.com/#AsyncFuncAI/deepwiki-open&Date)
447+

README.kr.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -427,3 +427,4 @@ DeepResearch를 사용하려면 질문 제출 전 Ask 인터페이스에서 "Dee
427427
## ⭐ 스타 히스토리
428428

429429
[![Star History Chart](https://api.star-history.com/svg?repos=AsyncFuncAI/deepwiki-open&type=Date)](https://star-history.com/#AsyncFuncAI/deepwiki-open&Date)
430+

README.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -241,6 +241,20 @@ The OpenAI Client's base_url configuration is designed primarily for enterprise
241241

242242
**Coming Soon**: In future updates, DeepWiki will support a mode where users need to provide their own API keys in requests. This will allow enterprise customers with private channels to use their existing API arrangements without sharing credentials with the DeepWiki deployment.
243243

244+
## 🧩 Using OpenAI-Compatible Embedding Models (e.g., Alibaba Qwen)
245+
246+
If you want to use embedding models compatible with the OpenAI API (such as Alibaba Qwen), follow these steps:
247+
248+
1. Replace the contents of `api/config/embedder.json` with those from `api/config/embedder_openai_compatible.json`.
249+
2. In your project root `.env` file, set the relevant environment variables, for example:
250+
```
251+
OPENAI_API_KEY=your_api_key
252+
OPENAI_API_BASE_URL=your_openai_compatible_endpoint
253+
```
254+
3. The program will automatically substitute placeholders in embedder.json with the values from your environment variables.
255+
256+
This allows you to seamlessly switch to any OpenAI-compatible embedding service without code changes.
257+
244258
## 🛠️ Advanced Setup
245259

246260
### Environment Variables

README.vi.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -420,3 +420,4 @@ Dự án này được cấp phép theo Giấy phép MIT - xem file [LICENSE](LI
420420
## ⭐ Lịch sử
421421

422422
[![Biểu đồ lịch sử](https://api.star-history.com/svg?repos=AsyncFuncAI/deepwiki-open&type=Date)](https://star-history.com/#AsyncFuncAI/deepwiki-open&Date)
423+

README.zh.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -352,3 +352,18 @@ OpenAI 客户端的 base_url 配置主要为拥有私有 API 渠道的企业用
352352
- 支持与第三方 OpenAI API 兼容服务的集成
353353

354354
**即将推出**:在未来的更新中,DeepWiki 将支持一种模式,用户需要在请求中提供自己的 API 密钥。这将允许拥有私有渠道的企业客户使用其现有的 API 安排,而不是与 DeepWiki 部署共享凭据。
355+
356+
## 🧩 使用 OpenAI 兼容的 Embedding 模型(如阿里巴巴 Qwen)
357+
358+
如果你希望使用 OpenAI 以外、但兼容 OpenAI 接口的 embedding 模型(如阿里巴巴 Qwen),请参考以下步骤:
359+
360+
1.`api/config/embedder_openai_compatible.json` 的内容替换 `api/config/embedder.json`
361+
2. 在项目根目录的 `.env` 文件中,配置相应的环境变量,例如:
362+
```
363+
OPENAI_API_KEY=你的_api_key
364+
OPENAI_API_BASE_URL=你的_openai_兼容接口地址
365+
```
366+
3. 程序会自动用环境变量的值替换 embedder.json 里的占位符。
367+
368+
这样即可无缝切换到 OpenAI 兼容的 embedding 服务,无需修改代码。
369+

api/config.py

Lines changed: 35 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
11
import os
22
import json
33
import logging
4+
import re
45
from pathlib import Path
5-
from typing import List
6+
from typing import List, Union, Dict, Any
67

78
logger = logging.getLogger(__name__)
89

@@ -48,6 +49,36 @@
4849
"BedrockClient": BedrockClient
4950
}
5051

52+
def replace_env_placeholders(config: Union[Dict[str, Any], List[Any], str, Any]) -> Union[Dict[str, Any], List[Any], str, Any]:
53+
"""
54+
Recursively replace placeholders like "${ENV_VAR}" in string values
55+
within a nested configuration structure (dicts, lists, strings)
56+
with environment variable values. Logs a warning if a placeholder is not found.
57+
"""
58+
pattern = re.compile(r"\$\{([A-Z0-9_]+)\}")
59+
60+
def replacer(match: re.Match[str]) -> str:
61+
env_var_name = match.group(1)
62+
original_placeholder = match.group(0)
63+
env_var_value = os.environ.get(env_var_name)
64+
if env_var_value is None:
65+
logger.warning(
66+
f"Environment variable placeholder '{original_placeholder}' was not found in the environment. "
67+
f"The placeholder string will be used as is."
68+
)
69+
return original_placeholder
70+
return env_var_value
71+
72+
if isinstance(config, dict):
73+
return {k: replace_env_placeholders(v) for k, v in config.items()}
74+
elif isinstance(config, list):
75+
return [replace_env_placeholders(item) for item in config]
76+
elif isinstance(config, str):
77+
return pattern.sub(replacer, config)
78+
else:
79+
# Handles numbers, booleans, None, etc.
80+
return config
81+
5182
# Load JSON configuration file
5283
def load_json_config(filename):
5384
try:
@@ -65,7 +96,9 @@ def load_json_config(filename):
6596
return {}
6697

6798
with open(config_path, 'r') as f:
68-
return json.load(f)
99+
config = json.load(f)
100+
config = replace_env_placeholders(config)
101+
return config
69102
except Exception as e:
70103
logger.error(f"Error loading configuration file {filename}: {str(e)}")
71104
return {}

api/config/embedder.json.bak

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
{
2+
"embedder": {
3+
"client_class": "OpenAIClient",
4+
"batch_size": 500,
5+
"model_kwargs": {
6+
"model": "text-embedding-3-small",
7+
"dimensions": 256,
8+
"encoding_format": "float"
9+
}
10+
},
11+
"embedder_ollama": {
12+
"client_class": "OllamaClient",
13+
"model_kwargs": {
14+
"model": "nomic-embed-text"
15+
}
16+
},
17+
"retriever": {
18+
"top_k": 20
19+
},
20+
"text_splitter": {
21+
"split_by": "word",
22+
"chunk_size": 350,
23+
"chunk_overlap": 100
24+
}
25+
}
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
{
2+
"embedder": {
3+
"client_class": "OpenAIClient",
4+
"initialize_kwargs": {
5+
"api_key": "${OPENAI_API_KEY}",
6+
"base_url": "${OPENAI_API_BASE_URL}"
7+
},
8+
"batch_size": 10,
9+
"model_kwargs": {
10+
"model": "text-embedding-v3",
11+
"dimensions": 256,
12+
"encoding_format": "float"
13+
}
14+
},
15+
"embedder_ollama": {
16+
"client_class": "OllamaClient",
17+
"model_kwargs": {
18+
"model": "nomic-embed-text"
19+
}
20+
},
21+
"retriever": {
22+
"top_k": 20
23+
},
24+
"text_splitter": {
25+
"split_by": "word",
26+
"chunk_size": 350,
27+
"chunk_overlap": 100
28+
}
29+
}

api/data_pipeline.py

Lines changed: 3 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@
1515
from api.ollama_patch import OllamaDocumentProcessor
1616
from urllib.parse import urlparse, urlunparse, quote
1717

18+
from api.tools.embedder import get_embedder
19+
1820
# Configure logging
1921
logger = logging.getLogger(__name__)
2022

@@ -366,14 +368,7 @@ def prepare_data_pipeline(is_ollama_embedder: bool = None):
366368
splitter = TextSplitter(**configs["text_splitter"])
367369
embedder_config = get_embedder_config()
368370

369-
if not embedder_config:
370-
raise ValueError("No embedder configuration found")
371-
372-
# Create embedder based on configuration
373-
embedder = adal.Embedder(
374-
model_client=embedder_config["model_client"](),
375-
model_kwargs=embedder_config["model_kwargs"],
376-
)
371+
embedder = get_embedder(is_local_ollama=is_ollama_embedder)
377372

378373
if is_ollama_embedder:
379374
# Use Ollama document processor for single-document processing

0 commit comments

Comments
 (0)