jd-opensource
diff --git a/‎README.md‎
Lines changed: 8 additions & 102 deletions b/‎README.md‎
Lines changed: 8 additions & 102 deletions
diff --git a/‎README_zh.md‎
Lines changed: 8 additions & 104 deletions b/‎README_zh.md‎
Lines changed: 8 additions & 104 deletions
diff --git a/‎docs/en/dev_guide/code-arch.md‎
Lines changed: 27 additions & 0 deletions b/‎docs/en/dev_guide/code-arch.md‎
Lines changed: 27 additions & 0 deletions
diff --git a/‎docs/en/features/disagg_pd.md‎
Lines changed: 4 additions & 4 deletions b/‎docs/en/features/disagg_pd.md‎
Lines changed: 4 additions & 4 deletions
@@ -79,108 +79,13 @@ limitations under the License. -->
 
 ---
 
-## 3. Code Architecture
-```
-├── xllm/
-|   : main source folder
-│   ├── api_service/               # code for api services
-│   ├── core/  
-│   │   : xllm core features folder
-│   │   ├── common/                
-│   │   ├── distributed_runtime/   # code for distributed and pd serving
-│   │   ├── framework/             # code for execution orchestration
-│   │   ├── kernels/               # adaption for npu kernels adaption
-│   │   ├── layers/                # model layers impl
-│   │   ├── platform/              # adaption for various platform
-│   │   ├── runtime/               # code for worker and executor
-│   │   ├── scheduler/             # code for batch and pd scheduler
-│   │   └── util/
-│   ├── function_call              # code for tool call parser
-│   ├── models/                    # models impl
-│   ├── processors/                # code for vlm pre-processing
-│   ├── proto/                     # communication protocol
-│   ├── pybind/                    # code for python bind
-|   └── server/                    # xLLM server
-├── examples/                      # examples of calling xLLM
-├── tools/                         # code for npu time generations
-└── xllm.cpp                       # entrypoint of xLLM
-```
-
-Supported models list:
-- DeepSeek-V3/R1
-- DeepSeek-R1-Distill-Qwen
-- Kimi-k2
-- Llama2/3
-- MiniCPM-V
-- MiMo-VL
-- Qwen2/2.5/QwQ
-- Qwen2.5-VL
-- Qwen3 / Qwen3-MoE
-- Qwen3-VL / Qwen3-VL-MoE
-- GLM4.5 / GLM4.6 / GLM-4.6V / GLM-4.7
-- VLM-R1
-
----
-
-## 4. Quick Start
-#### Installation
-First, download the image we provide:
-```bash
-# A2 x86
-docker pull xllm/xllm-ai:xllm-dev-hb-rc2-x86
-# A2 arm
-docker pull xllm/xllm-ai:xllm-dev-hb-rc2-arm
-# A3 arm
-docker pull xllm/xllm-ai:xllm-dev-hc-rc2-arm
-# or
-# A2 x86
-docker pull quay.io/jd_xllm/xllm-ai:xllm-dev-hb-rc2-x86
-# A2 arm
-docker pull quay.io/jd_xllm/xllm-ai:xllm-dev-hb-rc2-arm
-# A3 arm
-docker pull quay.io/jd_xllm/xllm-ai:xllm-dev-hc-rc2-arm
-```
-Then create the corresponding container:
-```bash
-sudo docker run -it --ipc=host -u 0 --privileged --name mydocker --network=host  --device=/dev/davinci0  --device=/dev/davinci_manager --device=/dev/devmm_svm --device=/dev/hisi_hdc -v /var/queue_schedule:/var/queue_schedule -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi -v /usr/local/sbin/:/usr/local/sbin/ -v /var/log/npu/conf/slog/slog.conf:/var/log/npu/conf/slog/slog.conf -v /var/log/npu/slog/:/var/log/npu/slog -v /export/home:/export/home -w /export/home -v ~/.ssh:/root/.ssh  -v /var/log/npu/profiling/:/var/log/npu/profiling -v /var/log/npu/dump/:/var/log/npu/dump -v /home/:/home/  -v /runtime/:/runtime/ -v /etc/hccn.conf:/etc/hccn.conf xllm/xllm-ai:xllm-dev-hb-rc2-x86
-```
+## 3. Quick Start
 
-Install official repo and submodules：
-```bash
-git clone https://github.com/jd-opensource/xllm
-cd xllm 
-git submodule init
-git submodule update
-```
-The compilation depends on [vcpkg](https://github.com/microsoft/vcpkg). The Docker image already includes VCPKG_ROOT preconfigured. If you want to manually set it up, you can:
-```bash
-git clone https://gitcode.com/xLLM-AI/vcpkg.git
-cd vcpkg && git checkout ffc42e97c866ce9692f5c441394832b86548422c
-export VCPKG_ROOT=/your/path/to/vcpkg
-```
-
-#### Compilation
-When compiling, generate executable files `build/xllm/core/server/xllm` under `build/`:
-```bash
-python setup.py build
-```
-Or, compile directly using the following command to generate the whl package under `dist/`:
-```bash
-python setup.py bdist_wheel
-```
-
-#### Launch
-Run the following command to start xLLM engine: 
-```bash
-./build/xllm/core/server/xllm \    # launch xllm server
-    --model=/path/to/your/llm  \   # model path（to replace with your own path）
-    --port=9977 \                  # set service port to 9977
-    --max_memory_utilization 0.90  # set the maximal utilization of device memory
-```
+Please refer to [Quick Start](docs/en/getting_started/quick_start.md) for more details. Besides, please check the model support status at [Model Support List](docs/en/supported_models.md).
 
 --- 
 
-## 5. Contributing
+## 4. Contributing
 There are several ways you can contribute to xLLM:
 
 1. Reporting Issues (Bugs & Errors)
@@ -200,14 +105,14 @@ If you have problems about development, please check our document: **[Document](
 
 ---
 
-## 6. Community & Support
+## 5. Community & Support
 If you encounter any issues along the way, you are welcomed to submit reproducible steps and log snippets in the project's Issues area, or contact the xLLM Core team directly via your internal Slack. In addition, we have established official WeChat groups. You can access the following QR code to join. Welcome to contact us!
 
 <div align="center">
   <img src="docs/assets/wechat_qrcode.jpg" alt="qrcode3" width="50%" />
 </div>
 
-## 7. Acknowledgment
+## 6. Acknowledgment
 
 This project was made possible thanks to the following open-source projects:  
 - [ScaleLLM](https://github.com/vectorch-ai/ScaleLLM) - xLLM draws inspiration from ScaleLLM's graph construction method and references its runtime execution. 
@@ -217,6 +122,7 @@ This project was made possible thanks to the following open-source projects:
 - [safetensors](https://github.com/huggingface/safetensors) - xLLM relies on the C binding safetensors capability.
 - [Partial JSON Parser](https://github.com/promplate/partial-json-parser) - Implement xLLM's C++ JSON parser with insights from Python and Go implementations.
 - [concurrentqueue](https://github.com/cameron314/concurrentqueue) - A fast multi-producer, multi-consumer lock-free concurrent queue for C++11.
+- [Flashinfer](https://github.com/flashinfer-ai/flashinfer) - High-performance NVIDIA GPU kernels.
 
 
 Thanks to the following collaborating university laboratories:
@@ -235,13 +141,13 @@ Thanks to all the following [developers](https://github.com/jd-opensource/xllm/g
 
 ---
 
-## 8. License
+## 7. License
 [Apache License](LICENSE)
 
 #### xLLM is provided by JD.com 
 #### Thanks for your Contributions!
 
-## 9. Citation
+## 8. Citation
 
 If you think this repository is helpful to you, welcome to cite us:
 ```
 
@@ -74,112 +74,15 @@ xLLM 提供了强大的智能计算能力，通过硬件系统的算力优化与
 - 投机推理优化，多核并行提升效率；
 - MoE专家的动态负载均衡，实现专家分布的高效调整。
 
-
 ---
 
-## 3. 代码结构
-```
-├── xllm/
-|   : 主代码目录
-│   ├── api_service/               # api服务化实现
-│   ├── core/  
-│   │   : xllm核心功能代码目录
-│   │   ├── common/                
-│   │   ├── distributed_runtime/   # 分布式PD服务实现
-│   │   ├── framework/             # 引擎执行模块实现
-│   │   ├── kernels/               # 国产芯片kernels适配实现
-│   │   ├── layers/                # 模型层实现
-│   │   ├── platform/              # 多平台兼容层
-│   │   ├── runtime/               # worker/executor角色实现
-│   │   ├── scheduler/             # 批调度与PD调度实现
-│   │   └── util/
-│   ├── function_call              # function call实现
-│   ├── models/                    # 模型实现
-│   ├── processors/                # 多模态模型预处理实现
-│   ├── proto/                     # 通信协议
-│   ├── pybind/                    # python接口
-|   └── server/                    # xLLM服务实例
-├── examples/                      # 服务调用示例
-├── tools/                         # NPU Timeline生成工具
-└── xllm.cpp                       # xLLM启动入口
-```
+## 3. 快速开始
 
-当前支持模型列表：
-- DeepSeek-V3/R1
-- DeepSeek-R1-Distill-Qwen
-- Kimi-k2
-- Llama2/3
-- MiniCPM-V
-- MiMo-VL
-- Qwen2/2.5/QwQ
-- Qwen2.5-VL
-- Qwen3 / Qwen3-MoE
-- Qwen3-VL / Qwen3-VL-MoE
-- GLM-4.5 / GLM-4.6 / GLM-4.6V / GLM-4.7
-- VLM-R1
-
----
-
-
-## 4. 快速开始
-#### 安装
-首先下载我们提供的镜像：
-```bash
-# A2 x86
-docker pull quay.io/jd_xllm/xllm-ai:xllm-dev-hb-rc2-x86
-# A2 arm
-docker pull quay.io/jd_xllm/xllm-ai:xllm-dev-hb-rc2-arm
-# A3 arm
-docker pull quay.io/jd_xllm/xllm-ai:xllm-dev-hc-rc2-arm
-# 或者
-# A2 x86
-docker pull xllm/xllm-ai:xllm-dev-hb-rc2-x86
-# A2 arm
-docker pull xllm/xllm-ai:xllm-dev-hb-rc2-arm
-# A3 arm
-docker pull xllm/xllm-ai:xllm-dev-hc-rc2-arm
-```
-然后创建对应的容器
-```bash
-sudo docker run -it --ipc=host -u 0 --privileged --name mydocker --network=host  --device=/dev/davinci0  --device=/dev/davinci_manager --device=/dev/devmm_svm --device=/dev/hisi_hdc -v /var/queue_schedule:/var/queue_schedule -v /usr/local/Ascend/driver:/usr/local/Ascend/driver -v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi -v /usr/local/sbin/:/usr/local/sbin/ -v /var/log/npu/conf/slog/slog.conf:/var/log/npu/conf/slog/slog.conf -v /var/log/npu/slog/:/var/log/npu/slog -v /export/home:/export/home -w /export/home -v ~/.ssh:/root/.ssh  -v /var/log/npu/profiling/:/var/log/npu/profiling -v /var/log/npu/dump/:/var/log/npu/dump -v /home/:/home/  -v /runtime/:/runtime/ -v /etc/hccn.conf:/etc/hccn.conf xllm/xllm-ai:xllm-dev-hb-rc2-x86
-```
-
-下载官方仓库与模块依赖：
-```bash
-git clone https://github.com/jd-opensource/xllm
-cd xllm 
-git submodule init
-git submodule update
-```
-编译依赖[vcpkg](https://github.com/microsoft/vcpkg)，镜像中已经提前配置完成。如果您想要手动配置，可以执行如下命令:
-```bash
-git clone https://gitcode.com/xLLM-AI/vcpkg.git
-cd vcpkg && git checkout ffc42e97c866ce9692f5c441394832b86548422c
-export VCPKG_ROOT=/your/path/to/vcpkg
-```
-
-#### 编译
-执行编译，在`build/`下生成可执行文件`build/xllm/core/server/xllm`：
-```bash
-python setup.py build
-```
-或直接用以下命令编译在`dist/`下生成whl包:
-```bash
-python setup.py bdist_wheel
-```
-
-#### 执行
-运行例如如下命令启动xllm引擎：
-```bash
-./build/xllm/core/server/xllm \    # 启动 xllm 服务器程序
-    --model=/path/to/your/llm  \   # 指定模型路径（需替换为实际路径）
-    --port=9977 \                  # 设置服务端口为 9977
-    --max_memory_utilization 0.90  # 设置最大内存利用率为 90
-```
+请参考[快速开始文档](docs/zh/getting_started/quick_start.md)。此外，请在[模型支持列表](docs/zh/supported_models.md)查看模型支持情况。
 
 ---
 
-## 5. 成为贡献者
+## 4. 成为贡献者
 您可以通过以下方法为 xLLM 作出贡献:
 
 1. 在Issue中报告问题
@@ -199,7 +102,7 @@ python setup.py bdist_wheel
 
 ---
 
-## 6. 社区支持
+## 5. 社区支持
 如果你在xLLM的开发或使用过程中遇到任何问题，欢迎在项目的Issue区域提交可复现的步骤或日志片段。
 如果您有企业内部Slack，请直接联系xLLM Core团队。另外，我们建立了官方微信群，可以访问以下二维码加入。欢迎沟通和联系我们:
 
@@ -209,7 +112,7 @@ python setup.py bdist_wheel
 
 ---
 
-## 7. 致谢
+## 6. 致谢
 本项目的实现得益于以下开源项目: 
 
 - [ScaleLLM](https://github.com/vectorch-ai/ScaleLLM) - 采用了ScaleLLM中构图方式和借鉴Runtime执行。
@@ -219,6 +122,7 @@ python setup.py bdist_wheel
 - [safetensors](https://github.com/huggingface/safetensors) - 依赖其c binding safetensors能力。
 - [Partial JSON Parser](https://github.com/promplate/partial-json-parser) - xLLM的C++版本JSON解析器，参考Python与Go实现的设计思路。
 - [concurrentqueue](https://github.com/cameron314/concurrentqueue) - 高性能无锁Queue.
+- [Flashinfer](https://github.com/flashinfer-ai/flashinfer) - 高性能NVIDIA GPU算子。
 
 感谢以下合作的高校实验室：
 
@@ -236,14 +140,14 @@ python setup.py bdist_wheel
 
 ---
 
-## 8. 许可证
+## 7. 许可证
 
 [Apache License](LICENSE)
 
 #### xLLM 由 JD.com 提供 
 #### 感谢您对xLLM的关心与贡献!
 
-## 9. 引用
+## 8. 引用
 
 如果你觉得这个仓库对你有帮助，欢迎引用我们：
 ```
 
@@ -0,0 +1,27 @@
+# Code Architecture
+
+```
+├── xllm/
+|   : main source folder
+│   ├── api_service/               # code for api services
+│   ├── core/  
+│   │   : xllm core features folder
+│   │   ├── common/                
+│   │   ├── distributed_runtime/   # code for distributed and pd serving
+│   │   ├── framework/             # code for execution orchestration
+│   │   ├── kernels/               # adaption for npu kernels adaption
+│   │   ├── layers/                # model layers impl
+│   │   ├── platform/              # adaption for various platform
+│   │   ├── runtime/               # code for worker and executor
+│   │   ├── scheduler/             # code for batch and pd scheduler
+│   │   └── util/
+│   ├── function_call              # code for tool call parser
+│   ├── models/                    # models impl
+│   ├── processors/                # code for vlm pre-processing
+│   ├── proto/                     # communication protocol
+│   ├── pybind/                    # code for python bind
+|   └── server/                    # xLLM server
+├── examples/                      # examples of calling xLLM
+├── tools/                         # code for npu time generations
+└── xllm.cpp                       # entrypoint of xLLM
+```
@@ -38,8 +38,8 @@ ENABLE_DECODE_RESPONSE_TO_SERVICE=true ./xllm_master_serving --etcd_addr="127.0.
 3. Start xLLM  
 - Taking Qwen2-7B as an example  
     - Start Prefill Instance
-        ``` shell linenums="1" hl_lines="10"
-        ./xllm --model=Qwen2-7B-Instruct \
+        ```bash
+        /path/to/xllm --model=Qwen2-7B-Instruct \
                --port=8010 \
                --devices="npu:0" \
                --master_node_addr="127.0.0.1:18888" \
@@ -54,8 +54,8 @@ ENABLE_DECODE_RESPONSE_TO_SERVICE=true ./xllm_master_serving --etcd_addr="127.0.
                --nnodes=1
         ```
     - Start Decode Instance 
-        ```shell linenums="1" hl_lines="11"  
-        ./xllm --model=Qwen2-7B-Instruct \
+        ```bash 
+        /path/to/xllm --model=Qwen2-7B-Instruct \
                --port=8020 \
                --devices="npu:1" \
                --master_node_addr="127.0.0.1:18898" \