-
Notifications
You must be signed in to change notification settings - Fork 23
How to Resolve Compatibility and Runtime Issues During Integration of XSched with XpuOS/llama.cpp? #23
Description
Dear XSched Development Team,
During testing of the integration between your team's modified XpuOS/llama.cpp (adapted for XSched) and XSched, I successfully compiled the code on an NVIDIA A100 GPU (Ampere architecture, compute capability 8.0). However, multiple runtime exceptions occur during execution, blocking further integration testing. I am reporting the detailed issues, abnormal behaviors, and relevant context below and sincerely request your support and guidance.
1. Integration Environment
- Hardware: NVIDIA A100 GPU (only supports Level 1 preemption, corresponding to the enum
kPreemptLevelBlock). - Software: Your team's modified XpuOS/llama.cpp, XSched scheduler.
- Current Status: Code compiled successfully; XSched loads normally; basic XQueue creation and priority management work. However, the process crashes immediately when executing inference tasks.
2. Specific Issues & Abnormal Behaviors
2.1 Hardcoded Preemption Level Incompatible with A100
Note: kPreemptLevelDeactivate is not a function but an enum constant for Level 2 preemption, used to specify the queue preemption level.
In your modified XpuOS/llama.cpp, the preemption level is hardcoded to Level 2 (kPreemptLevelDeactivate).
However, the NVIDIA A100 only supports Level 1 preemption (kPreemptLevelBlock) and does NOT support Level 2.
This mismatch causes:
cuda error 907: operation not permitted- Followed by segmentation fault (core dumped)
- Process crashes instantly when running inference.
2.2 Missing Automatic Hardware Adaptation; Environment Variable Not Effective
XpuOS/llama.cpp currently lacks automatic detection/adaptation logic for preemption levels.
It only uses the hardcoded Level 2 and cannot automatically fall back to Level 1 based on hardware capabilities.
I attempted to manually set the environment variable:
XSCHED_AUTO_XQUEUE_LEVEL=1
But this variable has no effect on explicit queue creation in XpuOS/llama.cpp — Level 2 is still enforced. Switching levels requires manual source code modification, severely reducing integration efficiency.
2.3 Conflict Between CUDA Graph and XSched Event Recording
XpuOS/llama.cpp enables CUDA Graph for batch processing to improve inference performance.
However, XSched inserts scheduling events during GPU task execution, leading to a conflict.
Symptom:
CUDA Graph runs in capture mode; XSched attempts to record scheduling events during this period, which exceeds hardware support. This exacerbates runtime instability and causes process crashes, preventing successful inference completion.
3. Questions & Requests
Regarding the above integration issues, I would appreciate clarification and solutions:
- Could you provide a patch or modify the source code to make the hardcoded Level 2 preemption in XpuOS/llama.cpp configurable, or add automatic hardware preemption level detection logic to achieve compatibility with GPUs that only support Level 1, such as the A100?
- How to make
XSCHED_AUTO_XQUEUE_LEVEL=1actually take effect in XpuOS/llama.cpp queue creation, without manual code changes? - For the CUDA Graph vs. XSched event conflict, is there a feasible solution (e.g., disable specific features, adjust event recording logic) to ensure stable inference execution?
Thank you for your excellent work. I look forward to your reply and support to complete the test.