Skip to content

How to Resolve Compatibility and Runtime Issues During Integration of XSched with XpuOS/llama.cpp? #23

@kylin1019

Description

@kylin1019

Dear XSched Development Team,

During testing of the integration between your team's modified XpuOS/llama.cpp (adapted for XSched) and XSched, I successfully compiled the code on an NVIDIA A100 GPU (Ampere architecture, compute capability 8.0). However, multiple runtime exceptions occur during execution, blocking further integration testing. I am reporting the detailed issues, abnormal behaviors, and relevant context below and sincerely request your support and guidance.

1. Integration Environment

  1. Hardware: NVIDIA A100 GPU (only supports Level 1 preemption, corresponding to the enum kPreemptLevelBlock).
  2. Software: Your team's modified XpuOS/llama.cpp, XSched scheduler.
  3. Current Status: Code compiled successfully; XSched loads normally; basic XQueue creation and priority management work. However, the process crashes immediately when executing inference tasks.

2. Specific Issues & Abnormal Behaviors

2.1 Hardcoded Preemption Level Incompatible with A100

Note: kPreemptLevelDeactivate is not a function but an enum constant for Level 2 preemption, used to specify the queue preemption level.

In your modified XpuOS/llama.cpp, the preemption level is hardcoded to Level 2 (kPreemptLevelDeactivate).
However, the NVIDIA A100 only supports Level 1 preemption (kPreemptLevelBlock) and does NOT support Level 2.

This mismatch causes:

  • cuda error 907: operation not permitted
  • Followed by segmentation fault (core dumped)
  • Process crashes instantly when running inference.

2.2 Missing Automatic Hardware Adaptation; Environment Variable Not Effective

XpuOS/llama.cpp currently lacks automatic detection/adaptation logic for preemption levels.
It only uses the hardcoded Level 2 and cannot automatically fall back to Level 1 based on hardware capabilities.

I attempted to manually set the environment variable:

XSCHED_AUTO_XQUEUE_LEVEL=1

But this variable has no effect on explicit queue creation in XpuOS/llama.cpp — Level 2 is still enforced. Switching levels requires manual source code modification, severely reducing integration efficiency.

2.3 Conflict Between CUDA Graph and XSched Event Recording

XpuOS/llama.cpp enables CUDA Graph for batch processing to improve inference performance.
However, XSched inserts scheduling events during GPU task execution, leading to a conflict.

Symptom:
CUDA Graph runs in capture mode; XSched attempts to record scheduling events during this period, which exceeds hardware support. This exacerbates runtime instability and causes process crashes, preventing successful inference completion.

3. Questions & Requests

Regarding the above integration issues, I would appreciate clarification and solutions:

  1. Could you provide a patch or modify the source code to make the hardcoded Level 2 preemption in XpuOS/llama.cpp configurable, or add automatic hardware preemption level detection logic to achieve compatibility with GPUs that only support Level 1, such as the A100?
  2. How to make XSCHED_AUTO_XQUEUE_LEVEL=1 actually take effect in XpuOS/llama.cpp queue creation, without manual code changes?
  3. For the CUDA Graph vs. XSched event conflict, is there a feasible solution (e.g., disable specific features, adjust event recording logic) to ensure stable inference execution?

Thank you for your excellent work. I look forward to your reply and support to complete the test.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions