-
Notifications
You must be signed in to change notification settings - Fork 741
Description
Description
I tried running DJL XGboost inferencing on GPU, however it internally seems to be falling to CPU predictor
Expected Behavior
It should have run on GPU. Following flame graph shows it running on CPU
How to Reproduce?
POM dependencies:
<dependency>
<groupId>ai.djl.ml.xgboost</groupId>
<artifactId>xgboost-gpu</artifactId>
<version>0.29.0</version>
</dependency>
<dependency>
<groupId>ai.djl.pytorch</groupId>
<artifactId>pytorch-native-cu121</artifactId>
<classifier>linux-x86_64</classifier>
<scope>runtime</scope>
<version>2.1.2</version>
</dependency>
Code block used for model init
Path modelDir = Paths.get(modelInternalParametersFile).getParent();
try {
Device device;
Translator<DenseFeature[], BatchModelOutput> translator = new XGBoostTranslator();
try {
// Attempt to get the GPU device.
device = Device.gpu();
log.info("GPU device found. Initializing XGBoost model on GPU.");
} catch (Exception e) {
log.error("No GPU device found or DJL CUDA engine not configured. Falling back to CPU.", e);
e.printStackTrace();
device = Device.cpu();
}
// Use the Criteria API to load the model with our custom translator.
Criteria<DenseFeature[], BatchModelOutput> criteria = Criteria.builder()
.setTypes(DenseFeature[].class, BatchModelOutput.class)
.optDevice(device)
.optEngine("XGBoost")
.optModelPath(modelDir)
.optTranslator(translator)
.optModelName(MODEL_INTERNAL_PARAMETER)
.build();
// Load the model and create a single, reusable, thread-safe predictor.
this.model = criteria.loadModel();
this.predictor = model.newPredictor(translator);
log.info("DjlXGBoostModel initialized successfully on device: {}.", model.getNDManager().getDevice());
} catch (Exception e) {
log.error("Failed to load DJL XGBoost model from path: {}", modelDir, e);
throw new RuntimeException("Could not initialize DjlXGBoostModel", e);
}
This init block does not give any exception or error message, indicating it is able to detect GPU
Environment Info
Cuda toolkit : cuda-repo-debian12-12-8-local_12.8.0-570.86.10-1_amd64.deb.1
In docker file :
ENV PATH /usr/local/cuda-12.8/bin:/usr/local/nvidia/bin:${PATH}
ENV LD_LIBRARY_PATH /usr/local/cuda-12.8/lib64:/usr/local/nvidia/lib64:${LD_LIBRARY_PATH}
Nvidia-smi output
within same environemnt i am able to run DJL Pytorch engine , however xgboost seems to be failing back to CPU
Wanted to understand why I am not able to schedule xgboost on GPU .Does this indicate a potential bug or a misconfiguration on our part