Skip to content

Implement compute domain DRA plugin with state management#153

Merged
enoodle merged 3 commits intomainfrom
erez/compute-domain-dra-plugin-implementation
Jan 15, 2026
Merged

Implement compute domain DRA plugin with state management#153
enoodle merged 3 commits intomainfrom
erez/compute-domain-dra-plugin-implementation

Conversation

@enoodle
Copy link
Contributor

@enoodle enoodle commented Jan 12, 2026

Summary

This PR adds the full implementation for the compute domain DRA plugin, building on the skeleton from #154

Changes

  • State management (state.go): Manages prepared claims and domain info with checkpoint persistence for crash recovery
  • CDI handling (cdi.go): Generates Container Device Interface specs for device exposure to containers
  • Checkpoint persistence (checkpoint.go): Saves/restores plugin state across restarts
  • Device node simulation (nvcdi_device.go): Creates fake device nodes via symlinks to /dev/null
  • Health check server (health.go): gRPC health check for liveness/readiness probes
  • Driver integration (driver.go): Wires up state management, implements Prepare/Unprepare methods
  • Helm templates: Added OpenShift SCC annotation for hostmount-anyuid

Testing

  • Added comprehensive unit tests for all new components
  • All existing tests pass
  • Linting passes

Dependencies

  • Adds k8s.io/kubernetes for checkpoint manager

@enoodle enoodle force-pushed the erez/compute-domain-dra-plugin-implementation branch from 65a5900 to d9908e0 Compare January 12, 2026 14:30
@enoodle enoodle marked this pull request as draft January 12, 2026 14:35
@enoodle enoodle force-pushed the erez/compute-domain-dra-plugin-implementation branch 2 times, most recently from 55baafd to e2fee94 Compare January 13, 2026 23:09
@enoodle enoodle marked this pull request as ready for review January 13, 2026 23:09
Add full implementation for the compute domain DRA plugin:
- State management with checkpoint persistence for crash recovery
- CDI (Container Device Interface) spec generation for device exposure
- Health check server for liveness/readiness probes
- Device node simulation via symlinks to /dev/null
- Driver tests for prepare/unprepare resource claims
- OpenShift SCC annotation in helm templates
…lugin

- Add computeDomainDevicePluginLabelKey to status-updater node labeling
- Add compute-domain-dra-plugin to integration test setup
- Enable computeDomainDraPlugin in integration test values
@enoodle enoodle force-pushed the erez/compute-domain-dra-plugin-implementation branch from e2fee94 to b8e3022 Compare January 14, 2026 16:23
Copy link
Contributor

@gshaibi gshaibi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing PR, WELL DONE!!!
Approved with some minor nit comments

Path: path.Join(config.flags.kubeletRegistrarDirectoryPath, consts.ComputeDomainDriverName+"-reg.sock"),
}).String()
log.Info("connecting to registration socket", "path", regSockPath)
regConn, err := grpc.NewClient(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not that relevant, but do we want to close it somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

if readErr == nil && current == target {
return nil
}
if err := os.Remove(path); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err := os.Remove(path); err != nil {
if err = os.Remove(path); err != nil {

return err
}
} else {
if err := os.Remove(path); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err := os.Remove(path); err != nil {
if err = os.Remove(path); err != nil {

@enoodle enoodle enabled auto-merge January 15, 2026 15:50
@enoodle enoodle merged commit 9476038 into main Jan 15, 2026
4 of 5 checks passed
@enoodle enoodle deleted the erez/compute-domain-dra-plugin-implementation branch January 15, 2026 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants