Skip to content

Conversation

@sambuc
Copy link

@sambuc sambuc commented Dec 1, 2025

This merges with a couple of fixes the code from #72.

kanya-approve and others added 3 commits November 26, 2025 11:09
* Ensure the mutex is not copied, even when the nodeServer is copied by
  storing a pointer to the mutex, instead of the mutex itself

* Use Mutex instead of RWMutex, as having two readers of the variable at
  the same time means we are going to write the state at the same time,
  corrupting the state file on storage.

* Mutex / RWMutex are not recursive / re-entrant in Go, so in two cases
  we do not call `Unlock()` through `defer` as `persistState()` also
  takes the lock.

* As a rule of thumb, Locking a Mutex should be as close as possible to
  the resource requiring it, to minimize the size of the critical
  section / the time spent holding the lock.
@sambuc sambuc marked this pull request as ready for review December 1, 2025 08:16
@sambuc sambuc requested a review from a team as a code owner December 1, 2025 08:16
@sambuc sambuc changed the title Sambuc/feat merge restart pr 1 sambuc/feat merge restart pr 1 Dec 1, 2025

return &nodeServer{
// Use kubelet plugin directory for state persistence
stateFile := "/var/lib/kubelet/plugins/csi-rclone/mounted_volumes.json"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big deal, but it does not hurt to make this a constant

Suggested change
stateFile := "/var/lib/kubelet/plugins/csi-rclone/mounted_volumes.json"
const stateFile string = "/var/lib/kubelet/plugins/csi-rclone/mounted_volumes.json"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will find this (and the comment) moved directly into the instantiation of nodeServer just below as this is the only place where this variable is used in a commit in #78.

Comment on lines +156 to +162
onDaemonReady := func() error {
if d.ns != nil {
return d.ns.remountTrackedVolumes(context.Background())
}
return nil
}
return d.ns.RcloneOps.Run(onDaemonReady)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Here it is better to pass in a context to the enclosing function and propagate that into remountTrackedVolumes. And an even better thing is to wrap the context passed into Run into a timeout so that if remounting the volumes takes a long time the server can give up and start.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can even decide whether to have a timeout for the total remount of all storages or just make a new timeout context for each call to Mount for each mountpoint. And you could even do both.

RcloneOps Operations

// Track mounted volumes for automatic remounting
mountedVolumes map[string]*MountedVolume
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Why pointer here and not just map[string]MountedVolume?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it allows for the map to stay small, therefore making it simpler and faster to reallocate when needed as we append often to it.

The MountedVolume struct is not that small, also this decouples it a bit more from the map.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You really should not overthink that in go. This map will not have intensive use, so it should stay as simple as it can get. In go, this means using values as much as possible.

Comment on lines +391 to +396
ns.mutex.Unlock()
klog.Infof("Tracked mounted volume %s at path %s", volumeId, targetPath)

if err := ns.persistState(); err != nil {
klog.Errorf("Failed to persist volume state: %v", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I think that updating the mountedVolumes map and writing to file should be done in one single lock/unlock cycle, not two. What could happen now is that as soon as you unlock the mutex, it can be picked up by another trackMountedVolume call. This can probably cause weird behaviour or it can cause us to miss persisting the state to file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you could do is make persistState be a fully standalone private function that just takes a map and a file name and writes that file. And the locking happens all in the methods that just call the function.

You could even go as far as calling that function unsafePersistState and/or also making it clear in the docstring that it is not goroutine safe.

Comment on lines +493 to +519
// Load volume state from disk
func (ns *nodeServer) loadState() error {
if ns.stateFile == "" {
return nil
}

ns.mutex.Lock()
defer ns.mutex.Unlock()

data, err := os.ReadFile(ns.stateFile)
if err != nil {
if os.IsNotExist(err) {
klog.Info("No persisted volume state found, starting fresh")
return nil
}
return fmt.Errorf("failed to read state file: %v", err)
}

var volumes map[string]*MountedVolume
if err := json.Unmarshal(data, &volumes); err != nil {
return fmt.Errorf("failed to unmarshal volume state: %v", err)
}

ns.mountedVolumes = volumes
klog.Infof("Loaded %d tracked volumes from %s", len(ns.mountedVolumes), ns.stateFile)
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion(non-blocking): I am being incredibly picky here.. but it is not a bad idea to do one of the following:

  1. Note in the docstring that this will overwrite the map of volumes in memory
  2. Fail if the map in memory is not empty but this is called (maybe this is a bit too extreme)
  3. Merge. This may require looping through the map from disk. But you can also try to unmarshal directly into ns.mountedVolumes but I am not sure really if this is the same as overwrite or it will just update the fields it can find.

Comment on lines +414 to +425
func (ns *nodeServer) remountTrackedVolumes(ctx context.Context) error {
ns.mutex.Lock()
defer ns.mutex.Unlock()

if len(ns.mountedVolumes) == 0 {
klog.Info("No tracked volumes to remount")
return nil
}

klog.Infof("Remounting %d tracked volumes", len(ns.mountedVolumes))

for volumeId, mv := range ns.mountedVolumes {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: This is similar to the other cases. It is safer to load the text file and remount in one lock/unlock cycle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants