-
Notifications
You must be signed in to change notification settings - Fork 167
Description
Hi there, just got out of one nice rabbit-hole and thought I won't keep the journey for myself :)
TL;DR
Now I know attaching a disk in one project as a PersistentVolume
in a GKE in another project is not allowed, but the driver gave me a real hard time on my way to realize that - which is what I want to improve for any other unfortunate dead-end journeymen.
The rabbit hole
Trust me, it was a well meant design decision. We decided to keep some persistent disks in their own project (let's call it DISK_PROJECT) out of a scope of an ephemeral project hosting a GKE cluster (GKE_PROJECT). So far I haven't tripped any warning that it's not really supposed to work; all there is speaks about that one can't attach disks from different zones, but that was no problem. We set it up per the instructions in the GKE docs. All seemed fine until we wanted to bind the PV/PVC to a workload when things started sliding the wrong direction without us knowing yet (the logs are from the workload's namespace events
).
Warning FailedAttachVolume 12s (x8 over 2m24s) attachdetach-controller AttachVolume.Attach failed for volume "PERSISTENT_VOLUME_NAME" : rpc error: code = Internal desc = Failed to getDisk: googleapi: Error 403: Required 'compute.disks.get' permission for 'projects/DISK_PROJECT/zones/ZONE/disks/DISK_NAME', forbidden
Treading that path for the 1st time, it wasn't particularly easy to figure out that the permissions should be added to the "hidden" engine robot account, but we cracked that one.
Curious and cautious about what all the permissions will be required we naturally hit a couple new ones right after we added the previous ones, notably:
37s Warning FailedAttachVolume pod/POD AttachVolume.Attach failed for volume "PERSISTENT_VOLUME_NAME" : rpc error: code = Internal desc = Failed to Attach: failed cloud service attach disk call: googleapi: Error 403: Required 'compute.instances.attachDisk' permission for 'projects/DISK_PROJECT/zones/ZONE/instances/GKE_NODE'
More details:
Reason: forbidden, Message: Required 'compute.instances.attachDisk' permission for 'projects/DISK_PROJECT/zones/ZONE/instances/GKE_NODE'
Reason: forbidden, Message: Required 'compute.disks.use' permission for 'projects/DISK_PROJECT/zones/ZONE/disks/DISK_NAME'
Something's already off in this one - it's mentioning the GKE_NODE in the DISK_PROJECT, while the node is actually in its own "GKE_PROJECT". But we didn't notice, so out of a little desperation we added the DISK_PROJECT's compute.admin
role to the GKE_PROJECT's robot account, at which moment the "403 Forbidden" errors disappeared, but the most strange (while now more obvious) thing happened. The error became a "404 Not Found", because the GCP API couldn't find the node, that didn't exist in the DISK_PROJECT, but existed in the GKE_PROJECT:
14s Warning FailedAttachVolume pod/POD AttachVolume.Attach failed for volume "PERSISTENT_VOLUME_NAME" : rpc error: code = Internal desc = Failed to Attach: failed cloud service attach disk call: googleapi: Error 404: The resource 'projects/DISK_PROJECT/zones/ZONE/instances/GKE_NODE' was not found, notFound
That was a dead end and at this moment I grew suspicious the driver (v1.8.7
in our case, but master
seems to have it too) "incorrectly" derives the node's project from the disk's one. Then I really found the code that extracts the project from the disk's volumeHandle
and uses it in the call to AttachDisk
which eventually leads to the GCP API call that produces the error. Ha!
Before filing a bug for this "obviously" incorrect assumption, I tried to bypass the driver to see if I'll have more luck directly with the API's attachDisk
(redacted):
curl --request POST \
'https://compute.googleapis.com/compute/v1/projects/$GKE_PROJECT/zones/$ZONE/instances/$GKE_NODE/attachDisk
--data '{"deviceName":"$DISK_NAME","source":"projects/$DISK_PROJECT/zones/$ZONE/disks/$DISK_NAME"}'
Finally it gave out the error I was missing the whole time:
Invalid value for field 'resource.source': 'projects/DISK_PROJECT/zones/ZONE/disks/DISK_NAME'. Disk must be in the same project as the instance or instance template.
Whoops, OK then!
Next steps
Given the level the driver wants to enforce that rule itself and not delegate it to the API, there are likely two scenarios how to cope with a fix.
- The "same project" rule is a solid one, driver should warn about such misconfigs right away.
Given the driver already derives the project from the disk's handle, the assumption is there. However, nothing warns anyone before starting with the GCP API calls (which leads one on the - hopefully very unnecessary - permission path with a dead end). An obvious and easy fix would be to add the check tovalidateControllerPublishVolumeRequest
- The driver should delegate the responsibility to the API (no, please don't! ;))
However nasty, it can still be a legitimate way to let things flow. In such case, the node's project should enter AttachDisk as yet anotherinstanceProject
argument (or whatever) so it would get passed correctly to the API so the user would get the final error.
Either way, now the driver hides the last most important API error behind its silent assumption, which is hopefully worth tackling.
Thanks for reading this novel up to here 🎉 😄