Misleading error when attaching a persistent disk from a different project

Hi there, just got out of one nice rabbit-hole and thought I won't keep the journey for myself :)

## TL;DR
Now I know attaching a disk in one project as a `PersistentVolume` in a GKE in another project is **not allowed**, but the driver gave me a real hard time on my way to realize that - which is what I want to improve for any other unfortunate dead-end journeymen.

## The rabbit hole
Trust me, it was a well meant design decision. We decided to keep some persistent disks in their own project (let's call it _DISK_PROJECT_) out of a scope of an ephemeral project hosting a GKE cluster (_GKE_PROJECT_). So far I haven't tripped any warning that it's not really supposed to work; all there is speaks about that one can't attach disks from different zones, but that was no problem. We set it up per [the instructions in the GKE docs](https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/regional-pd#manual-provisioning). All seemed fine until we wanted to bind the PV/PVC to a workload when things started sliding the wrong direction without us knowing yet (the logs are from the workload's namespace `events`).

>Warning  FailedAttachVolume  12s (x8 over 2m24s)  attachdetach-controller  AttachVolume.Attach failed for volume "_PERSISTENT_VOLUME_NAME_" : rpc error: code = Internal desc = Failed to getDisk: googleapi: Error 403: Required 'compute.disks.get' permission for 'projects/_DISK_PROJECT_/zones/_ZONE_/disks/_DISK_NAME_', forbidden

Treading that path for the 1st time, it wasn't particularly easy to figure out that the permissions should be added to the "hidden" [engine robot account](https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/issues/794#issuecomment-867719990), but we cracked that one.

Curious and cautious about what all the permissions will be required we naturally hit a couple new ones right after we added the previous ones, notably:
>37s         Warning   FailedAttachVolume     pod/_POD_   AttachVolume.Attach failed for volume "_PERSISTENT_VOLUME_NAME_" : rpc error: code = Internal desc = Failed to Attach: failed cloud service attach disk call: googleapi: Error 403: Required 'compute.instances.attachDisk' permission for 'projects/_DISK_PROJECT_/zones/_ZONE_/instances/_GKE_NODE_'
More details:
Reason: forbidden, Message: Required 'compute.instances.attachDisk' permission for 'projects/_DISK_PROJECT_/zones/_ZONE_/instances/_GKE_NODE_'
Reason: forbidden, Message: Required 'compute.disks.use' permission for 'projects/_DISK_PROJECT_/zones/_ZONE_/disks/_DISK_NAME_'

Something's already off in this one - it's mentioning the _GKE_NODE_ in the _DISK_PROJECT_, while the node is actually in its own "_GKE_PROJECT_". But we didn't notice, so out of a little desperation we added the _DISK_PROJECT_'s `compute.admin` role to the _GKE_PROJECT_'s robot account, at which moment the "403 Forbidden" errors disappeared, but the most strange (while now more obvious) thing happened. The error became a "404 Not Found", because the GCP API couldn't find the node, that didn't exist in the _DISK_PROJECT_, but existed in the _GKE_PROJECT_:
>14s         Warning   FailedAttachVolume                pod/_POD_                     AttachVolume.Attach failed for volume "_PERSISTENT_VOLUME_NAME_" : rpc error: code = Internal desc = Failed to Attach: failed cloud service attach disk call: googleapi: Error 404: The resource 'projects/_DISK_PROJECT_/zones/_ZONE_/instances/_GKE_NODE_' was not found, notFound

That was a dead end and at this moment I grew suspicious the driver (`v1.8.7` in our case, but `master` seems to have it too) "incorrectly" derives the node's project from the disk's one. Then I really found [the code that extracts the project from the disk's `volumeHandle`](https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/v1.8.7/pkg/gce-pd-csi-driver/controller.go#L544) and uses it in the [call to `AttachDisk`](https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/v1.8.7/pkg/gce-pd-csi-driver/controller.go#L616) which eventually leads to the [GCP API call](https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/v1.8.7/pkg/gce-cloud-provider/compute/gce-compute.go#L706-L722) that produces the error. Ha!

Before filing a bug for this "obviously" incorrect assumption, I tried to bypass the driver to see if I'll have more luck directly with the [API's `attachDisk`](https://cloud.google.com/compute/docs/reference/rest/v1/instances/attachDisk) (redacted):
```
curl --request POST \
  'https://compute.googleapis.com/compute/v1/projects/$GKE_PROJECT/zones/$ZONE/instances/$GKE_NODE/attachDisk
  --data '{"deviceName":"$DISK_NAME","source":"projects/$DISK_PROJECT/zones/$ZONE/disks/$DISK_NAME"}'
```
Finally it gave out the error I was missing the whole time:
>Invalid value for field 'resource.source': 'projects/_DISK_PROJECT_/zones/_ZONE_/disks/_DISK_NAME_'. Disk must be in the same project as the instance or instance template.

Whoops, OK then!

## Next steps
Given the level the driver wants to enforce that rule itself and not delegate it to the API, there are likely two scenarios how to cope with a fix.
1. The "same project" rule is a solid one, driver should warn about such misconfigs right away.
   Given the driver already derives the project from the disk's handle, the assumption is there. However, nothing warns anyone before starting with the GCP API calls (which leads one on the - hopefully very unnecessary - permission path with a dead end). An obvious and easy fix would be to add the check to [`validateControllerPublishVolumeRequest`](https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/v1.8.7/pkg/gce-pd-csi-driver/controller.go#L505)
2. The driver should delegate the responsibility to the API (no, please don't! ;))
   However nasty, it can still be a legitimate way to let things flow. In such case, the node's project should enter [AttachDisk](https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver/blob/v1.8.7/pkg/gce-cloud-provider/compute/gce-compute.go#L706) as yet another `instanceProject` argument (or whatever) so it would get passed correctly to the API so the user would get the final error.

Either way, now the driver hides the last most important API error behind its silent assumption, which is hopefully worth tackling.

Thanks for reading this novel up to here :tada: :smile: 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misleading error when attaching a persistent disk from a different project #1314

TL;DR

The rabbit hole

Next steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misleading error when attaching a persistent disk from a different project #1314

Description

TL;DR

The rabbit hole

Next steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions