Add VPC egress firewall support for Cloud Run (#158)

ElleNajt · claude · web-flow · commit aa99e63b4a3e · 2026-03-01T19:22:09.000-08:00
* Upload large commands to GCS when they exceed env var limits

Cloud Run passes commands via environment variables, which have a ~32KB
limit. When a command exceeds 30KB, upload it to GCS and replace it with
a bootstrap script that downloads and executes it.

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

* Add VPC Direct Egress support for egress firewall

Route Cloud Run container traffic through a VPC where Cloud NGFW
firewall policies control outbound access by domain name (FQDN rules).

We previously tried iptables inside the container but found that
curl -6 bypasses iptables on Cloud Run, ip6tables kills the container,
and /proc/sys is read-only. The VPC approach applies firewall rules at
the GCP infrastructure level, outside the container.

Changes:
- Add vpc_network/vpc_subnet/vpc_egress to CloudRunClientConfig
- Configure run_v2.VpcAccess on job creation
- Add vpc_network/vpc_subnet/vpc_egress to ClaudeCodeClientConfig
- Document egress firewall setup in README (with example FQDN rules)
- Add integration test for VPC egress (allowed/blocked domains)

Co-Authored-By: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;

---------

Co-authored-by: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/safetytooling/infra/cloud_run/README.md b/safetytooling/infra/cloud_run/README.md
@@ -177,12 +177,131 @@ client = ClaudeCodeClient(
 - **Without this, Claude could take over your entire GCP project** - don't skip this step!
 
 **What this doesn't limit:**
-- Outbound network access (Claude could exfiltrate data to external URLs)
+- Outbound network access (see Egress Firewall below)
 - Anthropic API usage (Claude could use your API key for other purposes)
 
 For the "yolo Claude" use case, the main risks are data exfiltration and API key abuse.
 Containers are ephemeral (destroyed after job), so there's no persistence risk.
 
+## Egress Firewall (Recommended)
+
+By default, containers can make outbound requests to any host. To restrict egress (e.g., only allow `api.anthropic.com` and Google APIs), use VPC Direct Egress with Cloud NGFW firewall rules.
+
+**How it works:** When `vpc_network` is set, all container traffic routes through a VPC where a Cloud NGFW firewall policy controls access by domain name (FQDN rules). This covers both IPv4 and IPv6.
+
+**Usage:**
+
+```python
+client = ClaudeCodeClient(
+    project_id="my-project",
+    gcs_bucket="my-bucket",
+    api_key_secret="anthropic-api-key-USERNAME",
+    service_account="claude-runner@my-project.iam.gserviceaccount.com",
+    vpc_network="my-egress-vpc",       # VPC with NGFW firewall policy
+    vpc_subnet="my-egress-subnet",     # Subnet in the VPC
+    vpc_egress="all-traffic",          # Route all traffic through VPC
+)
+```
+
+**One-time GCP setup:**
+
+1. **VPC + Subnet** (with Private Google Access for Google APIs):
+   ```bash
+   gcloud compute networks create egress-firewall-vpc --subnet-mode=custom
+   gcloud compute networks subnets create egress-firewall-subnet \
+       --network=egress-firewall-vpc --region=us-central1 \
+       --range=10.100.0.0/24 --enable-private-ip-google-access
+   ```
+
+2. **Cloud Router + NAT** (required for internet access from VPC):
+   ```bash
+   gcloud compute routers create egress-firewall-router \
+       --network=egress-firewall-vpc --region=us-central1
+   gcloud compute routers nats create egress-firewall-nat \
+       --router=egress-firewall-router --region=us-central1 \
+       --auto-allocate-nat-external-ips \
+       --endpoint-types=ENDPOINT_TYPE_VM,ENDPOINT_TYPE_MANAGED_PROXY_LB \
+       --nat-all-subnet-ip-ranges
+   ```
+   Note: `ENDPOINT_TYPE_MANAGED_PROXY_LB` is required — Cloud Run Direct VPC Egress uses managed proxy load balancers internally.
+
+3. **Cloud NGFW firewall policy** with FQDN rules:
+   ```bash
+   # Create policy and associate with VPC
+   gcloud compute network-firewall-policies create egress-firewall-policy --global
+   gcloud compute network-firewall-policies associations create \
+       --firewall-policy=egress-firewall-policy --network=egress-firewall-vpc --global-firewall-policy
+
+   # Allow DNS
+   gcloud compute network-firewall-policies rules create 100 \
+       --firewall-policy=egress-firewall-policy --direction=EGRESS --action=allow \
+       --dest-ip-ranges=0.0.0.0/0 --layer4-configs=udp:53,tcp:53 --global-firewall-policy
+
+   # Allow metadata server
+   gcloud compute network-firewall-policies rules create 200 \
+       --firewall-policy=egress-firewall-policy --direction=EGRESS --action=allow \
+       --dest-ip-ranges=169.254.169.254/32 --layer4-configs=all --global-firewall-policy
+
+   # Allow Google APIs (list each subdomain — wildcards not supported)
+   gcloud compute network-firewall-policies rules create 250 \
+       --firewall-policy=egress-firewall-policy --direction=EGRESS --action=allow \
+       --dest-fqdns=storage.googleapis.com,oauth2.googleapis.com,www.googleapis.com,\
+   secretmanager.googleapis.com,accounts.googleapis.com,cloudresourcemanager.googleapis.com,\
+   run.googleapis.com,logging.googleapis.com,gcr.io,iamcredentials.googleapis.com \
+       --layer4-configs=tcp:443 --global-firewall-policy
+
+   # Allow Private Google Access VIPs
+   gcloud compute network-firewall-policies rules create 300 \
+       --firewall-policy=egress-firewall-policy --direction=EGRESS --action=allow \
+       --dest-ip-ranges=199.36.153.0/24 --layer4-configs=tcp:443 --global-firewall-policy
+
+   # Allow your API providers
+   gcloud compute network-firewall-policies rules create 400 \
+       --firewall-policy=egress-firewall-policy --direction=EGRESS --action=allow \
+       --dest-fqdns=api.anthropic.com,openrouter.ai \
+       --layer4-configs=tcp:443 --global-firewall-policy
+
+   # Allow package managers + GitHub (needed if agents install dependencies)
+   gcloud compute network-firewall-policies rules create 450 \
+       --firewall-policy=egress-firewall-policy --direction=EGRESS --action=allow \
+       --dest-fqdns=registry.npmjs.org,pypi.org,files.pythonhosted.org,\
+   crates.io,static.crates.io,proxy.golang.org,sum.golang.org,index.golang.org,\
+   rubygems.org,github.com,raw.githubusercontent.com,objects.githubusercontent.com \
+       --layer4-configs=tcp:443,tcp:80 --global-firewall-policy
+
+   # Deny everything else (IPv4 + IPv6)
+   gcloud compute network-firewall-policies rules create 10000 \
+       --firewall-policy=egress-firewall-policy --direction=EGRESS --action=deny \
+       --dest-ip-ranges=0.0.0.0/0 --layer4-configs=all --global-firewall-policy
+   gcloud compute network-firewall-policies rules create 10001 \
+       --firewall-policy=egress-firewall-policy --direction=EGRESS --action=deny \
+       --dest-ip-ranges=::/0 --layer4-configs=all --global-firewall-policy
+   ```
+
+**Costs:** Cloud NAT charges per VM-hour and per GB processed ([pricing](https://cloud.google.com/nat/pricing)). NGFW Standard charges $0.018/GB on internet-bound traffic evaluated by FQDN rules ([pricing](https://cloud.google.com/firewall/pricing)) — negligible for typical API call workloads but could add up if transferring large files.
+
+**Key facts:**
+- FQDN rules don't support wildcards — must list each Google API subdomain individually
+- IPv6 is fully blocked at the VPC level (deny `::/0`)
+
+**Verifying your setup:**
+
+An integration test is included at `tests/test_vpc_egress.py`. It launches a Cloud Run container with VPC egress enabled and curls several domains from inside:
+
+- Allowed domains (`api.anthropic.com`, `pypi.org`, `registry.npmjs.org`) should return an HTTP response
+- Blocked domains (`example.com`) should time out (HTTP code `000`)
+- IPv6 requests (`curl -6`) should be blocked
+
+```bash
+# Set env vars for your GCP project
+export GCP_PROJECT_ID=my-project GCS_BUCKET=my-bucket
+export API_KEY_SECRET=anthropic-api-key-USERNAME
+export SERVICE_ACCOUNT=claude-runner@my-project.iam.gserviceaccount.com
+export VPC_NETWORK=egress-firewall-vpc VPC_SUBNET=egress-firewall-subnet
+
+pytest tests/test_vpc_egress.py -v --run-integration
+```
+
 ## How It Works
 
 ```
@@ -255,6 +374,9 @@ ClaudeCodeClientConfig(
     memory: str = "2Gi",   # Up to 32Gi
     skip_permissions: bool = True,  # --dangerously-skip-permissions
     image: str = DEFAULT_CLAUDE_CODE_IMAGE,  # Pre-built image with Claude Code
+    vpc_network: str = None,       # VPC for egress firewall (see Egress Firewall section)
+    vpc_subnet: str = None,        # Subnet in the VPC (required when vpc_network is set)
+    vpc_egress: str = "all-traffic",  # "all-traffic" or "private-ranges-only"
 )
 ```
 
@@ -333,6 +455,9 @@ CloudRunClientConfig(
     env: dict = {},        # Environment variables
     secrets: dict = {},    # Secret Manager secrets as env vars
     service_account: str = None,  # Restricted service account (see Security Hardening)
+    vpc_network: str = None,       # VPC for egress firewall
+    vpc_subnet: str = None,        # Subnet in the VPC
+    vpc_egress: str = None,        # "all-traffic" or "private-ranges-only"
 )
 ```
 
diff --git a/safetytooling/infra/cloud_run/claude_code_client.py b/safetytooling/infra/cloud_run/claude_code_client.py
@@ -87,6 +87,12 @@ class ClaudeCodeClientConfig:
                         SECURITY: Use a restricted service account to limit container access.
                         See README for setup instructions.
                         Format: "name@project.iam.gserviceaccount.com"
+        vpc_network: VPC network name for Direct VPC Egress. When set with vpc_egress="all-traffic",
+                     all outbound traffic routes through the VPC where Cloud NGFW firewall policies
+                     control access. This covers both IPv4 and IPv6. Requires a Cloud NAT gateway
+                     with ENDPOINT_TYPE_MANAGED_PROXY_LB on the VPC for internet access.
+        vpc_subnet: VPC subnet name (required when vpc_network is set).
+        vpc_egress: VPC egress setting - "all-traffic" or "private-ranges-only" (default: "all-traffic").
     """
 
     project_id: str
@@ -102,6 +108,9 @@ class ClaudeCodeClientConfig:
     image: str = DEFAULT_CLAUDE_CODE_IMAGE
     api_key_secret: str | None = None
     service_account: str | None = None
+    vpc_network: str | None = None
+    vpc_subnet: str | None = None
+    vpc_egress: str = "all-traffic"
 
 
 # Instructions prepended to task when output_instructions=True
@@ -405,6 +414,9 @@ def __init__(
             env={},
             secrets=secrets,
             service_account=self.config.service_account,
+            vpc_network=self.config.vpc_network,
+            vpc_subnet=self.config.vpc_subnet,
+            vpc_egress=self.config.vpc_egress if self.config.vpc_network else None,
         )
         self._cloud_run = CloudRunClient(cloud_run_config)
 
diff --git a/safetytooling/infra/cloud_run/cloud_run_client.py b/safetytooling/infra/cloud_run/cloud_run_client.py
@@ -102,6 +102,9 @@ class CloudRunClientConfig:
     env: dict[str, str] = field(default_factory=dict)
     secrets: dict[str, str] = field(default_factory=dict)
     service_account: str | None = None
+    vpc_network: str | None = None
+    vpc_subnet: str | None = None
+    vpc_egress: str | None = None  # "all-traffic" or "private-ranges-only"
 
 
 @dataclass(frozen=True)
@@ -496,15 +499,27 @@ def _get_or_create_job(self, timeout: int) -> str:
             if self.config.service_account:
                 job.template.template.service_account = self.config.service_account
 
+            if self.config.vpc_network:
+                vpc_access = run_v2.VpcAccess(
+                    network_interfaces=[
+                        run_v2.VpcAccess.NetworkInterface(
+                            network=self.config.vpc_network,
+                            subnetwork=self.config.vpc_subnet,
+                        )
+                    ],
+                )
+                if self.config.vpc_egress == "all-traffic":
+                    vpc_access.egress = run_v2.VpcAccess.VpcEgress.ALL_TRAFFIC
+                job.template.template.vpc_access = vpc_access
+
             parent = f"projects/{self.config.project_id}/locations/{self.config.region}"
-            request = CreateJobRequest(parent=parent, job=job, job_id=job_id)
 
+            request = CreateJobRequest(parent=parent, job=job, job_id=job_id)
             try:
                 operation = self._jobs_client.create_job(request=request)
                 created_job = operation.result()
                 job_name = created_job.name
             except Exception as e:
-                # Job might already exist (from previous process/session)
                 if "already exists" in str(e).lower():
                     job_name = f"{parent}/jobs/{job_id}"
                 else:
@@ -513,6 +528,19 @@ def _get_or_create_job(self, timeout: int) -> str:
             self._job_cache[config_hash] = job_name
             return job_name
 
+    _GCS_COMMANDS_PREFIX: ClassVar[str] = "cloudrun-commands"
+    _COMMAND_SIZE_LIMIT: ClassVar[int] = 30000  # Leave headroom below 32768 env var limit
+
+    def _upload_command_to_gcs(self, command: str) -> str:
+        """Upload a large command to GCS and return its path."""
+        cmd_hash = hashlib.sha256(command.encode()).hexdigest()[:16]
+        gcs_path = f"{self._GCS_COMMANDS_PREFIX}/{cmd_hash}.sh"
+        bucket = self._storage_client.bucket(self.config.gcs_bucket)
+        blob = bucket.blob(gcs_path)
+        if not blob.exists():
+            blob.upload_from_string(command, content_type="text/plain")
+        return gcs_path
+
     def _run_job_execution(
         self,
         job_name: str,
@@ -524,7 +552,17 @@ def _run_job_execution(
         """Run an execution of an existing job with specific inputs/outputs/command.
 
         Uses RunJobRequest.Overrides to pass per-execution environment variables.
+        If the command exceeds the env var size limit, it's uploaded to GCS and
+        a small bootstrap script downloads and evals it.
         """
+        # If command is too large for an env var, stash it in GCS
+        if len(command.encode()) > self._COMMAND_SIZE_LIMIT:
+            gcs_path = self._upload_command_to_gcs(command)
+            command = (
+                f'gcloud storage cp "gs://{self.config.gcs_bucket}/{gcs_path}" /tmp/large_command.sh '
+                f"&& bash /tmp/large_command.sh"
+            )
+
         # Build env var overrides for this execution
         env_overrides = [
             run_v2.EnvVar(name="OUTPUT_GCS_PATH", value=output_gcs_path),
@@ -634,6 +672,9 @@ def _compute_config_hash(self) -> str:
             self.config.memory,
             self.config.service_account or "",
             self.config.gcs_bucket,
+            self.config.vpc_network or "",
+            self.config.vpc_subnet or "",
+            self.config.vpc_egress or "",
         ]
         # Add sorted env vars
         for k, v in sorted(self.config.env.items()):
diff --git a/tests/test_vpc_egress.py b/tests/test_vpc_egress.py