-
Notifications
You must be signed in to change notification settings - Fork 30
Description
Description:
The GitHub Actions Cache Server is experiencing frequent PostgreSQL connection terminations during cache operations, resulting in 500 errors and failed cache requests. The error occurs across multiple operations including CreateCacheEntry and GetCacheEntryDownloadURL.
Using S3 and Postgres.
Error:
[request error[] [unhandled[] [POST] http:///twirp/github.actions.results.api.v1.CacheService/CreateCacheEntry
H3Error: Connection terminated unexpectedly
at /app/server/node_modules/pg-pool/index.js:45:11
... 8 lines matching cause stack trace ...
at async Object.reserveCache (file:///app/server/index.mjs:5048:13) {
cause: Error: Connection terminated unexpectedly
at /app/server/node_modules/pg-pool/index.js:45:11
at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
at async PostgresDriver.acquireConnection (file:///app/server/node_modules/kysely/dist/esm/dialect/postgres/postgres-driver.js:21:24)
at async RuntimeDriver.acquireConnection (file:///app/server/node_modules/kysely/dist/esm/driver/runtime-driver.js:44:28)
at async DefaultConnectionProvider.provideConnection (file:///app/server/node_modules/kysely/dist/esm/driver/default-connection-provider.js:8:28)
at async DefaultQueryExecutor.executeQuery (file:///app/server/node_modules/kysely/dist/esm/query-executor/query-executor-base.js:34:16)
at async SelectQueryBuilderImpl.execute (file:///app/server/node_modules/kysely/dist/esm/query-builder/select-query-builder.js:317:24)
at async SelectQueryBuilderImpl.executeTakeFirst (file:///app/server/node_modules/kysely/dist/esm/query-builder/select-query-builder.js:321:26)
at async getUpload (file:///app/server/index.mjs:4667:15)
at async Object.reserveCache (file:///app/server/index.mjs:5048:13),
statusCode: 500,
fatal: false,
unhandled: true,
statusMessage: undefined,
data: undefined
}
Database Activity:
PostgreSQL audit logs show frequent DELETE operations on [uploads] and [upload_parts] tables, suggesting the cleanup processes are working, but connections are being terminated during regular operations.
DELETE FROM "uploads" WHERE "id" = $1
DELETE FROM ONLY "public"."upload_parts" WHERE $1 OPERATOR(pg_catalog.=) "upload_id"
Impact
Cache operations fail with 500 errors
GitHub Actions workflows experience cache misses
Service becomes unreliable under load
Helm config:
replicaCount: 2
service:
type: ClusterIP
port: 80
resources:
limits:
cpu: 1500m
memory: 3Gi
requests:
cpu: 1000m
memory: 2.5Gi
livenessProbe:
httpGet:
path: /
port: cache
initialDelaySeconds: 60
timeoutSeconds: 30
periodSeconds: 30
failureThreshold: 5
readinessProbe:
httpGet:
path: /
port: cache
initialDelaySeconds: 60
timeoutSeconds: 30
periodSeconds: 30
failureThreshold: 5
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 5
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 70
scaleDownStabilizationWindowSeconds: 600
persistentVolumeClaim:
enabled: true
template:
metadata:
name: cache-data
labels: {}
annotations: {}
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 20Gi
volumeMode: Filesystem
storageClassName: efs-sc
tmpVolume:
ephemeral:
volumeClaimTemplate:
metadata:
labels:
type: github-actions-cache-server-tmp
spec:
accessModes:
- ReadWriteMany
storageClassName: efs-sc
resources:
requests:
storage: 10Gi
env:
name: ENABLE_DIRECT_DOWNLOADS
value: "true"