Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -15,3 +15,5 @@ test-results
*.goe
**/kuttl-test.json
**/kubeconfig
AGENT.md
.gitlab-ci.yml
246 changes: 246 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,94 @@ handling on your side.

## Unreleased

## 2025-09-23 - Release v100.0.11

### Fixes/Improvements

- health: Fail liveness on sustained port exhaustion to auto-restart pod
- /healthz returns 503 when EADDRNOTAVAIL errors exceed threshold within window
- Enabled by default; control via envs:
- `HEALTH_FAIL_ON_PORT_EXHAUSTION` (default: true)
- `PORT_EXHAUSTION_WINDOW` (default: 60s)
- `PORT_EXHAUSTION_THRESHOLD` (default: 8)
- registry: Record EADDRNOTAVAIL in HTTP transport; sliding-window tracker powers health gate

### Notes

- No manifest changes required; your existing livenessProbe on `/healthz` will now trigger restart when ports are exhausted.

## 2025-09-23 - Release v100.0.10a

### Fixes/Improvements

- registry: Add periodic HTTP transport janitor to proactively close idle connections
- New helper `StartTransportJanitor(interval)` runs `CloseIdleConnections()` across all cached transports
- Wired into both `run` and `webhook` commands; stops gracefully on shutdown
- Default interval: 5m; can be tuned via `REGISTRY_TRANSPORT_JANITOR_INTERVAL` (set `0` to disable)

### Rationale

- Mitigates long-running cluster issues where outbound dials eventually fail with
`connect: cannot assign requested address` (ephemeral port/SNAT exhaustion) by
improving connection reuse and cleaning up idle sockets over time.

### Notes

- This complements existing changes: shared transports per registry, tuned
`MaxConnsPerHost`/idle pools/timeouts, and per‑registry in‑flight caps.

## 2025-09-19 - Release v100.0.9a

### Changes

- registry: Increase default per-attempt timeouts to 60s for tag and manifest fetches
- registry: Make per-attempt timeouts env-tunable via `REGISTRY_TAG_TIMEOUT` and `REGISTRY_MANIFEST_TIMEOUT`
- http transport: Default `ResponseHeaderTimeout` raised to 60s; env-tunable via `REGISTRY_RESPONSE_HEADER_TIMEOUT`
- docs: Document new envs and updated defaults

### Notes

- If your registries are occasionally slow under load, you can set `REGISTRY_TAG_TIMEOUT=90s`, `REGISTRY_MANIFEST_TIMEOUT=90s`, and `REGISTRY_RESPONSE_HEADER_TIMEOUT=90s` to tolerate longer server delays. Consider also lowering concurrency and adding per‑registry rate limits.

## 2025-09-19 - Release v100.0.8a

### Changes

- registry-scanner: JWT auth dedupe and retries stabilized; add metrics nil-guards to avoid panics in tests
- registry-scanner: Fix jittered exponential backoff math for retries
- tests(registry): Add JWT singleflight, different scopes, and retry/backoff tests; reset Prometheus registry per test

### Notes

- This release contains the JWT singleflight + authorizer transport cache improvements; ensure you update the embedded `registry-scanner` module.

## 2025-09-19 - Release v100.0.7

### Fixes

- fix(continuous): initialize Argo client when warm-up is disabled (`--warmup-cache=false`) to prevent panic in `runContinuousOnce`

### Changes

- scheduler: continuous tick cadence set to ~1s (from ~100ms)
- docs: clarify boolean flag usage for `--warmup-cache=false`

### Notes

- If you disable warm-up, continuous starts immediately; each ~1s tick lists and filters apps, then dispatches those due. Unsupported apps are skipped.

## 2025-09-19 - Release v100.0.6a

### Changes

- scheduler(continuous): increase tick cadence from ~100ms to ~1s to reduce log noise and API/list pressure; no change to per-app `--interval` gating
- docs(readme): remove Mermaid diagram; add ASCII architecture; add rate limiting/backpressure section; add phase comparison table (stock vs tuned)

### Notes

- Behavior impact: only the scheduler’s discovery cadence changes; application dispatch still respects `--interval`, in-flight guards, fairness (LRU/fail-first, cooldown, per-repo-cap), and concurrency caps.
- Recommended: if startup delay is undesirable, run with `--warmup-cache=false`.

### Upgrade notes (no really, you MUST read this)

* **Attention**: By default, `argocd-image-updater` now uses the K8s API to retrieve applications, instead of the Argo CD API. Also, it is now recommended to install in the same namespace as Argo CD is running in (`argocd` by default). For existing installations, which are running in a dedicated namespace.
Expand All @@ -29,6 +117,164 @@ handling on your side.

* refactor: make argocd-image-updater-config volume mapping optional (#145)


## 2025-09-18 - Release v100.0.5a

### Fixes

- fix(git): Prevent panic in batched writer when `GetCreds` is nil or write-back method is not Git
- Only enqueue batched writes when `wbc.Method == git`
- Guard in `repoWriter.commitBatch` for missing `GetCreds` (skip with log)

### Tests

- test(git): Strengthen batched writer test to set `Method: WriteBackGit` and provide `GetCreds` stub, so missing-GetCreds would fail tests

### Notes

- No flags or defaults changed; safe upgrade from v100.0.4a

## 2025-09-18 - Release v100.0.4a

### Changes

- test(git): Add unit test verifying batched writer flushes per-branch (monorepo safety)
- fix(git): Guard `getWriteBackBranch` against nil Application source
- docs: Clarify `--max-concurrency=0` (auto) in README quick reference

### Notes

- All existing tests pass. No changes to defaults or flags.

## 2025-09-18 - Release v100.0.3a

### Highlights

- Continuous mode: per-app scheduling with independent timers (no full-cycle waits)
- Auto concurrency: `--max-concurrency=0` computes workers from CPUs/apps
- Robust registry auth and I/O: singleflight + retries with backoff on `/jwt/auth`, tag and manifest operations
- Safer connection handling: transport reuse, tuned timeouts, per‑registry in‑flight caps
- Git efficiency: per‑repo batched writer + retries
- Deep metrics: apps, cycles, registry, JWT

### New features

- feat(mode): `--mode=continuous` (default remains `cycle`)
- feat(concurrency): `--max-concurrency=0` for auto sizing
- feat(schedule): LRU / fail-first with `--schedule`; fairness with `--per-repo-cap`, `--cooldown`
- feat(auth): JWT `/jwt/auth` retries with backoff (singleflight dedupe)
- Env: `REGISTRY_JWT_ATTEMPTS` (default 7), `REGISTRY_JWT_RETRY_BASE` (200ms), `REGISTRY_JWT_RETRY_MAX` (3s)
- feat(metrics): Per-application timings and state
- `argocd_image_updater_application_update_duration_seconds{application}`
- `argocd_image_updater_application_last_attempt_timestamp{application}`
- `argocd_image_updater_application_last_success_timestamp{application}`
- `argocd_image_updater_images_considered_total{application}`
- `argocd_image_updater_images_skipped_total{application}`
- `argocd_image_updater_scheduler_skipped_total{reason}`
- feat(metrics): Cycle timing
- `argocd_image_updater_update_cycle_duration_seconds`
- `argocd_image_updater_update_cycle_last_end_timestamp`
- feat(metrics): Registry visibility
- `argocd_image_updater_registry_in_flight_requests{registry}`
- `argocd_image_updater_registry_request_duration_seconds{registry}`
- `argocd_image_updater_registry_http_status_total{registry,code}`
- `argocd_image_updater_registry_request_retries_total{registry,op}`
- `argocd_image_updater_registry_errors_total{registry,kind}`
- feat(metrics): Singleflight effectiveness
- `argocd_image_updater_singleflight_leaders_total{kind}`
- `argocd_image_updater_singleflight_followers_total{kind}`
- feat(metrics): JWT visibility
- `argocd_image_updater_registry_jwt_auth_requests_total{registry,service,scope}`
- `argocd_image_updater_registry_jwt_auth_errors_total{registry,service,scope,reason}`
- `argocd_image_updater_registry_jwt_auth_duration_seconds{registry,service,scope}`
- `argocd_image_updater_registry_jwt_token_ttl_seconds{registry,service,scope}`

### Improvements

- perf(registry): HTTP transport reuse; tuned `MaxIdleConns`, `MaxIdleConnsPerHost`, `MaxConnsPerHost`; response and handshake timeouts
- perf(registry): Per‑registry in‑flight cap to prevent connection storms
- resiliency(registry): Jittered retries for tags/manifests; `/jwt/auth` retries with backoff
- perf(git): Batched per‑repo writer; retries for fetch/shallow-fetch/push
- sched: Fairness via LRU/fail-first, cooldown, and per-repo caps

### Defaults enabled (no flags)

- Transport reuse and tuned timeouts
- Per‑registry in‑flight cap (default 15)
- Authorizer cache per (registry, repo)
- Singleflight on tags, manifests, and `/jwt/auth`
- Retries: tags/manifests (3x), JWT auth (defaults above)
- Git retries (env-overridable); Batched writer (disable via `GIT_BATCH_DISABLE=true`)

### Docs

- docs(install): Performance flags and defaults (continuous mode, auto concurrency, JWT retry envs)
- docs(metrics): Expanded metrics section

### Tests

- test: Unit tests for transport caching, metrics wrappers, continuous scheduler basics, and end-to-end build

### Known issues

- Under very high concurrency and bursty load, upstream registry/SNAT limits may still cause intermittent timeouts. The new caps, retries, and singleflight significantly reduce impact; tune per‑registry limits and consider HTTP/2 where available.

## 2025-09-17 - Release v99.9.9 - 66de072

### New features

* feat: Reuse HTTP transports for registries with keep-alives and timeouts
* feat: Initialize registry refresh-token map to enable token reuse
* feat: Add Makefile `DOCKER` variable to support `podman`

### Improvements

* perf: Cache transports per registry+TLS mode; add sensible connection/timeouts
* resiliency: Retry/backoff for registry tag listing
* resiliency: Retry/backoff for git fetch/shallow-fetch/push during write-back

### Tests/Docs

* test: Add unit tests for transport caching and token map init
* docs: Requirements/notes updates

### Upgrade notes

* None

### Bug fixes

* None

### Bugs

* Under very high concurrency (300–500) after 2–3 hours, nodes may hit ephemeral port exhaustion causing registry dials to fail:

Example error observed:

`dial tcp 10.2.163.141:5000: connect: cannot assign requested address`

Notes:
- This typically manifests across all registries simultaneously under heavy outbound connection churn.
- Root cause is excessive parallel dials combined with short‑lived connections (TIME_WAIT buildup), not a specific registry outage.
- Mitigations available in v100.0.0a: larger keep‑alive pools, lower MaxConnsPerHost, and ability to close idle on cache clear. Operational mitigations: reduce updater concurrency and/or per‑registry limits (e.g., 500→250; 50 rps→20–30 rps) while investigating.

Details:
- Old ports are “released” only after TIME_WAIT (2MSL). With HTTP/1.1 and big bursts, you create more concurrent outbound sockets than the ephemeral range can recycle before TIME_WAIT expires, so you hit “cannot assign requested address” even though old sockets eventually close.
- Why it still happens under 250/100 RPS:
- Each new dial consumes a unique local ephemeral port to the same dst tuple. TIME_WAIT lasts ~60–120s (kernel dependent). Bursty concurrency + short interval means you outpace reuse.
- Go HTTP/1.1 doesn’t pipeline; reuse works only if there’s an idle kept‑alive socket. If many goroutines need sockets at once, you dial anyway.
- Often compounded by SNAT limits at the node (Kubernetes egress): per‑dst NAT port cap can exhaust even faster.
- How to confirm quickly:
- Check TIME_WAIT to the registry IP:port: `ss -antp | grep :5000 | grep TIME_WAIT | wc -l`
- Check ephemeral range: `sysctl net.ipv4.ip_local_port_range`
- In Kubernetes, inspect node SNAT usage (some clouds cap SNAT ports per node/destination).
- What fixes it (software‑side, regardless of kernel/NAT tuning):
- Add a hard per‑registry in‑flight cap (e.g., 10–15) so requests queue instead of dialing new sockets.
- Lower `MaxConnsPerHost` further (e.g., 15). Keep large idle pools to maximize reuse.
- Add jitter to scheduling (avoid synchronized bursts); consider 30s interval over 15s.
- If the registry supports HTTP/2 over TLS, H2 multiplexing drastically reduces sockets.

## 2020-12-06 - Release v0.8.0

### Upgrade notes (no really, you MUST read this)
Expand Down
4 changes: 2 additions & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@ FROM golang:1.24 AS builder

RUN mkdir -p /src/argocd-image-updater
WORKDIR /src/argocd-image-updater
# copy the entire repo first so local replaces (./registry-scanner) exist in context
COPY . .
# cache dependencies as a layer for faster rebuilds
COPY go.mod go.sum ./
RUN go mod download
COPY . .

RUN mkdir -p dist && \
make controller
Expand Down
23 changes: 13 additions & 10 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ ARCH?=$(shell go env GOARCH)
OUTDIR?=dist
BINNAME?=argocd-image-updater

# Container runtime (override with DOCKER=podman)
DOCKER?=docker

CURRENT_DIR=$(shell pwd)
VERSION=$(shell cat ${CURRENT_DIR}/VERSION)
GIT_COMMIT=$(shell git rev-parse HEAD)
Expand Down Expand Up @@ -87,14 +90,14 @@ controller:

.PHONY: image
image: clean-image
docker build \
${DOCKER} build \
-t ${IMAGE_PREFIX}${IMAGE_NAME}:${IMAGE_TAG} \
--pull \
.

.PHONY: multiarch-image
multiarch-image:
docker buildx build \
${DOCKER} buildx build \
-t ${IMAGE_PREFIX}${IMAGE_NAME}:${IMAGE_TAG} \
--progress plain \
--pull \
Expand All @@ -103,7 +106,7 @@ multiarch-image:

.PHONY: multiarch-image-push
multiarch-image-push:
docker buildx build \
${DOCKER} buildx build \
-t ${IMAGE_PREFIX}${IMAGE_NAME}:${IMAGE_TAG} \
--progress plain \
--pull \
Expand All @@ -113,7 +116,7 @@ multiarch-image-push:

.PHONY: image-push
image-push: image
docker push ${IMAGE_PREFIX}${IMAGE_NAME}:${IMAGE_TAG}
${DOCKER} push ${IMAGE_PREFIX}${IMAGE_NAME}:${IMAGE_TAG}

.PHONY: release-binaries
release-binaries:
Expand All @@ -130,10 +133,10 @@ release-binaries:

.PHONY: extract-binary
extract-binary:
docker rm argocd-image-updater-${IMAGE_TAG} || true
docker create --name argocd-image-updater-${IMAGE_TAG} ${IMAGE_PREFIX}${IMAGE_NAME}:${IMAGE_TAG}
docker cp argocd-image-updater-${IMAGE_TAG}:/usr/local/bin/argocd-image-updater /tmp/argocd-image-updater_${IMAGE_TAG}_linux-amd64
docker rm argocd-image-updater-${IMAGE_TAG}
${DOCKER} rm argocd-image-updater-${IMAGE_TAG} || true
${DOCKER} create --name argocd-image-updater-${IMAGE_TAG} ${IMAGE_PREFIX}${IMAGE_NAME}:${IMAGE_TAG}
${DOCKER} cp argocd-image-updater-${IMAGE_TAG}:/usr/local/bin/argocd-image-updater /tmp/argocd-image-updater_${IMAGE_TAG}_linux-amd64
${DOCKER} rm argocd-image-updater-${IMAGE_TAG}

.PHONY: lint
lint:
Expand All @@ -148,7 +151,7 @@ codegen: manifests

.PHONY: run-test
run-test:
docker run -v $(HOME)/.kube:/kube --rm -it \
${DOCKER} run -v $(HOME)/.kube:/kube --rm -it \
-e ARGOCD_TOKEN \
${IMAGE_PREFIX}${IMAGE_NAME}:${IMAGE_TAG} \
--kubeconfig /kube/config \
Expand All @@ -157,5 +160,5 @@ run-test:

.PHONY: serve-docs
serve-docs:
docker run ${MKDOCS_RUN_ARGS} --rm -it -p 8000:8000 -v ${CURRENT_DIR}:/docs ${MKDOCS_DOCKER_IMAGE} serve -a 0.0.0.0:8000
${DOCKER} run ${MKDOCS_RUN_ARGS} --rm -it -p 8000:8000 -v ${CURRENT_DIR}:/docs ${MKDOCS_DOCKER_IMAGE} serve -a 0.0.0.0:8000

Loading