Skip to content

Commit 2c24eb8

Browse files
author
Eidmantas Ivanauskas
committed
feat: transport/timeouts/singleflight/jwt-retries + continuous scheduler, batched git writer, docs; release v100.0.6a
1 parent 5f7bb29 commit 2c24eb8

File tree

24 files changed

+1808
-592
lines changed

24 files changed

+1808
-592
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,3 +15,5 @@ test-results
1515
*.goe
1616
**/kuttl-test.json
1717
**/kubeconfig
18+
AGENT.md
19+
.gitlab-ci.yml

CHANGELOG.md

Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,18 @@ handling on your side.
66

77
## Unreleased
88

9+
## 2025-09-19 - Release v100.0.6a
10+
11+
### Changes
12+
13+
- scheduler(continuous): increase tick cadence from ~100ms to ~1s to reduce log noise and API/list pressure; no change to per-app `--interval` gating
14+
- docs(readme): remove Mermaid diagram; add ASCII architecture; add rate limiting/backpressure section; add phase comparison table (stock vs tuned)
15+
16+
### Notes
17+
18+
- Behavior impact: only the scheduler’s discovery cadence changes; application dispatch still respects `--interval`, in-flight guards, fairness (LRU/fail-first, cooldown, per-repo-cap), and concurrency caps.
19+
- Recommended: if startup delay is undesirable, run with `--warmup-cache=false`.
20+
921
### Upgrade notes (no really, you MUST read this)
1022

1123
* **Attention**: By default, `argocd-image-updater` now uses the K8s API to retrieve applications, instead of the Argo CD API. Also, it is now recommended to install in the same namespace as Argo CD is running in (`argocd` by default). For existing installations, which are running in a dedicated namespace.
@@ -29,6 +41,164 @@ handling on your side.
2941

3042
* refactor: make argocd-image-updater-config volume mapping optional (#145)
3143

44+
45+
## 2025-09-18 - Release v100.0.5a
46+
47+
### Fixes
48+
49+
- fix(git): Prevent panic in batched writer when `GetCreds` is nil or write-back method is not Git
50+
- Only enqueue batched writes when `wbc.Method == git`
51+
- Guard in `repoWriter.commitBatch` for missing `GetCreds` (skip with log)
52+
53+
### Tests
54+
55+
- test(git): Strengthen batched writer test to set `Method: WriteBackGit` and provide `GetCreds` stub, so missing-GetCreds would fail tests
56+
57+
### Notes
58+
59+
- No flags or defaults changed; safe upgrade from v100.0.4a
60+
61+
## 2025-09-18 - Release v100.0.4a
62+
63+
### Changes
64+
65+
- test(git): Add unit test verifying batched writer flushes per-branch (monorepo safety)
66+
- fix(git): Guard `getWriteBackBranch` against nil Application source
67+
- docs: Clarify `--max-concurrency=0` (auto) in README quick reference
68+
69+
### Notes
70+
71+
- All existing tests pass. No changes to defaults or flags.
72+
73+
## 2025-09-18 - Release v100.0.3a
74+
75+
### Highlights
76+
77+
- Continuous mode: per-app scheduling with independent timers (no full-cycle waits)
78+
- Auto concurrency: `--max-concurrency=0` computes workers from CPUs/apps
79+
- Robust registry auth and I/O: singleflight + retries with backoff on `/jwt/auth`, tag and manifest operations
80+
- Safer connection handling: transport reuse, tuned timeouts, per‑registry in‑flight caps
81+
- Git efficiency: per‑repo batched writer + retries
82+
- Deep metrics: apps, cycles, registry, JWT
83+
84+
### New features
85+
86+
- feat(mode): `--mode=continuous` (default remains `cycle`)
87+
- feat(concurrency): `--max-concurrency=0` for auto sizing
88+
- feat(schedule): LRU / fail-first with `--schedule`; fairness with `--per-repo-cap`, `--cooldown`
89+
- feat(auth): JWT `/jwt/auth` retries with backoff (singleflight dedupe)
90+
- Env: `REGISTRY_JWT_ATTEMPTS` (default 7), `REGISTRY_JWT_RETRY_BASE` (200ms), `REGISTRY_JWT_RETRY_MAX` (3s)
91+
- feat(metrics): Per-application timings and state
92+
- `argocd_image_updater_application_update_duration_seconds{application}`
93+
- `argocd_image_updater_application_last_attempt_timestamp{application}`
94+
- `argocd_image_updater_application_last_success_timestamp{application}`
95+
- `argocd_image_updater_images_considered_total{application}`
96+
- `argocd_image_updater_images_skipped_total{application}`
97+
- `argocd_image_updater_scheduler_skipped_total{reason}`
98+
- feat(metrics): Cycle timing
99+
- `argocd_image_updater_update_cycle_duration_seconds`
100+
- `argocd_image_updater_update_cycle_last_end_timestamp`
101+
- feat(metrics): Registry visibility
102+
- `argocd_image_updater_registry_in_flight_requests{registry}`
103+
- `argocd_image_updater_registry_request_duration_seconds{registry}`
104+
- `argocd_image_updater_registry_http_status_total{registry,code}`
105+
- `argocd_image_updater_registry_request_retries_total{registry,op}`
106+
- `argocd_image_updater_registry_errors_total{registry,kind}`
107+
- feat(metrics): Singleflight effectiveness
108+
- `argocd_image_updater_singleflight_leaders_total{kind}`
109+
- `argocd_image_updater_singleflight_followers_total{kind}`
110+
- feat(metrics): JWT visibility
111+
- `argocd_image_updater_registry_jwt_auth_requests_total{registry,service,scope}`
112+
- `argocd_image_updater_registry_jwt_auth_errors_total{registry,service,scope,reason}`
113+
- `argocd_image_updater_registry_jwt_auth_duration_seconds{registry,service,scope}`
114+
- `argocd_image_updater_registry_jwt_token_ttl_seconds{registry,service,scope}`
115+
116+
### Improvements
117+
118+
- perf(registry): HTTP transport reuse; tuned `MaxIdleConns`, `MaxIdleConnsPerHost`, `MaxConnsPerHost`; response and handshake timeouts
119+
- perf(registry): Per‑registry in‑flight cap to prevent connection storms
120+
- resiliency(registry): Jittered retries for tags/manifests; `/jwt/auth` retries with backoff
121+
- perf(git): Batched per‑repo writer; retries for fetch/shallow-fetch/push
122+
- sched: Fairness via LRU/fail-first, cooldown, and per-repo caps
123+
124+
### Defaults enabled (no flags)
125+
126+
- Transport reuse and tuned timeouts
127+
- Per‑registry in‑flight cap (default 15)
128+
- Authorizer cache per (registry, repo)
129+
- Singleflight on tags, manifests, and `/jwt/auth`
130+
- Retries: tags/manifests (3x), JWT auth (defaults above)
131+
- Git retries (env-overridable); Batched writer (disable via `GIT_BATCH_DISABLE=true`)
132+
133+
### Docs
134+
135+
- docs(install): Performance flags and defaults (continuous mode, auto concurrency, JWT retry envs)
136+
- docs(metrics): Expanded metrics section
137+
138+
### Tests
139+
140+
- test: Unit tests for transport caching, metrics wrappers, continuous scheduler basics, and end-to-end build
141+
142+
### Known issues
143+
144+
- Under very high concurrency and bursty load, upstream registry/SNAT limits may still cause intermittent timeouts. The new caps, retries, and singleflight significantly reduce impact; tune per‑registry limits and consider HTTP/2 where available.
145+
146+
## 2025-09-17 - Release v99.9.9 - 66de072
147+
148+
### New features
149+
150+
* feat: Reuse HTTP transports for registries with keep-alives and timeouts
151+
* feat: Initialize registry refresh-token map to enable token reuse
152+
* feat: Add Makefile `DOCKER` variable to support `podman`
153+
154+
### Improvements
155+
156+
* perf: Cache transports per registry+TLS mode; add sensible connection/timeouts
157+
* resiliency: Retry/backoff for registry tag listing
158+
* resiliency: Retry/backoff for git fetch/shallow-fetch/push during write-back
159+
160+
### Tests/Docs
161+
162+
* test: Add unit tests for transport caching and token map init
163+
* docs: Requirements/notes updates
164+
165+
### Upgrade notes
166+
167+
* None
168+
169+
### Bug fixes
170+
171+
* None
172+
173+
### Bugs
174+
175+
* Under very high concurrency (300–500) after 2–3 hours, nodes may hit ephemeral port exhaustion causing registry dials to fail:
176+
177+
Example error observed:
178+
179+
`dial tcp 10.2.163.141:5000: connect: cannot assign requested address`
180+
181+
Notes:
182+
- This typically manifests across all registries simultaneously under heavy outbound connection churn.
183+
- Root cause is excessive parallel dials combined with short‑lived connections (TIME_WAIT buildup), not a specific registry outage.
184+
- Mitigations available in v100.0.0a: larger keep‑alive pools, lower MaxConnsPerHost, and ability to close idle on cache clear. Operational mitigations: reduce updater concurrency and/or per‑registry limits (e.g., 500→250; 50 rps→20–30 rps) while investigating.
185+
186+
Details:
187+
- Old ports are “released” only after TIME_WAIT (2MSL). With HTTP/1.1 and big bursts, you create more concurrent outbound sockets than the ephemeral range can recycle before TIME_WAIT expires, so you hit “cannot assign requested address” even though old sockets eventually close.
188+
- Why it still happens under 250/100 RPS:
189+
- Each new dial consumes a unique local ephemeral port to the same dst tuple. TIME_WAIT lasts ~60–120s (kernel dependent). Bursty concurrency + short interval means you outpace reuse.
190+
- Go HTTP/1.1 doesn’t pipeline; reuse works only if there’s an idle kept‑alive socket. If many goroutines need sockets at once, you dial anyway.
191+
- Often compounded by SNAT limits at the node (Kubernetes egress): per‑dst NAT port cap can exhaust even faster.
192+
- How to confirm quickly:
193+
- Check TIME_WAIT to the registry IP:port: `ss -antp | grep :5000 | grep TIME_WAIT | wc -l`
194+
- Check ephemeral range: `sysctl net.ipv4.ip_local_port_range`
195+
- In Kubernetes, inspect node SNAT usage (some clouds cap SNAT ports per node/destination).
196+
- What fixes it (software‑side, regardless of kernel/NAT tuning):
197+
- Add a hard per‑registry in‑flight cap (e.g., 10–15) so requests queue instead of dialing new sockets.
198+
- Lower `MaxConnsPerHost` further (e.g., 15). Keep large idle pools to maximize reuse.
199+
- Add jitter to scheduling (avoid synchronized bursts); consider 30s interval over 15s.
200+
- If the registry supports HTTP/2 over TLS, H2 multiplexing drastically reduces sockets.
201+
32202
## 2020-12-06 - Release v0.8.0
33203

34204
### Upgrade notes (no really, you MUST read this)

Makefile

Lines changed: 13 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@ ARCH?=$(shell go env GOARCH)
1212
OUTDIR?=dist
1313
BINNAME?=argocd-image-updater
1414

15+
# Container runtime (override with DOCKER=podman)
16+
DOCKER?=docker
17+
1518
CURRENT_DIR=$(shell pwd)
1619
VERSION=$(shell cat ${CURRENT_DIR}/VERSION)
1720
GIT_COMMIT=$(shell git rev-parse HEAD)
@@ -87,14 +90,14 @@ controller:
8790

8891
.PHONY: image
8992
image: clean-image
90-
docker build \
93+
${DOCKER} build \
9194
-t ${IMAGE_PREFIX}${IMAGE_NAME}:${IMAGE_TAG} \
9295
--pull \
9396
.
9497

9598
.PHONY: multiarch-image
9699
multiarch-image:
97-
docker buildx build \
100+
${DOCKER} buildx build \
98101
-t ${IMAGE_PREFIX}${IMAGE_NAME}:${IMAGE_TAG} \
99102
--progress plain \
100103
--pull \
@@ -103,7 +106,7 @@ multiarch-image:
103106

104107
.PHONY: multiarch-image-push
105108
multiarch-image-push:
106-
docker buildx build \
109+
${DOCKER} buildx build \
107110
-t ${IMAGE_PREFIX}${IMAGE_NAME}:${IMAGE_TAG} \
108111
--progress plain \
109112
--pull \
@@ -113,7 +116,7 @@ multiarch-image-push:
113116

114117
.PHONY: image-push
115118
image-push: image
116-
docker push ${IMAGE_PREFIX}${IMAGE_NAME}:${IMAGE_TAG}
119+
${DOCKER} push ${IMAGE_PREFIX}${IMAGE_NAME}:${IMAGE_TAG}
117120

118121
.PHONY: release-binaries
119122
release-binaries:
@@ -130,10 +133,10 @@ release-binaries:
130133

131134
.PHONY: extract-binary
132135
extract-binary:
133-
docker rm argocd-image-updater-${IMAGE_TAG} || true
134-
docker create --name argocd-image-updater-${IMAGE_TAG} ${IMAGE_PREFIX}${IMAGE_NAME}:${IMAGE_TAG}
135-
docker cp argocd-image-updater-${IMAGE_TAG}:/usr/local/bin/argocd-image-updater /tmp/argocd-image-updater_${IMAGE_TAG}_linux-amd64
136-
docker rm argocd-image-updater-${IMAGE_TAG}
136+
${DOCKER} rm argocd-image-updater-${IMAGE_TAG} || true
137+
${DOCKER} create --name argocd-image-updater-${IMAGE_TAG} ${IMAGE_PREFIX}${IMAGE_NAME}:${IMAGE_TAG}
138+
${DOCKER} cp argocd-image-updater-${IMAGE_TAG}:/usr/local/bin/argocd-image-updater /tmp/argocd-image-updater_${IMAGE_TAG}_linux-amd64
139+
${DOCKER} rm argocd-image-updater-${IMAGE_TAG}
137140

138141
.PHONY: lint
139142
lint:
@@ -148,7 +151,7 @@ codegen: manifests
148151

149152
.PHONY: run-test
150153
run-test:
151-
docker run -v $(HOME)/.kube:/kube --rm -it \
154+
${DOCKER} run -v $(HOME)/.kube:/kube --rm -it \
152155
-e ARGOCD_TOKEN \
153156
${IMAGE_PREFIX}${IMAGE_NAME}:${IMAGE_TAG} \
154157
--kubeconfig /kube/config \
@@ -157,5 +160,5 @@ run-test:
157160

158161
.PHONY: serve-docs
159162
serve-docs:
160-
docker run ${MKDOCS_RUN_ARGS} --rm -it -p 8000:8000 -v ${CURRENT_DIR}:/docs ${MKDOCS_DOCKER_IMAGE} serve -a 0.0.0.0:8000
163+
${DOCKER} run ${MKDOCS_RUN_ARGS} --rm -it -p 8000:8000 -v ${CURRENT_DIR}:/docs ${MKDOCS_DOCKER_IMAGE} serve -a 0.0.0.0:8000
161164

0 commit comments

Comments
 (0)