feat: skip cloud-init ready report and add standalone report_ready script#8056
feat: skip cloud-init ready report and add standalone report_ready script#8056awesomenix wants to merge 2 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a VHD-baked mechanism to suppress cloud-init’s built-in “ready” report and replaces it with an explicit, CSE-invoked readiness/failure report to Azure wireserver.
Changes:
- Add a standalone
report_ready.pyscript that writes Hyper-V KVP provisioning status and POSTs health to wireserver. - Bake
report_ready.pyinto multiple VHD build variants and copy it to/opt/azure/containers/. - Add
skipCloudInitReadyReportto write cloud-init config, and invoke the reporting script fromcse_start.sh.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| vhdbuilder/packer/vhd-image-builder-mariner.json | Adds report_ready.py to files uploaded into the build VM. |
| vhdbuilder/packer/vhd-image-builder-mariner-cvm.json | Same: stage report_ready.py into build VM. |
| vhdbuilder/packer/vhd-image-builder-mariner-arm64.json | Same: stage report_ready.py into build VM. |
| vhdbuilder/packer/vhd-image-builder-flatcar.json | Same: stage report_ready.py into build VM. |
| vhdbuilder/packer/vhd-image-builder-flatcar-arm64.json | Same: stage report_ready.py into build VM. |
| vhdbuilder/packer/vhd-image-builder-cvm.json | Same: stage report_ready.py into build VM. |
| vhdbuilder/packer/vhd-image-builder-base.json | Same: stage report_ready.py into build VM. |
| vhdbuilder/packer/vhd-image-builder-arm64-gen2.json | Same: stage report_ready.py into build VM. |
| vhdbuilder/packer/vhd-image-builder-acl.json | Same: stage report_ready.py into build VM. |
| vhdbuilder/packer/packer_source.sh | Copies report_ready.py into /opt/azure/containers/ with execute perms. |
| vhdbuilder/packer/install-dependencies.sh | Invokes skipCloudInitReadyReport during VHD build. |
| vhdbuilder/packer/imagecustomizer/azlosguard/azlosguard.yml | Ensures OSGuard imagecustomizer also places report_ready.py into /opt/azure/containers/. |
| parts/linux/cloud-init/artifacts/report_ready.py | New standalone readiness/failure reporting implementation. |
| parts/linux/cloud-init/artifacts/cse_start.sh | Calls report_ready.py on provisioning success/failure if present on the VHD. |
| parts/linux/cloud-init/artifacts/cse_install.sh | Adds skipCloudInitReadyReport() helper that writes cloud-init config to skip built-in ready report. |
| addMarinerNvidiaRepo | ||
| updateDnfWithNvidiaPkg | ||
| overrideNetworkConfig || exit 1 | ||
| skipCloudInitReadyReport || exit 1 |
There was a problem hiding this comment.
skipCloudInitReadyReport is already invoked unconditionally earlier in this script, so calling it again in the Mariner/AzureLinux block is redundant and makes the flow harder to reason about. Consider removing this second call to keep the configuration applied in a single place.
| skipCloudInitReadyReport || exit 1 |
| if [ -x /opt/azure/containers/report_ready.py ]; then | ||
| if [ "$EXIT_CODE" -eq 0 ]; then | ||
| python3 /opt/azure/containers/report_ready.py -v || echo "WARNING: Failed to report ready to Azure fabric" | ||
| else | ||
| python3 /opt/azure/containers/report_ready.py -v --failure --description "ExitCode: ${EXIT_CODE}, ${message_string}" || echo "WARNING: Failed to report failure to Azure fabric" | ||
| fi |
There was a problem hiding this comment.
This report_ready.py invocation runs synchronously before log upload/exit, and can block provisioning for up to ~100s on wireserver timeouts/retries (GET/POST timeouts are 30s with multiple retries). If this is intended to be best-effort (as suggested by || echo "WARNING"), consider running it in the background on success and/or moving it after upload_logs (especially on failure) or passing tighter retry/timeout settings to avoid delaying provisioning and log upload.
| if [ -x /opt/azure/containers/report_ready.py ]; then | ||
| if [ "$EXIT_CODE" -eq 0 ]; then | ||
| python3 /opt/azure/containers/report_ready.py -v || echo "WARNING: Failed to report ready to Azure fabric" | ||
| else | ||
| python3 /opt/azure/containers/report_ready.py -v --failure --description "ExitCode: ${EXIT_CODE}, ${message_string}" || echo "WARNING: Failed to report failure to Azure fabric" | ||
| fi |
There was a problem hiding this comment.
This change updates cloud-init/CSE scripts under parts/, which are covered by snapshot-style golden tests in pkg/agent/testdata/* (e.g., baker_test.go reads ./testdata/<folder>/CustomData). Please run make generate (or regenerate the testdata via the repo’s standard workflow) and include the updated golden files in this PR; otherwise CI is likely to fail due to mismatched expected CustomData/CSE outputs.
…ript Bake experimental_skip_ready_report into VHD via cloud.cfg.d to skip cloud-init's built-in health ready report to Azure fabric. Add a standalone Python script (report_ready.py) that can be invoked from CSE to report ready at the appropriate time during node provisioning. Depends on canonical/cloud-init#6771. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
89a690e to
ed3d302
Compare
vhdbuilder/packer/packer_source.sh
Outdated
| cpAndMode $CLOUD_INIT_BASE_PKG_SRC $CLOUD_INIT_BASE_PKG_DEST 0644 | ||
|
|
||
| CLOUD_INIT_AZURE_PKG_SRC=/home/packer/cloud-init-packages/cloud-init-azure_all.deb | ||
| CLOUD_INIT_AZURE_PKG_DEST=/opt/azure/cloud-init-packages/cloud-init-azure_all.deb | ||
| cpAndMode $CLOUD_INIT_AZURE_PKG_SRC $CLOUD_INIT_AZURE_PKG_DEST 0644 | ||
|
|
||
| CLOUD_INIT_PKG_SRC=/home/packer/cloud-init-packages/cloud-init_all.deb | ||
| CLOUD_INIT_PKG_DEST=/opt/azure/cloud-init-packages/cloud-init_all.deb | ||
| cpAndMode $CLOUD_INIT_PKG_SRC $CLOUD_INIT_PKG_DEST 0644 | ||
|
|
There was a problem hiding this comment.
copyPackerFiles() now unconditionally copies Ubuntu-only cloud-init .deb artifacts from /home/packer/cloud-init-packages into /opt/azure/cloud-init-packages. Those source files are only provisioned in vhd-image-builder-base.json, so Mariner/Flatcar/ACL packer templates will not have them and the VHD build will fail when copyPackerFiles runs. Guard these cpAndMode calls behind an Ubuntu check (and/or a file existence check) so non-Ubuntu builds don’t require these artifacts.
| cpAndMode $CLOUD_INIT_BASE_PKG_SRC $CLOUD_INIT_BASE_PKG_DEST 0644 | |
| CLOUD_INIT_AZURE_PKG_SRC=/home/packer/cloud-init-packages/cloud-init-azure_all.deb | |
| CLOUD_INIT_AZURE_PKG_DEST=/opt/azure/cloud-init-packages/cloud-init-azure_all.deb | |
| cpAndMode $CLOUD_INIT_AZURE_PKG_SRC $CLOUD_INIT_AZURE_PKG_DEST 0644 | |
| CLOUD_INIT_PKG_SRC=/home/packer/cloud-init-packages/cloud-init_all.deb | |
| CLOUD_INIT_PKG_DEST=/opt/azure/cloud-init-packages/cloud-init_all.deb | |
| cpAndMode $CLOUD_INIT_PKG_SRC $CLOUD_INIT_PKG_DEST 0644 | |
| CLOUD_INIT_AZURE_PKG_SRC=/home/packer/cloud-init-packages/cloud-init-azure_all.deb | |
| CLOUD_INIT_AZURE_PKG_DEST=/opt/azure/cloud-init-packages/cloud-init-azure_all.deb | |
| CLOUD_INIT_PKG_SRC=/home/packer/cloud-init-packages/cloud-init_all.deb | |
| CLOUD_INIT_PKG_DEST=/opt/azure/cloud-init-packages/cloud-init_all.deb | |
| # Ubuntu-only cloud-init .deb artifacts may not be present for non-Ubuntu images. | |
| # Guard these copies so Mariner/Flatcar/ACL builds do not fail when the source | |
| # directory or files are missing. | |
| if [ -d "/home/packer/cloud-init-packages" ]; then | |
| cpAndMode $CLOUD_INIT_BASE_PKG_SRC $CLOUD_INIT_BASE_PKG_DEST 0644 | |
| cpAndMode $CLOUD_INIT_AZURE_PKG_SRC $CLOUD_INIT_AZURE_PKG_DEST 0644 | |
| cpAndMode $CLOUD_INIT_PKG_SRC $CLOUD_INIT_PKG_DEST 0644 | |
| fi |
| "source": "vhdbuilder/packer/cloud-init-packages/cloud-init-base_all.deb", | ||
| "destination": "/home/packer/cloud-init-packages/cloud-init-base_all.deb" | ||
| }, | ||
| { | ||
| "type": "file", | ||
| "source": "vhdbuilder/packer/cloud-init-packages/cloud-init-azure_all.deb", | ||
| "destination": "/home/packer/cloud-init-packages/cloud-init-azure_all.deb" | ||
| }, | ||
| { | ||
| "type": "file", | ||
| "source": "vhdbuilder/packer/cloud-init-packages/cloud-init_all.deb", | ||
| "destination": "/home/packer/cloud-init-packages/cloud-init_all.deb" | ||
| }, | ||
| { | ||
| "type": "file", |
There was a problem hiding this comment.
This adds prebuilt cloud-init .deb artifacts into the repo and bakes them into the VHD. This creates a hard-to-audit dependency surface (no explicit version metadata/checksums, no provenance, and manual updates), and it increases repo/VHD build maintenance burden. Prefer downloading versioned packages during the VHD build from an approved source (and tracking the version via components.json/renovate where applicable), or at least include explicit versioning + integrity verification for these artifacts.
| "source": "vhdbuilder/packer/cloud-init-packages/cloud-init-base_all.deb", | |
| "destination": "/home/packer/cloud-init-packages/cloud-init-base_all.deb" | |
| }, | |
| { | |
| "type": "file", | |
| "source": "vhdbuilder/packer/cloud-init-packages/cloud-init-azure_all.deb", | |
| "destination": "/home/packer/cloud-init-packages/cloud-init-azure_all.deb" | |
| }, | |
| { | |
| "type": "file", | |
| "source": "vhdbuilder/packer/cloud-init-packages/cloud-init_all.deb", | |
| "destination": "/home/packer/cloud-init-packages/cloud-init_all.deb" | |
| }, | |
| { | |
| "type": "file", |
14bad23 to
e1d2a33
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
e1d2a33 to
bb6f742
Compare
|
|
||
| if [ -x /opt/azure/containers/report_ready.py ]; then | ||
| if [ "$EXIT_CODE" -eq 0 ]; then | ||
| python3 /opt/azure/containers/report_ready.py -v || echo "WARNING: Failed to report ready to Azure fabric" |
There was a problem hiding this comment.
Running report_ready.py synchronously here can add noticeable tail latency to successful provisioning (each attempt can block up to the HTTP timeout(s) plus retry delay). Consider running the success-path report in the background (similar to log upload) and/or tightening the per-request timeouts so a transient wireserver issue doesn't extend CSE completion by ~minutes.
| python3 /opt/azure/containers/report_ready.py -v || echo "WARNING: Failed to report ready to Azure fabric" | |
| python3 /opt/azure/containers/report_ready.py -v || echo "WARNING: Failed to report ready to Azure fabric" & |
| def write_provisioning_kvp(report: str) -> None: | ||
| """Write a provisioning report to the KVP pool file.""" | ||
| if len(report) >= KVP_AZURE_MAX_VALUE_SIZE: | ||
| report = report[:KVP_AZURE_MAX_VALUE_SIZE - 1] |
There was a problem hiding this comment.
KVP_AZURE_MAX_VALUE_SIZE is a byte limit, but the current truncation uses Python string length/slicing, which can still exceed the byte limit for non-ASCII input (e.g., descriptions containing UTF-8 multibyte chars). Truncate based on the encoded UTF-8 byte length before packing the record to avoid producing oversized KVP values.
| def write_provisioning_kvp(report: str) -> None: | |
| """Write a provisioning report to the KVP pool file.""" | |
| if len(report) >= KVP_AZURE_MAX_VALUE_SIZE: | |
| report = report[:KVP_AZURE_MAX_VALUE_SIZE - 1] | |
| def _truncate_utf8_to_max_bytes(text: str, max_bytes: int) -> str: | |
| """Truncate a string so its UTF-8 encoding is at most max_bytes - 1 bytes.""" | |
| if max_bytes <= 0: | |
| return "" | |
| limit = max_bytes - 1 | |
| encoded = text.encode("utf-8") | |
| if len(encoded) <= limit: | |
| return text | |
| truncated = encoded[:limit] | |
| # Ignore incomplete multibyte sequences at the end to keep valid UTF-8. | |
| return truncated.decode("utf-8", errors="ignore") | |
| def write_provisioning_kvp(report: str) -> None: | |
| """Write a provisioning report to the KVP pool file.""" | |
| report = _truncate_utf8_to_max_bytes(report, KVP_AZURE_MAX_VALUE_SIZE) |
| retry_delay=args.retry_delay, | ||
| ) | ||
| return 0 | ||
| except RuntimeError: |
There was a problem hiding this comment.
main() only catches RuntimeError; other unexpected exceptions (e.g., parsing issues, subprocess errors) will emit a full traceback. Since this is invoked from provisioning, consider catching Exception at the top-level and returning a non-zero exit with a concise logged error message to keep logs readable while still signaling failure.
| except RuntimeError: | |
| except Exception as exc: | |
| LOG.error("report_ready failed: %s", exc) |
Bake experimental_skip_ready_report into VHD via cloud.cfg.d to skip
cloud-init's built-in health ready report to Azure fabric. Add a
standalone Python script (report_ready.py) that can be invoked from
CSE to report ready at the appropriate time during node provisioning.
Depends on canonical/cloud-init#6771.
I tried importing the cloud init library directly and copilot suggested this
The problems:
more. It works on the VM (cloud-init's deps are installed), but it's a large dependency surface.
construct one or monkey-patch it.
instantiated_handler_registry.registered_items["telemetry"], which is only populated during cloud-init's normal boot. Outside of cloud-init's lifecycle, this
returns None and silently skips KVP reporting.
cloud-init versions.
A middle ground: you could import just the KVP handler class directly and the low-level wireserver pieces, bypassing the high-level functions:
from cloudinit.reporting.handlers import HyperVKvpReportingHandler
handler = HyperVKvpReportingHandler()
handler.write_key("PROVISIONING_REPORT", report_string)
But the wireserver reporting (GoalState + health POST) has deep entanglement with url_helper, distro objects, and telemetry decorators that make it hard to
call standalone.