Skip to content

[kaizen] CI tweaks#4739

Merged
sharder996 merged 6 commits intomainfrom
kaizen/ci-tweaks
Mar 19, 2026
Merged

[kaizen] CI tweaks#4739
sharder996 merged 6 commits intomainfrom
kaizen/ci-tweaks

Conversation

@sharder996
Copy link
Collaborator

Some tweaks to things that are running on an automated basis:

  1. Added a rule for merging conflicts when converting coverage reports for TICS
  2. Added retry functionality for when we prune cached vcpkg packages. The growing list of cached packages sometimes causes rate limiting errors.
  3. Stopped the distro-scraper from removing existing data when merging its output with an existing file.
  4. Edited the fedora scraper to get architecture specific images from different mirrors (they are not all hosted in the same place). Factored out a bit of common code while I was at it.

@sharder996 sharder996 requested review from a team and jimporter and removed request for a team March 17, 2026 20:48
Copy link
Contributor

@jimporter jimporter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall to me. Regarding the Fedora scraper changes, it's a shame we have to scrap the Apache-generated HTML directory listings. (I couldn't find any alternatives after a bit of poking around.) That feels like it could be brittle, but I don't know enough about the issue to have a better idea.

Just a couple suggestions below about making things a bit more Pythonic.

Comment on lines +9 to +10
PRIMARY_RELEASES_URL = "https://dl.fedoraproject.org/pub/fedora/linux/releases"
SECONDARY_RELEASES_URL = "https://dl.fedoraproject.org/pub/fedora-secondary/releases"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you added trailing slashes to these, then some of the remaining code would (arguably) be simpler:

Suggested change
PRIMARY_RELEASES_URL = "https://dl.fedoraproject.org/pub/fedora/linux/releases"
SECONDARY_RELEASES_URL = "https://dl.fedoraproject.org/pub/fedora-secondary/releases"
PRIMARY_RELEASES_URL = "https://dl.fedoraproject.org/pub/fedora/linux/releases/"
SECONDARY_RELEASES_URL = "https://dl.fedoraproject.org/pub/fedora-secondary/releases/"

Comment on lines +33 to +34
url = f"{PRIMARY_RELEASES_URL}/"
text = await self._fetch_text(session, url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... then you could do this:

Suggested change
url = f"{PRIMARY_RELEASES_URL}/"
text = await self._fetch_text(session, url)
text = await self._fetch_text(session, PRIMARY_RELEASES_URL)

This could also use _fetch_dir_listing to avoid needing to use a regex to parse the HTML.

self.logger.info("Sending HEAD request to %s", url)
fedora_arch = ARCH_MAP.get(label, label)
base = SECONDARY_RELEASES_URL if label in SECONDARY_ARCHES else PRIMARY_RELEASES_URL
images_url = f"{base}/{version}/Cloud/{fedora_arch}/images/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and finally this (after adding from urllib.parse import urljoin near the top):

Suggested change
images_url = f"{base}/{version}/Cloud/{fedora_arch}/images/"
images_url = urljoin(base, f"{version}/Cloud/{fedora_arch}/images/")

(urljoin is a bit picky and you need the trailing slash in the base value for this to work.)

raise RuntimeError("No images to determine latest version")
text = await self._fetch_text(session, url)
# Match href values that are plain filenames (no path separators or query strings)
return re.findall(r'href="([^"/?][^"/]*)"', text)
Copy link
Contributor

@jimporter jimporter Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than relying on regexes here, maybe use a real HTML parser? This would require adding bs4 as a dependency and then from bs4 import BeautifulSoup.

Suggested change
return re.findall(r'href="([^"/?][^"/]*)"', text)
doc = BeautifulSoup(text)
# Directory entries are links inside the main <pre> block immediately
# following the "Parent Directory".
entries = (doc.find("pre").find(string="Parent Directory").parent
.find_next_siblings("a"))
return [i.text for i in entries]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably then would need some extra filtering for directories vs non-directories (by the callers?). I'm not 100% sure this is worth the effort, but it would help prevent potential future issues if something changes on the Fedora end.

@sharder996
Copy link
Collaborator Author

@jimporter Agree with you on all points. I don't view this code and mission critical or production code so my resiliency standards are lower. FWIW the image source I decided to pull from comes from Fedora's release tooling which I would expect to be a bit more stable.

Copy link
Contributor

@jimporter jimporter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sharder996 LGTM.

I think there's a delicate balance between making this code bulletproof, since it runs automatically versus keeping it simple. Even though it's not a crisis if this code breaks once in a while, it's never fun (IMO) to deal with flaky automation. On the other hand, lots of extra complexity makes maintenance harder.

There might be things we can do to simplify this code (though I know some of my suggestions added to the boilerplate, if nothing else). It's a tough question though, since simplicity plus robustness probably means we'd need to spend more time thinking about the best way to get both. In any case, these changes are good to go, and I'll spend some time thinking about how to get the right balance (including whether my previous reviews of this code nudged us towards unnecessary complexity).

@sharder996 sharder996 added this pull request to the merge queue Mar 19, 2026
Merged via the queue into main with commit 5850771 Mar 19, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants