Conversation
There was a problem hiding this comment.
This looks good overall to me. Regarding the Fedora scraper changes, it's a shame we have to scrap the Apache-generated HTML directory listings. (I couldn't find any alternatives after a bit of poking around.) That feels like it could be brittle, but I don't know enough about the issue to have a better idea.
Just a couple suggestions below about making things a bit more Pythonic.
| PRIMARY_RELEASES_URL = "https://dl.fedoraproject.org/pub/fedora/linux/releases" | ||
| SECONDARY_RELEASES_URL = "https://dl.fedoraproject.org/pub/fedora-secondary/releases" |
There was a problem hiding this comment.
If you added trailing slashes to these, then some of the remaining code would (arguably) be simpler:
| PRIMARY_RELEASES_URL = "https://dl.fedoraproject.org/pub/fedora/linux/releases" | |
| SECONDARY_RELEASES_URL = "https://dl.fedoraproject.org/pub/fedora-secondary/releases" | |
| PRIMARY_RELEASES_URL = "https://dl.fedoraproject.org/pub/fedora/linux/releases/" | |
| SECONDARY_RELEASES_URL = "https://dl.fedoraproject.org/pub/fedora-secondary/releases/" |
| url = f"{PRIMARY_RELEASES_URL}/" | ||
| text = await self._fetch_text(session, url) |
There was a problem hiding this comment.
... then you could do this:
| url = f"{PRIMARY_RELEASES_URL}/" | |
| text = await self._fetch_text(session, url) | |
| text = await self._fetch_text(session, PRIMARY_RELEASES_URL) |
This could also use _fetch_dir_listing to avoid needing to use a regex to parse the HTML.
| self.logger.info("Sending HEAD request to %s", url) | ||
| fedora_arch = ARCH_MAP.get(label, label) | ||
| base = SECONDARY_RELEASES_URL if label in SECONDARY_ARCHES else PRIMARY_RELEASES_URL | ||
| images_url = f"{base}/{version}/Cloud/{fedora_arch}/images/" |
There was a problem hiding this comment.
... and finally this (after adding from urllib.parse import urljoin near the top):
| images_url = f"{base}/{version}/Cloud/{fedora_arch}/images/" | |
| images_url = urljoin(base, f"{version}/Cloud/{fedora_arch}/images/") |
(urljoin is a bit picky and you need the trailing slash in the base value for this to work.)
| raise RuntimeError("No images to determine latest version") | ||
| text = await self._fetch_text(session, url) | ||
| # Match href values that are plain filenames (no path separators or query strings) | ||
| return re.findall(r'href="([^"/?][^"/]*)"', text) |
There was a problem hiding this comment.
Rather than relying on regexes here, maybe use a real HTML parser? This would require adding bs4 as a dependency and then from bs4 import BeautifulSoup.
| return re.findall(r'href="([^"/?][^"/]*)"', text) | |
| doc = BeautifulSoup(text) | |
| # Directory entries are links inside the main <pre> block immediately | |
| # following the "Parent Directory". | |
| entries = (doc.find("pre").find(string="Parent Directory").parent | |
| .find_next_siblings("a")) | |
| return [i.text for i in entries] |
There was a problem hiding this comment.
This probably then would need some extra filtering for directories vs non-directories (by the callers?). I'm not 100% sure this is worth the effort, but it would help prevent potential future issues if something changes on the Fedora end.
|
@jimporter Agree with you on all points. I don't view this code and mission critical or production code so my resiliency standards are lower. FWIW the image source I decided to pull from comes from Fedora's release tooling which I would expect to be a bit more stable. |
jimporter
left a comment
There was a problem hiding this comment.
@sharder996 LGTM.
I think there's a delicate balance between making this code bulletproof, since it runs automatically versus keeping it simple. Even though it's not a crisis if this code breaks once in a while, it's never fun (IMO) to deal with flaky automation. On the other hand, lots of extra complexity makes maintenance harder.
There might be things we can do to simplify this code (though I know some of my suggestions added to the boilerplate, if nothing else). It's a tough question though, since simplicity plus robustness probably means we'd need to spend more time thinking about the best way to get both. In any case, these changes are good to go, and I'll spend some time thinking about how to get the right balance (including whether my previous reviews of this code nudged us towards unnecessary complexity).
Some tweaks to things that are running on an automated basis: