Scrape github home for repo description, stars, and tags#511
Scrape github home for repo description, stars, and tags#511rkent wants to merge 1 commit intoros-infrastructure:ros2from
Conversation
Signed-off-by: R Kent James <kent@caspia.com>
|
Sorry, noticed a small problem after I opened this PR. |
tfoote
left a comment
There was a problem hiding this comment.
With this being github specific it seems that using the API might be better.
For example: curl -L -H "Accept: application/vnd.github+json" -H "X-GitHub-Api-Version: 2022-11-28" https://api.github.com/repos/ros2/rclcpp
Especially if we're doing this for a bunch of packages we'll potentially want/need to use a token. I'd love to get this into the prefetch with caching instead of in the Jekyll too. In the same way that we precache the pip descriptions.
|
How does a token work with running on the buildfarm? I use a token myself in my github scapes, so I an familiar with it, but the limits are per account. I don't know if the "official" ROS github accounts has other competing uses, or if you can use a on-off account of some sort for this. |
|
So in this case, I suspect that our usage will be below the need for a token. There's a volume/rate limit for anonymous access. But if we need a token I can provision it into the job configuration on the buildfarm. We have a number of bot accounts that we could leverage. Especially for this we'd want a bot with no permissions so that there's virtually no security risk. We can have the script pick up the token from the environment if it exists. |
|
As I said, the API limit for unauthenticated access is only 60 per hour. We need one access per repo, so there is no way that could be done with unauthenticated access. My intentions are to write the script to use the token if it exists, otherwise revert to the scrape. Re "I'd love to get this into the prefetch with caching instead of in the Jekyll too" that is more complex here because you need to know the list of repos to do that, which is an earlier step in the Ruby megafile. I'm investigating the options though. |
Ahh for that, I would suggest that we just do things in the rosdistro. You can iterate the rosdistro pretty quickly using the python-rosdistro library. With that it just needs to filter for github urls into a set, iterate them and write the files to a cache. And in Jekyll it can just query said cache instead of walking at generate time. If the cache misses the data goes to a default. The cache could have a last updated timestamp and a TTL to prevent re-crawling too quickly. The rosdistro DistributionCache also has all the packages by name as well as their full package.xml https://github.com/ros-infrastructure/rosdistro/blob/master/src/rosdistro/distribution_cache.py which is what we can get most if not all the metadata we need from for #444 |
|
I'm trying to figure out the next step for this PR. Looking at the results, I find e.g. https://dev-rosindex.rosdabbler.com/?search_repos=true with the repo descriptions much more appealing than https://index.ros.org/?search_repos=true without the descriptions. You say "The rosdistro DistributionCache also has all the packages by name as well as their full package.xml" - true but this PR is about the repo name, not the package name. @tfoote are you proposing that "But if we need a token I can provision it into the job configuration on the buildfarm." and then this PR be switch from using web scraping to using the api? The web scraping works without the API. So I think we should move ahead with this somehow. I'm just not sure what changes need to be done. |
|
So I guess there's two levels of feYeah, lets do the short edback on this. One is that I'd really like to separate the polling/scraping from the ruby build process. That would allow us to potentially separate that results and cache it. The extra external interactions will make a long running job longer either via API or web scraping. To be a better citizen and also be more robust we should also switch to use the API. If you haven't seen a large jump in execution time we could land this as is and have a follow up enhancement request of switching it to be a separate python caching process that then the ruby generator will just rely on a local file for the metadata instead. |
|
I'll be mostly away until after Christmas, but let me briefly respond here. The problem with the API is the limit and the number of operations you can do. It's quite limited unauthenticated, even authenticated we might be pushing the limits of what we can do on an hourly basis. I don't mind doing the caching, but so far our caching work hasn't actually resulted in less time, because we end up having to calculate the cache every time we do something. Not sure what the answer to that is yet. I'll have more time after Christmas to discuss this more. |
|
Enjoy the holidays! We can sync up after. Indeed we haven't broken out the caching into a separate process so we're not getting the benefits yet. For the API access we can configure the buildfarm to provide the necessary token for authenticated access. This one will be easy as it's a read only token so no permissions actually needed. And even if we're running the full job, the workspace usually will persist so we can make the querying smarter to only retry existing content on some backoff like daily to lower the query burden for iterative queries. |
Here we scrape the homepage of github repositories, extracting then displaying various items.
This PR started as a followup to a (perhaps future) proposal to search github for ROS packages, and the stars were needed to rate repos. Later repo descriptions and tags were added since, why not? But it turns out that the repo descriptions in the repo list are the most useful of the scraped items. (You can see the current result of this PR, as well as a few others such as including download counts and discovered github packages, in https://dev-rosindex.rosdabbler.com/).
I expect a couple of controversies with this PR:
Scraping the web page like this is sensitive to future, unannounced changes to the web page layout by Github. The same information is also available in the Github API, but that is rate limited to just 60 requests per hour unauthenticated. It might be possible to authenticate the request, which allows 5000 requests per hour, but that could require additional effort by ros-infrastructure to setup and manage an account. I am suggesting in this PR that we do the scraping, but beware in the future that might have to change if the stability is not sufficient.
At least @nuclearsandwich is not enthusiastic about using github stars. Obviously I think it is useful (though imperfect like any metric), and worth doing to improve the ability of rosindex to identify significant packages.
In a future PR I may propose (and dev-rosindex shows) using download counts as an additional metric. That is also useful, but there is surprisingly little correlation between that and stars. Download counts are better for locating common utility repos that are useful but not that exciting, while github stars shows repos that specifically impressed a number of different users. So something like slam_toolbox has one of the highest Github stars ratings (1900) while only in the 10th percentile for download counts.