Skip to content

Scrape github home for repo description, stars, and tags#511

Open
rkent wants to merge 1 commit intoros-infrastructure:ros2from
rkent:scrape-repo-home
Open

Scrape github home for repo description, stars, and tags#511
rkent wants to merge 1 commit intoros-infrastructure:ros2from
rkent:scrape-repo-home

Conversation

@rkent
Copy link
Contributor

@rkent rkent commented Apr 8, 2025

Here we scrape the homepage of github repositories, extracting then displaying various items.

This PR started as a followup to a (perhaps future) proposal to search github for ROS packages, and the stars were needed to rate repos. Later repo descriptions and tags were added since, why not? But it turns out that the repo descriptions in the repo list are the most useful of the scraped items. (You can see the current result of this PR, as well as a few others such as including download counts and discovered github packages, in https://dev-rosindex.rosdabbler.com/).

I expect a couple of controversies with this PR:

  1. Scraping the web page.

Scraping the web page like this is sensitive to future, unannounced changes to the web page layout by Github. The same information is also available in the Github API, but that is rate limited to just 60 requests per hour unauthenticated. It might be possible to authenticate the request, which allows 5000 requests per hour, but that could require additional effort by ros-infrastructure to setup and manage an account. I am suggesting in this PR that we do the scraping, but beware in the future that might have to change if the stability is not sufficient.

  1. Use of Github stars to rate packages.

At least @nuclearsandwich is not enthusiastic about using github stars. Obviously I think it is useful (though imperfect like any metric), and worth doing to improve the ability of rosindex to identify significant packages.

In a future PR I may propose (and dev-rosindex shows) using download counts as an additional metric. That is also useful, but there is surprisingly little correlation between that and stars. Download counts are better for locating common utility repos that are useful but not that exciting, while github stars shows repos that specifically impressed a number of different users. So something like slam_toolbox has one of the highest Github stars ratings (1900) while only in the 10th percentile for download counts.

Signed-off-by: R Kent James <kent@caspia.com>
@rkent rkent force-pushed the scrape-repo-home branch from 8dcf777 to ebc6acd Compare April 8, 2025 22:30
@rkent
Copy link
Contributor Author

rkent commented Apr 8, 2025

Sorry, noticed a small problem after I opened this PR.

Copy link
Member

@tfoote tfoote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this being github specific it seems that using the API might be better.

For example: curl -L -H "Accept: application/vnd.github+json" -H "X-GitHub-Api-Version: 2022-11-28" https://api.github.com/repos/ros2/rclcpp

Especially if we're doing this for a bunch of packages we'll potentially want/need to use a token. I'd love to get this into the prefetch with caching instead of in the Jekyll too. In the same way that we precache the pip descriptions.

@rkent
Copy link
Contributor Author

rkent commented Apr 10, 2025

How does a token work with running on the buildfarm? I use a token myself in my github scapes, so I an familiar with it, but the limits are per account. I don't know if the "official" ROS github accounts has other competing uses, or if you can use a on-off account of some sort for this.

@tfoote
Copy link
Member

tfoote commented Apr 11, 2025

So in this case, I suspect that our usage will be below the need for a token. There's a volume/rate limit for anonymous access.

But if we need a token I can provision it into the job configuration on the buildfarm. We have a number of bot accounts that we could leverage. Especially for this we'd want a bot with no permissions so that there's virtually no security risk.

We can have the script pick up the token from the environment if it exists.

@rkent
Copy link
Contributor Author

rkent commented Apr 11, 2025

As I said, the API limit for unauthenticated access is only 60 per hour. We need one access per repo, so there is no way that could be done with unauthenticated access.

My intentions are to write the script to use the token if it exists, otherwise revert to the scrape.

Re "I'd love to get this into the prefetch with caching instead of in the Jekyll too" that is more complex here because you need to know the list of repos to do that, which is an earlier step in the Ruby megafile. I'm investigating the options though.

@tfoote
Copy link
Member

tfoote commented Apr 15, 2025

Re "I'd love to get this into the prefetch with caching instead of in the Jekyll too" that is more complex here because you need to know the list of repos to do that, which is an earlier step in the Ruby megafile. I'm investigating the options though.

Ahh for that, I would suggest that we just do things in the rosdistro. You can iterate the rosdistro pretty quickly using the python-rosdistro library.

import rosdistro

index = rosdistro.get_index(rosdistro.get_index_url())
for dist in index.distributions:
    distro_file = rosdistro.get_distribution_file(index,dist)
    repositories = distro_file.repositories

    print(f"Repositories in ROS distribution '{dist}':")
    for repo_name in sorted(repositories.keys()):
        repo_data = repositories[repo_name].get_data()
        print(f"- {repo_name}, {repo_data['source'] if 'source' in repo_data  else '<no source repo>'}")

With that it just needs to filter for github urls into a set, iterate them and write the files to a cache. And in Jekyll it can just query said cache instead of walking at generate time. If the cache misses the data goes to a default.

The cache could have a last updated timestamp and a TTL to prevent re-crawling too quickly.

The rosdistro DistributionCache also has all the packages by name as well as their full package.xml https://github.com/ros-infrastructure/rosdistro/blob/master/src/rosdistro/distribution_cache.py which is what we can get most if not all the metadata we need from for #444

@rkent
Copy link
Contributor Author

rkent commented Dec 10, 2025

I'm trying to figure out the next step for this PR. Looking at the results, I find e.g. https://dev-rosindex.rosdabbler.com/?search_repos=true with the repo descriptions much more appealing than https://index.ros.org/?search_repos=true without the descriptions. You say "The rosdistro DistributionCache also has all the packages by name as well as their full package.xml" - true but this PR is about the repo name, not the package name.

@tfoote are you proposing that "But if we need a token I can provision it into the job configuration on the buildfarm." and then this PR be switch from using web scraping to using the api? The web scraping works without the API.

So I think we should move ahead with this somehow. I'm just not sure what changes need to be done.

@tfoote
Copy link
Member

tfoote commented Dec 19, 2025

So I guess there's two levels of feYeah, lets do the short edback on this. One is that I'd really like to separate the polling/scraping from the ruby build process. That would allow us to potentially separate that results and cache it. The extra external interactions will make a long running job longer either via API or web scraping.

To be a better citizen and also be more robust we should also switch to use the API.

If you haven't seen a large jump in execution time we could land this as is and have a follow up enhancement request of switching it to be a separate python caching process that then the ruby generator will just rely on a local file for the metadata instead.

@rkent
Copy link
Contributor Author

rkent commented Dec 19, 2025

I'll be mostly away until after Christmas, but let me briefly respond here.

The problem with the API is the limit and the number of operations you can do. It's quite limited unauthenticated, even authenticated we might be pushing the limits of what we can do on an hourly basis.

I don't mind doing the caching, but so far our caching work hasn't actually resulted in less time, because we end up having to calculate the cache every time we do something. Not sure what the answer to that is yet.

I'll have more time after Christmas to discuss this more.

@tfoote
Copy link
Member

tfoote commented Dec 19, 2025

Enjoy the holidays! We can sync up after. Indeed we haven't broken out the caching into a separate process so we're not getting the benefits yet. For the API access we can configure the buildfarm to provide the necessary token for authenticated access. This one will be easy as it's a read only token so no permissions actually needed. And even if we're running the full job, the workspace usually will persist so we can make the querying smarter to only retry existing content on some backoff like daily to lower the query burden for iterative queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants