Migrate to uv and improve rate limit handling #2557

jai-deepsource · 2026-01-20T21:17:36Z

Switch from Poetry to uv for Python dependency management
Add retry logic with exponential backoff for GitHub rate limits
Reduce MAX_CONCURRENCY from 25 to 10 to avoid rate limits
Remove MAX_REPOSITORIES cap so all repos are processed
Sort repositories.toml alphabetically, add/remove some repos
Update CONTRIBUTING.md with uv setup instructions
Update frontend dependencies (nuxt, prettier, etc.)

- Switch from Poetry to uv for Python dependency management - Add retry logic with exponential backoff for GitHub rate limits - Reduce MAX_CONCURRENCY from 25 to 10 to avoid rate limits - Remove MAX_REPOSITORIES cap so all repos are processed - Sort repositories.toml alphabetically, add/remove some repos - Update CONTRIBUTING.md with uv setup instructions - Update frontend dependencies (nuxt, prettier, etc.)

vercel · 2026-01-20T21:17:41Z

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment

Project	Deployment	Review	Updated (UTC)
good-first-issue	Ignored	Preview	Jan 20, 2026 10:11pm

vercel · 2026-01-20T21:22:58Z

pyproject.toml

+requires-python = ">=3.9"
+dependencies = [
+    "github3.py>=4.0.1",
+    "toml>=0.10.2",


loguru is placed in optional dev dependencies instead of main dependencies, causing ImportError when running populate.py in production

View Details

📝 Patch Details

diff --git a/pyproject.toml b/pyproject.toml index c788dcf..bf9317c 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -14,6 +14,7 @@ dependencies = [ "emoji>=0.6.0", "requests>=2.23.0", "python-slugify>=4.0.1", + "loguru>=0.7.2", ] [project.optional-dependencies] @@ -21,7 +22,6 @@ dev = [ "ruff>=0.1.8", "types-toml>=0.10.8.7", "types-python-slugify>=8.0.0.3", - "loguru>=0.7.2", "mypy>=1.7.1", ]

Analysis

Why the bug exists:
The gfi/populate.py script is part of the production code and imports loguru at the module level (from loguru import logger at line 20). The module uses the logger extensively throughout the main execution flow with calls like logger.info(), logger.warning(), and logger.error(). However, in the migration from Poetry to uv, loguru was mistakenly left in [project.optional-dependencies] dev instead of being moved to the main [project] dependencies section.

When it manifests:
When anyone installs the package with just pip install good-first-issue (without the dev extras), and then attempts to run the populate script or import the gfi module, Python will raise an ImportError: No module named 'loguru' because loguru is not installed.

Impact:
This breaks the core functionality of the populate script, which is a key production component for gathering repository data. Any production deployment or normal usage of this module will fail immediately on import.

The fix:
I moved loguru>=0.7.2 from [project.optional-dependencies] dev to the main [project] dependencies list. This ensures loguru is installed whenever the package is installed, making it available for the populate.py script in all deployment scenarios. The package version constraint (>=0.7.2) is preserved.

vercel · 2026-01-20T21:22:59Z

gfi/populate.py

+            if good_first_issues and repository.language:
+                # store the repo info
+                info: RepositoryInfo = {}
+                info["name"] = name


Thread pool worker blocked by 60-180 second sleep during rate limiting, preventing parallel processing of other repositories

View Details

📝 Patch Details

diff --git a/gfi/populate.py b/gfi/populate.py index b06443a..f9f4d5a 100644 --- a/gfi/populate.py +++ b/gfi/populate.py @@ -9,7 +9,7 @@ from collections import Counter from concurrent.futures import ThreadPoolExecutor from operator import itemgetter from os import getenv, path -from typing import TypedDict, Dict, Union, Sequence, Optional, Mapping +from typing import TypedDict, Dict, Union, Sequence, Optional, Mapping, Tuple import toml @@ -61,7 +61,16 @@ RepositoryInfo = Dict["str", Union[str, int, Sequence]] def get_repository_info(identifier: RepositoryIdentifier) -> Optional[RepositoryInfo]: - """Get the relevant information needed for the repository from its owner login and name.""" + """Get the relevant information needed for the repository from its owner login and name. + + Note: This function does NOT retry on rate limits. Retries are handled at the caller level + to avoid blocking thread pool workers. + + Returns: + - RepositoryInfo dict if successful + - None if repository doesn't meet criteria or not found + - "RATE_LIMIT" string if rate limited (signal for retry outside thread pool) + """ owner, name = identifier["owner"], identifier["name"] logger.info("Getting info for {}/{}", owner, name) @@ -69,82 +78,71 @@ def get_repository_info(identifier: RepositoryIdentifier) -> Optional[Repository # create a logged in GitHub client client = login(token=getenv("GH_ACCESS_TOKEN")) - max_retries = 3 - - for attempt in range(max_retries): - try: - repository = client.repository(owner, name) - # Don't find issues inside archived repos. - if repository.archived: - return None - - good_first_issues = set( - itertools.chain.from_iterable( - repository.issues( - labels=label, - state=ISSUE_STATE, - number=ISSUE_LIMIT, - sort=ISSUE_SORT, - direction=ISSUE_SORT_DIRECTION, - ) - for label in ISSUE_LABELS + try: + repository = client.repository(owner, name) + # Don't find issues inside archived repos. + if repository.archived: + return None + + good_first_issues = set( + itertools.chain.from_iterable( + repository.issues( + labels=label, + state=ISSUE_STATE, + number=ISSUE_LIMIT, + sort=ISSUE_SORT, + direction=ISSUE_SORT_DIRECTION, ) + for label in ISSUE_LABELS ) - logger.info("\t found {} good first issues", len(good_first_issues)) - # check if repo has at least one good first issue - if good_first_issues and repository.language: - # store the repo info - info: RepositoryInfo = {} - info["name"] = name - info["owner"] = owner - info["description"] = emojize(repository.description or "") - info["language"] = repository.language - info["slug"] = slugify(repository.language, replacements=SLUGIFY_REPLACEMENTS) - info["url"] = repository.html_url - info["stars"] = repository.stargazers_count - info["stars_display"] = numerize.numerize(repository.stargazers_count) - info["last_modified"] = repository.pushed_at.isoformat() - info["id"] = str(repository.id) - - # get the latest issues with the tag - issues = [] - for issue in good_first_issues: - issues.append( - { - "title": issue.title, - "url": issue.html_url, - "number": issue.number, - "comments_count": issue.comments_count, - "created_at": issue.created_at.isoformat(), - } - ) - - info["issues"] = issues - return info - else: - logger.info("\t skipping due to insufficient issues or info") - return None - - except exceptions.ForbiddenError: - if attempt < max_retries - 1: - wait_time = 60 * (attempt + 1) # 60s, 120s, 180s - logger.warning("Rate limited on {}/{}. Waiting {}s before retry...", - owner, name, wait_time) - time.sleep(wait_time) - else: - logger.error("Rate limit exceeded after {} retries: {}/{}", - max_retries, owner, name) - return None + ) + logger.info("\t found {} good first issues", len(good_first_issues)) + # check if repo has at least one good first issue + if good_first_issues and repository.language: + # store the repo info + info: RepositoryInfo = {} + info["name"] = name + info["owner"] = owner + info["description"] = emojize(repository.description or "") + info["language"] = repository.language + info["slug"] = slugify(repository.language, replacements=SLUGIFY_REPLACEMENTS) + info["url"] = repository.html_url + info["stars"] = repository.stargazers_count + info["stars_display"] = numerize.numerize(repository.stargazers_count) + info["last_modified"] = repository.pushed_at.isoformat() + info["id"] = str(repository.id) + + # get the latest issues with the tag + issues = [] + for issue in good_first_issues: + issues.append( + { + "title": issue.title, + "url": issue.html_url, + "number": issue.number, + "comments_count": issue.comments_count, + "created_at": issue.created_at.isoformat(), + } + ) - except exceptions.NotFoundError: - logger.warning("Not Found: {}/{}", owner, name) + info["issues"] = issues + return info + else: + logger.info("\t skipping due to insufficient issues or info") return None - except exceptions.ConnectionError: - logger.warning("Connection error: {}/{}", owner, name) - return None + except exceptions.ForbiddenError: + # Don't sleep in the worker thread. Return a signal for retry at caller level. + logger.warning("Rate limited on {}. Will retry later.", f"{owner}/{name}") + return "RATE_LIMIT" + + except exceptions.NotFoundError: + logger.warning("Not Found: {}/{}", owner, name) + return None - return None + except exceptions.ConnectionError: + logger.warning("Connection error: {}/{}", owner, name) + return None if __name__ == "__main__": @@ -175,14 +173,39 @@ if __name__ == "__main__": # shuffle the order of the repositories random.shuffle(repositories) - with ThreadPoolExecutor(max_workers=MAX_CONCURRENCY) as executor: - results = executor.map(get_repository_info, repositories) - - # filter out repositories with valid data and increment tag counts - for result in results: - if result: - REPOSITORIES.append(result) - TAGS[result["language"]] += 1 + # Process repositories with retry logic outside the thread pool to avoid blocking workers + retry_queue = repositories.copy() + attempt_count = {} + max_retries_per_repo = 3 + + while retry_queue: + with ThreadPoolExecutor(max_workers=MAX_CONCURRENCY) as executor: + results = executor.map(get_repository_info, retry_queue) + + next_retry_queue = [] + for identifier, result in zip(retry_queue, results): + if result == "RATE_LIMIT": + # Handle rate limit with retry outside thread pool + repo_key = f"{identifier['owner']}/{identifier['name']}" + current_attempt = attempt_count.get(repo_key, 0) + 1 + attempt_count[repo_key] = current_attempt + + if current_attempt < max_retries_per_repo: + next_retry_queue.append(identifier) + else: + logger.error("Rate limit exceeded after {} retries: {}", + max_retries_per_repo, repo_key) + elif result: + REPOSITORIES.append(result) + TAGS[result["language"]] += 1 + + retry_queue = next_retry_queue + if retry_queue: + # Sleep outside thread pool to avoid blocking workers + wait_time = 60 * min(attempt_count.values()) if attempt_count else 60 + logger.info("Retrying {} repositories. Waiting {}s to avoid rate limits...", + len(retry_queue), wait_time) + time.sleep(wait_time) # write to generated JSON files

Analysis

Bug Analysis

Why It Happens:
The original code at line 98 in get_repository_info() calls time.sleep(wait_time) when a ForbiddenError (rate limit) is caught. This function is executed inside a ThreadPoolExecutor worker with max_workers=10. When the thread sleeps, it remains blocked and cannot process other repository tasks.

When It Manifests:

When processing multiple repositories through the ThreadPoolExecutor

When one or more repositories trigger a GitHub API rate limit (ForbiddenError)

The thread sleeps for 60s, 120s, or 180s depending on retry attempt

During this sleep, the worker thread is completely unavailable to the thread pool

Impact:

With 10 worker threads, if 2-3 repositories hit rate limits simultaneously, 2-3 threads are sleeping for up to 3 minutes

Only 7-8 threads are left to process work, reducing parallelism by 20-30%

If more repositories hit rate limits, parallelism degrades further

The ThreadPoolExecutor becomes inefficient, defeating its purpose for parallel processing

Overall execution time increases significantly compared to sequential processing

Fix Applied

Solution: Move rate limit retry logic outside the thread pool to avoid blocking worker threads.

Key Changes:

Modified get_repository_info() to return "RATE_LIMIT" signal instead of sleeping when a ForbiddenError occurs

Removed time.sleep() from inside the worker function

Implemented retry loop at the main level (lines 179-197):

Collects all repositories that hit rate limits

Sleeps outside the thread pool (main thread, not blocking workers)

All 10 worker threads remain available during the sleep

Retries rate-limited repositories in the next executor batch

Why It Solves The Issue:

Worker threads never sleep - they can immediately pick up new tasks

Rate limit sleep happens in the main thread, not blocking any workers

All 10 workers remain available for processing throughout rate limit waits

Parallelism is preserved, enabling efficient processing of all repositories

The ThreadPoolExecutor achieves its intended performance benefit

- Add astral-sh/setup-uv@v7 step (fixes make pre-build failure) - Update runner to blacksmith-2vcpu-ubuntu-2404 - Bump actions/checkout to v6 - Bump actions/setup-python to v6 Signed-off-by: Jai Pradeesh <[email protected]>

- Add GitHubRateLimiter class for thread-safe API throttling - Share single GitHub client across all workers - Proactively pause when quota drops below 100 - Coordinate all workers to pause together on rate limit hit - Reduce MAX_CONCURRENCY from 10 to 5 - Remove unused itertools and Mapping imports

jai-deepsource requested a review from sanket-deepsource January 20, 2026 21:17

vercel bot deployed to Preview January 20, 2026 21:22 View deployment

vercel bot reviewed Jan 20, 2026

View reviewed changes

Fix cron workflow by adding uv setup

0b46b42

- Add astral-sh/setup-uv@v7 step (fixes make pre-build failure) - Update runner to blacksmith-2vcpu-ubuntu-2404 - Bump actions/checkout to v6 - Bump actions/setup-python to v6 Signed-off-by: Jai Pradeesh <[email protected]>

jai-deepsource force-pushed the misc-improvements branch from ac9005e to 0b46b42 Compare January 20, 2026 21:38

vercel bot deployed to Preview January 20, 2026 21:39 View deployment

sanket-deepsource previously approved these changes Jan 20, 2026

View reviewed changes

jai-deepsource dismissed sanket-deepsource’s stale review via de84213 January 20, 2026 22:10

jai-deepsource requested a review from sanket-deepsource January 20, 2026 22:11

sanket-deepsource approved these changes Jan 20, 2026

View reviewed changes

jai-deepsource merged commit d85c8c4 into master Jan 20, 2026
2 checks passed

jai-deepsource deleted the misc-improvements branch January 20, 2026 23:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate to uv and improve rate limit handling #2557

Migrate to uv and improve rate limit handling #2557

Uh oh!

jai-deepsource commented Jan 20, 2026

Uh oh!

vercel bot commented Jan 20, 2026 •

edited

Loading

Uh oh!

vercel bot Jan 20, 2026

Uh oh!

vercel bot Jan 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Migrate to uv and improve rate limit handling #2557

Migrate to uv and improve rate limit handling #2557

Uh oh!

Conversation

jai-deepsource commented Jan 20, 2026

Uh oh!

vercel bot commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vercel bot Jan 20, 2026

Choose a reason for hiding this comment

Analysis

Uh oh!

vercel bot Jan 20, 2026

Choose a reason for hiding this comment

Analysis

Bug Analysis

Fix Applied

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

vercel bot commented Jan 20, 2026 •

edited

Loading