-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Migrate to uv and improve rate limit handling #2557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jai-deepsource
commented
Jan 20, 2026
- Switch from Poetry to uv for Python dependency management
- Add retry logic with exponential backoff for GitHub rate limits
- Reduce MAX_CONCURRENCY from 25 to 10 to avoid rate limits
- Remove MAX_REPOSITORIES cap so all repos are processed
- Sort repositories.toml alphabetically, add/remove some repos
- Update CONTRIBUTING.md with uv setup instructions
- Update frontend dependencies (nuxt, prettier, etc.)
- Switch from Poetry to uv for Python dependency management - Add retry logic with exponential backoff for GitHub rate limits - Reduce MAX_CONCURRENCY from 25 to 10 to avoid rate limits - Remove MAX_REPOSITORIES cap so all repos are processed - Sort repositories.toml alphabetically, add/remove some repos - Update CONTRIBUTING.md with uv setup instructions - Update frontend dependencies (nuxt, prettier, etc.)
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
| requires-python = ">=3.9" | ||
| dependencies = [ | ||
| "github3.py>=4.0.1", | ||
| "toml>=0.10.2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loguru is placed in optional dev dependencies instead of main dependencies, causing ImportError when running populate.py in production
View Details
📝 Patch Details
diff --git a/pyproject.toml b/pyproject.toml
index c788dcf..bf9317c 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -14,6 +14,7 @@ dependencies = [
"emoji>=0.6.0",
"requests>=2.23.0",
"python-slugify>=4.0.1",
+ "loguru>=0.7.2",
]
[project.optional-dependencies]
@@ -21,7 +22,6 @@ dev = [
"ruff>=0.1.8",
"types-toml>=0.10.8.7",
"types-python-slugify>=8.0.0.3",
- "loguru>=0.7.2",
"mypy>=1.7.1",
]
Analysis
Why the bug exists:
The gfi/populate.py script is part of the production code and imports loguru at the module level (from loguru import logger at line 20). The module uses the logger extensively throughout the main execution flow with calls like logger.info(), logger.warning(), and logger.error(). However, in the migration from Poetry to uv, loguru was mistakenly left in [project.optional-dependencies] dev instead of being moved to the main [project] dependencies section.
When it manifests:
When anyone installs the package with just pip install good-first-issue (without the dev extras), and then attempts to run the populate script or import the gfi module, Python will raise an ImportError: No module named 'loguru' because loguru is not installed.
Impact:
This breaks the core functionality of the populate script, which is a key production component for gathering repository data. Any production deployment or normal usage of this module will fail immediately on import.
The fix:
I moved loguru>=0.7.2 from [project.optional-dependencies] dev to the main [project] dependencies list. This ensures loguru is installed whenever the package is installed, making it available for the populate.py script in all deployment scenarios. The package version constraint (>=0.7.2) is preserved.
| if good_first_issues and repository.language: | ||
| # store the repo info | ||
| info: RepositoryInfo = {} | ||
| info["name"] = name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thread pool worker blocked by 60-180 second sleep during rate limiting, preventing parallel processing of other repositories
View Details
📝 Patch Details
diff --git a/gfi/populate.py b/gfi/populate.py
index b06443a..f9f4d5a 100644
--- a/gfi/populate.py
+++ b/gfi/populate.py
@@ -9,7 +9,7 @@ from collections import Counter
from concurrent.futures import ThreadPoolExecutor
from operator import itemgetter
from os import getenv, path
-from typing import TypedDict, Dict, Union, Sequence, Optional, Mapping
+from typing import TypedDict, Dict, Union, Sequence, Optional, Mapping, Tuple
import toml
@@ -61,7 +61,16 @@ RepositoryInfo = Dict["str", Union[str, int, Sequence]]
def get_repository_info(identifier: RepositoryIdentifier) -> Optional[RepositoryInfo]:
- """Get the relevant information needed for the repository from its owner login and name."""
+ """Get the relevant information needed for the repository from its owner login and name.
+
+ Note: This function does NOT retry on rate limits. Retries are handled at the caller level
+ to avoid blocking thread pool workers.
+
+ Returns:
+ - RepositoryInfo dict if successful
+ - None if repository doesn't meet criteria or not found
+ - "RATE_LIMIT" string if rate limited (signal for retry outside thread pool)
+ """
owner, name = identifier["owner"], identifier["name"]
logger.info("Getting info for {}/{}", owner, name)
@@ -69,82 +78,71 @@ def get_repository_info(identifier: RepositoryIdentifier) -> Optional[Repository
# create a logged in GitHub client
client = login(token=getenv("GH_ACCESS_TOKEN"))
- max_retries = 3
-
- for attempt in range(max_retries):
- try:
- repository = client.repository(owner, name)
- # Don't find issues inside archived repos.
- if repository.archived:
- return None
-
- good_first_issues = set(
- itertools.chain.from_iterable(
- repository.issues(
- labels=label,
- state=ISSUE_STATE,
- number=ISSUE_LIMIT,
- sort=ISSUE_SORT,
- direction=ISSUE_SORT_DIRECTION,
- )
- for label in ISSUE_LABELS
+ try:
+ repository = client.repository(owner, name)
+ # Don't find issues inside archived repos.
+ if repository.archived:
+ return None
+
+ good_first_issues = set(
+ itertools.chain.from_iterable(
+ repository.issues(
+ labels=label,
+ state=ISSUE_STATE,
+ number=ISSUE_LIMIT,
+ sort=ISSUE_SORT,
+ direction=ISSUE_SORT_DIRECTION,
)
+ for label in ISSUE_LABELS
)
- logger.info("\t found {} good first issues", len(good_first_issues))
- # check if repo has at least one good first issue
- if good_first_issues and repository.language:
- # store the repo info
- info: RepositoryInfo = {}
- info["name"] = name
- info["owner"] = owner
- info["description"] = emojize(repository.description or "")
- info["language"] = repository.language
- info["slug"] = slugify(repository.language, replacements=SLUGIFY_REPLACEMENTS)
- info["url"] = repository.html_url
- info["stars"] = repository.stargazers_count
- info["stars_display"] = numerize.numerize(repository.stargazers_count)
- info["last_modified"] = repository.pushed_at.isoformat()
- info["id"] = str(repository.id)
-
- # get the latest issues with the tag
- issues = []
- for issue in good_first_issues:
- issues.append(
- {
- "title": issue.title,
- "url": issue.html_url,
- "number": issue.number,
- "comments_count": issue.comments_count,
- "created_at": issue.created_at.isoformat(),
- }
- )
-
- info["issues"] = issues
- return info
- else:
- logger.info("\t skipping due to insufficient issues or info")
- return None
-
- except exceptions.ForbiddenError:
- if attempt < max_retries - 1:
- wait_time = 60 * (attempt + 1) # 60s, 120s, 180s
- logger.warning("Rate limited on {}/{}. Waiting {}s before retry...",
- owner, name, wait_time)
- time.sleep(wait_time)
- else:
- logger.error("Rate limit exceeded after {} retries: {}/{}",
- max_retries, owner, name)
- return None
+ )
+ logger.info("\t found {} good first issues", len(good_first_issues))
+ # check if repo has at least one good first issue
+ if good_first_issues and repository.language:
+ # store the repo info
+ info: RepositoryInfo = {}
+ info["name"] = name
+ info["owner"] = owner
+ info["description"] = emojize(repository.description or "")
+ info["language"] = repository.language
+ info["slug"] = slugify(repository.language, replacements=SLUGIFY_REPLACEMENTS)
+ info["url"] = repository.html_url
+ info["stars"] = repository.stargazers_count
+ info["stars_display"] = numerize.numerize(repository.stargazers_count)
+ info["last_modified"] = repository.pushed_at.isoformat()
+ info["id"] = str(repository.id)
+
+ # get the latest issues with the tag
+ issues = []
+ for issue in good_first_issues:
+ issues.append(
+ {
+ "title": issue.title,
+ "url": issue.html_url,
+ "number": issue.number,
+ "comments_count": issue.comments_count,
+ "created_at": issue.created_at.isoformat(),
+ }
+ )
- except exceptions.NotFoundError:
- logger.warning("Not Found: {}/{}", owner, name)
+ info["issues"] = issues
+ return info
+ else:
+ logger.info("\t skipping due to insufficient issues or info")
return None
- except exceptions.ConnectionError:
- logger.warning("Connection error: {}/{}", owner, name)
- return None
+ except exceptions.ForbiddenError:
+ # Don't sleep in the worker thread. Return a signal for retry at caller level.
+ logger.warning("Rate limited on {}. Will retry later.", f"{owner}/{name}")
+ return "RATE_LIMIT"
+
+ except exceptions.NotFoundError:
+ logger.warning("Not Found: {}/{}", owner, name)
+ return None
- return None
+ except exceptions.ConnectionError:
+ logger.warning("Connection error: {}/{}", owner, name)
+ return None
if __name__ == "__main__":
@@ -175,14 +173,39 @@ if __name__ == "__main__":
# shuffle the order of the repositories
random.shuffle(repositories)
- with ThreadPoolExecutor(max_workers=MAX_CONCURRENCY) as executor:
- results = executor.map(get_repository_info, repositories)
-
- # filter out repositories with valid data and increment tag counts
- for result in results:
- if result:
- REPOSITORIES.append(result)
- TAGS[result["language"]] += 1
+ # Process repositories with retry logic outside the thread pool to avoid blocking workers
+ retry_queue = repositories.copy()
+ attempt_count = {}
+ max_retries_per_repo = 3
+
+ while retry_queue:
+ with ThreadPoolExecutor(max_workers=MAX_CONCURRENCY) as executor:
+ results = executor.map(get_repository_info, retry_queue)
+
+ next_retry_queue = []
+ for identifier, result in zip(retry_queue, results):
+ if result == "RATE_LIMIT":
+ # Handle rate limit with retry outside thread pool
+ repo_key = f"{identifier['owner']}/{identifier['name']}"
+ current_attempt = attempt_count.get(repo_key, 0) + 1
+ attempt_count[repo_key] = current_attempt
+
+ if current_attempt < max_retries_per_repo:
+ next_retry_queue.append(identifier)
+ else:
+ logger.error("Rate limit exceeded after {} retries: {}",
+ max_retries_per_repo, repo_key)
+ elif result:
+ REPOSITORIES.append(result)
+ TAGS[result["language"]] += 1
+
+ retry_queue = next_retry_queue
+ if retry_queue:
+ # Sleep outside thread pool to avoid blocking workers
+ wait_time = 60 * min(attempt_count.values()) if attempt_count else 60
+ logger.info("Retrying {} repositories. Waiting {}s to avoid rate limits...",
+ len(retry_queue), wait_time)
+ time.sleep(wait_time)
# write to generated JSON files
Analysis
Bug Analysis
Why It Happens:
The original code at line 98 in get_repository_info() calls time.sleep(wait_time) when a ForbiddenError (rate limit) is caught. This function is executed inside a ThreadPoolExecutor worker with max_workers=10. When the thread sleeps, it remains blocked and cannot process other repository tasks.
When It Manifests:
- When processing multiple repositories through the ThreadPoolExecutor
- When one or more repositories trigger a GitHub API rate limit (ForbiddenError)
- The thread sleeps for 60s, 120s, or 180s depending on retry attempt
- During this sleep, the worker thread is completely unavailable to the thread pool
Impact:
- With 10 worker threads, if 2-3 repositories hit rate limits simultaneously, 2-3 threads are sleeping for up to 3 minutes
- Only 7-8 threads are left to process work, reducing parallelism by 20-30%
- If more repositories hit rate limits, parallelism degrades further
- The ThreadPoolExecutor becomes inefficient, defeating its purpose for parallel processing
- Overall execution time increases significantly compared to sequential processing
Fix Applied
Solution: Move rate limit retry logic outside the thread pool to avoid blocking worker threads.
Key Changes:
- Modified
get_repository_info()to return"RATE_LIMIT"signal instead of sleeping when a ForbiddenError occurs - Removed
time.sleep()from inside the worker function - Implemented retry loop at the main level (lines 179-197):
- Collects all repositories that hit rate limits
- Sleeps outside the thread pool (main thread, not blocking workers)
- All 10 worker threads remain available during the sleep
- Retries rate-limited repositories in the next executor batch
Why It Solves The Issue:
- Worker threads never sleep - they can immediately pick up new tasks
- Rate limit sleep happens in the main thread, not blocking any workers
- All 10 workers remain available for processing throughout rate limit waits
- Parallelism is preserved, enabling efficient processing of all repositories
- The ThreadPoolExecutor achieves its intended performance benefit
- Add astral-sh/setup-uv@v7 step (fixes make pre-build failure) - Update runner to blacksmith-2vcpu-ubuntu-2404 - Bump actions/checkout to v6 - Bump actions/setup-python to v6 Signed-off-by: Jai Pradeesh <[email protected]>
ac9005e to
0b46b42
Compare
- Add GitHubRateLimiter class for thread-safe API throttling - Share single GitHub client across all workers - Proactively pause when quota drops below 100 - Coordinate all workers to pause together on rate limit hit - Reduce MAX_CONCURRENCY from 10 to 5 - Remove unused itertools and Mapping imports