Skip to content

Conversation

@jai-deepsource
Copy link
Contributor

  • Switch from Poetry to uv for Python dependency management
  • Add retry logic with exponential backoff for GitHub rate limits
  • Reduce MAX_CONCURRENCY from 25 to 10 to avoid rate limits
  • Remove MAX_REPOSITORIES cap so all repos are processed
  • Sort repositories.toml alphabetically, add/remove some repos
  • Update CONTRIBUTING.md with uv setup instructions
  • Update frontend dependencies (nuxt, prettier, etc.)

- Switch from Poetry to uv for Python dependency management
- Add retry logic with exponential backoff for GitHub rate limits
- Reduce MAX_CONCURRENCY from 25 to 10 to avoid rate limits
- Remove MAX_REPOSITORIES cap so all repos are processed
- Sort repositories.toml alphabetically, add/remove some repos
- Update CONTRIBUTING.md with uv setup instructions
- Update frontend dependencies (nuxt, prettier, etc.)
@vercel
Copy link

vercel bot commented Jan 20, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Review Updated (UTC)
good-first-issue Ignored Ignored Preview Jan 20, 2026 10:11pm

Request Review

requires-python = ">=3.9"
dependencies = [
"github3.py>=4.0.1",
"toml>=0.10.2",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loguru is placed in optional dev dependencies instead of main dependencies, causing ImportError when running populate.py in production

View Details
📝 Patch Details
diff --git a/pyproject.toml b/pyproject.toml
index c788dcf..bf9317c 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -14,6 +14,7 @@ dependencies = [
     "emoji>=0.6.0",
     "requests>=2.23.0",
     "python-slugify>=4.0.1",
+    "loguru>=0.7.2",
 ]
 
 [project.optional-dependencies]
@@ -21,7 +22,6 @@ dev = [
     "ruff>=0.1.8",
     "types-toml>=0.10.8.7",
     "types-python-slugify>=8.0.0.3",
-    "loguru>=0.7.2",
     "mypy>=1.7.1",
 ]
 

Analysis

Why the bug exists:
The gfi/populate.py script is part of the production code and imports loguru at the module level (from loguru import logger at line 20). The module uses the logger extensively throughout the main execution flow with calls like logger.info(), logger.warning(), and logger.error(). However, in the migration from Poetry to uv, loguru was mistakenly left in [project.optional-dependencies] dev instead of being moved to the main [project] dependencies section.

When it manifests:
When anyone installs the package with just pip install good-first-issue (without the dev extras), and then attempts to run the populate script or import the gfi module, Python will raise an ImportError: No module named 'loguru' because loguru is not installed.

Impact:
This breaks the core functionality of the populate script, which is a key production component for gathering repository data. Any production deployment or normal usage of this module will fail immediately on import.

The fix:
I moved loguru>=0.7.2 from [project.optional-dependencies] dev to the main [project] dependencies list. This ensures loguru is installed whenever the package is installed, making it available for the populate.py script in all deployment scenarios. The package version constraint (>=0.7.2) is preserved.

if good_first_issues and repository.language:
# store the repo info
info: RepositoryInfo = {}
info["name"] = name
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread pool worker blocked by 60-180 second sleep during rate limiting, preventing parallel processing of other repositories

View Details
📝 Patch Details
diff --git a/gfi/populate.py b/gfi/populate.py
index b06443a..f9f4d5a 100644
--- a/gfi/populate.py
+++ b/gfi/populate.py
@@ -9,7 +9,7 @@ from collections import Counter
 from concurrent.futures import ThreadPoolExecutor
 from operator import itemgetter
 from os import getenv, path
-from typing import TypedDict, Dict, Union, Sequence, Optional, Mapping
+from typing import TypedDict, Dict, Union, Sequence, Optional, Mapping, Tuple
 
 import toml
 
@@ -61,7 +61,16 @@ RepositoryInfo = Dict["str", Union[str, int, Sequence]]
 
 
 def get_repository_info(identifier: RepositoryIdentifier) -> Optional[RepositoryInfo]:
-    """Get the relevant information needed for the repository from its owner login and name."""
+    """Get the relevant information needed for the repository from its owner login and name.
+    
+    Note: This function does NOT retry on rate limits. Retries are handled at the caller level
+    to avoid blocking thread pool workers.
+    
+    Returns:
+        - RepositoryInfo dict if successful
+        - None if repository doesn't meet criteria or not found
+        - "RATE_LIMIT" string if rate limited (signal for retry outside thread pool)
+    """
     owner, name = identifier["owner"], identifier["name"]
 
     logger.info("Getting info for {}/{}", owner, name)
@@ -69,82 +78,71 @@ def get_repository_info(identifier: RepositoryIdentifier) -> Optional[Repository
     # create a logged in GitHub client
     client = login(token=getenv("GH_ACCESS_TOKEN"))
 
-    max_retries = 3
-
-    for attempt in range(max_retries):
-        try:
-            repository = client.repository(owner, name)
-            # Don't find issues inside archived repos.
-            if repository.archived:
-                return None
-
-            good_first_issues = set(
-                itertools.chain.from_iterable(
-                    repository.issues(
-                        labels=label,
-                        state=ISSUE_STATE,
-                        number=ISSUE_LIMIT,
-                        sort=ISSUE_SORT,
-                        direction=ISSUE_SORT_DIRECTION,
-                    )
-                    for label in ISSUE_LABELS
+    try:
+        repository = client.repository(owner, name)
+        # Don't find issues inside archived repos.
+        if repository.archived:
+            return None
+
+        good_first_issues = set(
+            itertools.chain.from_iterable(
+                repository.issues(
+                    labels=label,
+                    state=ISSUE_STATE,
+                    number=ISSUE_LIMIT,
+                    sort=ISSUE_SORT,
+                    direction=ISSUE_SORT_DIRECTION,
                 )
+                for label in ISSUE_LABELS
             )
-            logger.info("\t found {} good first issues", len(good_first_issues))
-            # check if repo has at least one good first issue
-            if good_first_issues and repository.language:
-                # store the repo info
-                info: RepositoryInfo = {}
-                info["name"] = name
-                info["owner"] = owner
-                info["description"] = emojize(repository.description or "")
-                info["language"] = repository.language
-                info["slug"] = slugify(repository.language, replacements=SLUGIFY_REPLACEMENTS)
-                info["url"] = repository.html_url
-                info["stars"] = repository.stargazers_count
-                info["stars_display"] = numerize.numerize(repository.stargazers_count)
-                info["last_modified"] = repository.pushed_at.isoformat()
-                info["id"] = str(repository.id)
-
-                # get the latest issues with the tag
-                issues = []
-                for issue in good_first_issues:
-                    issues.append(
-                        {
-                            "title": issue.title,
-                            "url": issue.html_url,
-                            "number": issue.number,
-                            "comments_count": issue.comments_count,
-                            "created_at": issue.created_at.isoformat(),
-                        }
-                    )
-
-                info["issues"] = issues
-                return info
-            else:
-                logger.info("\t skipping due to insufficient issues or info")
-                return None
-
-        except exceptions.ForbiddenError:
-            if attempt < max_retries - 1:
-                wait_time = 60 * (attempt + 1)  # 60s, 120s, 180s
-                logger.warning("Rate limited on {}/{}. Waiting {}s before retry...",
-                             owner, name, wait_time)
-                time.sleep(wait_time)
-            else:
-                logger.error("Rate limit exceeded after {} retries: {}/{}",
-                           max_retries, owner, name)
-                return None
+        )
+        logger.info("\t found {} good first issues", len(good_first_issues))
+        # check if repo has at least one good first issue
+        if good_first_issues and repository.language:
+            # store the repo info
+            info: RepositoryInfo = {}
+            info["name"] = name
+            info["owner"] = owner
+            info["description"] = emojize(repository.description or "")
+            info["language"] = repository.language
+            info["slug"] = slugify(repository.language, replacements=SLUGIFY_REPLACEMENTS)
+            info["url"] = repository.html_url
+            info["stars"] = repository.stargazers_count
+            info["stars_display"] = numerize.numerize(repository.stargazers_count)
+            info["last_modified"] = repository.pushed_at.isoformat()
+            info["id"] = str(repository.id)
+
+            # get the latest issues with the tag
+            issues = []
+            for issue in good_first_issues:
+                issues.append(
+                    {
+                        "title": issue.title,
+                        "url": issue.html_url,
+                        "number": issue.number,
+                        "comments_count": issue.comments_count,
+                        "created_at": issue.created_at.isoformat(),
+                    }
+                )
 
-        except exceptions.NotFoundError:
-            logger.warning("Not Found: {}/{}", owner, name)
+            info["issues"] = issues
+            return info
+        else:
+            logger.info("\t skipping due to insufficient issues or info")
             return None
 
-        except exceptions.ConnectionError:
-            logger.warning("Connection error: {}/{}", owner, name)
-            return None
+    except exceptions.ForbiddenError:
+        # Don't sleep in the worker thread. Return a signal for retry at caller level.
+        logger.warning("Rate limited on {}. Will retry later.", f"{owner}/{name}")
+        return "RATE_LIMIT"
+
+    except exceptions.NotFoundError:
+        logger.warning("Not Found: {}/{}", owner, name)
+        return None
 
-    return None
+    except exceptions.ConnectionError:
+        logger.warning("Connection error: {}/{}", owner, name)
+        return None
 
 
 if __name__ == "__main__":
@@ -175,14 +173,39 @@ if __name__ == "__main__":
         # shuffle the order of the repositories
         random.shuffle(repositories)
 
-        with ThreadPoolExecutor(max_workers=MAX_CONCURRENCY) as executor:
-            results = executor.map(get_repository_info, repositories)
-
-        # filter out repositories with valid data and increment tag counts
-        for result in results:
-            if result:
-                REPOSITORIES.append(result)
-                TAGS[result["language"]] += 1
+        # Process repositories with retry logic outside the thread pool to avoid blocking workers
+        retry_queue = repositories.copy()
+        attempt_count = {}
+        max_retries_per_repo = 3
+
+        while retry_queue:
+            with ThreadPoolExecutor(max_workers=MAX_CONCURRENCY) as executor:
+                results = executor.map(get_repository_info, retry_queue)
+
+            next_retry_queue = []
+            for identifier, result in zip(retry_queue, results):
+                if result == "RATE_LIMIT":
+                    # Handle rate limit with retry outside thread pool
+                    repo_key = f"{identifier['owner']}/{identifier['name']}"
+                    current_attempt = attempt_count.get(repo_key, 0) + 1
+                    attempt_count[repo_key] = current_attempt
+
+                    if current_attempt < max_retries_per_repo:
+                        next_retry_queue.append(identifier)
+                    else:
+                        logger.error("Rate limit exceeded after {} retries: {}",
+                                   max_retries_per_repo, repo_key)
+                elif result:
+                    REPOSITORIES.append(result)
+                    TAGS[result["language"]] += 1
+
+            retry_queue = next_retry_queue
+            if retry_queue:
+                # Sleep outside thread pool to avoid blocking workers
+                wait_time = 60 * min(attempt_count.values()) if attempt_count else 60
+                logger.info("Retrying {} repositories. Waiting {}s to avoid rate limits...", 
+                           len(retry_queue), wait_time)
+                time.sleep(wait_time)
 
     # write to generated JSON files
 

Analysis

Bug Analysis

Why It Happens:
The original code at line 98 in get_repository_info() calls time.sleep(wait_time) when a ForbiddenError (rate limit) is caught. This function is executed inside a ThreadPoolExecutor worker with max_workers=10. When the thread sleeps, it remains blocked and cannot process other repository tasks.

When It Manifests:

  • When processing multiple repositories through the ThreadPoolExecutor
  • When one or more repositories trigger a GitHub API rate limit (ForbiddenError)
  • The thread sleeps for 60s, 120s, or 180s depending on retry attempt
  • During this sleep, the worker thread is completely unavailable to the thread pool

Impact:

  • With 10 worker threads, if 2-3 repositories hit rate limits simultaneously, 2-3 threads are sleeping for up to 3 minutes
  • Only 7-8 threads are left to process work, reducing parallelism by 20-30%
  • If more repositories hit rate limits, parallelism degrades further
  • The ThreadPoolExecutor becomes inefficient, defeating its purpose for parallel processing
  • Overall execution time increases significantly compared to sequential processing

Fix Applied

Solution: Move rate limit retry logic outside the thread pool to avoid blocking worker threads.

Key Changes:

  1. Modified get_repository_info() to return "RATE_LIMIT" signal instead of sleeping when a ForbiddenError occurs
  2. Removed time.sleep() from inside the worker function
  3. Implemented retry loop at the main level (lines 179-197):
    • Collects all repositories that hit rate limits
    • Sleeps outside the thread pool (main thread, not blocking workers)
    • All 10 worker threads remain available during the sleep
    • Retries rate-limited repositories in the next executor batch

Why It Solves The Issue:

  • Worker threads never sleep - they can immediately pick up new tasks
  • Rate limit sleep happens in the main thread, not blocking any workers
  • All 10 workers remain available for processing throughout rate limit waits
  • Parallelism is preserved, enabling efficient processing of all repositories
  • The ThreadPoolExecutor achieves its intended performance benefit

- Add astral-sh/setup-uv@v7 step (fixes make pre-build failure)
- Update runner to blacksmith-2vcpu-ubuntu-2404
- Bump actions/checkout to v6
- Bump actions/setup-python to v6

Signed-off-by: Jai Pradeesh <[email protected]>
- Add GitHubRateLimiter class for thread-safe API throttling
- Share single GitHub client across all workers
- Proactively pause when quota drops below 100
- Coordinate all workers to pause together on rate limit hit
- Reduce MAX_CONCURRENCY from 10 to 5
- Remove unused itertools and Mapping imports
@jai-deepsource jai-deepsource merged commit d85c8c4 into master Jan 20, 2026
2 checks passed
@jai-deepsource jai-deepsource deleted the misc-improvements branch January 20, 2026 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants