Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions .chloggen/gh-scrape-limits.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Use this changelog template to create an entry for release notes.

# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
change_type: enhancement

# The name of the component, or a single word describing the area of concern, (e.g. receiver/filelog)
component: receiver/github

# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
note: Add concurrency limit and pull request filtering to reduce rate limiting

# Mandatory: One or more tracking issues related to the change. You can use the PR number here if no issue exists.
issues: [43388]

# (Optional) One or more lines of additional information to render under the primary note.
# These lines will be padded with 2 spaces and then inserted directly into the document.
# Use pipe (|) for multiline entries.
subtext:

# If your change doesn't affect end users or the exported elements of any package,
# you should instead start your pull request title with [chore] or use the "Skip Changelog" label.
# Optional: The change log or logs in which this entry should be included.
# e.g. '[user]' or '[user, api]'
# Include 'user' if the change is relevant to end users.
# Include 'api' if there is a change to a library API.
# Default: '[user]'
change_logs: []
6 changes: 6 additions & 0 deletions receiver/githubreceiver/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,8 @@ receivers:
enabled: true
github_org: <myfancyorg>
search_query: "org:<myfancyorg> topic:<o11yalltheway>" # Recommended optional query override, defaults to "{org,user}:<github_org>"
max_concurrent_requests: 100 # Optional, default: 100
pull_request_lookback_days: 30 # Optional, default: 30
endpoint: "https://selfmanagedenterpriseserver.com" # Optional
auth:
authenticator: bearertokenauth/github
Expand All @@ -97,6 +99,10 @@ service:

`search_query` (optional): A filter to narrow down repositories. Defaults to `org:<github_org>` (or `user:<username>`). For example, use `repo:<org>/<repo>` to target a specific repository. Any valid GitHub search syntax is allowed.

`max_concurrent_requests` (optional, default: 100): Maximum concurrent API requests to prevent exceeding GitHub's secondary rate limit of 100 concurrent requests. Set lower if sharing the token with other tools, or higher (at your own risk) if you understand the implications.

`pull_request_lookback_days` (optional, default: 30): Days to look back for merged pull requests. Set to 0 for unlimited history. Open pull requests are always fetched regardless of this setting.

`metrics` (optional): Enable or disable metrics scraping. See the [metrics documentation](./documentation.md) for details.

### Scraping
Expand Down
1 change: 1 addition & 0 deletions receiver/githubreceiver/go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ require (
go.uber.org/goleak v1.3.0
go.uber.org/multierr v1.11.0
go.uber.org/zap v1.27.1
golang.org/x/sync v0.18.0
)

require (
Expand Down
2 changes: 2 additions & 0 deletions receiver/githubreceiver/go.sum

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

95 changes: 81 additions & 14 deletions receiver/githubreceiver/internal/scraper/githubscraper/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,23 +42,90 @@ to prevent abuse and maintain API availability. The following secondary limit is
particularly relevant:

- **Concurrent Requests Limit**: The API allows no more than 100 concurrent
requests. This limit is shared across the REST and GraphQL APIs. Since the
scraper creates a goroutine per repository, having more than 100 repositories
returned by the `search_query` will result in exceeding this limit.
It is recommended to use the `search_query` config option to limit the number of
repositories that are scraped. We recommend one instance of the receiver per
team (note: `team` is not a valid quantifier when searching repositories `topic`
is). Reminder that each instance of the receiver should have its own
corresponding token for authentication as this is what rate limits are tied to.
requests. This limit is shared across the REST and GraphQL APIs. The scraper
provides a `max_concurrent_requests` configuration option (default: 100) to
control concurrency and reduce the likelihood of exceeding this limit.

In summary, we recommend the following:
## Configuration Options for Rate Limiting

To reduce rate limit issues, the GitHub scraper provides two configuration
options:

### Concurrency Control

```yaml
receivers:
github:
scrapers:
github:
max_concurrent_requests: 100 # Default: 100
```

The `max_concurrent_requests` option limits how many repositories are scraped
concurrently. GitHub's API enforces a secondary rate limit of 100 concurrent
requests (shared between REST and GraphQL APIs). The default value of 100
respects this limit.

**When to adjust:**
- Set lower (e.g., 50) if you're also using GitHub's API from other tools with
the same token
- Set higher than 100 only if you understand the risk of secondary rate limit
errors
- The receiver will warn if this value exceeds 100

### Pull Request Time Filtering

```yaml
receivers:
github:
scrapers:
github:
pull_request_lookback_days: 30 # Default: 30
```

The `pull_request_lookback_days` option limits how far back to query for merged
pull requests. Open pull requests are always fetched regardless of age. The
scraper will stop paginating through merged PRs once it encounters PRs older
than the lookback period, significantly reducing API calls for repositories with
large PR histories.

**When to adjust:**
- Set to 0 to fetch all historical pull requests (may consume significant API
quota)
- Increase (e.g., 90) if your team's PR cycle time is longer
- Decrease (e.g., 7) if you only care about very recent merged PRs

**Note:** The implementation fetches open and merged PRs separately to enable
early termination of pagination for merged PRs, minimizing unnecessary API
calls.

### Example Configuration

```yaml
receivers:
github:
collection_interval: 300s # 5 minutes
scrapers:
github:
github_org: myorg
search_query: "org:myorg topic:observability"
max_concurrent_requests: 50
pull_request_lookback_days: 30
endpoint: https://github.example.com # For GitHub Enterprise
auth:
authenticator: bearertokenauth/github
```

### Recommendations

Based on the limitations above, we recommend:

- One instance of the receiver per team
- Each instance of the receiver should have its own token
- Leverage `search_query` config option to limit repositories returned to 100 or
less per instance
- `collection_interval` should be long enough to avoid rate limiting (see above
formula). A sensible default is `300s`.
- Each instance should have its own token
- Use `search_query` to limit repositories to a reasonable number
- Set `collection_interval` to 300s (5 minutes) or higher to avoid primary rate limits
- Use `max_concurrent_requests: 100` (default) to prevent secondary rate limit errors
- Use `pull_request_lookback_days: 30` (default) to limit historical PR queries

**Additional Resources:**

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,10 @@
package githubscraper // import "github.com/open-telemetry/opentelemetry-collector-contrib/receiver/githubreceiver/internal/scraper/githubscraper"

import (
"errors"

"go.opentelemetry.io/collector/config/confighttp"
"go.uber.org/multierr"

"github.com/open-telemetry/opentelemetry-collector-contrib/receiver/githubreceiver/internal"
"github.com/open-telemetry/opentelemetry-collector-contrib/receiver/githubreceiver/internal/metadata"
Expand All @@ -19,4 +22,27 @@ type Config struct {
GitHubOrg string `mapstructure:"github_org"`
// SearchQuery is the query to use when defining a custom search for repository data
SearchQuery string `mapstructure:"search_query"`
// MaxConcurrentRequests limits the number of concurrent API requests to prevent
// exceeding GitHub's secondary rate limit of 100 concurrent requests.
// Default: 100
MaxConcurrentRequests int `mapstructure:"max_concurrent_requests"`
// PullRequestLookbackDays limits how far back to query for merged/closed pull requests.
// Open pull requests are always fetched regardless of age.
// Set to 0 to fetch all historical pull requests.
// Default: 30
PullRequestLookbackDays int `mapstructure:"pull_request_lookback_days"`
}

func (cfg *Config) Validate() error {
var errs error

if cfg.MaxConcurrentRequests <= 0 {
errs = multierr.Append(errs, errors.New("max_concurrent_requests must be greater than 0"))
}

if cfg.PullRequestLookbackDays < 0 {
errs = multierr.Append(errs, errors.New("pull_request_lookback_days cannot be negative"))
}

return errs
}
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,71 @@ func TestConfig(t *testing.T) {
clientConfig.Timeout = 15 * time.Second

expectedConfig := &Config{
MetricsBuilderConfig: metadata.DefaultMetricsBuilderConfig(),
ClientConfig: clientConfig,
MetricsBuilderConfig: metadata.DefaultMetricsBuilderConfig(),
ClientConfig: clientConfig,
MaxConcurrentRequests: 100,
PullRequestLookbackDays: 30,
}

assert.Equal(t, expectedConfig, defaultConfig)
}

func TestConfigValidate(t *testing.T) {
tests := []struct {
name string
config *Config
expectedErr string
}{
{
name: "valid config with defaults",
config: &Config{
MaxConcurrentRequests: 100,
PullRequestLookbackDays: 30,
},
expectedErr: "",
},
{
name: "valid config with zero lookback (unlimited)",
config: &Config{
MaxConcurrentRequests: 50,
PullRequestLookbackDays: 0,
},
expectedErr: "",
},
{
name: "invalid negative max_concurrent_requests",
config: &Config{
MaxConcurrentRequests: -1,
PullRequestLookbackDays: 30,
},
expectedErr: "max_concurrent_requests must be greater than 0",
},
{
name: "invalid zero max_concurrent_requests",
config: &Config{
MaxConcurrentRequests: 0,
PullRequestLookbackDays: 30,
},
expectedErr: "max_concurrent_requests must be greater than 0",
},
{
name: "invalid negative lookback days",
config: &Config{
MaxConcurrentRequests: 100,
PullRequestLookbackDays: -1,
},
expectedErr: "pull_request_lookback_days cannot be negative",
},
}

for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
err := tt.config.Validate()
if tt.expectedErr == "" {
assert.NoError(t, err)
} else {
assert.ErrorContains(t, err, tt.expectedErr)
}
})
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,10 @@ import (
// This file implements factory for the GitHub Scraper as part of the GitHub Receiver

const (
TypeStr = "scraper"
defaultHTTPTimeout = 15 * time.Second
TypeStr = "scraper"
defaultHTTPTimeout = 15 * time.Second
defaultMaxConcurrentRequests = 100
defaultPullRequestLookbackDays = 30
)

type Factory struct{}
Expand All @@ -28,8 +30,10 @@ func (*Factory) CreateDefaultConfig() internal.Config {
clientConfig := confighttp.NewDefaultClientConfig()
clientConfig.Timeout = defaultHTTPTimeout
return &Config{
MetricsBuilderConfig: metadata.DefaultMetricsBuilderConfig(),
ClientConfig: clientConfig,
MetricsBuilderConfig: metadata.DefaultMetricsBuilderConfig(),
ClientConfig: clientConfig,
MaxConcurrentRequests: defaultMaxConcurrentRequests,
PullRequestLookbackDays: defaultPullRequestLookbackDays,
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ import (
"go.opentelemetry.io/collector/pdata/pmetric"
"go.opentelemetry.io/collector/receiver"
"go.uber.org/zap"
"golang.org/x/sync/semaphore"

"github.com/open-telemetry/opentelemetry-collector-contrib/receiver/githubreceiver/internal/metadata"
)
Expand All @@ -30,6 +31,7 @@ type githubScraper struct {
logger *zap.Logger
mb *metadata.MetricsBuilder
rb *metadata.ResourceBuilder
sem *semaphore.Weighted // Concurrency limiter
}

func (ghs *githubScraper) start(ctx context.Context, host component.Host) (err error) {
Expand Down Expand Up @@ -95,6 +97,28 @@ func (ghs *githubScraper) scrape(ctx context.Context) (pmetric.Metrics, error) {

ghs.mb.RecordVcsRepositoryCountDataPoint(now, int64(count))

// Log warning if repository count exceeds configured concurrency limit
if len(repos) > ghs.cfg.MaxConcurrentRequests {
ghs.logger.Sugar().Warnf(
"Found %d repositories but max_concurrent_requests is set to %d. "+
"Consider using search_query to reduce repository count or increase max_concurrent_requests. "+
"Note: GitHub's API limit is 100 concurrent requests.",
len(repos), ghs.cfg.MaxConcurrentRequests,
)
}

// Log warning if max_concurrent_requests exceeds GitHub's limit
if ghs.cfg.MaxConcurrentRequests > 100 {
ghs.logger.Sugar().Warnf(
"max_concurrent_requests is set to %d which exceeds GitHub's API limit of 100 concurrent requests. "+
"This may result in secondary rate limit errors.",
ghs.cfg.MaxConcurrentRequests,
)
}

// Initialize semaphore for concurrency control
ghs.sem = semaphore.NewWeighted(int64(ghs.cfg.MaxConcurrentRequests))

// Get the ref (branch) count (future branch data) for each repo and record
// the given metrics
var wg sync.WaitGroup
Expand All @@ -110,6 +134,13 @@ func (ghs *githubScraper) scrape(ctx context.Context) (pmetric.Metrics, error) {
go func() {
defer wg.Done()

// Acquire semaphore before making API calls
if err := ghs.sem.Acquire(ctx, 1); err != nil {
ghs.logger.Sugar().Errorf("failed to acquire semaphore: %v", zap.Error(err))
return
}
defer ghs.sem.Release(1)

branches, count, err := ghs.getBranches(ctx, genClient, name, trunk)
if err != nil {
ghs.logger.Sugar().Errorf("error getting branch count: %v", zap.Error(err))
Expand Down Expand Up @@ -164,7 +195,7 @@ func (ghs *githubScraper) scrape(ctx context.Context) (pmetric.Metrics, error) {
ghs.mb.RecordVcsContributorCountDataPoint(now, int64(contribs), url, name)

// Get change (pull request) data
prs, err := ghs.getPullRequests(ctx, genClient, name)
prs, err := ghs.getPullRequests(ctx, genClient, name, ghs.cfg.PullRequestLookbackDays)
if err != nil {
ghs.logger.Sugar().Errorf("error getting pull requests: %v", zap.Error(err))
}
Expand Down
Loading
Loading