-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Google Cloud BigQuery improvements #18467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. 2 Skipped Deployments
|
WalkthroughAdds explicit BigQuery pagination: introduces createQueryJob, changes getRowsForQuery to accept a job and page tokens, implements a paginated processCollection loop emitting per-row or batched events, removes processSingle, and bumps package and source version metadata and a dependency. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Src as Source
participant BQ as BigQuery API
rect rgb(242,248,255)
note over Src: run() → getQueryOpts() → createQueryJob()
Src->>Src: getQueryOpts()
Src->>BQ: createQueryJob(queryOpts)
BQ-->>Src: job
end
rect rgb(245,255,245)
loop paginate (pageToken, maxPages)
Src->>BQ: getRowsForQuery(job, pageSize, pageToken)
BQ-->>Src: { rows[], nextPageToken }
alt eventSize == 1
Src->>Src: processEvent(row) / $emit(row, meta)
else
Src->>Src: generateMetaForCollection(rows) / $emit(rows, meta)
end
Src->>Src: update pageToken / increment page count
alt stop conditions
Src->>Src: stop on no rows or no nextPageToken or maxPages reached
end
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
components/google_cloud/sources/bigquery-new-row/bigquery-new-row.mjs (1)
103-126: Restore compatibility with updatedgetRowsForQuerysignatureLine 116 still passes
queryOptsand treats the return value as an array, butgetRowsForQuerynow expects a query job and returns an object. The current call will throw (job.getQueryResults is not a function) and prevent the source from bootstrapping its lastResultId. Please create the job first and destruct the rows from the new return shape.- const rows = await this.getRowsForQuery(queryOpts, this.datasetId); + const job = await this.createQueryJob(queryOpts); + const { rows } = await this.getRowsForQuery(job, limit);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
pnpm-lock.yamlis excluded by!**/pnpm-lock.yaml
📒 Files selected for processing (4)
components/google_cloud/package.json(2 hunks)components/google_cloud/sources/bigquery-new-row/bigquery-new-row.mjs(1 hunks)components/google_cloud/sources/bigquery-query-results/bigquery-query-results.mjs(1 hunks)components/google_cloud/sources/common/bigquery.mjs(3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2024-12-12T19:23:09.039Z
Learnt from: jcortes
PR: PipedreamHQ/pipedream#14935
File: components/sailpoint/package.json:15-18
Timestamp: 2024-12-12T19:23:09.039Z
Learning: When developing Pipedream components, do not add built-in Node.js modules like `fs` to `package.json` dependencies, as they are native modules provided by the Node.js runtime.
Applied to files:
components/google_cloud/package.json
🧬 Code graph analysis (1)
components/google_cloud/sources/common/bigquery.mjs (1)
components/google_cloud/sources/bigquery-new-row/bigquery-new-row.mjs (2)
queryOpts(110-115)rows(116-116)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: pnpm publish
- GitHub Check: Publish TypeScript components
- GitHub Check: Verify TypeScript components
- GitHub Check: Lint Code Base
🔇 Additional comments (2)
components/google_cloud/package.json (1)
23-23: Confirm compatibility with@pipedream/platform@^3.1.0This jumps the SDK from the pre-1.0 line to the 3.x series, so previous helper APIs and runtime behavior may have breaking changes. Please re-run the affected Google Cloud sources end-to-end (or at least rebuild + dry-run them) to confirm nothing regresses before publishing. If you’ve already validated this, just note it in the PR.
components/google_cloud/sources/bigquery-query-results/bigquery-query-results.mjs (1)
11-11: Version bump matches the BigQuery pagination workUpdating the source version to 0.1.6 lines up with the underlying pagination changes—looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
components/google_cloud/sources/common/bigquery.mjs (3)
74-78: Make page size and max pages configurable (avoids OOMs and fits PR objective).Hard-coded
pageSize = 1000andmaxPages = 10may still cause memory pressure for large rows. Expose them as props and use here.Apply within this segment:
- const pageSize = 1000, maxPages = 10; + const pageSize = this.pageSize ?? 1000; + const maxPages = this.maxPages ?? 10;Then add props (outside this range):
// In props: pageSize: { type: "integer", label: "Rows per page", description: "Max rows fetched per BigQuery page (lower values reduce memory).", default: 1000, min: 1, }, maxPages: { type: "integer", label: "Max pages per run", description: "Safety cap on pages processed per run (prevents long/expensive pulls).", default: 10, min: 1, }
74-114: Simplify pagination loop with a bounded for-loop.The
allProcessedflag is unnecessary. A bounded for-loop is clearer and avoids state juggling.- let pageToken = null; - let allProcessed = false; - let pageCount = 0; - - while (!allProcessed) { + let pageToken = null; + for (let pageCount = 0; pageCount < maxPages; pageCount++) { const { rows, pageToken: nextPageToken, } = await this.getRowsForQuery(job, pageSize, pageToken); if (rows.length === 0) { - allProcessed = true; break; } chunk(rows, this.eventSize).forEach((batch) => { if (this.eventSize === 1) { const meta = this.generateMeta(batch[0], timestamp); this.$emit(batch[0], meta); } else { const meta = this.generateMetaForCollection(batch, timestamp); const data = { rows: batch, rowCount: batch.length, }; this.$emit(data, meta); } }); - pageCount++; - if (pageCount >= maxPages) { - allProcessed = true; - } if (this.uniqueKey) { this._updateLastResultId(rows); } if (!nextPageToken) { break; } pageToken = nextPageToken; - } + }
79-114: Add contextual error handling to improve OOM reporting.Wrap page fetch/emit in try/catch and include page index, pageSize, and row counts in the error to aid debugging (aligns with issue #18352).
- while (!allProcessed) { + while (!allProcessed) { + try { const { rows, pageToken: nextPageToken, } = await this.getRowsForQuery(job, pageSize, pageToken); if (rows.length === 0) { allProcessed = true; break; } chunk(rows, this.eventSize).forEach((batch) => { if (this.eventSize === 1) { const meta = this.generateMeta(batch[0], timestamp); this.$emit(batch[0], meta); } else { const meta = this.generateMetaForCollection(batch, timestamp); const data = { rows: batch, rowCount: batch.length, }; this.$emit(data, meta); } }); pageCount++; if (pageCount >= maxPages) { allProcessed = true; } if (this.uniqueKey) { this._updateLastResultId(rows); } if (!nextPageToken) { break; } pageToken = nextPageToken; + } catch (err) { + const context = { + pageCount, + pageSize, + pageToken, + eventSize: this.eventSize, + }; + err.message = `BigQuery pagination error: ${err.message} | context=${JSON.stringify(context)}`; + throw err; + } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
components/google_cloud/sources/common/bigquery.mjs(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
components/google_cloud/sources/common/bigquery.mjs (1)
components/google_cloud/sources/bigquery-new-row/bigquery-new-row.mjs (2)
queryOpts(110-115)rows(116-116)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: pnpm publish
- GitHub Check: Lint Code Base
- GitHub Check: Verify TypeScript components
- GitHub Check: Publish TypeScript components
🔇 Additional comments (3)
components/google_cloud/sources/common/bigquery.mjs (3)
110-114: Break on missing next page token — pagination termination fixed.Exiting when
nextPageTokenis falsy prevents re-reading the first page. Good.
33-41: Dataset.createQueryJob is supported — available on both the BigQuery client and Dataset instance; no changes needed.
42-59: Confirm getRowsForQuery usages updated
Ripgrep found no JS/TS references; check .mjs and other files for any lingering old-signature calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Resolves #18352
Summary by CodeRabbit
New Features
Refactor
Chores