Skip to content

Add parallelism to parseBytes and transform#400

Merged
Benjamin BENOIST (benjben) merged 1 commit intov2from
parallelism
Feb 14, 2025
Merged

Add parallelism to parseBytes and transform#400
Benjamin BENOIST (benjben) merged 1 commit intov2from
parallelism

Conversation

@benjben
Copy link
Contributor

For pods with more than 1 CPU, tests have shown that we get a better throughput when adding some parallelism on CPU-intensive steps.

Actually, tests have shown that we get the best throughput when adding parallelism to transform step only.

I wonder if we should remove the parallelism on parseBytes, but given that it is there now, I'd be in favor of keeping it with a low default, so that there is no parallelism, and if 1 day we want to change it we can.

Also, we've observed better throughput with writeBatchConcurrency = 2, so I updated the default.

For pods with more than 1 CPU, tests have shown that we get a better throughput
when adding some parallelism on CPU-intensive steps.

Actually, tests have shown that we get the best throughput
when adding parallelism to transform step only.

I wonder if we should remove the parallelism on parseBytes,
but given that it is there now, I'd be in favor of keeping it with a low default,
so that there is no parallelism, and if 1 day we want to change it we can.

Also, we've observed better throughput with writeBatchConcurrency = 2,
so I updated the default.
@benjben Benjamin BENOIST (benjben) merged commit 13292fc into v2 Feb 14, 2025
2 checks passed
@benjben Benjamin BENOIST (benjben) deleted the parallelism branch February 14, 2025 09:22
Benjamin BENOIST (benjben) added a commit that referenced this pull request Feb 14, 2025
For pods with more than 1 CPU, tests have shown that we get a better throughput
when adding some parallelism on CPU-intensive steps.

Actually, tests have shown that we get the best throughput
when adding parallelism to transform step only.

I wonder if we should remove the parallelism on parseBytes,
but given that it is there now, I'd be in favor of keeping it with a low default,
so that there is no parallelism, and if 1 day we want to change it we can.

Also, we've observed better throughput with writeBatchConcurrency = 2,
so I updated the default.
Benjamin BENOIST (benjben) added a commit that referenced this pull request Feb 14, 2025
For pods with more than 1 CPU, tests have shown that we get a better throughput
when adding some parallelism on CPU-intensive steps.

Actually, tests have shown that we get the best throughput
when adding parallelism to transform step only.

I wonder if we should remove the parallelism on parseBytes,
but given that it is there now, I'd be in favor of keeping it with a low default,
so that there is no parallelism, and if 1 day we want to change it we can.

Also, we've observed better throughput with writeBatchConcurrency = 2,
so I updated the default.
Benjamin BENOIST (benjben) added a commit that referenced this pull request Feb 17, 2025
For pods with more than 1 CPU, tests have shown that we get a better throughput
when adding some parallelism on CPU-intensive steps.

Actually, tests have shown that we get the best throughput
when adding parallelism to transform step only.

I wonder if we should remove the parallelism on parseBytes,
but given that it is there now, I'd be in favor of keeping it with a low default,
so that there is no parallelism, and if 1 day we want to change it we can.

Also, we've observed better throughput with writeBatchConcurrency = 2,
so I updated the default.
Benjamin BENOIST (benjben) pushed a commit that referenced this pull request Feb 17, 2025
- Update license to SLULA 1.1
- Cluster by event_name when creating new table (#402)
- Add parallelism to parseBytes and transform (#400)
- Decrease default batching.maxBytes to 10 MB (#398)
- Fix and improve ProcessingSpec for legacy column mode (#396)
- Add legacyColumnMode configuration (#394)
- Add e2e_latency_millis metric (#391)
- Fix startup on missing existing table (#384)
- Add option to exit on missing Iglu schemas (#382)
- Refactor health monitoring (#381)
- Feature flag to support the legacy column style -- bug fixes (#379 #380)
- Require alter table when schema is evolved for contexts
- Allow for delay in Writer discovering new columns
- Stay healthy if BigQuery table exceeds column limit (#372)
- Recover from server-side schema mismatch exceptions
- Improve exception handling immediately after altering the table
- Manage Writer resource to be consistent with Snowflake Loader
Benjamin BENOIST (benjben) pushed a commit that referenced this pull request Feb 17, 2025
- Update license to SLULA 1.1
- Cluster by event_name when creating new table (#402)
- Add parallelism to parseBytes and transform (#400)
- Decrease default batching.maxBytes to 10 MB (#398)
- Fix and improve ProcessingSpec for legacy column mode (#396)
- Add legacyColumnMode configuration (#394)
- Add e2e_latency_millis metric (#391)
- Fix startup on missing existing table (#384)
- Add option to exit on missing Iglu schemas (#382)
- Refactor health monitoring (#381)
- Feature flag to support the legacy column style -- bug fixes (#379 #380)
- Require alter table when schema is evolved for contexts
- Allow for delay in Writer discovering new columns
- Stay healthy if BigQuery table exceeds column limit (#372)
- Recover from server-side schema mismatch exceptions
- Improve exception handling immediately after altering the table
- Manage Writer resource to be consistent with Snowflake Loader
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants