Improve batch logic #145

jfrancoa · 2026-01-05T13:01:49Z

Improve batch logic to use fixed-size by default (it still supports dynamic batch via argument when creating data), but this allows not overloading the cluster. The logic was also improved, allowing ingesting big data reducing the consumption of memory, as the data gets generated as it's getting consumed (generator - queue - consumers).

orca-security-eu

Orca Security Scan Summary

Status	Check	Issues by priority
Passed	Infrastructure as Code	0 0 0 0	View in Orca
Passed	SAST	0 0 0 0	View in Orca
Passed	Secrets	0 0 0 0	View in Orca
Passed	Vulnerabilities	0 0 0 0	View in Orca

Copilot

Pull request overview

This PR refactors the batch ingestion logic to improve performance and memory efficiency when inserting data into Weaviate. The implementation introduces a producer-consumer pattern with two distinct modes: fixed-size batching (default) and dynamic batching (opt-in).

Key Changes:

Introduces fixed-size batch ingestion as the default, with configurable batch size and concurrent requests to prevent cluster overload
Adds dynamic batch mode as an opt-in feature for high-throughput scenarios using multiprocessing
Implements a memory-efficient producer-consumer pattern with queue-based coordination between data generation and ingestion

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
weaviate_cli/managers/data_manager.py	Core refactoring implementing new producer-consumer ingestion pattern with `_ErrorTracker` class, new `__producer_consumer_ingest` method supporting both fixed-size and dynamic batch modes, and simplified update logic
weaviate_cli/managers/config_manager.py	Adds support for slow connection environments via `SLOW_CONNECTION` environment variable that doubles client timeouts
weaviate_cli/defaults.py	Introduces `MAX_WORKERS` constant and new `CreateDataDefaults` fields for batch configuration (batch_size, dynamic_batch)
weaviate_cli/commands/create.py	Adds CLI options for `--dynamic_batch`, `--batch_size`, and `--concurrent_requests` with validation logic
.github/workflows/release.yaml	Updates GitHub Actions artifact upload/download actions to newer versions
.github/workflows/main.yaml	Updates GitHub Actions upload-artifact to v5

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

.github/workflows/release.yaml

weaviate_cli/managers/data_manager.py

weaviate_cli/commands/create.py

weaviate_cli/managers/data_manager.py

Instead of generating the data first and then ingesting it, we attemp to use fixed_size batching and leave each worker taking care of generating it's corresponding batch. Making it more memory efficient.

Use a combination of streaming via multiprocessing. Keeps the fixed_size implementation the same to ensure not overloading the server.

…nal, showing last 10 now)

This commit re-attempts connection in case the link is slow and it takes longer than the timeout to perform the gRPC checks. Next re-attempt takes place with a larger timeout. Also, fixes some logic issue when querying data for multitenant collections with auto-tenant creation.

jfrancoa requested a review from Copilot January 5, 2026 13:01

orca-security-eu bot reviewed Jan 5, 2026

View reviewed changes

Copilot started reviewing on behalf of jfrancoa January 5, 2026 13:02 View session

Copilot AI reviewed Jan 5, 2026

View reviewed changes

jfrancoa force-pushed the improve-batch-logic branch from 56493b2 to dab185f Compare January 5, 2026 15:44

jfrancoa and others added 8 commits January 5, 2026 17:10

Attempt streaming the data generation.

83a7189

Instead of generating the data first and then ingesting it, we attemp to use fixed_size batching and leave each worker taking care of generating it's corresponding batch. Making it more memory efficient.

Changed code to producer -> queue -> consumer pattern

d010ffa

Support choosing dynamic batching or not

ce546f7

Make dynamic_batch way more performant.

a94cdb2

Use a combination of streaming via multiprocessing. Keeps the fixed_size implementation the same to ensure not overloading the server.

Default batch size 1000

86f28e1

Improve printing at end of indexing run (errors are overloading termi…

1a546ab

…nal, showing last 10 now)

Cleanup and improved error logging

076f4d5

jfrancoa force-pushed the improve-batch-logic branch from dab185f to 89f007e Compare January 5, 2026 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve batch logic #145

Improve batch logic #145

Uh oh!

jfrancoa commented Jan 5, 2026

Uh oh!

orca-security-eu bot left a comment •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improve batch logic #145

Are you sure you want to change the base?

Improve batch logic #145

Uh oh!

Conversation

jfrancoa commented Jan 5, 2026

Uh oh!

orca-security-eu bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Orca Security Scan Summary

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

orca-security-eu bot left a comment •

edited

Loading