-
Notifications
You must be signed in to change notification settings - Fork 19
Improve batch logic #145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Improve batch logic #145
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Orca Security Scan Summary
| Status | Check | Issues by priority | |
|---|---|---|---|
| Infrastructure as Code | View in Orca | ||
| SAST | View in Orca | ||
| Secrets | View in Orca | ||
| Vulnerabilities | View in Orca |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the batch ingestion logic to improve performance and memory efficiency when inserting data into Weaviate. The implementation introduces a producer-consumer pattern with two distinct modes: fixed-size batching (default) and dynamic batching (opt-in).
Key Changes:
- Introduces fixed-size batch ingestion as the default, with configurable batch size and concurrent requests to prevent cluster overload
- Adds dynamic batch mode as an opt-in feature for high-throughput scenarios using multiprocessing
- Implements a memory-efficient producer-consumer pattern with queue-based coordination between data generation and ingestion
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| weaviate_cli/managers/data_manager.py | Core refactoring implementing new producer-consumer ingestion pattern with _ErrorTracker class, new __producer_consumer_ingest method supporting both fixed-size and dynamic batch modes, and simplified update logic |
| weaviate_cli/managers/config_manager.py | Adds support for slow connection environments via SLOW_CONNECTION environment variable that doubles client timeouts |
| weaviate_cli/defaults.py | Introduces MAX_WORKERS constant and new CreateDataDefaults fields for batch configuration (batch_size, dynamic_batch) |
| weaviate_cli/commands/create.py | Adds CLI options for --dynamic_batch, --batch_size, and --concurrent_requests with validation logic |
| .github/workflows/release.yaml | Updates GitHub Actions artifact upload/download actions to newer versions |
| .github/workflows/main.yaml | Updates GitHub Actions upload-artifact to v5 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
56493b2 to
dab185f
Compare
Instead of generating the data first and then ingesting it, we attemp to use fixed_size batching and leave each worker taking care of generating it's corresponding batch. Making it more memory efficient.
Use a combination of streaming via multiprocessing. Keeps the fixed_size implementation the same to ensure not overloading the server.
…nal, showing last 10 now)
This commit re-attempts connection in case the link is slow and it takes longer than the timeout to perform the gRPC checks. Next re-attempt takes place with a larger timeout. Also, fixes some logic issue when querying data for multitenant collections with auto-tenant creation.
dab185f to
89f007e
Compare
Improve batch logic to use fixed-size by default (it still supports dynamic batch via argument when creating data), but this allows not overloading the cluster. The logic was also improved, allowing ingesting big data reducing the consumption of memory, as the data gets generated as it's getting consumed (generator - queue - consumers).