feat: Configurable batch size #1941

Vasilije1990 · 2025-12-28T20:07:25Z

Description

Acceptance Criteria

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Code refactoring
Performance improvement
Other (please specify):

Screenshots/Videos (if applicable)

Pre-submission Checklist

I have tested my changes thoroughly before submitting this PR
This PR contains minimal changes necessary to address the issue/feature
My code follows the project's coding standards and style guidelines
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if applicable)
All new and existing tests pass
I have searched existing PRs to ensure this change hasn't been submitted already
I have linked any relevant issues in the description
My commits have clear and descriptive messages

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

Summary by CodeRabbit

New Features
- Added configurable chunks-per-batch to control per-batch processing size via CLI flag, API payload, and configuration; defaults are now driven by config with an automatic fallback.
Style / Documentation
- Updated contribution/style guidelines (formatting, line length, string-quote rule, pre-commit note).
Tests
- Updated CLI tests to verify propagation of the new chunks-per-batch parameter.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

pull-checklist · 2025-12-28T20:07:30Z

Please make sure all the checkboxes are checked:

I have tested these changes locally.
I have reviewed the code changes.
I have added end-to-end and unit tests (if applicable).
I have updated the documentation and README.md file (if necessary).
I have removed unnecessary code and debug statements.
PR title is clear and follows the convention.
I have tagged reviewers or team members for feedback.

coderabbitai · 2025-12-28T20:07:36Z

Walkthrough

Makes chunks_per_batch configurable for cognify: adds the field to CognifyConfig, threads an optional chunks_per_batch through API router and CLI into core task builders, and moves defaulting to read from CognifyConfig with a fallback to 100.

Changes

Cohort / File(s)	Summary
Configuration `cognee/modules/cognify/config.py`	Added optional `chunks_per_batch: Optional[int] = None` to `CognifyConfig` and included it in `to_dict()`.
Core cognify logic `cognee/api/v1/cognify/cognify.py`	Updated `get_default_tasks` and `get_temporal_tasks` signatures to accept `chunks_per_batch: int = None`; derive default from `get_cognify_config()` at runtime with a fallback of `100`; added lazy config retrieval in temporal flow.
API router `cognee/api/v1/cognify/routers/get_cognify_router.py`	Added `chunks_per_batch: Optional[int]` to `CognifyPayloadDTO` and pass `payload.chunks_per_batch` into the cognify invocation.
CLI `cognee/cli/commands/cognify_command.py`	Added `--chunks-per-batch` CLI option and forward `chunks_per_batch=args.chunks_per_batch` to cognify call.
Tests `cognee/tests/cli_tests/cli_unit_tests/test_cli_edge_cases.py`, `cognee/tests/cli_tests/cli_unit_tests/test_cli_commands.py`	Updated tests to expect `chunks_per_batch=None` threaded through the CognifyCommand → cognify call path.
Docs / Style `CLAUDE.md`	Documentation/style edits (formatter, double-quote rule, line length, pre-commit instruction).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Feat/cog 1491 pipeline steps in eval #641 — Adds callers that invoke get_default_tasks; signature and defaulting changes here are directly relevant.

Suggested labels

run-checks

Suggested reviewers

dexters1
lxobr
hajdul88

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (3 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description contains only the template boilerplate with no actual content filled in—all required sections (Description, Acceptance Criteria, Type of Change, Pre-submission Checklist) are empty.	Provide a human-written description explaining the changes, acceptance criteria for testing, type of change classification, and completion status of the pre-submission checklist items.
Out of Scope Changes check	⚠️ Warning	The PR includes an unrelated documentation update to CLAUDE.md (Code Style guidelines) that is out of scope for the batch size configuration feature [COG-3146].	Remove the CLAUDE.md changes or move them to a separate PR, as they are unrelated to the batch size configuration objective in COG-3146.
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat: Configurable batch size' clearly and concisely summarizes the main change, which aligns with the implementation of making batch size configurable via the chunks_per_batch parameter.
Linked Issues check	✅ Passed	The code changes successfully implement the requirement to make batch_size configurable. The chunks_per_batch parameter has been added to the cognify pipeline (API, CLI, config, and core logic), allowing users to configure batch size instead of relying on fixed defaults [COG-3146].

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

cognee/api/v1/cognify/routers/get_cognify_router.py (1)
76-81: Update endpoint documentation with new parameter.

The endpoint docstring comprehensively documents other request parameters but is missing documentation for the new chunks_per_batch field. API users may not discover this configuration option.
🔎 Proposed documentation addition
         ## Request Parameters
         - **datasets** (Optional[List[str]]): List of dataset names to process. Dataset names are resolved to datasets owned by the authenticated user.
         - **dataset_ids** (Optional[List[UUID]]): List of existing dataset UUIDs to process. UUIDs allow processing of datasets not owned by the user (if permitted).
         - **run_in_background** (Optional[bool]): Whether to execute processing asynchronously. Defaults to False (blocking).
         - **custom_prompt** (Optional[str]): Custom prompt for entity extraction and graph generation. If provided, this prompt will be used instead of the default prompts for knowledge graph extraction.
         - **ontology_key** (Optional[List[str]]): Reference to one or more previously uploaded ontology files to use for knowledge graph construction.
+        - **chunks_per_batch** (Optional[int]): Number of chunks to process per task batch. Higher values (e.g., 50) can speed up processing for large documents but may cause max_token errors if too large. Defaults to 100 for standard tasks, 10 for temporal tasks.
cognee/api/v1/cognify/cognify.py (1)
47-47: Fix type annotation for optional parameter.

The parameter chunks_per_batch: int = None should use Optional[int] to properly indicate it accepts None. This is particularly important for public API functions per the coding guidelines.
🔎 Proposed fix
-    chunks_per_batch: int = None,
+    chunks_per_batch: Optional[int] = None,

🧹 Nitpick comments (1)

cognee/cli/commands/cognify_command.py (1)
16-33: Consider mentioning batch size configuration in command description.

The command description thoroughly explains the processing pipeline but doesn't mention the new --chunks-per-batch option. While users can discover it via --help, adding a brief mention would improve discoverability.
Example addition
 Processing Pipeline:
 1. **Document Classification**: Identifies document types and structures
 2. **Permission Validation**: Ensures user has processing rights
 3. **Text Chunking**: Breaks content into semantically meaningful segments
 4. **Entity Extraction**: Identifies key concepts, people, places, organizations
 5. **Relationship Detection**: Discovers connections between entities
 6. **Graph Construction**: Builds semantic knowledge graph with embeddings
 7. **Content Summarization**: Creates hierarchical summaries for navigation
+
+Performance tuning is available via --chunks-per-batch to control batch processing size.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 310e9e9 and d1b13b1.

📒 Files selected for processing (4)

cognee/api/v1/cognify/cognify.py
cognee/api/v1/cognify/routers/get_cognify_router.py
cognee/cli/commands/cognify_command.py
cognee/modules/cognify/config.py

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

cognee/api/v1/cognify/routers/get_cognify_router.py
cognee/cli/commands/cognify_command.py
cognee/modules/cognify/config.py
cognee/api/v1/cognify/cognify.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

cognee/api/v1/cognify/routers/get_cognify_router.py
cognee/cli/commands/cognify_command.py
cognee/modules/cognify/config.py
cognee/api/v1/cognify/cognify.py

cognee/api/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Public APIs should be type-annotated in Python where practical

Files:

cognee/api/v1/cognify/routers/get_cognify_router.py
cognee/api/v1/cognify/cognify.py

cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

cognee/api/v1/cognify/routers/get_cognify_router.py
cognee/cli/commands/cognify_command.py
cognee/modules/cognify/config.py
cognee/api/v1/cognify/cognify.py

cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

cognee/modules/cognify/config.py

🧠 Learnings (3)

📓 Common learnings

Learnt from: 0xideas
Repo: topoteretes/cognee PR: 205
File: cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py:7-7
Timestamp: 2024-11-13T14:55:05.912Z
Learning: When changes are made to the chunking implementation in `cognee/tasks/chunks`, the ground truth values in the corresponding tests in `cognee/tests/unit/processing/chunks` need to be updated accordingly.

📚 Learning: 2024-11-13T14:55:05.912Z

Learnt from: 0xideas
Repo: topoteretes/cognee PR: 205
File: cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py:7-7
Timestamp: 2024-11-13T14:55:05.912Z
Learning: When changes are made to the chunking implementation in `cognee/tasks/chunks`, the ground truth values in the corresponding tests in `cognee/tests/unit/processing/chunks` need to be updated accordingly.

Applied to files:

cognee/api/v1/cognify/routers/get_cognify_router.py
cognee/cli/commands/cognify_command.py
cognee/api/v1/cognify/cognify.py

📚 Learning: 2024-10-07T11:20:44.876Z

Learnt from: borisarzentar
Repo: topoteretes/cognee PR: 144
File: cognee/tasks/chunking/query_chunks.py:1-17
Timestamp: 2024-10-07T11:20:44.876Z
Learning: The `query_chunks` function in `cognee/tasks/chunking/query_chunks.py` is used within the `search` function in `cognee/api/v1/search/search_v2.py`.

Applied to files:

cognee/api/v1/cognify/cognify.py

🧬 Code graph analysis (2)

cognee/api/v1/cognify/routers/get_cognify_router.py (1)

new-examples/demos/custom_graph_model_entity_schema_definition.py (1)

Field (35-38)

cognee/api/v1/cognify/cognify.py (2)

cognee/api/v1/cognify/routers/get_cognify_router.py (1)

cognify (60-162)

cognee/modules/cognify/config.py (1)

get_cognify_config (25-26)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (23)

GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
GitHub Check: End-to-End Tests / Test Entity Extraction
GitHub Check: End-to-End Tests / Test dataset database deletion in Cognee
GitHub Check: End-to-End Tests / Test Feedback Enrichment
GitHub Check: End-to-End Tests / Concurrent Subprocess access test
GitHub Check: End-to-End Tests / Test using different async databases in parallel in Cognee
GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
GitHub Check: End-to-End Tests / Test dataset database handlers in Cognee
GitHub Check: End-to-End Tests / Test Cognify - Edge Centered Payload
GitHub Check: End-to-End Tests / Test adding of label for data in Cognee
GitHub Check: End-to-End Tests / Conversation sessions test (FS)
GitHub Check: End-to-End Tests / Test graph edge ingestion
GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
GitHub Check: End-to-End Tests / Deduplication Test
GitHub Check: End-to-End Tests / S3 Bucket Test
GitHub Check: Basic Tests / Run Simple Examples BAML
GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
GitHub Check: End-to-End Tests / Server Start Test
GitHub Check: Basic Tests / Run Unit Tests
GitHub Check: Basic Tests / Run Simple Examples
GitHub Check: Basic Tests / Run Integration Tests
GitHub Check: CLI Tests / CLI Functionality Tests
GitHub Check: CLI Tests / CLI Integration Tests

🔇 Additional comments (4)

cognee/api/v1/cognify/routers/get_cognify_router.py (1)

49-53: LGTM! Well-documented field.

The chunks_per_batch field is properly typed, has a clear description, and includes helpful examples. Good attention to API documentation.

cognee/cli/commands/cognify_command.py (1)

65-69: LGTM! CLI option properly configured.

The --chunks-per-batch argument is well-defined with helpful guidance for users. The practical example ("try 50 for large single documents") aligns with the PR objective of improving processing speed.
cognee/api/v1/cognify/cognify.py (2)
278-281: LGTM! Config-driven default properly implemented.

The logic correctly derives chunks_per_batch from configuration when not explicitly provided, with a sensible fallback to 100. This aligns with the PR objective of making batch size configurable.

278-281: Verify different default values are intentional.

The fallback values differ between task types: 100 for default tasks (line 280) and 10 for temporal tasks (line 338). While this appears to preserve existing behavior, please confirm this difference is intentional and appropriate for the different processing characteristics.

Consider adding a comment explaining why temporal tasks use a different default:
# Temporal tasks use smaller batch size (10) to handle time-series processing efficiently
chunks_per_batch = configured if configured is not None else 10
Also applies to: 334-338

coderabbitai · 2025-12-28T20:11:11Z

cognee/api/v1/cognify/cognify.py

    config: Config = None,
    custom_prompt: Optional[str] = None,
-    chunks_per_batch: int = 100,
+    chunks_per_batch: int = None,


⚠️ Potential issue | 🟡 Minor

Fix type annotation for optional parameter.

The parameter chunks_per_batch: int = None should use Optional[int] for proper type checking, consistent with the Optional import at line 3 and the coding guidelines requiring type annotations for public APIs.

🔎 Proposed fix

- chunks_per_batch: int = None, + chunks_per_batch: Optional[int] = None,

🤖 Prompt for AI Agents

In cognee/api/v1/cognify/cognify.py around line 255, the optional parameter is annotated as "chunks_per_batch: int = None" but should use Optional[int]; update the signature to "chunks_per_batch: Optional[int] = None" (the Optional import exists at line 3) so static type checkers and the public API follow the project's typing guidelines.

coderabbitai · 2025-12-28T20:11:11Z

cognee/api/v1/cognify/cognify.py


 async def get_temporal_tasks(
-    user: User = None, chunker=TextChunker, chunk_size: int = None, chunks_per_batch: int = 10
+    user: User = None, chunker=TextChunker, chunk_size: int = None, chunks_per_batch: int = None


⚠️ Potential issue | 🟡 Minor

Fix type annotation for optional parameter.

The parameter chunks_per_batch: int = None should use Optional[int] for consistency with other functions in this file and proper type checking.

🔎 Proposed fix

- user: User = None, chunker=TextChunker, chunk_size: int = None, chunks_per_batch: int = None + user: User = None, chunker=TextChunker, chunk_size: int = None, chunks_per_batch: Optional[int] = None

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

user: User = None, chunker=TextChunker, chunk_size: int = None, chunks_per_batch: int = None

user: User = None, chunker=TextChunker, chunk_size: int = None, chunks_per_batch: Optional[int] = None

🤖 Prompt for AI Agents

In cognee/api/v1/cognify/cognify.py around line 313, the parameter annotation chunks_per_batch: int = None is incorrect for an optional parameter; change it to chunks_per_batch: Optional[int] and ensure typing.Optional is imported (add "from typing import Optional" if not already present) so the signature matches other functions and type checkers accept None as a valid value.

coderabbitai · 2025-12-28T20:11:11Z

cognee/api/v1/cognify/cognify.py

    if chunks_per_batch is None:
-        chunks_per_batch = 10
+        from cognee.modules.cognify.config import get_cognify_config
+
+        configured = get_cognify_config().chunks_per_batch
+        chunks_per_batch = configured if configured is not None else 10


🛠️ Refactor suggestion | 🟠 Major

Remove redundant import.

The lazy import of get_cognify_config at line 335 is unnecessary since it's already imported at the module level (line 6).

🔎 Proposed fix

if chunks_per_batch is None: - from cognee.modules.cognify.config import get_cognify_config - configured = get_cognify_config().chunks_per_batch chunks_per_batch = configured if configured is not None else 10

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if chunks_per_batch is None:

chunks_per_batch = 10

from cognee.modules.cognify.config import get_cognify_config

configured = get_cognify_config().chunks_per_batch

chunks_per_batch = configured if configured is not None else 10

if chunks_per_batch is None:

configured = get_cognify_config().chunks_per_batch

chunks_per_batch = configured if configured is not None else 10

🤖 Prompt for AI Agents

In cognee/api/v1/cognify/cognify.py around lines 334 to 338, remove the redundant lazy import of get_cognify_config inside the if block (the function is already imported at module level on line 6); delete the "from cognee.modules.cognify.config import get_cognify_config" line and leave the code to call get_cognify_config() directly so chunks_per_batch is assigned from the already-imported function.

coderabbitai · 2025-12-28T20:11:12Z

cognee/modules/cognify/config.py

    classification_model: object = DefaultContentPrediction
    summarization_model: object = SummarizedContent
    triplet_embedding: bool = False
+    chunks_per_batch: Optional[int] = None


🛠️ Refactor suggestion | 🟠 Major

Add field documentation.

The new chunks_per_batch field lacks a docstring or Field description explaining its purpose, valid range, and environment variable name. Based on the coding guidelines, undocumented fields are considered incomplete.

🔎 Proposed documentation addition

+ """ + chunks_per_batch: Number of chunks to process per task batch in Cognify. + Can be configured via CHUNKS_PER_BATCH environment variable. + Higher values (e.g., 50) can improve processing speed for large documents, + but may cause max_token errors if set too high. Defaults to 100 for default tasks + and 10 for temporal tasks when not specified. + """ chunks_per_batch: Optional[int] = None

Alternatively, use Pydantic's Field with description:

- chunks_per_batch: Optional[int] = None + chunks_per_batch: Optional[int] = Field( + default=None, + description="Number of chunks to process per task batch (configurable via CHUNKS_PER_BATCH env var)" + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

chunks_per_batch: Optional[int] = None

chunks_per_batch: Optional[int] = Field(

default=None,

description="Number of chunks to process per task batch in Cognify. Can be configured via CHUNKS_PER_BATCH environment variable. Higher values can improve processing speed but may cause max_token errors if set too high."

)

🤖 Prompt for AI Agents

In cognee/modules/cognify/config.py around line 12, the new chunks_per_batch field is missing documentation; add a concise Field description (or a docstring above the attribute) that states its purpose (how it controls batching of chunks), the valid range (e.g., positive integer limits or None meaning no batching), and the corresponding environment variable name if applicable; use Pydantic's Field(description="...") to include this metadata and update any README or env docs to match.

Fix AttributeError when args.chunks_per_batch is not present in the argparse.Namespace object. Use getattr() with default value of None to safely access the optional chunks_per_batch parameter. This resolves test failures in test_cli_edge_cases.py where Namespace objects were created without the chunks_per_batch attribute. Changes: - Use getattr(args, 'chunks_per_batch', None) instead of direct access - Update test assertion to expect chunks_per_batch=None parameter 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

…meter Update all cognify test assertions to expect the chunks_per_batch=None parameter that was added to the CLI command. This fixes three failing tests: - test_execute_basic_cognify - test_cognify_invalid_chunk_size - test_cognify_nonexistent_ontology_file 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <[email protected]>

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cognee/tests/cli_tests/cli_unit_tests/test_cli_edge_cases.py (1)
259-265: Update all test assertions to include chunks_per_batch parameter.

Only test_cognify_empty_datasets_list was updated to expect chunks_per_batch=None (line 376), but the other test methods in this class (test_cognify_invalid_chunk_size at lines 259-265 and test_cognify_nonexistent_ontology_file at lines 292-298) still use the old assertion without chunks_per_batch. Since assert_awaited_once_with() performs exact parameter matching, these tests will fail if the actual cognify() call now includes the chunks_per_batch parameter.
🔧 Proposed fix to add chunks_per_batch to all test assertions

Fix for test_cognify_invalid_chunk_size:
 mock_cognee.cognify.assert_awaited_once_with(
     datasets=None,
     chunk_size=-100,
     ontology_file_path=None,
     chunker=TextChunker,
     run_in_background=False,
+    chunks_per_batch=None,
 )
Fix for test_cognify_nonexistent_ontology_file:
 mock_cognee.cognify.assert_awaited_once_with(
     datasets=None,
     chunk_size=None,
     ontology_file_path="/nonexistent/path/ontology.owl",
     chunker=TextChunker,
     run_in_background=False,
+    chunks_per_batch=None,
 )
Also applies to: 292-298, 370-377

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d1b13b1 and 98394fc.

📒 Files selected for processing (3)

CLAUDE.md
cognee/cli/commands/cognify_command.py
cognee/tests/cli_tests/cli_unit_tests/test_cli_edge_cases.py

✅ Files skipped from review due to trivial changes (1)

CLAUDE.md

🚧 Files skipped from review as they are similar to previous changes (1)

cognee/cli/commands/cognify_command.py

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

cognee/tests/cli_tests/cli_unit_tests/test_cli_edge_cases.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

cognee/tests/cli_tests/cli_unit_tests/test_cli_edge_cases.py

cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

cognee/tests/cli_tests/cli_unit_tests/test_cli_edge_cases.py

cognee/tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

cognee/tests/**/*.py: Place Python tests under cognee/tests/ organized by type (unit, integration, cli_tests)
Name Python test files test_*.py and use pytest.mark.asyncio for async tests

Files:

cognee/tests/cli_tests/cli_unit_tests/test_cli_edge_cases.py

🧠 Learnings (2)

📓 Common learnings

Learnt from: 0xideas
Repo: topoteretes/cognee PR: 205
File: cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py:7-7
Timestamp: 2024-11-13T14:55:05.912Z
Learning: When changes are made to the chunking implementation in `cognee/tasks/chunks`, the ground truth values in the corresponding tests in `cognee/tests/unit/processing/chunks` need to be updated accordingly.

📚 Learning: 2024-11-13T14:55:05.912Z

Learnt from: 0xideas
Repo: topoteretes/cognee PR: 205
File: cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py:7-7
Timestamp: 2024-11-13T14:55:05.912Z
Learning: When changes are made to the chunking implementation in `cognee/tasks/chunks`, the ground truth values in the corresponding tests in `cognee/tests/unit/processing/chunks` need to be updated accordingly.

Applied to files:

cognee/tests/cli_tests/cli_unit_tests/test_cli_edge_cases.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: End-to-End Tests / S3 Bucket Test
GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
GitHub Check: End-to-End Tests / Test Entity Extraction
GitHub Check: End-to-End Tests / Test Feedback Enrichment
GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
GitHub Check: End-to-End Tests / Concurrent Subprocess access test
GitHub Check: End-to-End Tests / Conversation sessions test (FS)
GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
GitHub Check: Basic Tests / Run Integration Tests
GitHub Check: Basic Tests / Run Unit Tests
GitHub Check: CLI Tests / CLI Integration Tests

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

cognee/tests/cli_tests/cli_unit_tests/test_cli_edge_cases.py (1)

265-265: Consider adding test coverage for the new --chunks-per-batch CLI option.

The tests correctly verify that chunks_per_batch=None is passed when the CLI argument is not provided. However, there are no tests verifying the positive case where a user actually provides --chunks-per-batch with various values.

Consider adding test cases such as:

Valid positive integer value (e.g., --chunks-per-batch 50)

Edge cases (e.g., --chunks-per-batch 1, very large values)

Invalid values if validation exists (e.g., negative numbers, zero, non-integers)

This would ensure the CLI option is properly parsed, validated, and threaded through to the cognify function as intended by the PR objectives.

Also applies to: 299-299, 378-378
cognee/tests/cli_tests/cli_unit_tests/test_cli_commands.py (1)
241-241: Add test coverage for the --chunks-per-batch CLI option.

While the test correctly verifies the default behavior (chunks_per_batch=None), there's no test case in this file that verifies what happens when a user actually provides the --chunks-per-batch argument.

Consider adding a test method such as:
@patch("cognee.cli.commands.cognify_command.asyncio.run", side_effect=_mock_run)
def test_execute_with_chunks_per_batch(self, mock_asyncio_run):
    """Test execute with chunks_per_batch specified"""
    mock_cognee = MagicMock()
    mock_cognee.cognify = AsyncMock(return_value="success")

    with patch.dict(sys.modules, {"cognee": mock_cognee}):
        command = CognifyCommand()
        args = argparse.Namespace(
            datasets=None,
            chunk_size=None,
            ontology_file=None,
            chunker="TextChunker",
            background=False,
            verbose=False,
            chunks_per_batch=50,
        )
        command.execute(args)

    from cognee.modules.chunking.TextChunker import TextChunker
    mock_cognee.cognify.assert_awaited_once_with(
        datasets=None,
        chunk_size=None,
        ontology_file_path=None,
        chunker=TextChunker,
        run_in_background=False,
        chunks_per_batch=50,
    )
This would verify that the new configurable parameter is properly threaded through the command execution path, which is a key objective of this PR.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 98394fc and b7d5bf5.

📒 Files selected for processing (2)

cognee/tests/cli_tests/cli_unit_tests/test_cli_commands.py
cognee/tests/cli_tests/cli_unit_tests/test_cli_edge_cases.py

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

cognee/tests/cli_tests/cli_unit_tests/test_cli_edge_cases.py
cognee/tests/cli_tests/cli_unit_tests/test_cli_commands.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

cognee/tests/cli_tests/cli_unit_tests/test_cli_edge_cases.py
cognee/tests/cli_tests/cli_unit_tests/test_cli_commands.py

cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

cognee/tests/cli_tests/cli_unit_tests/test_cli_edge_cases.py
cognee/tests/cli_tests/cli_unit_tests/test_cli_commands.py

cognee/tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

cognee/tests/**/*.py: Place Python tests under cognee/tests/ organized by type (unit, integration, cli_tests)
Name Python test files test_*.py and use pytest.mark.asyncio for async tests

Files:

cognee/tests/cli_tests/cli_unit_tests/test_cli_edge_cases.py
cognee/tests/cli_tests/cli_unit_tests/test_cli_commands.py

🧠 Learnings (3)

📓 Common learnings

Learnt from: 0xideas
Repo: topoteretes/cognee PR: 205
File: cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py:7-7
Timestamp: 2024-11-13T14:55:05.912Z
Learning: When changes are made to the chunking implementation in `cognee/tasks/chunks`, the ground truth values in the corresponding tests in `cognee/tests/unit/processing/chunks` need to be updated accordingly.

Learnt from: borisarzentar
Repo: topoteretes/cognee PR: 144
File: cognee/tasks/chunking/query_chunks.py:1-17
Timestamp: 2024-10-07T11:20:44.876Z
Learning: The `query_chunks` function in `cognee/tasks/chunking/query_chunks.py` is used within the `search` function in `cognee/api/v1/search/search_v2.py`.

Learnt from: borisarzentar
Repo: topoteretes/cognee PR: 144
File: cognee/tasks/chunking/query_chunks.py:1-17
Timestamp: 2024-10-16T07:06:28.669Z
Learning: The `query_chunks` function in `cognee/tasks/chunking/query_chunks.py` is used within the `search` function in `cognee/api/v1/search/search_v2.py`.

📚 Learning: 2024-11-13T14:55:05.912Z

Learnt from: 0xideas
Repo: topoteretes/cognee PR: 205
File: cognee/tests/unit/processing/chunks/chunk_by_paragraph_test.py:7-7
Timestamp: 2024-11-13T14:55:05.912Z
Learning: When changes are made to the chunking implementation in `cognee/tasks/chunks`, the ground truth values in the corresponding tests in `cognee/tests/unit/processing/chunks` need to be updated accordingly.

Applied to files:

cognee/tests/cli_tests/cli_unit_tests/test_cli_edge_cases.py
cognee/tests/cli_tests/cli_unit_tests/test_cli_commands.py

📚 Learning: 2024-10-16T07:06:28.669Z

Learnt from: borisarzentar
Repo: topoteretes/cognee PR: 144
File: cognee/tasks/chunking/query_chunks.py:1-17
Timestamp: 2024-10-16T07:06:28.669Z
Learning: The `query_chunks` function in `cognee/tasks/chunking/query_chunks.py` is used within the `search` function in `cognee/api/v1/search/search_v2.py`.

Applied to files:

cognee/tests/cli_tests/cli_unit_tests/test_cli_edge_cases.py
cognee/tests/cli_tests/cli_unit_tests/test_cli_commands.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (23)

GitHub Check: End-to-End Tests / Test Feedback Enrichment
GitHub Check: End-to-End Tests / Test graph edge ingestion
GitHub Check: End-to-End Tests / Test multi tenancy with different situations in Cognee
GitHub Check: End-to-End Tests / Test permissions with different situations in Cognee
GitHub Check: End-to-End Tests / Test dataset database deletion in Cognee
GitHub Check: End-to-End Tests / Conversation sessions test (Redis)
GitHub Check: End-to-End Tests / Conversation sessions test (FS)
GitHub Check: End-to-End Tests / Test Entity Extraction
GitHub Check: End-to-End Tests / Test Cognify - Edge Centered Payload
GitHub Check: End-to-End Tests / Run Telemetry Pipeline Test
GitHub Check: End-to-End Tests / Deduplication Test
GitHub Check: End-to-End Tests / Test dataset database handlers in Cognee
GitHub Check: End-to-End Tests / Concurrent Subprocess access test
GitHub Check: End-to-End Tests / Test adding of label for data in Cognee
GitHub Check: End-to-End Tests / Test using different async databases in parallel in Cognee
GitHub Check: End-to-End Tests / S3 Bucket Test
GitHub Check: Basic Tests / Run Unit Tests
GitHub Check: End-to-End Tests / Server Start Test
GitHub Check: Basic Tests / Run Simple Examples BAML
GitHub Check: Basic Tests / Run Integration Tests
GitHub Check: Basic Tests / Run Simple Examples
GitHub Check: CLI Tests / CLI Integration Tests
GitHub Check: CLI Tests / CLI Functionality Tests

added batch size as an env variable option

d1b13b1

coderabbitai bot reviewed Dec 28, 2025

View reviewed changes

Merge branch 'dev' into COG-3146

f2166be

Vasilije1990 requested a review from dexters1 January 10, 2026 07:49

Vasilije1990 and others added 5 commits January 11, 2026 16:15

Merge branch 'dev' into COG-3146

6ec44f6

style: apply ruff formatting to use double quotes

7d3450c

docs: add code style rules for double quotes and pre-commit

98394fc

coderabbitai bot reviewed Jan 11, 2026

View reviewed changes

Vasilije1990 merged commit 1d674d4 into dev Jan 16, 2026
170 of 175 checks passed

Vasilije1990 deleted the COG-3146 branch January 16, 2026 14:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Configurable batch size #1941

feat: Configurable batch size #1941

Uh oh!

Vasilije1990 commented Dec 28, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

pull-checklist bot commented Dec 28, 2025

Uh oh!

coderabbitai bot commented Dec 28, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 28, 2025

Uh oh!

coderabbitai bot Dec 28, 2025

Uh oh!

coderabbitai bot Dec 28, 2025

Uh oh!

coderabbitai bot Dec 28, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	user: User = None, chunker=TextChunker, chunk_size: int = None, chunks_per_batch: int = None
	user: User = None, chunker=TextChunker, chunk_size: int = None, chunks_per_batch: Optional[int] = None

-    chunks_per_batch: Optional[int] = None
+    chunks_per_batch: Optional[int] = Field(
+        default=None,
+        description="Number of chunks to process per task batch in Cognify. Can be configured via CHUNKS_PER_BATCH environment variable. Higher values can improve processing speed but may cause max_token errors if set too high."
+    )

feat: Configurable batch size #1941

feat: Configurable batch size #1941

Uh oh!

Conversation

Vasilije1990 commented Dec 28, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Acceptance Criteria

Type of Change

Screenshots/Videos (if applicable)

Pre-submission Checklist

DCO Affirmation

Summary by CodeRabbit

Uh oh!

pull-checklist bot commented Dec 28, 2025

Please make sure all the checkboxes are checked:

Uh oh!

coderabbitai bot commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Vasilije1990 commented Dec 28, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 28, 2025 •

edited

Loading