Skip to content

Commit 1d674d4

Browse files
authored
feat: Configurable batch size (#1941)
<!-- .github/pull_request_template.md --> ## Description <!-- Please provide a clear, human-generated description of the changes in this PR. DO NOT use AI-generated descriptions. We want to understand your thought process and reasoning. --> ## Acceptance Criteria <!-- * Key requirements to the new feature or modification; * Proof that the changes work and meet the requirements; * Include instructions on how to verify the changes. Describe how to test it locally; * Proof that it's sufficiently tested. --> ## Type of Change <!-- Please check the relevant option --> - [ ] Bug fix (non-breaking change that fixes an issue) - [ ] New feature (non-breaking change that adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Documentation update - [ ] Code refactoring - [ ] Performance improvement - [ ] Other (please specify): ## Screenshots/Videos (if applicable) <!-- Add screenshots or videos to help explain your changes --> ## Pre-submission Checklist <!-- Please check all boxes that apply before submitting your PR --> - [ ] **I have tested my changes thoroughly before submitting this PR** - [ ] **This PR contains minimal changes necessary to address the issue/feature** - [ ] My code follows the project's coding standards and style guidelines - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] I have added necessary documentation (if applicable) - [ ] All new and existing tests pass - [ ] I have searched existing PRs to ensure this change hasn't been submitted already - [ ] I have linked any relevant issues in the description - [ ] My commits have clear and descriptive messages ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added configurable chunks-per-batch to control per-batch processing size via CLI flag, API payload, and configuration; defaults are now driven by config with an automatic fallback. * **Style / Documentation** * Updated contribution/style guidelines (formatting, line length, string-quote rule, pre-commit note). * **Tests** * Updated CLI tests to verify propagation of the new chunks-per-batch parameter. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
2 parents 2c29868 + b7d5bf5 commit 1d674d4

File tree

7 files changed

+35
-10
lines changed

7 files changed

+35
-10
lines changed

CLAUDE.md

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -427,10 +427,12 @@ git checkout -b feature/your-feature-name
427427

428428
## Code Style
429429

430-
- Ruff for linting and formatting (configured in `pyproject.toml`)
431-
- Line length: 100 characters
432-
- Pre-commit hooks run ruff automatically
433-
- Type hints encouraged (mypy checks enabled)
430+
- **Formatter**: Ruff (configured in `pyproject.toml`)
431+
- **Line length**: 100 characters
432+
- **String quotes**: Use double quotes `"` not single quotes `'` (enforced by ruff-format)
433+
- **Pre-commit hooks**: Run ruff linting and formatting automatically
434+
- **Type hints**: Encouraged (mypy checks enabled)
435+
- **Important**: Always run `pre-commit run --all-files` before committing to catch formatting issues
434436

435437
## Testing Strategy
436438

cognee/api/v1/cognify/cognify.py

Lines changed: 11 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -252,7 +252,7 @@ async def get_default_tasks( # TODO: Find out a better way to do this (Boris's
252252
chunk_size: int = None,
253253
config: Config = None,
254254
custom_prompt: Optional[str] = None,
255-
chunks_per_batch: int = 100,
255+
chunks_per_batch: int = None,
256256
**kwargs,
257257
) -> list[Task]:
258258
if config is None:
@@ -272,12 +272,14 @@ async def get_default_tasks( # TODO: Find out a better way to do this (Boris's
272272
"ontology_config": {"ontology_resolver": get_default_ontology_resolver()}
273273
}
274274

275-
if chunks_per_batch is None:
276-
chunks_per_batch = 100
277-
278275
cognify_config = get_cognify_config()
279276
embed_triplets = cognify_config.triplet_embedding
280277

278+
if chunks_per_batch is None:
279+
chunks_per_batch = (
280+
cognify_config.chunks_per_batch if cognify_config.chunks_per_batch is not None else 100
281+
)
282+
281283
default_tasks = [
282284
Task(classify_documents),
283285
Task(
@@ -308,7 +310,7 @@ async def get_default_tasks( # TODO: Find out a better way to do this (Boris's
308310

309311

310312
async def get_temporal_tasks(
311-
user: User = None, chunker=TextChunker, chunk_size: int = None, chunks_per_batch: int = 10
313+
user: User = None, chunker=TextChunker, chunk_size: int = None, chunks_per_batch: int = None
312314
) -> list[Task]:
313315
"""
314316
Builds and returns a list of temporal processing tasks to be executed in sequence.
@@ -330,7 +332,10 @@ async def get_temporal_tasks(
330332
list[Task]: A list of Task objects representing the temporal processing pipeline.
331333
"""
332334
if chunks_per_batch is None:
333-
chunks_per_batch = 10
335+
from cognee.modules.cognify.config import get_cognify_config
336+
337+
configured = get_cognify_config().chunks_per_batch
338+
chunks_per_batch = configured if configured is not None else 10
334339

335340
temporal_tasks = [
336341
Task(classify_documents),

cognee/api/v1/cognify/routers/get_cognify_router.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,11 @@ class CognifyPayloadDTO(InDTO):
4646
examples=[[]],
4747
description="Reference to one or more previously uploaded ontologies",
4848
)
49+
chunks_per_batch: Optional[int] = Field(
50+
default=None,
51+
description="Number of chunks to process per task batch in Cognify (overrides default).",
52+
examples=[10, 20, 50, 100],
53+
)
4954

5055

5156
def get_cognify_router() -> APIRouter:
@@ -146,6 +151,7 @@ async def cognify(payload: CognifyPayloadDTO, user: User = Depends(get_authentic
146151
config=config_to_use,
147152
run_in_background=payload.run_in_background,
148153
custom_prompt=payload.custom_prompt,
154+
chunks_per_batch=payload.chunks_per_batch,
149155
)
150156

151157
# If any cognify run errored return JSONResponse with proper error status code

cognee/cli/commands/cognify_command.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,11 @@ def configure_parser(self, parser: argparse.ArgumentParser) -> None:
6262
parser.add_argument(
6363
"--verbose", "-v", action="store_true", help="Show detailed progress information"
6464
)
65+
parser.add_argument(
66+
"--chunks-per-batch",
67+
type=int,
68+
help="Number of chunks to process per task batch (try 50 for large single documents).",
69+
)
6570

6671
def execute(self, args: argparse.Namespace) -> None:
6772
try:
@@ -111,6 +116,7 @@ async def run_cognify():
111116
chunk_size=args.chunk_size,
112117
ontology_file_path=args.ontology_file,
113118
run_in_background=args.background,
119+
chunks_per_batch=getattr(args, "chunks_per_batch", None),
114120
)
115121
return result
116122
except Exception as e:

cognee/modules/cognify/config.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,13 +9,15 @@ class CognifyConfig(BaseSettings):
99
classification_model: object = DefaultContentPrediction
1010
summarization_model: object = SummarizedContent
1111
triplet_embedding: bool = False
12+
chunks_per_batch: Optional[int] = None
1213
model_config = SettingsConfigDict(env_file=".env", extra="allow")
1314

1415
def to_dict(self) -> dict:
1516
return {
1617
"classification_model": self.classification_model,
1718
"summarization_model": self.summarization_model,
1819
"triplet_embedding": self.triplet_embedding,
20+
"chunks_per_batch": self.chunks_per_batch,
1921
}
2022

2123

cognee/tests/cli_tests/cli_unit_tests/test_cli_commands.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -238,6 +238,7 @@ def test_execute_basic_cognify(self, mock_asyncio_run):
238238
ontology_file_path=None,
239239
chunker=TextChunker,
240240
run_in_background=False,
241+
chunks_per_batch=None,
241242
)
242243

243244
@patch("cognee.cli.commands.cognify_command.asyncio.run")

cognee/tests/cli_tests/cli_unit_tests/test_cli_edge_cases.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -262,6 +262,7 @@ def test_cognify_invalid_chunk_size(self, mock_asyncio_run):
262262
ontology_file_path=None,
263263
chunker=TextChunker,
264264
run_in_background=False,
265+
chunks_per_batch=None,
265266
)
266267

267268
@patch("cognee.cli.commands.cognify_command.asyncio.run", side_effect=_mock_run)
@@ -295,6 +296,7 @@ def test_cognify_nonexistent_ontology_file(self, mock_asyncio_run):
295296
ontology_file_path="/nonexistent/path/ontology.owl",
296297
chunker=TextChunker,
297298
run_in_background=False,
299+
chunks_per_batch=None,
298300
)
299301

300302
@patch("cognee.cli.commands.cognify_command.asyncio.run")
@@ -373,6 +375,7 @@ def test_cognify_empty_datasets_list(self, mock_asyncio_run):
373375
ontology_file_path=None,
374376
chunker=TextChunker,
375377
run_in_background=False,
378+
chunks_per_batch=None,
376379
)
377380

378381

0 commit comments

Comments
 (0)