@@ -296,7 +296,11 @@ def index(
296
296
For the time being, documents are indexed using their hashes, and users
297
297
are not able to specify the uid of the document.
298
298
299
- Important:
299
+ .. versionchanged:: 0.3.25
300
+ Added ``scoped_full`` cleanup mode.
301
+
302
+ .. important::
303
+
300
304
* In full mode, the loader should be returning
301
305
the entire dataset, and not just a subset of the dataset.
302
306
Otherwise, the auto_cleanup will remove documents that it is not
@@ -309,7 +313,7 @@ def index(
309
313
chunks, and we index them using a batch size of 5, we'll have 3 batches
310
314
all with the same source id. In general, to avoid doing too much
311
315
redundant work select as big a batch size as possible.
312
- * The `scoped_full` mode is suitable if determining an appropriate batch size
316
+ * The `` scoped_full` ` mode is suitable if determining an appropriate batch size
313
317
is challenging or if your data loader cannot return the entire dataset at
314
318
once. This mode keeps track of source IDs in memory, which should be fine
315
319
for most use cases. If your dataset is large (10M+ docs), you will likely
@@ -378,10 +382,6 @@ def index(
378
382
TypeError: If ``vectorstore`` is not a VectorStore or a DocumentIndex.
379
383
AssertionError: If ``source_id`` is None when cleanup mode is incremental.
380
384
(should be unreachable code).
381
-
382
- .. version_modified:: 0.3.25
383
-
384
- * Added `scoped_full` cleanup mode.
385
385
"""
386
386
# Behavior is deprecated, but we keep it for backwards compatibility.
387
387
# # Warn only once per process.
@@ -636,26 +636,30 @@ async def aindex(
636
636
documents were deleted, which documents should be skipped.
637
637
638
638
For the time being, documents are indexed using their hashes, and users
639
- are not able to specify the uid of the document.
640
-
641
- Important:
642
- * In full mode, the loader should be returning
643
- the entire dataset, and not just a subset of the dataset.
644
- Otherwise, the auto_cleanup will remove documents that it is not
645
- supposed to.
646
- * In incremental mode, if documents associated with a particular
647
- source id appear across different batches, the indexing API
648
- will do some redundant work. This will still result in the
649
- correct end state of the index, but will unfortunately not be
650
- 100% efficient. For example, if a given document is split into 15
651
- chunks, and we index them using a batch size of 5, we'll have 3 batches
652
- all with the same source id. In general, to avoid doing too much
653
- redundant work select as big a batch size as possible.
654
- * The `scoped_full` mode is suitable if determining an appropriate batch size
655
- is challenging or if your data loader cannot return the entire dataset at
656
- once. This mode keeps track of source IDs in memory, which should be fine
657
- for most use cases. If your dataset is large (10M+ docs), you will likely
658
- need to parallelize the indexing process regardless.
639
+ are not able to specify the uid of the document.
640
+
641
+ .. versionchanged:: 0.3.25
642
+ Added ``scoped_full`` cleanup mode.
643
+
644
+ .. important::
645
+
646
+ * In full mode, the loader should be returning
647
+ the entire dataset, and not just a subset of the dataset.
648
+ Otherwise, the auto_cleanup will remove documents that it is not
649
+ supposed to.
650
+ * In incremental mode, if documents associated with a particular
651
+ source id appear across different batches, the indexing API
652
+ will do some redundant work. This will still result in the
653
+ correct end state of the index, but will unfortunately not be
654
+ 100% efficient. For example, if a given document is split into 15
655
+ chunks, and we index them using a batch size of 5, we'll have 3 batches
656
+ all with the same source id. In general, to avoid doing too much
657
+ redundant work select as big a batch size as possible.
658
+ * The ``scoped_full`` mode is suitable if determining an appropriate batch size
659
+ is challenging or if your data loader cannot return the entire dataset at
660
+ once. This mode keeps track of source IDs in memory, which should be fine
661
+ for most use cases. If your dataset is large (10M+ docs), you will likely
662
+ need to parallelize the indexing process regardless.
659
663
660
664
Args:
661
665
docs_source: Data loader or iterable of documents to index.
@@ -720,10 +724,6 @@ async def aindex(
720
724
TypeError: If ``vector_store`` is not a VectorStore or DocumentIndex.
721
725
AssertionError: If ``source_id_key`` is None when cleanup mode is
722
726
incremental or ``scoped_full`` (should be unreachable).
723
-
724
- .. version_modified:: 0.3.25
725
-
726
- * Added `scoped_full` cleanup mode.
727
727
"""
728
728
# Behavior is deprecated, but we keep it for backwards compatibility.
729
729
# # Warn only once per process.
0 commit comments