ImportDocumentAsync Overwriting Chunks with Same Document ID? #1004

mahadzk · 2025-02-14T00:52:29Z

mahadzk
Feb 14, 2025

Hi everyone,

I'm working with Microsoft Kernel Memory's ImportDocumentAsync() and running into an issue where only the last chunk is being stored in Azure AI Search when I pass multiple chunks with the same documentId.

Setup:
I'm performing chunking before calling ImportDocumentAsync().
I'm not using the built-in chunking (TextPartitioningOptions).
Each chunk has a unique PageNumber in the metadata.
However, only one chunk appears in AI Search (likely overwriting previous ones).
Issue:
Does ImportDocumentAsync() expect all chunks of a document to be passed together?
If I pass multiple chunks sequentially with the same documentId, does it overwrite earlier chunks?
Should I be adding a unique identifier per chunk (like ChunkID) to ensure all chunks are indexed?
What I’ve Tried:
Verified logging: Each chunk is passed separately, but only one remains in AI Search.
Tested adding PageNumber as metadata, but that didn’t prevent overwrites.
Considering modifying documentId per chunk, but that might break document-level search.
Would appreciate insights from anyone who has worked with Microsoft Kernel Memory and Azure AI Search indexing!
I can provide the code snippet if that would be helpful.

Thanks!

mahadzk · 2025-02-18T02:24:12Z

mahadzk
Feb 18, 2025
Author

@dluc Maybe you or someone else can guide on this?

3 replies

dluc Feb 18, 2025
Maintainer

why are you passing chunks with a document ID? Have you tried passing the entire document, and leave it to KM to create the chunks?

mahadzk Feb 18, 2025
Author

Our initial implementation relied on KMs built in chunking, but we now need Page Numbers associated with each chunk.
As I understand it, when passing a full document to KM, metadata can only be set once at the document level, meaning we can’t attach PageNumber metadata to individual chunks.
To work around this, we are pre-chunking the document page by page and adding PageNumber and FileName metadata before passing it for embedding.
The challenge we’re facing is that all chunks share the same documentId, which causes conflicts in Azure AI Search since it appears to overwrite previous chunks. However, we rely on documentId for search and deletion, so modifying it per chunk could break document-level operations.
Would appreciate insight and guidance on the best way to approach this!

dluc Feb 18, 2025
Maintainer

I see, try something like this:

documentId = document Id +"_" + chunk number
add a myDocId tag to identify the file, where myDocId = a value pointing to the file where chunks come from

(1) will address the conflict, so each chunk is treated independently.
(2) allows to use Tags to filter by your document ID

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ImportDocumentAsync Overwriting Chunks with Same Document ID? #1004

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

ImportDocumentAsync Overwriting Chunks with Same Document ID? #1004

Uh oh!

mahadzk Feb 14, 2025

Replies: 1 comment · 3 replies

Uh oh!

Uh oh!

mahadzk Feb 18, 2025 Author

Uh oh!

dluc Feb 18, 2025 Maintainer

Uh oh!

mahadzk Feb 18, 2025 Author

Uh oh!

dluc Feb 18, 2025 Maintainer

mahadzk
Feb 14, 2025

Replies: 1 comment 3 replies

mahadzk
Feb 18, 2025
Author

dluc Feb 18, 2025
Maintainer

mahadzk Feb 18, 2025
Author

dluc Feb 18, 2025
Maintainer