Skip to content

Conversation

@cdoern
Copy link
Contributor

@cdoern cdoern commented Oct 29, 2025

What does this PR do?

chunk_id in the Chunk class executes actual logic to compute a chunk ID. This sort of logic should not live in the API spec.

Instead, the providers should be in charge of calling generate_chunk_id, and pass it to Chunk.

this removes the incorrect dependency between Provider impl and API impl

@cdoern
Copy link
Contributor Author

cdoern commented Oct 29, 2025

this relates to conversation in #3895, fixes like this will make the API easier to split into a separate extensions package.

return str(uuid.UUID(hashlib.sha256(hash_input).hexdigest()[:32]))

@property
def chunk_id(self) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the bigger issue is its existence in the API file itself. The API should never have behavior like this at all. It should be data types and method declarations. Not implementation. Can we actually remove the usage of this property itself? You should use immutable datatypes -- something at the top level in the routers, etc. should say "oh there is no chunk id so I will generate one and then I will initialize an immutable object with this chunk id"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes sure, that makes more sense

@cdoern cdoern changed the title fix: remove generate_chunk_id from providers.utils fix: remove chunk_id property from Chunk class Oct 29, 2025
@cdoern cdoern changed the title fix: remove chunk_id property from Chunk class fix!: remove chunk_id property from Chunk class Oct 29, 2025
chunk_window = f"{i}-{i + len(toks)}"
chunk_id = generate_chunk_id(chunk, text, chunk_window)
chunk_metadata = metadata.copy()
chunk_metadata["chunk_id"] = chunk_id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need this in both places

chunk_id in the Chunk class executes actual logic to compute a chunk ID. This sort of logic should not live in the API spec.

Instead, the providers should be in charge of calling generate_chunk_id, and pass it to `Chunk`.

this removes the incorrect dependency between Provider impl and API impl

Signed-off-by: Charlie Doern <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants