✨(back) Add ODT parsing support and refactor document parser routing#354
✨(back) Add ODT parsing support and refactor document parser routing#354
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds ODT (OpenDocument Text) parsing: new ODT converter and error type, MIME-based parser routing with an ODT mixin, dependency on Changes
Sequence DiagramsequenceDiagram
participant Client
participant BaseParser as BaseParser
participant OdtMixin as OdtParserMixin
participant Converter as OdtToMd
participant ODFDO as odfdo.Document
participant Output as Markdown
Client->>BaseParser: parse_document(name, "application/vnd.oasis.opendocument.text", content)
BaseParser->>BaseParser: check MIME -> ODT branch
BaseParser->>OdtMixin: parse_odt_document(content)
OdtMixin->>Converter: extract(content)
Converter->>ODFDO: Document(BytesIO(content))
alt success
ODFDO-->>Converter: doc
Converter->>ODFDO: to_markdown()
ODFDO-->>Converter: markdown_text
Converter-->>OdtMixin: markdown_text
OdtMixin-->>BaseParser: markdown_text
BaseParser-->>Client: parsed markdown (str)
else failure
ODFDO-->>Converter: raises (TypeError/BadZip/XMLSyntaxError)
Converter->>Converter: log error
Converter-->>OdtMixin: raise OdtParsingError
OdtMixin-->>BaseParser: propagate OdtParsingError
BaseParser-->>Client: error propagated
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
d9a8762 to
c6e1e54
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
src/backend/chat/agent_rag/document_converter/odt.py (1)
22-25: Add exception handling for corrupt ODT ZIP structures.ODT files are ZIP-based, and corrupt files will raise
zipfile.BadZipFilefromodfdo.Document. The current handling only catchesTypeErrorandFileNotFoundError, missing this failure mode. Consider also catchinglxml.etree.XMLSyntaxErrorfor malformed XML content within the ZIP.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/backend/chat/agent_rag/document_converter/odt.py` around lines 22 - 25, The except block that currently catches (TypeError, FileNotFoundError) and raises OdtParsingError should also handle zipfile.BadZipFile and lxml.etree.XMLSyntaxError so corrupt ZIP structures and malformed XML inside ODTs are wrapped consistently; update the exception tuple in the except (used around odfdo.Document parsing, logger.error and raise OdtParsingError) to include zipfile.BadZipFile and lxml.etree.XMLSyntaxError, and add the necessary imports for zipfile and lxml.etree at the top of the module.src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py (1)
215-215: Misleading comment: says "PDF" but fixture is for ODT.The comment was likely copy-pasted from
mock_document_api. Consider updating it.✏️ Proposed fix
- # Mock PDF parsing + # Mock ODT parsing🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py` at line 215, The inline comment above the ODT fixture in the test function (test_conversation_with_document_upload) incorrectly says "Mock PDF parsing"; update that comment to accurately state "Mock ODT parsing" (or a more generic "Mock document parsing") so it matches the fixture used and avoids confusion when reviewing/mock_document_api is referenced.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/backend/chat/agent_rag/document_converter/odt.py`:
- Around line 15-16: Rename the class OtdToMd to OdtToMd to fix the typo and
keep naming consistent with OdtParsingError and the module name (odt.py); update
any references/usages of OtdToMd (constructor calls, imports, type hints) to the
new OdtToMd identifier so there are no unresolved references.
In
`@src/backend/chat/tests/agent_rag/document_converter/test_adaptive_pdf_parser.py`:
- Around line 393-394: The test function test_parse_odt has an incorrect
docstring copied from a PDF test; update the docstring to accurately describe
the ODT parsing behavior being tested (e.g., that it should extract
content/pages/text from an ODT file correctly or validate ODT-specific parsing
edge cases) so it reflects the purpose of test_parse_odt in the Adaptive PDF/ODT
parser tests.
In
`@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py`:
- Around line 988-1007: The test uses pdf_base64 but it actually encodes ODT
content and builds a data URL with application/pdf; rename the variable
pdf_base64 to odt_base64 (or similar) where sample_odt_content.read() is
encoded, and update the Attachment.url data URL to use the correct MIME type for
ODT (e.g., application/vnd.oasis.opendocument.text) so UIMessage/Attachment
(fields name, contentType, url) consistently reflect ODT content.
---
Nitpick comments:
In `@src/backend/chat/agent_rag/document_converter/odt.py`:
- Around line 22-25: The except block that currently catches (TypeError,
FileNotFoundError) and raises OdtParsingError should also handle
zipfile.BadZipFile and lxml.etree.XMLSyntaxError so corrupt ZIP structures and
malformed XML inside ODTs are wrapped consistently; update the exception tuple
in the except (used around odfdo.Document parsing, logger.error and raise
OdtParsingError) to include zipfile.BadZipFile and lxml.etree.XMLSyntaxError,
and add the necessary imports for zipfile and lxml.etree at the top of the
module.
In
`@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py`:
- Line 215: The inline comment above the ODT fixture in the test function
(test_conversation_with_document_upload) incorrectly says "Mock PDF parsing";
update that comment to accurately state "Mock ODT parsing" (or a more generic
"Mock document parsing") so it matches the fixture used and avoids confusion
when reviewing/mock_document_api is referenced.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: be896f2d-6e79-4df9-9174-9052f5dd3098
⛔ Files ignored due to path filters (1)
src/backend/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (11)
CHANGELOG.mdDockerfilesrc/backend/chat/agent_rag/document_converter/markitdown.pysrc/backend/chat/agent_rag/document_converter/odt.pysrc/backend/chat/agent_rag/document_converter/parser.pysrc/backend/chat/tests/agent_rag/document_converter/fixtures/sample.odtsrc/backend/chat/tests/agent_rag/document_converter/test_adaptive_pdf_parser.pysrc/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.pysrc/backend/conversations/settings.pysrc/backend/core/tests/file_upload/test_generate_temporary_url.pysrc/backend/pyproject.toml
src/backend/chat/tests/agent_rag/document_converter/test_adaptive_pdf_parser.py
Outdated
Show resolved
Hide resolved
src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/backend/chat/agent_rag/document_converter/parser.py (1)
56-57: Marknameas intentionally unused.This override should keep the base signature, but the parameter is never read. Renaming it to
_namekeeps the contract intact and clears the static-analysis warning.Small cleanup
- def parse_odt_document(self, name: str, content: bytes) -> str: + def parse_odt_document(self, _name: str, content: bytes) -> str: return OtdToMd().extract(content)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/backend/chat/agent_rag/document_converter/parser.py` around lines 56 - 57, The parse_odt_document method currently has an unused parameter name; to silence static-analysis warnings, rename the parameter to _name (keeping the str type annotation) in the def parse_odt_document(self, _name: str, content: bytes) -> str signature and leave the body returning OtdToMd().extract(content) unchanged (method name: parse_odt_document, helper: OtdToMd.extract).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/backend/chat/agent_rag/document_converter/parser.py`:
- Around line 44-50: BaseParser currently defines parse_pdf_document and
parse_odt_document as NotImplementedError stubs which causes any backend that
still constructs BaseParser (see BaseParser usage in base_rag_backend.py) to
crash on PDF/ODT uploads; either provide a safe concrete fallback in BaseParser
(implement parse_pdf_document and parse_odt_document with a minimal,
non-throwing behavior such as returning an empty string or simple text
extraction) or make parser injection mandatory by changing base_rag_backend.py
to require a concrete parser instance (do not instantiate BaseParser by default)
and surface a clear error if no parser is supplied; update the code paths
referencing BaseParser, parse_pdf_document, and parse_odt_document accordingly
so uploads never hit a NotImplementedError at runtime.
---
Nitpick comments:
In `@src/backend/chat/agent_rag/document_converter/parser.py`:
- Around line 56-57: The parse_odt_document method currently has an unused
parameter name; to silence static-analysis warnings, rename the parameter to
_name (keeping the str type annotation) in the def parse_odt_document(self,
_name: str, content: bytes) -> str signature and leave the body returning
OtdToMd().extract(content) unchanged (method name: parse_odt_document, helper:
OtdToMd.extract).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: e3762c8f-1b32-49ca-8e6d-37aad81e543f
⛔ Files ignored due to path filters (1)
src/backend/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (10)
CHANGELOG.mdsrc/backend/chat/agent_rag/document_converter/markitdown.pysrc/backend/chat/agent_rag/document_converter/odt.pysrc/backend/chat/agent_rag/document_converter/parser.pysrc/backend/chat/tests/agent_rag/document_converter/fixtures/sample.odtsrc/backend/chat/tests/agent_rag/document_converter/test_adaptive_pdf_parser.pysrc/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.pysrc/backend/conversations/settings.pysrc/backend/core/tests/file_upload/test_generate_temporary_url.pysrc/backend/pyproject.toml
✅ Files skipped from review due to trivial changes (6)
- src/backend/pyproject.toml
- src/backend/chat/agent_rag/document_converter/markitdown.py
- src/backend/conversations/settings.py
- src/backend/core/tests/file_upload/test_generate_temporary_url.py
- src/backend/chat/tests/agent_rag/document_converter/test_adaptive_pdf_parser.py
- CHANGELOG.md
🚧 Files skipped from review as they are similar to previous changes (2)
- src/backend/chat/agent_rag/document_converter/odt.py
- src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py
c6e1e54 to
e3e0659
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py (1)
988-1005:⚠️ Potential issue | 🟡 MinorFix ODT attachment MIME consistency in test payload.
Line 988 uses
pdf_base64for ODT bytes, and Line 1004 setsdata:application/pdfwhile the attachment is ODT. Keepname,contentType, andurlMIME aligned.✏️ Proposed fix
- pdf_base64 = base64.b64encode(sample_odt_content.read()).decode("utf-8") + odt_base64 = base64.b64encode(sample_odt_content.read()).decode("utf-8") @@ - url=f"data:application/pdf;base64,{pdf_base64}", + url=f"data:application/vnd.oasis.opendocument.text;base64,{odt_base64}",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py` around lines 988 - 1005, The test constructs an ODT attachment but encodes it as pdf_base64 and uses a PDF data URL; update the payload so the encoded bytes and MIME match the ODT attachment: rename or replace pdf_base64 with odt_base64 (the base64 of sample_odt_content), set Attachment.contentType to "application/vnd.oasis.opendocument.text" (already present) and change the Attachment.url data URL prefix to "data:application/vnd.oasis.opendocument.text;base64," so UIMessage.experimental_attachments (Attachment) has consistent name, contentType and url values.src/backend/chat/agent_rag/document_converter/odt.py (1)
15-16:⚠️ Potential issue | 🟡 MinorRename
OtdToMdtoOdtToMdfor consistency.Line 15 uses
OtdToMd, which is inconsistent withodt.py/OdtParsingErrornaming and makes symbol discovery harder.✏️ Proposed fix
-class OtdToMd: +class OdtToMd: """Convert an ODT file to Markdown using odfdo."""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/backend/chat/agent_rag/document_converter/odt.py` around lines 15 - 16, Rename the class OtdToMd to OdtToMd to match the file and existing symbols (e.g., OdtParsingError); update the class declaration and any references/imports/usages of OtdToMd across the codebase to OdtToMd (including tests, type hints, and factory/creator calls) and keep the existing docstring text intact but ensure it reflects the new class name if referenced; run tests to ensure no remaining references to OtdToMd remain.
🧹 Nitpick comments (2)
src/backend/chat/agent_rag/document_converter/parser.py (1)
56-57: Mark intentionally unused parameter explicitly.Line 56 keeps
namefor signature compatibility, but it is unused. Consider_nameto make intent explicit and quiet static warnings.♻️ Suggested tweak
- def parse_odt_document(self, name: str, content: bytes) -> str: + def parse_odt_document(self, _name: str, content: bytes) -> str: return OtdToMd().extract(content)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/backend/chat/agent_rag/document_converter/parser.py` around lines 56 - 57, The parameter name in parse_odt_document is intentionally unused but currently named `name`; to make this explicit and silence static warnings, rename the parameter to `_name: str` (or prefix it with an underscore) in the parse_odt_document signature and keep the body returning OtdToMd().extract(content) unchanged; update any callers if necessary to preserve compatibility while keeping the unused parameter clearly marked.src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py (1)
197-277: Consider factoring duplicated API mock setup into a helper.
fixture_mock_odt_document_apiduplicates most offixture_mock_document_api; this increases maintenance cost for future endpoint/shape changes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py` around lines 197 - 277, The fixture_fixture_mock_odt_document_api duplicates most of fixture_mock_document_api; extract the repeated responses setup into a shared helper (e.g., register_albert_and_find_mocks or build_document_api_mocks) that accepts parameters like document_name, document_content, prompt_tokens, completion_tokens, search_method, search_score and registers the common responses for the /v1/collections, /v1/parse-beta, /v1/documents, /v1/search and the find API endpoints; then call this helper from both fixture_mock_odt_document_api and fixture_mock_document_api to remove duplication and keep only odt-specific overrides in fixture_mock_odt_document_api.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/backend/chat/agent_rag/document_converter/odt.py`:
- Around line 22-26: Update the exception handling around the
odfdo.Document(BytesIO(content)) call to also catch zipfile.BadZipFile and
lxml.etree.XMLSyntaxError in addition to TypeError and FileNotFoundError so all
parsing failures are wrapped as OdtParsingError; keep the existing
logger.error("Failed to parse ODT document: %s", e) and re-raise
OdtParsingError(_("Failed to parse ODT document: %(error)s") % {"error": e})
from e so errors from odfdo.Document, corrupt zip archives, and malformed XML
are all handled (refer to odfdo.Document, BytesIO(content), logger.error, and
OdtParsingError).
---
Duplicate comments:
In `@src/backend/chat/agent_rag/document_converter/odt.py`:
- Around line 15-16: Rename the class OtdToMd to OdtToMd to match the file and
existing symbols (e.g., OdtParsingError); update the class declaration and any
references/imports/usages of OtdToMd across the codebase to OdtToMd (including
tests, type hints, and factory/creator calls) and keep the existing docstring
text intact but ensure it reflects the new class name if referenced; run tests
to ensure no remaining references to OtdToMd remain.
In
`@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py`:
- Around line 988-1005: The test constructs an ODT attachment but encodes it as
pdf_base64 and uses a PDF data URL; update the payload so the encoded bytes and
MIME match the ODT attachment: rename or replace pdf_base64 with odt_base64 (the
base64 of sample_odt_content), set Attachment.contentType to
"application/vnd.oasis.opendocument.text" (already present) and change the
Attachment.url data URL prefix to
"data:application/vnd.oasis.opendocument.text;base64," so
UIMessage.experimental_attachments (Attachment) has consistent name, contentType
and url values.
---
Nitpick comments:
In `@src/backend/chat/agent_rag/document_converter/parser.py`:
- Around line 56-57: The parameter name in parse_odt_document is intentionally
unused but currently named `name`; to make this explicit and silence static
warnings, rename the parameter to `_name: str` (or prefix it with an underscore)
in the parse_odt_document signature and keep the body returning
OtdToMd().extract(content) unchanged; update any callers if necessary to
preserve compatibility while keeping the unused parameter clearly marked.
In
`@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py`:
- Around line 197-277: The fixture_fixture_mock_odt_document_api duplicates most
of fixture_mock_document_api; extract the repeated responses setup into a shared
helper (e.g., register_albert_and_find_mocks or build_document_api_mocks) that
accepts parameters like document_name, document_content, prompt_tokens,
completion_tokens, search_method, search_score and registers the common
responses for the /v1/collections, /v1/parse-beta, /v1/documents, /v1/search and
the find API endpoints; then call this helper from both
fixture_mock_odt_document_api and fixture_mock_document_api to remove
duplication and keep only odt-specific overrides in
fixture_mock_odt_document_api.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: d24dd655-d76d-4fd4-baec-dfc782d57a7c
⛔ Files ignored due to path filters (1)
src/backend/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (11)
CHANGELOG.mdsrc/backend/chat/agent_rag/document_converter/markitdown.pysrc/backend/chat/agent_rag/document_converter/odt.pysrc/backend/chat/agent_rag/document_converter/parser.pysrc/backend/chat/tests/agent_rag/document_converter/fixtures/sample.odtsrc/backend/chat/tests/agent_rag/document_converter/test_adaptive_pdf_parser.pysrc/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.pysrc/backend/conversations/settings.pysrc/backend/core/tests/file_upload/test_generate_temporary_url.pysrc/backend/pyproject.tomlsrc/frontend/apps/e2e/__tests__/app-conversations/common.ts
✅ Files skipped from review due to trivial changes (5)
- CHANGELOG.md
- src/backend/core/tests/file_upload/test_generate_temporary_url.py
- src/backend/chat/agent_rag/document_converter/markitdown.py
- src/backend/pyproject.toml
- src/frontend/apps/e2e/tests/app-conversations/common.ts
🚧 Files skipped from review as they are similar to previous changes (2)
- src/backend/conversations/settings.py
- src/backend/chat/tests/agent_rag/document_converter/test_adaptive_pdf_parser.py
e3e0659 to
043d17a
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py (1)
988-1005:⚠️ Potential issue | 🟡 MinorFix ODT data URL MIME mismatch in attachment payload.
At Line 988,
pdf_base64encodes ODT bytes, and at Line 1004 the URL still usesdata:application/pdf. This is inconsistent withcontentType="application/vnd.oasis.opendocument.text"and can hide routing bugs.✏️ Proposed fix
- pdf_base64 = base64.b64encode(sample_odt_content.read()).decode("utf-8") + odt_base64 = base64.b64encode(sample_odt_content.read()).decode("utf-8") @@ - url=f"data:application/pdf;base64,{pdf_base64}", + url=f"data:application/vnd.oasis.opendocument.text;base64,{odt_base64}",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py` around lines 988 - 1005, The attachment payload builds a data URL with pdf_base64 (ODT bytes) but uses the wrong MIME type "application/pdf"; update the data URL to match Attachment.contentType ("application/vnd.oasis.opendocument.text") so the UIMessage/Attachment created in the test (variables: pdf_base64, sample_odt_content, UIMessage, Attachment, TextUIPart) uses a consistent MIME type for the data: URL.
🧹 Nitpick comments (2)
src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py (2)
197-278: Extract a shared API-mock builder to avoid drift.This ODT fixture is almost identical to
fixture_mock_document_api; only document-specific values differ. Consolidating into one helper/parametrized fixture will reduce maintenance risk.♻️ Proposed refactor sketch
+def _register_document_api_mocks(*, document_name: str, document_content: str): + prompt_tokens = 10 + completion_tokens = 20 + search_method = "semantic" + search_score = 0.9 + + responses.post("https://albert.api.etalab.gouv.fr/v1/collections", json={"id": "123", "name": "test-collection"}, status=status.HTTP_200_OK) + responses.post( + "https://albert.api.etalab.gouv.fr/v1/parse-beta", + json={ + "data": [{"content": document_content, "metadata": {"document_name": document_name}}], + "usage": {"prompt_tokens": prompt_tokens, "completion_tokens": completion_tokens}, + }, + status=status.HTTP_200_OK, + ) + responses.post("https://albert.api.etalab.gouv.fr/v1/documents", json={"id": 456}, status=status.HTTP_201_CREATED) + responses.post( + "https://albert.api.etalab.gouv.fr/v1/search", + json={ + "data": [{ + "method": search_method, + "chunk": {"id": 123, "content": document_content, "metadata": {"document_name": document_name}}, + "score": search_score, + }], + "usage": {"prompt_tokens": prompt_tokens, "completion_tokens": completion_tokens}, + }, + status=status.HTTP_200_OK, + ) + responses.post("https://find.api.example.com/api/v1.0/documents/index/", json={"id": "456", "status": "indexed"}, status=status.HTTP_200_OK) + responses.post( + "https://find.api.example.com/api/v1.0/documents/search/", + json=[{"_source": {"title.fr": document_name, "content.fr": document_content}, "_score": search_score}], + status=status.HTTP_200_OK, + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py` around lines 197 - 278, The ODT-specific responses in fixture_mock_odt_document_api largely duplicate fixture_mock_document_api; extract a shared API-mock builder (e.g., a helper function build_mock_document_api or a parametrized fixture) that accepts document_name, document_content, prompt_tokens, completion_tokens, search_method, and search_score, then have fixture_mock_odt_document_api and fixture_mock_document_api call that helper with their specific values; update places that previously referenced fixture_mock_odt_document_api to use the new helper/parametrized fixture and remove duplicated response setup inside fixture_mock_odt_document_api.
971-1244: Consider parametrizing PDF/ODT conversation tests.This test duplicates most of
test_post_conversation_with_document_uploadwith only format-specific values changed. A parametrized test would keep assertions aligned across formats and reduce future copy/paste regressions.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py` around lines 971 - 1244, The two near-duplicate tests (test_post_conversation_with_document_upload and test_post_conversation_with_odt_document_upload) should be merged into a single parametrized test; replace these separate functions with one test function (e.g., test_post_conversation_with_document_upload_parametrized) decorated with pytest.mark.parametrize over format-specific values (contentType, filename, sample content fixture, expected mime/data URL prefix, and expected extracted text like "This is the content of the ODT."). In the body, use the same setup logic currently in test_post_conversation_with_odt_document_upload (creating ChatConversationFactory, forcing auth, encoding sample content, building UIMessage with experimental_attachments) but substitute the parametrized variables for Attachment(name, contentType, url) and expected response strings; keep agent_model, response assertions, and pydantic_messages assertions but derive format-specific expected values from the parameters (e.g., attachment filename and expected source url/title). Ensure you reference and reuse the existing symbols: UIMessage, Attachment, TextUIPart, mock_ai_agent_service(FunctionModel(...)), replace_uuids_with_placeholder, and chat_conversation.pydantic_messages to maintain all existing assertions while removing duplicated code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In
`@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py`:
- Around line 988-1005: The attachment payload builds a data URL with pdf_base64
(ODT bytes) but uses the wrong MIME type "application/pdf"; update the data URL
to match Attachment.contentType ("application/vnd.oasis.opendocument.text") so
the UIMessage/Attachment created in the test (variables: pdf_base64,
sample_odt_content, UIMessage, Attachment, TextUIPart) uses a consistent MIME
type for the data: URL.
---
Nitpick comments:
In
`@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py`:
- Around line 197-278: The ODT-specific responses in
fixture_mock_odt_document_api largely duplicate fixture_mock_document_api;
extract a shared API-mock builder (e.g., a helper function
build_mock_document_api or a parametrized fixture) that accepts document_name,
document_content, prompt_tokens, completion_tokens, search_method, and
search_score, then have fixture_mock_odt_document_api and
fixture_mock_document_api call that helper with their specific values; update
places that previously referenced fixture_mock_odt_document_api to use the new
helper/parametrized fixture and remove duplicated response setup inside
fixture_mock_odt_document_api.
- Around line 971-1244: The two near-duplicate tests
(test_post_conversation_with_document_upload and
test_post_conversation_with_odt_document_upload) should be merged into a single
parametrized test; replace these separate functions with one test function
(e.g., test_post_conversation_with_document_upload_parametrized) decorated with
pytest.mark.parametrize over format-specific values (contentType, filename,
sample content fixture, expected mime/data URL prefix, and expected extracted
text like "This is the content of the ODT."). In the body, use the same setup
logic currently in test_post_conversation_with_odt_document_upload (creating
ChatConversationFactory, forcing auth, encoding sample content, building
UIMessage with experimental_attachments) but substitute the parametrized
variables for Attachment(name, contentType, url) and expected response strings;
keep agent_model, response assertions, and pydantic_messages assertions but
derive format-specific expected values from the parameters (e.g., attachment
filename and expected source url/title). Ensure you reference and reuse the
existing symbols: UIMessage, Attachment, TextUIPart,
mock_ai_agent_service(FunctionModel(...)), replace_uuids_with_placeholder, and
chat_conversation.pydantic_messages to maintain all existing assertions while
removing duplicated code.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 2e49a0ed-402a-4ae0-9a01-e68e2f23b20f
⛔ Files ignored due to path filters (1)
src/backend/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (11)
CHANGELOG.mdsrc/backend/chat/agent_rag/document_converter/markitdown.pysrc/backend/chat/agent_rag/document_converter/odt.pysrc/backend/chat/agent_rag/document_converter/parser.pysrc/backend/chat/tests/agent_rag/document_converter/fixtures/sample.odtsrc/backend/chat/tests/agent_rag/document_converter/test_adaptive_pdf_parser.pysrc/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.pysrc/backend/conversations/settings.pysrc/backend/core/tests/file_upload/test_generate_temporary_url.pysrc/backend/pyproject.tomlsrc/frontend/apps/e2e/__tests__/app-conversations/common.ts
✅ Files skipped from review due to trivial changes (4)
- src/backend/chat/agent_rag/document_converter/markitdown.py
- CHANGELOG.md
- src/backend/pyproject.toml
- src/backend/core/tests/file_upload/test_generate_temporary_url.py
🚧 Files skipped from review as they are similar to previous changes (5)
- src/frontend/apps/e2e/tests/app-conversations/common.ts
- src/backend/chat/agent_rag/document_converter/odt.py
- src/backend/conversations/settings.py
- src/backend/chat/tests/agent_rag/document_converter/test_adaptive_pdf_parser.py
- src/backend/chat/agent_rag/document_converter/parser.py
043d17a to
cbe1cd1
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
docs/async-tasks-analysis.md (1)
35-35: Consider completing sentence fragments for clarity.Lines 35, 61, and 80 contain sentence fragments missing subjects ("Could be..." without stating what "could be"). While understandable from context, complete sentences improve readability in technical documentation.
✍️ Suggested grammatical improvements
-**Trade-off**: Pre-computing costs LLM tokens even if the user never asks for a summary. Could be opt-in per project or triggered lazily on first access then cached. +**Trade-off**: Pre-computing costs LLM tokens even if the user never asks for a summary. This could be opt-in per project or triggered lazily on first access then cached.-**Alternative (simpler, no LLM call)**: Implement a `history_processor` in PydanticAI that truncates/trims old messages at request time (sliding window). Cheaper but loses context. Could be a good first step before full LLM-based compaction. +**Alternative (simpler, no LLM call)**: Implement a `history_processor` in PydanticAI that truncates/trims old messages at request time (sliding window). Cheaper but loses context. This could be a good first step before full LLM-based compaction.-Multiple sync Langfuse SDK calls block the event loop. Could be offloaded, but `asyncio.to_thread()` is the simpler fix for these. +Multiple sync Langfuse SDK calls block the event loop. These could be offloaded, but `asyncio.to_thread()` is the simpler fix for them.Also applies to: 61-61, 80-80
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/async-tasks-analysis.md` at line 35, Several sentences in the doc are fragments lacking an explicit subject (e.g., "Could be opt-in per project or triggered lazily on first access then cached."): update the phrases at the occurrences of "Could be..." (lines referencing the fragment text) to full sentences by adding a subject and verb—e.g., "This can be made opt-in per project, or it can be triggered lazily on first access and then cached."—apply the same pattern to the other fragments so each "Could be..." becomes a complete sentence that preserves meaning and tone.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/async-tasks-analysis.md`:
- Around line 65-78: The document has duplicate section numbering ("### 4." used
twice); rename the second "### 4. Periodic maintenance (LOW VALUE, future need)"
to "### 5. Periodic maintenance (LOW VALUE, future need)" and then update "###
5. Langfuse trace updates (LOW VALUE)" to "### 6. Langfuse trace updates (LOW
VALUE)" so the headings "Title generation", "Periodic maintenance", and
"Langfuse trace updates" have consecutive, unique numbers.
In
`@src/backend/chat/tests/agent_rag/document_converter/test_adaptive_pdf_parser.py`:
- Around line 408-414: Test asserts the wrong call signature for the ODT route:
BaseParser.parse_document only forwards the file content to parse_odt_document,
not a name= kwarg. Update the assertion that references
parser.parse_odt_document (and the test around parser.parse_document) to expect
a single positional content argument (e.g.,
mock_parse.assert_called_once_with(sample_odt)) or the equivalent content-only
call rather than mock_parse.assert_called_once_with(name="sample.odt",
content=sample_odt); keep the rest of the test unchanged.
---
Nitpick comments:
In `@docs/async-tasks-analysis.md`:
- Line 35: Several sentences in the doc are fragments lacking an explicit
subject (e.g., "Could be opt-in per project or triggered lazily on first access
then cached."): update the phrases at the occurrences of "Could be..." (lines
referencing the fragment text) to full sentences by adding a subject and
verb—e.g., "This can be made opt-in per project, or it can be triggered lazily
on first access and then cached."—apply the same pattern to the other fragments
so each "Could be..." becomes a complete sentence that preserves meaning and
tone.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 94dd5c97-21aa-46de-89fb-75bd5718a076
⛔ Files ignored due to path filters (1)
src/backend/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (12)
CHANGELOG.mddocs/async-tasks-analysis.mdsrc/backend/chat/agent_rag/document_converter/markitdown.pysrc/backend/chat/agent_rag/document_converter/odt.pysrc/backend/chat/agent_rag/document_converter/parser.pysrc/backend/chat/tests/agent_rag/document_converter/fixtures/sample.odtsrc/backend/chat/tests/agent_rag/document_converter/test_adaptive_pdf_parser.pysrc/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.pysrc/backend/conversations/settings.pysrc/backend/core/tests/file_upload/test_generate_temporary_url.pysrc/backend/pyproject.tomlsrc/frontend/apps/e2e/__tests__/app-conversations/common.ts
✅ Files skipped from review due to trivial changes (4)
- src/backend/chat/agent_rag/document_converter/markitdown.py
- src/backend/core/tests/file_upload/test_generate_temporary_url.py
- src/backend/pyproject.toml
- CHANGELOG.md
🚧 Files skipped from review as they are similar to previous changes (3)
- src/backend/conversations/settings.py
- src/frontend/apps/e2e/tests/app-conversations/common.ts
- src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py
docs/async-tasks-analysis.md
Outdated
| ### 4. Title generation (MEDIUM VALUE) | ||
|
|
||
| **Current flow**: `_generate_title()` runs after streaming completes - an extra LLM call that delays the final response. | ||
|
|
||
| **With background tasks**: Fire-and-forget after the stream ends. Title appears asynchronously in the UI. | ||
|
|
||
| ### 4. Periodic maintenance (LOW VALUE, future need) | ||
|
|
||
| No cleanup tasks exist today. Could eventually need: | ||
| - Orphaned vector store collection cleanup | ||
| - Old conversation purging | ||
| - Usage analytics aggregation | ||
|
|
||
| ### 5. Langfuse trace updates (LOW VALUE) |
There was a problem hiding this comment.
Fix duplicate section numbering.
Lines 65 and 71 both use "### 4." for different sections. This creates confusion in the document structure.
📝 Proposed fix for section numbering
### 4. Title generation (MEDIUM VALUE)
**Current flow**: `_generate_title()` runs after streaming completes - an extra LLM call that delays the final response.
**With background tasks**: Fire-and-forget after the stream ends. Title appears asynchronously in the UI.
-### 4. Periodic maintenance (LOW VALUE, future need)
+### 5. Periodic maintenance (LOW VALUE, future need)
No cleanup tasks exist today. Could eventually need:
- Orphaned vector store collection cleanup
- Old conversation purging
- Usage analytics aggregation
-### 5. Langfuse trace updates (LOW VALUE)
+### 6. Langfuse trace updates (LOW VALUE)
Multiple sync Langfuse SDK calls block the event loop. Could be offloaded, but `asyncio.to_thread()` is the simpler fix for these.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ### 4. Title generation (MEDIUM VALUE) | |
| **Current flow**: `_generate_title()` runs after streaming completes - an extra LLM call that delays the final response. | |
| **With background tasks**: Fire-and-forget after the stream ends. Title appears asynchronously in the UI. | |
| ### 4. Periodic maintenance (LOW VALUE, future need) | |
| No cleanup tasks exist today. Could eventually need: | |
| - Orphaned vector store collection cleanup | |
| - Old conversation purging | |
| - Usage analytics aggregation | |
| ### 5. Langfuse trace updates (LOW VALUE) | |
| ### 4. Title generation (MEDIUM VALUE) | |
| **Current flow**: `_generate_title()` runs after streaming completes - an extra LLM call that delays the final response. | |
| **With background tasks**: Fire-and-forget after the stream ends. Title appears asynchronously in the UI. | |
| ### 5. Periodic maintenance (LOW VALUE, future need) | |
| No cleanup tasks exist today. Could eventually need: | |
| - Orphaned vector store collection cleanup | |
| - Old conversation purging | |
| - Usage analytics aggregation | |
| ### 6. Langfuse trace updates (LOW VALUE) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/async-tasks-analysis.md` around lines 65 - 78, The document has
duplicate section numbering ("### 4." used twice); rename the second "### 4.
Periodic maintenance (LOW VALUE, future need)" to "### 5. Periodic maintenance
(LOW VALUE, future need)" and then update "### 5. Langfuse trace updates (LOW
VALUE)" to "### 6. Langfuse trace updates (LOW VALUE)" so the headings "Title
generation", "Periodic maintenance", and "Langfuse trace updates" have
consecutive, unique numbers.
src/backend/chat/tests/agent_rag/document_converter/test_adaptive_pdf_parser.py
Outdated
Show resolved
Hide resolved
97a596f to
97e2c05
Compare
- Add odfdo dependency for ODT-to-markdown conversion - Refactor BaseParser to route by content type (PDF, ODT, other) - Extract OdtParserMixin and AdaptivePdfParserMixin for composability - Add OdtParsingError for corrupt/empty ODT files - Accept ODT in RAG upload formats, add pandoc to Docker image - Add tests for PDF/ODT routing, adaptive method selection, and ODT errors Signed-off-by: Laurent Paoletti <lp@providenz.fr>
97e2c05 to
bc28e05
Compare
|


Purpose
Allow ODT documents
Proposal
odfdolibrary for markdown conversionBaseParserto centralize content-type routing (PDF, ODT, fallback to markitdown), replacing duplicated routing inAlbertParserandAdaptivePdfParserOdtParserMixinand renameAdaptiveParserMixintoAdaptivePdfParserMixinfor clarity and composabilityOdtParsingErrorwith translatable message for corrupt/empty ODTfiles
application/vnd.oasis.opendocument.textin accepted RAG upload formatsSummary by CodeRabbit
New Features
Settings / Frontend
Tests
Documentation
Chores