Skip to content

✨(back) Add ODT parsing support and refactor document parser routing#354

Open
providenz wants to merge 1 commit intomainfrom
providenz/odt-parsing
Open

✨(back) Add ODT parsing support and refactor document parser routing#354
providenz wants to merge 1 commit intomainfrom
providenz/odt-parsing

Conversation

@providenz
Copy link
Copy Markdown
Collaborator

@providenz providenz commented Mar 23, 2026

Purpose

Allow ODT documents

Proposal

  • Add ODT (OpenDocument Text) support using odfdo library for markdown conversion
  • Refactor BaseParser to centralize content-type routing (PDF, ODT, fallback to markitdown), replacing duplicated routing in AlbertParser and
    AdaptivePdfParser
  • Extract OdtParserMixin and rename AdaptiveParserMixin to
    AdaptivePdfParserMixin for clarity and composability
  • Add OdtParsingError with translatable message for corrupt/empty ODT
    files
  • Register application/vnd.oasis.opendocument.text in accepted RAG upload formats

Summary by CodeRabbit

  • New Features

    • Added OpenDocument Text (ODT) support so users can upload and process .odt documents for document search and RAG workflows.
  • Settings / Frontend

    • Upload accept lists updated to allow ODT files.
  • Tests

    • Unit, integration, and e2e tests added/expanded to cover ODT parsing, routing, and upload flows.
  • Documentation

    • CHANGELOG updated; added analysis on background-task approaches and parsing tradeoffs.
  • Chores

    • Added runtime dependency for ODT parsing.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds ODT (OpenDocument Text) parsing: new ODT converter and error type, MIME-based parser routing with an ODT mixin, dependency on odfdo, configuration and frontend test updates, extensive backend unit and integration tests, and a changelog entry. No other behavioral changes reported.

Changes

Cohort / File(s) Summary
Core ODT Parsing
src/backend/chat/agent_rag/document_converter/odt.py
New OdtParsingError and OdtToMd.extract(content: bytes) -> str using odfdo.Document(BytesIO(content)).to_markdown(). Logs parsing failures and raises OdtParsingError with chained original exception.
Parser Routing & Mixins
src/backend/chat/agent_rag/document_converter/parser.py
Adds CT_PDF, CT_ODT; centralizes MIME routing in BaseParser.parse_document() with abstract parse_pdf_document and parse_odt_document; adds OdtParserMixin; renames AdaptiveParserMixinAdaptivePdfParserMixin; updates AlbertParser and AdaptivePdfParser inheritance and removes their parse_document() overrides.
Markdown Converter
src/backend/chat/agent_rag/document_converter/markitdown.py
Minor simplification: _convert() now returns conversion.text_content directly.
Configuration & Dependency
src/backend/conversations/settings.py, src/backend/pyproject.toml
Adds application/vnd.oasis.opendocument.text to default RAG_FILES_ACCEPTED_FORMATS; pins runtime dependency odfdo==3.22.1.
Backend Unit Tests
src/backend/chat/tests/agent_rag/document_converter/test_adaptive_pdf_parser.py
Adds sample_odt fixture and ODT parsing tests; extends PDF routing tests for text-rich vs mixed/OCR routing.
Integration Tests
src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py
Adds sample_odt_content and mock_odt_document_api fixtures; new end-to-end test for ODT upload, parsing, RAG search events, and persisted messages.
Misc Tests
src/backend/core/tests/file_upload/test_generate_temporary_url.py
Includes my-document.odt in tested S3 key formats.
Frontend E2E
src/frontend/apps/e2e/__tests__/app-conversations/common.ts
Appends application/vnd.oasis.opendocument.text to CONFIG.chat_upload_accept for e2e tests.
Changelog & Docs
CHANGELOG.md, docs/async-tasks-analysis.md
Documents "add ODT parsing support" under Unreleased → Added; adds analysis doc on background task strategy.

Sequence Diagram

sequenceDiagram
    participant Client
    participant BaseParser as BaseParser
    participant OdtMixin as OdtParserMixin
    participant Converter as OdtToMd
    participant ODFDO as odfdo.Document
    participant Output as Markdown

    Client->>BaseParser: parse_document(name, "application/vnd.oasis.opendocument.text", content)
    BaseParser->>BaseParser: check MIME -> ODT branch
    BaseParser->>OdtMixin: parse_odt_document(content)
    OdtMixin->>Converter: extract(content)
    Converter->>ODFDO: Document(BytesIO(content))
    alt success
        ODFDO-->>Converter: doc
        Converter->>ODFDO: to_markdown()
        ODFDO-->>Converter: markdown_text
        Converter-->>OdtMixin: markdown_text
        OdtMixin-->>BaseParser: markdown_text
        BaseParser-->>Client: parsed markdown (str)
    else failure
        ODFDO-->>Converter: raises (TypeError/BadZip/XMLSyntaxError)
        Converter->>Converter: log error
        Converter-->>OdtMixin: raise OdtParsingError
        OdtMixin-->>BaseParser: propagate OdtParsingError
        BaseParser-->>Client: error propagated
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested labels

backend, documentation

Suggested reviewers

  • qbey
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main changes: adding ODT parsing support and refactoring document parser routing, matching the primary objectives of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 96.67% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch providenz/odt-parsing

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@providenz providenz force-pushed the providenz/odt-parsing branch 4 times, most recently from d9a8762 to c6e1e54 Compare March 23, 2026 16:47
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
src/backend/chat/agent_rag/document_converter/odt.py (1)

22-25: Add exception handling for corrupt ODT ZIP structures.

ODT files are ZIP-based, and corrupt files will raise zipfile.BadZipFile from odfdo.Document. The current handling only catches TypeError and FileNotFoundError, missing this failure mode. Consider also catching lxml.etree.XMLSyntaxError for malformed XML content within the ZIP.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/chat/agent_rag/document_converter/odt.py` around lines 22 - 25,
The except block that currently catches (TypeError, FileNotFoundError) and
raises OdtParsingError should also handle zipfile.BadZipFile and
lxml.etree.XMLSyntaxError so corrupt ZIP structures and malformed XML inside
ODTs are wrapped consistently; update the exception tuple in the except (used
around odfdo.Document parsing, logger.error and raise OdtParsingError) to
include zipfile.BadZipFile and lxml.etree.XMLSyntaxError, and add the necessary
imports for zipfile and lxml.etree at the top of the module.
src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py (1)

215-215: Misleading comment: says "PDF" but fixture is for ODT.

The comment was likely copy-pasted from mock_document_api. Consider updating it.

✏️ Proposed fix
-    # Mock PDF parsing
+    # Mock ODT parsing
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py`
at line 215, The inline comment above the ODT fixture in the test function
(test_conversation_with_document_upload) incorrectly says "Mock PDF parsing";
update that comment to accurately state "Mock ODT parsing" (or a more generic
"Mock document parsing") so it matches the fixture used and avoids confusion
when reviewing/mock_document_api is referenced.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/backend/chat/agent_rag/document_converter/odt.py`:
- Around line 15-16: Rename the class OtdToMd to OdtToMd to fix the typo and
keep naming consistent with OdtParsingError and the module name (odt.py); update
any references/usages of OtdToMd (constructor calls, imports, type hints) to the
new OdtToMd identifier so there are no unresolved references.

In
`@src/backend/chat/tests/agent_rag/document_converter/test_adaptive_pdf_parser.py`:
- Around line 393-394: The test function test_parse_odt has an incorrect
docstring copied from a PDF test; update the docstring to accurately describe
the ODT parsing behavior being tested (e.g., that it should extract
content/pages/text from an ODT file correctly or validate ODT-specific parsing
edge cases) so it reflects the purpose of test_parse_odt in the Adaptive PDF/ODT
parser tests.

In
`@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py`:
- Around line 988-1007: The test uses pdf_base64 but it actually encodes ODT
content and builds a data URL with application/pdf; rename the variable
pdf_base64 to odt_base64 (or similar) where sample_odt_content.read() is
encoded, and update the Attachment.url data URL to use the correct MIME type for
ODT (e.g., application/vnd.oasis.opendocument.text) so UIMessage/Attachment
(fields name, contentType, url) consistently reflect ODT content.

---

Nitpick comments:
In `@src/backend/chat/agent_rag/document_converter/odt.py`:
- Around line 22-25: The except block that currently catches (TypeError,
FileNotFoundError) and raises OdtParsingError should also handle
zipfile.BadZipFile and lxml.etree.XMLSyntaxError so corrupt ZIP structures and
malformed XML inside ODTs are wrapped consistently; update the exception tuple
in the except (used around odfdo.Document parsing, logger.error and raise
OdtParsingError) to include zipfile.BadZipFile and lxml.etree.XMLSyntaxError,
and add the necessary imports for zipfile and lxml.etree at the top of the
module.

In
`@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py`:
- Line 215: The inline comment above the ODT fixture in the test function
(test_conversation_with_document_upload) incorrectly says "Mock PDF parsing";
update that comment to accurately state "Mock ODT parsing" (or a more generic
"Mock document parsing") so it matches the fixture used and avoids confusion
when reviewing/mock_document_api is referenced.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: be896f2d-6e79-4df9-9174-9052f5dd3098

📥 Commits

Reviewing files that changed from the base of the PR and between 6dd41e8 and 2bc2150.

⛔ Files ignored due to path filters (1)
  • src/backend/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (11)
  • CHANGELOG.md
  • Dockerfile
  • src/backend/chat/agent_rag/document_converter/markitdown.py
  • src/backend/chat/agent_rag/document_converter/odt.py
  • src/backend/chat/agent_rag/document_converter/parser.py
  • src/backend/chat/tests/agent_rag/document_converter/fixtures/sample.odt
  • src/backend/chat/tests/agent_rag/document_converter/test_adaptive_pdf_parser.py
  • src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py
  • src/backend/conversations/settings.py
  • src/backend/core/tests/file_upload/test_generate_temporary_url.py
  • src/backend/pyproject.toml

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/backend/chat/agent_rag/document_converter/parser.py (1)

56-57: Mark name as intentionally unused.

This override should keep the base signature, but the parameter is never read. Renaming it to _name keeps the contract intact and clears the static-analysis warning.

Small cleanup
-    def parse_odt_document(self, name: str, content: bytes) -> str:
+    def parse_odt_document(self, _name: str, content: bytes) -> str:
         return OtdToMd().extract(content)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/chat/agent_rag/document_converter/parser.py` around lines 56 -
57, The parse_odt_document method currently has an unused parameter name; to
silence static-analysis warnings, rename the parameter to _name (keeping the str
type annotation) in the def parse_odt_document(self, _name: str, content: bytes)
-> str signature and leave the body returning OtdToMd().extract(content)
unchanged (method name: parse_odt_document, helper: OtdToMd.extract).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/backend/chat/agent_rag/document_converter/parser.py`:
- Around line 44-50: BaseParser currently defines parse_pdf_document and
parse_odt_document as NotImplementedError stubs which causes any backend that
still constructs BaseParser (see BaseParser usage in base_rag_backend.py) to
crash on PDF/ODT uploads; either provide a safe concrete fallback in BaseParser
(implement parse_pdf_document and parse_odt_document with a minimal,
non-throwing behavior such as returning an empty string or simple text
extraction) or make parser injection mandatory by changing base_rag_backend.py
to require a concrete parser instance (do not instantiate BaseParser by default)
and surface a clear error if no parser is supplied; update the code paths
referencing BaseParser, parse_pdf_document, and parse_odt_document accordingly
so uploads never hit a NotImplementedError at runtime.

---

Nitpick comments:
In `@src/backend/chat/agent_rag/document_converter/parser.py`:
- Around line 56-57: The parse_odt_document method currently has an unused
parameter name; to silence static-analysis warnings, rename the parameter to
_name (keeping the str type annotation) in the def parse_odt_document(self,
_name: str, content: bytes) -> str signature and leave the body returning
OtdToMd().extract(content) unchanged (method name: parse_odt_document, helper:
OtdToMd.extract).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: e3762c8f-1b32-49ca-8e6d-37aad81e543f

📥 Commits

Reviewing files that changed from the base of the PR and between 2bc2150 and c6e1e54.

⛔ Files ignored due to path filters (1)
  • src/backend/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (10)
  • CHANGELOG.md
  • src/backend/chat/agent_rag/document_converter/markitdown.py
  • src/backend/chat/agent_rag/document_converter/odt.py
  • src/backend/chat/agent_rag/document_converter/parser.py
  • src/backend/chat/tests/agent_rag/document_converter/fixtures/sample.odt
  • src/backend/chat/tests/agent_rag/document_converter/test_adaptive_pdf_parser.py
  • src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py
  • src/backend/conversations/settings.py
  • src/backend/core/tests/file_upload/test_generate_temporary_url.py
  • src/backend/pyproject.toml
✅ Files skipped from review due to trivial changes (6)
  • src/backend/pyproject.toml
  • src/backend/chat/agent_rag/document_converter/markitdown.py
  • src/backend/conversations/settings.py
  • src/backend/core/tests/file_upload/test_generate_temporary_url.py
  • src/backend/chat/tests/agent_rag/document_converter/test_adaptive_pdf_parser.py
  • CHANGELOG.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/backend/chat/agent_rag/document_converter/odt.py
  • src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py

@providenz providenz force-pushed the providenz/odt-parsing branch from c6e1e54 to e3e0659 Compare March 25, 2026 12:28
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py (1)

988-1005: ⚠️ Potential issue | 🟡 Minor

Fix ODT attachment MIME consistency in test payload.

Line 988 uses pdf_base64 for ODT bytes, and Line 1004 sets data:application/pdf while the attachment is ODT. Keep name, contentType, and url MIME aligned.

✏️ Proposed fix
-    pdf_base64 = base64.b64encode(sample_odt_content.read()).decode("utf-8")
+    odt_base64 = base64.b64encode(sample_odt_content.read()).decode("utf-8")
@@
-                url=f"data:application/pdf;base64,{pdf_base64}",
+                url=f"data:application/vnd.oasis.opendocument.text;base64,{odt_base64}",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py`
around lines 988 - 1005, The test constructs an ODT attachment but encodes it as
pdf_base64 and uses a PDF data URL; update the payload so the encoded bytes and
MIME match the ODT attachment: rename or replace pdf_base64 with odt_base64 (the
base64 of sample_odt_content), set Attachment.contentType to
"application/vnd.oasis.opendocument.text" (already present) and change the
Attachment.url data URL prefix to
"data:application/vnd.oasis.opendocument.text;base64," so
UIMessage.experimental_attachments (Attachment) has consistent name, contentType
and url values.
src/backend/chat/agent_rag/document_converter/odt.py (1)

15-16: ⚠️ Potential issue | 🟡 Minor

Rename OtdToMd to OdtToMd for consistency.

Line 15 uses OtdToMd, which is inconsistent with odt.py / OdtParsingError naming and makes symbol discovery harder.

✏️ Proposed fix
-class OtdToMd:
+class OdtToMd:
     """Convert an ODT file to Markdown using odfdo."""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/chat/agent_rag/document_converter/odt.py` around lines 15 - 16,
Rename the class OtdToMd to OdtToMd to match the file and existing symbols
(e.g., OdtParsingError); update the class declaration and any
references/imports/usages of OtdToMd across the codebase to OdtToMd (including
tests, type hints, and factory/creator calls) and keep the existing docstring
text intact but ensure it reflects the new class name if referenced; run tests
to ensure no remaining references to OtdToMd remain.
🧹 Nitpick comments (2)
src/backend/chat/agent_rag/document_converter/parser.py (1)

56-57: Mark intentionally unused parameter explicitly.

Line 56 keeps name for signature compatibility, but it is unused. Consider _name to make intent explicit and quiet static warnings.

♻️ Suggested tweak
-    def parse_odt_document(self, name: str, content: bytes) -> str:
+    def parse_odt_document(self, _name: str, content: bytes) -> str:
         return OtdToMd().extract(content)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/backend/chat/agent_rag/document_converter/parser.py` around lines 56 -
57, The parameter name in parse_odt_document is intentionally unused but
currently named `name`; to make this explicit and silence static warnings,
rename the parameter to `_name: str` (or prefix it with an underscore) in the
parse_odt_document signature and keep the body returning
OtdToMd().extract(content) unchanged; update any callers if necessary to
preserve compatibility while keeping the unused parameter clearly marked.
src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py (1)

197-277: Consider factoring duplicated API mock setup into a helper.

fixture_mock_odt_document_api duplicates most of fixture_mock_document_api; this increases maintenance cost for future endpoint/shape changes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py`
around lines 197 - 277, The fixture_fixture_mock_odt_document_api duplicates
most of fixture_mock_document_api; extract the repeated responses setup into a
shared helper (e.g., register_albert_and_find_mocks or build_document_api_mocks)
that accepts parameters like document_name, document_content, prompt_tokens,
completion_tokens, search_method, search_score and registers the common
responses for the /v1/collections, /v1/parse-beta, /v1/documents, /v1/search and
the find API endpoints; then call this helper from both
fixture_mock_odt_document_api and fixture_mock_document_api to remove
duplication and keep only odt-specific overrides in
fixture_mock_odt_document_api.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/backend/chat/agent_rag/document_converter/odt.py`:
- Around line 22-26: Update the exception handling around the
odfdo.Document(BytesIO(content)) call to also catch zipfile.BadZipFile and
lxml.etree.XMLSyntaxError in addition to TypeError and FileNotFoundError so all
parsing failures are wrapped as OdtParsingError; keep the existing
logger.error("Failed to parse ODT document: %s", e) and re-raise
OdtParsingError(_("Failed to parse ODT document: %(error)s") % {"error": e})
from e so errors from odfdo.Document, corrupt zip archives, and malformed XML
are all handled (refer to odfdo.Document, BytesIO(content), logger.error, and
OdtParsingError).

---

Duplicate comments:
In `@src/backend/chat/agent_rag/document_converter/odt.py`:
- Around line 15-16: Rename the class OtdToMd to OdtToMd to match the file and
existing symbols (e.g., OdtParsingError); update the class declaration and any
references/imports/usages of OtdToMd across the codebase to OdtToMd (including
tests, type hints, and factory/creator calls) and keep the existing docstring
text intact but ensure it reflects the new class name if referenced; run tests
to ensure no remaining references to OtdToMd remain.

In
`@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py`:
- Around line 988-1005: The test constructs an ODT attachment but encodes it as
pdf_base64 and uses a PDF data URL; update the payload so the encoded bytes and
MIME match the ODT attachment: rename or replace pdf_base64 with odt_base64 (the
base64 of sample_odt_content), set Attachment.contentType to
"application/vnd.oasis.opendocument.text" (already present) and change the
Attachment.url data URL prefix to
"data:application/vnd.oasis.opendocument.text;base64," so
UIMessage.experimental_attachments (Attachment) has consistent name, contentType
and url values.

---

Nitpick comments:
In `@src/backend/chat/agent_rag/document_converter/parser.py`:
- Around line 56-57: The parameter name in parse_odt_document is intentionally
unused but currently named `name`; to make this explicit and silence static
warnings, rename the parameter to `_name: str` (or prefix it with an underscore)
in the parse_odt_document signature and keep the body returning
OtdToMd().extract(content) unchanged; update any callers if necessary to
preserve compatibility while keeping the unused parameter clearly marked.

In
`@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py`:
- Around line 197-277: The fixture_fixture_mock_odt_document_api duplicates most
of fixture_mock_document_api; extract the repeated responses setup into a shared
helper (e.g., register_albert_and_find_mocks or build_document_api_mocks) that
accepts parameters like document_name, document_content, prompt_tokens,
completion_tokens, search_method, search_score and registers the common
responses for the /v1/collections, /v1/parse-beta, /v1/documents, /v1/search and
the find API endpoints; then call this helper from both
fixture_mock_odt_document_api and fixture_mock_document_api to remove
duplication and keep only odt-specific overrides in
fixture_mock_odt_document_api.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: d24dd655-d76d-4fd4-baec-dfc782d57a7c

📥 Commits

Reviewing files that changed from the base of the PR and between c6e1e54 and e3e0659.

⛔ Files ignored due to path filters (1)
  • src/backend/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (11)
  • CHANGELOG.md
  • src/backend/chat/agent_rag/document_converter/markitdown.py
  • src/backend/chat/agent_rag/document_converter/odt.py
  • src/backend/chat/agent_rag/document_converter/parser.py
  • src/backend/chat/tests/agent_rag/document_converter/fixtures/sample.odt
  • src/backend/chat/tests/agent_rag/document_converter/test_adaptive_pdf_parser.py
  • src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py
  • src/backend/conversations/settings.py
  • src/backend/core/tests/file_upload/test_generate_temporary_url.py
  • src/backend/pyproject.toml
  • src/frontend/apps/e2e/__tests__/app-conversations/common.ts
✅ Files skipped from review due to trivial changes (5)
  • CHANGELOG.md
  • src/backend/core/tests/file_upload/test_generate_temporary_url.py
  • src/backend/chat/agent_rag/document_converter/markitdown.py
  • src/backend/pyproject.toml
  • src/frontend/apps/e2e/tests/app-conversations/common.ts
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/backend/conversations/settings.py
  • src/backend/chat/tests/agent_rag/document_converter/test_adaptive_pdf_parser.py

@providenz providenz force-pushed the providenz/odt-parsing branch from e3e0659 to 043d17a Compare March 25, 2026 13:20
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py (1)

988-1005: ⚠️ Potential issue | 🟡 Minor

Fix ODT data URL MIME mismatch in attachment payload.

At Line 988, pdf_base64 encodes ODT bytes, and at Line 1004 the URL still uses data:application/pdf. This is inconsistent with contentType="application/vnd.oasis.opendocument.text" and can hide routing bugs.

✏️ Proposed fix
-    pdf_base64 = base64.b64encode(sample_odt_content.read()).decode("utf-8")
+    odt_base64 = base64.b64encode(sample_odt_content.read()).decode("utf-8")
@@
-                url=f"data:application/pdf;base64,{pdf_base64}",
+                url=f"data:application/vnd.oasis.opendocument.text;base64,{odt_base64}",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py`
around lines 988 - 1005, The attachment payload builds a data URL with
pdf_base64 (ODT bytes) but uses the wrong MIME type "application/pdf"; update
the data URL to match Attachment.contentType
("application/vnd.oasis.opendocument.text") so the UIMessage/Attachment created
in the test (variables: pdf_base64, sample_odt_content, UIMessage, Attachment,
TextUIPart) uses a consistent MIME type for the data: URL.
🧹 Nitpick comments (2)
src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py (2)

197-278: Extract a shared API-mock builder to avoid drift.

This ODT fixture is almost identical to fixture_mock_document_api; only document-specific values differ. Consolidating into one helper/parametrized fixture will reduce maintenance risk.

♻️ Proposed refactor sketch
+def _register_document_api_mocks(*, document_name: str, document_content: str):
+    prompt_tokens = 10
+    completion_tokens = 20
+    search_method = "semantic"
+    search_score = 0.9
+
+    responses.post("https://albert.api.etalab.gouv.fr/v1/collections", json={"id": "123", "name": "test-collection"}, status=status.HTTP_200_OK)
+    responses.post(
+        "https://albert.api.etalab.gouv.fr/v1/parse-beta",
+        json={
+            "data": [{"content": document_content, "metadata": {"document_name": document_name}}],
+            "usage": {"prompt_tokens": prompt_tokens, "completion_tokens": completion_tokens},
+        },
+        status=status.HTTP_200_OK,
+    )
+    responses.post("https://albert.api.etalab.gouv.fr/v1/documents", json={"id": 456}, status=status.HTTP_201_CREATED)
+    responses.post(
+        "https://albert.api.etalab.gouv.fr/v1/search",
+        json={
+            "data": [{
+                "method": search_method,
+                "chunk": {"id": 123, "content": document_content, "metadata": {"document_name": document_name}},
+                "score": search_score,
+            }],
+            "usage": {"prompt_tokens": prompt_tokens, "completion_tokens": completion_tokens},
+        },
+        status=status.HTTP_200_OK,
+    )
+    responses.post("https://find.api.example.com/api/v1.0/documents/index/", json={"id": "456", "status": "indexed"}, status=status.HTTP_200_OK)
+    responses.post(
+        "https://find.api.example.com/api/v1.0/documents/search/",
+        json=[{"_source": {"title.fr": document_name, "content.fr": document_content}, "_score": search_score}],
+        status=status.HTTP_200_OK,
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py`
around lines 197 - 278, The ODT-specific responses in
fixture_mock_odt_document_api largely duplicate fixture_mock_document_api;
extract a shared API-mock builder (e.g., a helper function
build_mock_document_api or a parametrized fixture) that accepts document_name,
document_content, prompt_tokens, completion_tokens, search_method, and
search_score, then have fixture_mock_odt_document_api and
fixture_mock_document_api call that helper with their specific values; update
places that previously referenced fixture_mock_odt_document_api to use the new
helper/parametrized fixture and remove duplicated response setup inside
fixture_mock_odt_document_api.

971-1244: Consider parametrizing PDF/ODT conversation tests.

This test duplicates most of test_post_conversation_with_document_upload with only format-specific values changed. A parametrized test would keep assertions aligned across formats and reduce future copy/paste regressions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py`
around lines 971 - 1244, The two near-duplicate tests
(test_post_conversation_with_document_upload and
test_post_conversation_with_odt_document_upload) should be merged into a single
parametrized test; replace these separate functions with one test function
(e.g., test_post_conversation_with_document_upload_parametrized) decorated with
pytest.mark.parametrize over format-specific values (contentType, filename,
sample content fixture, expected mime/data URL prefix, and expected extracted
text like "This is the content of the ODT."). In the body, use the same setup
logic currently in test_post_conversation_with_odt_document_upload (creating
ChatConversationFactory, forcing auth, encoding sample content, building
UIMessage with experimental_attachments) but substitute the parametrized
variables for Attachment(name, contentType, url) and expected response strings;
keep agent_model, response assertions, and pydantic_messages assertions but
derive format-specific expected values from the parameters (e.g., attachment
filename and expected source url/title). Ensure you reference and reuse the
existing symbols: UIMessage, Attachment, TextUIPart,
mock_ai_agent_service(FunctionModel(...)), replace_uuids_with_placeholder, and
chat_conversation.pydantic_messages to maintain all existing assertions while
removing duplicated code.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In
`@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py`:
- Around line 988-1005: The attachment payload builds a data URL with pdf_base64
(ODT bytes) but uses the wrong MIME type "application/pdf"; update the data URL
to match Attachment.contentType ("application/vnd.oasis.opendocument.text") so
the UIMessage/Attachment created in the test (variables: pdf_base64,
sample_odt_content, UIMessage, Attachment, TextUIPart) uses a consistent MIME
type for the data: URL.

---

Nitpick comments:
In
`@src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py`:
- Around line 197-278: The ODT-specific responses in
fixture_mock_odt_document_api largely duplicate fixture_mock_document_api;
extract a shared API-mock builder (e.g., a helper function
build_mock_document_api or a parametrized fixture) that accepts document_name,
document_content, prompt_tokens, completion_tokens, search_method, and
search_score, then have fixture_mock_odt_document_api and
fixture_mock_document_api call that helper with their specific values; update
places that previously referenced fixture_mock_odt_document_api to use the new
helper/parametrized fixture and remove duplicated response setup inside
fixture_mock_odt_document_api.
- Around line 971-1244: The two near-duplicate tests
(test_post_conversation_with_document_upload and
test_post_conversation_with_odt_document_upload) should be merged into a single
parametrized test; replace these separate functions with one test function
(e.g., test_post_conversation_with_document_upload_parametrized) decorated with
pytest.mark.parametrize over format-specific values (contentType, filename,
sample content fixture, expected mime/data URL prefix, and expected extracted
text like "This is the content of the ODT."). In the body, use the same setup
logic currently in test_post_conversation_with_odt_document_upload (creating
ChatConversationFactory, forcing auth, encoding sample content, building
UIMessage with experimental_attachments) but substitute the parametrized
variables for Attachment(name, contentType, url) and expected response strings;
keep agent_model, response assertions, and pydantic_messages assertions but
derive format-specific expected values from the parameters (e.g., attachment
filename and expected source url/title). Ensure you reference and reuse the
existing symbols: UIMessage, Attachment, TextUIPart,
mock_ai_agent_service(FunctionModel(...)), replace_uuids_with_placeholder, and
chat_conversation.pydantic_messages to maintain all existing assertions while
removing duplicated code.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 2e49a0ed-402a-4ae0-9a01-e68e2f23b20f

📥 Commits

Reviewing files that changed from the base of the PR and between e3e0659 and 043d17a.

⛔ Files ignored due to path filters (1)
  • src/backend/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (11)
  • CHANGELOG.md
  • src/backend/chat/agent_rag/document_converter/markitdown.py
  • src/backend/chat/agent_rag/document_converter/odt.py
  • src/backend/chat/agent_rag/document_converter/parser.py
  • src/backend/chat/tests/agent_rag/document_converter/fixtures/sample.odt
  • src/backend/chat/tests/agent_rag/document_converter/test_adaptive_pdf_parser.py
  • src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py
  • src/backend/conversations/settings.py
  • src/backend/core/tests/file_upload/test_generate_temporary_url.py
  • src/backend/pyproject.toml
  • src/frontend/apps/e2e/__tests__/app-conversations/common.ts
✅ Files skipped from review due to trivial changes (4)
  • src/backend/chat/agent_rag/document_converter/markitdown.py
  • CHANGELOG.md
  • src/backend/pyproject.toml
  • src/backend/core/tests/file_upload/test_generate_temporary_url.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • src/frontend/apps/e2e/tests/app-conversations/common.ts
  • src/backend/chat/agent_rag/document_converter/odt.py
  • src/backend/conversations/settings.py
  • src/backend/chat/tests/agent_rag/document_converter/test_adaptive_pdf_parser.py
  • src/backend/chat/agent_rag/document_converter/parser.py

@providenz providenz force-pushed the providenz/odt-parsing branch from 043d17a to cbe1cd1 Compare March 25, 2026 13:51
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
docs/async-tasks-analysis.md (1)

35-35: Consider completing sentence fragments for clarity.

Lines 35, 61, and 80 contain sentence fragments missing subjects ("Could be..." without stating what "could be"). While understandable from context, complete sentences improve readability in technical documentation.

✍️ Suggested grammatical improvements
-**Trade-off**: Pre-computing costs LLM tokens even if the user never asks for a summary. Could be opt-in per project or triggered lazily on first access then cached.
+**Trade-off**: Pre-computing costs LLM tokens even if the user never asks for a summary. This could be opt-in per project or triggered lazily on first access then cached.
-**Alternative (simpler, no LLM call)**: Implement a `history_processor` in PydanticAI that truncates/trims old messages at request time (sliding window). Cheaper but loses context. Could be a good first step before full LLM-based compaction.
+**Alternative (simpler, no LLM call)**: Implement a `history_processor` in PydanticAI that truncates/trims old messages at request time (sliding window). Cheaper but loses context. This could be a good first step before full LLM-based compaction.
-Multiple sync Langfuse SDK calls block the event loop. Could be offloaded, but `asyncio.to_thread()` is the simpler fix for these.
+Multiple sync Langfuse SDK calls block the event loop. These could be offloaded, but `asyncio.to_thread()` is the simpler fix for them.

Also applies to: 61-61, 80-80

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/async-tasks-analysis.md` at line 35, Several sentences in the doc are
fragments lacking an explicit subject (e.g., "Could be opt-in per project or
triggered lazily on first access then cached."): update the phrases at the
occurrences of "Could be..." (lines referencing the fragment text) to full
sentences by adding a subject and verb—e.g., "This can be made opt-in per
project, or it can be triggered lazily on first access and then cached."—apply
the same pattern to the other fragments so each "Could be..." becomes a complete
sentence that preserves meaning and tone.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/async-tasks-analysis.md`:
- Around line 65-78: The document has duplicate section numbering ("### 4." used
twice); rename the second "### 4. Periodic maintenance (LOW VALUE, future need)"
to "### 5. Periodic maintenance (LOW VALUE, future need)" and then update "###
5. Langfuse trace updates (LOW VALUE)" to "### 6. Langfuse trace updates (LOW
VALUE)" so the headings "Title generation", "Periodic maintenance", and
"Langfuse trace updates" have consecutive, unique numbers.

In
`@src/backend/chat/tests/agent_rag/document_converter/test_adaptive_pdf_parser.py`:
- Around line 408-414: Test asserts the wrong call signature for the ODT route:
BaseParser.parse_document only forwards the file content to parse_odt_document,
not a name= kwarg. Update the assertion that references
parser.parse_odt_document (and the test around parser.parse_document) to expect
a single positional content argument (e.g.,
mock_parse.assert_called_once_with(sample_odt)) or the equivalent content-only
call rather than mock_parse.assert_called_once_with(name="sample.odt",
content=sample_odt); keep the rest of the test unchanged.

---

Nitpick comments:
In `@docs/async-tasks-analysis.md`:
- Line 35: Several sentences in the doc are fragments lacking an explicit
subject (e.g., "Could be opt-in per project or triggered lazily on first access
then cached."): update the phrases at the occurrences of "Could be..." (lines
referencing the fragment text) to full sentences by adding a subject and
verb—e.g., "This can be made opt-in per project, or it can be triggered lazily
on first access and then cached."—apply the same pattern to the other fragments
so each "Could be..." becomes a complete sentence that preserves meaning and
tone.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 94dd5c97-21aa-46de-89fb-75bd5718a076

📥 Commits

Reviewing files that changed from the base of the PR and between 043d17a and cbe1cd1.

⛔ Files ignored due to path filters (1)
  • src/backend/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (12)
  • CHANGELOG.md
  • docs/async-tasks-analysis.md
  • src/backend/chat/agent_rag/document_converter/markitdown.py
  • src/backend/chat/agent_rag/document_converter/odt.py
  • src/backend/chat/agent_rag/document_converter/parser.py
  • src/backend/chat/tests/agent_rag/document_converter/fixtures/sample.odt
  • src/backend/chat/tests/agent_rag/document_converter/test_adaptive_pdf_parser.py
  • src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py
  • src/backend/conversations/settings.py
  • src/backend/core/tests/file_upload/test_generate_temporary_url.py
  • src/backend/pyproject.toml
  • src/frontend/apps/e2e/__tests__/app-conversations/common.ts
✅ Files skipped from review due to trivial changes (4)
  • src/backend/chat/agent_rag/document_converter/markitdown.py
  • src/backend/core/tests/file_upload/test_generate_temporary_url.py
  • src/backend/pyproject.toml
  • CHANGELOG.md
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/backend/conversations/settings.py
  • src/frontend/apps/e2e/tests/app-conversations/common.ts
  • src/backend/chat/tests/views/chat/conversations/test_conversation_with_document_upload.py

Comment on lines +65 to +78
### 4. Title generation (MEDIUM VALUE)

**Current flow**: `_generate_title()` runs after streaming completes - an extra LLM call that delays the final response.

**With background tasks**: Fire-and-forget after the stream ends. Title appears asynchronously in the UI.

### 4. Periodic maintenance (LOW VALUE, future need)

No cleanup tasks exist today. Could eventually need:
- Orphaned vector store collection cleanup
- Old conversation purging
- Usage analytics aggregation

### 5. Langfuse trace updates (LOW VALUE)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix duplicate section numbering.

Lines 65 and 71 both use "### 4." for different sections. This creates confusion in the document structure.

📝 Proposed fix for section numbering
 ### 4. Title generation (MEDIUM VALUE)
 
 **Current flow**: `_generate_title()` runs after streaming completes - an extra LLM call that delays the final response.
 
 **With background tasks**: Fire-and-forget after the stream ends. Title appears asynchronously in the UI.
 
-### 4. Periodic maintenance (LOW VALUE, future need)
+### 5. Periodic maintenance (LOW VALUE, future need)
 
 No cleanup tasks exist today. Could eventually need:
 - Orphaned vector store collection cleanup
 - Old conversation purging
 - Usage analytics aggregation
 
-### 5. Langfuse trace updates (LOW VALUE)
+### 6. Langfuse trace updates (LOW VALUE)
 
 Multiple sync Langfuse SDK calls block the event loop. Could be offloaded, but `asyncio.to_thread()` is the simpler fix for these.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### 4. Title generation (MEDIUM VALUE)
**Current flow**: `_generate_title()` runs after streaming completes - an extra LLM call that delays the final response.
**With background tasks**: Fire-and-forget after the stream ends. Title appears asynchronously in the UI.
### 4. Periodic maintenance (LOW VALUE, future need)
No cleanup tasks exist today. Could eventually need:
- Orphaned vector store collection cleanup
- Old conversation purging
- Usage analytics aggregation
### 5. Langfuse trace updates (LOW VALUE)
### 4. Title generation (MEDIUM VALUE)
**Current flow**: `_generate_title()` runs after streaming completes - an extra LLM call that delays the final response.
**With background tasks**: Fire-and-forget after the stream ends. Title appears asynchronously in the UI.
### 5. Periodic maintenance (LOW VALUE, future need)
No cleanup tasks exist today. Could eventually need:
- Orphaned vector store collection cleanup
- Old conversation purging
- Usage analytics aggregation
### 6. Langfuse trace updates (LOW VALUE)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/async-tasks-analysis.md` around lines 65 - 78, The document has
duplicate section numbering ("### 4." used twice); rename the second "### 4.
Periodic maintenance (LOW VALUE, future need)" to "### 5. Periodic maintenance
(LOW VALUE, future need)" and then update "### 5. Langfuse trace updates (LOW
VALUE)" to "### 6. Langfuse trace updates (LOW VALUE)" so the headings "Title
generation", "Periodic maintenance", and "Langfuse trace updates" have
consecutive, unique numbers.

@providenz providenz force-pushed the providenz/odt-parsing branch 2 times, most recently from 97a596f to 97e2c05 Compare March 25, 2026 15:06
- Add odfdo dependency for ODT-to-markdown conversion
- Refactor BaseParser to route by content type (PDF, ODT, other)
- Extract OdtParserMixin and AdaptivePdfParserMixin for composability
- Add OdtParsingError for corrupt/empty ODT files
- Accept ODT in RAG upload formats, add pandoc to Docker image
- Add tests for PDF/ODT routing, adaptive method selection, and ODT errors

Signed-off-by: Laurent Paoletti <lp@providenz.fr>
@providenz providenz force-pushed the providenz/odt-parsing branch from 97e2c05 to bc28e05 Compare March 25, 2026 15:11
@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
35.3% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant