Skip to content

Commit 6ba9228

Browse files
feat: Add DocX parsing strategy to enable configuring convert-to-PDF (#173)
Adds an optional (None-default) `docx_parsing_strategy` to the ParserConfig to mirror `docx_parsing_strategy` in `ParserConfig` in [compass-parser](https://github.com/cohere-ai/compass-parser/blob/e0ea790c040de8dc2246c96de63d2598264692dd/parser/compass_parser/parser_types.py#L410). In conformance with the pattern established by `presentation_parsing_strategy`, by default the SDK will not pass an explicit value for `docx_parsing_strategy` and let the server decide. **Related PRs** - cohere-ai/compass-parser#247 - Corresponding compass-parser PR - #165 - Similar but for adding convert-to-PDF presentation strategy **Blocking** - Once this PR is merged, version bumped & released, [this PR adding e2e docx parsing eval can be merged](cohere-ai/compass-eval#21).
1 parent 768a1cb commit 6ba9228

File tree

2 files changed

+18
-5
lines changed

2 files changed

+18
-5
lines changed

cohere_compass/models/config.py

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,19 @@ def _missing_(cls, value: Any):
113113
return cls.Unstructured
114114

115115

116+
class DocxParsingStrategy(str, Enum):
117+
"""Enum for specifying the parsing strategy for DOCX files."""
118+
119+
# Uses https://github.com/microsoft/markitdown
120+
MarkItDown = "MarkItDown"
121+
# Converts the DOCX to PDF and uses the PDF parsing strategy
122+
ConvertToPDF = "ConvertToPDF"
123+
124+
@classmethod
125+
def _missing_(cls, value: Any):
126+
return cls.MarkItDown
127+
128+
116129
class ParsingStrategy(str, Enum):
117130
"""Enum for specifying the parsing strategy to use."""
118131

@@ -193,12 +206,12 @@ class ParserConfig(BaseModel):
193206
vertical_table_crop_margin: int = 100
194207
horizontal_table_crop_margin: int = 100
195208

209+
pdf_parsing_config: PDFParsingConfig = PDFParsingConfig()
196210
pdf_parsing_strategy: PDFParsingStrategy = PDFParsingStrategy.QuickText
197211
tabular_parsing_strategy: TabularParsingStrategy = TabularParsingStrategy.Granular
198-
199-
pdf_parsing_config: PDFParsingConfig = PDFParsingConfig()
200-
201212
presentation_parsing_strategy: PresentationParsingStrategy | None = None
213+
docx_parsing_strategy: DocxParsingStrategy | None = None
214+
202215
enable_assets_returned_as_base64: bool = True
203216

204217

pyproject.toml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[tool.poetry]
22
name = "cohere-compass-sdk"
3-
version = "2.2.1"
3+
version = "2.2.2"
44
authors = []
55
description = "Cohere Compass SDK"
66
readme = "README.md"
@@ -93,4 +93,4 @@ omit = [
9393
]
9494

9595
[tool.coverage.html]
96-
directory = "coverage_html"
96+
directory = "coverage_html"

0 commit comments

Comments
 (0)