Skip to content

Commit 6f54d3f

Browse files
committed
Merge remote-tracking branch 'origin/main' into pprados/fix_password
# Conflicts: # requirements/ingest/ingest.txt # test_unstructured/partition/pdf_image/test_pdf.py # unstructured/partition/pdf.py # unstructured/partition/pdf_image/pdfminer_processing.py
2 parents 7cf19df + 4140f62 commit 6f54d3f

File tree

62 files changed

+8998
-1358
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

62 files changed

+8998
-1358
lines changed

.github/workflows/ingest-test-fixtures-update-pr.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,8 @@ jobs:
9494
AZURE_SEARCH_API_KEY: ${{ secrets.AZURE_SEARCH_API_KEY }}
9595
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
9696
OCTOAI_API_KEY: ${{ secrets.OCTOAI_API_KEY }}
97+
ASTRA_DB_APPLICATION_TOKEN: ${{secrets.ASTRA_DB_TOKEN}}
98+
ASTRA_DB_API_ENDPOINT: ${{secrets.ASTRA_DB_ENDPOINT}}
9799
OCR_AGENT: "unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract"
98100
OVERWRITE_FIXTURES: "true"
99101
CI: "true"

CHANGELOG.md

Lines changed: 70 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,71 @@
1-
## 0.16.3-dev1
1+
## 0.16.9
2+
3+
### Enhancements
4+
5+
### Features
6+
7+
### Fixes
8+
9+
- **Fix NLTK Download** to not download from unstructured S3 Bucket
10+
11+
## 0.16.8
12+
13+
### Enhancements
14+
- **Metrics: Weighted table average is optional**
15+
16+
### Features
17+
18+
### Fixes
19+
20+
## 0.16.7
21+
22+
### Enhancements
23+
- **Add image_alt_mode to partition_html** Adds an `image_alt_mode` parameter to `partition_html()` to control how alt text is extracted from images in HTML documents for `html_parser_version=v2` . The parameter can be set to `to_text` to extract alt text as text from `<img>` html tags
24+
25+
### Features
26+
27+
### Fixes
28+
29+
30+
## 0.16.6
31+
32+
### Enhancements
33+
- **Every `<table>` tag is considered to be ontology.Table** Added special handling for tables in HTML partitioning (`html_parser_version=v2`. This change is made to improve the accuracy of table extraction from HTML documents.
34+
- **Every HTML has default ontology class assigned** When parsing HTML with `html_parser_version=v2` to ontology each defined HTML in the Ontology has assigned default ontology class. This way it is possible to assign ontology class instead of UncategorizedText when the HTML tag is predicted correctly without class assigned class
35+
- **Use (number of actual table) weighted average for table metrics** In evaluating table metrics the mean aggregation now uses the actual number of tables in a document to weight the metric scores
36+
37+
### Features
38+
39+
### Fixes
40+
- **ElementMetadata consolidation** Now `text_as_html` metadata is combined across all elements in CompositeElement when chunking HTML output
41+
42+
## 0.16.5
43+
44+
### Enhancements
45+
46+
### Features
47+
48+
### Fixes
49+
- **Fixes parsing HTML v2 parser** Now max recursion limit is set and value is correctly extracted from ontology element
50+
51+
52+
## 0.16.4
53+
54+
### Enhancements
55+
56+
* **`value` attribute in `<input/>` element is parsed to `OntologyElement.text` in ontology**
57+
* **`id` and `class` attributes removed from Table subtags in HTML partitioning**
58+
* **cleaned `to_html` and newly introduced `to_text` in `OntologyElement`**
59+
* **Elements created from V2 HTML are less granular** Added merging of adjacent text elements and inline html tags in the HTML partitioner to reduce the number of elements created from V2 HTML.
60+
61+
### Features
62+
63+
* **Add support for link extraction in pdf hi_res strategy.** The `partition_pdf()` function now supports link extraction when using the `hi_res` strategy, allowing users to extract hyperlinks from PDF documents more effectively.
64+
65+
### Fixes
66+
67+
68+
## 0.16.3
269

370
### Enhancements
471

@@ -8,6 +75,8 @@
875

976
* **Use password** to load PDF with all modes
1077
* **V2 elements without first parent ID can be parsed**
78+
* **Fix missing elements when layout element parsed in V2 ontology**
79+
* updated **unstructured-inference** to be **0.8.1** in requirements/extra-pdf-image.in
1180

1281
## 0.16.2
1382

0 commit comments

Comments
 (0)