Skip to content

Commit 76a09de

Browse files
authored
Merge branch 'main' into pprados/fix_password
2 parents 2b1a401 + 0245661 commit 76a09de

File tree

100 files changed

+3314
-16319
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

100 files changed

+3314
-16319
lines changed

.github/workflows/ci.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -398,3 +398,4 @@ jobs:
398398
image: "unstructured:dev"
399399
severity-cutoff: critical
400400
only-fixed: true
401+
output-format: table

.grype.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
ignore:
2+
- vulnerability: CVE-2024-11053

CHANGELOG.md

Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,37 @@
1-
## 0.16.10-dev0
1+
## 0.16.12-dev5
22

33
### Enhancements
44

5-
- **Enhance quote standardization tests with additional Unicode scenarios
5+
- **Prepare auto-partitioning for pluggable partitioners**. Move toward a uniform partitioner call signature so a custom or override partitioner can be registered without code changes.
6+
- **Add NDJSON file type support**
7+
8+
### Features
9+
10+
### Fixes
11+
12+
- Base image has been updated, trigger new workflows
13+
- **Upgrade ruff to latest.** Previously the ruff version was pinned to <0.5. Remove that pin and fix the handful of lint items that resulted.
14+
- **CSV with asserted XLS content-type is correctly identified as CSV.** Resolves a bug where a CSV file with an asserted content-type of `application/vnd.ms-excel` was incorrectly identified as an XLS file.
15+
- **Improve element-type mapping for Chinese text.** Fixes bug where Chinese text would produce large numbers of false-positive `Title` elements.
16+
- **Improve element-type mapping for HTML.** Fixes bug where certain non-title elements were classified as `Title`.
17+
18+
## 0.16.11
19+
20+
### Enhancements
21+
22+
- **Enhance quote standardization tests** with additional Unicode scenarios
23+
- **Relax table segregation rule in chunking.** Previously a `Table` element was always segregated into its own pre-chunk such that the `Table` appeared alone in a chunk or was split into multiple `TableChunk` elements, but never combined with `Text`-subtype elements. Allow table elements to be combined with other elements in the same chunk when space allows.
24+
- **Compute chunk length based solely on `element.text`.** Previously `.metadata.text_as_html` was also considered and since it is always longer that the text (due to HTML tag overhead) it was the effective length criterion. Remove text-as-html from the length calculation such that text-length is the sole criterion for sizing a chunk.
25+
26+
### Features
27+
28+
### Fixes
29+
30+
- Fix ipv4 regex to correctly include up to three digit octets.
31+
32+
## 0.16.10
33+
34+
### Enhancements
635

736
### Features
837

CONTRIBUTING.md

Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
## Contributing to Unstructured
2+
3+
[![Contributor Covenant](https://img.shields.io/badge/Contributor%20Covenant-2.1-4baaaa.svg)](code_of_conduct.md)
4+
5+
👍🎉 First off, thank you for taking the time to contribute! 🎉👍
6+
7+
The following is a set of guidelines for contributing to the open source ecosystem of preprocessing pipeline APIs and supporting libraries hosted [here](https://github.com/Unstructured-IO).
8+
9+
This is meant to help the review process go smoothly, save the reviewer(s) time in catching common issues, and avoid submitting PRs that will be rejected by the CI.
10+
11+
In some cases it's convenient to put up a PR that's not ready for final review. This is fine (and under those circumstances it's not necessary to go through this checklist), but the PR should be put in draft mode so everyone knows it's not ready for review.
12+
13+
### How to Contribute?
14+
15+
If you want to contribute, start working through the Unstructured codebase, navigate to the Github "issues" tab and start looking through interesting issues. If you are not sure of where to start, then start by trying one of the smaller/easier issues here i.e. issues with the "good first issue" label and then take a look at the issues with the "contributions welcome" label. These are issues that we believe are particularly well suited for outside contributions, often because we probably won't get to them right now. If you decide to start on an issue, leave a comment so that other people know that you're working on it. If you want to help out, but not alone, use the issue comment thread to coordinate.
16+
17+
18+
## Pull-Request Checklist
19+
20+
The following is a list of tasks to be completed before submitting a pull request for final review.
21+
22+
### Before creating PR:
23+
24+
1. Follow coding best practices
25+
1. [ ] Make sure all new classes/functions/methods have docstrings.
26+
1. [ ] Make sure all new functions/methods have type hints (optional for tests).
27+
1. [ ] Make sure all new functions/methods have associated tests.
28+
1. [ ] Update `CHANGELOG.md` and `__version__.py` if the core code has changed
29+
<br/><br/>
30+
1. Ensure environment is consistent
31+
1. [ ] Update dependencies in `.in` files if needed (pay special attention to whether the current PR depends on changes to internal repos that are not packaged - if so the commit needs to be bumped).
32+
1. [ ] If dependencies have changed, recompile dependencies with `make pip-compile`.
33+
1. [ ] Make sure local virtual environment matches what CI will see - reinstall internal/external dependencies as needed.\
34+
<sub>Follow the [virtualenv install instructions](https://github.com/Unstructured-IO/community#mac--homebrew) if you are unsure about working with virtual environments.
35+
<br/><br/>
36+
1. Run tests and checks locally
37+
1. [ ] Run tests locally with `make test`. Some repositories have supplemental tests with targets like `make test-integration` or `make test-sample-docs`. If applicable, run these as well. Try to make sure all tests are passing before submitting the PR, unless you are submitting in draft mode.
38+
1. [ ] Run typing, linting, and formatting checks with `make check`. Some repositories have supplemental checks with targets like `make check-scripts` or `make check-notebooks`. If applicable, run these as well. Try to make sure all checks are passing before submitting the PR, unless you are submitting in draft mode.
39+
<br/><br/>
40+
1. Ensure code is clean
41+
1. [ ] Remove all debugging artifacts.
42+
1. [ ] Remove commented out code.
43+
1. [ ] For actual comments, note that our typical format is `# NOTE(<username>): <comment>`
44+
1. [ ] Double check everything has been committed and pushed, recommended that local feature branch is clean.
45+
46+
### PR Guidelines:
47+
48+
1. [ ] PR title should follow [conventional commit](https://www.conventionalcommits.org/en/v1.0.0/) standards.
49+
50+
1. [ ] PR description should give enough detail that the reviewer knows what they reviewing - sometimes a copy-paste of the added `CHANGELOG.md` items is enough, sometimes more detail is needed.
51+
52+
1. [ ] If applicable, add a testing section to the PR description that recommends steps a reviewer can take to verify the changes, e.g. a snippet of code they can run locally.
53+
54+
### License
55+
56+
Unstructured open source projects are licensed under the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0).
57+
58+
Include a license at the top of new `setup.py` files:
59+
60+
- [Python license example](https://github.com/Unstructured-IO/unstructured/blob/main/setup.py)
61+
62+
63+
## Conventions
64+
65+
For pull requests, our convention is to squash and merge. For PR titles, we use [conventional commit](https://www.freecodecamp.org/news/how-to-write-better-git-commit-messages/#conventional-commits) messages. The format should look like
66+
67+
- `<type>: <description>`.
68+
69+
For example, if the PR addresses a new feature, the PR title should look like:
70+
71+
- `feat: Implements exciting new feature`.
72+
73+
For feature branches, the naming convention is:
74+
75+
- `<username>/<description>`.
76+
77+
For the commit above, coming from the user called `contributor` the branch name would look like:
78+
79+
- `contributor/exciting-new-feature`.
80+
81+
Here is a list of some of the most common possible commit types:
82+
83+
- `feat` – a new feature is introduced with the changes
84+
- `fix` – a bug fix has occurred
85+
- `chore` – changes that do not relate to a fix or feature and don't modify src or test files (for example updating dependencies)
86+
- `refactor` – refactored code that neither fixes a bug nor adds a feature
87+
- `docs` – updates to documentation such as a the README or other markdown files
88+
89+
### Why should you write better commit messages?
90+
91+
By writing good commits, you are simply future-proofing yourself. You could save yourself and/or coworkers hours of digging around while troubleshooting by providing that helpful description 🙂.
92+
93+
The extra time it takes to write a thoughtful commit message as a letter to your potential future self is extremely worthwhile. On large scale projects, documentation is imperative for maintenance.
94+
95+
Collaboration and communication are of utmost importance within engineering teams. The Git commit message is a prime example of this. I highly suggest setting up a convention for commit messages on your team if you do not already have one in place.
96+
97+
98+
## Code of Conduct
99+
100+
In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, nationality, personal appearance, race, religion, or sexual identity and orientation.
101+
102+
### Enforcement
103+
104+
Please report unacceptable behavior to [email protected]. All complaints will be reviewed and investigated and will result in a response that is deemed necessary and appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. Further details of specific enforcement policies may be posted separately.
105+
106+
Project maintainers who do not follow or enforce the Code of Conduct in good faith may face temporary or permanent repercussions as determined by other members of the project's leadership.
107+
108+
Thank you! 🤗
109+
110+
The Unstructured Team
111+
112+
113+
## Learn more
114+
115+
| Section | Description |
116+
|-|-|
117+
| [Company Website](https://unstructured.io) | Unstructured.io product and company info |
118+
| [Documentation](https://unstructured-io.github.io/unstructured) | Full API documentation |
119+
| [Working with Pull Requests](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests) | About pull requests |
120+
| [Code of Conduct](https://www.contributor-covenant.org/version/1/4/code-of-conduct/) | Contributor Covenant Code Of Conduct |
121+
| [Conventional Commits](https://www.freecodecamp.org/news/how-to-write-better-git-commit-messages/) | How to write better git commit messages |
122+
| [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) | Lightweight convention on top of commit messages |
123+
| [First Contributions](https://github.com/firstcontributions/first-contributions/blob/main/README.md) | Beginners' guide to make their first contribution! |
124+
125+
126+
## Contributing Guides
127+
128+
If you're stumped 😓, here are some good examples of contribution guidelines:
129+
130+
- The GitHub Docs [contribution guidelines](https://github.com/github/docs/blob/main/CONTRIBUTING.md).
131+
- The Ruby on Rails [contribution guidelines](https://github.com/rails/rails/blob/main/CONTRIBUTING.md).
132+
- The Open Government [contribution guidelines](https://github.com/opengovernment/opengovernment/blob/master/CONTRIBUTING.md).
133+
- The MMOCR [contribution guidelines](https://mmocr.readthedocs.io/en/dev-1.x/notes/contribution_guide.html).
134+
- The HuggingFace [contribution guidelines](https://huggingface2.notion.site/Contribution-Guide-19411c29298644df8e9656af45a7686d).

Dockerfile

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,9 @@ COPY example-docs example-docs
1212
RUN chown -R notebook-user:notebook-user /app && \
1313
apk add font-ubuntu git && \
1414
fc-cache -fv && \
15-
ln -s /usr/bin/python3.11 /usr/bin/python3
15+
if [ "$(readlink -f /usr/bin/python3)" != "/usr/bin/python3.11" ]; then \
16+
ln -sf /usr/bin/python3.11 /usr/bin/python3; \
17+
fi
1618

1719
USER notebook-user
1820

example-docs/simple.ndjson

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
{"element_id": "a06d2d9e65212d4aa955c3ab32950ffa", "metadata": {"category_depth": 0, "file_directory": "unstructured/example-docs", "filename": "simple.docx", "filetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "languages": ["eng"], "last_modified": "2024-07-06T16:44:51"}, "text": "These are a few of my favorite things:", "type": "Title"}
2+
{"element_id": "b334c93e9b1cbca3b6f6d78ce8bc2484", "metadata": {"category_depth": 0, "file_directory": "unstructured/example-docs", "filename": "simple.docx", "filetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "languages": ["eng"], "last_modified": "2024-07-06T16:44:51", "parent_id": "a06d2d9e65212d4aa955c3ab32950ffa"}, "text": "Parrots", "type": "ListItem"}
3+
{"element_id": "76469ecb9f1459943c8d8cca1a550b5a", "metadata": {"category_depth": 0, "file_directory": "unstructured/example-docs", "filename": "simple.docx", "filetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "languages": ["eng"], "last_modified": "2024-07-06T16:44:51", "parent_id": "a06d2d9e65212d4aa955c3ab32950ffa"}, "text": "Hockey", "type": "ListItem"}
4+
{"element_id": "261fac731945a138415adc2dd4434b17", "metadata": {"category_depth": 0, "file_directory": "unstructured/example-docs", "filename": "simple.docx", "filetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "languages": ["eng"], "last_modified": "2024-07-06T16:44:51"}, "text": "Analysis", "type": "Title"}
5+
{"element_id": "95f392d32c5271bfdb30eaef45921e59", "metadata": {"category_depth": 0, "file_directory": "unstructured/example-docs", "filename": "simple.docx", "filetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "languages": ["eng"], "last_modified": "2024-07-06T16:44:51", "parent_id": "261fac731945a138415adc2dd4434b17"}, "text": "This is my first thought. This is my second thought.", "type": "NarrativeText"}
6+
{"element_id": "0de25bd6f0d74bc4f909f2678f385736", "metadata": {"category_depth": 0, "file_directory": "unstructured/example-docs", "filename": "simple.docx", "filetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "languages": ["eng"], "last_modified": "2024-07-06T16:44:51", "parent_id": "261fac731945a138415adc2dd4434b17"}, "text": "This is my third thought.", "type": "NarrativeText"}
7+
{"element_id": "f296a3bc8a901f19199fda1da92829b6", "metadata": {"category_depth": 0, "file_directory": "unstructured/example-docs", "filename": "simple.docx", "filetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "languages": ["eng"], "last_modified": "2024-07-06T16:44:51", "parent_id": "261fac731945a138415adc2dd4434b17"}, "text": "2023", "type": "UncategorizedText"}
8+
{"element_id": "78c62edbc674fdca0f6a0e3ffb459f86", "metadata": {"category_depth": 0, "file_directory": "unstructured/example-docs", "filename": "simple.docx", "filetype": "application/vnd.openxmlformats-officedocument.wordprocessingml.document", "languages": ["eng"], "last_modified": "2024-07-06T16:44:51"}, "text": "DOYLESTOWN, PA 18901", "type": "Address"}

0 commit comments

Comments
 (0)