Skip to content

Commit 889a4d5

Browse files
feat: Add filter step (#7)
* Add filter step * Fix cli * Fix elasticsearch * Bump version of docker compose * install docker compose manually * Install as root * apt-get update before docker install command * update to use self hosted CI image * Add script to install docker compose * Update docker compose install script * Install docker as sudo * Set missing variables * Print version of docker compose * Add docker compose installation as part of update fixture step * Specify docker compose version * Fix docker compose command * Update ingest test fixtures (#10) Co-authored-by: rbiseck3 <[email protected]> * rmove file used for testing * Update e2e tests with new docker compose install script * lint shell * fix sqlite issue * Generate filter step from cli inputs * bugfix and add ingest test * update s3 example file * Remove glob in original connectors * Add file size in local indexer * Update ingest test fixtures (#11) Co-authored-by: rbiseck3 <[email protected]> --------- Co-authored-by: Unstructured-DevOps <[email protected]> Co-authored-by: rbiseck3 <[email protected]>
1 parent 0510b4f commit 889a4d5

File tree

117 files changed

+10043
-2962
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

117 files changed

+10043
-2962
lines changed

.github/workflows/e2e.yml

Lines changed: 6 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -47,10 +47,6 @@ jobs:
4747
uses: ./.github/actions/base-cache
4848
with:
4949
python-version: ${{ matrix.python-version }}
50-
- name: Setup docker-compose
51-
uses: KengoTODA/actions-setup-docker-compose@v1
52-
with:
53-
version: '2.22.0'
5450
- name: Test (end-to-end)
5551
env:
5652
AIRTABLE_PERSONAL_ACCESS_TOKEN: ${{ secrets.AIRTABLE_PERSONAL_ACCESS_TOKEN }}
@@ -108,6 +104,8 @@ jobs:
108104
sudo apt-get install -y tesseract-ocr-kor
109105
sudo apt-get install diffstat
110106
tesseract --version
107+
sudo make install-docker-compose
108+
docker compose version
111109
./test_e2e/test-src.sh
112110
113111
@@ -132,10 +130,6 @@ jobs:
132130
uses: ./.github/actions/base-cache
133131
with:
134132
python-version: ${{ matrix.python-version }}
135-
- name: Setup docker-compose
136-
uses: KengoTODA/actions-setup-docker-compose@v1
137-
with:
138-
version: '2.22.0'
139133
- name: Test (end-to-end)
140134
env:
141135
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
@@ -164,6 +158,8 @@ jobs:
164158
DATABRICKS_USERNAME: ${{secrets.DATABRICKS_USERNAME}}
165159
DATABRICKS_PASSWORD: ${{secrets.DATABRICKS_PASSWORD}}
166160
DATABRICKS_CATALOG: ${{secrets.DATABRICKS_CATALOG}}
161+
SHAREPOINT_CLIENT_ID: ${{secrets.SHAREPOINT_CLIENT_ID}}
162+
SHAREPOINT_CRED: ${{secrets.SHAREPOINT_CRED}}
167163
TABLE_OCR: "tesseract"
168164
OCR_AGENT: "unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract"
169165
CI: "true"
@@ -178,4 +174,6 @@ jobs:
178174
sudo apt-get install -y tesseract-ocr-kor
179175
sudo apt-get install diffstat
180176
tesseract --version
177+
sudo make install-docker-compose
178+
docker compose version
181179
./test_e2e/test-dest.sh

.github/workflows/ingest-test-fixtures-update-pr.yml

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -38,10 +38,6 @@ jobs:
3838
uses: ./.github/actions/base-cache
3939
with:
4040
python-version: ${{ env.PYTHON_VERSION }}
41-
- name: Setup docker-compose
42-
uses: KengoTODA/actions-setup-docker-compose@v1
43-
with:
44-
version: '2.22.0'
4541
- name: Update test fixtures
4642
env:
4743
AIRTABLE_PERSONAL_ACCESS_TOKEN: ${{ secrets.AIRTABLE_PERSONAL_ACCESS_TOKEN }}
@@ -95,6 +91,8 @@ jobs:
9591
sudo apt-get install -y tesseract-ocr
9692
sudo apt-get install -y tesseract-ocr-kor
9793
tesseract --version
94+
sudo make install-docker-compose
95+
docker compose version
9896
./test_e2e/test-src.sh
9997
10098
- name: Save branch name to environment file

CHANGELOG.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
1-
## 0.0.2-dev1
1+
## 0.0.2-dev2
22

33
### Enhancements
44

55
* **Use uuid for s3 identifiers** Update unique id to use uuid derived from file path rather than the filepath itself.
66
* **V2 connectors precheck support** All steps in the v2 pipeline support an optional precheck call, which encompasses the previous check connection functionality.
7+
* **Filter Step** Support dedicated step as part of the pipeline to filter documents.
78

89
## 0.0.1
910

@@ -19,8 +20,6 @@
1920

2021
## 0.0.0
2122

22-
### Enhancements
23-
2423
### Features
2524

2625
* **Initial Migration** Create the structure of this repo from the original code in the [Unstructured](https://github.com/Unstructured-IO/unstructured) project.

Makefile

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,10 @@ install-all-deps:
4242
install-pandoc:
4343
ARCH=${ARCH} ./scripts/install-pandoc.sh
4444

45+
.PHONY: install-docker-compose
46+
install-docker-compose:
47+
ARCH=${ARCH} ./scripts/install-docker-compose.sh
48+
4549
.PHONY: install-ci
4650
install-ci: install-all-connectors install-all-embedders
4751
pip install -r requirements/local_partition/pdf.txt

scripts/install-docker-compose.sh

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
#!/usr/bin/env bash
2+
3+
set -euo pipefail
4+
5+
DOCKER_ARCH=${ARCH}
6+
if [ "${ARCH}" = "x86_64" ]; then
7+
TARGETARCH="amd64"
8+
elif [ "${ARCH}" = "arm64" ] || [ "${ARCH}" = "aarch64" ]; then
9+
TARGETARCH="arm64"
10+
fi
11+
TARGETOS=linux
12+
DOCKER_VERSION=26.1.3
13+
BUILDX_VERSION=0.16.0
14+
DOCKER_COMPOSE_VERSION=2.28.1
15+
16+
curl -fLo docker.tgz https://download.docker.com/${TARGETOS}/static/stable/"${DOCKER_ARCH}"/docker-${DOCKER_VERSION}.tgz
17+
tar zxvf docker.tgz
18+
rm -rf docker.tgz
19+
mkdir -p /usr/local/lib/docker/cli-plugins
20+
curl -fLo /usr/local/lib/docker/cli-plugins/docker-buildx "https://github.com/docker/buildx/releases/download/v${BUILDX_VERSION}/buildx-v${BUILDX_VERSION}.linux-${TARGETARCH}"
21+
chmod +x /usr/local/lib/docker/cli-plugins/docker-buildx
22+
curl -SL https://github.com/docker/compose/releases/download/v${DOCKER_COMPOSE_VERSION}/docker-compose-${TARGETOS}-"${DOCKER_ARCH}" -o /usr/local/lib/docker/cli-plugins/docker-compose
23+
chmod +x /usr/local/lib/docker/cli-plugins/docker-compose

test_e2e/dest/elasticsearch.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ source "$SCRIPT_DIR"/env_setup/elasticsearch/common/es-dest-ingest-test-creds.en
1919
function cleanup {
2020
# Index cleanup
2121
echo "Stopping Elasticsearch Docker container"
22-
docker-compose -f "$SCRIPT_DIR"/env_setup/elasticsearch/common/docker-compose.yaml down --remove-orphans -v
22+
docker compose -f "$SCRIPT_DIR"/env_setup/elasticsearch/common/docker-compose.yaml down --remove-orphans -v
2323

2424
# Local file cleanup
2525
cleanup_dir "$WORK_DIR"

test_e2e/dest/kafka-local.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ function cleanup {
2525
cleanup_dir "$OUTPUT_DIR"
2626

2727
echo "Stopping local Kafka instance"
28-
docker-compose -f "$SCRIPT_DIR"/env_setup/kafka/docker-compose.yml down --remove-orphans -v
28+
docker compose -f "$SCRIPT_DIR"/env_setup/kafka/docker-compose.yml down --remove-orphans -v
2929
}
3030

3131
trap cleanup EXIT

test_e2e/dest/opensearch.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ source "$SCRIPT_DIR"/cleanup.sh
1717
function cleanup {
1818
# Index cleanup
1919
echo "Stopping OpenSearch Docker container"
20-
docker-compose -f "$SCRIPT_DIR"/env_setup/opensearch/common/docker-compose.yaml down --remove-orphans -v
20+
docker compose -f "$SCRIPT_DIR"/env_setup/opensearch/common/docker-compose.yaml down --remove-orphans -v
2121

2222
# Local file cleanup
2323
cleanup_dir "$WORK_DIR"

test_e2e/dest/pgvector.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ DATABASE_TYPE="pgvector"
1717
source "$SCRIPT_DIR"/cleanup.sh
1818
function cleanup {
1919
echo "Stopping SQL DB Docker container"
20-
docker-compose -f "$SCRIPT_DIR"/env_setup/sql/docker-compose-"$DATABASE_TYPE".yaml down --remove-orphans -v
20+
docker compose -f "$SCRIPT_DIR"/env_setup/sql/docker-compose-"$DATABASE_TYPE".yaml down --remove-orphans -v
2121
# Local file cleanup
2222
cleanup_dir "$WORK_DIR"
2323
cleanup_dir "$OUTPUT_DIR"

test_e2e/dest/weaviate.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ source "$SCRIPT_DIR"/cleanup.sh
1717
function cleanup {
1818
# Index cleanup
1919
echo "Stopping Weaviate Docker container"
20-
docker-compose -f "$SCRIPT_DIR"/env_setup/weaviate/docker-compose.yml down --remove-orphans -v
20+
docker compose -f "$SCRIPT_DIR"/env_setup/weaviate/docker-compose.yml down --remove-orphans -v
2121

2222
# Local file cleanup
2323
cleanup_dir "$WORK_DIR"

0 commit comments

Comments
 (0)