|
1 | 1 | --- |
2 | | -title: Docker Installation |
3 | | -description: The instructions below guide you on how to use the unstructured library inside a Docker container. |
| 2 | +title: Docker installation |
4 | 3 | --- |
5 | 4 |
|
6 | | -## Prerequisites |
| 5 | +Follow these steps to run the Unstructured open source library inside a Docker container. |
7 | 6 |
|
8 | | -If you haven’t installed Docker on your machine, you can find the installation guide [here](https://docs.docker.com/get-docker/). |
| 7 | +<Steps> |
| 8 | + <Step title="Install and run Docker"> |
| 9 | + If you do not have Docker already installed and running, you can install and run a tool such as Docker Desktop, which is |
| 10 | + available for macOS, Windows, and Linux. Learn how to install and run: |
| 11 | + |
| 12 | + - [Docker Desktop on Mac](https://docs.docker.com/desktop/setup/install/mac-install/) |
| 13 | + - [Docker Desktop on Windows](https://docs.docker.com/desktop/setup/install/windows-install/) |
| 14 | + - [Docker Desktop on Linux](https://docs.docker.com/desktop/setup/install/linux/) |
| 15 | + </Step> |
| 16 | + <Step title="Pull the Unstructured Docker image"> |
| 17 | + <Info>If you are an experienced Docker user, you plan to parse only a single type of data, and you want to accelerate the image-building process, you can [build your own Docker image](#building-your-own-docker-image) instead of pulling the latest prebuilt image.</Info> |
| 18 | + <Tabs> |
| 19 | + <Tab title="Docker Desktop UI"> |
| 20 | + <Note> |
| 21 | + The following steps are for AMD64-based systems. |
9 | 22 |
|
10 | | -<Note> |
11 | | - We build multi-platform images to support both x86\_64 and Apple silicon hardware. Using docker pull should download the appropriate image for your architecture. However, if needed, you can specify the platform with the –platform flag, e.g., –platform linux/amd64. |
| 23 | + If you are using an ARM64-based system (such as Apple Silicon), follow the instructions on the **Docker CLI** tab in this step instead. |
| 24 | + </Note> |
12 | 25 |
|
13 | | - We do not support GPU usage with the Unstructured library inside a Docker container. |
14 | | -</Note> |
| 26 | + 1. In your Docker Desktop UI's search box, enter `downloads.unstructured.io/unstructured-io/unstructured:latest`. |
| 27 | + 2. On the **Images** tab, next to **unstructured-io/unstructured**, click **Pull**. |
15 | 28 |
|
16 | | -## Pulling the Docker Image |
| 29 | + To list the available images on your machine, in the sidebar, click **Images**. |
17 | 30 |
|
18 | | -We create Docker images for every push to the main branch. These images are tagged with the respective short commit hash (like fbc7a69) and the application version (e.g., 0.5.5-dev1). The most recent image also receives the latest tag. To use these images, pull them from our repository: |
| 31 | + To remove this image from your machine at any time, click the trash can (**Delete**) icon next to the image in the |
| 32 | + list of available images. |
| 33 | + </Tab> |
| 34 | + <Tab title="Docker CLI"> |
| 35 | + From your terminal or command prompt, run the following command. |
19 | 36 |
|
20 | | -```go |
21 | | -docker pull downloads.unstructured.io/unstructured-io/unstructured:latest |
| 37 | + <Tip>If you have the Docker Desktop UI running, you can click the **Terminal** button in the UI's lower right corner to run a Docker CLI session from within the Docker Desktop UI.</Tip> |
22 | 38 |
|
23 | | -``` |
| 39 | + For AMD64-based systems, run the following command: |
24 | 40 |
|
| 41 | + ```bash |
| 42 | + # The AMD64 platform is the default. |
| 43 | + docker pull downloads.unstructured.io/unstructured-io/unstructured:latest |
25 | 44 |
|
26 | | -## Using the Docker Image |
| 45 | + # Or, to explicitly specify the AMD64 platform: |
| 46 | + docker pull --platform=linux/amd64 downloads.unstructured.io/unstructured-io/unstructured:latest |
| 47 | + ``` |
27 | 48 |
|
28 | | -After pulling the image, you can create and start a container from it: |
| 49 | + For ARM64-based systems (such as Apple Silicon), run the following command instead: |
29 | 50 |
|
30 | | -```go |
31 | | -# create the container |
32 | | -docker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest |
| 51 | + ```bash |
| 52 | + docker pull --platform=linux/arm64 downloads.unstructured.io/unstructured-io/unstructured:latest |
| 53 | + ``` |
33 | 54 |
|
34 | | -# start a bash shell inside the running Docker container |
35 | | -docker exec -it unstructured bash |
| 55 | + To list the available images on your machine, run the following command: |
| 56 | + |
| 57 | + ```bash |
| 58 | + docker images |
| 59 | + ``` |
36 | 60 |
|
37 | | -``` |
| 61 | + To remove this image from your machine at any time, run the following command: |
| 62 | + |
| 63 | + ```bash |
| 64 | + docker rmi downloads.unstructured.io/unstructured-io/unstructured:latest |
| 65 | + ``` |
| 66 | + </Tab> |
| 67 | + </Tabs> |
| 68 | + </Step> |
| 69 | + <Step title="Create and run a container from the image"> |
| 70 | + <Tabs> |
| 71 | + <Tab title="Docker Desktop UI"> |
| 72 | + <Note> |
| 73 | + The following steps are for AMD64-based systems. |
38 | 74 |
|
| 75 | + If you are using an ARM64-based system (such as Apple Silicon), follow the instructions on the **Docker CLI** tab in this step instead. |
| 76 | + </Note> |
| 77 | + |
| 78 | + 1. In the Docker Desktop UI's sidebar, click **Images**. |
| 79 | + 2. Next to **unstructured-io/unstructured**, click the play (**Run**) icon. |
| 80 | + 3. Expand **Optional settings**. |
| 81 | + 4. For **Container name**, enter some name for your container, such as `unstructured`. |
| 82 | + 5. In the sidebar, click **Containers**. |
| 83 | + 6. Next to your container, click the play (**Start**) icon. |
| 84 | + </Tab> |
| 85 | + <Tab title="Docker CLI"> |
| 86 | + For AMD64-based systems, run the following command, replacing `<container-name>` with some name for your container, such as `unstructured`: |
39 | 87 |
|
40 | | -## Building Your Own Docker Image |
41 | | -You can also build your own Docker image. If you only plan to parse a single type of data, you can accelerate the build process by excluding certain packages or requirements needed for other data types. Refer to the Dockerfile to determine which lines are necessary for your requirements. |
| 88 | + ```bash |
| 89 | + # The AMD64 platform is the default. |
| 90 | + docker run -dt --name <container-name> downloads.unstructured.io/unstructured-io/unstructured:latest |
42 | 91 |
|
43 | | -```go |
44 | | -make docker-build |
| 92 | + # Or, to explicitly specify the AMD64 platform: |
| 93 | + docker run -dt --platform=linux/amd64 --name <container-name> downloads.unstructured.io/unstructured-io/unstructured:latest |
| 94 | + ``` |
45 | 95 |
|
46 | | -# start a bash shell inside the running Docker container |
47 | | -make docker-start-bash |
| 96 | + For ARM64-based systems (such as Apple Silicon), run the following command instead, replacing `<container-name>` with some name for your container, such as `unstructured`: |
| 97 | + |
| 98 | + ```bash |
| 99 | + docker run -dt --platform=linux/arm64 --name <container-name> downloads.unstructured.io/unstructured-io/unstructured:latest |
| 100 | + ``` |
| 101 | + </Tab> |
| 102 | + </Tabs> |
| 103 | + </Step> |
| 104 | + <Step title="Interact with the Unstructured open source library by running code inside the container"> |
| 105 | + <Tabs> |
| 106 | + <Tab title="Docker Desktop UI"> |
| 107 | + 1. In the Docker Desktop UI, in the lower right corner, click the **Terminal** button. |
| 108 | + 2. To start a terminal session inside the container, run the following command, replacing `<container-name>` with the name of your container, such as `unstructured`: |
| 109 | + |
| 110 | + ```bash |
| 111 | + docker exec -it <container-name> bash |
| 112 | + ``` |
| 113 | + |
| 114 | + 3. Run Unstructured open source library calls from inside the container. For example, start the Python interpreter: |
| 115 | + |
| 116 | + ```bash |
| 117 | + python |
| 118 | + ``` |
| 119 | + |
| 120 | + And then run the following commands, one command at a time, to make calls to the Unstructured open source library. |
| 121 | + These calls process a PDF file in the `/app/example-docs/pdf` directory named `layout-parser-paper.pdf`. The |
| 122 | + processed data is written as a JSON file named `layout-parser-paper-output.json` in that same directory: |
| 123 | + |
| 124 | + ```bash |
| 125 | + >>> from unstructured.partition.pdf import partition_pdf |
| 126 | + >>> from unstructured.staging.base import elements_to_json |
| 127 | + >>> elements = partition_pdf(filename="/app/example-docs/pdf/layout-parser-paper.pdf") |
| 128 | + >>> elements_to_json(elements=elements, filename="/app/example-docs/pdf/layout-parser-paper-output.json") |
| 129 | + ``` |
| 130 | + |
| 131 | + After the last call finishes running, exit the Python interpreter, and then print the contents of the JSON file to the terminal: |
| 132 | + |
| 133 | + ```bash |
| 134 | + >>> exit() |
| 135 | + |
| 136 | + cat ./example-docs/pdf/layout-parser-paper-output.json |
| 137 | + ``` |
| 138 | + |
| 139 | + 4. To exit the terminal session, run the following command, or press `Ctrl+D`: |
| 140 | + |
| 141 | + ```bash |
| 142 | + exit |
| 143 | + ``` |
| 144 | + </Tab> |
| 145 | + <Tab title="Docker CLI"> |
| 146 | + 1. Run the following command, replacing `<container-name>` with the name of your container, such as `unstructured`: |
| 147 | + |
| 148 | + ```bash |
| 149 | + docker exec -it <container-name> bash |
| 150 | + ``` |
| 151 | + |
| 152 | + 2. Run Unstructured open source library calls from inside the container. For example, start the Python interpreter: |
| 153 | + |
| 154 | + ```bash |
| 155 | + python |
| 156 | + ``` |
| 157 | + |
| 158 | + And then run the following commands, one command at a time, to make calls to the Unstructured open source library. |
| 159 | + These calls process a PDF file in the `/app/example-docs/pdf` directory named `layout-parser-paper.pdf`. The |
| 160 | + processed data is written as a JSON file named `layout-parser-paper-output.json` in that same directory: |
| 161 | + |
| 162 | + ```bash |
| 163 | + >>> from unstructured.partition.pdf import partition_pdf |
| 164 | + >>> from unstructured.staging.base import elements_to_json |
| 165 | + >>> elements = partition_pdf(filename="/app/example-docs/pdf/layout-parser-paper.pdf") |
| 166 | + >>> elements_to_json(elements=elements, filename="/app/example-docs/pdf/layout-parser-paper-output.json") |
| 167 | + ``` |
| 168 | + |
| 169 | + After the last call finishes running, exit the Python interpreter, and then print the contents of the JSON file to the terminal: |
| 170 | + |
| 171 | + ```bash |
| 172 | + >>> exit() |
| 173 | + |
| 174 | + cat ./example-docs/pdf/layout-parser-paper-output.json |
| 175 | + ``` |
| 176 | + |
| 177 | + 4. To exit the terminal session, run the following command, or press `Ctrl+D`: |
| 178 | + |
| 179 | + ```bash |
| 180 | + exit |
| 181 | + ``` |
| 182 | + </Tab> |
| 183 | + </Tabs> |
| 184 | + </Step> |
| 185 | + <Step title="Interact with the Unstructured open source library by running code outside the container"> |
| 186 | + You can also interact with the Unstructured open source library by running code that is on the |
| 187 | + same machine as the running container but not within the container itself. To do this, you can |
| 188 | + use the Docker CLI to create a container that mounts the local directory containing the |
| 189 | + code into the container itself, and then run that code from the container. |
| 190 | + |
| 191 | + 1. Run one of the following commands, replacing the following placeholders with the appropriate values: |
48 | 192 |
|
49 | | -``` |
| 193 | + - Replace `<host-path>` with the path to the directory containing your code, for example `/Users/<username>/my_example_code/`. |
| 194 | + - Replace `<container-path>` with the path to some directory within the container to mount `<host-path>` into, for example `/app/my_example_code/`. If |
| 195 | + `<container-path>` does not already exist, it will be created at the same time that the container is created. |
| 196 | + - Replace `<container-name>` with some name for your container, such as `unstructured_mount`. |
| 197 | + |
| 198 | + For AMD64-based systems, run the following command: |
50 | 199 |
|
| 200 | + ```bash |
| 201 | + # The AMD64 platform is the default. |
| 202 | + docker run -dt -v <host-path>:<container-path>--name <container-name> downloads.unstructured.io/unstructured-io/unstructured:latest |
51 | 203 |
|
52 | | -## Interacting with Python Inside the Container |
53 | | -Once inside the running Docker container, you can directly test the library using Python’s interactive mode: |
| 204 | + # Or, to explicitly specify the AMD64 platform: |
| 205 | + docker run -dt -v <host-path>:<container-path> --platform=linux/amd64 --name <container-name> downloads.unstructured.io/unstructured-io/unstructured:latest |
| 206 | + ``` |
54 | 207 |
|
55 | | -```go |
56 | | -python3 |
| 208 | + For ARM64-based systems (such as Apple Silicon), run the following command instead: |
57 | 209 |
|
58 | | ->>> from unstructured.partition.pdf import partition_pdf |
59 | | ->>> elements = partition_pdf(filename="example-docs/pdf/layout-parser-paper-fast.pdf") |
| 210 | + ```bash |
| 211 | + docker run -dt -v <host-path>:<container-path> --platform=linux/arm64 --name <container-name> downloads.unstructured.io/unstructured-io/unstructured:latest |
| 212 | + ``` |
60 | 213 |
|
61 | | ->>> from unstructured.partition.text import partition_text |
62 | | ->>> elements = partition_text(filename="example-docs/fake-text.txt") |
| 214 | + 2. Start a terminal session inside the container by running the following command, replacing `<container-name>` with the name of your container, such as `unstructured_mount`: |
63 | 215 |
|
64 | | -``` |
| 216 | + ```bash |
| 217 | + docker exec -it <container-name> bash |
| 218 | + ``` |
| 219 | + |
| 220 | + 3. Add `<container-path>` to the `PYTHONPATH` environment variable within the container by running the following commands, |
| 221 | + replacing `<container-path>` with the path to the target directory within the container: |
| 222 | + |
| 223 | + ```bash |
| 224 | + PYTHONPATH="${PYTHONPATH}:<container-path>" |
| 225 | + export PYTHONPATH |
| 226 | + ``` |
| 227 | + |
| 228 | + 4. Run Unstructured open source library calls, referencing your code from `<container-path>`. |
| 229 | + |
| 230 | + For example, if you have a file named `main.py` in `<host-path>`that contains the four commands following `>>>` from the previous step, |
| 231 | + you can run it as follows, replacing `<container-path>` with the path to the target directory within the container: |
| 232 | + |
| 233 | + ```bash |
| 234 | + python <container-path>/main.py |
| 235 | + ``` |
| 236 | + |
| 237 | + To print the contents of the JSON file to the terminal, run the following command: |
| 238 | + |
| 239 | + ```bash |
| 240 | + cat /app/example-docs/pdf/layout-parser-paper-output.json |
| 241 | + ``` |
| 242 | + |
| 243 | + 5. To exit the terminal session, run the following command, or press `Ctrl+D`: |
| 244 | + |
| 245 | + ```bash |
| 246 | + exit |
| 247 | + ``` |
| 248 | + </Step> |
| 249 | + <Step title="Stop running the container"> |
| 250 | + If you do not need the keep running the container, you can stop it as follows: |
| 251 | + <Tabs> |
| 252 | + <Tab title="Docker Desktop UI"> |
| 253 | + 1. In the Docker Desktop UI, in the sidebar, click **Containers**. |
| 254 | + 2. Next to your container, click the square (**Stop**) icon. |
| 255 | + </Tab> |
| 256 | + <Tab title="Docker CLI"> |
| 257 | + Run the following command, replacing `<container-name>` with the name of your container, such as `unstructured` or `unstructured_mount`: |
| 258 | + |
| 259 | + ```bash |
| 260 | + docker stop <container-name> |
| 261 | + ``` |
| 262 | + </Tab> |
| 263 | + </Tabs> |
| 264 | + </Step> |
| 265 | +</Steps> |
| 266 | + |
| 267 | +## Building your own Docker image |
| 268 | + |
| 269 | +You can build your own Docker image instead of pulling the latest prebuilt image. |
| 270 | +If you only plan to parse a single type of data, you can accelerate the |
| 271 | +build process by excluding certain packages or requirements needed for other data types. Refer to the |
| 272 | +[Dockerfile](https://github.com/Unstructured-IO/unstructured/blob/main/Dockerfile) to determine which lines |
| 273 | +are necessary for your requirements. |
| 274 | + |
| 275 | +```bash |
| 276 | +make docker-build |
| 277 | + |
| 278 | +# Start a Bash shell inside of the running Docker container. |
| 279 | +make docker-start-bash |
| 280 | +``` |
0 commit comments