Skip to content

Commit 035622c

Browse files
authored
Open source installation steps: update full local installation and Docker installation instructions (#638)
1 parent 969bf87 commit 035622c

File tree

2 files changed

+293
-72
lines changed

2 files changed

+293
-72
lines changed
Lines changed: 253 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,64 +1,280 @@
11
---
2-
title: Docker Installation
3-
description: The instructions below guide you on how to use the unstructured library inside a Docker container.
2+
title: Docker installation
43
---
54

6-
## Prerequisites
5+
Follow these steps to run the Unstructured open source library inside a Docker container.
76

8-
If you haven’t installed Docker on your machine, you can find the installation guide [here](https://docs.docker.com/get-docker/).
7+
<Steps>
8+
<Step title="Install and run Docker">
9+
If you do not have Docker already installed and running, you can install and run a tool such as Docker Desktop, which is
10+
available for macOS, Windows, and Linux. Learn how to install and run:
11+
12+
- [Docker Desktop on Mac](https://docs.docker.com/desktop/setup/install/mac-install/)
13+
- [Docker Desktop on Windows](https://docs.docker.com/desktop/setup/install/windows-install/)
14+
- [Docker Desktop on Linux](https://docs.docker.com/desktop/setup/install/linux/)
15+
</Step>
16+
<Step title="Pull the Unstructured Docker image">
17+
<Info>If you are an experienced Docker user, you plan to parse only a single type of data, and you want to accelerate the image-building process, you can [build your own Docker image](#building-your-own-docker-image) instead of pulling the latest prebuilt image.</Info>
18+
<Tabs>
19+
<Tab title="Docker Desktop UI">
20+
<Note>
21+
The following steps are for AMD64-based systems.
922

10-
<Note>
11-
We build multi-platform images to support both x86\_64 and Apple silicon hardware. Using docker pull should download the appropriate image for your architecture. However, if needed, you can specify the platform with the –platform flag, e.g., –platform linux/amd64.
23+
If you are using an ARM64-based system (such as Apple Silicon), follow the instructions on the **Docker CLI** tab in this step instead.
24+
</Note>
1225

13-
We do not support GPU usage with the Unstructured library inside a Docker container.
14-
</Note>
26+
1. In your Docker Desktop UI's search box, enter `downloads.unstructured.io/unstructured-io/unstructured:latest`.
27+
2. On the **Images** tab, next to **unstructured-io/unstructured**, click **Pull**.
1528

16-
## Pulling the Docker Image
29+
To list the available images on your machine, in the sidebar, click **Images**.
1730

18-
We create Docker images for every push to the main branch. These images are tagged with the respective short commit hash (like fbc7a69) and the application version (e.g., 0.5.5-dev1). The most recent image also receives the latest tag. To use these images, pull them from our repository:
31+
To remove this image from your machine at any time, click the trash can (**Delete**) icon next to the image in the
32+
list of available images.
33+
</Tab>
34+
<Tab title="Docker CLI">
35+
From your terminal or command prompt, run the following command.
1936

20-
```go
21-
docker pull downloads.unstructured.io/unstructured-io/unstructured:latest
37+
<Tip>If you have the Docker Desktop UI running, you can click the **Terminal** button in the UI's lower right corner to run a Docker CLI session from within the Docker Desktop UI.</Tip>
2238

23-
```
39+
For AMD64-based systems, run the following command:
2440

41+
```bash
42+
# The AMD64 platform is the default.
43+
docker pull downloads.unstructured.io/unstructured-io/unstructured:latest
2544

26-
## Using the Docker Image
45+
# Or, to explicitly specify the AMD64 platform:
46+
docker pull --platform=linux/amd64 downloads.unstructured.io/unstructured-io/unstructured:latest
47+
```
2748

28-
After pulling the image, you can create and start a container from it:
49+
For ARM64-based systems (such as Apple Silicon), run the following command instead:
2950

30-
```go
31-
# create the container
32-
docker run -dt --name unstructured downloads.unstructured.io/unstructured-io/unstructured:latest
51+
```bash
52+
docker pull --platform=linux/arm64 downloads.unstructured.io/unstructured-io/unstructured:latest
53+
```
3354

34-
# start a bash shell inside the running Docker container
35-
docker exec -it unstructured bash
55+
To list the available images on your machine, run the following command:
56+
57+
```bash
58+
docker images
59+
```
3660

37-
```
61+
To remove this image from your machine at any time, run the following command:
62+
63+
```bash
64+
docker rmi downloads.unstructured.io/unstructured-io/unstructured:latest
65+
```
66+
</Tab>
67+
</Tabs>
68+
</Step>
69+
<Step title="Create and run a container from the image">
70+
<Tabs>
71+
<Tab title="Docker Desktop UI">
72+
<Note>
73+
The following steps are for AMD64-based systems.
3874

75+
If you are using an ARM64-based system (such as Apple Silicon), follow the instructions on the **Docker CLI** tab in this step instead.
76+
</Note>
77+
78+
1. In the Docker Desktop UI's sidebar, click **Images**.
79+
2. Next to **unstructured-io/unstructured**, click the play (**Run**) icon.
80+
3. Expand **Optional settings**.
81+
4. For **Container name**, enter some name for your container, such as `unstructured`.
82+
5. In the sidebar, click **Containers**.
83+
6. Next to your container, click the play (**Start**) icon.
84+
</Tab>
85+
<Tab title="Docker CLI">
86+
For AMD64-based systems, run the following command, replacing `<container-name>` with some name for your container, such as `unstructured`:
3987

40-
## Building Your Own Docker Image
41-
You can also build your own Docker image. If you only plan to parse a single type of data, you can accelerate the build process by excluding certain packages or requirements needed for other data types. Refer to the Dockerfile to determine which lines are necessary for your requirements.
88+
```bash
89+
# The AMD64 platform is the default.
90+
docker run -dt --name <container-name> downloads.unstructured.io/unstructured-io/unstructured:latest
4291

43-
```go
44-
make docker-build
92+
# Or, to explicitly specify the AMD64 platform:
93+
docker run -dt --platform=linux/amd64 --name <container-name> downloads.unstructured.io/unstructured-io/unstructured:latest
94+
```
4595

46-
# start a bash shell inside the running Docker container
47-
make docker-start-bash
96+
For ARM64-based systems (such as Apple Silicon), run the following command instead, replacing `<container-name>` with some name for your container, such as `unstructured`:
97+
98+
```bash
99+
docker run -dt --platform=linux/arm64 --name <container-name> downloads.unstructured.io/unstructured-io/unstructured:latest
100+
```
101+
</Tab>
102+
</Tabs>
103+
</Step>
104+
<Step title="Interact with the Unstructured open source library by running code inside the container">
105+
<Tabs>
106+
<Tab title="Docker Desktop UI">
107+
1. In the Docker Desktop UI, in the lower right corner, click the **Terminal** button.
108+
2. To start a terminal session inside the container, run the following command, replacing `<container-name>` with the name of your container, such as `unstructured`:
109+
110+
```bash
111+
docker exec -it <container-name> bash
112+
```
113+
114+
3. Run Unstructured open source library calls from inside the container. For example, start the Python interpreter:
115+
116+
```bash
117+
python
118+
```
119+
120+
And then run the following commands, one command at a time, to make calls to the Unstructured open source library.
121+
These calls process a PDF file in the `/app/example-docs/pdf` directory named `layout-parser-paper.pdf`. The
122+
processed data is written as a JSON file named `layout-parser-paper-output.json` in that same directory:
123+
124+
```bash
125+
>>> from unstructured.partition.pdf import partition_pdf
126+
>>> from unstructured.staging.base import elements_to_json
127+
>>> elements = partition_pdf(filename="/app/example-docs/pdf/layout-parser-paper.pdf")
128+
>>> elements_to_json(elements=elements, filename="/app/example-docs/pdf/layout-parser-paper-output.json")
129+
```
130+
131+
After the last call finishes running, exit the Python interpreter, and then print the contents of the JSON file to the terminal:
132+
133+
```bash
134+
>>> exit()
135+
136+
cat ./example-docs/pdf/layout-parser-paper-output.json
137+
```
138+
139+
4. To exit the terminal session, run the following command, or press `Ctrl+D`:
140+
141+
```bash
142+
exit
143+
```
144+
</Tab>
145+
<Tab title="Docker CLI">
146+
1. Run the following command, replacing `<container-name>` with the name of your container, such as `unstructured`:
147+
148+
```bash
149+
docker exec -it <container-name> bash
150+
```
151+
152+
2. Run Unstructured open source library calls from inside the container. For example, start the Python interpreter:
153+
154+
```bash
155+
python
156+
```
157+
158+
And then run the following commands, one command at a time, to make calls to the Unstructured open source library.
159+
These calls process a PDF file in the `/app/example-docs/pdf` directory named `layout-parser-paper.pdf`. The
160+
processed data is written as a JSON file named `layout-parser-paper-output.json` in that same directory:
161+
162+
```bash
163+
>>> from unstructured.partition.pdf import partition_pdf
164+
>>> from unstructured.staging.base import elements_to_json
165+
>>> elements = partition_pdf(filename="/app/example-docs/pdf/layout-parser-paper.pdf")
166+
>>> elements_to_json(elements=elements, filename="/app/example-docs/pdf/layout-parser-paper-output.json")
167+
```
168+
169+
After the last call finishes running, exit the Python interpreter, and then print the contents of the JSON file to the terminal:
170+
171+
```bash
172+
>>> exit()
173+
174+
cat ./example-docs/pdf/layout-parser-paper-output.json
175+
```
176+
177+
4. To exit the terminal session, run the following command, or press `Ctrl+D`:
178+
179+
```bash
180+
exit
181+
```
182+
</Tab>
183+
</Tabs>
184+
</Step>
185+
<Step title="Interact with the Unstructured open source library by running code outside the container">
186+
You can also interact with the Unstructured open source library by running code that is on the
187+
same machine as the running container but not within the container itself. To do this, you can
188+
use the Docker CLI to create a container that mounts the local directory containing the
189+
code into the container itself, and then run that code from the container.
190+
191+
1. Run one of the following commands, replacing the following placeholders with the appropriate values:
48192

49-
```
193+
- Replace `<host-path>` with the path to the directory containing your code, for example `/Users/<username>/my_example_code/`.
194+
- Replace `<container-path>` with the path to some directory within the container to mount `<host-path>` into, for example `/app/my_example_code/`. If
195+
`<container-path>` does not already exist, it will be created at the same time that the container is created.
196+
- Replace `<container-name>` with some name for your container, such as `unstructured_mount`.
197+
198+
For AMD64-based systems, run the following command:
50199

200+
```bash
201+
# The AMD64 platform is the default.
202+
docker run -dt -v <host-path>:<container-path>--name <container-name> downloads.unstructured.io/unstructured-io/unstructured:latest
51203

52-
## Interacting with Python Inside the Container
53-
Once inside the running Docker container, you can directly test the library using Python’s interactive mode:
204+
# Or, to explicitly specify the AMD64 platform:
205+
docker run -dt -v <host-path>:<container-path> --platform=linux/amd64 --name <container-name> downloads.unstructured.io/unstructured-io/unstructured:latest
206+
```
54207

55-
```go
56-
python3
208+
For ARM64-based systems (such as Apple Silicon), run the following command instead:
57209

58-
>>> from unstructured.partition.pdf import partition_pdf
59-
>>> elements = partition_pdf(filename="example-docs/pdf/layout-parser-paper-fast.pdf")
210+
```bash
211+
docker run -dt -v <host-path>:<container-path> --platform=linux/arm64 --name <container-name> downloads.unstructured.io/unstructured-io/unstructured:latest
212+
```
60213

61-
>>> from unstructured.partition.text import partition_text
62-
>>> elements = partition_text(filename="example-docs/fake-text.txt")
214+
2. Start a terminal session inside the container by running the following command, replacing `<container-name>` with the name of your container, such as `unstructured_mount`:
63215

64-
```
216+
```bash
217+
docker exec -it <container-name> bash
218+
```
219+
220+
3. Add `<container-path>` to the `PYTHONPATH` environment variable within the container by running the following commands,
221+
replacing `<container-path>` with the path to the target directory within the container:
222+
223+
```bash
224+
PYTHONPATH="${PYTHONPATH}:<container-path>"
225+
export PYTHONPATH
226+
```
227+
228+
4. Run Unstructured open source library calls, referencing your code from `<container-path>`.
229+
230+
For example, if you have a file named `main.py` in `<host-path>`that contains the four commands following `>>>` from the previous step,
231+
you can run it as follows, replacing `<container-path>` with the path to the target directory within the container:
232+
233+
```bash
234+
python <container-path>/main.py
235+
```
236+
237+
To print the contents of the JSON file to the terminal, run the following command:
238+
239+
```bash
240+
cat /app/example-docs/pdf/layout-parser-paper-output.json
241+
```
242+
243+
5. To exit the terminal session, run the following command, or press `Ctrl+D`:
244+
245+
```bash
246+
exit
247+
```
248+
</Step>
249+
<Step title="Stop running the container">
250+
If you do not need the keep running the container, you can stop it as follows:
251+
<Tabs>
252+
<Tab title="Docker Desktop UI">
253+
1. In the Docker Desktop UI, in the sidebar, click **Containers**.
254+
2. Next to your container, click the square (**Stop**) icon.
255+
</Tab>
256+
<Tab title="Docker CLI">
257+
Run the following command, replacing `<container-name>` with the name of your container, such as `unstructured` or `unstructured_mount`:
258+
259+
```bash
260+
docker stop <container-name>
261+
```
262+
</Tab>
263+
</Tabs>
264+
</Step>
265+
</Steps>
266+
267+
## Building your own Docker image
268+
269+
You can build your own Docker image instead of pulling the latest prebuilt image.
270+
If you only plan to parse a single type of data, you can accelerate the
271+
build process by excluding certain packages or requirements needed for other data types. Refer to the
272+
[Dockerfile](https://github.com/Unstructured-IO/unstructured/blob/main/Dockerfile) to determine which lines
273+
are necessary for your requirements.
274+
275+
```bash
276+
make docker-build
277+
278+
# Start a Bash shell inside of the running Docker container.
279+
make docker-start-bash
280+
```

0 commit comments

Comments
 (0)