Skip to content

Commit 7472e1b

Browse files
authored
docs: add a quick start page to the readme and docs (#240)
* added quick start section to the readme * added quick start to docs * parenthetical on extra deps * typo * fix typo * fixed mixed tabs/spaces
1 parent 601f250 commit 7472e1b

File tree

2 files changed

+77
-7
lines changed

2 files changed

+77
-7
lines changed

README.md

Lines changed: 40 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -49,11 +49,48 @@ about. Bricks in the library fall into three categories:
4949
- :performing_arts: ***Staging bricks*** that format data for downstream tasks, such as ML inference
5050
and data labeling.
5151
<br></br>
52-
## :eight_pointed_black_star: Installation
52+
## :eight_pointed_black_star: Quick Start
53+
54+
Use the following instructions to get up and running with `unstructured` and test your
55+
installation.
56+
57+
- Install the Python SDK with `pip install unstructured[local-inference]`
58+
- If you do not need to process PDFs or images, you can run `pip install unstructured`
59+
- Install the following system dependencies if they are not already available on your system.
60+
Depending on what document types you're parsing, you may not need all of these.
61+
- `libmagic-dev` (filetype detection)
62+
- `poppler-utils` (images and PDFs)
63+
- `tesseract-ocr` (images and PDFs)
64+
- `libreoffice` (MS Office docs)
65+
- Run the following to install NLTK dependencies. `unstructured` will handle this automatically
66+
soon.
67+
- `python -c "import nltk; nltk.download('punkt')"`
68+
- `python -c "import nltk; nltk.download('averaged_perceptron_tagger')"`
69+
- If you are parsing PDFs, run the following to install the `detectron2` model, which
70+
`unstructured` uses for layout detection:
71+
- `pip install "detectron2@git+https://github.com/facebookresearch/[email protected]#egg=detectron2"`
72+
73+
At this point, you should be able to run the following code:
5374

54-
To install the library, run `pip install unstructured`.
75+
```python
76+
from unstructured.partition.auto import partition
77+
78+
elements = partition(filename="example-docs/fake-email.eml")
79+
```
80+
81+
And if you installed with `local-inference`, you should be able to run this as well:
82+
83+
```python
84+
from unstructured.partition.auto import partition
85+
86+
elements = partition("example-docs/layout-parser-paper.pdf")
87+
```
88+
89+
90+
## :coffee: Installation Instructions for Local Development
5591

56-
## :coffee: Getting Started
92+
The following instructions are intended to help you get up and running with `unstructured`
93+
locally if you are planning to contribute to the project.
5794

5895
* Using `pyenv` to manage virtualenv's is recommended but not necessary
5996
* Mac install instructions. See [here](https://github.com/Unstructured-IO/community#mac--homebrew) for more detailed instructions.

docs/source/installing.rst

Lines changed: 37 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,43 @@
11
Installation
22
============
33

4-
You can install the library by cloning the repo and running ``make install`` from the
5-
root directory. Developers can run ``make install-local`` to install the dev and test
6-
requirements alongside the base requirements. If you want a minimal installation without any
7-
parser specific dependencies, run ``make install-base``.
4+
Quick Start
5+
-----------
6+
7+
Use the following instructions to get up and running with ``unstructured`` and test your
8+
installation.
9+
10+
* Install the Python SDK with ``pip install unstructured[local-inference]``
11+
* If you do not need to process PDFs or images, you can run ``pip install unstructured``
12+
13+
* Install the following system dependencies if they are not already available on your system. Depending on what document types you're parsing, you may not need all of these.
14+
* ``libmagic-dev`` (filetype detection)
15+
* ``poppler-utils`` (images and PDFs)
16+
* ``tesseract-ocr`` (images and PDFs)
17+
* ``libreoffice`` (MS Office docs)
18+
19+
* Run the following to install NLTK dependencies. ``unstructured`` will handle this automatically soon.
20+
* ``python -c "import nltk; nltk.download('punkt')"``
21+
* ``python -c "import nltk; nltk.download('averaged_perceptron_tagger')"``
22+
23+
* If you are parsing PDFs, run the following to install the ``detectron2`` model, which ``unstructured`` uses for layout detection:
24+
* ``pip install "detectron2@git+https://github.com/facebookresearch/[email protected]#egg=detectron2"``
25+
26+
At this point, you should be able to run the following code:
27+
28+
.. code:: python
29+
30+
from unstructured.partition.auto import partition
31+
32+
elements = partition(filename="example-docs/fake-email.eml")
33+
34+
And if you installed with `local-inference`, you should be able to run this as well:
35+
36+
.. code:: python
37+
38+
from unstructured.partition.auto import partition
39+
40+
elements = partition("example-docs/layout-parser-paper.pdf")
841
942
1043
Installation with ``conda`` on Windows

0 commit comments

Comments
 (0)