diff --git a/README.md b/README.md index d6de584..c62a293 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,7 @@ [![Coverage Status](https://coveralls.io/repos/18F/doc_processing_toolkit/badge.png)](https://coveralls.io/r/18F/doc_processing_toolkit) ##### About -Python library to extract text from any file type compatiable with [TIKA](http://tika.apache.org/). It defaults to OCR when text extraction of a PDF file fails. +Python library to extract text from any file type compatiable with [Tika](http://tika.apache.org/). It defaults to OCR when text extraction of a PDF file fails. ##### Dependencies - [Apache Tika](http://tika.apache.org/) @@ -13,14 +13,14 @@ Python library to extract text from any file type compatiable with [TIKA](http:/ ##### Installation 1. Download tika-server-1.7.jar from [Apache Tika](http://www.apache.org/dyn/closer.cgi/tika/tika-server-1.7.jar) -2. Mac: `brew install ghostscripts` Ubuntu: `sudo apt-get install ghostscript` +2. Mac: `brew install ghostscript` Ubuntu: `sudo apt-get install ghostscript` 3. Mac: `brew install tesseract` Ubuntu: `sudo apt-get install tesseract-ocr` 4. Mac: `brew tap homebrew/x11` and `brew install xpdf` Ubuntu: `sudo apt-get install poppler-utils` 5. Install Python dependencies with `pip install -r requirements.txt` ##### Usage These script assume that an instance of Tika server is running. -Starting Tika Servers +Starting Tika Server `java -jar tika-server-1.7.jar --port 9998` In Python script