Skip to content

Commit 63fb899

Browse files
committed
Merge branch 'master' of github.com:chrismattmann/tika-python into kill-server
2 parents 431f024 + 2cfe0de commit 63fb899

File tree

8 files changed

+155
-45
lines changed

8 files changed

+155
-45
lines changed

README.md

Lines changed: 18 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@ tika-python
55
===========
66
A Python port of the [Apache Tika](http://tika.apache.org/)
77
library that makes Tika available using the
8-
[Tika REST Server](http://wiki.apache.org/tika/TikaJAXRS).
8+
[Tika REST Server](http://wiki.apache.org/tika/TikaJAXRS).
99

10-
This makes Apache Tika available as a Python library,
10+
This makes Apache Tika available as a Python library,
1111
installable via Setuptools, Pip and Easy Install.
1212

1313
To use this library, you need to have Java 7+ installed on your
@@ -22,8 +22,14 @@ Installation (with pip)
2222

2323
Installation (without pip)
2424
--------------------------
25-
1. `python setup.py build`
26-
2. `python setup.py install`
25+
1. `python setup.py build`
26+
2. `python setup.py install`
27+
28+
Airgap Environment Setup
29+
------------------------
30+
To get this working in a disconnected environment, download a tika server file and set the TIKA_SERVER_JAR environment variable to TIKA_SERVER_JAR="file:///<yourpath>/tika-server.jar" which successfully tells `python-tika` to "download" this file and move it to `/tmp/tika-server.jar` and run as background process.
31+
32+
This is the only way to run `python-tika` without internet access. Without this set, the default is to check the tika version and pull latest every time from Apache.
2733

2834
Environment Variables
2935
---------------------
@@ -58,11 +64,11 @@ print(parsed["content"])
5864

5965
Parser Interface
6066
----------------------
61-
The parser interface extracts text and metadata using the /rmeta
67+
The parser interface extracts text and metadata using the /rmeta
6268
interface. This is one of the better ways to get the internal XHTML
6369
content extracted.
6470

65-
Note:
71+
Note:
6672
![Alert Icon](https://github.com/adam-p/markdown-here/raw/master/src/common/images/icon28.png "Alert")
6773
The parser interface needs the following environment variable set on the console for printing of the extracted content.
6874
```export PYTHONIOENCODING=utf8```
@@ -85,7 +91,7 @@ Specify Output Format To XHTML
8591
---------------------
8692
The parser interface is optionally able to output the content as XHTML rather than plain text.
8793

88-
Note:
94+
Note:
8995
![Alert Icon](https://github.com/adam-p/markdown-here/raw/master/src/common/images/icon28.png "Alert")
9096
The parser interface needs the following environment variable set on the console for printing of the extracted content.
9197
```export PYTHONIOENCODING=utf8```
@@ -129,7 +135,7 @@ print(detector.from_file('/path/to/file'))
129135
Config Interface
130136
----------------------
131137
The config interface allows you to inspect the Tika Server environment's
132-
configuration including what parsers, mime types, and detectors the
138+
configuration including what parsers, mime types, and detectors the
133139
server has been configured with.
134140

135141
```
@@ -143,7 +149,7 @@ print(config.getDetectors())
143149

144150
Language Detection Interface
145151
---------------------------------
146-
The language detection interface provides a 2 character language
152+
The language detection interface provides a 2 character language
147153
code texted based on the text in provided file.
148154

149155
```
@@ -187,10 +193,10 @@ Changing the Tika Classpath
187193
---------------------------
188194
You can update the classpath that Tika server uses by
189195
setting the classpath as a set of ':' delimited strings.
190-
For example if you want to get Tika-Python working with
196+
For example if you want to get Tika-Python working with
191197
[GeoTopicParsing](http://wiki.apache.org/tika/GeoTopicParser),
192198
you can do this, replace paths below with your own paths, as
193-
identified [here](http://wiki.apache.org/tika/GeoTopicParser)
199+
identified [here](http://wiki.apache.org/tika/GeoTopicParser)
194200
and make sure that you have done this:
195201

196202
kill Tika server (if already running):
@@ -294,6 +300,7 @@ Contributors
294300
* Igor Tokarev, Freelance
295301
* Imraan Parker, Freelance
296302
* Annie K. Didier, JPL
303+
* Juan Elosua, TEGRA Cybersecurity Center
297304

298305
Thanks
299306
======

setup.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@
2020

2121
import os.path
2222
import tika
23+
from io import open
2324

2425
try:
2526
from ez_setup import use_setuptools

tika/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
# See the License for the specific language governing permissions and
1515
# limitations under the License.
1616

17-
__version__ = "1.22"
17+
__version__ = "1.23"
1818

1919
try:
2020
__import__('pkg_resources').declare_namespace(__name__)

tika/parser.py

Lines changed: 29 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -20,10 +20,10 @@
2020
import os
2121
import json
2222

23-
def from_file(filename, serverEndpoint=ServerEndpoint, xmlContent=False, headers=None, config_path=None, requestOptions={}):
23+
def from_file(filename, service='all', serverEndpoint=ServerEndpoint, xmlContent=False, headers=None, config_path=None, requestOptions={}):
2424
'''
2525
Parses a file for metadata and content
26-
:param filename: path to file which needs to be parsed
26+
:param filename: path to file which needs to be parsed or binary file using open(path,'rb')
2727
:param serverEndpoint: Server endpoint url
2828
:param xmlContent: Whether or not XML content be requested.
2929
Default is 'False', which results in text content.
@@ -33,11 +33,11 @@ def from_file(filename, serverEndpoint=ServerEndpoint, xmlContent=False, headers
3333
'content' has a str value and metadata has a dict type value.
3434
'''
3535
if not xmlContent:
36-
jsonOutput = parse1('all', filename, serverEndpoint, headers=headers, config_path=config_path, requestOptions=requestOptions)
36+
output = parse1(service, filename, serverEndpoint, headers=headers, config_path=config_path, requestOptions=requestOptions)
3737
else:
38-
jsonOutput = parse1('all', filename, serverEndpoint, services={'meta': '/meta', 'text': '/tika', 'all': '/rmeta/xml'},
38+
output = parse1(service, filename, serverEndpoint, services={'meta': '/meta', 'text': '/tika', 'all': '/rmeta/xml'},
3939
headers=headers, config_path=config_path, requestOptions=requestOptions)
40-
return _parse(jsonOutput)
40+
return _parse(output, service)
4141

4242

4343
def from_buffer(string, serverEndpoint=ServerEndpoint, xmlContent=False, headers=None, config_path=None, requestOptions={}):
@@ -61,20 +61,35 @@ def from_buffer(string, serverEndpoint=ServerEndpoint, xmlContent=False, headers
6161

6262
return _parse((status,response))
6363

64-
def _parse(jsonOutput):
64+
def _parse(output, service='all'):
6565
'''
66-
Parses JSON response from Tika REST API server
67-
:param jsonOutput: JSON output from Tika Server
66+
Parses response from Tika REST API server
67+
:param output: output from Tika Server
68+
:param service: service requested from the tika server
69+
Default is 'all', which results in recursive text content+metadata.
70+
'meta' returns only metadata
71+
'text' returns only content
6872
:return: a dictionary having 'metadata' and 'content' values
6973
'''
70-
parsed={}
71-
if not jsonOutput:
74+
parsed={'metadata': None, 'content': None}
75+
if not output:
7276
return parsed
73-
74-
parsed["status"] = jsonOutput[0]
75-
if jsonOutput[1] == None or jsonOutput[1] == "":
77+
78+
parsed["status"] = output[0]
79+
if output[1] == None or output[1] == "":
80+
return parsed
81+
82+
if service == "text":
83+
parsed["content"] = output[1]
84+
return parsed
85+
86+
realJson = json.loads(output[1])
87+
88+
parsed["metadata"] = {}
89+
if service == "meta":
90+
for key in realJson:
91+
parsed["metadata"][key] = realJson[key]
7692
return parsed
77-
realJson = json.loads(jsonOutput[1])
7893

7994
content = ""
8095
for js in realJson:
@@ -85,7 +100,6 @@ def _parse(jsonOutput):
85100
content = None
86101

87102
parsed["content"] = content
88-
parsed["metadata"] = {}
89103

90104
for js in realJson:
91105
for n in js:

tika/tests/files/rwservlet.pdf

34.4 KB
Binary file not shown.
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
#!/usr/bin/env python
2+
# encoding: utf-8
3+
# Licensed to the Apache Software Foundation (ASF) under one or more
4+
# contributor license agreements. See the NOTICE file distributed with
5+
# this work for additional information regarding copyright ownership.
6+
# The ASF licenses this file to You under the Apache License, Version 2.0
7+
# (the "License"); you may not use this file except in compliance with
8+
# the License. You may obtain a copy of the License at
9+
#
10+
# http://www.apache.org/licenses/LICENSE-2.0
11+
#
12+
# Unless required by applicable law or agreed to in writing, software
13+
# distributed under the License is distributed on an "AS IS" BASIS,
14+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15+
# See the License for the specific language governing permissions and
16+
# limitations under the License.
17+
#
18+
# python -m unittest tika.tests.test_from_file_service
19+
20+
import unittest
21+
import tika.parser
22+
23+
24+
class CreateTest(unittest.TestCase):
25+
'test different services in from_file parsing: Content, Metadata or both in recursive mode'
26+
27+
def test_default_service(self):
28+
'parse file using default service'
29+
result = tika.parser.from_file(
30+
'https://boe.es/boe/dias/2019/12/02/pdfs/BOE-A-2019-17288.pdf')
31+
self.assertEqual(result['metadata']['Content-Type'],'application/pdf')
32+
self.assertIn('AUTORIDADES Y PERSONAL',result['content'])
33+
def test_default_service_explicit(self):
34+
'parse file using default service explicitly'
35+
result = tika.parser.from_file(
36+
'https://boe.es/boe/dias/2019/12/02/pdfs/BOE-A-2019-17288.pdf', service='all')
37+
self.assertEqual(result['metadata']['Content-Type'],'application/pdf')
38+
self.assertIn('AUTORIDADES Y PERSONAL',result['content'])
39+
def test_text_service(self):
40+
'parse file using the content only service'
41+
result = tika.parser.from_file(
42+
'https://boe.es/boe/dias/2019/12/02/pdfs/BOE-A-2019-17288.pdf', service='text')
43+
self.assertIsNone(result['metadata'])
44+
self.assertIn('AUTORIDADES Y PERSONAL',result['content'])
45+
def test_meta_service(self):
46+
'parse file using the content only service'
47+
result = tika.parser.from_file(
48+
'https://boe.es/boe/dias/2019/12/02/pdfs/BOE-A-2019-17288.pdf', service='meta')
49+
self.assertIsNone(result['content'])
50+
self.assertEqual(result['metadata']['Content-Type'],'application/pdf')
51+
def test_invalid_service(self):
52+
'parse file using an invalid service should perform the default parsing'
53+
result = tika.parser.from_file(
54+
'https://boe.es/boe/dias/2019/12/02/pdfs/BOE-A-2019-17288.pdf', service='bad')
55+
self.assertEqual(result['metadata']['Content-Type'],'application/pdf')
56+
self.assertIn('AUTORIDADES Y PERSONAL',result['content'])
57+
58+
if __name__ == '__main__':
59+
unittest.main()

tika/tests/test_tika.py

Lines changed: 22 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -15,32 +15,45 @@
1515
# See the License for the specific language governing permissions and
1616
# limitations under the License.
1717
#
18-
#python -m unittest tests.tests
19-
18+
# python -m unittest tests.tests
19+
import os
2020
import unittest
2121
import tika.parser
2222

2323

2424
class CreateTest(unittest.TestCase):
25-
"test for file types"
25+
"""test for file types"""
2626

2727
def test_remote_pdf(self):
28-
'parse remote PDF'
28+
"""parse remote PDF"""
2929
self.assertTrue(tika.parser.from_file(
3030
'http://appsrv.achd.net/reports/rwservlet?food_rep_insp&P_ENCOUNTER=201504160015'))
31+
3132
def test_remote_html(self):
32-
'parse remote HTML'
33-
self.assertTrue(tika.parser.from_file(
34-
'http://neverssl.com/index.html'))
33+
"""parse remote HTML"""
34+
self.assertTrue(tika.parser.from_file('http://neverssl.com/index.html'))
35+
3536
def test_remote_mp3(self):
36-
'parese remote mp3'
37+
"""parse remote mp3"""
3738
self.assertTrue(tika.parser.from_file(
3839
'https://archive.org/download/Ainst-Spaceshipdemo.mp3/Ainst-Spaceshipdemo.mp3'))
40+
3941
def test_remote_jpg(self):
40-
'parse remote jpg'
42+
"""parse remote jpg"""
4143
self.assertTrue(tika.parser.from_file(
4244
'https://www.nasa.gov/sites/default/files/thumbnails/image/j2m-shareable.jpg'))
4345

46+
def test_local_binary(self):
47+
"""parse file binary"""
48+
file = os.path.join(os.path.dirname(__file__), 'files', 'rwservlet.pdf')
49+
with open(file, 'rb') as file_obj:
50+
self.assertTrue(tika.parser.from_file(file_obj))
51+
52+
def test_local_path(self):
53+
"""parse file path"""
54+
file = os.path.join(os.path.dirname(__file__), 'files', 'rwservlet.pdf')
55+
self.assertTrue(tika.parser.from_file(file))
56+
4457

4558
if __name__ == '__main__':
4659
unittest.main()

tika/tika.py

Lines changed: 25 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@
5858
detected = detector.from_buffer('some buffered content', config_path='/path/to/configfile')
5959
6060
'''
61+
import types
6162

6263
USAGE = """
6364
tika.py [-v] [-e] [-o <outputDir>] [--server <TikaServerEndpoint>] [--install <UrlToTikaServerJar>] [--port <portNumber>] <command> <option> <urlOrPathToFile>
@@ -141,6 +142,7 @@ def make_content_disposition_header(fn):
141142
from os import walk
142143
import signal
143144
import logging
145+
import io
144146

145147
log_path = os.getenv('TIKA_LOG_PATH', tempfile.gettempdir())
146148
log_file = os.path.join(log_path, 'tika.log')
@@ -162,7 +164,7 @@ def make_content_disposition_header(fn):
162164
log.setLevel(logging.INFO)
163165

164166
Windows = True if platform.system() == "Windows" else False
165-
TikaVersion = os.getenv('TIKA_VERSION', '1.22')
167+
TikaVersion = os.getenv('TIKA_VERSION', '1.23')
166168
TikaJarPath = os.getenv('TIKA_PATH', tempfile.gettempdir())
167169
TikaFilesPath = tempfile.gettempdir()
168170
TikaServerLogFilePath = log_path
@@ -328,9 +330,11 @@ def parse1(option, urlOrPath, serverEndpoint=ServerEndpoint, verbose=Verbose, ti
328330
log.warning('config option must be one of meta, text, or all; using all.')
329331
service = services.get(option, services['all'])
330332
if service == '/tika': responseMimeType = 'text/plain'
331-
headers.update({'Accept': responseMimeType, 'Content-Disposition': make_content_disposition_header(path)})
332-
status, response = callServer('put', serverEndpoint, service, open(path, 'rb'),
333-
headers, verbose, tikaServerJar, config_path=config_path, rawResponse=rawResponse, requestOptions=requestOptions)
333+
headers.update({'Accept': responseMimeType, 'Content-Disposition': make_content_disposition_header(path.encode('utf-8') if type(path) is unicode_string else path)})
334+
with urlOrPath if _is_file_object(urlOrPath) else open(path, 'rb') as f:
335+
status, response = callServer('put', serverEndpoint, service, f,
336+
headers, verbose, tikaServerJar, config_path=config_path,
337+
rawResponse=rawResponse, requestOptions=requestOptions)
334338

335339
if file_type == 'remote': os.unlink(path)
336340
return (status, response)
@@ -547,7 +551,6 @@ def callServer(verb, serverEndpoint, service, data, headers, verbose=Verbose, ti
547551
effectiveRequestOptions.update(requestOptions)
548552

549553
resp = verbFn(serviceUrl, encodedData, **effectiveRequestOptions)
550-
encodedData.close() # closes the file reading data
551554

552555
if verbose:
553556
print(sys.stderr, "Request headers: ", headers)
@@ -701,14 +704,26 @@ def toFilename(url):
701704
value = re.sub(r'[^\w\s\.\-]', '-', path).strip().lower()
702705
return re.sub(r'[-\s]+', '-', value).strip("-")[-200:]
703706

704-
707+
708+
def _is_file_object(f):
709+
try:
710+
file_types = (types.FileType, io.IOBase)
711+
except AttributeError:
712+
file_types = (io.IOBase,)
713+
714+
return isinstance(f, file_types)
715+
705716
def getRemoteFile(urlOrPath, destPath):
706717
'''
707718
Fetches URL to local path or just returns absolute path.
708719
:param urlOrPath: resource locator, generally URL or path
709720
:param destPath: path to store the resource, usually a path on file system
710-
:return: tuple having (path, 'local'/'remote')
721+
:return: tuple having (path, 'local'/'remote'/'binary')
711722
'''
723+
# handle binary stream input
724+
if _is_file_object(urlOrPath):
725+
return (urlOrPath.name, 'binary')
726+
712727
urlp = urlparse(urlOrPath)
713728
if urlp.scheme == '':
714729
return (os.path.abspath(urlOrPath), 'local')
@@ -774,8 +789,6 @@ def checkPortIsOpen(remoteServerHost=ServerHost, port = Port):
774789
return True
775790
else :
776791
return False
777-
sock.close()
778-
#FIXME: the above line is unreachable
779792

780793
except KeyboardInterrupt:
781794
print("You pressed Ctrl+C")
@@ -789,6 +802,9 @@ def checkPortIsOpen(remoteServerHost=ServerHost, port = Port):
789802
print("Couldn't connect to server")
790803
sys.exit()
791804

805+
finally:
806+
sock.close()
807+
792808
def main(argv=None):
793809
"""Run Tika from command line according to USAGE."""
794810
global Verbose

0 commit comments

Comments
 (0)