Skip to content

Commit a955af2

Browse files
committed
Merge pull request #67 in CATS/pyclowder2 from BD-2097-simple-extractor-on-file-extraction-e.g-wordcount to develop
* commit '09b5a3e9cc0466862a23513591672fa497c8c2b1': Update sample-extractors/simple-extractor/README.md simple extractor onbuild Removed init.py Removed invalid code in Dockerfile. Update sample-extractors/simple-extractor/__init__.py Changed print statement to logger debug. Fixed Dockerfile. Updated gitignore 1. Moved simple extractor out of pyclowder as part of sample extractors. 2. Started adding README for simple extractor 3. Updated docker file for simple extractor Removed extra blank line that was causing build failure. simple extractor on file with wordcount example
2 parents b528b56 + 09b5a3e commit a955af2

File tree

10 files changed

+196
-1
lines changed

10 files changed

+196
-1
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
.idea/*
2+
.DS_Store
23

34
# Byte-compiled / optimized / DLL files
45
__pycache__/

README.md

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -223,3 +223,50 @@ the docker container.
223223

224224
If you need any python packages installed you will need to create file called requiremenets.txt. If this file exists
225225
the docker build process will use `pip install -r requirements.txt` to install these packages.
226+
227+
## SimpleExtractor
228+
Motivation: design and implement a simple extractor to bridge Python developer and knowledge of PyClowder library. It requires little effort for Python developers to wrap their python code into Clowder's extractors.
229+
230+
Simple extractors take developer defined main function as input parameter to do extraction and then parse and pack extraction's output into Simple extractor defined metadata data-struct and submit back to Clowder.
231+
232+
Users' function must have to return a ``dict'' object containing metdata and previews.
233+
```markdown
234+
result = {
235+
'metadata': {},
236+
'previews': [
237+
'filename',
238+
{'file': 'filename'},
239+
{'file': 'filename', 'metadata': {}, 'mimetype': 'image/jpeg'}
240+
]}
241+
```
242+
243+
### Example:
244+
`wordcount-simpleextractor` is the simplest example to illustrate how to wrap existing Python code as a Simple Extractor.
245+
246+
wordcount.py is regular python file which is defined and provided by Python developers. In the code, wordcount invoke `wc` command to process input file to extract lines, words, characters. It packs metadata into python dict.
247+
```markdown
248+
import subprocess
249+
250+
def wordcount(input_file):
251+
result = subprocess.check_output(['wc', input_file], stderr=subprocess.STDOUT)
252+
(lines, words, characters, _) = result.split()
253+
metadata = {
254+
'lines': lines,
255+
'words': words,
256+
'characters': characters
257+
}
258+
result = {
259+
'metadata': metadata
260+
}
261+
return result
262+
```
263+
264+
To build wordcount as a Simpel extractor docker image, users just simply assign two environment variables in Dockerfile shown below. EXTRACTION_FUNC is environment variable and has to be assigned as extraction function, where in wordcount.py, the extraction function is `wordcount`. Environment variable EXTRACTION_MODULE is the name of module file containing the definition of extraction function.
265+
```markdown
266+
FROM clowder/extractors-simple-extractor:onbuild
267+
268+
ENV EXTRACTION_FUNC="wordcount"
269+
ENV EXTRACTION_MODULE="wordcount"
270+
```
271+
272+

docker.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ export DEBUG=${DEBUG:-""}
1010
${DEBUG} docker build --tag clowder/pyclowder:latest .
1111
${DEBUG} docker build --tag clowder/pyclowder:onbuild --file Dockerfile.onbuild .
1212
${DEBUG} docker build --tag clowder/extractors-binary-preview:onbuild sample-extractors/binary-preview
13+
${DEBUG} docker build --tag clowder/extractors-simple-extractor:onbuild sample-extractors/simple-extractor
1314

1415
# build sample extractors
1516
${DEBUG} docker build --tag clowder/extractors-wordcount:latest sample-extractors/wordcount
17+
${DEBUG} docker build --tag clowder/extractors-wordcount-simpleextractor:latest sample-extractors/wordcount-simple-extractor

sample-extractors/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ This folder contains example extractors:
22

33
* wordcount: A simple extractor that takes a text file and counts the number of characters, words and lines.
44
* echo: A simple extractor that shows how check_message can be used to tell the extractor not to download the actual file.
5-
5+
* wordcount-simpleextractor: this is wordcount extractor which uses the power of new added module simpleextractor in Pyclowder2.
66
Additional files in this folder are:
77

88
* example.conf: an example of the ubuntu upstart file
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
FROM clowder/pyclowder:onbuild
2+
3+
ENV EXTRACTION_FUNC=""
4+
ENV EXTRACTION_MODULE=""
5+
6+
# install any packages
7+
ONBUILD COPY packages.* Dockerfile /home/clowder/
8+
ONBUILD RUN if [ -e packages.apt ]; then \
9+
apt-get -q -q update \
10+
&& xargs apt-get -y install --no-install-recommends < packages.apt \
11+
&& rm -rf /var/lib/apt/lists/*; \
12+
fi
13+
14+
# install any python packages
15+
ONBUILD COPY requirements.txt* Dockerfile /home/clowder/
16+
ONBUILD RUN if [ -e requirements.txt ]; then \
17+
pip install --no-cache-dir -r requirements.txt; \
18+
fi
19+
20+
# copy all files
21+
ONBUILD ADD . /home/clowder/
22+
23+
CMD python -c "from simple_extractor import SimpleExtractor; from ${EXTRACTION_MODULE} import *; SimpleExtractor(${EXTRACTION_FUNC}).start()"
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
# Simple Extractor
2+
3+
The goal of the simple extractor is to make writing of an extractor as easy as possible. It wraps almost all of the
4+
complexities in itself and exposes only two environment variables: ```EXTRACTION_FUNC``` and ```EXTRACTION_MODULE```. Environment
5+
variable ```EXTRACTION_FUNC``` is the name of the method that needs to be called when this extractor receives a message from
6+
the message broker. The other environment variable ```EXTRACTION_MODULE``` is the module name of python file where ```EXTRACTION_FUNC``` function has been declared.
7+
8+
# When to Use This
9+
10+
1. This simple extractor is meant to be used in those situations when there is already some Python code available that
11+
needs to be wrapped as an extractor as quickly as possible.
12+
2. This extractor CURRENTLY outputs JSON format metadata or a list of preview files. If your extractor generates
13+
any additional information like generated files, datasets, collections, thumbnails, etc., this method cannot be use and
14+
you have to write your extractor the normal way using [PyClowder2](https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder2/browse)
15+
3. [Docker](https://www.docker.com/) is the recommended way of developing / wrapping your code using the Simple Extractor.
16+
17+
## Steps for Writing an Extractor Using the Simple Extractor
18+
19+
To write an extractor using the Simple Extractor, you need to have your Python program available. The main function of
20+
this Python program is supposed to take an input file path as its parameter. It needs to return a Python dictionary that
21+
can contain either metadata information ("metadata"), details about file previews ("previews") or both. For example:
22+
23+
``` json
24+
{
25+
"metadata": dict(),
26+
"previews": array()
27+
}
28+
```
29+
30+
1. Let's call your main Python program file ```your_python_program.py``` and the main function ```your_main_function```.
31+
32+
2. Let's create a Dockerfile for your extractor. Dockerfile contents need to be:
33+
34+
FROM clowder/extractors-simple-extractor:onbuild
35+
ENV EXTRACTION_FUNC="your_main_function"
36+
ENV EXTRACTION_MODULE="your_python_program"
37+
38+
39+
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
#!/usr/bin/env python
2+
3+
import logging
4+
from pyclowder.extractors import Extractor
5+
import pyclowder.files
6+
7+
8+
class SimpleExtractor(Extractor):
9+
def __init__(self, extraction):
10+
Extractor.__init__(self)
11+
self.extraction = extraction
12+
self.setup()
13+
# setup logging for the exctractor
14+
logging.getLogger('pyclowder').setLevel(logging.INFO)
15+
self.logger = logging.getLogger('__main__')
16+
self.logger.setLevel(logging.INFO)
17+
18+
def process_message(self, connector, host, secret_key, resource, parameters):
19+
input_file = resource["local_paths"][0]
20+
file_id = resource['id']
21+
result = self.extraction(input_file)
22+
if 'metadata' in result.keys():
23+
metadata = self.get_metadata(result.get('metadata'), 'file', file_id, host)
24+
self.logger.info("upload metadata")
25+
self.logger.debug(metadata)
26+
pyclowder.files.upload_metadata(connector, host, secret_key, file_id, metadata)
27+
if 'previews' in result.keys():
28+
self.logger.info("upload previews")
29+
for preview in result['previews']:
30+
if isinstance(preview, basestring):
31+
preview = {'file': preview}
32+
else:
33+
continue
34+
self.logger.info("upload preview")
35+
pyclowder.files.upload_preview(connector, host, secret_key, file_id, preview.get('file'),
36+
preview.get('metadata'), preview.get('mimetype'))
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
FROM clowder/extractors-simple-extractor:onbuild
2+
3+
ENV EXTRACTION_FUNC="wordcount"
4+
ENV EXTRACTION_MODULE="wordcount"
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
{
2+
"@context": "http://clowder.ncsa.illinois.edu/contexts/extractors.jsonld",
3+
"name": "ncsa.wordcount",
4+
"version": "1.0",
5+
"description": "WordCount simple extractor. Counts the number of characters, words and lines in the text file that was uploaded.",
6+
"author": "Bing Zhang <[email protected]>",
7+
"contributors": [],
8+
"contexts": [
9+
{
10+
"lines": "http://clowder.ncsa.illinois.edu/metadata/ncsa.wordcount#lines",
11+
"words": "http://clowder.ncsa.illinois.edu/metadata/ncsa.wordcount#words",
12+
"characters": "http://clowder.ncsa.illinois.edu/metadata/ncsa.wordcount#characters"
13+
}
14+
],
15+
"repository": {
16+
"repType": "git",
17+
"repUrl": "https://opensource.ncsa.illinois.edu/stash/scm/cats/pyclowder.git"
18+
},
19+
"process": {
20+
"file": [
21+
"text/*",
22+
"application/json"
23+
]
24+
},
25+
"external_services": [],
26+
"dependencies": [],
27+
"bibtex": []
28+
}
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
import subprocess
2+
3+
4+
def wordcount(input_file):
5+
result = subprocess.check_output(['wc', input_file], stderr=subprocess.STDOUT)
6+
(lines, words, characters, _) = result.split()
7+
metadata = {
8+
'lines': lines,
9+
'words': words,
10+
'characters': characters
11+
}
12+
result = {
13+
'metadata': metadata
14+
}
15+
return result

0 commit comments

Comments
 (0)