Skip to content

Commit f2c30bf

Browse files
committed
simple extractor onbuild
1 parent ab8d37f commit f2c30bf

File tree

5 files changed

+41
-28
lines changed

5 files changed

+41
-28
lines changed

README.md

Lines changed: 8 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -225,9 +225,9 @@ If you need any python packages installed you will need to create file called re
225225
the docker build process will use `pip install -r requirements.txt` to install these packages.
226226

227227
## SimpleExtractor
228-
Motivation: design and implement an simple extractor to bridge Python developer and knowledge of PyClowder library. It is inevitable for us to assume a learning curve for Python developer to be familiar with PyClowder api library and start to code extractor.
228+
Motivation: design and implement a simple extractor to bridge Python developer and knowledge of PyClowder library. It requires little effort for Python developers to wrap their python code into Clowder's extractors.
229229

230-
Instead of making a complete extractor, SimpleExtractor provides a way to decouple the function of metadata computation from the full implementation of extractor.  Developers will just focus on their own extract function body. Meanwhile, simple extractor will take developer defined function as input to do extraction and then parse and organize the metadata output into Clowder defined metadata data-struct and submit back to Clowder.
230+
Simple extractors take developer defined main function as input parameter to do extraction and then parse and pack extraction's output into Simple extractor defined metadata data-struct and submit back to Clowder.
231231

232232
Users' function must have to return a ``dict'' object containing metdata and previews.
233233
```markdown
@@ -241,17 +241,12 @@ result = {
241241
```
242242

243243
### Example:
244-
Extraction on single file is most common extractor type. In Clowder extractors' repositories, we would think 90% or more extractors are computing extraction on single file(e.g., wordcount, meangrey, ocr, etc.), which means user uploads a file onto Clowder, and then that uploaded file will be forwarded to an applicable extractor to compute metadata and extractor will post back Clowder the attached computing metadata and previews for this particular file.
244+
`wordcount-simpleextractor` is the simplest example to illustrate how to wrap existing Python code as a Simple Extractor.
245245

246-
247-
`wordcount-simpleextractor` is the simplest example to use SimpleExtractor. It consists of three files, which will be illustrated respectivaly.
248-
249-
250-
wordcount.py is regular python file which is defined and provided by users. In the code, wordcount invoke `wc` command to process input file and extractor lines, words, characters. It packs metadata into python dict.
246+
wordcount.py is regular python file which is defined and provided by Python developers. In the code, wordcount invoke `wc` command to process input file to extract lines, words, characters. It packs metadata into python dict.
251247
```markdown
252248
import subprocess
253249

254-
255250
def wordcount(input_file):
256251
result = subprocess.check_output(['wc', input_file], stderr=subprocess.STDOUT)
257252
(lines, words, characters, _) = result.split()
@@ -266,13 +261,12 @@ def wordcount(input_file):
266261
return result
267262
```
268263

269-
To build wordcount simple extractor as docker image, we provide the template Dockerfile shown below. EXTRACTION_FUNC is environment variable and must be assigned to extraction function, where in wordcount.py, the function is `wordcount`.
264+
To build wordcount as a Simpel extractor docker image, users just simply assign two environment variables in Dockerfile shown below. EXTRACTION_FUNC is environment variable and has to be assigned as extraction function, where in wordcount.py, the extraction function is `wordcount`. Environment variable EXTRACTION_MODULE is the name of module file containing the definition of extraction function.
270265
```markdown
271-
FROM clowder/pyclowder:onbuild
272-
273-
ENV EXTRACTION_FUNC="wordcount"
266+
FROM clowder/extractors-simple-extractor:onbuild
274267

275-
CMD python -c "from pyclowder.simpleextractor import SimpleExtractor; from wordcount import *; SimpleExtractor(${EXTRACTION_FUNC}).start()"
268+
ENV EXTRACTION_FUNC="wordcount"
269+
ENV EXTRACTION_MODULE="wordcount"
276270
```
277271

278272

docker.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@ export DEBUG=${DEBUG:-""}
1010
${DEBUG} docker build --tag clowder/pyclowder:latest .
1111
${DEBUG} docker build --tag clowder/pyclowder:onbuild --file Dockerfile.onbuild .
1212
${DEBUG} docker build --tag clowder/extractors-binary-preview:onbuild sample-extractors/binary-preview
13-
${DEBUG} docker build --tag clowder/extractors-simple-extractor:latest sample-extractors/simple-extractor
13+
${DEBUG} docker build --tag clowder/extractors-simple-extractor:onbuild sample-extractors/simple-extractor
1414

1515
# build sample extractors
1616
${DEBUG} docker build --tag clowder/extractors-wordcount:latest sample-extractors/wordcount
17-
${DEBUG} docker build --tag clowder/extractors-wordcount-simpleextractor:latest sample-extractors/wordcount-simpleextractor
17+
${DEBUG} docker build --tag clowder/extractors-wordcount-simpleextractor:latest sample-extractors/wordcount-simple-extractor

sample-extractors/simple-extractor/Dockerfile

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,21 @@ FROM clowder/pyclowder:onbuild
33
ENV EXTRACTION_FUNC=""
44
ENV EXTRACTION_MODULE=""
55

6-
CMD python -c "from simple_extractor import SimpleExtractor; from ${EXTRACTION_MODULE} import *; SimpleExtractor(${EXTRACTION_FUNC}).start()"
6+
# install any packages
7+
ONBUILD COPY packages.* Dockerfile /home/clowder/
8+
ONBUILD RUN if [ -e packages.apt ]; then \
9+
apt-get -q -q update \
10+
&& xargs apt-get -y install --no-install-recommends < packages.apt \
11+
&& rm -rf /var/lib/apt/lists/*; \
12+
fi
13+
14+
# install any python packages
15+
ONBUILD COPY requirements.txt* Dockerfile /home/clowder/
16+
ONBUILD RUN if [ -e requirements.txt ]; then \
17+
pip install --no-cache-dir -r requirements.txt; \
18+
fi
19+
20+
# copy all files
21+
ONBUILD ADD . /home/clowder/
22+
23+
CMD python -c "from simple_extractor import SimpleExtractor; from ${EXTRACTION_MODULE} import *; SimpleExtractor(${EXTRACTION_FUNC}).start()"
Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,23 @@
11
# Simple Extractor
22

33
The goal of the simple extractor is to make writing of an extractor as easy as possible. It wraps almost all of the
4-
complexities in itself and exposes only one environment variable called ```EXTRACTION_FUNC```. This environment
5-
variable needs to contain the name of the method that needs to be called when this extractor receives a message from
6-
the message broker.
4+
complexities in itself and exposes only two environment variables: ```EXTRACTION_FUNC``` and ```EXTRACTION_MODULE```. Environment
5+
variable ```EXTRACTION_FUNC``` is the name of the method that needs to be called when this extractor receives a message from
6+
the message broker. The other environment variable ```EXTRACTION_MODULE`` is the module name of python file where ```EXTRACTION_FUNC``` function has been declared.
77

88
# When to Use This
99

1010
1. This simple extractor is meant to be used in those situations when there is already some Python code available that
1111
needs to be wrapped as an extractor as quickly as possible.
12-
2. This extractor ONLY generates JSON format metadata or a list of preview files. If your extractor generates
12+
2. This extractor CURRENTLY outputs JSON format metadata or a list of preview files. If your extractor generates
1313
any additional information like generated files, datasets, collections, thumbnails, etc., this method cannot be use and
14-
you need to write your extractor the normal way using [PyClowder2](https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder2/browse)
14+
you have to write your extractor the normal way using [PyClowder2](https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder2/browse)
1515
3. [Docker](https://www.docker.com/) is the recommended way of developing / wrapping your code using the Simple Extractor.
1616

1717
## Steps for Writing an Extractor Using the Simple Extractor
1818

1919
To write an extractor using the Simple Extractor, you need to have your Python program available. The main function of
20-
this Python program needs to accept an input file path as its parameter. It needs to return a Python dictionary that
20+
this Python program is supposed to take an input file path as its parameter. It needs to return a Python dictionary that
2121
can contain either metadata information ("metadata"), details about file previews ("previews") or both. For example:
2222

2323
``` json
@@ -29,10 +29,11 @@ can contain either metadata information ("metadata"), details about file preview
2929

3030
1. Let's call your main Python program file ```your_python_program.py``` and the main function ```your_main_function```.
3131

32-
2. Let's create a Dockerfile for your extractor. Its contents need to be:
32+
2. Let's create a Dockerfile for your extractor. Dockerfile contents need to be:
33+
34+
FROM clowder/extractors-simple-extractor:onbuild
35+
ENV EXTRACTION_FUNC="your_main_function"
36+
ENV EXTRACTION_MODULE="your_python_program"
3337

34-
FROM clowder/extractors-simple-extractor:latest
35-
ENV EXTRACTION_FUNC="your_python_program.your_main_function"
3638

37-
TODO: Complete this.
3839

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1-
FROM clowder/extractors-simple-extractor:latest
1+
FROM clowder/extractors-simple-extractor:onbuild
22

33
ENV EXTRACTION_FUNC="wordcount"
4+
ENV EXTRACTION_MODULE="wordcount"

0 commit comments

Comments
 (0)