You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+8-14Lines changed: 8 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -225,9 +225,9 @@ If you need any python packages installed you will need to create file called re
225
225
the docker build process will use `pip install -r requirements.txt` to install these packages.
226
226
227
227
## SimpleExtractor
228
-
Motivation: design and implement an simple extractor to bridge Python developer and knowledge of PyClowder library. It is inevitable for us to assume a learning curve for Python developer to be familiar with PyClowder api library and start to code extractor.
228
+
Motivation: design and implement a simple extractor to bridge Python developer and knowledge of PyClowder library. It requires little effort for Python developers to wrap their python code into Clowder's extractors.
229
229
230
-
Instead of making a complete extractor, SimpleExtractor provides a way to decouple the function of metadata computation from the full implementation of extractor. Developers will just focus on their own extract function body. Meanwhile, simple extractor will take developer defined function as input to do extraction and then parse and organize the metadata output into Clowder defined metadata data-struct and submit back to Clowder.
230
+
Simple extractors take developer defined main function as input parameter to do extraction and then parse and pack extraction's output into Simple extractor defined metadata data-struct and submit back to Clowder.
231
231
232
232
Users' function must have to return a ``dict'' object containing metdata and previews.
233
233
```markdown
@@ -241,17 +241,12 @@ result = {
241
241
```
242
242
243
243
### Example:
244
-
Extraction on single file is most common extractor type. In Clowder extractors' repositories, we would think 90% or more extractors are computing extraction on single file(e.g., wordcount, meangrey, ocr, etc.), which means user uploads a file onto Clowder, and then that uploaded file will be forwarded to an applicable extractor to compute metadata and extractor will post back Clowder the attached computing metadata and previews for this particular file.
244
+
`wordcount-simpleextractor`is the simplest example to illustrate how to wrap existing Python code as a Simple Extractor.
245
245
246
-
247
-
`wordcount-simpleextractor` is the simplest example to use SimpleExtractor. It consists of three files, which will be illustrated respectivaly.
248
-
249
-
250
-
wordcount.py is regular python file which is defined and provided by users. In the code, wordcount invoke `wc` command to process input file and extractor lines, words, characters. It packs metadata into python dict.
246
+
wordcount.py is regular python file which is defined and provided by Python developers. In the code, wordcount invoke `wc` command to process input file to extract lines, words, characters. It packs metadata into python dict.
251
247
```markdown
252
248
import subprocess
253
249
254
-
255
250
def wordcount(input_file):
256
251
result = subprocess.check_output(['wc', input_file], stderr=subprocess.STDOUT)
257
252
(lines, words, characters, _) = result.split()
@@ -266,13 +261,12 @@ def wordcount(input_file):
266
261
return result
267
262
```
268
263
269
-
To build wordcount simple extractor as docker image, we provide the template Dockerfile shown below. EXTRACTION_FUNC is environment variable and must be assigned to extraction function, where in wordcount.py, the function is `wordcount`.
264
+
To build wordcount as a Simpel extractor docker image, users just simply assign two environment variables in Dockerfile shown below. EXTRACTION_FUNC is environment variable and has to be assigned as extraction function, where in wordcount.py, the extraction function is `wordcount`. Environment variable EXTRACTION_MODULE is the name of module file containing the definition of extraction function.
The goal of the simple extractor is to make writing of an extractor as easy as possible. It wraps almost all of the
4
-
complexities in itself and exposes only one environment variable called ```EXTRACTION_FUNC```. This environment
5
-
variable needs to contain the name of the method that needs to be called when this extractor receives a message from
6
-
the message broker.
4
+
complexities in itself and exposes only two environment variables: ```EXTRACTION_FUNC``` and ```EXTRACTION_MODULE```. Environment
5
+
variable ```EXTRACTION_FUNC``` is the name of the method that needs to be called when this extractor receives a message from
6
+
the message broker. The other environment variable ```EXTRACTION_MODULE`` is the module name of python file where ```EXTRACTION_FUNC``` function has been declared.
7
7
8
8
# When to Use This
9
9
10
10
1. This simple extractor is meant to be used in those situations when there is already some Python code available that
11
11
needs to be wrapped as an extractor as quickly as possible.
12
-
2. This extractor ONLY generates JSON format metadata or a list of preview files. If your extractor generates
12
+
2. This extractor CURRENTLY outputs JSON format metadata or a list of preview files. If your extractor generates
13
13
any additional information like generated files, datasets, collections, thumbnails, etc., this method cannot be use and
14
-
you need to write your extractor the normal way using [PyClowder2](https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder2/browse)
14
+
you have to write your extractor the normal way using [PyClowder2](https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder2/browse)
15
15
3.[Docker](https://www.docker.com/) is the recommended way of developing / wrapping your code using the Simple Extractor.
16
16
17
17
## Steps for Writing an Extractor Using the Simple Extractor
18
18
19
19
To write an extractor using the Simple Extractor, you need to have your Python program available. The main function of
20
-
this Python program needs to accept an input file path as its parameter. It needs to return a Python dictionary that
20
+
this Python program is supposed to take an input file path as its parameter. It needs to return a Python dictionary that
21
21
can contain either metadata information ("metadata"), details about file previews ("previews") or both. For example:
22
22
23
23
```json
@@ -29,10 +29,11 @@ can contain either metadata information ("metadata"), details about file preview
29
29
30
30
1. Let's call your main Python program file ```your_python_program.py``` and the main function ```your_main_function```.
31
31
32
-
2. Let's create a Dockerfile for your extractor. Its contents need to be:
32
+
2. Let's create a Dockerfile for your extractor. Dockerfile contents need to be:
0 commit comments