Merge pull request #67 in CATS/pyclowder2 from BD-2097-simple-extractor-on-file-extraction-e.g-wordcount to develop

robkooper · robkooper · commit a955af287c52 · 2018-06-19T13:36:44.000-05:00
* commit '09b5a3e9cc0466862a23513591672fa497c8c2b1':
  Update sample-extractors/simple-extractor/README.md
  simple extractor onbuild
  Removed init.py
  Removed invalid code in Dockerfile.
  Update sample-extractors/simple-extractor/__init__.py
  Changed print statement to logger debug.
  Fixed Dockerfile.
  Updated gitignore
  1. Moved simple extractor out of pyclowder as part of sample extractors. 2. Started adding README for simple extractor 3. Updated docker file for simple extractor
  Removed extra blank line that was causing build failure.
  simple extractor on file with wordcount example
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,5 @@
 .idea/*
+.DS_Store
 
 # Byte-compiled / optimized / DLL files
 __pycache__/
diff --git a/README.md b/README.md
@@ -223,3 +223,50 @@ the docker container.
 
 If you need any python packages installed you will need to create file called requiremenets.txt. If this file exists
 the docker build process will use `pip install -r requirements.txt` to install these packages.
+
+## SimpleExtractor
+Motivation: design and implement a simple extractor to bridge Python developer and knowledge of PyClowder library. It requires little effort for Python developers to wrap their python code into Clowder's extractors.
+
+Simple extractors take developer defined main function as input parameter to do extraction and then parse and pack extraction's output into Simple extractor defined metadata data-struct and submit back to Clowder.
+
+Users' function must have to return a ``dict'' object containing metdata and previews.
+```markdown
+result = {
+  'metadata': {},
+  'previews': [
+      'filename',
+      {'file': 'filename'},
+      {'file': 'filename', 'metadata': {}, 'mimetype': 'image/jpeg'}
+  ]}
+```
+
+### Example: 
+`wordcount-simpleextractor` is the simplest example to illustrate how to wrap existing Python code as a Simple Extractor.
+
+wordcount.py is regular python file which is defined and provided by Python developers. In the code, wordcount invoke `wc` command to process input file to extract lines, words, characters. It packs metadata into python dict.
+```markdown
+import subprocess
+  
+def wordcount(input_file):
+    result = subprocess.check_output(['wc', input_file], stderr=subprocess.STDOUT)
+    (lines, words, characters, _) = result.split()
+    metadata = {
+        'lines': lines,
+        'words': words,
+        'characters': characters
+    }
+    result = {
+        'metadata': metadata
+    }
+    return result
+```
+
+To build wordcount as a Simpel extractor docker image, users just simply assign two environment variables in Dockerfile shown below. EXTRACTION_FUNC is environment variable and has to be assigned as extraction function, where in wordcount.py, the extraction function is `wordcount`. Environment variable EXTRACTION_MODULE is the name of module file containing the definition of extraction function.
+```markdown
+FROM clowder/extractors-simple-extractor:onbuild
+
+ENV EXTRACTION_FUNC="wordcount"
+ENV EXTRACTION_MODULE="wordcount"
+```
+
+
diff --git a/docker.sh b/docker.sh
@@ -10,6 +10,8 @@ export DEBUG=${DEBUG:-""}
 ${DEBUG} docker build --tag clowder/pyclowder:latest .
 ${DEBUG} docker build --tag clowder/pyclowder:onbuild --file Dockerfile.onbuild .
 ${DEBUG} docker build --tag clowder/extractors-binary-preview:onbuild sample-extractors/binary-preview
+${DEBUG} docker build --tag clowder/extractors-simple-extractor:onbuild sample-extractors/simple-extractor
 
 # build sample extractors
 ${DEBUG} docker build --tag clowder/extractors-wordcount:latest sample-extractors/wordcount
+${DEBUG} docker build --tag clowder/extractors-wordcount-simpleextractor:latest sample-extractors/wordcount-simple-extractor
diff --git a/sample-extractors/README.md b/sample-extractors/README.md
@@ -2,7 +2,7 @@ This folder contains example extractors:
 
 * wordcount: A simple extractor that takes a text file and counts the number of characters, words and lines.
 * echo: A simple extractor that shows how check_message can be used to tell the extractor not to download the actual file.
-
+* wordcount-simpleextractor: this is wordcount extractor which uses the power of new added module simpleextractor in Pyclowder2.
 Additional files in this folder are:
 
 * example.conf: an example of the ubuntu upstart file
diff --git a/sample-extractors/simple-extractor/Dockerfile b/sample-extractors/simple-extractor/Dockerfile
@@ -0,0 +1,23 @@
+FROM clowder/pyclowder:onbuild
+
+ENV EXTRACTION_FUNC=""
+ENV EXTRACTION_MODULE=""
+
+# install any packages
+ONBUILD COPY packages.* Dockerfile /home/clowder/
+ONBUILD RUN if [ -e packages.apt ]; then \
+                apt-get -q -q update \
+                && xargs apt-get -y install --no-install-recommends < packages.apt \
+                && rm -rf /var/lib/apt/lists/*; \
+            fi
+
+# install any python packages
+ONBUILD COPY requirements.txt* Dockerfile /home/clowder/
+ONBUILD RUN if [ -e requirements.txt ]; then \
+                pip install --no-cache-dir -r requirements.txt; \
+            fi
+
+# copy all files
+ONBUILD ADD . /home/clowder/
+
+CMD python -c "from simple_extractor import SimpleExtractor; from ${EXTRACTION_MODULE} import *; SimpleExtractor(${EXTRACTION_FUNC}).start()"
diff --git a/sample-extractors/simple-extractor/README.md b/sample-extractors/simple-extractor/README.md
@@ -0,0 +1,39 @@
+# Simple Extractor
+
+The goal of the simple extractor is to make writing of an extractor as easy as possible. It wraps almost all of the 
+complexities in itself and exposes only two environment variables: ```EXTRACTION_FUNC``` and ```EXTRACTION_MODULE```. Environment 
+variable ```EXTRACTION_FUNC``` is the name of the method that needs to be called when this extractor receives a message from 
+the message broker. The other environment variable ```EXTRACTION_MODULE``` is the module name of python file where ```EXTRACTION_FUNC``` function has been declared.
+ 
+# When to Use This
+
+1. This simple extractor is meant to be used in those situations when there is already some Python code available that 
+needs to be wrapped as an extractor as quickly as possible.
+2. This extractor CURRENTLY outputs JSON format metadata or a list of preview files. If your extractor generates 
+any additional information like generated files, datasets, collections, thumbnails, etc., this method cannot be use and 
+you have to write your extractor the normal way using [PyClowder2](https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder2/browse)
+3. [Docker](https://www.docker.com/) is the recommended way of developing / wrapping your code using the Simple Extractor.
+
+## Steps for Writing an Extractor Using the Simple Extractor
+
+To write an extractor using the Simple Extractor, you need to have your Python program available. The main function of 
+this Python program is supposed to take an input file path as its parameter. It needs to return a Python dictionary that 
+can contain either metadata information ("metadata"), details about file previews ("previews") or both. For example:
+
+``` json
+{   
+    "metadata": dict(),
+    "previews": array() 
+}
+```
+
+1. Let's call your main Python program file ```your_python_program.py``` and the main function ```your_main_function```.
+
+2. Let's create a Dockerfile for your extractor. Dockerfile contents need to be:
+
+        FROM clowder/extractors-simple-extractor:onbuild
+        ENV EXTRACTION_FUNC="your_main_function"
+        ENV EXTRACTION_MODULE="your_python_program"
+
+
+
diff --git a/sample-extractors/simple-extractor/simple_extractor.py b/sample-extractors/simple-extractor/simple_extractor.py
@@ -0,0 +1,36 @@
+#!/usr/bin/env python
+
+import logging
+from pyclowder.extractors import Extractor
+import pyclowder.files
+
+
+class SimpleExtractor(Extractor):
+    def __init__(self, extraction):
+        Extractor.__init__(self)
+        self.extraction = extraction
+        self.setup()
+        # setup logging for the exctractor
+        logging.getLogger('pyclowder').setLevel(logging.INFO)
+        self.logger = logging.getLogger('__main__')
+        self.logger.setLevel(logging.INFO)
+
+    def process_message(self, connector, host, secret_key, resource, parameters):
+        input_file = resource["local_paths"][0]
+        file_id = resource['id']
+        result = self.extraction(input_file)
+        if 'metadata' in result.keys():
+            metadata = self.get_metadata(result.get('metadata'), 'file', file_id, host)
+            self.logger.info("upload metadata")
+            self.logger.debug(metadata)
+            pyclowder.files.upload_metadata(connector, host, secret_key, file_id, metadata)
+        if 'previews' in result.keys():
+            self.logger.info("upload previews")
+            for preview in result['previews']:
+                if isinstance(preview, basestring):
+                    preview = {'file': preview}
+                else:
+                    continue
+                self.logger.info("upload preview")
+                pyclowder.files.upload_preview(connector, host, secret_key, file_id, preview.get('file'),
+                                               preview.get('metadata'), preview.get('mimetype'))
diff --git a/sample-extractors/wordcount-simple-extractor/Dockerfile b/sample-extractors/wordcount-simple-extractor/Dockerfile
@@ -0,0 +1,4 @@
+FROM clowder/extractors-simple-extractor:onbuild
+
+ENV EXTRACTION_FUNC="wordcount"
+ENV EXTRACTION_MODULE="wordcount"
diff --git a/sample-extractors/wordcount-simple-extractor/extractor_info.json b/sample-extractors/wordcount-simple-extractor/extractor_info.json
@@ -0,0 +1,28 @@
+{
+  "@context": "http://clowder.ncsa.illinois.edu/contexts/extractors.jsonld",
+  "name": "ncsa.wordcount",
+  "version": "1.0",
+  "description": "WordCount simple extractor. Counts the number of characters, words and lines in the text file that was uploaded.",
+  "author": "Bing Zhang <bing@illinois.edu>",
+  "contributors": [],
+  "contexts": [
+    {
+      "lines": "http://clowder.ncsa.illinois.edu/metadata/ncsa.wordcount#lines",
+      "words": "http://clowder.ncsa.illinois.edu/metadata/ncsa.wordcount#words",
+      "characters": "http://clowder.ncsa.illinois.edu/metadata/ncsa.wordcount#characters"
+    }
+  ],
+  "repository": {
+    "repType": "git",
+    "repUrl": "https://opensource.ncsa.illinois.edu/stash/scm/cats/pyclowder.git"
+  },
+  "process": {
+    "file": [
+      "text/*",
+      "application/json"
+    ]
+  },
+  "external_services": [],
+  "dependencies": [],
+  "bibtex": []
+}
diff --git a/sample-extractors/wordcount-simple-extractor/wordcount.py b/sample-extractors/wordcount-simple-extractor/wordcount.py
@@ -0,0 +1,15 @@
+import subprocess
+
+
+def wordcount(input_file):
+    result = subprocess.check_output(['wc', input_file], stderr=subprocess.STDOUT)
+    (lines, words, characters, _) = result.split()
+    metadata = {
+        'lines': lines,
+        'words': words,
+        'characters': characters
+    }
+    result = {
+        'metadata': metadata
+    }
+    return result

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,5 @@`
`1`	`1`	`.idea/*`
	`2`	`+.DS_Store`
`2`	`3`
`3`	`4`	`# Byte-compiled / optimized / DLL files`
`4`	`5`	`__pycache__/`