Merge branch 'master' into extractor-key-support

max-zilla · max-zilla · commit 7682ebf9ffb5 · 2022-12-07T08:04:15.000-06:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,7 +1,7 @@
 # Change Log
 All notable changes to this project will be documented in this file.
 
-The format is based on [Keep a Changelog](http://keepachangelog.com/) 
+The format is based on [Keep a Changelog](http://keepachangelog.com/)
 and this project adheres to [Semantic Versioning](http://semver.org/).
 
 ## Unreleased
@@ -10,6 +10,37 @@ and this project adheres to [Semantic Versioning](http://semver.org/).
 - Add support for `EXTRACTOR_KEY` and `CLOWDER_EMAIL` environment variables to register
 an extractor for just one user.
 
+## 2.6.0 - 2022-06-14
+
+This will change how clowder sees the extractors. If you have an extractor, and you specify
+the queue name (eiter as command line argument or environment variable) the name of the
+extractor shown in clowder, will be the name of the queue.
+
+### Fixed
+- both heartbeat and nax_retry need to be converted to in, not string
+
+### Changed
+- when you set the RABBITMQ_QUEUE it will change the name of the extractor as well in the
+  extractor_info document. [#47](https://github.com/clowder-framework/pyclowder/issues/47)
+- environment variable CLOWDER_MAX_RETRY is now MAX_RETRY
+
+## 2.5.1 - 2022-03-04
+
+### Changed
+- updated pypi documentation
+
+## 2.5.0 - 2022-03-04
+
+### Fixed
+- extractor would fail on empty dataset download [#36](https://github.com/clowder-framework/pyclowder/issues/36)
+
+### Added
+- ability to set the heartbeat for an extractractor [#42](https://github.com/clowder-framework/pyclowder/issues/42)
+
+### Changed
+- update wordcount extractor to not use docker image
+- using piptools for requirements
+
 ## 2.4.1 - 2021-07-21
 
 ### Added
@@ -43,13 +74,13 @@ an extractor for just one user.
 ## 2.3.2 - 2020-09-24
 
 ### Fixed
-- When rabbitmq restarts the extractor would not stop and restart, resulting 
+- When rabbitmq restarts the extractor would not stop and restart, resulting
   in the extractor no longer receiving any messages. #17
 
 ### Added
 - Can specify url to use for extractor downloads, this is helpful for instances
   that have access to the internal URL for clowder, for example in docker/kubernetes.
-  
+
 ### Removed
 - Removed ability to run multiple connectors in the same python process. If
   parallelism is needed, use multiple processes (or containers).
@@ -135,7 +166,7 @@ install pyclowder.
 
 ### Fixed
 - Error decoding json body from Clowder when filename had special characters
-  [CATSPYC-18] (https://opensource.ncsa.illinois.edu/jira/browse/CATSPYC-18) 
+  [CATSPYC-18] (https://opensource.ncsa.illinois.edu/jira/browse/CATSPYC-18)
 - RABBITMQ_QUEUE variable/flag was ignored when set and would connect
   to default queue.
 
diff --git a/README.md b/README.md
@@ -15,7 +15,7 @@ create new extractors.
 Install using pip (for most recent versions see: https://pypi.org/project/pyclowder/):
 
 ```
-pip install pyclowder==2.4.1
+pip install pyclowder==2.6.0
 ```
 
 Install pyClowder on your system by cloning this repo:
@@ -25,13 +25,16 @@ git clone https://github.com/clowder-framework/pyclowder.git
 cd pyclowder
 pip install -r requirements.txt
 python setup.py install
-
 ```
 or directly from GitHub:
 ```
 pip install -r https://raw.githubusercontent.com/clowder-framework/pyclowder/master/requirements.txt git+https://github.com/clowder-framework/pyclowder.git
 ```
 
+## Quickstart example
+
+See the [README](https://github.com/clowder-framework/pyclowder/tree/master/sample-extractors/wordcount#readme) in `sample-extractors/wordcount`. Using Docker, no install is required.
+
 ## Example Extractor
 
 Following is an example of the WordCount extractor. This example will allow the user to specify from the command line
@@ -213,8 +216,13 @@ to a file that is read with the configuration options.
 
 # Dockerfile
 
-We recommend using the pyclowder:onbuild to easily convert your extractor into a docker container. If you build the
-extractor as commented above, you will only need the following Dockerfile
+We recommend following the instructions at [clowder/generator](https://github.com/clowder-framework/generator) to build a Docker image from your Simple Extractor.
+
+You can also use the pyclowder:onbuild Docker image to easily convert your extractor into a docker container. This image is no longer maintained so it is recommeded to either use the clowder/generator linked above or build your own Dockerfile by choosing your own base image and installing pyClowder as described below.
+
+
+**This is deprecated and the onbuild image is no longer maintained**
+If you build the extractor as using the pyclowder:onbuild image, you will only need the following Dockerfile
 
 ```
 FROM clowder/pyclowder:onbuild
@@ -287,7 +295,7 @@ def wordcount(input_file):
     return result
 ```
 
-To build wordcount as a Simpel extractor docker image, users just simply assign two environment variables in Dockerfile shown below. EXTRACTION_FUNC is environment variable and has to be assigned as extraction function, where in wordcount.py, the extraction function is `wordcount`. Environment variable EXTRACTION_MODULE is the name of module file containing the definition of extraction function.
+To build wordcount as a an extractor docker image, users just simply assign two environment variables in Dockerfile shown below. EXTRACTION_FUNC is environment variable and has to be assigned as extraction function, where in wordcount.py, the extraction function is `wordcount`. Environment variable EXTRACTION_MODULE is the name of module file containing the definition of extraction function.
 ```markdown
 FROM clowder/extractors-simple-extractor:onbuild
 
diff --git a/description.rst b/description.rst
@@ -1,25 +1,39 @@
-This package provides standard functions for interacting with the
-Clowder open source data management system. Clowder is designed
-to allow researchers to build customized catalogs in the clouds
-to help you manage research data.
+This package provides standard functions for interacting with the Clowder
+open source data management system. Clowder is designed to allow researchers
+to build customized catalogs in the clouds to help you manage research data.
+
+One of the most interesting aspects of Clowder is the ability to extract
+metadata from any file. This ability is created using extractors. To make it
+easy to create these extractors in python we have created a module called
+clowder. Besides wrapping often used api calls in convenient python calls, we
+have also added some code to make it easy to create new extractors.
 
 Installation
 ------------
 
-The easiest way install pyclowder is using pip and pulling from PyPI.
-Use the following command to install::
+Install using pip (for most recent versions see: https://pypi.org/project/pyclowder/):
+
+```
+pip install pyclowder==2.6.0
+```
 
-    pip install pyclowder
+Install pyClowder on your system by cloning this repo:
 
-Because this system is still under rapid development, you may want to
-install by cloning the repo using the following commands::
+```
+git clone https://github.com/clowder-framework/pyclowder.git
+cd pyclowder
+pip install -r requirements.txt
+python setup.py install
+```
 
-    git clone https://opensource.ncsa.illinois.edu/bitbucket/scm/cats/pyclowder.git
-    cd pyclowder
-    pip install -r requirements.txt
-    python setup.py install
+or directly from GitHub:
 
-Or you can install directly from NCSA's Bitbucket::
+```
+pip install -r https://raw.githubusercontent.com/clowder-framework/pyclowder/master/requirements.txt git+https://github.com/clowder-framework/pyclowder.git
+```
 
-    pip install -r https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder/raw/requirements.txt git+https://opensource.ncsa.illinois.edu/bitbucket/scm/cats/pyclowder.git
+Quickstart example
+------------------
 
+See the [README](https://github.com/clowder-framework/pyclowder/tree/master/sample-extractors/wordcount#readme)
+in `sample-extractors/wordcount`. Using Docker, no install is required.
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -57,9 +57,9 @@
 # built documents.
 #
 # The short X.Y version.
-version = u'2.4'
+version = u'2.6'
 # The full version, including alpha/beta/rc tags.
-release = u'2.4.1'
+release = u'2.6.0'
 
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.
diff --git a/pyclowder/connectors.py b/pyclowder/connectors.py
@@ -295,6 +295,9 @@ def _download_file_metadata(self, host, secret_key, fileid, filepath):
         return (md_dir, md_file)
 
     def _prepare_dataset(self, host, secret_key, resource):
+        logger = logging.getLogger(__name__)
+
+        file_paths = []
         located_files = []
         missing_files = []
         tmp_files_created = []
@@ -356,10 +359,13 @@ def _prepare_dataset(self, host, secret_key, resource):
 
         # If we didn't find any files locally, download dataset .zip as normal
         else:
-            inputzip = pyclowder.datasets.download(self, host, secret_key, resource["id"])
-            file_paths = pyclowder.utils.extract_zip_contents(inputzip)
-            tmp_files_created += file_paths
-            tmp_files_created.append(inputzip)
+            try:
+                inputzip = pyclowder.datasets.download(self, host, secret_key, resource["id"])
+                file_paths = pyclowder.utils.extract_zip_contents(inputzip)
+                tmp_files_created += file_paths
+                tmp_files_created.append(inputzip)
+            except Exception as e:
+                logger.exception("No files found and download failed")
 
         return (file_paths, tmp_files_created, tmp_dirs_created)
 
@@ -656,7 +662,7 @@ def __init__(self, extractor_name, extractor_info,
         self.consumer_tag = None
         self.worker = None
         self.announcer = None
-        self.heartbeat = heartbeat
+        self.heartbeat = float(heartbeat)
 
     def connect(self):
         """connect to rabbitmq using URL parameters"""
diff --git a/pyclowder/extractors.py b/pyclowder/extractors.py
@@ -75,7 +75,8 @@ def __init__(self):
         connector_default = "RabbitMQ"
         if os.getenv('LOCAL_PROCESSING', "False").lower() == "true":
             connector_default = "Local"
-        max_retry = os.getenv('CLOWDER_MAX_RETRY', 10)
+        max_retry = int(os.getenv('MAX_RETRY', 10))
+        heartbeat = int(os.getenv('HEARTBEAT', 5*60))
 
         # create the actual extractor
         self.parser = argparse.ArgumentParser(description=self.extractor_info['description'])
@@ -118,7 +119,9 @@ def __init__(self):
         self.parser.add_argument('--no-bind', dest="nobind", action='store_true',
                                  help='instance will bind itself to RabbitMQ by name but NOT file type')
         self.parser.add_argument('--max-retry', dest='max_retry', default=max_retry,
-                                 help='Maximum number of retries if an error happens in the extractor')
+                                 help='Maximum number of retries if an error happens in the extractor (default=%d)' % max_retry)
+        self.parser.add_argument('--heartbeat', dest='heartbeat', default=heartbeat,
+                                 help='Time in seconds between extractor heartbeats (default=%d)' % heartbeat)
 
     def setup(self):
         """Parse command line arguments and so some setup
@@ -128,6 +131,10 @@ def setup(self):
         """
         self.args = self.parser.parse_args()
 
+        # fix extractor_info based on the queue name
+        if self.args.rabbitmq_queuename and self.extractor_info['name'] != self.args.rabbitmq_queuename:
+            self.extractor_info['name'] = self.args.rabbitmq_queuename
+
         # use command line option for ssl_verify
         if 'sslverify' in self.args:
             self.ssl_verify = self.args.sslverify
@@ -174,6 +181,7 @@ def start(self):
                                               mounted_paths=json.loads(self.args.mounted_paths),
                                               clowder_url=self.args.clowder_url,
                                               max_retry=self.args.max_retry,
+                                              heartbeat=self.args.heartbeat,
                                               extractor_key=self.args.extractor_key,
                                               clowder_email=self.args.clowder_email)
                 connector.connect()
diff --git a/requirements.txt b/requirements.txt
@@ -1,5 +1,24 @@
-enum34==1.1.10
+#
+# This file is autogenerated by pip-compile with python 3.9
+# To update, run:
+#
+#    pip-compile
+#
+certifi==2021.10.8
+    # via requests
+charset-normalizer==2.0.10
+    # via requests
+idna==3.3
+    # via requests
 pika==1.2.0
-PyYAML==5.4.1
+    # via pyclowder (setup.py)
+pyyaml==5.4.1
+    # via pyclowder (setup.py)
 requests==2.26.0
+    # via
+    #   pyclowder (setup.py)
+    #   requests-toolbelt
 requests-toolbelt==0.9.1
+    # via pyclowder (setup.py)
+urllib3==1.26.8
+    # via requests
diff --git a/sample-extractors/wordcount/Dockerfile b/sample-extractors/wordcount/Dockerfile
@@ -1,4 +1,8 @@
-ARG PYCLOWDER_PYTHON=""
-FROM clowder/pyclowder${PYCLOWDER_PYTHON}:onbuild
+FROM python:3.8
 
-ENV MAIN_SCRIPT="wordcount.py"
+WORKDIR /extractor
+COPY requirements.txt ./
+RUN pip install -r requirements.txt
+
+COPY wordcount.py extractor_info.json ./
+CMD python wordcount.py
diff --git a/sample-extractors/wordcount/README.md b/sample-extractors/wordcount/README.md
@@ -2,20 +2,42 @@ A simple extractor that counts the number of characters, words and lines in a te
 
 # Docker
 
-This extractor is ready to be run as a docker container. To build the docker container run:
+This extractor is ready to be run as a docker container, the only dependency is a running Clowder instance. Simply build and run.
+
+1. Start Clowder. For help starting Clowder, see our [getting started guide](https://github.com/clowder-framework/clowder/blob/develop/doc/src/sphinx/userguide/installing_clowder.rst).
+
+2. First build the extractor Docker container:
 
 ```
+# from this directory, run:
+
 docker build -t clowder_wordcount .
 ```
 
-To run the docker containers use:
+3. Finally run the extractor:
 
 ```
-docker run -t -i --rm -e "RABBITMQ_URI=amqp://rabbitmqserver/clowder" clowder_wordcount
-docker run -t -i --rm --link clowder_rabbitmq_1:rabbitmq clowder_wordcount
+docker run -t -i --rm --net clowder_clowder -e "RABBITMQ_URI=amqp://guest:guest@rabbitmq:5672/%2f" --name "wordcount" clowder_wordcount
 ```
 
-The RABBITMQ_URI and RABBITMQ_EXCHANGE environment variables can be used to control what RabbitMQ server and exchange it will bind itself to, you can also use the --link option to link the extractor to a RabbitMQ container.
+Then open the Clowder web app and run the wordcount extractor on a .txt file (or similar)! Done.
+
+### Python and Docker details
+
+You may use any version of Python 3. Simply edit the first line of the `Dockerfile`, by default it uses `FROM python:3.8`.
+
+Docker flags:
+
+- `--net` links the extractor to the Clowder Docker network (run `docker network ls` to identify your own.)
+- `-e RABBITMQ_URI=` sets the environment variables can be used to control what RabbitMQ server and exchange it will bind itself to. Setting the `RABBITMQ_EXCHANGE` may also help.
+  - You can also use `--link` to link the extractor to a RabbitMQ container.
+- `--name` assigns the container a name visible in Docker Desktop.
+
+## Troubleshooting
+
+**If you run into _any_ trouble**, please reach out on our Clowder Slack in the [#pyclowder channel](https://clowder-software.slack.com/archives/CNC2UVBCP).
+
+Alternate methods of running extractors are below.
 
 # Commandline Execution
 
diff --git a/sample-extractors/wordcount/requirements.txt b/sample-extractors/wordcount/requirements.txt
@@ -0,0 +1 @@
+pyclowder==2.6.0
diff --git a/setup.py b/setup.py
diff --git a/version.sh b/version.sh