Skip to content

Commit 7682ebf

Browse files
committed
Merge branch 'master' into extractor-key-support
2 parents a8c7262 + 21a902a commit 7682ebf

File tree

12 files changed

+221
-77
lines changed

12 files changed

+221
-77
lines changed

CHANGELOG.md

Lines changed: 35 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Change Log
22
All notable changes to this project will be documented in this file.
33

4-
The format is based on [Keep a Changelog](http://keepachangelog.com/)
4+
The format is based on [Keep a Changelog](http://keepachangelog.com/)
55
and this project adheres to [Semantic Versioning](http://semver.org/).
66

77
## Unreleased
@@ -10,6 +10,37 @@ and this project adheres to [Semantic Versioning](http://semver.org/).
1010
- Add support for `EXTRACTOR_KEY` and `CLOWDER_EMAIL` environment variables to register
1111
an extractor for just one user.
1212

13+
## 2.6.0 - 2022-06-14
14+
15+
This will change how clowder sees the extractors. If you have an extractor, and you specify
16+
the queue name (eiter as command line argument or environment variable) the name of the
17+
extractor shown in clowder, will be the name of the queue.
18+
19+
### Fixed
20+
- both heartbeat and nax_retry need to be converted to in, not string
21+
22+
### Changed
23+
- when you set the RABBITMQ_QUEUE it will change the name of the extractor as well in the
24+
extractor_info document. [#47](https://github.com/clowder-framework/pyclowder/issues/47)
25+
- environment variable CLOWDER_MAX_RETRY is now MAX_RETRY
26+
27+
## 2.5.1 - 2022-03-04
28+
29+
### Changed
30+
- updated pypi documentation
31+
32+
## 2.5.0 - 2022-03-04
33+
34+
### Fixed
35+
- extractor would fail on empty dataset download [#36](https://github.com/clowder-framework/pyclowder/issues/36)
36+
37+
### Added
38+
- ability to set the heartbeat for an extractractor [#42](https://github.com/clowder-framework/pyclowder/issues/42)
39+
40+
### Changed
41+
- update wordcount extractor to not use docker image
42+
- using piptools for requirements
43+
1344
## 2.4.1 - 2021-07-21
1445

1546
### Added
@@ -43,13 +74,13 @@ an extractor for just one user.
4374
## 2.3.2 - 2020-09-24
4475

4576
### Fixed
46-
- When rabbitmq restarts the extractor would not stop and restart, resulting
77+
- When rabbitmq restarts the extractor would not stop and restart, resulting
4778
in the extractor no longer receiving any messages. #17
4879

4980
### Added
5081
- Can specify url to use for extractor downloads, this is helpful for instances
5182
that have access to the internal URL for clowder, for example in docker/kubernetes.
52-
83+
5384
### Removed
5485
- Removed ability to run multiple connectors in the same python process. If
5586
parallelism is needed, use multiple processes (or containers).
@@ -135,7 +166,7 @@ install pyclowder.
135166

136167
### Fixed
137168
- Error decoding json body from Clowder when filename had special characters
138-
[CATSPYC-18] (https://opensource.ncsa.illinois.edu/jira/browse/CATSPYC-18)
169+
[CATSPYC-18] (https://opensource.ncsa.illinois.edu/jira/browse/CATSPYC-18)
139170
- RABBITMQ_QUEUE variable/flag was ignored when set and would connect
140171
to default queue.
141172

README.md

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ create new extractors.
1515
Install using pip (for most recent versions see: https://pypi.org/project/pyclowder/):
1616

1717
```
18-
pip install pyclowder==2.4.1
18+
pip install pyclowder==2.6.0
1919
```
2020

2121
Install pyClowder on your system by cloning this repo:
@@ -25,13 +25,16 @@ git clone https://github.com/clowder-framework/pyclowder.git
2525
cd pyclowder
2626
pip install -r requirements.txt
2727
python setup.py install
28-
2928
```
3029
or directly from GitHub:
3130
```
3231
pip install -r https://raw.githubusercontent.com/clowder-framework/pyclowder/master/requirements.txt git+https://github.com/clowder-framework/pyclowder.git
3332
```
3433

34+
## Quickstart example
35+
36+
See the [README](https://github.com/clowder-framework/pyclowder/tree/master/sample-extractors/wordcount#readme) in `sample-extractors/wordcount`. Using Docker, no install is required.
37+
3538
## Example Extractor
3639

3740
Following is an example of the WordCount extractor. This example will allow the user to specify from the command line
@@ -213,8 +216,13 @@ to a file that is read with the configuration options.
213216

214217
# Dockerfile
215218

216-
We recommend using the pyclowder:onbuild to easily convert your extractor into a docker container. If you build the
217-
extractor as commented above, you will only need the following Dockerfile
219+
We recommend following the instructions at [clowder/generator](https://github.com/clowder-framework/generator) to build a Docker image from your Simple Extractor.
220+
221+
You can also use the pyclowder:onbuild Docker image to easily convert your extractor into a docker container. This image is no longer maintained so it is recommeded to either use the clowder/generator linked above or build your own Dockerfile by choosing your own base image and installing pyClowder as described below.
222+
223+
224+
**This is deprecated and the onbuild image is no longer maintained**
225+
If you build the extractor as using the pyclowder:onbuild image, you will only need the following Dockerfile
218226

219227
```
220228
FROM clowder/pyclowder:onbuild
@@ -287,7 +295,7 @@ def wordcount(input_file):
287295
return result
288296
```
289297

290-
To build wordcount as a Simpel extractor docker image, users just simply assign two environment variables in Dockerfile shown below. EXTRACTION_FUNC is environment variable and has to be assigned as extraction function, where in wordcount.py, the extraction function is `wordcount`. Environment variable EXTRACTION_MODULE is the name of module file containing the definition of extraction function.
298+
To build wordcount as a an extractor docker image, users just simply assign two environment variables in Dockerfile shown below. EXTRACTION_FUNC is environment variable and has to be assigned as extraction function, where in wordcount.py, the extraction function is `wordcount`. Environment variable EXTRACTION_MODULE is the name of module file containing the definition of extraction function.
291299
```markdown
292300
FROM clowder/extractors-simple-extractor:onbuild
293301

description.rst

Lines changed: 29 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,39 @@
1-
This package provides standard functions for interacting with the
2-
Clowder open source data management system. Clowder is designed
3-
to allow researchers to build customized catalogs in the clouds
4-
to help you manage research data.
1+
This package provides standard functions for interacting with the Clowder
2+
open source data management system. Clowder is designed to allow researchers
3+
to build customized catalogs in the clouds to help you manage research data.
4+
5+
One of the most interesting aspects of Clowder is the ability to extract
6+
metadata from any file. This ability is created using extractors. To make it
7+
easy to create these extractors in python we have created a module called
8+
clowder. Besides wrapping often used api calls in convenient python calls, we
9+
have also added some code to make it easy to create new extractors.
510

611
Installation
712
------------
813

9-
The easiest way install pyclowder is using pip and pulling from PyPI.
10-
Use the following command to install::
14+
Install using pip (for most recent versions see: https://pypi.org/project/pyclowder/):
15+
16+
```
17+
pip install pyclowder==2.6.0
18+
```
1119

12-
pip install pyclowder
20+
Install pyClowder on your system by cloning this repo:
1321

14-
Because this system is still under rapid development, you may want to
15-
install by cloning the repo using the following commands::
22+
```
23+
git clone https://github.com/clowder-framework/pyclowder.git
24+
cd pyclowder
25+
pip install -r requirements.txt
26+
python setup.py install
27+
```
1628

17-
git clone https://opensource.ncsa.illinois.edu/bitbucket/scm/cats/pyclowder.git
18-
cd pyclowder
19-
pip install -r requirements.txt
20-
python setup.py install
29+
or directly from GitHub:
2130

22-
Or you can install directly from NCSA's Bitbucket::
31+
```
32+
pip install -r https://raw.githubusercontent.com/clowder-framework/pyclowder/master/requirements.txt git+https://github.com/clowder-framework/pyclowder.git
33+
```
2334

24-
pip install -r https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder/raw/requirements.txt git+https://opensource.ncsa.illinois.edu/bitbucket/scm/cats/pyclowder.git
35+
Quickstart example
36+
------------------
2537

38+
See the [README](https://github.com/clowder-framework/pyclowder/tree/master/sample-extractors/wordcount#readme)
39+
in `sample-extractors/wordcount`. Using Docker, no install is required.

docs/source/conf.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,9 +57,9 @@
5757
# built documents.
5858
#
5959
# The short X.Y version.
60-
version = u'2.4'
60+
version = u'2.6'
6161
# The full version, including alpha/beta/rc tags.
62-
release = u'2.4.1'
62+
release = u'2.6.0'
6363

6464
# The language for content autogenerated by Sphinx. Refer to documentation
6565
# for a list of supported languages.

pyclowder/connectors.py

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -295,6 +295,9 @@ def _download_file_metadata(self, host, secret_key, fileid, filepath):
295295
return (md_dir, md_file)
296296

297297
def _prepare_dataset(self, host, secret_key, resource):
298+
logger = logging.getLogger(__name__)
299+
300+
file_paths = []
298301
located_files = []
299302
missing_files = []
300303
tmp_files_created = []
@@ -356,10 +359,13 @@ def _prepare_dataset(self, host, secret_key, resource):
356359

357360
# If we didn't find any files locally, download dataset .zip as normal
358361
else:
359-
inputzip = pyclowder.datasets.download(self, host, secret_key, resource["id"])
360-
file_paths = pyclowder.utils.extract_zip_contents(inputzip)
361-
tmp_files_created += file_paths
362-
tmp_files_created.append(inputzip)
362+
try:
363+
inputzip = pyclowder.datasets.download(self, host, secret_key, resource["id"])
364+
file_paths = pyclowder.utils.extract_zip_contents(inputzip)
365+
tmp_files_created += file_paths
366+
tmp_files_created.append(inputzip)
367+
except Exception as e:
368+
logger.exception("No files found and download failed")
363369

364370
return (file_paths, tmp_files_created, tmp_dirs_created)
365371

@@ -656,7 +662,7 @@ def __init__(self, extractor_name, extractor_info,
656662
self.consumer_tag = None
657663
self.worker = None
658664
self.announcer = None
659-
self.heartbeat = heartbeat
665+
self.heartbeat = float(heartbeat)
660666

661667
def connect(self):
662668
"""connect to rabbitmq using URL parameters"""

pyclowder/extractors.py

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,8 @@ def __init__(self):
7575
connector_default = "RabbitMQ"
7676
if os.getenv('LOCAL_PROCESSING', "False").lower() == "true":
7777
connector_default = "Local"
78-
max_retry = os.getenv('CLOWDER_MAX_RETRY', 10)
78+
max_retry = int(os.getenv('MAX_RETRY', 10))
79+
heartbeat = int(os.getenv('HEARTBEAT', 5*60))
7980

8081
# create the actual extractor
8182
self.parser = argparse.ArgumentParser(description=self.extractor_info['description'])
@@ -118,7 +119,9 @@ def __init__(self):
118119
self.parser.add_argument('--no-bind', dest="nobind", action='store_true',
119120
help='instance will bind itself to RabbitMQ by name but NOT file type')
120121
self.parser.add_argument('--max-retry', dest='max_retry', default=max_retry,
121-
help='Maximum number of retries if an error happens in the extractor')
122+
help='Maximum number of retries if an error happens in the extractor (default=%d)' % max_retry)
123+
self.parser.add_argument('--heartbeat', dest='heartbeat', default=heartbeat,
124+
help='Time in seconds between extractor heartbeats (default=%d)' % heartbeat)
122125

123126
def setup(self):
124127
"""Parse command line arguments and so some setup
@@ -128,6 +131,10 @@ def setup(self):
128131
"""
129132
self.args = self.parser.parse_args()
130133

134+
# fix extractor_info based on the queue name
135+
if self.args.rabbitmq_queuename and self.extractor_info['name'] != self.args.rabbitmq_queuename:
136+
self.extractor_info['name'] = self.args.rabbitmq_queuename
137+
131138
# use command line option for ssl_verify
132139
if 'sslverify' in self.args:
133140
self.ssl_verify = self.args.sslverify
@@ -174,6 +181,7 @@ def start(self):
174181
mounted_paths=json.loads(self.args.mounted_paths),
175182
clowder_url=self.args.clowder_url,
176183
max_retry=self.args.max_retry,
184+
heartbeat=self.args.heartbeat,
177185
extractor_key=self.args.extractor_key,
178186
clowder_email=self.args.clowder_email)
179187
connector.connect()

requirements.txt

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,24 @@
1-
enum34==1.1.10
1+
#
2+
# This file is autogenerated by pip-compile with python 3.9
3+
# To update, run:
4+
#
5+
# pip-compile
6+
#
7+
certifi==2021.10.8
8+
# via requests
9+
charset-normalizer==2.0.10
10+
# via requests
11+
idna==3.3
12+
# via requests
213
pika==1.2.0
3-
PyYAML==5.4.1
14+
# via pyclowder (setup.py)
15+
pyyaml==5.4.1
16+
# via pyclowder (setup.py)
417
requests==2.26.0
18+
# via
19+
# pyclowder (setup.py)
20+
# requests-toolbelt
521
requests-toolbelt==0.9.1
22+
# via pyclowder (setup.py)
23+
urllib3==1.26.8
24+
# via requests
Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,8 @@
1-
ARG PYCLOWDER_PYTHON=""
2-
FROM clowder/pyclowder${PYCLOWDER_PYTHON}:onbuild
1+
FROM python:3.8
32

4-
ENV MAIN_SCRIPT="wordcount.py"
3+
WORKDIR /extractor
4+
COPY requirements.txt ./
5+
RUN pip install -r requirements.txt
6+
7+
COPY wordcount.py extractor_info.json ./
8+
CMD python wordcount.py

sample-extractors/wordcount/README.md

Lines changed: 27 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,20 +2,42 @@ A simple extractor that counts the number of characters, words and lines in a te
22

33
# Docker
44

5-
This extractor is ready to be run as a docker container. To build the docker container run:
5+
This extractor is ready to be run as a docker container, the only dependency is a running Clowder instance. Simply build and run.
6+
7+
1. Start Clowder. For help starting Clowder, see our [getting started guide](https://github.com/clowder-framework/clowder/blob/develop/doc/src/sphinx/userguide/installing_clowder.rst).
8+
9+
2. First build the extractor Docker container:
610

711
```
12+
# from this directory, run:
13+
814
docker build -t clowder_wordcount .
915
```
1016

11-
To run the docker containers use:
17+
3. Finally run the extractor:
1218

1319
```
14-
docker run -t -i --rm -e "RABBITMQ_URI=amqp://rabbitmqserver/clowder" clowder_wordcount
15-
docker run -t -i --rm --link clowder_rabbitmq_1:rabbitmq clowder_wordcount
20+
docker run -t -i --rm --net clowder_clowder -e "RABBITMQ_URI=amqp://guest:guest@rabbitmq:5672/%2f" --name "wordcount" clowder_wordcount
1621
```
1722

18-
The RABBITMQ_URI and RABBITMQ_EXCHANGE environment variables can be used to control what RabbitMQ server and exchange it will bind itself to, you can also use the --link option to link the extractor to a RabbitMQ container.
23+
Then open the Clowder web app and run the wordcount extractor on a .txt file (or similar)! Done.
24+
25+
### Python and Docker details
26+
27+
You may use any version of Python 3. Simply edit the first line of the `Dockerfile`, by default it uses `FROM python:3.8`.
28+
29+
Docker flags:
30+
31+
- `--net` links the extractor to the Clowder Docker network (run `docker network ls` to identify your own.)
32+
- `-e RABBITMQ_URI=` sets the environment variables can be used to control what RabbitMQ server and exchange it will bind itself to. Setting the `RABBITMQ_EXCHANGE` may also help.
33+
- You can also use `--link` to link the extractor to a RabbitMQ container.
34+
- `--name` assigns the container a name visible in Docker Desktop.
35+
36+
## Troubleshooting
37+
38+
**If you run into _any_ trouble**, please reach out on our Clowder Slack in the [#pyclowder channel](https://clowder-software.slack.com/archives/CNC2UVBCP).
39+
40+
Alternate methods of running extractors are below.
1941

2042
# Commandline Execution
2143

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
pyclowder==2.6.0

0 commit comments

Comments
 (0)