Skip to content

Commit f9a29aa

Browse files
committed
Added quickstart example with new, isolated, Dockerfile
1 parent 98c1fea commit f9a29aa

File tree

4 files changed

+73
-41
lines changed

4 files changed

+73
-41
lines changed

README.md

Lines changed: 42 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -25,13 +25,18 @@ git clone https://github.com/clowder-framework/pyclowder.git
2525
cd pyclowder
2626
pip install -r requirements.txt
2727
python setup.py install
28-
2928
```
29+
3030
or directly from GitHub:
31+
3132
```
3233
pip install -r https://raw.githubusercontent.com/clowder-framework/pyclowder/master/requirements.txt git+https://github.com/clowder-framework/pyclowder.git
3334
```
3435

36+
## Quickstart example
37+
38+
See the [README](https://github.com/clowder-framework/pyclowder/tree/master/sample-extractors/wordcount#readme) in `sample-extractors/wordcount`. Using Docker, no install is required.
39+
3540
## Example Extractor
3641

3742
Following is an example of the WordCount extractor. This example will allow the user to specify from the command line
@@ -157,7 +162,7 @@ extractor_info.json, and instead bind only by extractor name. Assuming no other
157162
extractor instance will then only be triggered via manual or direct messages (i.e. using extractor name), and not by
158163
upload events in Clowder.
159164

160-
Note however that if any other instances of the extractor are running on the same RabbitMQ queue without --no-bind,
165+
Note however that if any other instances of the extractor are running on the same RabbitMQ queue without --no-bind,
161166
they will still bind by file type as normal regardless of previously existing instances with --no-bind, so use caution
162167
when running multiple instances of one extractor while using --no-bind.
163168

@@ -174,8 +179,8 @@ process_message.
174179
The RabbitMQ connector connects to a RabbitMQ instance, creates a queue and binds itself to that queue. Any message in
175180
the queue will be fetched and passed to the check_message and process_message. This connector takes three parameters:
176181

177-
* rabbitmq_uri [REQUIRED] : the uri of the RabbitMQ server
178-
* rabbitmq_exchange [OPTIONAL] : the exchange to which to bind the queue
182+
- rabbitmq_uri [REQUIRED] : the uri of the RabbitMQ server
183+
- rabbitmq_exchange [OPTIONAL] : the exchange to which to bind the queue
179184

180185
## HPCConnector
181186

@@ -184,18 +189,18 @@ Once all pickle files are processed the extractor will stop. The pickle file is
184189
argument, the logfile that is being monitored to send feedback back to clowder. This connector takes a single argument
185190
(which can be list):
186191

187-
* picklefile [REQUIRED] : a single file, or list of files that are the pickled messages to be processed.
192+
- picklefile [REQUIRED] : a single file, or list of files that are the pickled messages to be processed.
188193

189194
## LocalConnector
190195

191-
The Local connector will execute an extractor as a standalone program. This can be used to process files that are
192-
present in a local hard drive. After extracting the metadata, it stores the generated metadata in an output file in the
196+
The Local connector will execute an extractor as a standalone program. This can be used to process files that are
197+
present in a local hard drive. After extracting the metadata, it stores the generated metadata in an output file in the
193198
local drive. This connector takes two arguments:
194199

195-
* --input-file-path [REQUIRED] : Full path of the local input file that needs to be processed.
196-
* --output-file-path [OPTIONAL] : Full path of the output file (.json) to store the generated metadata. If no output
197-
file path is provided, it will create a new file with the name <input_file_with_extension>.json in the same directory
198-
as that of the input file.
200+
- --input-file-path [REQUIRED] : Full path of the local input file that needs to be processed.
201+
- --output-file-path [OPTIONAL] : Full path of the output file (.json) to store the generated metadata. If no output
202+
file path is provided, it will create a new file with the name <input_file_with_extension>.json in the same directory
203+
as that of the input file.
199204

200205
# Clowder API wrappers
201206

@@ -250,49 +255,53 @@ COPY <MY.CODE>.py extractor_info.json /home/clowder/
250255
# Command to be run when container is run
251256
CMD python3 <MY.CODE>.py
252257
```
258+
253259
## SimpleExtractor
260+
254261
Motivation: design and implement a simple extractor to bridge Python developer and knowledge of PyClowder library. It requires little effort for Python developers to wrap their python code into Clowder's extractors.
255262

256263
Simple extractors take developer defined main function as input parameter to do extraction and then parse and pack extraction's output into Simple extractor defined metadata data-struct and submit back to Clowder.
257264

258265
Users' function must have to return a ``dict'' object containing metdata and previews.
266+
259267
```markdown
260268
result = {
261-
'metadata': {},
262-
'previews': [
263-
'filename',
264-
{'file': 'filename'},
265-
{'file': 'filename', 'metadata': {}, 'mimetype': 'image/jpeg'}
266-
]}
269+
'metadata': {},
270+
'previews': [
271+
'filename',
272+
{'file': 'filename'},
273+
{'file': 'filename', 'metadata': {}, 'mimetype': 'image/jpeg'}
274+
]}
267275
```
268276

269-
### Example:
277+
### Example:
278+
270279
`wordcount-simpleextractor` is the simplest example to illustrate how to wrap existing Python code as a Simple Extractor.
271280

272281
wordcount.py is regular python file which is defined and provided by Python developers. In the code, wordcount invoke `wc` command to process input file to extract lines, words, characters. It packs metadata into python dict.
282+
273283
```markdown
274284
import subprocess
275-
276-
def wordcount(input_file):
277-
result = subprocess.check_output(['wc', input_file], stderr=subprocess.STDOUT)
278-
(lines, words, characters, _) = result.split()
279-
metadata = {
280-
'lines': lines,
281-
'words': words,
282-
'characters': characters
283-
}
284-
result = {
285-
'metadata': metadata
286-
}
287-
return result
285+
286+
def wordcount(input*file):
287+
result = subprocess.check_output(['wc', input_file], stderr=subprocess.STDOUT)
288+
(lines, words, characters, *) = result.split()
289+
metadata = {
290+
'lines': lines,
291+
'words': words,
292+
'characters': characters
293+
}
294+
result = {
295+
'metadata': metadata
296+
}
297+
return result
288298
```
289299

290300
To build wordcount as a Simpel extractor docker image, users just simply assign two environment variables in Dockerfile shown below. EXTRACTION_FUNC is environment variable and has to be assigned as extraction function, where in wordcount.py, the extraction function is `wordcount`. Environment variable EXTRACTION_MODULE is the name of module file containing the definition of extraction function.
301+
291302
```markdown
292303
FROM clowder/extractors-simple-extractor:onbuild
293304

294305
ENV EXTRACTION_FUNC="wordcount"
295306
ENV EXTRACTION_MODULE="wordcount"
296307
```
297-
298-
Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,8 @@
1-
ARG PYCLOWDER_PYTHON=""
2-
FROM clowder/pyclowder${PYCLOWDER_PYTHON}:onbuild
1+
FROM python:3.8
32

4-
ENV MAIN_SCRIPT="wordcount.py"
3+
WORKDIR /extractor
4+
COPY requirements.txt ./
5+
RUN pip install -r requirements.txt
6+
7+
COPY wordcount.py extractor_info.json ./
8+
CMD python wordcount.py

sample-extractors/wordcount/README.md

Lines changed: 23 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,20 +2,38 @@ A simple extractor that counts the number of characters, words and lines in a te
22

33
# Docker
44

5-
This extractor is ready to be run as a docker container. To build the docker container run:
5+
This extractor is ready to be run as a docker container, the only dependency is a running Clowder instance. Simply build and run.
6+
7+
1. Start Clowder. For help starting Clowder, see our [getting started guide](https://github.com/clowder-framework/clowder/blob/develop/doc/src/sphinx/userguide/installing_clowder.rst).
8+
9+
2. First build the extractor Docker container:
610

711
```
12+
# from this directory, run:
13+
814
docker build -t clowder_wordcount .
915
```
1016

11-
To run the docker containers use:
17+
3. Finally run the extractor:
1218

1319
```
14-
docker run -t -i --rm -e "RABBITMQ_URI=amqp://rabbitmqserver/clowder" clowder_wordcount
15-
docker run -t -i --rm --link clowder_rabbitmq_1:rabbitmq clowder_wordcount
20+
docker run -t -i --rm --net clowder_clowder -e "RABBITMQ_URI=amqp://guest:guest@rabbitmq:5672/%2f" --name "wordcount" clowder_wordcount
1621
```
1722

18-
The RABBITMQ_URI and RABBITMQ_EXCHANGE environment variables can be used to control what RabbitMQ server and exchange it will bind itself to, you can also use the --link option to link the extractor to a RabbitMQ container.
23+
Then open the Clowder web app and run the wordcount extractor on a .txt file (or similar)! Done.
24+
25+
### Details
26+
27+
- `--net` links the extractor to the Clowder Docker network (run `docker network ls` to identify your own.)
28+
- `-e RABBITMQ_URI=` sets the environment variables can be used to control what RabbitMQ server and exchange it will bind itself to. Setting the `RABBITMQ_EXCHANGE` may also help.
29+
- You can also use `--link` to link the extractor to a RabbitMQ container.
30+
- `--name` assigns the container a name visible in Docker Desktop.
31+
32+
## Troubleshooting
33+
34+
**If you run into _any_ trouble**, please reach out on our Clowder Slack in the [#pyclowder channel](https://clowder-software.slack.com/archives/CNC2UVBCP).
35+
36+
Alternate methods of running extractors are below.
1937

2038
# Commandline Execution
2139

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
pyclowder==2.4.0

0 commit comments

Comments
 (0)