You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
See the [README](https://github.com/clowder-framework/pyclowder/tree/master/sample-extractors/wordcount#readme) in `sample-extractors/wordcount`. Using Docker, no install is required.
39
+
35
40
## Example Extractor
36
41
37
42
Following is an example of the WordCount extractor. This example will allow the user to specify from the command line
@@ -157,7 +162,7 @@ extractor_info.json, and instead bind only by extractor name. Assuming no other
157
162
extractor instance will then only be triggered via manual or direct messages (i.e. using extractor name), and not by
158
163
upload events in Clowder.
159
164
160
-
Note however that if any other instances of the extractor are running on the same RabbitMQ queue without --no-bind,
165
+
Note however that if any other instances of the extractor are running on the same RabbitMQ queue without --no-bind,
161
166
they will still bind by file type as normal regardless of previously existing instances with --no-bind, so use caution
162
167
when running multiple instances of one extractor while using --no-bind.
163
168
@@ -174,8 +179,8 @@ process_message.
174
179
The RabbitMQ connector connects to a RabbitMQ instance, creates a queue and binds itself to that queue. Any message in
175
180
the queue will be fetched and passed to the check_message and process_message. This connector takes three parameters:
176
181
177
-
* rabbitmq_uri [REQUIRED] : the uri of the RabbitMQ server
178
-
* rabbitmq_exchange [OPTIONAL] : the exchange to which to bind the queue
182
+
- rabbitmq_uri [REQUIRED] : the uri of the RabbitMQ server
183
+
- rabbitmq_exchange [OPTIONAL] : the exchange to which to bind the queue
179
184
180
185
## HPCConnector
181
186
@@ -184,18 +189,18 @@ Once all pickle files are processed the extractor will stop. The pickle file is
184
189
argument, the logfile that is being monitored to send feedback back to clowder. This connector takes a single argument
185
190
(which can be list):
186
191
187
-
* picklefile [REQUIRED] : a single file, or list of files that are the pickled messages to be processed.
192
+
- picklefile [REQUIRED] : a single file, or list of files that are the pickled messages to be processed.
188
193
189
194
## LocalConnector
190
195
191
-
The Local connector will execute an extractor as a standalone program. This can be used to process files that are
192
-
present in a local hard drive. After extracting the metadata, it stores the generated metadata in an output file in the
196
+
The Local connector will execute an extractor as a standalone program. This can be used to process files that are
197
+
present in a local hard drive. After extracting the metadata, it stores the generated metadata in an output file in the
193
198
local drive. This connector takes two arguments:
194
199
195
-
* --input-file-path [REQUIRED] : Full path of the local input file that needs to be processed.
196
-
* --output-file-path [OPTIONAL] : Full path of the output file (.json) to store the generated metadata. If no output
197
-
file path is provided, it will create a new file with the name <input_file_with_extension>.json in the same directory
198
-
as that of the input file.
200
+
- --input-file-path [REQUIRED] : Full path of the local input file that needs to be processed.
201
+
- --output-file-path [OPTIONAL] : Full path of the output file (.json) to store the generated metadata. If no output
202
+
file path is provided, it will create a new file with the name <input_file_with_extension>.json in the same directory
Motivation: design and implement a simple extractor to bridge Python developer and knowledge of PyClowder library. It requires little effort for Python developers to wrap their python code into Clowder's extractors.
255
262
256
263
Simple extractors take developer defined main function as input parameter to do extraction and then parse and pack extraction's output into Simple extractor defined metadata data-struct and submit back to Clowder.
257
264
258
265
Users' function must have to return a ``dict'' object containing metdata and previews.
`wordcount-simpleextractor` is the simplest example to illustrate how to wrap existing Python code as a Simple Extractor.
271
280
272
281
wordcount.py is regular python file which is defined and provided by Python developers. In the code, wordcount invoke `wc` command to process input file to extract lines, words, characters. It packs metadata into python dict.
282
+
273
283
```markdown
274
284
import subprocess
275
-
276
-
def wordcount(input_file):
277
-
result = subprocess.check_output(['wc', input_file], stderr=subprocess.STDOUT)
278
-
(lines, words, characters, _) = result.split()
279
-
metadata = {
280
-
'lines': lines,
281
-
'words': words,
282
-
'characters': characters
283
-
}
284
-
result = {
285
-
'metadata': metadata
286
-
}
287
-
return result
285
+
286
+
def wordcount(input*file):
287
+
result = subprocess.check_output(['wc', input_file], stderr=subprocess.STDOUT)
288
+
(lines, words, characters, *) = result.split()
289
+
metadata = {
290
+
'lines': lines,
291
+
'words': words,
292
+
'characters': characters
293
+
}
294
+
result = {
295
+
'metadata': metadata
296
+
}
297
+
return result
288
298
```
289
299
290
300
To build wordcount as a Simpel extractor docker image, users just simply assign two environment variables in Dockerfile shown below. EXTRACTION_FUNC is environment variable and has to be assigned as extraction function, where in wordcount.py, the extraction function is `wordcount`. Environment variable EXTRACTION_MODULE is the name of module file containing the definition of extraction function.
Copy file name to clipboardExpand all lines: sample-extractors/wordcount/README.md
+23-5Lines changed: 23 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,20 +2,38 @@ A simple extractor that counts the number of characters, words and lines in a te
2
2
3
3
# Docker
4
4
5
-
This extractor is ready to be run as a docker container. To build the docker container run:
5
+
This extractor is ready to be run as a docker container, the only dependency is a running Clowder instance. Simply build and run.
6
+
7
+
1. Start Clowder. For help starting Clowder, see our [getting started guide](https://github.com/clowder-framework/clowder/blob/develop/doc/src/sphinx/userguide/installing_clowder.rst).
8
+
9
+
2. First build the extractor Docker container:
6
10
7
11
```
12
+
# from this directory, run:
13
+
8
14
docker build -t clowder_wordcount .
9
15
```
10
16
11
-
To run the docker containers use:
17
+
3. Finally run the extractor:
12
18
13
19
```
14
-
docker run -t -i --rm -e "RABBITMQ_URI=amqp://rabbitmqserver/clowder" clowder_wordcount
15
-
docker run -t -i --rm --link clowder_rabbitmq_1:rabbitmq clowder_wordcount
The RABBITMQ_URI and RABBITMQ_EXCHANGE environment variables can be used to control what RabbitMQ server and exchange it will bind itself to, you can also use the --link option to link the extractor to a RabbitMQ container.
23
+
Then open the Clowder web app and run the wordcount extractor on a .txt file (or similar)! Done.
24
+
25
+
### Details
26
+
27
+
-`--net` links the extractor to the Clowder Docker network (run `docker network ls` to identify your own.)
28
+
-`-e RABBITMQ_URI=` sets the environment variables can be used to control what RabbitMQ server and exchange it will bind itself to. Setting the `RABBITMQ_EXCHANGE` may also help.
29
+
- You can also use `--link` to link the extractor to a RabbitMQ container.
30
+
-`--name` assigns the container a name visible in Docker Desktop.
31
+
32
+
## Troubleshooting
33
+
34
+
**If you run into _any_ trouble**, please reach out on our Clowder Slack in the [#pyclowder channel](https://clowder-software.slack.com/archives/CNC2UVBCP).
35
+
36
+
Alternate methods of running extractors are below.
0 commit comments