Skip to content

Commit c0d846a

Browse files
Michael-D-JohnsonMichael Johnsonmax-zillalmarini
authored
Update extractor docs (#255)
* adding additional extractor documentation * fixing anchors * fixing some links * fixing code blocks * fixing code block * fixing code block * fixing code-blocks * fixing code-blocks * formatting python examples * formatting python examples * formatting python examples * fixing anchors * fixing anchors * fixing anchors * updating extractor pngs replacing images to use ncsa csvheader extractor * updating image removing secret key * replacing csv extractor URL * Fixed images in extractors.rst. Co-authored-by: Michael Johnson <[email protected]> Co-authored-by: Max Burnette <[email protected]> Co-authored-by: Luigi Marini <[email protected]> Co-authored-by: Luigi Marini <[email protected]>
1 parent fa651f1 commit c0d846a

File tree

5 files changed

+270
-3
lines changed

5 files changed

+270
-3
lines changed
35.5 KB
Loading
63.1 KB
Loading
14.4 KB
Loading
109 KB
Loading

doc/src/sphinx/develop/extractors.rst

Lines changed: 270 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,20 @@
22

33
Extractors
44
==============
5+
* :ref:`Overview`
6+
* :ref:`Building and Deploying Extractors`
7+
* :ref:`Testing Locally with Clowder`
8+
* :ref:`A Quick Note on Debugging`
9+
* :ref:`Additional pyClowder Examples`
510

11+
Overview
12+
########
613
One of the major features of Clowder is the ability to deploy custom extractors for when files are uploaded to the system.
7-
A full list of extractors is available in `Bitbucket <https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS>`_.
14+
A list of extractors is available in `GitHub <https://github.com/clowder-framework>`_. A full list of extractors is available in `Bitbucket <https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS>`_.
815

9-
To write new extractors, `pyClowder2 <https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder2/browse>`_ is a good starting point.
16+
To write new extractors, `pyClowder <https://github.com/clowder-framework/pyclowder>`_ is a good starting point.
1017
It provides a simple Python library to write new extractors in Python. Please see the
11-
`sample extractors <https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder2/browse/sample-extractors>`_ directory for examples.
18+
`sample extractors <https://github.com/clowder-framework/pyclowder/sample-extractors>`_ directory for examples.
1219
That being said, extractors can be written in any language that supports HTTP, JSON and AMQP
1320
(ideally a `RabbitMQ client library <https://www.rabbitmq.com/>`_ is available for it).
1421

@@ -27,4 +34,264 @@ The current list of supported events is:
2734
* Metadata removed from dataset
2835
* File/Dataset manual submission to extractor
2936

37+
Building and Deploying Extractors
38+
###################################
3039

40+
To create and deploy an extractor to your Clowder instance you'll need several pieces: user code, clowder wrapping code to help you integrate your code into clowder, an extractor metadata file, and, possibly, a Dockerfile for the deployment of your extractor. With these pieces in place, a user is able to search for their extractor, submit their extractor and have any metadata returned from their extractor stored - all within Clowder.
41+
42+
Although the main intent of an extractor is to interact with a file within Clowder and save metadata associated with said file, Clowder’s ability to interact with files creates a flexibility with extractors that lets users do more than the intended scope. For instance, a user could write an extractor code that reads a file and pushes data to another application, modifies the file, or creates derived inputs within Clowder.
43+
44+
To learn more about extractor basics please refer to the following `documentation <https://opensource.ncsa.illinois.edu/confluence/display/CATS/Extractors#Extractors-Extractorbasics>`_.
45+
46+
For general API documentation refer `here <https://clowderframework.org/swagger/?url=https://clowder.ncsa.illinois.edu/clowder/swagger>`_. API documentation for your particular instance of Clowder can be found under Help -> API.
47+
48+
1. User code
49+
50+
This is code written by you that takes, as input, a file(s) and returns metadata associated with the input file(s).
51+
52+
2. Clowder Code
53+
54+
We've created Clowder packages in Python and Java that make it easier for you to write extractors. These packages help wrap your code so that your extractor can be recognized and run within your Clowder instance. Details on building an extractor can be found at the following links:
55+
56+
57+
* `jClowder <https://github.com/clowder-framework/jclowder>`_
58+
* `pyClowder <https://github.com/clowder-framework/pyclowder>`_
59+
* From scratch using:
60+
* RabbitMQ client library
61+
* HTTP/JSON client libraries
62+
63+
3. extractor_info.json
64+
65+
The extractor_info.json file is a file that includes metadata about your extractor. It allows Clowder to “know” about your extractor. Refer `here <https://opensource.ncsa.illinois.edu/confluence/display/CATS/extractor_info.json>`_ for more information on the extractor_info.json file.
66+
67+
4. Docker
68+
69+
To deploy your extractor within Clowder you need to create a Docker container. Docker packages your code with all its dependencies, allowing your code to be deployed and run on any system that has Docker installed. To learn more about Docker containers refer to `docker.com <https://www.docker.com/resources/what-container>`_. For a useful tutorial on Docker containers refer to `katacoda.com <https://www.katacoda.com/courses/docker>`_. Installing docker requires a minimum of computer skills depending on the type of machine that you are using.
70+
71+
To see specific examples of Dockerfiles refer to the Clowder Code links above or peruse existing extractors at the following links:
72+
73+
- `Clowder GitHub <https://github.com/clowder-framework>`_
74+
75+
- `Clowder Bitbucket <https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS>`_
76+
77+
If creating a simple Python extractor, a Dockerfile can be generated for you following the instructions on the `clowder/generator <https://github.com/clowder-framework/generator>`_) repository.
78+
79+
Testing Locally with Clowder
80+
##############################
81+
82+
While building your extractor, it is useful to test it within a Clowder instance. Prior to deploying your extractor on development or production clusters, testing locally can help debug issues quickly. Below are some instructions on how to deploy a local instance of Clowder and deploy your extractor locally for quick testing. The following docker commands should be executed from a terminal window. These should work on a linux system with docker installed or on a mac and Windows with `Docker Desktop <https://docs.docker.com/desktop>`_) installed.
83+
84+
1. Build your docker image: run the following in the same directory as your Dockerfile
85+
86+
.. code-block:: bash
87+
88+
docker build -t myimage:tag .
89+
90+
2. Once your Docker image is built it can now be deployed within Clowder.
91+
92+
.. code-block:: bash
93+
94+
docker-compose -f docker-compose.yml -f docker-compose.extractors.yml up -d
95+
96+
97+
Below are examples of each file:
98+
99+
* `docker-compose.yml <https://github.com/clowder-framework/clowder/blob/develop/docker-compose.yml>`_
100+
* This file sets up Clowder and its dependencies such as MongoDB and RabbitMQ. You should not have to modify it.
101+
* `docker-compose.override.yml <https://github.com/clowder-framework/clowder/blob/develop/docker-compose.override.example.yml>`_
102+
* This file overrides defaults, and can be used to customize clowder. When downloading the file, make sure to rename it to docker-compose.override.yml. In this case it will expose clowder, mongo and rabbitmq ports to the localhost.
103+
* `docker-compose.extractor.yml <https://github.com/clowder-framework/clowder/blob/develop/docker-compose.extractors.yml>`_
104+
* This file deploys your extractor to Clowder. You will have to update this file to reflect your extractor's name, Docker image name and version tag, and any other requirements like environment variables. See below:
105+
106+
107+
.. code-block:: yaml
108+
109+
version: '3.5'
110+
111+
services:
112+
myextractor:
113+
image: myextractor_imagename:mytag
114+
restart: unless-stopped
115+
networks:
116+
- clowder
117+
depends_on:
118+
- rabbitmq
119+
- clowder
120+
environment:
121+
- RABBITMQ_URI=${RABBITMQ_URI:-amqp://guest:guest@rabbitmq/%2F}
122+
# Add any additional environment variables your code may need here
123+
# Add multiple extractors below following template above
124+
125+
126+
1. Initialize Clowder. All the commands below assume that you are running this in a folder called tests, hence the network name tests_clowder. If you ran the docker-compose command in a folder called clowder, the network would be clowder_clowder.
127+
128+
.. code-block:: bash
129+
130+
docker run -ti --rm --network tests_clowder clowder/mongo-init
131+
132+
133+
4. Enter email, first name, last name password, and admin: true when prompted.
134+
135+
5. Navigate to localhost:9000 and login with credentials you created in step 4.
136+
137+
6. Create a test space and dataset. Then click 'Select Files' and upload (if the file stays in CREATED and never moves to PROCESSED you might need to change the permission on the data folder using docker run -ti --rm --network tests_clowder clowder/mongo-init).
138+
139+
7. Click on file and type submit for extraction.
140+
141+
8. It may take a few minutes for you to be able to see the extractors available within Clowder.
142+
143+
9. Eventually you should see your extractor in the list and click submit.
144+
145+
10. Navigate back to file and click on metadata.
146+
147+
11. You should see your metadata present if all worked successfully.
148+
149+
150+
A Quick Note on Debugging
151+
##########################
152+
153+
To check the status of your extraction, navigate to the file within Clowder and click on the “Extractions” tab. This will give you a list of extractions that have been submitted. Any error messages will show up here if your extractor did not run successfully.
154+
155+
.. image:: /_static/ug_extractors-1.png
156+
157+
You can expand the tab to see all submissions of the extractor and any error messages associated with the submission:
158+
159+
.. image:: /_static/ug_extractors-2.png
160+
161+
If your extractor failed, the error message is not helpful, or if you do not see metadata present in the “Metadata” tab for the file you can check the logs of your extractor coming from the docker container by executing the following:
162+
163+
.. code-block:: bash
164+
165+
docker log tests_myextractor_1
166+
167+
168+
Replace “myextractor” with whatever name you gave your extractor in the docker-compose.extractors.yml file.
169+
170+
If you want to watch the logs as your extractor is running you can type:
171+
172+
.. code-block:: bash
173+
174+
docker logs -f tests_myextractor_1
175+
176+
.. image:: /_static/ug_extractors-4.png
177+
178+
You can print any debugging information within your extractor to the docker logs by utilizing the logging object within your code. The following example is for pyClowder:
179+
180+
.. code-block:: bash
181+
182+
logging.info("Uploaded metadata %s", metadata)
183+
184+
185+
In the screenshot above you can see the lines printed out by the logging.info as the line will start with INFO:
186+
187+
.. code-block:: bash
188+
189+
2021-04-27 16:47:49,995 [MainThread ] INFO
190+
191+
192+
Additional pyClowder Examples
193+
##############################
194+
195+
For a simple example of an extractor, please refer to `extractor-csv <https://github.com/clowder-framework/extractors-csv>`_. This extractor is submitted on a csv file and returns the headers as metadata.
196+
197+
.. image:: /_static/ug_extractors-3.png
198+
199+
Specifying multiple inputs
200+
***************************
201+
202+
This example assumes data is within the same dataset.
203+
204+
.. code-block:: python
205+
206+
#!/usr/bin/env python3
207+
208+
import subprocess
209+
import logging
210+
211+
from pyclowder.extractors import Extractor
212+
import pyclowder.files
213+
import pyclowder.datasets
214+
215+
class MyExtractor(Extractor):
216+
def __init__(self):
217+
Extractor.__init__(self)
218+
logging.getLogger('pyclowder').setLevel(logging.DEBUG)
219+
logging.getLogger('__main__').setLevel(logging.DEBUG)
220+
221+
# Add an argument to pass second filename with default filename
222+
self.parser.add_argument('--secondfile',default="my_default_second_file.csv")
223+
self.setup()
224+
225+
def process_message(self, connector,host, secret_key,resource, parameters):
226+
# grab inputfile path
227+
inputfile = resource["local_paths"][0]
228+
229+
# get list of files in dataset
230+
filelist = pyclowder.datasets.get_file_list(connector, host, secret_key, parameters['datasetId'])
231+
232+
# loop through dataset and grab id of file whose filename matches desired filename
233+
for file_dict in filelist:
234+
if file_dict['filename'] == self.args.secondfile:
235+
secondfileID = file_dict['id']
236+
237+
# or a more pythonic way to do the above loop
238+
# secondfileId = [file_dict['id'] for file_dict in filelist if file_dict['filename'] == self.args.secondfile][0]
239+
240+
# download second file "locally" so extractor can operate on it
241+
secondfilepath = pyclowder.files.download(connector, host, secret_key, secondfileId)
242+
243+
"""
244+
Execute your function/code to operate on said inputfile and secondfile
245+
"""
246+
247+
# upload any metadata that code above outputs as "my_metadata"
248+
metadata = self.get_metadata(my_metadata, 'file', parameters['id'], host)
249+
pyclowder.files.upload_metadata(connector, host, secret_key, parameters['id'], metadata)
250+
251+
if __name__ == "__main__":
252+
extractor = MyExtractor()
253+
extractor.start()
254+
255+
Renaming files
256+
*******************
257+
258+
.. code-block:: python
259+
260+
class MyExtractor(Extractor):
261+
def __init__(self):
262+
Extractor.__init__(self)
263+
logging.getLogger('pyclowder').setLevel(logging.DEBUG)
264+
logging.getLogger('__main__').setLevel(logging.DEBUG)
265+
266+
# Add an argument to pass second filename with default filename
267+
self.parser.add_argument('--filename')
268+
self.setup()
269+
270+
def rename_file(self, connector, host, key, fileid,filename):
271+
# rename file
272+
renameFile= '%sapi/files/%s/filename' % (host, fileid)
273+
274+
f = json.dumps({"name": filename})
275+
276+
connector.put(renameFile,
277+
data=f,
278+
headers={"Content-Type": "application/json", "X-API-KEY": key},
279+
verify=connector.ssl_verify if connector else True)
280+
281+
def process_message(self, connector, host, secret_key,resource, parameters):
282+
# grab inputfile path
283+
inputfile = resource["local_paths"][0]
284+
285+
if self.args.filename:
286+
# call rename_file function
287+
self.rename_file(connector, host, secret_key, parameters['id'], self.args.filename)
288+
289+
# upload any metadata that code above outputs as "my_metadata"
290+
metadata = self.get_metadata(my_metadata, 'file', parameters['id'], host)
291+
pyclowder.files.upload_metadata(connector, host, secret_key, parameters['id'], metadata)
292+
293+
if __name__ == "__main__":
294+
extractor = MyExtractor()
295+
extractor.start()
296+
297+

0 commit comments

Comments
 (0)