You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/src/sphinx/develop/extractors.rst
+270-3Lines changed: 270 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,13 +2,20 @@
2
2
3
3
Extractors
4
4
==============
5
+
* :ref:`Overview`
6
+
* :ref:`Building and Deploying Extractors`
7
+
* :ref:`Testing Locally with Clowder`
8
+
* :ref:`A Quick Note on Debugging`
9
+
* :ref:`Additional pyClowder Examples`
5
10
11
+
Overview
12
+
########
6
13
One of the major features of Clowder is the ability to deploy custom extractors for when files are uploaded to the system.
7
-
A full list of extractors is available in `Bitbucket <https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS>`_.
14
+
A list of extractors is available in `GitHub <https://github.com/clowder-framework>`_. A full list of extractors is available in `Bitbucket <https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS>`_.
8
15
9
-
To write new extractors, `pyClowder2<https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder2/browse>`_ is a good starting point.
16
+
To write new extractors, `pyClowder<https://github.com/clowder-framework/pyclowder>`_ is a good starting point.
10
17
It provides a simple Python library to write new extractors in Python. Please see the
11
-
`sample extractors <https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder2/browse/sample-extractors>`_ directory for examples.
18
+
`sample extractors <https://github.com/clowder-framework/pyclowder/sample-extractors>`_ directory for examples.
12
19
That being said, extractors can be written in any language that supports HTTP, JSON and AMQP
13
20
(ideally a `RabbitMQ client library <https://www.rabbitmq.com/>`_ is available for it).
14
21
@@ -27,4 +34,264 @@ The current list of supported events is:
27
34
* Metadata removed from dataset
28
35
* File/Dataset manual submission to extractor
29
36
37
+
Building and Deploying Extractors
38
+
###################################
30
39
40
+
To create and deploy an extractor to your Clowder instance you'll need several pieces: user code, clowder wrapping code to help you integrate your code into clowder, an extractor metadata file, and, possibly, a Dockerfile for the deployment of your extractor. With these pieces in place, a user is able to search for their extractor, submit their extractor and have any metadata returned from their extractor stored - all within Clowder.
41
+
42
+
Although the main intent of an extractor is to interact with a file within Clowder and save metadata associated with said file, Clowder’s ability to interact with files creates a flexibility with extractors that lets users do more than the intended scope. For instance, a user could write an extractor code that reads a file and pushes data to another application, modifies the file, or creates derived inputs within Clowder.
43
+
44
+
To learn more about extractor basics please refer to the following `documentation <https://opensource.ncsa.illinois.edu/confluence/display/CATS/Extractors#Extractors-Extractorbasics>`_.
45
+
46
+
For general API documentation refer `here <https://clowderframework.org/swagger/?url=https://clowder.ncsa.illinois.edu/clowder/swagger>`_. API documentation for your particular instance of Clowder can be found under Help -> API.
47
+
48
+
1. User code
49
+
50
+
This is code written by you that takes, as input, a file(s) and returns metadata associated with the input file(s).
51
+
52
+
2. Clowder Code
53
+
54
+
We've created Clowder packages in Python and Java that make it easier for you to write extractors. These packages help wrap your code so that your extractor can be recognized and run within your Clowder instance. Details on building an extractor can be found at the following links:
The extractor_info.json file is a file that includes metadata about your extractor. It allows Clowder to “know” about your extractor. Refer `here <https://opensource.ncsa.illinois.edu/confluence/display/CATS/extractor_info.json>`_ for more information on the extractor_info.json file.
66
+
67
+
4. Docker
68
+
69
+
To deploy your extractor within Clowder you need to create a Docker container. Docker packages your code with all its dependencies, allowing your code to be deployed and run on any system that has Docker installed. To learn more about Docker containers refer to `docker.com <https://www.docker.com/resources/what-container>`_. For a useful tutorial on Docker containers refer to `katacoda.com <https://www.katacoda.com/courses/docker>`_. Installing docker requires a minimum of computer skills depending on the type of machine that you are using.
70
+
71
+
To see specific examples of Dockerfiles refer to the Clowder Code links above or peruse existing extractors at the following links:
If creating a simple Python extractor, a Dockerfile can be generated for you following the instructions on the `clowder/generator <https://github.com/clowder-framework/generator>`_) repository.
78
+
79
+
Testing Locally with Clowder
80
+
##############################
81
+
82
+
While building your extractor, it is useful to test it within a Clowder instance. Prior to deploying your extractor on development or production clusters, testing locally can help debug issues quickly. Below are some instructions on how to deploy a local instance of Clowder and deploy your extractor locally for quick testing. The following docker commands should be executed from a terminal window. These should work on a linux system with docker installed or on a mac and Windows with `Docker Desktop <https://docs.docker.com/desktop>`_) installed.
83
+
84
+
1. Build your docker image: run the following in the same directory as your Dockerfile
85
+
86
+
.. code-block:: bash
87
+
88
+
docker build -t myimage:tag .
89
+
90
+
2. Once your Docker image is built it can now be deployed within Clowder.
91
+
92
+
.. code-block:: bash
93
+
94
+
docker-compose -f docker-compose.yml -f docker-compose.extractors.yml up -d
* This file overrides defaults, and can be used to customize clowder. When downloading the file, make sure to rename it to docker-compose.override.yml. In this case it will expose clowder, mongo and rabbitmq ports to the localhost.
* This file deploys your extractor to Clowder. You will have to update this file to reflect your extractor's name, Docker image name and version tag, and any other requirements like environment variables. See below:
# Add any additional environment variables your code may need here
123
+
# Add multiple extractors below following template above
124
+
125
+
126
+
1. Initialize Clowder. All the commands below assume that you are running this in a folder called tests, hence the network name tests_clowder. If you ran the docker-compose command in a folder called clowder, the network would be clowder_clowder.
127
+
128
+
.. code-block:: bash
129
+
130
+
docker run -ti --rm --network tests_clowder clowder/mongo-init
131
+
132
+
133
+
4. Enter email, first name, last name password, and admin: true when prompted.
134
+
135
+
5. Navigate to localhost:9000 and login with credentials you created in step 4.
136
+
137
+
6. Create a test space and dataset. Then click 'Select Files' and upload (if the file stays in CREATED and never moves to PROCESSED you might need to change the permission on the data folder using docker run -ti --rm --network tests_clowder clowder/mongo-init).
138
+
139
+
7. Click on file and type submit for extraction.
140
+
141
+
8. It may take a few minutes for you to be able to see the extractors available within Clowder.
142
+
143
+
9. Eventually you should see your extractor in the list and click submit.
144
+
145
+
10. Navigate back to file and click on metadata.
146
+
147
+
11. You should see your metadata present if all worked successfully.
148
+
149
+
150
+
A Quick Note on Debugging
151
+
##########################
152
+
153
+
To check the status of your extraction, navigate to the file within Clowder and click on the “Extractions” tab. This will give you a list of extractions that have been submitted. Any error messages will show up here if your extractor did not run successfully.
154
+
155
+
.. image:: /_static/ug_extractors-1.png
156
+
157
+
You can expand the tab to see all submissions of the extractor and any error messages associated with the submission:
158
+
159
+
.. image:: /_static/ug_extractors-2.png
160
+
161
+
If your extractor failed, the error message is not helpful, or if you do not see metadata present in the “Metadata” tab for the file you can check the logs of your extractor coming from the docker container by executing the following:
162
+
163
+
.. code-block:: bash
164
+
165
+
docker log tests_myextractor_1
166
+
167
+
168
+
Replace “myextractor” with whatever name you gave your extractor in the docker-compose.extractors.yml file.
169
+
170
+
If you want to watch the logs as your extractor is running you can type:
171
+
172
+
.. code-block:: bash
173
+
174
+
docker logs -f tests_myextractor_1
175
+
176
+
.. image:: /_static/ug_extractors-4.png
177
+
178
+
You can print any debugging information within your extractor to the docker logs by utilizing the logging object within your code. The following example is for pyClowder:
179
+
180
+
.. code-block:: bash
181
+
182
+
logging.info("Uploaded metadata %s", metadata)
183
+
184
+
185
+
In the screenshot above you can see the lines printed out by the logging.info as the line will start with INFO:
186
+
187
+
.. code-block:: bash
188
+
189
+
2021-04-27 16:47:49,995 [MainThread ] INFO
190
+
191
+
192
+
Additional pyClowder Examples
193
+
##############################
194
+
195
+
For a simple example of an extractor, please refer to `extractor-csv <https://github.com/clowder-framework/extractors-csv>`_. This extractor is submitted on a csv file and returns the headers as metadata.
196
+
197
+
.. image:: /_static/ug_extractors-3.png
198
+
199
+
Specifying multiple inputs
200
+
***************************
201
+
202
+
This example assumes data is within the same dataset.
0 commit comments