|
| 1 | +# Introduction |
| 2 | + |
| 3 | +Clowder is an open-source research data management system that supports curation of long-tail data and metadata across |
| 4 | +multiple research domains and diverse data types. It uses a metadata extraction bus to perform data curation. Extractors |
| 5 | +are software programs that do the extraction of specific metadata from a file or dataset (a group of related files). |
| 6 | +The Simple Extractor Wrapper is a piece of software being developed to make the process of developing an extractor |
| 7 | +easier. This document will provide the details of writing an extractor program using the Simple Extractor Wrapper. |
| 8 | + |
| 9 | +# Goals of Simple Extractor Wrapper |
| 10 | + |
| 11 | +An extractor can be written in any programming language as long as it can communicate with Clowder using a simple HTTP |
| 12 | +web service API and RabbitMQ. It can be hard to develop an extractor from the scratch when you also consider the code |
| 13 | +that is needed for this communication. To reduce this effort and to avoid code duplication, we created libraries written |
| 14 | +in Python (PyClowder) and Java (JClowder) to make the processing of writing extractors easy in these languages. We chose |
| 15 | +these languages since they are among the most popular ones and they continue to remain to so. Though this is the case, |
| 16 | +there is still some overhead in terms of developing an extractor using these libraries. In order to make the process of |
| 17 | +writing extractors even easier, we created a Simple Extractor Wrapper, that wraps around your existing source code and |
| 18 | +converts your code into an extractor. As the name says, the extractor itself needs to be simple in nature. The extractor |
| 19 | +will process a file and generate metadata in JSON format and/or create a file preview. Any other Clowder API endpoints |
| 20 | +are not currently available through the Simple Extractor and the developer would have to fall back to using PyClowder, |
| 21 | +JClowder or writing the extractor from scratch. |
| 22 | + |
| 23 | +# Creating an Extractor |
| 24 | + |
| 25 | +The main function of your program needs to accept the string format file path of the input file. It also needs to |
| 26 | +return an object containing either metadata information ("metadata"), details about file previews ("previews") or both |
| 27 | +in the following format: |
| 28 | + |
| 29 | +```json |
| 30 | +{ |
| 31 | + "metadata": {}, |
| 32 | + "previews": [] |
| 33 | +} |
| 34 | +``` |
| 35 | + |
| 36 | +The metadata sub document will contain the metadata that is directly uploaded back to clowder and will be associated |
| 37 | +with the file. The previews array is a list of filenames that are previews that will be uploaded to clowder and |
| 38 | +associated with file. Once the previews are uploaded they will be removed from the drive. |
| 39 | + |
| 40 | +When writing the code for the extractor you don't have to worry about interaction with clowder and any subpieces, you |
| 41 | +can test your code locally in your development environment by calling the function that will process the file and see |
| 42 | +if the result matches the output described above. |
| 43 | + |
| 44 | +# Using Extractor in Clowder |
| 45 | + |
| 46 | +Once you are done with the extractor and you have tested your code you can wrap the extractor in a docker image and test |
| 47 | +this image in the full clowder environment. To do this you will need to create a Dockerfile as well as an |
| 48 | +extractor_info.json file as well as some optional additional files need by the docker build process. Once you have these |
| 49 | +files you can build you image using `docker build -t extractor-example .`. This will build the docker image and tag it |
| 50 | +with the name extractor-example (you should replace this with a better name). |
| 51 | + |
| 52 | +The dockerfile has 2 environment variables that need to be set: |
| 53 | +- R_SCRIPT : the path on disk to the actual file that needs to be sourced for the function. This can be left blank if |
| 54 | + no file needs to be sourced (for example in case when the file is installed as a package). |
| 55 | +- R_FUNCTION : the name of the function that needs to be called that takes a file as input and returns an object that |
| 56 | + contains the data described above. |
| 57 | + |
| 58 | +There can be 2 additional files that are used when creating the docker image: |
| 59 | +- packages.apt : a list of ubuntu packages that need to be installed for the default ubuntu repositories. |
| 60 | +- docker.R : an R script that is run during the docker build process. This can be used to install any required R |
| 61 | + packages. Another option is to install the code if it is provided as an R package. |
| 62 | + |
| 63 | +An example of the Dockerfile is: |
| 64 | + |
| 65 | +```Dockerfile |
| 66 | +FROM clowder/extractors-simple-r-extractor:onbuild |
| 67 | + |
| 68 | +ENV R_SCRIPT="wordcount.R" \ |
| 69 | + R_FUNCTION="process_file" |
| 70 | +``` |
| 71 | + |
| 72 | +There also has to be an extractor_info.json file which contains information about the extractor and is used to by the |
| 73 | +extractor framework to initialize the extractor as well as upload information to clowder about the extractor. |
| 74 | + |
| 75 | +```json |
| 76 | +{ |
| 77 | + "@context": "<context root URL>", |
| 78 | + "name": "<extractor name>", |
| 79 | + "version": "<version number>", |
| 80 | + "description": "<extractor description>", |
| 81 | + "author": "<first name> <last name> <<email address>>", |
| 82 | + "contributors": [ |
| 83 | + "<first name> <last name> <<email address>>", |
| 84 | + "<first name> <last name> <<email address>>", |
| 85 | + ], |
| 86 | + "contexts": [ |
| 87 | + { |
| 88 | + "<metadata term 1>": "<URL definition of metadata term 1>", |
| 89 | + "<metadata term 2>": "<URL definition of metadata term 2>", |
| 90 | + } |
| 91 | + ], |
| 92 | + "repository": [ |
| 93 | + { |
| 94 | + "repType": "git", |
| 95 | + "repUrl": "<source code URL>" |
| 96 | + } |
| 97 | + ], |
| 98 | + "process": { |
| 99 | + "file": [ |
| 100 | + "<MIME type/subtype>", |
| 101 | + "<MIME type/subtype>" |
| 102 | + ] |
| 103 | + }, |
| 104 | + "external_services": [], |
| 105 | + "dependencies": [], |
| 106 | + "bibtex": [] |
| 107 | + } |
| 108 | +``` |
| 109 | + |
| 110 | +Once the image with the extractor is build you can test this extractor in the clowder environment. To do this you will |
| 111 | +need to start clowder first. This can be done using a single [docker-compose file](https://opensource.ncsa.illinois.edu/bitbucket/projects/CATS/repos/pyclowder2/raw/docker-compose.yml). |
| 112 | +You can start the full clowder stack using `docker-compose up -p clowder` in the same folder where you downloaded the |
| 113 | +docker-compose file. After some time you will have an instance of clowder running that you can access using: |
| 114 | +http://localhost:9000/ (if you use docker with virtualbox the url will probably be http://192.168.99.100:9000/). |
| 115 | + |
| 116 | +If this is the first time you have started clowder you will need to create an account. You will be asked to enter an |
| 117 | +email address (use [email protected]). If you look at the console where you started clowder using docker-compose you |
| 118 | +will some text and a url of the form http://localhost:9000/signup/57d93076-7eca-418e-be7e-4a06c06f3259. If you follow |
| 119 | +this URL you will be able to create an account for clowder. If you used the [email protected] email address this will |
| 120 | +have admin privileges. |
| 121 | + |
| 122 | +Once you have the full clowder stack running, you can start your extractor using |
| 123 | +`docker run --rm -ti --network clowder_clowder extractor-example`. This will start the extractor and show the output |
| 124 | +of the extractor on the command line. Once the extractor has started successfully, you can upload the appropriate file |
| 125 | +and it should show that it is being processed by the extractor. At this point you have successfully created an |
| 126 | +extractor and deployed it in clowder. |
| 127 | + |
| 128 | + |
0 commit comments