Skip to content

Commit da060b0

Browse files
committed
Initial framework to test extractors
1 parent b0a94e2 commit da060b0

File tree

10 files changed

+378
-0
lines changed

10 files changed

+378
-0
lines changed
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
FROM python:3.8
2+
3+
WORKDIR /extractor
4+
COPY requirements.txt ./
5+
RUN pip install -r requirements.txt
6+
7+
COPY test-dataset-extractor.py extractor_info.json ./
8+
CMD python test-dataset-extractor.py
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
A simple test extractor that verifies the functions of file in pyclowder.
2+
3+
# Docker
4+
5+
This extractor is ready to be run as a docker container, the only dependency is a running Clowder instance. Simply build and run.
6+
7+
1. Start Clowder V2. For help starting Clowder V2, see our [getting started guide](https://github.com/clowder-framework/clowder2/blob/main/README.md).
8+
9+
2. First build the extractor Docker container:
10+
11+
```
12+
# from this directory, run:
13+
14+
docker build -t test-dataset-extractor .
15+
```
16+
17+
3. Finally run the extractor:
18+
19+
```
20+
docker run -t -i --rm --net clowder_clowder -e "RABBITMQ_URI=amqp://guest:guest@rabbitmq:5672/%2f" --name "test-dataset-extractor" test-dataset-extractor
21+
```
22+
23+
Then open the Clowder web app and run the wordcount extractor on a .txt file (or similar)! Done.
24+
25+
### Python and Docker details
26+
27+
You may use any version of Python 3. Simply edit the first line of the `Dockerfile`, by default it uses `FROM python:3.8`.
28+
29+
Docker flags:
30+
31+
- `--net` links the extractor to the Clowder Docker network (run `docker network ls` to identify your own.)
32+
- `-e RABBITMQ_URI=` sets the environment variables can be used to control what RabbitMQ server and exchange it will bind itself to. Setting the `RABBITMQ_EXCHANGE` may also help.
33+
- You can also use `--link` to link the extractor to a RabbitMQ container.
34+
- `--name` assigns the container a name visible in Docker Desktop.
35+
36+
## Troubleshooting
37+
38+
**If you run into _any_ trouble**, please reach out on our Clowder Slack in the [#pyclowder channel](https://clowder-software.slack.com/archives/CNC2UVBCP).
39+
40+
Alternate methods of running extractors are below.
41+
42+
# Commandline Execution
43+
44+
To execute the extractor from the command line you will need to have the required packages installed. It is highly recommended to use python virtual environment for this. You will need to create a virtual environment first, then activate it and finally install all required packages.
45+
46+
```
47+
Step 1 - Start clowder docker-compose
48+
Step 2 - Starting heartbeat listener
49+
virtualenv clowder2-python (try pipenv)
50+
source clowder2-python/bin/activate
51+
Step 3 - Run heatbeat_listener_sync.py to register new extractor (This step will likely not be needed in future)
52+
cd ~/Git/clowder2/backend
53+
pip install email_validator
54+
copy heartbeat_listener_sync.py to /backend from /backend/app/rabbitmq
55+
python heartbeat_listener_sync.py
56+
57+
Step 4 - Installing pyclowder branch & running extractor
58+
source ~/clowder2-python/bin/activate
59+
pip uninstall pyclowder
60+
61+
# the pyclowder Git repo should have Todd's branch activated (50-clowder20-submit-file-to-extractor)
62+
pip install -e ~/Git/pyclowder
63+
64+
cd ~/Git/pyclowder/sample-extractors/test-dataset-extractor
65+
export CLOWDER_VERSION=2
66+
export CLOWDER_URL=http://localhost:8000/
67+
68+
python test-dataset-extractor.py
69+
70+
71+
Step 5 = # post a particular File ID (text file) to the new extractor
72+
POST http://localhost:3002/api/v2/files/639b31754241665a4fc3e513/extract?extractorName=ncsa.test-dataset-extractor
73+
74+
Or,
75+
Go to Clowder UI and submit a file for extraction
76+
```
77+
78+
# Run the extractor from Pycharm
79+
You can run the heartbeat_listener_sync.py and test_file_extractor.py from pycharm.
80+
Create a pipenv (generally pycharm directs you to create one when you first open the file). To run test_file_extractor.py,
81+
add 'CLOWDER_VERSION=2' to environment variable in run configuration.
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
{
2+
"@context": "http://clowder.ncsa.illinois.edu/contexts/extractors.jsonld",
3+
"name": "ncsa.test-dataset-extractor",
4+
"version": "2.0",
5+
"description": "Test Dataset extractor. Test to verify all functionalities of dataset in pyclowder.",
6+
"author": "Dipannita Dey <[email protected]>",
7+
"contributors": [],
8+
"contexts": [
9+
{
10+
"lines": "http://clowder.ncsa.illinois.edu/metadata/sample_metadata#lines",
11+
"words": "http://clowder.ncsa.illinois.edu/metadata/sample_metadata#words",
12+
"characters": "http://clowder.ncsa.illinois.edu/metadata/sample_metadata#characters"
13+
}
14+
],
15+
"repository": [
16+
{
17+
"repType": "git",
18+
"repUrl": "https://opensource.ncsa.illinois.edu/stash/scm/cats/pyclowder.git"
19+
}
20+
],
21+
"process": {
22+
"dataset": [
23+
"*"
24+
]
25+
},
26+
"external_services": [],
27+
"dependencies": [],
28+
"bibtex": []
29+
}
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
pyclowder==2.6.0
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
#!/usr/bin/env python
2+
3+
"""Example extractor based on the clowder code."""
4+
5+
import logging
6+
import subprocess
7+
8+
from pyclowder.extractors import Extractor
9+
import pyclowder.files
10+
11+
12+
class TestDatasetExtractor(Extractor):
13+
"""Test the functionalities of an extractor."""
14+
def __init__(self):
15+
Extractor.__init__(self)
16+
17+
# add any additional arguments to parser
18+
# self.parser.add_argument('--max', '-m', type=int, nargs='?', default=-1,
19+
# help='maximum number (default=-1)')
20+
21+
# parse command line and load default logging configuration
22+
self.setup()
23+
24+
# setup logging for the exctractor
25+
logging.getLogger('pyclowder').setLevel(logging.DEBUG)
26+
logging.getLogger('__main__').setLevel(logging.DEBUG)
27+
28+
def process_message(self, connector, host, secret_key, resource, parameters):
29+
# Process the file and upload the results
30+
31+
logger = logging.getLogger(__name__)
32+
dataset_id = resource['id']
33+
34+
# Local file path to file which you want to upload to dataset
35+
file_path = 'a7.txt'
36+
37+
# Upload a new file to dataset
38+
file_id = pyclowder.files.upload_to_dataset(connector, host, secret_key, dataset_id, file_path, True)
39+
if file_id is None:
40+
logger.error("Error uploading file")
41+
else:
42+
logger.info("File uploaded successfully")
43+
44+
# Get file list under dataset
45+
file_list = pyclowder.datasets.get_file_list(connector, host, secret_key, dataset_id)
46+
logger.info("File list : %s", file_list)
47+
if file_id in [file['id'] for file in file_list]:
48+
logger.info("File uploading and retrieving file list succeeded")
49+
else:
50+
logger.error("File uploading/retrieving file list didn't succeed")
51+
52+
# Download info of dataset
53+
dataset_info = pyclowder.datasets.get_info(connector, host, secret_key, dataset_id)
54+
logger.info("Dataset info: %s", dataset_info)
55+
if dataset_id == dataset_info['id']:
56+
logger.info("Success in downloading dataset info")
57+
else:
58+
logger.error("Error in downloading dataset info")
59+
60+
# Downloading metadata of dataset
61+
dataset_metadata = pyclowder.datasets.download_metadata(connector, host, secret_key, dataset_id)
62+
if dataset_metadata is None:
63+
logger.info("No metadata found for dataset %s", dataset_id)
64+
else:
65+
logger.info("Metadata: %s", dataset_metadata)
66+
67+
68+
if __name__ == "__main__":
69+
extractor = TestDatasetExtractor()
70+
extractor.start()
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
FROM python:3.8
2+
3+
WORKDIR /extractor
4+
COPY requirements.txt ./
5+
RUN pip install -r requirements.txt
6+
7+
COPY test-file-extractor.py extractor_info.json ./
8+
CMD python test-file-extractor.py
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
A simple test extractor that verifies the functions of file in pyclowder.
2+
3+
# Docker
4+
5+
This extractor is ready to be run as a docker container, the only dependency is a running Clowder instance. Simply build and run.
6+
7+
1. Start Clowder V2. For help starting Clowder V2, see our [getting started guide](https://github.com/clowder-framework/clowder2/blob/main/README.md).
8+
9+
2. First build the extractor Docker container:
10+
11+
```
12+
# from this directory, run:
13+
14+
docker build -t test-file-extractor .
15+
```
16+
17+
3. Finally run the extractor:
18+
19+
```
20+
docker run -t -i --rm --net clowder_clowder -e "RABBITMQ_URI=amqp://guest:guest@rabbitmq:5672/%2f" --name "test-file-extractor" test-file-extractor
21+
```
22+
23+
Then open the Clowder web app and run the wordcount extractor on a .txt file (or similar)! Done.
24+
25+
### Python and Docker details
26+
27+
You may use any version of Python 3. Simply edit the first line of the `Dockerfile`, by default it uses `FROM python:3.8`.
28+
29+
Docker flags:
30+
31+
- `--net` links the extractor to the Clowder Docker network (run `docker network ls` to identify your own.)
32+
- `-e RABBITMQ_URI=` sets the environment variables can be used to control what RabbitMQ server and exchange it will bind itself to. Setting the `RABBITMQ_EXCHANGE` may also help.
33+
- You can also use `--link` to link the extractor to a RabbitMQ container.
34+
- `--name` assigns the container a name visible in Docker Desktop.
35+
36+
## Troubleshooting
37+
38+
**If you run into _any_ trouble**, please reach out on our Clowder Slack in the [#pyclowder channel](https://clowder-software.slack.com/archives/CNC2UVBCP).
39+
40+
Alternate methods of running extractors are below.
41+
42+
# Commandline Execution
43+
44+
To execute the extractor from the command line you will need to have the required packages installed. It is highly recommended to use python virtual environment for this. You will need to create a virtual environment first, then activate it and finally install all required packages.
45+
46+
```
47+
Step 1 - Start clowder docker-compose
48+
Step 2 - Starting heartbeat listener
49+
virtualenv clowder2-python (try pipenv)
50+
source clowder2-python/bin/activate
51+
Step 3 - Run heatbeat_listener_sync.py to register new extractor (This step will likely not be needed in future)
52+
cd ~/Git/clowder2/backend
53+
pip install email_validator
54+
copy heartbeat_listener_sync.py to /backend from /backend/app/rabbitmq
55+
python heartbeat_listener_sync.py
56+
57+
Step 4 - Installing pyclowder branch & running extractor
58+
source ~/clowder2-python/bin/activate
59+
pip uninstall pyclowder
60+
61+
# the pyclowder Git repo should have Todd's branch activated (50-clowder20-submit-file-to-extractor)
62+
pip install -e ~/Git/pyclowder
63+
64+
cd ~/Git/pyclowder/sample-extractors/test-file-extractor
65+
export CLOWDER_VERSION=2
66+
export CLOWDER_URL=http://localhost:8000/
67+
68+
python test-file-extractor.py
69+
70+
71+
Step 5 = # post a particular File ID (text file) to the new extractor
72+
POST http://localhost:3002/api/v2/files/639b31754241665a4fc3e513/extract?extractorName=ncsa.test-file-extractor
73+
74+
Or,
75+
Go to Clowder UI and submit a file for extraction
76+
```
77+
78+
# Run the extractor from Pycharm
79+
You can run the heartbeat_listener_sync.py and test_file_extractor.py from pycharm.
80+
Create a pipenv (generally pycharm directs you to create one when you first open the file). To run test_file_extractor.py,
81+
add 'CLOWDER_VERSION=2' to environment variable in run configuration.
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
{
2+
"@context": "http://clowder.ncsa.illinois.edu/contexts/extractors.jsonld",
3+
"name": "ncsa.test-file-extractor",
4+
"version": "2.0",
5+
"description": "Test File extractor. Test to verify all functionalities of file in pyclowder.",
6+
"author": "Dipannita Dey <[email protected]>",
7+
"contributors": [],
8+
"contexts": [
9+
{
10+
"lines": "http://clowder.ncsa.illinois.edu/metadata/sample_metadata#lines",
11+
"words": "http://clowder.ncsa.illinois.edu/metadata/sample_metadata#words",
12+
"characters": "http://clowder.ncsa.illinois.edu/metadata/sample_metadata#characters"
13+
}
14+
],
15+
"repository": [
16+
{
17+
"repType": "git",
18+
"repUrl": "https://opensource.ncsa.illinois.edu/stash/scm/cats/pyclowder.git"
19+
}
20+
],
21+
"process": {
22+
"file": [
23+
"text/*",
24+
"application/json"
25+
]
26+
},
27+
"external_services": [],
28+
"dependencies": [],
29+
"bibtex": []
30+
}
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
pyclowder==2.6.0
Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
#!/usr/bin/env python
2+
3+
"""Example extractor based on the clowder code."""
4+
5+
import logging
6+
import subprocess
7+
8+
from pyclowder.extractors import Extractor
9+
import pyclowder.files
10+
11+
12+
class TestFileExtractor(Extractor):
13+
"""Test the functionalities of an extractor."""
14+
def __init__(self):
15+
Extractor.__init__(self)
16+
17+
# add any additional arguments to parser
18+
# self.parser.add_argument('--max', '-m', type=int, nargs='?', default=-1,
19+
# help='maximum number (default=-1)')
20+
21+
# parse command line and load default logging configuration
22+
self.setup()
23+
24+
# setup logging for the exctractor
25+
logging.getLogger('pyclowder').setLevel(logging.DEBUG)
26+
logging.getLogger('__main__').setLevel(logging.DEBUG)
27+
28+
def process_message(self, connector, host, secret_key, resource, parameters):
29+
# Process the file and upload the results
30+
31+
logger = logging.getLogger(__name__)
32+
file_id = resource['id']
33+
34+
35+
# Sample metadata
36+
sample_metadata = {
37+
'lines': 10,
38+
'words': 20,
39+
'characters': 30
40+
}
41+
metadata = self.get_metadata(sample_metadata, 'file', file_id, host)
42+
43+
# Normal logs will appear in the extractor log, but NOT in the Clowder UI.
44+
logger.debug(metadata)
45+
46+
# Upload metadata to original file
47+
pyclowder.files.upload_metadata(connector, host, secret_key, file_id, metadata)
48+
49+
# Download metadata of file
50+
downloaded_metadata = pyclowder.files.download_metadata(connector, host, secret_key, file_id)
51+
logger.info("Downloaded metadata : %s", downloaded_metadata)
52+
if sample_metadata in (metadata['contents']for metadata in downloaded_metadata):
53+
logger.info("Success in uploading and downloading file metadata")
54+
else:
55+
logger.error("Error in uploading/downloading file metadata")
56+
57+
# Download info of file
58+
file = pyclowder.files.download_info(connector, host, secret_key, file_id)
59+
logger.info("File info: %s", file)
60+
if file_id == file[0]['id']:
61+
logger.info("Success in downloading file info")
62+
else:
63+
logger.error("Error in downloading file info")
64+
65+
66+
67+
if __name__ == "__main__":
68+
extractor = TestFileExtractor()
69+
extractor.start()

0 commit comments

Comments
 (0)