-
Notifications
You must be signed in to change notification settings - Fork 8
Getting Started
#Architecture
Polar deep insights contains 2 major components
- Insight Generator
- Insight Visualizer
##Insight Generator
cd ./insight-generatorThe insight generator is a python library which provides an interface to extract entities, locations, file metadata and measurements from documents.
The given a file path as argument the main.py script recurses down the directory tree and extracts the above mentioned metadata from each file and saves the extracted metadata onto an elastic search index.
This extraction library works with files on S3/HDFS. It requires the files to be mounted onto the local file system.
# Syntax
python main.py [ ROOT-PATH ] [ ELASTIC-SEARCH-INDEX ]
# Example
python main.py "/tmp/dump" "http://104.236.190.155:9200"####Extending the extraction interface
Users can build custom implementations to handle the extracted metadata.
from extractors.base import InformationExtractor
from util.dir_tree import DirTreeTraverser
def customProcessor(metadata):
# Do something with the extracted metadata
def process(PATH):
md = InformationExtractor(PATH).extract()
customProcessor(md)
DirTreeTraverser(BASE_PATH).iterateAndPerform(process)####Standalone Extractor
# Syntax
python extract.py [ FILE-PATH ]
# Example
python extract.py /tmp/dump/test.html | python -mjson.tool###Metadata Extraction Interface We use the following extraction tools.
| Tool | Data type |
|---|---|
| Apache tika | content and file-metadata |
| Stanford's core NLP / NER | dates and locations |
| Python regex | entities |
| Grobid Quantities | measurements |
Information Retrieval and Data Science (IRDS) research group, University of Southern California.