Skip to content

Getting Started

Nithin Krishna edited this page Aug 10, 2016 · 9 revisions

#Architecture

Polar deep insights contains 2 major components

  • Insight Generator
  • Insight Visualizer


##Insight Generator

cd ./insight-generator

The insight generator is a python library which provides an interface to extract entities, locations, file metadata and measurements from documents.

The given a file path as argument the main.py script recurses down the directory tree and extracts the above mentioned metadata from each file and saves the extracted metadata onto an elastic search index.

This extraction library works with files on S3/HDFS. It requires the files to be mounted onto the local file system.

# Syntax
python main.py [ ROOT-PATH ] [ ELASTIC-SEARCH-INDEX ]

# Example
python main.py "/tmp/dump" "http://104.236.190.155:9200"

####Extending the extraction interface

Users can build custom implementations to handle the extracted metadata.

from extractors.base import InformationExtractor
from util.dir_tree import DirTreeTraverser

def customProcessor(metadata):
  # Do something with the extracted metadata

def process(PATH):
  md = InformationExtractor(PATH).extract()
  customProcessor(md)

DirTreeTraverser(BASE_PATH).iterateAndPerform(process)

####Standalone Extractor

# Syntax
python extract.py [ FILE-PATH ]

# Example
python extract.py /tmp/dump/test.html | python -mjson.tool

###Metadata Extraction Interface We use the following extraction tools.

Tool Data type
Apache tika content and file-metadata
Stanford's core NLP / NER dates and locations
Python regex entities
Grobid Quantities measurements
Clone this wiki locally