Skip to content

Getting Started

Nithin Krishna edited this page Aug 10, 2016 · 9 revisions

#Architecture


Polar deep insights contains 2 major components

  • Insight Generator
  • Insight Visualizer

##Insight Generator

cd ./insight-generator

The insight generator is a python library which provides an interface to extract entities, locations, file metadata and measurements from documents.

The given a file path as argument the main.py script recurses down the directory tree and extracts the above mentioned metadata from each file and saves the extracted metadata onto an elastic search index.

# Syntax
python main.py [ ROOT-PATH ] [ ELASTIC-SEARCH-INDEX ]

# Example
python main.py "/tmp/dump" "http://104.236.190.155:9200"

###Extending Information Extractor

Users can build custom implementations to handle the extracted metadata.

from extractors.base import InformationExtractor
from util.dir_tree import DirTreeTraverser

def customProcessor(metadata):
  # Do something with the extracted metadata

def process(PATH):
  md = InformationExtractor(PATH).extract()
  customProcessor(md)

DirTreeTraverser(BASE_PATH).iterateAndPerform(process)

###Standalone File Extractor

# Syntax
python extract.py [ FILE-PATH ]

# Example
python extract.py /tmp/dump/test.html | python -mjson.tool

This python library works with files on S3/HDFS. It requires the files to be mounted onto the local file system.

Clone this wiki locally