Skip to content

densekernel/EDEN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EDEN

(Anomalous) Event Detection in News

  • Analyse stream of news articles to find anomalous events
  • NLP modules to represent data
  • Apply Document Clustering algorithms for Event Detection
  • Characterise event-centric clusters and use statistical threshold models for Anomaly Detection

Installation

To use this with an existing dataset, we recommend setting up the repostiory in the following way.

app/
datasets/
  eval/
    1.txt
    2.txt
    ...
  raw-data/
    1.json
    2.json
    ...
  word2vec_signal/
    word2vec_signal.p
pipeline/
  /io/

Part of this structure will be set up during the process of cloning.

  • git clone https://github.com/jonathanmanfield/EDEN
  • cd EDEN
  • cp path/to/datasets .

Data Pipeline

Implements Python Luigi.

Requirements

TODO list all the librariesx

Documentation

Task Dependencies Parameters Description
ReadData {} {‘fn’ : list} Reads respective data (news articles) from ‘fn’, a list of filenames. Returns a con- catenation of all files. 'fn': Use a comma-seperated text string (e.g., '1,2,3').
PreprocessData {ReadData} {‘fn’ : list, ‘method’: string} Apply NLP techniques on unstructured data for stop- word removal, lower-casing and porter stemming. Convert to structure data format of Vector Space Model. Method: Selected by parameter; ‘ltc’: tf-idf on entire content, ‘ltc-ent’: tf-idf on named entities, ‘word2vec’: Use pre-trained Google News vectors
ClusterData {PreprocessData} {‘fn’ ‘method’: string, ‘algo’: string, ‘params’: dict} Run Document Cluster-ing algorithm (selected by‘algo’ with hyperparameters‘params’) on PreprocessedData'algo': Select from {'kmeans', 'dbscan', 'meanshift', 'birch', 'gac', 'gactemporal'} to run the corresponding algorithm'params': Use string representation of Python Dictionary (e.g., '{"n_clusters": 50}')n.b., for choice of parameters of algorithm see corresponding sklearn documentation, with the exception of 'gac' or 'gactemporal' which accept parameters: 'b=10.0, p=0.5, s=0.8, t=100,', b is factor (stopping criteria), s is minimum similarity threshold (stopping criteria), bucket size, p is a reduction 'gac has a parameter 're=5' for number of iterations to perform normally before re-bucketing.
Evaluate {PreprocessData, ClusterData} {‘fn’ : list, ‘method’: string, ‘algo’: string, ‘params’: dict} Evaluate performance of Document Clustering algoirthm using external criterion and labelled data in the style of TDT Pilot Study
CrossValidate {PreprocessData} {‘fn’ : list, ‘method’: string, ‘algo’: string, ‘params’: dict, 'train': list, 'test': list} Perform grid search across range of hyperaparameters to optimise Document Clus- tering algorithms train, test: Use a comma-separated value of filenames like 'fn' (e.g., '3,5,6')
AnomalyDetection {PreprocessData, ClusterData} {'tau': string, 't': integer, 'k':float } Create a statistical threshold model that classifies event-centric clusters as anomalous if the product of their cohesiveness (cosine similarity) and highest burst in publishing is outside a range.

Prompts

Each task can be run separately and parameterised. All intermediary files will be saved to pipeline/io Please note that all the source python files are located in 'pipeline', i.e. you need to cd pipeline

  • ReadData: python eden.py ReadData --local-scheduler --fn '35,30'
  • PreprocessData: python eden.py PreprocessData --local-scheduler --fn '35,30' --method 'ltc'
  • ClusterData:
    • python eden.py ClusterData --local-scheduler --fn '35,30' --method 'ltc' --algo 'kmeans' --params '{"n_clusters": 8}'
    • python eden.py ClusterData --local-scheduler --fn '35,30' --method 'ltc' --algo 'gac' --params '{"b": 10, "s":0.9, "p":0.9}'
    • python eden.py ClusterData --local-scheduler --fn '35,30' --method 'ltc' --algo 'gactemporal' --params '{"b": 10, "s":0.9, "p":0.9, "re": 5}'
  • Evaluate:
    • python eden.py Evaluate --local-scheduler --fn '35,30' --method 'ltc' --algo 'gactemporal' --params '{"b": 10, "s":0.9, "p":0.9, "re": 5}'
  • AnomalyDetection: python eden.py AnomalyDetection --local-scheduler --fn '35,30' --method 'ltc' --algo 'gactemporal' --params '{"b":10, "s": 0.9, "p":0.9, "re":5}'

Or the entire pipeline can be run at once. Of-course it can be parameterised at this stage too.

  • python eden.py Evaluate --local-scheduler
  • python eden.py AnomalyDetection --local-scheduler
  • python eden.py CrossValidation --local-scheduler

Separately, there exists a cross-validation function to find the best hyperparameters and see how well they generalize to test data.

  • python eden.py CrossValidate --local-scheduler --fn '35,30' --method 'ltc' --algo 'kmeans' --params '{"n_clusters": [5,10,15,20]}' --train '35' --test '30'

App (Under Construction)

Architecture

  • Python 2.7

Elasticsearch

News articles are indexed in a (running) instance of ElasticSearch.

  • Elasticsearch should be running (http://locahost:9200)
  • Mappings should match those in use by Signal

Back-end: Python Flask

The back-end of the application is powered by Python Flask.

  • Flask_RESTful powers API
  • Flask_CORS (cross-origin resource sharing), mainly for cross-origin AJAX

Front-end: Angular UI

  • Angular 1.4.3
  • Twitter Bootstrap
  • Routes exist to count the number of articles

Installation (and operation)

Installation:

  • Run ElasticSearch with Signal 1M-Sample (See Elasticsearch section)
  • git clone https://github.com/jonathanmanfield/EDEN
  • cd EDEN/app
  • virtualenv venv
  • source ./venv/bin/activate
  • pip install -r requirements.txt (Needs testing)

Operation (in seperate terminal tabs):

Back-end:

  • source ./venv/bin/activate
  • make backend

Front-end:

  • source ./venv/bin/activate
  • make frontend

Access (web application):

  • visit http://localhost:8000
  • Test count articles by visiting http://localhost:8000/#/articles/count

Notebooks (Coming soon)

  • Data Visualisation (With dimensionality reduction of article vectors and plot characterisations)

Credits

About

(Anomalous) Event Detection in News

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •