architecture planning

Overall High Level Layout

Diagram

There will be four components to GenderTracker:

DataManager
Importer
Parser
- Article
MetricRunner
Metric
StorageManager
OutputManager

Importer

Responsible for handling the import of data. It should have the following methods:

fetch (gets the data from the source)
parse (parses the data into the appropriate form.)

There will be some basic importers:

URLImporter
FileImporter
StringImporter

Note this doesn't do any actual parsing, it just passes on the elements it extracts back to whoever instantiated it so it can be passed to a parser.

Parser

Converts the data that comes from the Importers into Article.

For each source, a new parser will need to be created. I'm envisioning them for:

NYTimesParser
GlobeParser
GuardianParser etc...

Article

An Article is a generic structured version of input text content. It will have some set properties like:

title
body
author
_original_text (prior to stripping any markup and so on.)

Any additional properties can be added to an Article, like (publication date, publication name...?)

MetricRunner

The MetricRunner is responsible for knowing:

What metrics should be computed on an article
What metrics are currently running (maintain a job queue)
Where each metric should report its result (or what should happen with the result.)

MetricRunner needs to report the results of every job. These results either:

Get written to a database
Get stored in memory
Get compiled into a report of some form.

Questions

Not sure if we should separate storage from the reporting aspect... Not even sure the reporting aspect is necessary.

Metric

A metric takes in a single article. A metric is a unit of computation that returns the result for how gender biased a specific article is (although it really doesn't have to be this specific, if we want to make it as such.)

A Metric has:

A type ("name", "pronoun" etc.)
scores
male [0,1] (does it need to be -1?)
female [0,1] (does it need to be -1?)
threshold (how high does the result need to be to return a positive indication in either direction?)
compute method (performs the actual computation.

Questions

What about metrics that need to know the metrics of their other articles? For example, if we build a classifier? I think this assumes you had to have built it outside the context of this system.