This project consists of a series of extract-transform-load (ETL) pipelines for adding common sense knowledge triples to the MOWGLI Common Sense Knowledge Graph (CSKG).
The CSKG is used by downstream applications such as question answering systems and knowledge graph browsers. The graph consists of nodes and edges serialized in KGTK edge format, which is a specialization of the general KGTK format.
ConceptNet serves as the core of the CSKG, and other sources such as Wikidata are linked to it. The majority of the predicates/relations in the CSKG are reused from ConceptNet.
From the current directory:
python3 -m venv venv
On Unix:
source venv/bin/activate
On Windows
venv\Scripts\activate
pip install -r requirements.txt
The framework uses LevelDB for whole-graph operations such as duplicate checking.
OS X:
brew install leveldb
CFLAGS=-I$(brew --prefix)/include LDFLAGS=-L$(brew --prefix)/lib pip install plyvel
Linux:
pip install plyvel
The RDF loader can use the rdflib "Sleepycat" store if the bsddb3 module is present.
Linux:
pip install bsddb3
Activate the virtual environment as above, then run:
pytest
Activate the virtual environment as above, then run:
python3 -m mowgli_etl.cli etl rpi_combined
to run all of the available pipelines as well as combine their output.
The extract, transform, and load stages of the pipelines write data to the data directory. (The path to this directory can be changed on the command line). The structure of the data directory is data/<pipeline id>/<stage>. For example, data/swow/loaded for the final products of the swow pipeline.
The rpi_combined pipeline "loads" the outputs of the other pipelines into its data/rpi_combined/loaded directory in the CSKG CSV format.
The mowgli-etl code base consists of:
- a minimal bespoke framework for implementing ETL pipelines
- pipeline implementations for different data sources, such as the
swowpipeline for the Small World of Words word association lexicon
A pipeline consists of:
- an extractor, inheriting from the
_Extractorabstract base class - a transformer, inheriting from the
_Transformerabstract base class - an optional loader (
_Loadersubclass), which is usually not explicitly specified by pipelines; a default is provided instead - a
_Pipelinesubclass that ties everything together
Running a pipeline with a command such as
python3 -m mowgli_etl.cli etl swow
initiates the following process, where swow is the pipeline id.
- Instantiate the pipeline by
- finding a module named exactly
mowgli_etl.pipeline.swow.swow_pipeline(or adapted from another pipeline id) - finding a subclass of _Pipeline declared in that module
- instantiating that subclass with a few arguments from the command line as constructor parameters
- finding a module named exactly
- Call the
extractmethod of theextractoron the pipeline. See the docstring of_Extractor.extractfor information on the contract ofextract. - Call the
transformmethod of thetransformeron the pipeline, passing in a**kwdsdictionary returned byextract. See the docstring of_Transformer.transformfor more information. - The
transformmethod is a generator for a sequence of models, typicallyKgEdges andKgNodes to add to the CSKG. This generator is passed to the loader, which iterates over it, loading data as it goes. For example, the default KGTK loader buffers nodes and appends edge rows to an output KGTK file. This loading process does not usually need to be handled by the pipeline implementations, most of which rely on the default loader.
- Generators
- Type hints and the
typingmodule, especiallyNamedTuple dataclasses- Keyword-only arguments (
def f(*, x, y)), and**kwdskeyword variadic arguments - Abstract base classes, abstract methods, and the
abcmodule - The pytest framework for unit testing
- The
pathlibmodule - Class methods
We follow PEP8 and the Google Python Style Guide, preferring the former where the two are inconsistent.
We encourage using an IDE such as PyCharm. Please format your code with Black before committing it. The formatter can be integrated into most editors, to format on save.
Most code should be part of a class. There should be one class per file, and the file should be named after the class (SomeClass as some_class.py).
The swow pipeline is the best model for new pipelines.
Extractors typically work in one of two ways:
- Using pre-downloaded data that is committed to the per-pipeline
datasubdirectory. This is the best approach for smaller data sets that change infrequently. - Downloading source data when the
extractmethod is called. The data can be cached in the per-pipelinedatasubdirectory and reused ifforceis not specified. Cached data should be.gitignored. Use an implementation of theEtlHttpClientrather than usingurllib,requests, or another HTTP client directly. This makes it easier to mock the HTTP client in unit tests.
The extract method receives a storage parameter that points to a PipelineStorage instance, which has the path to appropriate subdirectory of data. Extractors should use this path (storage.extracted_data_dir_path) rather than trying to locate data directly, since the path to data can be changed on the command line.
Once the data is available, the extractor must pass it to the transformer by returning a **kwds dictionary. This is typically done in one of two ways:
- Returning
{"path_to_file": Path("the/file/path")}fromextract, so thattransformisdef transform(self, *, path_to_file: Path). This is the preferred approach for large files. - Reading the file in the extractor and returning
{"file_data": "..."}, in which casetransformisdef transform(self, *, file_data: str)or similar. This is acceptable for small data.
Given extracted data in one of the forms listed above, the transformer's task is to:
- parse the data in its source format
- create a sequence of
KgEdgeandKgNodemodels that capture the data - yield those models
Transformers can be implemented in a variety of ways, as long as they conform to the _Transformer abstract base class. For example, in many implementations the top-level transform methods delegates to multiple private helper methods or helper classes. It is easier to test the code if the logic of the transformer is broken up into relatively small methods that can be tested individually, rather than one large transform method with many branches.
Note that KgEdge and KgNode have legacy factory classmethods (.legacy in both cases) corresponding to an older data model. These should not be used in new code. New code should instantiate the models directly or use one of the other factory classmethods as a convenience.
The swow pipeline tests in tests/mowgli_etl_test/pipeline/swow can be used as a model for how to test a pipeline. Familiarity with the pytest framework is necessary.
We use the GitHub flow with feature branches on this code. Branches should be named after (e.g., GH-###) or otherwise linked to an issue in the issue tracker. Please tag a staff person for code reviews, and re-tag when you have addressed the staff person's comments in the code and rebutted the comments in the PR. See the Google Code Review Developer Guide for more information on code reviews.
We use CircleCI for continuous integration. CircleCI runs the tests in tests/ on every push to origin. Merging a feature branch is contingent on having adequate tests and all tests passing. We encourage test-driven development.
- conceptnet.io and the ConceptNet paper for understanding common sense knowledge graphs
- The Storks et al. survey "Recent Advances in Natural Language Inference"
- The Missing Semester of Your CS Education