Improve READMEs, fix dependencies for release (#95)

jmsfltchr · web-flow · commit f9cfae4ea7fc · 2019-09-25T15:28:54.000+01:00
## What is the goal of this PR?

Improve install instructions in main README, and diagram explanations in KGCN README, ready for release.

## What are the changes implemented in this PR?

- README edits
- Fix the dependencies ready for release
- Lock the bazel and rbe install scripts to a specific build-tools commit
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -3,9 +3,9 @@ commands:
 
   install-bazel-linux-rbe:
     steps:
-      - run: curl -OL https://raw.githubusercontent.com/graknlabs/build-tools/master/ci/install-bazel-linux.sh
+      - run: curl -OL https://raw.githubusercontent.com/graknlabs/build-tools/04c69fbe5277bf2ed9e2baf5e9a53ac3c9ebee80/ci/install-bazel-linux.sh
       - run: bash ./install-bazel-linux.sh && rm ./install-bazel-linux.sh
-      - run: curl -OL https://raw.githubusercontent.com/graknlabs/build-tools/master/ci/install-bazel-rbe.sh
+      - run: curl -OL https://raw.githubusercontent.com/graknlabs/build-tools/04c69fbe5277bf2ed9e2baf5e9a53ac3c9ebee80/ci/install-bazel-rbe.sh
       - run: bash ./install-bazel-rbe.sh && rm ./install-bazel-rbe.sh
 
   run-grakn-server:
diff --git a/README.md b/README.md
@@ -8,9 +8,9 @@
 
 [Grakn](https://github.com/graknlabs/grakn) lets us create Knowledge Graphs from our data. But what challenges do we encounter where querying alone won’t cut it? What library can address these challenges?
 
-To respond to these scenarios, KGLIB is the centre of all research projects conducted at Grakn Labs. In particular, its focus is on the integration of machine learning with the Grakn knowledge graph.
+To respond to these scenarios, KGLIB is the centre of all research projects conducted at Grakn Labs. In particular, its focus is on the integration of machine learning with the Grakn Knowledge Graph. More on this below, in [*Knowledge Graph Tasks*](https://github.com/graknlabs/kglib#knowledge-graph-tasks).
 
-At present this repo contains one project: [*Knowledge Graph Convolutional Networks* (KGCNs)](https://github.com/graknlabs/kglib/tree/master/kglib/kgcn).
+At present this repo contains one project: [*Knowledge Graph Convolutional Networks* (KGCNs)](https://github.com/graknlabs/kglib/tree/master/kglib/kgcn). Go there for more info on getting started with a working example.
 
 ## Quickstart
 **Requirements**
@@ -21,23 +21,38 @@ At present this repo contains one project: [*Knowledge Graph Convolutional Netwo
 
 - The [latest release of Grakn Core](https://github.com/graknlabs/grakn/releases/latest) or [Grakn KGMS](https://dev.grakn.ai/docs/cloud-deployment/kgms) running
 
+**Run**
+Take a look at [*Knowledge Graph Convolutional Networks* (KGCNs)](https://github.com/graknlabs/kglib/tree/master/kglib/kgcn) to see a walkthrough of how to use the library.
+
 **Building from source**
 
-To test that all targets can be built: 
+Clone KGLIB:
+
+```
+git clone git@github.com:graknlabs/kglib.git
+```
+
+`cd` in to the project:
+
+```
+cd kglib
+```
+
+To build all targets can be built:
 
-```bash
+```
 bazel build //...
 ```
 
-To run all tests: 
+To run all tests (requires Python 3.6+): 
 
-```bash
-bazel test //... --test_output=streamed --spawn_strategy=standalone --python_version PY3 --python_path $(which python3)
+```
+bazel test //kglib/... --test_output=streamed --spawn_strategy=standalone --python_version PY3 --python_path $(which python3)
 ```
 
 To build the pip distribution (find the output in `bazel-bin`):
 
-```bash
+```
 bazel build //:assemble-pip
 ```
 
@@ -76,7 +91,7 @@ Here we term any task which creates new facts for the KG as *Knowledge Graph Com
 
 #### Relation Prediction (a.k.a. Link prediction)
 
-We often want to find new connections in our Knowledge Graphs. Often, we need to understand how two concepts are connected. This is the case of binary Relation prediction, which all existing literature concerns itself with. Grakn is a [Hypergraph](https://en.wikipedia.org/wiki/Hypergraph), where Relations are [Hyperedges](https://en.wikipedia.org/wiki/Glossary_of_graph_theory_terms#hyperedge). Therefore, in general, the Relations we may want to predict may be **ternary** (3-way) or even **[N-ary](https://en.wikipedia.org/wiki/N-ary_group)** (N-way), which goes beyond the research we have seen in this domain.
+We often want to find new connections in our Knowledge Graphs. Often, we need to understand how two concepts are connected. This is the case of **binary** Relation prediction, which all existing literature concerns itself with. Grakn is a [Hypergraph](https://en.wikipedia.org/wiki/Hypergraph), where Relations are [Hyperedges](https://en.wikipedia.org/wiki/Glossary_of_graph_theory_terms#hyperedge). Therefore, in general, the Relations we may want to predict may be **ternary** (3-way) or even **[N-ary](https://en.wikipedia.org/wiki/N-ary_group)** (N-way), which goes beyond the research we have seen in this domain.
 
 When predicting Relations, there are several scenarios we may have. When predicting binary Relations between the members of one set and the members of another set, we may need to  predict them as:
 
@@ -88,21 +103,17 @@ When predicting Relations, there are several scenarios we may have. When predict
 
 *Examples:* The problem of predicting which disease(s) a patient has is a one-to-many problem. Whereas, predicting which drugs in the KG treat which diseases is a many-to-many problem.
 
-We anticipate that solutions working well for the one-to-one case will also be applicable (at least to some extent) to the one-to-many case and cascade also to the many-to-many case.
-
-***In KGLIB*** [*Knowledge Graph Convolutional Networks* (KGCNs)](https://github.com/graknlabs/kglib/tree/master/kglib/kgcn) can help us with one-to-one binary Relation prediction. This requires extra implementation, for which two approaches are apparent:
-
-- Create two KGCNs, one for each of the two Roleplayers in the binary Relation. Extend the neural network to compare the embeddings of each Roleplayer, and classify the pairing according to whether a Relation should exist or not.
+Notice also that recommender systems are one use case of one-to-many binary Relation prediction.
 
-- Feed Relations directly to a KGCN, and classify their existence. (KGCNs can accept Relations as the Things of interest just as well as Entities). To do this we also need to create hypothetical Relations, labelled as negative examples, and feed them to the KGCN alongside the positively labelled known Relations. Note that this extends well to ternary and N-ary Relations.
+We anticipate that solutions working well for the one-to-one case will also be applicable (at least to some extent) to the one-to-many case and cascade also to the many-to-many case.
 
-Notice also that recommender systems are one use case of one-to-many binary Relation prediction.
+***In KGLIB*** [*Knowledge Graph Convolutional Networks* (KGCNs)](https://github.com/graknlabs/kglib/tree/master/kglib/kgcn) performs Relation prediction using an approach based on [Graph Networks](https://github.com/deepmind/graph_nets) from DeepMind. This can be used to predict **binary**, **ternary**, or **N-ary** relations. This is well-supported for the one-to-one case and the one-to-many case.
 
 #### Attribute Prediction
 
 We would like to predict one or more Attributes of a Thing, which may include also prediction of whether that Attribute should even be present at all.
 
-***In KGLIB*** [*Knowledge Graph Convolutional Networks* (KGCNs)](https://github.com/graknlabs/kglib/tree/master/kglib/kgcn) can be used to directly learn Attributes for any Thing. Attribute prediction is already fully supported.
+***In KGLIB*** [*Knowledge Graph Convolutional Networks* (KGCNs)](https://github.com/graknlabs/kglib/tree/master/kglib/kgcn) can be used to directly learn Attributes for any Thing. This requires some minor additional functionality to be added (we intend to build this imminently).
 
 #### Subgraph Prediction
 
@@ -114,7 +125,7 @@ Embeddings of Things and/or Types are universally useful for performing other do
 These vectors are easy to ingest into other ML pipelines.
 The benefit of building general-purpose embeddings is therefore to make use of them in multiple other pipelines. This reduces the expense of traversing the Knowledge Graph, since this task can be performed once and the output re-used more than once.
 
-***In KGLIB*** [*Knowledge Graph Convolutional Networks* (KGCNs)](https://github.com/graknlabs/kglib/tree/master/kglib/kgcn) can be used to build general-purpose embeddings. This requires additional functionality, since a generic loss function is required in order to train the model. At its simplest, this can be achieved by measuring the shortest distance across the KG between two Things. This can be achieved trivially in Grakn using [`compute path`](https://dev.grakn.ai/docs/query/compute-query#compute-the-shortest-path).
+***In KGLIB*** [*Knowledge Graph Convolutional Networks* (KGCNs)](https://github.com/graknlabs/kglib/tree/master/kglib/kgcn) can be used to build general-purpose embeddings. This requires additional functionality, since a generic loss function is required in order to train the model in an unsupervised fashion. At its simplest, this can be achieved by measuring the shortest distance across the KG between two Things. This can be achieved trivially in Grakn using [`compute path`](https://dev.grakn.ai/docs/query/compute-query#compute-the-shortest-path).
 
 #### Rule Mining (a.k.a. Association Rule Learning)
 
diff --git a/dependencies/graknlabs/dependencies.bzl b/dependencies/graknlabs/dependencies.bzl
@@ -4,7 +4,7 @@ def graknlabs_build_tools():
     git_repository(
         name = "graknlabs_build_tools",
         remote = "https://github.com/graknlabs/build-tools",
-        commit = "f50e7a618045c99862bed78f813b1cfbb25a6016", # sync-marker: do not remove this comment, this is used for sync-dependencies by @graknlabs_build_tools
+        commit = "04c69fbe5277bf2ed9e2baf5e9a53ac3c9ebee80", # sync-marker: do not remove this comment, this is used for sync-dependencies by @graknlabs_build_tools
     )
 
 
@@ -19,5 +19,5 @@ def graknlabs_client_python():
     git_repository(
         name = "graknlabs_client_python",
         remote = "https://github.com/graknlabs/client-python",
-        commit = "4f03fc79fba71f216a28a4bc412c084fcef099a0" # sync-marker: do not remove this comment, this is used for sync-dependencies by @graknlabs_client_python
+        tag = "1.5.4" # sync-marker: do not remove this comment, this is used for sync-dependencies by @graknlabs_client_python
     )
diff --git a/kglib/kgcn/.images/graph_snippet.png b/kglib/kgcn/.images/graph_snippet.png
diff --git a/kglib/kgcn/.images/learning.png b/kglib/kgcn/.images/learning.png
diff --git a/kglib/kgcn/README.md b/kglib/kgcn/README.md
@@ -1,3 +1,5 @@
+
+
 # Knowledge Graph Convolutional Networks
 
 This project introduces a novel model: the *Knowledge Graph Convolutional Network* (KGCN). This project is in its second major iteration since its inception. 
@@ -31,10 +33,105 @@ See the [full example](https://github.com/graknlabs/kglib/tree/master/kglib/kgcn
 Once you have installed kglib via pip (as above) you can run the example as follows:
 
 1. Start a Grakn server
+
 2. Load [the schema](kglib/utils/grakn/synthetic/examples/diagnosis/schema.gql) for the example into Grakn. The template for the command is `./grakn console -k diagnosis -f path/to/schema.gql`
+
 3. Run the example: `python -m kglib.kgcn.examples.diagnosis.diagnosis`
+
 4. You should observe console output to indicate that the pipeline is running and that the model is learning. Afterwards two plots should be created to visualise the training process and examples of the predictions made.
 
+## Output
+
+### Console
+
+During training, the console will output metrics for the performance on the training and test sets.
+
+You should see output such as this for the diagnosis example:
+```
+# (iteration number), T (elapsed seconds), Ltr (training loss), Lge (test/generalization loss), Ctr (training fraction nodes/edges labeled correctly), Str (training fraction examples solved correctly), Cge (test/generalization fraction nodes/edges labeled correctly), Sge (test/generalization fraction examples solved correctly)
+# 00000, T 8.7, Ltr 2.4677, Lge 2.3044, Ctr 0.2749, Str 0.0000, Cge 0.2444, Sge 0.0000
+# 00050, T 11.3, Ltr 0.5098, Lge 0.4571, Ctr 0.8924, Str 0.0000, Cge 0.8983, Sge 0.0000
+# 00100, T 14.0, Ltr 0.3694, Lge 0.3340, Ctr 0.8924, Str 0.0000, Cge 0.8983, Sge 0.0000
+# 00150, T 16.6, Ltr 0.3309, Lge 0.3041, Ctr 0.9010, Str 0.0000, Cge 0.8919, Sge 0.0000
+# 00200, T 19.2, Ltr 0.3125, Lge 0.2940, Ctr 0.9010, Str 0.0000, Cge 0.8919, Sge 0.0000
+# 00250, T 21.8, Ltr 0.2975, Lge 0.2790, Ctr 0.9254, Str 0.2000, Cge 0.9178, Sge 0.4333
+# 00300, T 24.4, Ltr 0.2761, Lge 0.2641, Ctr 0.9332, Str 0.6000, Cge 0.9243, Sge 0.4333
+# 00350, T 27.0, Ltr 0.2653, Lge 0.2534, Ctr 0.9340, Str 0.6000, Cge 0.9243, Sge 0.4333
+# 00400, T 29.7, Ltr 0.2866, Lge 0.2709, Ctr 0.9332, Str 0.6000, Cge 0.9178, Sge 0.0000
+# 00450, T 32.3, Ltr 0.2641, Lge 0.2609, Ctr 0.9324, Str 0.6000, Cge 0.9178, Sge 0.4333
+# 00500, T 34.9, Ltr 0.2601, Lge 0.2544, Ctr 0.9324, Str 0.6000, Cge 0.9178, Sge 0.4333
+# 00550, T 37.5, Ltr 0.2571, Lge 0.2501, Ctr 0.9332, Str 0.6000, Cge 0.9243, Sge 0.4333
+# 00600, T 40.1, Ltr 0.2530, Lge 0.2404, Ctr 0.9348, Str 0.6000, Cge 0.9373, Sge 0.4333
+# 00650, T 42.7, Ltr 0.2508, Lge 0.2363, Ctr 0.9356, Str 0.6000, Cge 0.9438, Sge 0.4333
+# 00700, T 45.3, Ltr 0.2500, Lge 0.2340, Ctr 0.9372, Str 0.7333, Cge 0.9503, Sge 0.4333
+# 00750, T 48.0, Ltr 0.2493, Lge 0.2307, Ctr 0.9372, Str 0.7333, Cge 0.9567, Sge 0.8000
+# 00800, T 50.7, Ltr 0.2488, Lge 0.2284, Ctr 0.9372, Str 0.7333, Cge 0.9567, Sge 0.8000
+```
+
+Take note of the key:
+
+- \# - iteration number
+- T - elapsed seconds
+- Ltr - training loss
+- Lge - test/generalization loss
+- Ctr - training fraction nodes/edges labeled correctly
+- Str - training fraction examples solved correctly
+- Cge - test/generalization fraction nodes/edges labeled correctly
+- Sge - test/generalization fraction examples solved correctly
+
+The element we are most interested in is `Sge`, the proportion of subgraphs where all elements of the subgraph were classified correctly. This therefore represents an entirely correctly predicted example.
+
+### Diagrams
+
+#### Training Metrics
+Upon running the example you will also get plots from matplotlib saved to your working directory.
+
+You will see plots of metrics for the training process (training iteration on the x-axis) for the training set (solid line), and test set (dotted line). From left to right:
+
+- The absolute loss across all of the elements in the dataset
+- The fraction of all graph elements predicted correctly across the dataset
+- The fraction of completely solved examples (subgraphs extracted from Grakn)
+
+![learning metrics](.images/learning.png)
+
+#### Visualise Predictions
+
+We also receive a plot of some of the predictions made on the test set. 
+
+**Blue box:** Ground Truth 
+
+- Preexisting (known) graph elements are shown in blue
+
+- Relations and role edges that **should be predicted to exist** are shown in green
+
+- Candidate relations and role edges that **should not be predicted to exist** are shown faintly in red
+
+**Black boxes**: Model Predictions at certain message-passing steps
+
+This uses the same colour scheme as above, but opacity indicates a probability given by the model.
+
+The learner predicts three classes for each graph element. These are:
+
+```
+[
+Element already existed in the graph (we wish to ignore these elements),
+Element does not exist in the graph,
+Element does exist in the graph
+]
+```
+
+In this way we perform relation prediction by proposing negative candidate relations (Grakn's rules help us with this). Then we train the learner to classify these negative candidates as **does not exist** and the correct relations as **does exist**.
+
+These boxes shows the score assigned to the class **does exist**.
+
+Therefore, for good predictions we want to see no blue elements, and for the red elements to fade out as more messages are passed, the green elements becoming more certain.
+
+
+
+![predictions made on test set](.images/graph_snippet.png)
+
+This visualisation has some flaws, and will be improved in the future.
+
 ## Methodology
 
 The methodology that this implementation uses for Relation prediction is as follows:
@@ -43,7 +140,7 @@ In the case of the diagnosis example, we aim to predict `diagnosis` Relations. W
 
 We then teach the KGCN to distinguish between the positive and negative examples.
 
-###Examples == Subgraphs
+### Examples == Subgraphs
 
 We do this by creating *examples*, where each example is a subgraph extracted from a Grakn knowledge Graph. These subgraphs contain positive and negative instances of the relation to be predicted.
 
@@ -74,11 +171,11 @@ A single subgraph is extracted from Grakn by making these queries and combining
 
 We can visualise such a subgraph by running these two queries in Grakn Workbase:
 
-![](.images/queried_subgraph.png)
+![queried subgraph](.images/queried_subgraph.png)
 
 You can get the relevant version of Workbase from the Assets of the [latest release](https://github.com/graknlabs/workbase/releases/latest).
 
-###Learning
+### Learning
 
 A KGCN is a learned message-passing graph algorithm. Neural network components are learned, and are used to transform signals that are passed around the graph. This approach is convolutional due to the fact that the same transformation is applied to all edges and another is applied to all nodes. It may help your understanding to analogise this to convolution over images, where the same transformation is applied over all pixel neighbourhoods.
 

Original file line number	Diff line number	Diff line change
`@@ -4,7 +4,7 @@ def graknlabs_build_tools():`
`4`	`4`	`git_repository(`
`5`	`5`	`name = "graknlabs_build_tools",`
`6`	`6`	`remote = "https://github.com/graknlabs/build-tools",`
`7`		`- commit = "f50e7a618045c99862bed78f813b1cfbb25a6016", # sync-marker: do not remove this comment, this is used for sync-dependencies by @graknlabs_build_tools`
	`7`	`+ commit = "04c69fbe5277bf2ed9e2baf5e9a53ac3c9ebee80", # sync-marker: do not remove this comment, this is used for sync-dependencies by @graknlabs_build_tools`
`8`	`8`	`)`
`9`	`9`
`10`	`10`
`@@ -19,5 +19,5 @@ def graknlabs_client_python():`
`19`	`19`	`git_repository(`
`20`	`20`	`name = "graknlabs_client_python",`
`21`	`21`	`remote = "https://github.com/graknlabs/client-python",`
`22`		`- commit = "4f03fc79fba71f216a28a4bc412c084fcef099a0" # sync-marker: do not remove this comment, this is used for sync-dependencies by @graknlabs_client_python`
	`22`	`+ tag = "1.5.4" # sync-marker: do not remove this comment, this is used for sync-dependencies by @graknlabs_client_python`
`23`	`23`	`)`