Skip to content

Commit dd32b5e

Browse files
committed
init graph_pattern_learner submodule
moved out of Associations repo @ f2af3af1097f76017b920643c961c72da5da3cdd
0 parents  commit dd32b5e

File tree

100 files changed

+49708
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

100 files changed

+49708
-0
lines changed

README.md

Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
Graph Pattern Learner
2+
=====================
3+
4+
(Work in progress...)
5+
6+
In this repository you find the code for a graph pattern learner. Given a list
7+
of source-target-pairs and a SPARQL endpoint, it will try to learn SPARQL
8+
patterns. Given a source, the learned patterns will try to lead you to the right
9+
target.
10+
11+
The algorithm was first developed on a list of human associations that had been
12+
mapped to DBpedia entities, as can be seen in [data/gt_associations.csv]:
13+
14+
| source | target |
15+
| --------------------------------- | --------------------------------- |
16+
| http://dbpedia.org/resource/Bacon | http://dbpedia.org/resource/Egg |
17+
| http://dbpedia.org/resource/Baker | http://dbpedia.org/resource/Bread |
18+
| http://dbpedia.org/resource/Crow | http://dbpedia.org/resource/Bird |
19+
| http://dbpedia.org/resource/Elm | http://dbpedia.org/resource/Tree |
20+
| http://dbpedia.org/resource/Gull | http://dbpedia.org/resource/Bird |
21+
| ... | ... |
22+
23+
As you can immediately see, associations don't only follow a single pattern. Our
24+
algorithm is designed to be able to deal with this. It will try to learn several
25+
patterns, which in combination model your input list of source-target-pairs. If
26+
your list of source-target-pairs is less complicated, the algorithm will happily
27+
terminate earlier.
28+
29+
You can find more information about the algorithm and learning patterns for
30+
human associations on [https://w3id.org/associations]. The page also includes
31+
publications, as well as the resulting patterns learned for human associations
32+
from a local DBpedia endpoint including wikilinks.
33+
34+
35+
Installation
36+
------------
37+
38+
Currently the suggested installation method is via git clone (also allows easier
39+
contributions):
40+
41+
git clone [email protected]:RDFLib/graph-pattern-learner.git
42+
cd graph-pattern-learner
43+
44+
Afterwards, to setup the virtual environment and install all dependencies in it:
45+
46+
virtualenv venv &&
47+
. venv/bin/activate &&
48+
pip install -r requirements.txt &&
49+
deactivate
50+
51+
52+
Running the learner
53+
-------------------
54+
55+
Before actually running the evolutionary algorithm, please consider that it will
56+
issue a lot of queries to the endpoint you're specifying. Please don't run this
57+
against public endpoints without asking the providers first. It is likely that
58+
you will disrupt their service or get blacklisted. I suggest running against an
59+
own local endpoint filled with the datasets you're interested in. If you really
60+
want to run this against public endpoints, at least don't run the multi-process
61+
version, but restrict yourself to one process.
62+
63+
Always feel free to reach out for help or feedback via the issue tracker or via
64+
associations at joernhees de. We might even run the learner for you ;)
65+
66+
Before running, make sure to activate the virtual environment:
67+
68+
. venv/bin/activate
69+
70+
To get a list of all available options run:
71+
72+
python run.py --help
73+
74+
Don't be scared by the length, most options use sane defaults, but it's nice to
75+
be able to change things once you become more familiar with your data and the
76+
learner.
77+
78+
The options you will definitely be interested are:
79+
80+
--associations_filename (defaults to ./data/gt_associations.csv)
81+
--sparql_endpoint (defaults to http://dbpedia.org/sparql)
82+
83+
To run the algorithm you might want to run it like this:
84+
85+
./clean_logs.sh
86+
PYTHONIOENCODING=utf-8 python \
87+
run.py --associations_filename=... --sparql_endpoint=... \
88+
2>&1 | tee >(gzip > logs/main.log.gz)
89+
90+
If you want to speed things up you can (and should) run with SCOOP in parallel:
91+
92+
./clean_logs.sh
93+
PYTHONIOENCODING=utf-8 python \
94+
-m scoop -n8 run.py --associations_filename=... --sparql_endpoint=... \
95+
2>&1 | tee >(gzip > logs/main.log.gz)
96+
97+
SCOOP will then run the graph pattern learner distributed over 8 cores (-n).
98+
99+
The algorithm will by default randomly split your input list of source-target-
100+
pairs into a training and a test set. If you want to see how well the learned
101+
patterns generalise, you can run:
102+
103+
./run_create_bundle.sh ./results/bundle_name sparql_endpoint \
104+
--associations_filename=...
105+
106+
The script will then first learn patterns, visualise them in
107+
`./results/bundle_name/visualise`, before evaluating predictions on first the
108+
training- and then the test-set.

__init__.py

Whitespace-only changes.

clean_logs.sh

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
#!/usr/bin/env bash
2+
cd "$(dirname $0)"
3+
rm -r logs/*.log* logs/error_logs_*/ || exit 0

0 commit comments

Comments
 (0)