Skip to content

Commit 105d101

Browse files
authored
Merge pull request #1 from kuefmz/updates
update readme and remove unused files
2 parents 207f6cb + e7e2540 commit 105d101

15 files changed

+1036603
-44
lines changed

README.md

Lines changed: 62 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,92 @@
11
# Define a small taxonomy (~200 terms) for AI/ML domain
22

3+
4+
## Overview
5+
6+
37
The data folder is not complete as some of the sources are too large for GitHub
48

59
Collections that are considered:
610
- **OpenAlex**: data accessed on 14th February 2024
711
- **OpenAIRE**: data accessed on 15th February 2024
812

9-
## Method/Usage
1013

14+
## Methodology
15+
16+
The methodology used can be seen on the image bellow
17+
18+
19+
![methodology](./plots/methodology.png)
20+
21+
22+
## Usage
23+
24+
Bellow you can find the commands that need to be run to get to execute the methodology.
25+
26+
### Step1
1127
First we need to collect papers and their categories from OpenAlex and OpenAIRE and align them.
1228

1329
For this run the script
1430

15-
16-
# Step1
31+
```
1732
python3 src/collect_initial_data.py
33+
```
1834

1935

2036
Some of the collected categories are noisy, this script also cleans that up.
2137

22-
Ones the data is collected we need to look for the best match for each category from each collection because we want to map them
38+
### Step2
2339

24-
For this run the script:
40+
Ones the data is collected we need to look for the match for each category from each collection because we want to map them. The matches are selected based on their semantic similarity.
2541

42+
For this run the script.
2643

27-
# Step2
44+
```
2845
python3 src/select_the_best_match_from_category_pairs.py
46+
```
47+
48+
### Step3
49+
50+
Now that the papers with their mapped categories are ready and the category similarities, it is time to upload to a Neo4J database. This can take several days, depending on the machine that is used for the database and the amount of data collected.
51+
52+
53+
```
54+
python3 src/store_data_into_neo4j.py
55+
```
56+
57+
### Step4
58+
59+
Ones the data is stored into the database we run a query to select all the category pairs and the supporting papers for both categories and their similarity.
60+
61+
```
62+
python3 src/run_ne04j_query.py
63+
```
64+
65+
### Step 5
66+
67+
68+
Now that we have the pairs, we can run another script to transform the query results to a CSV and also calculate the missing 'Agreement' metric for the analysis.
69+
70+
```
71+
python3 scr/transform_result_from_json_to_csv.py
72+
```
73+
74+
### Step 6
75+
76+
This will generate the final data. The analysis is performed based on this data. The goal of this analysis is to produce a set of mapping alignments that can be aligned in the different knowledge graphs.
77+
78+
### Step 7
2979

80+
The mapping produced in the previous steps should be manually validated.
3081

82+
Assuming that the generated mapping is stored in a CSV, the following scipt collects all the papers from the collected data that belong to both of the aligned categories.
3183

32-
To get all the best pair combinations run:
3384

85+
```
86+
python3 scr/validation.py
87+
```
3488

35-
# Step3.1
89+
These papers should be manually validated by domain experts.
3690

37-
python3 src/get_all_the_pair_combinations.py
3891

3992

40-
To be continued.

plots/agreement_distplot.png

8.54 KB
Loading

plots/agreement_distplot2.png

36.8 KB
Loading
43.6 KB
Loading

plots/barplot.png

35.7 KB
Loading

plots/categories_vs_threshold.png

46.8 KB
Loading

plots/mapping_vs_agreement.png

31.7 KB
Loading

plots/method

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
<mxfile host="app.diagrams.net" modified="2024-03-14T14:59:21.439Z" agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" etag="_qYvj7vXzO1W5TcUiSkq" version="24.0.4" type="device">
1+
<mxfile host="app.diagrams.net" modified="2024-04-08T16:41:30.263Z" agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" etag="8G2JoxbQfByx-54JaGIW" version="24.2.2" type="device">
22
<diagram name="1 oldal" id="1S5QFZVV1yIsBkyI2sdZ">
3-
<mxGraphModel dx="1434" dy="798" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="1" pageScale="1" pageWidth="1169" pageHeight="1654" math="0" shadow="0">
3+
<mxGraphModel dx="1364" dy="913" grid="1" gridSize="10" guides="1" tooltips="1" connect="1" arrows="1" fold="1" page="1" pageScale="1" pageWidth="1169" pageHeight="1654" math="0" shadow="0">
44
<root>
55
<mxCell id="0" />
66
<mxCell id="1" parent="0" />
@@ -13,7 +13,7 @@
1313
<mxCell id="KIzjvLAE8B5wqseCv-Lj-1" value="&lt;font style=&quot;font-size: 14px;&quot;&gt;1. Data collection&lt;/font&gt;&lt;div style=&quot;font-size: 14px;&quot;&gt;&lt;font style=&quot;font-size: 14px;&quot;&gt;(papers, categories)&lt;/font&gt;&lt;/div&gt;" style="rounded=0;whiteSpace=wrap;html=1;fillColor=#f5f5f5;strokeColor=#666666;fontColor=#333333;container=0;" parent="1" vertex="1">
1414
<mxGeometry x="50" y="220" width="130.00000000000003" height="60" as="geometry" />
1515
</mxCell>
16-
<mxCell id="2AmnZrVYBXidMmyjmQXf-23" value="" style="endArrow=classic;html=1;rounded=0;exitX=1;exitY=0.5;exitDx=0;exitDy=0;entryX=0.854;entryY=0.033;entryDx=0;entryDy=0;entryPerimeter=0;" edge="1" parent="1" source="KIzjvLAE8B5wqseCv-Lj-8" target="KIzjvLAE8B5wqseCv-Lj-9">
16+
<mxCell id="2AmnZrVYBXidMmyjmQXf-23" value="" style="endArrow=classic;html=1;rounded=0;exitX=1;exitY=0.5;exitDx=0;exitDy=0;entryX=0.854;entryY=0.033;entryDx=0;entryDy=0;entryPerimeter=0;" parent="1" source="KIzjvLAE8B5wqseCv-Lj-8" target="KIzjvLAE8B5wqseCv-Lj-9" edge="1">
1717
<mxGeometry width="50" height="50" relative="1" as="geometry">
1818
<mxPoint x="1080" y="190" as="sourcePoint" />
1919
<mxPoint x="780.0000000000002" y="320" as="targetPoint" />
@@ -22,9 +22,9 @@
2222
</Array>
2323
</mxGeometry>
2424
</mxCell>
25-
<mxCell id="2AmnZrVYBXidMmyjmQXf-27" value="&lt;font style=&quot;font-size: 13px;&quot;&gt;Calculate similariy&lt;/font&gt;&lt;div style=&quot;font-size: 13px;&quot;&gt;&lt;font style=&quot;font-size: 13px;&quot;&gt;value of category paires&lt;/font&gt;&lt;/div&gt;" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];container=0;" vertex="1" connectable="0" parent="2AmnZrVYBXidMmyjmQXf-23">
25+
<mxCell id="2AmnZrVYBXidMmyjmQXf-27" value="&lt;font style=&quot;font-size: 13px;&quot;&gt;Calculate similariy&lt;/font&gt;&lt;div style=&quot;font-size: 13px;&quot;&gt;&lt;font style=&quot;font-size: 13px;&quot;&gt;value of category paires&lt;/font&gt;&lt;/div&gt;" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];container=0;" parent="2AmnZrVYBXidMmyjmQXf-23" vertex="1" connectable="0">
2626
<mxGeometry x="0.04" y="-2" relative="1" as="geometry">
27-
<mxPoint x="-65" y="8" as="offset" />
27+
<mxPoint x="-62" y="-2" as="offset" />
2828
</mxGeometry>
2929
</mxCell>
3030
<mxCell id="KIzjvLAE8B5wqseCv-Lj-7" value="&lt;font style=&quot;font-size: 14px;&quot;&gt;2. Align papers from different KGs&lt;/font&gt;" style="rounded=0;whiteSpace=wrap;html=1;fillColor=#f5f5f5;strokeColor=#666666;fontColor=#333333;container=0;" parent="1" vertex="1">
@@ -33,13 +33,13 @@
3333
<mxCell id="KIzjvLAE8B5wqseCv-Lj-8" value="&lt;font style=&quot;font-size: 14px;&quot;&gt;3. Preprocessing&lt;/font&gt;" style="rounded=0;whiteSpace=wrap;html=1;fillColor=#f5f5f5;fontColor=#333333;strokeColor=#666666;container=0;" parent="1" vertex="1">
3434
<mxGeometry x="520" y="220" width="130.00000000000003" height="60" as="geometry" />
3535
</mxCell>
36-
<mxCell id="KIzjvLAE8B5wqseCv-Lj-13" value="" style="endArrow=classic;html=1;rounded=0;entryX=0;entryY=0.5;entryDx=0;entryDy=0;exitX=1;exitY=0.5;exitDx=0;exitDy=0;" parent="1" target="KIzjvLAE8B5wqseCv-Lj-7" edge="1" source="KIzjvLAE8B5wqseCv-Lj-1">
36+
<mxCell id="KIzjvLAE8B5wqseCv-Lj-13" value="" style="endArrow=classic;html=1;rounded=0;entryX=0;entryY=0.5;entryDx=0;entryDy=0;exitX=1;exitY=0.5;exitDx=0;exitDy=0;" parent="1" source="KIzjvLAE8B5wqseCv-Lj-1" target="KIzjvLAE8B5wqseCv-Lj-7" edge="1">
3737
<mxGeometry width="50" height="50" relative="1" as="geometry">
3838
<mxPoint x="227.4418604651163" y="250" as="sourcePoint" />
3939
<mxPoint x="299.5348837209302" y="200" as="targetPoint" />
4040
</mxGeometry>
4141
</mxCell>
42-
<mxCell id="2AmnZrVYBXidMmyjmQXf-13" value="&lt;font style=&quot;font-size: 13px;&quot;&gt;Finalize&amp;nbsp;&lt;/font&gt;&lt;div style=&quot;font-size: 13px;&quot;&gt;&lt;font style=&quot;font-size: 13px;&quot;&gt;collected&amp;nbsp;&lt;/font&gt;&lt;/div&gt;&lt;div style=&quot;font-size: 13px;&quot;&gt;&lt;font style=&quot;font-size: 13px;&quot;&gt;data&lt;/font&gt;&lt;/div&gt;" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];container=0;" vertex="1" connectable="0" parent="KIzjvLAE8B5wqseCv-Lj-13">
42+
<mxCell id="2AmnZrVYBXidMmyjmQXf-13" value="&lt;font style=&quot;font-size: 13px;&quot;&gt;Finalize&amp;nbsp;&lt;/font&gt;&lt;div style=&quot;font-size: 13px;&quot;&gt;&lt;font style=&quot;font-size: 13px;&quot;&gt;collected&amp;nbsp;&lt;/font&gt;&lt;/div&gt;&lt;div style=&quot;font-size: 13px;&quot;&gt;&lt;font style=&quot;font-size: 13px;&quot;&gt;data&lt;/font&gt;&lt;/div&gt;" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];container=0;" parent="KIzjvLAE8B5wqseCv-Lj-13" vertex="1" connectable="0">
4343
<mxGeometry x="-0.3684" y="4" relative="1" as="geometry">
4444
<mxPoint x="10" y="4" as="offset" />
4545
</mxGeometry>
@@ -50,7 +50,7 @@
5050
<mxPoint x="299.5348837209302" y="260" as="targetPoint" />
5151
</mxGeometry>
5252
</mxCell>
53-
<mxCell id="2AmnZrVYBXidMmyjmQXf-14" value="&lt;font style=&quot;font-size: 13px;&quot;&gt;Finalize&lt;/font&gt;&lt;div style=&quot;font-size: 13px;&quot;&gt;&lt;font style=&quot;font-size: 13px;&quot;&gt;paper&amp;nbsp;&lt;/font&gt;&lt;div style=&quot;&quot;&gt;&lt;font style=&quot;font-size: 13px;&quot;&gt;alignments&lt;/font&gt;&lt;/div&gt;&lt;/div&gt;" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];container=0;" vertex="1" connectable="0" parent="KIzjvLAE8B5wqseCv-Lj-14">
53+
<mxCell id="2AmnZrVYBXidMmyjmQXf-14" value="&lt;font style=&quot;font-size: 13px;&quot;&gt;Finalize&lt;/font&gt;&lt;div style=&quot;font-size: 13px;&quot;&gt;&lt;font style=&quot;font-size: 13px;&quot;&gt;paper&amp;nbsp;&lt;/font&gt;&lt;div style=&quot;&quot;&gt;&lt;font style=&quot;font-size: 13px;&quot;&gt;alignments&lt;/font&gt;&lt;/div&gt;&lt;/div&gt;" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];container=0;" parent="KIzjvLAE8B5wqseCv-Lj-14" vertex="1" connectable="0">
5454
<mxGeometry x="-0.2421" y="1" relative="1" as="geometry">
5555
<mxPoint x="9" as="offset" />
5656
</mxGeometry>
@@ -65,7 +65,7 @@
6565
</Array>
6666
</mxGeometry>
6767
</mxCell>
68-
<mxCell id="2AmnZrVYBXidMmyjmQXf-1" value="&lt;font style=&quot;font-size: 13px;&quot;&gt;Collect more data&lt;/font&gt;" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];container=0;" vertex="1" connectable="0" parent="cRcOPCo_2h5ppIKabsQA-2">
68+
<mxCell id="2AmnZrVYBXidMmyjmQXf-1" value="&lt;font style=&quot;font-size: 13px;&quot;&gt;Collect more data&lt;/font&gt;" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];container=0;" parent="cRcOPCo_2h5ppIKabsQA-2" vertex="1" connectable="0">
6969
<mxGeometry x="0.2778" y="2" relative="1" as="geometry">
7070
<mxPoint x="25" y="-2" as="offset" />
7171
</mxGeometry>
@@ -80,7 +80,7 @@
8080
</Array>
8181
</mxGeometry>
8282
</mxCell>
83-
<mxCell id="2AmnZrVYBXidMmyjmQXf-2" value="&lt;font style=&quot;font-size: 13px;&quot;&gt;Align preprocessed papers&lt;/font&gt;" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];container=0;" vertex="1" connectable="0" parent="cRcOPCo_2h5ppIKabsQA-3">
83+
<mxCell id="2AmnZrVYBXidMmyjmQXf-2" value="&lt;font style=&quot;font-size: 13px;&quot;&gt;Align preprocessed papers&lt;/font&gt;" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];container=0;" parent="cRcOPCo_2h5ppIKabsQA-3" vertex="1" connectable="0">
8484
<mxGeometry x="0.0889" y="-1" relative="1" as="geometry">
8585
<mxPoint x="8" as="offset" />
8686
</mxGeometry>
@@ -91,7 +91,7 @@
9191
<mxCell id="KIzjvLAE8B5wqseCv-Lj-9" value="&lt;font style=&quot;font-size: 14px;&quot;&gt;4. Propose candidate mapping&lt;/font&gt;" style="rounded=0;whiteSpace=wrap;html=1;fillColor=#f5f5f5;fontColor=#333333;strokeColor=#666666;container=0;" parent="1" vertex="1">
9292
<mxGeometry x="720" y="380" width="130" height="60" as="geometry" />
9393
</mxCell>
94-
<mxCell id="2AmnZrVYBXidMmyjmQXf-22" value="" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" edge="1" parent="1" source="KIzjvLAE8B5wqseCv-Lj-10" target="KIzjvLAE8B5wqseCv-Lj-11">
94+
<mxCell id="2AmnZrVYBXidMmyjmQXf-22" value="" style="edgeStyle=orthogonalEdgeStyle;rounded=0;orthogonalLoop=1;jettySize=auto;html=1;" parent="1" source="KIzjvLAE8B5wqseCv-Lj-10" target="KIzjvLAE8B5wqseCv-Lj-11" edge="1">
9595
<mxGeometry relative="1" as="geometry" />
9696
</mxCell>
9797
<mxCell id="KIzjvLAE8B5wqseCv-Lj-10" value="&lt;font style=&quot;font-size: 14px;&quot;&gt;5. Evaluate candidate mapping&lt;/font&gt;" style="rounded=0;whiteSpace=wrap;html=1;fillColor=#f5f5f5;fontColor=#333333;strokeColor=#666666;container=0;" parent="1" vertex="1">
@@ -109,7 +109,7 @@
109109
<mxPoint x="146.55172413793105" y="420" as="targetPoint" />
110110
</mxGeometry>
111111
</mxCell>
112-
<mxCell id="2AmnZrVYBXidMmyjmQXf-24" value="&lt;font style=&quot;font-size: 13px;&quot;&gt;Calculate&lt;/font&gt;&lt;div style=&quot;font-size: 13px;&quot;&gt;&lt;font style=&quot;font-size: 13px;&quot;&gt;metrics&lt;/font&gt;&lt;/div&gt;" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];container=0;" vertex="1" connectable="0" parent="KIzjvLAE8B5wqseCv-Lj-16">
112+
<mxCell id="2AmnZrVYBXidMmyjmQXf-24" value="&lt;font style=&quot;font-size: 13px;&quot;&gt;Calculate&lt;/font&gt;&lt;div style=&quot;font-size: 13px;&quot;&gt;&lt;font style=&quot;font-size: 13px;&quot;&gt;metrics&lt;/font&gt;&lt;/div&gt;" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];container=0;" parent="KIzjvLAE8B5wqseCv-Lj-16" vertex="1" connectable="0">
113113
<mxGeometry x="0.15" y="-3" relative="1" as="geometry">
114114
<mxPoint x="12" y="3" as="offset" />
115115
</mxGeometry>
@@ -120,7 +120,7 @@
120120
<mxPoint x="367.2413793103448" y="420" as="targetPoint" />
121121
</mxGeometry>
122122
</mxCell>
123-
<mxCell id="2AmnZrVYBXidMmyjmQXf-25" value="&lt;font style=&quot;font-size: 13px;&quot;&gt;Define&lt;/font&gt;&lt;div style=&quot;font-size: 13px;&quot;&gt;&lt;font style=&quot;font-size: 13px;&quot;&gt;threshold&lt;/font&gt;&lt;/div&gt;" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];container=0;" vertex="1" connectable="0" parent="KIzjvLAE8B5wqseCv-Lj-17">
123+
<mxCell id="2AmnZrVYBXidMmyjmQXf-25" value="&lt;font style=&quot;font-size: 13px;&quot;&gt;Define&lt;/font&gt;&lt;div style=&quot;font-size: 13px;&quot;&gt;&lt;font style=&quot;font-size: 13px;&quot;&gt;threshold&lt;/font&gt;&lt;/div&gt;" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];container=0;" parent="KIzjvLAE8B5wqseCv-Lj-17" vertex="1" connectable="0">
124124
<mxGeometry x="0.1" y="1" relative="1" as="geometry">
125125
<mxPoint x="10" y="1" as="offset" />
126126
</mxGeometry>
@@ -131,7 +131,7 @@
131131
<mxPoint x="700" y="410" as="targetPoint" />
132132
</mxGeometry>
133133
</mxCell>
134-
<mxCell id="2AmnZrVYBXidMmyjmQXf-26" value="&lt;font style=&quot;font-size: 13px;&quot;&gt;Select&lt;/font&gt;&lt;div style=&quot;font-size: 13px;&quot;&gt;&lt;font style=&quot;font-size: 13px;&quot;&gt;papers&lt;/font&gt;&lt;/div&gt;&lt;div style=&quot;font-size: 13px;&quot;&gt;&lt;font style=&quot;font-size: 13px;&quot;&gt;to validate&lt;/font&gt;&lt;/div&gt;" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];container=0;" vertex="1" connectable="0" parent="KIzjvLAE8B5wqseCv-Lj-18">
134+
<mxCell id="2AmnZrVYBXidMmyjmQXf-26" value="&lt;font style=&quot;font-size: 13px;&quot;&gt;Select&lt;/font&gt;&lt;div style=&quot;font-size: 13px;&quot;&gt;&lt;font style=&quot;font-size: 13px;&quot;&gt;papers&lt;/font&gt;&lt;/div&gt;&lt;div style=&quot;font-size: 13px;&quot;&gt;&lt;font style=&quot;font-size: 13px;&quot;&gt;to validate&lt;/font&gt;&lt;/div&gt;" style="edgeLabel;html=1;align=center;verticalAlign=middle;resizable=0;points=[];container=0;" parent="KIzjvLAE8B5wqseCv-Lj-18" vertex="1" connectable="0">
135135
<mxGeometry x="-0.175" relative="1" as="geometry">
136136
<mxPoint x="-9" as="offset" />
137137
</mxGeometry>

plots/methodology.png

4.21 MB
Loading

plots/similarity_vs_agreement.png

71.8 KB
Loading

0 commit comments

Comments
 (0)