UASFRA-MS-KnowledgeGraph is a data science project and part of the requirements for the Master program (M.Sc.) in Computer Science at the Frankfurt University of Applied Sciences.
The project's goal is to create a NEO4J Knowledge Graph ("KG") populated with ESG data required to be reported by companies due to the European Sustainability Reporting Standards (ESRS) legislation.
The reported ESG data is extracted from XBRL-files as part of this project.
The programs presented in here are able to create and query such a NEO4J Knowledge Graph using Python.
The Knowledge Graph can be queried with Python functions or an OpenAI-attached chat bot.
The documentation and presentation to this project are avaialable here and here.
There are additional READ.md-files concerning the respective sections:
All programs can be executed from the "main.py"-script in the root folder of this project. The "main.py"-script makes use of the following modules in the "src"-folder:
main.py
src
- A_read_xbrl.py: Converts company's XBRL-files into JSON-files to later import the data into the KG
- B_rdf_graph.py: Creates cypher queries based on the provided ontology.ttl-file and constructs the KG schema
- C_read_data.py: Creates templates for importing the data from the JSON-files created earlier
- D_graph_construction: Imports data from the JSON-files into the KG and loads addional data from wikidata and dbpedia
- E_embeddings.py: Converts text of some Node's text properties into LLM embeddings to later do similarity search
- F_graph_bot.py: Formulates questions in relation to data in the KG in human-readable form for a KB bot to answer them
- G_graph_queries.py: Formulates questions in relation to data in the KG and gets results from Python functions
In order to run the functions in "main.py", some settings must be adjusted and NEO4J and related software needs to be installed first.
# PATHS
path_base = pathlib.Path("C:/your/path/to/the/root-folder/")
path_data = pathlib.Path(path_base, "src/data/")
path_models = pathlib.Path(path_base, "src/models/")
path_ontos = pathlib.Path(path_models, "Ontologies")
Only adjust the "path_base"-value to the path where this project (root folder) is located on your system. Leave all other paths untouched unless you want to change the location of these folders.
There are different options to install the NEO4J database on your system. We recommend to choose the
Graph Database Self-Managed / NEO4J Server (Community or Enterprise)
- Check the system requirements to install NEO4J: NEO4J system requirements
- Linux installation instructions: NEO4J Linux installation
- Windows installation instructions: NEO4J Windows installation
Under Windows, the respective jar-files need to be downloaded and put into the "$NEO4J_HOME/plugins" sub-folder of the "$NEO4J_HOME"-folder on your system. Some "$NEO4J_HOME/conf"-files need to be adjusted. Please refer to the installation instructions here:
- NEO4J neosemantics installation instructions
- NEO4J apoc installation instructions
- NEO4J graph-data-science installation instructions
In order to use the programs, some Python libraries need to be installed first. Please install (i.e. with: pip install ...) all the libraries listed under [packages] in the Pipfile of the root folder:
Pipfile:
[packages]
- neo4j
- pandas
- python-dotenv
- etc.
- Make sure you have adjusted the "$NEO4J_HOME/conf"-files as described in the NEO4J plugins installation instructions.
- Open http://localhost:7474 in your web browser.
- Connect using the username "neo4j" with the default password "neo4j". You might be prompted to change your password.
- Go to the file "secrets_template.env" in the root folder of this project. Change the name of this file from "secrets_template.env" to "secrets.env". Set the NEO4J username and password from the web browser as your "NEO4J_USER" and "NEO4J_PW" there. You might also want to insert your "OPENAI_API_KEY" there if you want to use the graph bot later.
- To see if NEO4J and its plugins were installed correctly and can be used in "main.py", please run the following "test_installation.py"-script in the root folder:
test_installation.py:
""" This script's purpose is to check if the installation of NEO4J and the import of Python libraries succeeded."""
from src.G_graph_queries import GraphQueries
gq = GraphQueries()
df = gq._query_df(query="SHOW functions")
apoc = df.name.str.startswith('apoc').any()
n10s = df.name.str.startswith('n10s').any()
gds = df.name.str.startswith('gds').any()
if __name__ == '__main__':
print(f"""INSTALLATIONS:
apoc: {apoc}
n10s: {n10s}
graph-data-science: {gds}""")
-
You should now see "True" printed for all three prefixes:
- n10s.* (for neosemantics functions)
- apoc.* (for apoc functions)
- gds.* (for graph-data-science functions)
-
If you get an error or any of these three prefixes is missing ("False" in the printout), please go back and check/redo the settings and the installation.
From the "main.py"-file in the root folder, you can now run the following functions by uncommenting the # CODE BLOCK below the desired function description:
main.py
0. Read XBRL-file into JSON-file. Please see: README-data.md-file.
# CODE BLOCK
1. Load ontology and show schema of knowledge graph in browser. Please see: README-models.md-file.
# CODE BLOCK
2. Load JSON-files/Company data into the NEO4J Knowledge-Graph. Please see: README-data.md-file.
# CODE BLOCK
3. Enrich NEO4J Knowledge-Graph with external data from wikidata.
# CODE BLOCK
4. Enrich NEO4J Knowledge-Graph with external data from dbpedia.
# CODE BLOCK
5. Create text embedding for one of the text properties.
# CODE BLOCK
6. GraphBot: RAG (Retrieval Augmented Generation) with NEO4J Graph.
# CODE BLOCK
7. GraphQueries: Query NEO4J Graph with Python functions.
# CODE BLOCK
onto_file_path_or_url: str = path_ontos.as_posix() + "/onto4/Ontology4.ttl"
Most of the parameters to be passed to these functions are Python "Enums". For instance, for the function ...
execute_graph_queries()
""" 7. GraphQueries: Query NEO4J Graph with Python functions. """
execute_graph_queries(esrs_1=ESRS.EmissionsToAirByPollutant,
company=Company.Adidas,
periods=['2023', '2022'],
return_df=True,
stat=Stats.SUM,
esrs_2=ESRS.NetRevenue,
comp_prop=CompProp.Industries,
print_queries=False)
These Enums allow you to easily select a value from the possible values such as "Adidas" after typing "Company." as your IDE should now show you all the possible values.
These Enum values are:
ESRS:
The ESRS-values refer to the 21 exemplary ESRS data points that you populated the KG with if you have (at least) run the functions 1. through 5. from "main.py". Please refer to the README-data.md-file in "/src/data/" for further details:
AbsoluteValueOfTotalGHGEmissionsReduction
AssetsAtMaterialPhysicalRiskBeforeClimateChangeAdaptationActions
AssetsAtMaterialTransitionRiskBeforeClimateMitigationActions
EmissionsToAirByPollutant
EmissionsToSoilByPollutant
EmissionsToWaterByPolllutant
FinancialResourcesAllocatedToActionPlanCapEx
FinancialResourcesAllocatedToActionPlanOpEx
GrossLocationBasedScope2GHGEmissions
GrossMarketBasedScope2GHGEmissions
GrossScope1GHGEmissions
GrossScope3GHGEmissions
NetRevenue
NetRevenueUsedToCalculateGHGIntensity
TotalAmountOfSubstancesOfConcernGenerated
TotalEnergyConsumptionFromFossilSources
TotalEnergyConsumptionFromNuclearSources
TotalEnergyConsumptionFromRenewableSources
TotalGHGEmissions
TotalUseOfLandArea
TotalWaterConsumption
Company:
The Company-values refer to the 3 exemplary companies "Adidas", "BASF" and "Puma" that you populated the KG with. Please refer to the README-data.md-file in "/src/data/" for further details:
Adidas
BASF
Puma
Stats:
MIN
MAX
AVG
SUM
CompProp:
The two CompProp-values refer to Node properties of the "Company" Node which come from external sources such as wikidata or dbpedia. Aggregates and single data points can be calculated according to these values. Please refer to the functions in "G_graph_queries.py"":
Country
Industries
Please note that the sample JSON-files loaded into the KG only contains data for the periods 2022 and 2023.
The next section is: Research