Skip to content

Commit b8c30ab

Browse files
authored
Update BUILD.org
Add instructions on importing the ista RDF output into Neo4j.
1 parent 9c66bda commit b8c30ab

File tree

1 file changed

+125
-3
lines changed

1 file changed

+125
-3
lines changed

BUILD.org

Lines changed: 125 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -78,8 +78,10 @@ should be needed). You'll notice that it creates two output files
7878
that are used while populating the ontology.
7979

8080
*** Drugbank
81-
Navigate to the Download page on the Drugbank website (linked
82-
above). Select the "External Links" tab. In the table titled "External
81+
In order to download the Academic DrugBank datasets, you need to first create a free DrugBank account and verify your email address. After verifying your email address, they may need some more information regarding your DrugBank account, like the description of how you plan to use DrugBank, a description of your organization, Who is sponsoring this research, and What is the end goal of this research. Account approval can take up to several business days to weeks based on our experience.
82+
83+
After your access has been approved, navigate to the Academic Download page on the Drugbank website (linked
84+
above) by selecting the "Download" tab and "Academic Download". Select the "External Links" tab. In the table titled "External
8385
Drug Links", click the "Download" button on the row labeled
8486
"All". This will download a zip file. Extract the contents of that zip
8587
file, and make sure it is named =drug_links.csv= (some versions use a
@@ -119,6 +121,8 @@ directory, which should deposit two filtered data files in the
119121
and used when you run the ontology population script, along with the
120122
unmodified =curated_disease_gene_associations.tsv= file.
121123

124+
Then you create a directory that will hold all of the raw data files. It can be 'D:\data\' or something else you prefer. Within that, there will be 1 folder for each third-party database, and in those folders, you'll put the individual csv/tsv/txt files.
125+
122126
** SQL data sources
123127
If you don't already have MySQL installed, install it. We recommend
124128
using either a package manager (if one is available on your OS), or
@@ -252,8 +256,71 @@ define mappings using these parser objects. We won't replicate every
252256
mapping in this guide for brevity, but you can see all of them in the
253257
full AlzKB build script.
254258
*** Configuration for 'flat file' (e.g., CSV) data sources
259+
#+begin_src python
260+
hetionet.parse_node_type(
261+
node_type="Symptom",
262+
source_filename="hetionet-v1.0-nodes.tsv",
263+
fmt="tsv",
264+
parse_config={
265+
"iri_column_name": "name",
266+
"headers": True,
267+
"filter_column": "kind",
268+
"filter_value": "Symptom",
269+
"data_transforms": {
270+
"id": lambda x: x.split("::")[-1]
271+
},
272+
"data_property_map": {
273+
"id": onto.xrefMeSH,
274+
"name": onto.commonName
275+
}
276+
},
277+
merge=False,
278+
skip=False
279+
)
280+
#+end_src
281+
This block indicates the third-party database is hetionet, and the file is hetionet-v1.0-nodes.tsv
282+
283+
So the file it will look for is D:\data\hetionet\hetionet-v1.0-nodes.tsv
284+
285+
Some of the configuration blocks will have a CUSTOM\ prefix to the filename. This means that the file was created by us manually and will need to be stored in a CUSTOM subdirectory of the database folder. For example:
286+
#+begin_src python
287+
disgenet.parse_node_type(
288+
node_type="Disease",
289+
source_filename="CUSTOM/disease_mappings_to_attributes_alzheimer.tsv", # Filtered for just Alzheimer disease
290+
fmt="tsv-pandas",
291+
parse_config={
292+
"iri_column_name": "diseaseId",
293+
"headers": True,
294+
"data_property_map": {
295+
"diseaseId": onto.xrefUmlsCUI,
296+
"name": onto.commonName,
297+
}
298+
},
299+
merge=False,
300+
skip=False
301+
)
302+
#+end_src
303+
This file will be D:\data\disgenet\CUSTOM\disease_mappings_alzheimer.tsv
304+
305+
*** Configuration for SQL server data sources
306+
#+begin_src python
307+
aopdb.parse_node_type(
308+
node_type="Drug",
309+
source_table="chemical_info",
310+
parse_config={
311+
"iri_column_name": "DTX_id",
312+
"data_property_map": {"ChemicalID": onto.xrefMeSH},
313+
"merge_column": {
314+
"source_column_name": "DTX_id",
315+
"data_property": onto.xrefDTXSID
316+
}
317+
},
318+
merge=True,
319+
skip=False
320+
)
321+
#+end_src
322+
This block indicates the third-party database is AOP-DB, and the source table is chemical_info.
255323

256-
*** Configuration for SQL data sources
257324

258325
** Mapping data sources to ontology components
259326
Every flat file or SQL table from a third-party data source can be
@@ -269,6 +336,9 @@ Each mapping is defined using a method call in the =ista= Python
269336
script.
270337

271338
** Running =ista=
339+
Now you have set the location of data resources, ontology, and defined mapping method. Run populate_ontology.py
340+
341+
The alzkb-populated.rdf is the output of this step and will be used for setting Neo4j Graph database.
272342

273343
* 3.: Converting the ontology into a Neo4j graph database
274344

@@ -306,6 +376,58 @@ the contents of AlzKB. In Neo4j Community, this can be done as follows:
306376
- Uncomment the line containing
307377
=dbms.security.procedures.allowlist=apoc.coll.*,apoc.load.*,gds.*=
308378
to activate it.
379+
- Add =n10s.*,apoc.cypher.*,apoc.help= to =dbms.security.procedures.allowlist=apoc.coll.*,apoc.load.*,gds.*=
309380
- Click the "Apply" button, then "Close".
310381
- Click "Start" to start the graph database.
311382
** Importing the =ista= RDF output into Neo4j
383+
- Open neo4j Browser and run the following Cypher to import RDF data
384+
#+begin_src cypher
385+
# Cleaning nodes
386+
MATCH (n) DETACH DELETE n
387+
#+end_src
388+
389+
#+begin_src cypher
390+
# Constraint Creation
391+
CREATE CONSTRAINT n10s_unique_uri FOR (r:Resource) REQUIRE r.uri IS UNIQUE
392+
#+end_src
393+
394+
#+begin_src cypher
395+
# Creating a Graph Configuration
396+
CALL n10s.graphconfig.init()
397+
CALL n10s.graphconfig.set({applyNeo4jNaming: true, handleVocabUris: 'IGNORE'})
398+
#+end_src
399+
400+
#+begin_src cypher
401+
# Importing RDF
402+
CALL n10s.rdf.import.fetch( "file://D:\\data\\alzkb-populated.rdf", "RDF/XML")
403+
#+end_src
404+
405+
- Run the Cyphers below to clean nodes
406+
#+begin_src cypher
407+
MATCH (n:Resource) REMOVE n:Resource;
408+
MATCH (n:NamedIndividual) REMOVE n:NamedIndividual;
409+
MATCH (n:AllDisjointClasses) REMOVE n:AllDisjointClasses;
410+
MATCH (n:AllDisjointProperties) REMOVE n:AllDisjointProperties;
411+
MATCH (n:DatatypeProperty) REMOVE n:DatatypeProperty;
412+
MATCH (n:FunctionalProperty) REMOVE n:FunctionalProperty;
413+
MATCH (n:ObjectProperty) REMOVE n:ObjectProperty;
414+
MATCH (n:AnnotationProperty) REMOVE n:AnnotationProperty;
415+
MATCH (n:SymmetricProperty) REMOVE n:SymmetricProperty;
416+
MATCH (n:_GraphConfig) REMOVE n:_GraphConfig;
417+
MATCH (n:Ontology) REMOVE n:Ontology;
418+
MATCH (n:Restriction) REMOVE n:Restriction;
419+
MATCH (n:Class) REMOVE n:Class;
420+
MATCH (n) WHERE size(labels(n)) = 0 DETACH DELETE n; # Removes nodes without labels
421+
#+end_src
422+
423+
Now, you have built the AlzKB from scratch. You can find the number of nodes and relationships with
424+
#+begin_src cypher
425+
CALL db.labels() YIELD label
426+
CALL apoc.cypher.run('MATCH (:`'+label+'`) RETURN count(*) as count',{}) YIELD value
427+
RETURN label, value.count ORDER BY label
428+
#+end_src
429+
#+begin_src cypher
430+
CALL db.relationshipTypes() YIELD relationshipType as type
431+
CALL apoc.cypher.run('MATCH ()-[:`'+type+'`]->() RETURN count(*) as count',{}) YIELD value
432+
RETURN type, value.count ORDER BY type
433+
#+end_src

0 commit comments

Comments
 (0)