Skip to content

Commit cb7c4de

Browse files
committed
Get GRMA to work
1 parent 2f2fb30 commit cb7c4de

28 files changed

+6314
-17
lines changed

.gitignore

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -134,3 +134,10 @@ dmypy.json
134134
# behave
135135
pretty.output
136136
allure_report/
137+
138+
# graph output dirs
139+
output/
140+
results/
141+
142+
lol_graph.c
143+
cutils.c

Makefile

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,12 +89,14 @@ docker: docker-build ## build a docker image and run the service
8989

9090
install: clean ## install the package to the active Python's site-packages
9191
pip install --upgrade pip
92-
python setup.py install
92+
pip install git+https://github.com/nmdp-bioinformatics/py-graph-imputation
9393
pip install -r requirements.txt
9494
pip install -r requirements-tests.txt
9595
pip install -r requirements-dev.txt
9696
pip install -r requirements-deploy.txt
9797
pre-commit install
98+
python setup.py build_ext --inplace
99+
python setup.py install
98100

99101
venv: ## creates a Python3 virtualenv environment in venv
100102
python3 -m venv venv --prompt $(PROJECT_NAME)-venv

README.md

Lines changed: 178 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,183 @@
1-
My Project Template
1+
py-graph-match
22
===================
33

4+
Matching with Graph
45

5-
How to use the template:
6+
`grma`` is a package for finding HLA matches using graphs approach.
7+
The matching is based on [grim's](https://github.com/nmdp-bioinformatics/py-graph-imputation) imputation.
68

7-
1. Create a template by clicking on the "Use this template" button. Make sure to select all branches
8-
This will create a new repository with the given name e.g. `urban-potato`
9+
10+
## Pre-requisites
11+
12+
### Data Directory Structure
13+
14+
```
15+
data
16+
├── donors_dir
17+
│ └── donors.txt
18+
├── hpf.csv
19+
└── patients.txt
20+
```
21+
22+
### conf Directory Structure
23+
24+
```
25+
conf
26+
└── minimal-configuration.json
27+
```
28+
29+
Follow these steps for finding matches:
30+
31+
Setup a virtual environment (venv) and run:
32+
```
33+
make install
34+
```
35+
36+
## Quick Getting Started
37+
38+
Get Started with a built-in example.
39+
40+
### Build 'Donors Graph'
41+
42+
```
43+
python test_build_donors_graph.py
44+
```
45+
46+
### Find Matches
47+
48+
Use grma algorthm for finding matches efficiently. You can run the file `test_matching.py`
49+
```
50+
python test_matching.py
51+
```
52+
53+
Find the match results in `results` directory.
54+
55+
# Full Walk through
56+
### Building The Donors' Graph:
57+
58+
The donors' graph is a graph which contains all the donors (the search space). It implemented using a LOL (List of Lists) representation written in cython for better time and memory efficiency.
59+
The building might take a lot of memory and time, so it's recommended to save the graph in a pickle file.
60+
61+
Before building the donors' graph, all the donors' HLAs must be imputed using `grim`.
62+
Then all the imputation files must be saved under the same directory.
63+
64+
```python
65+
import os
66+
from grma.donorsgraph.build_donors_graph import BuildMatchingGraph
67+
68+
PATH_TO_DONORS_DIR = "data/donors_dir"
69+
PATH_TO_DONORS_GRAPH = "output/donors_graph.pkl"
70+
71+
os.makedirs(f"output", exist_ok=True)
72+
73+
build_matching = BuildMatchingGraph(PATH_TO_DONORS_DIR)
74+
graph = build_matching.graph # access the donors' graph
75+
76+
build_matching.to_pickle(PATH_TO_DONORS_GRAPH) # save the donors' graph to pickle
77+
```
78+
79+
### Search & Match before imputation to patients
80+
The function `matching` finds matches up to 3 mismatches and return a `pandas.DataFrame` object of the matches sorted by number of mismatches and their score.
81+
82+
The function get these parameters:
83+
* match_graph: a grma donors' graph object - `grma.match.Graph`
84+
* grim_config_file: a path to `grim` configuration file
85+
86+
87+
```python
88+
from grma.match import Graph, matching
89+
90+
PATH_TO_DONORS_GRAPH = "data/donors_graph.pkl"
91+
PATH_CONGIF_FILE = "conf/minimal-configuration.json"
92+
93+
94+
# The donors' graph we built earlier
95+
donors_graph = Graph.from_pickle(PATH_TO_DONORS_GRAPH)
96+
97+
98+
# matching_results is a dict - {patient_id: the patient's result dataframe}
99+
matching_results = matching(donors_graph,PATH_CONGIF_FILE, search_id=1, donors_info=[],
100+
threshold=0.1, cutof=100, save_to_csv=True, output_dir="results")
101+
102+
```
103+
104+
`matching` takes some optional parameters, which you might want to change:
105+
106+
* search_id: An integer identification of the search. default is 0.
107+
* donors_info: An iterable of fields from the database to include in the results. default is None.
108+
* threshold: Minimal score value for a valid match. default is 0.1.
109+
* cutof: Maximum number of matches to return. default is 50.
110+
* verbose: A boolean flag for whether to print the documentation. default is False
111+
* save_to_csv: A boolean flag for whether to save the matching results into a csv file. default is False. If the field is set to True, upon completion of the function, it will generate a directory named `search_1`
112+
* `output_dir`: output directory to write match results file to
113+
114+
### Search & Match after imputation to patients
115+
116+
The function `find_mathces` find matches up to 3 mismatches and return a `pandas.DataFrame` object of the matches
117+
sorted by number of mismatches and their score.
118+
119+
They get these parameters:
120+
* imputation_filename: a path to the file of the patients' typing.
121+
* match_graph: a grma donors' graph object - `grma.match.Graph`
122+
123+
```python
124+
from grma.match import Graph, find_matches
125+
126+
PATH_TO_PATIENTS_FILE = "data/patients_file.txt"
127+
PATH_TO_DONORS_GRAPH = "output/donors_graph.pkl"
128+
129+
# The donors' graph we built earlier
130+
donors_graph = Graph.from_pickle(PATH_TO_DONORS_GRAPH)
131+
matching_results = find_matches(PATH_TO_PATIENTS_FILE, donors_graph)
132+
133+
# matching_results is a dict - {patient_id: the patient's result dataframe}
134+
135+
for patient, df in matching_results.items():
136+
# Use here the dataframe 'df' with the results for 'patient'
137+
print(patient, df)
138+
```
139+
140+
`find_matches` takes some optional parameters, which you might want to change:
141+
* search_id: An integer identification of the search. default is 0.
142+
* donors_info: An iterable of fields from the database to include in the results. default is None.
143+
* threshold: Minimal score value for a valid match. default is 0.1.
144+
* cutof: Maximum number of matches to return. default is 50.
145+
* verbose: A boolean flag for whether to print the documentation. default is False
146+
* save_to_csv: A boolean flag for whether to save the matching results into a csv file. default is False.
147+
If the field is set to True, upon completion of the function, it will generate a directory named `Matching_Result
148+
s_1`.
149+
* calculate_time: A boolean flag for whether to return the matching time for patient. default is False.
150+
In case `calculate_time=True` the output will be dict like this: `{patient_id: (results_dataframe, time)}`
151+
* `output_dir`: output directory to write match results file to
152+
153+
### Set Database
154+
In order to get in the matching results more information about the donors than the matching information,
155+
one can set a database that has all the donors' information in it.
156+
The database must be a `pandas.DataFrame` that its indexes are the donors' IDs.
157+
158+
After setting the database, when calling one of the matching functions,
159+
you may set in the `donor_info` variable a `list` with the names of the columns you want to join to the result dataframe from the database.
160+
161+
Example of setting the database:
162+
163+
```python
164+
import pandas as pd
165+
from grma.match import set_database
166+
167+
donors = [0, 1, 2]
168+
database = pd.DataFrame([[30], [32], [25]], columns=["Age"], index=donors)
169+
170+
set_database(database)
171+
```
172+
173+
174+
# How to contribute:
175+
176+
1. Fork the repository: https://github.com/nmdp-bioinformatics/py-graph-match.git
9177
2. Clone the repository locally
10178
```shell
11-
git clone git@github.com:pbashyal-nmdp/urban-potato.git
12-
cd urban-potato
179+
git clone https://github.com/<Your-Github-ID>/py-graph-match.git
180+
cd py-graph-match
13181
```
14182
3. Make a virtual environment and activate it, run `make venv`
15183
```shell
@@ -58,18 +226,18 @@ How to use the template:
58226
| |-- HLA_alleles.py
59227
| `-- SLUG_match.py
60228
`-- unit
61-
`-- test_my_project_template.py
229+
`-- test_py-graph-match.py
62230
```
63-
8. Package Module files go in the `my_project_template` directory.
231+
8. Package Module files go in the `py-graph-match` directory.
64232
```
65-
my_project_template
233+
py-graph-match
66234
|-- __init__.py
67235
|-- algorithm
68236
| `-- match.py
69237
|-- model
70238
| |-- allele.py
71239
| `-- slug.py
72-
`-- my_project_template.py
240+
`-- py-graph-match.py
73241
```
74242
9. Run all tests with `make test` or different tests with `make behave` or `make pytest`. `make behave` will generate report files and open the browser to the report.
75243
10. Use `python app.py` to run the Flask service app in debug mode. Service will be available at http://localhost:8080/

conf/minimal-configuration.json

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
{
2+
"populations": [
3+
"CAU"
4+
],
5+
"freq_trim_threshold": 1e-5,
6+
"priority": {
7+
"alpha": 0.4999999,
8+
"eta": 0,
9+
"beta": 1e-7,
10+
"gamma": 1e-7,
11+
"delta": 0.4999999
12+
},
13+
"UNK_priors": "SR",
14+
"FULL_LOCI": "ABCQR",
15+
"loci_map": {
16+
"A": 1,
17+
"B": 2,
18+
"C": 3,
19+
"DQB1": 4,
20+
"DRB1": 5
21+
},
22+
23+
"factor_missing_data": 0.0001,
24+
"Plan_B_Matrix": [
25+
[[1, 2, 3, 4, 5]],
26+
[[1, 2, 3], [4, 5]],
27+
[[1], [2, 3], [4, 5]],
28+
[[1, 2, 3], [4], [5]],
29+
[[1], [2, 3], [4], [5]],
30+
[[1], [2], [3], [4], [5]]
31+
],
32+
"planb": true,
33+
"number_of_options_threshold": 100000,
34+
"epsilon": 1e-3,
35+
"number_of_results": 10,
36+
"number_of_pop_results": 100,
37+
"output_MUUG": true,
38+
"output_haplotypes": true,
39+
"freq_data_dir": "data/freqs" ,
40+
"freq_file": "data/hpf.csv" ,
41+
"graph_files_path": "output/csv/" ,
42+
"node_csv_file": "nodes.csv",
43+
"edges_csv_file": "edges.csv",
44+
"info_node_csv_file": "info_node.csv",
45+
"top_links_csv_file": "top_links.csv",
46+
"imputation_in_file": "data/patients.txt",
47+
"imputation_out_umug_freq_filename": "don.umug",
48+
"imputation_out_umug_pops_filename": "don.umug.pops",
49+
"imputation_out_hap_freq_filename": "don.pmug",
50+
"imputation_out_hap_pops_filename": "don.pmug.pops",
51+
"imputation_out_miss_filename": "don.miss",
52+
"imputation_out_problem_filename": "don.problem",
53+
"max_haplotypes_number_in_phase": 100,
54+
"imuptation_out_path": "output"
55+
}

data/donors_dir/donors.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
12,A*01:01+A*01:01^B*07:02+B*57:01^C*06:02+C*07:02^DQB1*03:03+DQB1*06:02^DRB1*07:01+DRB1*15:01,1,0
2+
13,A*01:02+A*01:01^B*07:02+B*57:01^C*06:02+C*07:02^DQB1*03:03+DQB1*06:02^DRB1*07:01+DRB1*15:01,1,0
3+
14,A*01:02+A*02:02^B*07:02+B*57:01^C*01:02+C*07:02^DQB1*03:03+DQB1*06:02^DRB1*07:01+DRB1*15:01,1,0

0 commit comments

Comments
 (0)