|
1 | | -My Project Template |
| 1 | +py-graph-match |
2 | 2 | =================== |
3 | 3 |
|
| 4 | +Matching with Graph |
4 | 5 |
|
5 | | -How to use the template: |
| 6 | +`grma`` is a package for finding HLA matches using graphs approach. |
| 7 | +The matching is based on [grim's](https://github.com/nmdp-bioinformatics/py-graph-imputation) imputation. |
6 | 8 |
|
7 | | -1. Create a template by clicking on the "Use this template" button. Make sure to select all branches |
8 | | - This will create a new repository with the given name e.g. `urban-potato` |
| 9 | + |
| 10 | +## Pre-requisites |
| 11 | + |
| 12 | +### Data Directory Structure |
| 13 | + |
| 14 | +``` |
| 15 | +data |
| 16 | +├── donors_dir |
| 17 | +│ └── donors.txt |
| 18 | +├── hpf.csv |
| 19 | +└── patients.txt |
| 20 | +``` |
| 21 | + |
| 22 | +### conf Directory Structure |
| 23 | + |
| 24 | +``` |
| 25 | +conf |
| 26 | +└── minimal-configuration.json |
| 27 | +``` |
| 28 | + |
| 29 | +Follow these steps for finding matches: |
| 30 | + |
| 31 | +Setup a virtual environment (venv) and run: |
| 32 | +``` |
| 33 | +make install |
| 34 | +``` |
| 35 | + |
| 36 | +## Quick Getting Started |
| 37 | + |
| 38 | +Get Started with a built-in example. |
| 39 | + |
| 40 | +### Build 'Donors Graph' |
| 41 | + |
| 42 | +``` |
| 43 | +python test_build_donors_graph.py |
| 44 | +``` |
| 45 | + |
| 46 | +### Find Matches |
| 47 | + |
| 48 | +Use grma algorthm for finding matches efficiently. You can run the file `test_matching.py` |
| 49 | +``` |
| 50 | +python test_matching.py |
| 51 | +``` |
| 52 | + |
| 53 | +Find the match results in `results` directory. |
| 54 | + |
| 55 | +# Full Walk through |
| 56 | +### Building The Donors' Graph: |
| 57 | + |
| 58 | +The donors' graph is a graph which contains all the donors (the search space). It implemented using a LOL (List of Lists) representation written in cython for better time and memory efficiency. |
| 59 | +The building might take a lot of memory and time, so it's recommended to save the graph in a pickle file. |
| 60 | + |
| 61 | +Before building the donors' graph, all the donors' HLAs must be imputed using `grim`. |
| 62 | +Then all the imputation files must be saved under the same directory. |
| 63 | + |
| 64 | +```python |
| 65 | +import os |
| 66 | +from grma.donorsgraph.build_donors_graph import BuildMatchingGraph |
| 67 | + |
| 68 | +PATH_TO_DONORS_DIR = "data/donors_dir" |
| 69 | +PATH_TO_DONORS_GRAPH = "output/donors_graph.pkl" |
| 70 | + |
| 71 | +os.makedirs(f"output", exist_ok=True) |
| 72 | + |
| 73 | +build_matching = BuildMatchingGraph(PATH_TO_DONORS_DIR) |
| 74 | +graph = build_matching.graph # access the donors' graph |
| 75 | + |
| 76 | +build_matching.to_pickle(PATH_TO_DONORS_GRAPH) # save the donors' graph to pickle |
| 77 | +``` |
| 78 | + |
| 79 | +### Search & Match before imputation to patients |
| 80 | +The function `matching` finds matches up to 3 mismatches and return a `pandas.DataFrame` object of the matches sorted by number of mismatches and their score. |
| 81 | + |
| 82 | +The function get these parameters: |
| 83 | +* match_graph: a grma donors' graph object - `grma.match.Graph` |
| 84 | +* grim_config_file: a path to `grim` configuration file |
| 85 | + |
| 86 | + |
| 87 | +```python |
| 88 | +from grma.match import Graph, matching |
| 89 | + |
| 90 | +PATH_TO_DONORS_GRAPH = "data/donors_graph.pkl" |
| 91 | +PATH_CONGIF_FILE = "conf/minimal-configuration.json" |
| 92 | + |
| 93 | + |
| 94 | +# The donors' graph we built earlier |
| 95 | +donors_graph = Graph.from_pickle(PATH_TO_DONORS_GRAPH) |
| 96 | + |
| 97 | + |
| 98 | +# matching_results is a dict - {patient_id: the patient's result dataframe} |
| 99 | +matching_results = matching(donors_graph,PATH_CONGIF_FILE, search_id=1, donors_info=[], |
| 100 | + threshold=0.1, cutof=100, save_to_csv=True, output_dir="results") |
| 101 | + |
| 102 | +``` |
| 103 | + |
| 104 | +`matching` takes some optional parameters, which you might want to change: |
| 105 | + |
| 106 | +* search_id: An integer identification of the search. default is 0. |
| 107 | +* donors_info: An iterable of fields from the database to include in the results. default is None. |
| 108 | +* threshold: Minimal score value for a valid match. default is 0.1. |
| 109 | +* cutof: Maximum number of matches to return. default is 50. |
| 110 | +* verbose: A boolean flag for whether to print the documentation. default is False |
| 111 | +* save_to_csv: A boolean flag for whether to save the matching results into a csv file. default is False. If the field is set to True, upon completion of the function, it will generate a directory named `search_1` |
| 112 | +* `output_dir`: output directory to write match results file to |
| 113 | + |
| 114 | +### Search & Match after imputation to patients |
| 115 | + |
| 116 | +The function `find_mathces` find matches up to 3 mismatches and return a `pandas.DataFrame` object of the matches |
| 117 | + sorted by number of mismatches and their score. |
| 118 | + |
| 119 | +They get these parameters: |
| 120 | +* imputation_filename: a path to the file of the patients' typing. |
| 121 | +* match_graph: a grma donors' graph object - `grma.match.Graph` |
| 122 | + |
| 123 | +```python |
| 124 | +from grma.match import Graph, find_matches |
| 125 | + |
| 126 | +PATH_TO_PATIENTS_FILE = "data/patients_file.txt" |
| 127 | +PATH_TO_DONORS_GRAPH = "output/donors_graph.pkl" |
| 128 | + |
| 129 | +# The donors' graph we built earlier |
| 130 | +donors_graph = Graph.from_pickle(PATH_TO_DONORS_GRAPH) |
| 131 | +matching_results = find_matches(PATH_TO_PATIENTS_FILE, donors_graph) |
| 132 | + |
| 133 | +# matching_results is a dict - {patient_id: the patient's result dataframe} |
| 134 | + |
| 135 | +for patient, df in matching_results.items(): |
| 136 | + # Use here the dataframe 'df' with the results for 'patient' |
| 137 | + print(patient, df) |
| 138 | +``` |
| 139 | + |
| 140 | +`find_matches` takes some optional parameters, which you might want to change: |
| 141 | +* search_id: An integer identification of the search. default is 0. |
| 142 | +* donors_info: An iterable of fields from the database to include in the results. default is None. |
| 143 | +* threshold: Minimal score value for a valid match. default is 0.1. |
| 144 | +* cutof: Maximum number of matches to return. default is 50. |
| 145 | +* verbose: A boolean flag for whether to print the documentation. default is False |
| 146 | +* save_to_csv: A boolean flag for whether to save the matching results into a csv file. default is False. |
| 147 | +If the field is set to True, upon completion of the function, it will generate a directory named `Matching_Result |
| 148 | +s_1`. |
| 149 | +* calculate_time: A boolean flag for whether to return the matching time for patient. default is False. |
| 150 | + In case `calculate_time=True` the output will be dict like this: `{patient_id: (results_dataframe, time)}` |
| 151 | +* `output_dir`: output directory to write match results file to |
| 152 | + |
| 153 | +### Set Database |
| 154 | +In order to get in the matching results more information about the donors than the matching information, |
| 155 | +one can set a database that has all the donors' information in it. |
| 156 | +The database must be a `pandas.DataFrame` that its indexes are the donors' IDs. |
| 157 | + |
| 158 | +After setting the database, when calling one of the matching functions, |
| 159 | +you may set in the `donor_info` variable a `list` with the names of the columns you want to join to the result dataframe from the database. |
| 160 | + |
| 161 | +Example of setting the database: |
| 162 | + |
| 163 | +```python |
| 164 | +import pandas as pd |
| 165 | +from grma.match import set_database |
| 166 | + |
| 167 | +donors = [0, 1, 2] |
| 168 | +database = pd.DataFrame([[30], [32], [25]], columns=["Age"], index=donors) |
| 169 | + |
| 170 | +set_database(database) |
| 171 | +``` |
| 172 | + |
| 173 | + |
| 174 | +# How to contribute: |
| 175 | + |
| 176 | +1. Fork the repository: https://github.com/nmdp-bioinformatics/py-graph-match.git |
9 | 177 | 2. Clone the repository locally |
10 | 178 | ```shell |
11 | | - git clone git@github.com:pbashyal-nmdp/urban-potato.git |
12 | | - cd urban-potato |
| 179 | + git clone https://github.com/<Your-Github-ID>/py-graph-match.git |
| 180 | + cd py-graph-match |
13 | 181 | ``` |
14 | 182 | 3. Make a virtual environment and activate it, run `make venv` |
15 | 183 | ```shell |
@@ -58,18 +226,18 @@ How to use the template: |
58 | 226 | | |-- HLA_alleles.py |
59 | 227 | | `-- SLUG_match.py |
60 | 228 | `-- unit |
61 | | - `-- test_my_project_template.py |
| 229 | + `-- test_py-graph-match.py |
62 | 230 | ``` |
63 | | -8. Package Module files go in the `my_project_template` directory. |
| 231 | +8. Package Module files go in the `py-graph-match` directory. |
64 | 232 | ``` |
65 | | - my_project_template |
| 233 | + py-graph-match |
66 | 234 | |-- __init__.py |
67 | 235 | |-- algorithm |
68 | 236 | | `-- match.py |
69 | 237 | |-- model |
70 | 238 | | |-- allele.py |
71 | 239 | | `-- slug.py |
72 | | - `-- my_project_template.py |
| 240 | + `-- py-graph-match.py |
73 | 241 | ``` |
74 | 242 | 9. Run all tests with `make test` or different tests with `make behave` or `make pytest`. `make behave` will generate report files and open the browser to the report. |
75 | 243 | 10. Use `python app.py` to run the Flask service app in debug mode. Service will be available at http://localhost:8080/ |
|
0 commit comments