Commit 3c57bb7
committed
Refactor MAC Codes file
- Don't save temporary zip and css files
- Don't expand the individual alleles for codes to dict. Read the zip file into a
dictionary without any temp files. Saves memory/disk space.
```
(venv) --- /tmp/py-ard » ls -lah mac.pickle*
-rw-r--r-- 1 pbashyal wheel 278M Oct 6 10:19 mac.pickle
-rw-r--r-- 1 pbashyal wheel 365M Sep 30 14:45 mac.pickle-old
```
- Refactor complex lambdas into functions
# Notes
When reviewing MAC code.
## MAC file
```python
mac_file = data_dir + "/mac.txt"
```
File: 'mac.txt'
2 Different Versions:
> when they’re printed, the first is better for encoding and the second is better for decoding
The entire list was maintained both as an excel spreadsheet and also as a sybase database table.
The excel was the one that was printed and distributed and it was rife with typos
**==> numer.v3.txt <==**
Sorted by the length and the the values in the list
```
"LAST UPDATED: 09/30/20"
CODE SUBTYPE
AB 01/02
AC 01/03
AD 01/04
AE 01/05
AG 01/06
AH 01/07
AJ 01/08
```
**==> alpha.v3.txt <==**
Sorted by the code
```
"LAST UPDATED: 10/01/20"
* CODE SUBTYPE
AA 01/02/03/05
AB 01/02
AC 01/03
AD 01/04
AE 01/05
AF 01/09
AG 01/06
```
Function `all_macs` downloads the `https://hml.nmdp.org/mac/files/numer.v3.zip`
to file `numeric.v3.zip`, unzips it to `out_file = data_dir + "/numer.v3.txt"`.
The first 3 lines are skipped. The rest is turned into a pandas DataFrame.
```
Code Alleles
0 AB 01/02
1 AC 01/03
2 AD 01/04
3 AE 01/05
4 AG 01/06
```
And written out as `'/tmp/3290/mac.txt'` as a CSV file 851603 lines long.
The `Alleles` column is expanded by splitting on `/`
```
Code Alleles
0 AB [01, 02]
1 AC [01, 03]
2 AD [01, 04]
3 AE [01, 05]
4 AG [01, 06]
... ... ...
9995 ABTVE [02, 110, 140]
9996 AUYAN [02, 110, 145]
9997 AAFFK [02, 110, 146]
9998 ACTAX [02, 110, 176]
9999 CKBTE [02, 110, 308]
```
Comments:
No need to download the zip file, save it to an uncompress format. Just read the zip file into a
dictionary without any temp files.
```python
import pandas as pd
url='https://hml.nmdp.org/mac/files/numer.v3.zip'
df_mac = pd.read_csv(url, sep='\t', compression='zip', skiprows=3, names=['Code', 'Alleles'])
mac_dict = df_mac.set_index("Code")["Alleles"].to_dict('index')
```1 parent 52d90e1 commit 3c57bb7
3 files changed
+90
-82
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
35 | 35 | | |
36 | 36 | | |
37 | 37 | | |
38 | | - | |
39 | 38 | | |
40 | 39 | | |
41 | 40 | | |
| |||
79 | 78 | | |
80 | 79 | | |
81 | 80 | | |
82 | | - | |
83 | 81 | | |
84 | 82 | | |
85 | 83 | | |
86 | 84 | | |
87 | 85 | | |
88 | | - | |
89 | | - | |
90 | | - | |
91 | | - | |
92 | | - | |
| 86 | + | |
93 | 87 | | |
94 | 88 | | |
95 | 89 | | |
| |||
107 | 101 | | |
108 | 102 | | |
109 | 103 | | |
110 | | - | |
111 | 104 | | |
112 | | - | |
113 | | - | |
114 | | - | |
115 | 105 | | |
116 | 106 | | |
117 | 107 | | |
118 | 108 | | |
119 | 109 | | |
120 | 110 | | |
121 | | - | |
| 111 | + | |
122 | 112 | | |
123 | | - | |
124 | | - | |
125 | | - | |
126 | | - | |
127 | | - | |
128 | | - | |
129 | | - | |
130 | | - | |
131 | | - | |
132 | | - | |
133 | | - | |
134 | | - | |
135 | | - | |
136 | | - | |
| 113 | + | |
| 114 | + | |
137 | 115 | | |
138 | 116 | | |
139 | 117 | | |
| |||
199 | 177 | | |
200 | 178 | | |
201 | 179 | | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
202 | 245 | | |
203 | 246 | | |
204 | 247 | | |
| |||
236 | 279 | | |
237 | 280 | | |
238 | 281 | | |
239 | | - | |
240 | 282 | | |
241 | 283 | | |
242 | 284 | | |
| |||
398 | 440 | | |
399 | 441 | | |
400 | 442 | | |
401 | | - | |
| 443 | + | |
402 | 444 | | |
403 | 445 | | |
404 | 446 | | |
| |||
446 | 488 | | |
447 | 489 | | |
448 | 490 | | |
449 | | - | |
450 | 491 | | |
451 | 492 | | |
452 | 493 | | |
453 | 494 | | |
454 | 495 | | |
455 | 496 | | |
456 | 497 | | |
457 | | - | |
| 498 | + | |
458 | 499 | | |
459 | 500 | | |
460 | 501 | | |
461 | 502 | | |
462 | | - | |
| 503 | + | |
463 | 504 | | |
464 | 505 | | |
465 | 506 | | |
466 | 507 | | |
467 | | - | |
468 | | - | |
469 | | - | |
470 | | - | |
471 | | - | |
472 | | - | |
473 | | - | |
| 508 | + | |
| 509 | + | |
| 510 | + | |
| 511 | + | |
| 512 | + | |
474 | 513 | | |
475 | 514 | | |
476 | 515 | | |
| |||
492 | 531 | | |
493 | 532 | | |
494 | 533 | | |
495 | | - | |
| 534 | + | |
496 | 535 | | |
497 | 536 | | |
498 | 537 | | |
| |||
529 | 568 | | |
530 | 569 | | |
531 | 570 | | |
532 | | - | |
533 | 571 | | |
534 | | - | |
535 | | - | |
536 | | - | |
537 | | - | |
| 572 | + | |
538 | 573 | | |
539 | 574 | | |
540 | 575 | | |
| |||
565 | 600 | | |
566 | 601 | | |
567 | 602 | | |
568 | | - | |
| 603 | + | |
569 | 604 | | |
570 | 605 | | |
571 | 606 | | |
| |||
575 | 610 | | |
576 | 611 | | |
577 | 612 | | |
578 | | - | |
579 | 613 | | |
580 | 614 | | |
581 | 615 | | |
582 | | - | |
583 | | - | |
584 | | - | |
585 | | - | |
| 616 | + | |
586 | 617 | | |
587 | 618 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
25 | | - | |
26 | | - | |
27 | | - | |
28 | 25 | | |
29 | 26 | | |
30 | 27 | | |
31 | 28 | | |
32 | 29 | | |
33 | 30 | | |
34 | | - | |
35 | | - | |
36 | | - | |
37 | | - | |
38 | | - | |
39 | | - | |
40 | | - | |
41 | | - | |
42 | | - | |
43 | | - | |
44 | | - | |
45 | | - | |
46 | | - | |
47 | | - | |
48 | | - | |
49 | | - | |
50 | | - | |
51 | | - | |
52 | | - | |
53 | | - | |
54 | 31 | | |
55 | 32 | | |
56 | 33 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
49 | | - | |
50 | | - | |
51 | | - | |
52 | | - | |
53 | | - | |
54 | | - | |
55 | | - | |
56 | | - | |
57 | | - | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
58 | 58 | | |
59 | 59 | | |
60 | 60 | | |
| |||
0 commit comments