Skip to content

Commit b76c7b8

Browse files
committed
Use pyard-reduce-csv command to reduce a CSV file based on a JSON config file.
1 parent 2db41b9 commit b76c7b8

File tree

6 files changed

+356
-294
lines changed

6 files changed

+356
-294
lines changed

extras/README.md

Lines changed: 103 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -1,100 +1,114 @@
11
# Extras
22

3-
# Batch Script for CSV File
3+
# Script to batch process a CSV File
44

55
**Example Scripts to batch reduce HLA typings from a CSV File**
66

7-
`reduce_csv.py` and `conf.py` scripts can be used to take a CSV file with HLA
8-
typing data and reduce certain columns and produce a new CSV and Excel file.
9-
10-
For most use case, installing `py-ard`, specifying the changes in `conf.py` file
11-
and running `python reduce_csv.py` will produce result based on the configuration
12-
in the `conf.py`.
13-
14-
15-
```python
16-
#
17-
# configurations for processing CSV files
18-
#
19-
20-
# The column names that are in CSV
21-
# The output file will have these columns
22-
all_columns_in_csv = [
23-
"nmdp_id", "r_a_typ1", "r_a_typ2", "r_b_typ1", "r_b_typ2", "r_c_typ1", "r_c_typ2", "r_drb1_typ1", "r_drb1_typ2",
24-
"r_dpb1_typ1", "r_dpb1_typ2"
25-
]
26-
27-
#
28-
# List of columns which have typing information and need to be reduced.
29-
# The locus is the 2nd term in the column name
30-
# Eg: For column R_DRB1_type1, DPB1 is the locus name
31-
#
32-
columns_to_reduce_in_csv = [
33-
"r_a_typ1", "r_a_typ2", "r_b_typ1", "r_b_typ2", "r_c_typ1", "r_c_typ2", "r_drb1_typ1", "r_drb1_typ2", "r_dpb1_typ1",
7+
`pyard-reduce-csv` command can be used with a config file(that describes ways
8+
to reduce the file) can be used to take a CSV file with HLA typing data and
9+
reduce certain columns and produce a new CSV or an Excel file.
10+
11+
Install `py-ard` and use `pyard-reduce-csv` command specifying the changes in a JSON
12+
config file and running `pyard-reduce-csv -c <config-file>` will produce result based
13+
on the configuration in the config file.
14+
15+
16+
See [Example JSON config file](reduce_conf.json).
17+
18+
19+
### Input CSV filename
20+
`in_csv_filename` Directory path and file name of the Input CSV file
21+
22+
### Output CSV filename
23+
`out_csv_filename` Directory path and file name of the Reduced Output CSV file
24+
25+
### CSV Columns to read
26+
`columns_from_csv` The column names to read from CSV file
27+
28+
```json
29+
[
30+
"nmdp_id",
31+
"r_a_typ1",
32+
"r_a_typ2",
33+
"r_b_typ1",
34+
"r_b_typ2",
35+
"r_c_typ1",
36+
"r_c_typ2",
37+
"r_drb1_typ1",
38+
"r_drb1_typ2",
39+
"r_dpb1_typ1",
40+
"r_dpb1_typ2"
41+
]
42+
```
43+
44+
### CSV Columns to reduce
45+
`columns_to_reduce_in_csv` List of columns which have typing information and need to be reduced.
46+
47+
**NOTE**: The locus is the 2nd term in the column name
48+
E.g., for column `column R_DRB1_type1`, `DPB1` is the locus name
49+
50+
```json
51+
[
52+
"r_a_typ1",
53+
"r_a_typ2",
54+
"r_b_typ1",
55+
"r_b_typ2",
56+
"r_c_typ1",
57+
"r_c_typ2",
58+
"r_drb1_typ1",
59+
"r_drb1_typ2",
60+
"r_dpb1_typ1",
3461
"r_dpb1_typ2"
35-
]
36-
37-
#
38-
# Configuration options to ARD reduction of a CSV file
39-
#
40-
ard_config = {
41-
# All Columns in the CSV file
42-
"csv_in_column_names": all_columns_in_csv,
43-
44-
# Columns to check for typings
45-
"columns_to_check": columns_to_reduce_in_csv,
46-
47-
# How should the typings be reduced
48-
# Valid Options:
49-
# - G
50-
# - lg
51-
# - lgx
52-
"redux_type": "lgx",
53-
54-
# Input CSV filename
55-
"in_csv_filename": "sample.csv",
56-
57-
# Output CSV filename
58-
"out_csv_filename": 'clean_sample.csv',
59-
60-
# Use compression
61-
# Valid options
62-
# - 'gzip'
63-
# - 'zip'
64-
# - None
65-
"apply_compression": 'gzip',
66-
67-
# Show verbose log
68-
# Valid options:
69-
# - True
70-
# - False
71-
"verbose_log": True,
72-
73-
# What to reduce ?
74-
"reduce_serology": False,
75-
"reduce_v2": True,
76-
"reduce_3field": True,
77-
"reduce_P": True,
78-
"reduce_XX": False,
79-
"reduce_MAC": True,
80-
81-
# Is locus name present in allele
82-
# Eg. A*01:01 vs 01:01
83-
"locus_in_allele_name": False,
84-
85-
# Format
86-
# Valid options:
87-
# - csv
88-
# - xlsx
89-
"output_file_format": 'csv',
90-
91-
# Add a separate column for processed column
92-
"new_column_for_redux": False,
93-
}
62+
],
9463
```
9564

96-
The included sample CSV file `sample.csv` can be processed using the script.
9765

98-
```shell
66+
### Redux Options
67+
`redux_type` Reduction Type
68+
69+
Valid Options: `G`, `lg` and `lgx`
70+
71+
### Compression Options
72+
`apply_compression` Compression to use for output file
9973

74+
Valid options: `'gzip'`, `'zip'` or `null`
75+
76+
### Verbose log Options
77+
`log_comment` Show verbose log ?
78+
79+
Valid options: `true` or `false`
80+
81+
### Types of typings to reduce
82+
```json
83+
"verbose_log": true,
84+
"reduce_serology": false,
85+
"reduce_v2": true,
86+
"reduce_3field": true,
87+
"reduce_P": true,
88+
"reduce_XX": false,
89+
"reduce_MAC": true,
10090
```
91+
Valid options: `true` or `false`
92+
93+
94+
### Locus Name in Allele
95+
`locus_in_allele_name`
96+
Is locus name present in allele ? E.g. A*01:01 vs 01:01
97+
98+
Valid options: `true` or `false`
99+
100+
### Output Format
101+
`output_file_format` Format of the output file
102+
103+
Valid options: `csv` or `xlsx`
104+
105+
### Create New Column
106+
`new_column_for_redux` Add a separate column for processed column or replace
107+
the current column. Creates a `reduced_` version of the column.
108+
109+
Valid options: `true`, `false`
110+
111+
### Map to DRBX
112+
`map_drb345_to_drbx` Map to DRBX Typings based on DRB3, DRB4 and DRB5 typings.
113+
114+
Valid options: `true` or `false`

extras/conf.py

Lines changed: 0 additions & 78 deletions
This file was deleted.

extras/reduce_conf.json

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
{
2+
"in_csv_filename": "sample.csv",
3+
"out_csv_filename": "clean_sample.csv",
4+
"columns_from_csv": [
5+
"nmdp_id",
6+
"r_a_typ1",
7+
"r_a_typ2",
8+
"r_b_typ1",
9+
"r_b_typ2",
10+
"r_c_typ1",
11+
"r_c_typ2",
12+
"r_drb1_typ1",
13+
"r_drb1_typ2",
14+
"r_dpb1_typ1",
15+
"r_dpb1_typ2"
16+
],
17+
"columns_to_reduce_in_csv": [
18+
"r_a_typ1",
19+
"r_a_typ2",
20+
"r_b_typ1",
21+
"r_b_typ2",
22+
"r_c_typ1",
23+
"r_c_typ2",
24+
"r_drb1_typ1",
25+
"r_drb1_typ2",
26+
"r_dpb1_typ1",
27+
"r_dpb1_typ2"
28+
],
29+
"redux_type": "lgx",
30+
"apply_compression": "gzip",
31+
"reduce_serology": false,
32+
"reduce_v2": true,
33+
"reduce_3field": true,
34+
"reduce_P": true,
35+
"reduce_XX": false,
36+
"reduce_MAC": true,
37+
"locus_in_allele_name": false,
38+
"keep_locus_in_allele_name": false,
39+
"output_file_format": "csv",
40+
"new_column_for_redux": false,
41+
"map_drb345_to_drbx": false,
42+
"verbose_log": true
43+
}

0 commit comments

Comments
 (0)