Skip to content

Commit 5a6725f

Browse files
committed
Update README, improve preprocessing
1 parent e464b12 commit 5a6725f

File tree

8 files changed

+202
-92
lines changed

8 files changed

+202
-92
lines changed

Dockerfile

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,9 +13,11 @@ RUN pip3 install pandas keras h5py zmq
1313
RUN pip3 install tqdm
1414
RUN pip3 install tensorflow-datasets
1515

16-
RUN pip install tensorflow-addons==0.13.0
16+
RUN pip install tensorflow-addons
1717

1818
RUN pip install scipy
19+
RUN pip install scikit-learn
20+
RUN pip install notebook
1921

2022
RUN pip install seaborn
2123

README.md

Lines changed: 80 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -53,11 +53,44 @@ tensorflow-datasets
5353
tensorflow-addons==0.13.0
5454
scipy
5555
seaborn
56+
scikit-learn
57+
notebook
5658
```
5759

5860
A GPU is recommended (with all necessary drivers installed), and a moderate amount of RAM will be required to run the data preproccessing and model training.
5961

6062

63+
### Downloading Data
64+
65+
The full dataset is stored on Zenodo at the following URL: https://zenodo.org/record/8220494
66+
67+
These can be downloaded from the site directly, but the following script may be preferable due to the large file size:
68+
```bash
69+
#!/bin/bash
70+
71+
for i in $(seq -w 0 5 165); do
72+
printf -v j "%03d" $((${i#0} + 4))
73+
wget https://zenodo.org/records/8220494/files/data_${i}_${j}.tar.gz
74+
done
75+
```
76+
77+
> [!WARNING]
78+
> These files are very large (4.0GB each, 135.4GB total).
79+
> Ensure you have enough disk space before downloading.
80+
81+
To extract the files:
82+
```bash
83+
#!/bin/bash
84+
85+
for i in $(seq -w 0 5 165); do
86+
printf -v j "%03d" $((${i#0} + 4))
87+
tar xzf data_${i}_${j}.tar.gz
88+
done
89+
```
90+
91+
See the instructions below on processing the resulting files for use.
92+
93+
6194
## Usage
6295

6396
### TensorFlow Container
@@ -101,8 +134,14 @@ Change the `docker-compose.yml` to ensure the device is mounted in the container
101134
The scripts in the `preprocessing` directory process the database file(s) into NumPy files, and then TFRecord datasets.
102135
It is recommended to run these scripts from within the TensorFlow container described above.
103136

104-
Please note that these scripts load the full datasets into memory, and will consume large amounts of RAM.
105-
It is recommended that you run them on a machine with at least 128GB of RAM.
137+
> [!NOTE]
138+
> Converting databases to NumPy files and filtering is only necessary if you are doing your own data collection.
139+
> If the provided dataset on Zenodo is used, only the `np-to-tfrecord.py` script is needed.
140+
141+
> [!IMPORTANT]
142+
> Please note that these scripts load the full datasets into memory, and will consume large amounts of RAM.
143+
> It is recommended that you run them on a machine with at least 128GB of RAM.
144+
106145

107146
#### db-to-np-multiple.py
108147

@@ -116,6 +155,7 @@ python3 db-to-np-multiple.py
116155

117156
The resulting files will be placed in `code/processed` (ensure this directory already exists).
118157

158+
119159
#### np-filter.py
120160

121161
This script normalizes the IQ samples, and filters out unusable data.
@@ -128,20 +168,52 @@ python3 np-filter.py
128168

129169
The resulting files will be placed in `code/filtered` (ensure this directory already exists).
130170

171+
131172
#### np-to-tfrecord.py
132173

133174
This script converts NumPy files into the TFRecord format, for use in model training.
134-
To run, `path_base` and `suffixes` are once again set as above.
135-
The `chunk_size` `shuffle`, `by_id`, and `id_counts` options may also be set to adjust how the dataset is generated -- the default options should be fine, unless alternative datasets (e.g. with transmitters removed) are required.
175+
To run this script, ensure your data has been processed into NumPy files with the following format:
176+
- `samples_<suffix>.npy`
177+
- `ra_sat_<suffix>.npy`
178+
- `ra_cell_<suffix>.npy`
136179

137-
The script runs with no arguments:
180+
> [!NOTE]
181+
> The `db-to-np-multiple.py` script will produce files in this format.
182+
> The dataset available from Zenodo is also in this format.
183+
184+
The script can be used as follows:
138185
```bash
139-
python3 np-filter.py
186+
python3 np-to-tfrecord.py --path-in <INPUT PATH> --path-out <OUTPUT PATH>
140187
```
141188

142-
The resulting files will be placed in `code/tfrecord` (ensure this directory already exists).
189+
There are also the following optional parameters:
190+
- `--chunk-size <CHUNK SIZE>`: number of records in each chunk. Default is 50000, set to a smaller value for smaller files.
191+
- `-v`, `--verbose`: display progress.
192+
- `--max-files <MAX FILES>`: stop after processing the specified number of input files.
193+
- `--skip-files <SKIP FILES>`: skip a specified number of input files.
194+
- `--no-shuffle`: do not shuffle the data.
195+
- `--by-id`: see below.
196+
197+
The `by_id` option creates 9 datasets.
198+
The first of these contains only the most common 10% of transmitter IDs.
199+
The second contains 20%, and so on.
200+
Be careful using this option, as it creates a much larger number of files, and takes significantly longer to run.
201+
202+
> [!WARNING]
203+
> This script in particular will use a large amount of RAM, since it loads the entire dataset into memory at once.
204+
> Processing may be done in batches by using the `--max-files` and `--skip-files` command-line arguments.
205+
206+
207+
#### sqlite3-compress.py
208+
209+
This script converts database files directly into the NumPy arrays in the same format as provided in the Zenodo dataset.
210+
This includes all columns provided by the data collection pipeline.
211+
212+
The script can be used as follows:
213+
```bash
214+
python3 sqlite3-compress.py <INPUT PATH> <OUTPUT PATH>
215+
```
143216

144-
Please note that this script in particular will use a large amount of RAM.
145217

146218
#### Noise
147219

preprocessing/db-to-np-multiple.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,9 @@
1010
file_db = f"db-{db_index}.sqlite3"
1111

1212
out_dir = os.path.join(path_base, "processed")
13-
file_samples = f"samples-{db_index}.npy"
14-
file_ids = f"ids-{db_index}.npy"
15-
file_cells = f"cells-{db_index}.npy"
13+
file_samples = f"samples_{db_index}.npy"
14+
file_ids = f"ra_sat_{db_index}.npy"
15+
file_cells = f"ra_cell_{db_index}.npy"
1616

1717
db = Database(os.path.join(path_base, file_db), num_samples)
1818

preprocessing/noise/db-to-np-multiple.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -13,13 +13,13 @@
1313

1414
file_db = f"db-{db_index}.sqlite3"
1515

16-
file_samples = f"samples-{db_index}.npy"
17-
file_ids = f"ids-{db_index}.npy"
18-
file_cells = f"cells-{db_index}.npy"
19-
file_magnitudes = f"magnitudes-{db_index}.npy"
20-
file_noises = f"noises-{db_index}.npy"
21-
file_levels = f"levels-{db_index}.npy"
22-
file_confidences = f"confidences-{db_index}.npy"
16+
file_samples = f"samples_{db_index}.npy"
17+
file_ids = f"ra_sat_{db_index}.npy"
18+
file_cells = f"ra_cell_{db_index}.npy"
19+
file_magnitudes = f"magnitudes_{db_index}.npy"
20+
file_noises = f"noises_{db_index}.npy"
21+
file_levels = f"levels_{db_index}.npy"
22+
file_confidences = f"confidences_{db_index}.npy"
2323

2424
db = Database(os.path.join(path_base, file_db), num_samples)
2525

preprocessing/noise/np-filter.py

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -11,13 +11,13 @@
1111
suffixes = ["a", "b", "c"]
1212

1313
def save_dataset(path, suffix, samples_array, ids_array, cells_array, magnitudes_array, noises_array, levels_array, confidences_array):
14-
file_samples = os.path.join(path, "samples-{}.npy".format(suffix))
15-
file_ids = os.path.join(path, "ids-{}.npy".format(suffix))
16-
file_cells = os.path.join(path, "cells-{}.npy".format(suffix))
17-
file_magnitudes = os.path.join(path, "magnitudes-{}.npy".format(suffix))
18-
file_noises = os.path.join(path, "noises-{}.npy".format(suffix))
19-
file_levels = os.path.join(path, "levels-{}.npy".format(suffix))
20-
file_confidences = os.path.join(path, "confidences-{}.npy".format(suffix))
14+
file_samples = os.path.join(path, "samples_{}.npy".format(suffix))
15+
file_ids = os.path.join(path, "ra_sat_{}.npy".format(suffix))
16+
file_cells = os.path.join(path, "ra_cell_{}.npy".format(suffix))
17+
file_magnitudes = os.path.join(path, "magnitudes_{}.npy".format(suffix))
18+
file_noises = os.path.join(path, "noises_{}.npy".format(suffix))
19+
file_levels = os.path.join(path, "levels_{}.npy".format(suffix))
20+
file_confidences = os.path.join(path, "confidences_{}.npy".format(suffix))
2121

2222
np.save(file_samples, samples_array)
2323
np.save(file_ids, ids_array)
@@ -31,13 +31,13 @@ def save_dataset(path, suffix, samples_array, ids_array, cells_array, magnitudes
3131
def process(path_in, path_out, suffix):
3232
print("Processing dataset {}".format(suffix))
3333

34-
file_samples = os.path.join(path_in, "samples-{}.npy".format(suffix))
35-
file_ids = os.path.join(path_in, "ids-{}.npy".format(suffix))
36-
file_cells = os.path.join(path_in, "cells-{}.npy".format(suffix))
37-
file_magnitudes = os.path.join(path_in, "magnitudes-{}.npy".format(suffix))
38-
file_noises = os.path.join(path_in, "noises-{}.npy".format(suffix))
39-
file_levels = os.path.join(path_in, "levels-{}.npy".format(suffix))
40-
file_confidences = os.path.join(path_in, "confidences-{}.npy".format(suffix))
34+
file_samples = os.path.join(path_in, "samples_{}.npy".format(suffix))
35+
file_ids = os.path.join(path_in, "ra_sat_{}.npy".format(suffix))
36+
file_cells = os.path.join(path_in, "ra_cell_{}.npy".format(suffix))
37+
file_magnitudes = os.path.join(path_in, "magnitudes_{}.npy".format(suffix))
38+
file_noises = os.path.join(path_in, "noises_{}.npy".format(suffix))
39+
file_levels = os.path.join(path_in, "levels_{}.npy".format(suffix))
40+
file_confidences = os.path.join(path_in, "confidences_{}.npy".format(suffix))
4141

4242
print("Loading ArrayDataset")
4343
ds = NoiseArrayDataset.from_files(

preprocessing/noise/np-to-tfrecord.py

Lines changed: 45 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -3,37 +3,20 @@
33
import os
44
import tensorflow as tf
55

6-
path_base = "/data"
7-
path_in = os.path.join(path_base, "filtered")
8-
path_out = os.path.join(path_base, "tfrecord-magnitude")
9-
10-
suffixes = ["a", "b", "c"]
11-
12-
chunk_size = 50000
13-
shuffle = True
6+
import argparse
147

158
# Percentages to keep
16-
magnitude_percentages = [
17-
10,
18-
20,
19-
30,
20-
40,
21-
50,
22-
60,
23-
70,
24-
80,
25-
90
26-
]
9+
magnitude_percentages = list(range(10, 100, 10))
2710

2811
# Get a unique ID for the given id/cell pair
2912
def get_id_cell(sat_id, sat_cell, num_cells=63):
3013
return (sat_id * num_cells) + sat_cell
3114

3215
def load_dataset(path, suffix):
33-
file_samples = os.path.join(path, "samples-{}.npy".format(suffix))
34-
file_ids = os.path.join(path, "ids-{}.npy".format(suffix))
35-
file_cells = os.path.join(path, "cells-{}.npy".format(suffix))
36-
file_magnitudes = os.path.join(path, "magnitudes-{}.npy".format(suffix))
16+
file_samples = os.path.join(path, "samples_{}.npy".format(suffix))
17+
file_ids = os.path.join(path, "ra_sat_{}.npy".format(suffix))
18+
file_cells = os.path.join(path, "ra_cell_{}.npy".format(suffix))
19+
file_magnitudes = os.path.join(path, "magnitudes_{}.npy".format(suffix))
3720

3821
samples_array = np.load(file_samples)
3922
ids_array = np.load(file_ids)
@@ -64,6 +47,10 @@ def save_dataset(path, suffix, samples_array, ids_array, cells_array):
6447
def save_dataset_batches(path, chunk_size, samples_array, ids_array, cells_array, verbose):
6548
chunk_count = 0
6649

50+
# Create directory if it doesn't exist
51+
if not os.path.exists(path):
52+
os.makedirs(path)
53+
6754
while samples_array.shape[0] >= chunk_size:
6855
if verbose:
6956
print(f"Saving chunk {chunk_count}...")
@@ -81,20 +68,31 @@ def save_dataset_batches(path, chunk_size, samples_array, ids_array, cells_array
8168
save_dataset(path, str(chunk_count), s, i, c)
8269
chunk_count += 1
8370

84-
if verbose:
85-
print(f"Saving chunk {chunk_count}...")
86-
print(f"Samples remaining: {samples_array.shape[0]}")
87-
save_dataset(path, str(chunk_count), samples_array, ids_array, cells_array)
88-
chunk_count += 1
71+
if samples_array.shape[0] > 0:
72+
if verbose:
73+
print(f"Saving chunk {chunk_count}...")
74+
print(f"Samples remaining: {samples_array.shape[0]}")
75+
save_dataset(path, str(chunk_count), samples_array, ids_array, cells_array)
76+
chunk_count += 1
8977

90-
def process_all(chunk_size=50000, verbose=False):
78+
def process_all(chunk_size, path_in, path_out, max_files=None, skip_files=0, verbose=False, shuffle=True):
9179
samples_array = None
9280
ids_array = None
9381
cells_array = None
9482
magnitudes_array = None
9583

9684
message_count = 0
9785

86+
# Check path_in for files of the form samples_{suffix}.npy
87+
suffixes = [ f for f in os.listdir(path_in) if f.startswith("samples_") and f.endswith(".npy") ]
88+
suffixes.sort()
89+
suffixes = [ f[8:-4] for f in suffixes ]
90+
suffixes = suffixes[skip_files:]
91+
if max_files is not None:
92+
suffixes = suffixes[:max_files]
93+
94+
if verbose:
95+
print("Loading data...")
9896
for suffix in tqdm(suffixes, disable=not verbose):
9997
s, i, c, m = load_dataset(path_in, suffix)
10098
message_count += s.shape[0]
@@ -164,4 +162,21 @@ def process_all(chunk_size=50000, verbose=False):
164162
print(f"Done")
165163

166164
if __name__ == "__main__":
167-
process_all(chunk_size=chunk_size, verbose=True)
165+
path_base = "/data"
166+
path_in = path_base
167+
path_out = os.path.join(path_base, "tfrecord")
168+
169+
parser = argparse.ArgumentParser(description="Process NumPy files into TFRecord datasets.")
170+
parser.add_argument("--chunk-size", type=int, default=50000, help="Number of records in each chunk.")
171+
parser.add_argument("--path-in", type=str, default=path_in, help="Input directory.")
172+
parser.add_argument("--path-out", type=str, default=path_out, help="Output directory.")
173+
parser.add_argument("--max-files", type=int, default=None, help="Maximum number of input files to process.")
174+
parser.add_argument("--skip-files", type=int, default=0, help="Number of input files to skip.")
175+
parser.add_argument("--no-shuffle", action='store_true', help="Do not shuffle data.")
176+
parser.add_argument("-v", "--verbose", action='store_true', help="Display progress.")
177+
args = parser.parse_args()
178+
179+
shuffle = not args.no_shuffle
180+
181+
process_all(args.chunk_size, args.path_in, args.path_out, args.max_files, args.skip_files, verbose=args.verbose, shuffle=shuffle)
182+

preprocessing/np-filter.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,18 +11,18 @@
1111
suffixes = ["a", "b", "c"]
1212

1313
def save_dataset(path, suffix, samples_array, ids_array, cells_array):
14-
file_samples = os.path.join(path, "samples-{}.npy".format(suffix))
15-
file_ids = os.path.join(path, "ids-{}.npy".format(suffix))
16-
file_cells = os.path.join(path, "cells-{}.npy".format(suffix))
14+
file_samples = os.path.join(path, "samples_{}.npy".format(suffix))
15+
file_ids = os.path.join(path, "ra_sat_{}.npy".format(suffix))
16+
file_cells = os.path.join(path, "ra_cell_{}.npy".format(suffix))
1717

1818
np.save(file_samples, samples_array)
1919
np.save(file_ids, ids_array)
2020
np.save(file_cells, cells_array)
2121

2222
def process(path_in, path_out, suffix):
23-
file_samples = os.path.join(path_in, "samples-{}.npy".format(suffix))
24-
file_ids = os.path.join(path_in, "ids-{}.npy".format(suffix))
25-
file_cells = os.path.join(path_in, "cells-{}.npy".format(suffix))
23+
file_samples = os.path.join(path_in, "samples_{}.npy".format(suffix))
24+
file_ids = os.path.join(path_in, "ra_sat_{}.npy".format(suffix))
25+
file_cells = os.path.join(path_in, "ra_cell_{}.npy".format(suffix))
2626

2727
print("Loading ArrayDataset")
2828
ds = ArrayDataset.from_files(

0 commit comments

Comments
 (0)