You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+21-1Lines changed: 21 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -201,7 +201,27 @@ Be careful using this option, as it creates a much larger number of files, and t
201
201
202
202
> [!WARNING]
203
203
> This script in particular will use a large amount of RAM, since it loads the entire dataset into memory at once.
204
-
> Processing may be done in batches by using the `--max-files` and `--skip-files` command-line arguments.
204
+
> Processing may be done in batches by using the `--max-files` and `--skip-files` command-line arguments, or the script below.
205
+
206
+
##### np-to-tfrecord-parallel.sh
207
+
208
+
This script can run multiple instances of `np-to-tfrecord.py` in parallel, allowing preprocessing to be sped up and/or less RAM to be used.
209
+
210
+
Usage:
211
+
```bash
212
+
np-to-tfrecord-parallel.sh <NUM PROCESSES><FILES PER PROCESS><INPUT PATH><OUTPUT PATH>
213
+
```
214
+
Where:
215
+
-`INPUT PATH` contains your `.npy` files, as above.
216
+
-`OUTPUT PATH` is the desired output directory.
217
+
-`NUM PROCESSES` is the number of CPU cores to use.
218
+
-`FILES PER PROCESS` is the number of files each thread should load at once.
219
+
220
+
Ensure that `NUM_PROCESSES * FILES_PER_PROCESS` input files can fit comfortably in RAM.
221
+
222
+
> [!NOTE]
223
+
> Shuffling is disabled by default in this script - if shuffled data is desired, the `--no-shuffle` flag should be removed from the script.
224
+
> If this flag is removed, shuffling will only be done on a per-process level - that is, each process will shuffle the files it has loaded, but not the dataset as a whole.
parser.add_argument("--max-files", type=int, default=None, help="Maximum number of input files to process.")
219
-
parser.add_argument("--skip-files", type=int, default=0, help="Number of input files to skip.")
222
+
parser.add_argument("--skip-files", type=int, default=None, help="Number of input files to skip.")
220
223
parser.add_argument("--no-shuffle", action='store_true', help="Do not shuffle data.")
221
224
parser.add_argument("--by-id", action='store_true', help="Create datasets with different percentages of the most common IDs. WARNING: This will create a lot of datasets, and take a long time!")
0 commit comments