Fixes swapping of data and feature dimension to work in the general case. (#214)

kshiteejm · web-flow · commit 0f24e63167ac · 2023-04-12T07:19:06.000-07:00
Previous implementation was broken as using transpose assumes that `data_list` is a 2D array.

However, in certain cases (when all the feature values array lengths are the same) the `data_list` can be a 3D array as the call to
`data_list = np.array(list(dataset.as_numpy_iterator()), dtype=object)` 
merges inner np arrays and converts `data_list` into one big 3D array.
diff --git a/compiler_opt/tools/sparse_bucket_generator.py b/compiler_opt/tools/sparse_bucket_generator.py
@@ -170,7 +170,7 @@ def main(_) -> None:
   parser_fn = create_tfrecord_parser_fn(sequence_features)
   dataset = dataset.map(parser_fn, num_parallel_calls=tf.data.AUTOTUNE)
   data_list = np.array(list(dataset.as_numpy_iterator()), dtype=object)
-  data_list = np.transpose(data_list, [1, 0])
+  data_list = data_list.swapaxes(0, 1)
 
   with mp.Pool(FLAGS.parallelism) as pool:
     feature_names = list(sorted(sequence_features))