This client provides an easy way to interact with AIS cluster to create TensorFlow datasets.
$ ./setup.sh
$ source venv/bin/activate
$ ais create bucket $BUCKET
...
Put small tars from gsutil ls gs://lpr-gtc2020 into $BUCKET
and adjust imagenet.py with your $BUCKET and objects template
...
$ python examples/imagenet_in_memory.pydef Dataset(bucket_name, proxy_url, conversions, selections, remote_exec)Create Dataset object
bucket_name - string - name of an AIS bucket
proxy_url - string - url of AIS cluster proxy
conversions - (optional) list of Conversions from tar2tf.ops. Describes transformations made on tar-record. See tar2tf.ops section for more.
selections - (optional) list of length 2 of Selections from tar2tf.ops. Describes how to transform tar-record entry into datapoint. See tar2tf.ops section for more.
remote_exec - (optional) - bool - specify is conversions and selections should be executed in the cluster.
If remote_exec == True, but remote execution of one of conversions is not supported, remote_exec becomes disabled.
If remote_exec not provided, it will be automatically detected if remote execution is possible.
def load(template, **kwargs)Transform tars of images from AIS into TensorFlow compatible format.
template - string - object names of tars. Bash range syntax like {0..10} is supported.
output_shapes - list of tf.TensorShape - resulting objects' shapes
output_types - list of tf.DType - resulting objects' types
num_workers - number - number of workers concurrently downloading objects from AIS cluster
path - string or string generator - destination where TFRecord file or multiple files should be saved to.
If path provided, remote execution is not enabled.
Accepted: string, string with "{}" format template or generator.
If max_shard_size is specified multiple files destinations might be needed.
If path is string default path indexing will be applied.
If path is string with "{}" consecutive numbers starting with 1 will be put into path.
If path is generator consecutive yielded values will be used.
Generated TFRecord files paths are returned from load.
If empty or None, all operations are made in memory or executed remotely and tf.data.Dataset is returned.
record_to_example (optional) - function - should specify how to translate tar record.
Argument of this function is representation of single tar record: python dict.
Tar record is an abstraction for multiple files with exactly the same path, but different extension.
The argument of function will have __key__ entry which value is path to record without an extension.
For each extension e, dict with have an entry e with value the same as contents of relevant file.
If default record_to_example was used, default_record_parser function should be used to
parse TFRecord to tf.Dataset interface.
ops module is used to describe tar-record to datapoint transformation.
Conversions are transformations applied to each tar record.
tar2tf.ops.Convert(ext_name, dst_type)
Converts inner type of ext_name entry image into dst_type.
Remote execution supported.
tar2tf.ops.Decode(ext_name)
Decodes image from format BMP, JPEG, or PNG. Fails for other formats.
Remote execution supported.
tar2tf.ops.Resize(ext_name, dst_size)
Resizes ext_name image into new size dst_size.
Remote execution supported.
tar2tf.ops.Rotate(ext_name, [angle])
Rotates ext_name image angle degrees clockwise. If angle == 0 or not provided, random rotation is applied.
Remote execution supported.
tar2tf.ops.Func(f)
The most versatile operations from tar2tf.ops. Takes function f and calls it with tar_record.
Selections select entries from tar record to be either values or labels in dataset.
tar2tf.ops.Select(ext_name)
The simplest of tar2tf.ops. Returns value from tar record under ext_name key.
tar2tf.ops.SelectJSON(ext_name, nested_path)
Similar to Select, but is able to extract deeply nested value from JSON format.
nested_path can be either string/int (for first level values) or list of string/int (for deeply nested).
Reads value under ext_name, treats it as a JSON, and returns value under nested_path.
tar2tf.ops.SelectList(list of Selection)
Returns an object which is a list of provided Selections
tar2tf.ops.SelectDict(dict of Selection)
Returns an object which is a dict of provided Selections
dataset = Dataset(BUCKET_NAME, PROXY_URL, [Decode("jpg"), Resize("jpg", (32,32))], ["jpg", "cls"])
train_dataset = dataset.load(
"train-{0..3}.tar.gz",
remote_exec=True,
).shuffle().batch(BATCH_SIZE)
test_dataset = dataset.load("train-{4..7}.tar.gz").batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)dataset = Dataset(BUCKET_NAME, PROXY_URL, [Decode("jpg"), Resize("jpg", (32,32))], ["jpg", "cls"])
train_dataset = dataset.load(
"train-{0..3}.tar.gz",
).shuffle().batch(BATCH_SIZE)
test_dataset = dataset.load("train-{4..7}.tar.gz").batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)# Create in-memory TensorFlow dataset
dataset = Dataset(BUCKET_NAME, PROXY_URL)
train_dataset = dataset.load("train-{0..3}.tar.gz").shuffle().batch(BATCH_SIZE)
test_dataset = dataset.load("train-{4..7}.tar.gz").batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)# Create in-memory TensorFlow dataset
dataset = Dataset(BUCKET_NAME, PROXY_URL)
train_dataset = dataset.load(
"train-{0..3}.tar.gz",
num_workers=4,
remote_exec=False,
).shuffle().batch(BATCH_SIZE)
test_dataset = dataset.load(
"train-{4..7}.tar.gz",
num_workers=4,
remote_exec=False,
).batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)dataset = Dataset(BUCKET_NAME, PROXY_URL)
records = dataset.load(
"train-{0..3}.tar.gz",
path="train.record",
)
train_dataset = tf.data.TFRecordDataset(filenames=records)
.map(default_record_parser)
.shuffle(buffer_size=1024)
.batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)Create TensorFlow dataset with intermediate storing TFRecord in filesystem with limited TFRecord size.
dataset = Dataset(BUCKET_NAME, PROXY_URL)
filenames = dataset.load(
"train-{0..3}.tar.gz",
path="train-{}.record",
max_shard_size="100MB",
)
train_dataset = tf.data.TFRecordDataset(filenames=filenames)
.map(default_record_parser)
.shuffle(buffer_size=1024)
.batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)# Create in-memory TensorFlow dataset
# decoded and resized "jpg", applies function f
# datapoint value from "jpg", label from "cls"
dataset = Dataset(BUCKET_NAME, PROXY_URL, [Decode("jpg"), Resize("jpg", (32,32)), Func(f)], ["jpg", "cls"])
train_dataset = dataset.load("train-{0..3}.tar.gz"
).shuffle().batch(BATCH_SIZE)
test_dataset = dataset.load("train-{4..7}.tar.gz").batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)# Create in-memory TensorFlow dataset
dataset = Dataset(
BUCKET_NAME,
PROXY_URL,
[Decode("jpg"), Resize("jpg", (32,32)), Func(f)],
["jpg", "cls"]
)
train_dataset = dataset.load("train-{0..3}.tar.gz").shuffle().batch(BATCH_SIZE)
test_dataset = dataset.load("train-{4..7}.tar.gz").batch(BATCH_SIZE)
test_dataset = dataset.load("train-{4..7}.tar.gz").batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)