TAR2TF - AIS python client

This client provides an easy way to interact with AIS cluster to create TensorFlow datasets.

Start

$ ./setup.sh
$ source venv/bin/activate
$ ais create bucket $BUCKET

...
Put small tars from gsutil ls gs://lpr-gtc2020 into $BUCKET
and adjust imagenet.py with your $BUCKET and objects template
...

$ python examples/imagenet_in_memory.py

Functions

def Dataset(bucket_name, proxy_url, conversions, selections, remote_exec)

Create Dataset object

bucket_name - string - name of an AIS bucket

proxy_url - string - url of AIS cluster proxy

conversions - (optional) list of Conversions from tar2tf.ops. Describes transformations made on tar-record. See tar2tf.ops section for more.

selections - (optional) list of length 2 of Selections from tar2tf.ops. Describes how to transform tar-record entry into datapoint. See tar2tf.ops section for more.

remote_exec - (optional) - bool - specify is conversions and selections should be executed in the cluster.
If remote_exec == True, but remote execution of one of conversions is not supported, remote_exec becomes disabled. If remote_exec not provided, it will be automatically detected if remote execution is possible.

def load(template, **kwargs)

Transform tars of images from AIS into TensorFlow compatible format.

template - string - object names of tars. Bash range syntax like {0..10} is supported.

output_shapes - list of tf.TensorShape - resulting objects' shapes

output_types - list of tf.DType - resulting objects' types

num_workers - number - number of workers concurrently downloading objects from AIS cluster

path - string or string generator - destination where TFRecord file or multiple files should be saved to. If path provided, remote execution is not enabled. Accepted: string, string with "{}" format template or generator. If max_shard_size is specified multiple files destinations might be needed. If path is string default path indexing will be applied. If path is string with "{}" consecutive numbers starting with 1 will be put into path. If path is generator consecutive yielded values will be used.
Generated TFRecord files paths are returned from load.

If empty or None, all operations are made in memory or executed remotely and tf.data.Dataset is returned.

record_to_example (optional) - function - should specify how to translate tar record. Argument of this function is representation of single tar record: python dict. Tar record is an abstraction for multiple files with exactly the same path, but different extension. The argument of function will have __key__ entry which value is path to record without an extension. For each extension e, dict with have an entry e with value the same as contents of relevant file.

If default record_to_example was used, default_record_parser function should be used to parse TFRecord to tf.Dataset interface.

tar2tf.ops

ops module is used to describe tar-record to datapoint transformation.

Conversions

Conversions are transformations applied to each tar record.

`Convert`

tar2tf.ops.Convert(ext_name, dst_type)

Converts inner type of ext_name entry image into dst_type.

Remote execution supported.

`Decode`

tar2tf.ops.Decode(ext_name)

Decodes image from format BMP, JPEG, or PNG. Fails for other formats.

Remote execution supported.

`Resize`

tar2tf.ops.Resize(ext_name, dst_size)

Resizes ext_name image into new size dst_size.

Remote execution supported.

`Rotate`

tar2tf.ops.Rotate(ext_name, [angle])

Rotates ext_name image angle degrees clockwise. If angle == 0 or not provided, random rotation is applied.

Remote execution supported.

`Func`

tar2tf.ops.Func(f)

The most versatile operations from tar2tf.ops. Takes function f and calls it with tar_record.

Selections

Selections select entries from tar record to be either values or labels in dataset.

`Select`

tar2tf.ops.Select(ext_name)

The simplest of tar2tf.ops. Returns value from tar record under ext_name key.

`SelectJSON`

tar2tf.ops.SelectJSON(ext_name, nested_path)

Similar to Select, but is able to extract deeply nested value from JSON format. nested_path can be either string/int (for first level values) or list of string/int (for deeply nested). Reads value under ext_name, treats it as a JSON, and returns value under nested_path.

`SelectList`

tar2tf.ops.SelectList(list of Selection)

Returns an object which is a list of provided Selections

`SelectDict`

tar2tf.ops.SelectDict(dict of Selection)

Returns an object which is a dict of provided Selections

Examples

Create dataset with remote transformations

dataset = Dataset(BUCKET_NAME, PROXY_URL, [Decode("jpg"), Resize("jpg", (32,32))], ["jpg", "cls"])
train_dataset = dataset.load(
    "train-{0..3}.tar.gz",
    remote_exec=True,
).shuffle().batch(BATCH_SIZE)
test_dataset = dataset.load("train-{4..7}.tar.gz").batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)

The same as above, but remote execution is automatically detected

dataset = Dataset(BUCKET_NAME, PROXY_URL, [Decode("jpg"), Resize("jpg", (32,32))], ["jpg", "cls"])
train_dataset = dataset.load(
    "train-{0..3}.tar.gz",
).shuffle().batch(BATCH_SIZE)
test_dataset = dataset.load("train-{4..7}.tar.gz").batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)

Create in-memory dataset from tars with names `"train-{0..7}.tar.gz"` in bucket `BUCKET_NAME`.

# Create in-memory TensorFlow dataset
dataset = Dataset(BUCKET_NAME, PROXY_URL)
train_dataset = dataset.load("train-{0..3}.tar.gz").shuffle().batch(BATCH_SIZE)
test_dataset = dataset.load("train-{4..7}.tar.gz").batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)

The same as above, but with 4 workers fetching dataset from the cluster.

# Create in-memory TensorFlow dataset
dataset = Dataset(BUCKET_NAME, PROXY_URL)
train_dataset = dataset.load(
    "train-{0..3}.tar.gz",
    num_workers=4,
    remote_exec=False,
).shuffle().batch(BATCH_SIZE)
test_dataset = dataset.load(
    "train-{4..7}.tar.gz",
    num_workers=4,
    remote_exec=False,
).batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)

Create TensorFlow dataset with intermediate storing `TFRecord` in filesystem.

dataset = Dataset(BUCKET_NAME, PROXY_URL)

records = dataset.load(
    "train-{0..3}.tar.gz",
    path="train.record",
)
train_dataset = tf.data.TFRecordDataset(filenames=records)
                       .map(default_record_parser)
                       .shuffle(buffer_size=1024)
                       .batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)

Create TensorFlow dataset with intermediate storing `TFRecord` in filesystem with limited TFRecord size.

dataset = Dataset(BUCKET_NAME, PROXY_URL)

filenames = dataset.load(
    "train-{0..3}.tar.gz",
    path="train-{}.record",
    max_shard_size="100MB",
)
train_dataset = tf.data.TFRecordDataset(filenames=filenames)
                       .map(default_record_parser)
                       .shuffle(buffer_size=1024)
                       .batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)

Create TensorFlow dataset in memory with custom tar-record to datapoint translation.

# Create in-memory TensorFlow dataset
# decoded and resized "jpg", applies function f
# datapoint value from "jpg", label from "cls"
dataset = Dataset(BUCKET_NAME, PROXY_URL, [Decode("jpg"), Resize("jpg", (32,32)), Func(f)], ["jpg", "cls"])
train_dataset = dataset.load("train-{0..3}.tar.gz"
).shuffle().batch(BATCH_SIZE)
test_dataset = dataset.load("train-{4..7}.tar.gz").batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)

Create TensorFlow dataset in memory with custom tar-record to datapoint translation.

# Create in-memory TensorFlow dataset
dataset = Dataset(
    BUCKET_NAME,
    PROXY_URL,
    [Decode("jpg"), Resize("jpg", (32,32)), Func(f)],
    ["jpg", "cls"]
)
train_dataset = dataset.load("train-{0..3}.tar.gz").shuffle().batch(BATCH_SIZE)
test_dataset = dataset.load("train-{4..7}.tar.gz").batch(BATCH_SIZE)
test_dataset = dataset.load("train-{4..7}.tar.gz").batch(BATCH_SIZE)
# ...
model.fit(train_dataset, epochs=EPOCHS)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TAR2TF - AIS python client

Start

Functions

tar2tf.ops

Conversions

`Convert`

`Decode`

`Resize`

`Rotate`

`Func`

Selections

`Select`

`SelectJSON`

`SelectList`

`SelectDict`

Examples

Create dataset with remote transformations

The same as above, but remote execution is automatically detected

Create in-memory dataset from tars with names `"train-{0..7}.tar.gz"` in bucket `BUCKET_NAME`.

The same as above, but with 4 workers fetching dataset from the cluster.

Create TensorFlow dataset with intermediate storing `TFRecord` in filesystem.

Create TensorFlow dataset with intermediate storing `TFRecord` in filesystem with limited TFRecord size.

Create TensorFlow dataset in memory with custom tar-record to datapoint translation.

Create TensorFlow dataset in memory with custom tar-record to datapoint translation.

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

TAR2TF - AIS python client

Start

Functions

tar2tf.ops

Conversions

Convert

Decode

Resize

Rotate

Func

Selections

Select

SelectJSON

SelectList

SelectDict

Examples

Create dataset with remote transformations

The same as above, but remote execution is automatically detected

Create in-memory dataset from tars with names "train-{0..7}.tar.gz" in bucket BUCKET_NAME.

The same as above, but with 4 workers fetching dataset from the cluster.

Create TensorFlow dataset with intermediate storing TFRecord in filesystem.

Create TensorFlow dataset with intermediate storing TFRecord in filesystem with limited TFRecord size.

Create TensorFlow dataset in memory with custom tar-record to datapoint translation.

Create TensorFlow dataset in memory with custom tar-record to datapoint translation.

`Convert`

`Decode`

`Resize`

`Rotate`

`Func`

`Select`

`SelectJSON`

`SelectList`

`SelectDict`

Create in-memory dataset from tars with names `"train-{0..7}.tar.gz"` in bucket `BUCKET_NAME`.

Create TensorFlow dataset with intermediate storing `TFRecord` in filesystem.

Create TensorFlow dataset with intermediate storing `TFRecord` in filesystem with limited TFRecord size.