This document covers 2 crucial components of machine learning data operations:
- Data Caching - for optimizing data access and reducing compute overhead.
- Data Lineage - for providing observability and repeatability of pipeline runs and experiments with regards to the data they use.
The techniques and guidelines mentioned here focus on the domain of computer vision, but may have applications in other machine learning use cases.
Running Machine Learning pipelines and experiments over large datasets on GPU requires paying attention to resource bottlenecks that may cause overhead on utilization and overall cost increase.
When using Azure Machine Learning with GPU compute cluster and data in Azure Storage, the connection between virtual machines and storage accounts may quickly become a factor that slows down training pipelines, effectively making expensive GPU nodes wait for the data to be delivered. This is why we recommend using Premium Storage Accounts as primary datastores for running Azure Machine Learning pipelines.
This strategy is particularly effective in training computer vision models because large amounts of image data can be stored cost effectively in Azure Blob Storage, and archived in cool tiers over time. Meanwhile, prior to running Azure Machine Learning pipelines, data can be copied to Premium storage for fast access by the training cluster.
When identifying files to be copied between storages for caching, we usually deal with one of two cases:
- Whole dataset transfer from the long-term storage location to the cache location.
- Partial transfer based on a list of files to copy.
Second case is a situation when an actual set of file to use is a significantly smaller part of the dataset. It's very typical for specific experiment pipelines or integration tests. While it may optimize premium storage utilization and caching time, maintaining lists of files to copy requires additional investment, such as annotations processing. For example, if annotations are stored in Pascal VOC format, there may be a code component that builds a list of labeled images so that only these files are cached.
If images are mostly shared between pipelines and experiments in a single cache storage, caching the whole dataset is an optimal choice as it doesn't bring additional complication to data management.
Model reproducibility is a critical component of production-grade machine learning pipelines. Without it, Data scientists risk having limited visibility into what causes changes in model performance. The variability may appear to be caused by an adjustment of one parameter, but may actually be caused by hidden sources of randomness. Reproducibility reduces or eliminates variations when rerunning failed jobs or prior experiments, making it essential in the context of fault tolerance and iterative refinement of models. Running ML pipelines in the cloud across multiple compute nodes not only multiplies the sources of non-determinism but also increases the need for both fault tolerance and iterative model development.
There are 5 major components that define a machine learning model:
- Model architecture
- Model hyperparameters
- Training code
- Data augmentation code and parameters
- Data and annotations
First 4 components are tracked between pipeline's Git repository and Azure Machine Learning tracking storages. Therefore, it's relatively easy to achieve model reproducibility up to the point of what data was used (as long as sources of randomness are parameterized as well).
Data and annotations are much harder to track. There are tools and processes to maintain data versioning that can provide a great level of observability on data. However, with Azure Machine Learning pipelines, the same level of observability may be achieved with significantly lower effort. For example, here's what we can do when training computer vision models.
- Maintain snapshots of annotation files as they are expected to change. Every pipeline run has a snapshot of annotation files in the run's output storage.
- Maintain a single shared cache of data files (images) as they are expected to stay unmodified once added to the storage.
- Model training code uses dataset annotation data to get references to data files (images). As data files are added over time, previous experiments are still fully reproducible based on the stored annotation snapshots.
Note that image preprocessing is assumed to be processed in a separate step. Raw images are stored in a separate location under its own retention policy. The preprocessed images are stored in long-term storage where the following strategy applies.
For computer vision models trained on Azure Machine Learning pipelines, we recommend the following strategy.
- Every pipeline run performs differential caching of the entire set of data files (images) in a Premium Blob Storage container (cache container).
- Every pipeline run takes a snapshot of dataset annotations/labels, and stores it as the pipeline run's output. Such snapshot is attributed with a unique snapshot name.
- Pipeline runs must support 2 modes: with the latest data or in the reproduction mode.
- When running on the latest data, a new snapshot of annotations/labels should be copied to the cache container, to a run-specific location defined by the snapshot name. The same files are uploaded to the run's output folder.
- When running in the reproduction mode, annotations/labels are downloaded from the output storage of the run that is being reproduced. The location of the snapshot in the cache container is defined by the snapshot name of the run that is being reproduced.
Different locations (prefixes) of the shared cache and annotation snapshots in a Premium Blob Storage container should be used for configuring different lifecycle management policies (TTL). Annotation snapshots don't need to be preserved as long as the shared cache.
Computer vision pipelines on Azure Machine Learning require transferring data from long-term storage in Azure Blob to a Premium Azure Blob container used as a cache. There are 4 major options of performing such data transfer in Azure ML pipelines:
- PythonScriptStep with custom code on top of the Azure Storage SDK or AzCopy.
- Durable Azure Functions with custom code on top of the Azure Storage SDK or AzCopy.
- DataTransferStep, a specialized step in Azure Machine Learning, utilizes Azure Data Factory (Copy Activity).
- PythonScriptStep directly initiating an Azure Data Factory pipeline with Copy Activity.
| Option | Pros | Cons |
|---|---|---|
| PythonScriptStep with AzCopy | Fast and efficient, provides required features | Storage credentials must be exposed to the step (directly, via Key Vault, or to AML compute's identity) |
| Durable Functions | Fast and efficient, provides almost all required features, no need to expose storage credentials to pipelines | Significantly increases run orchestration complexity, adds manageability overhead, no access to AML run output |
| DataTransferStep | Lowest manageability overhead, storage credentials are managed within AML Datastores | Doesn't support incremental copying |
| Azure Data Factory | no need to expose storage credentials to pipelines | Same as DataTransferStep + manageability overhead |
Option "PythonScriptStep with AzCopy" is a clear winner for the following reasons:
- It has greater flexibility and AzCopy provides a highly efficient data transfer mechanism with maximum performance.
- It runs within AML pipelines, and has access to all pipeline's and run's subsystems.
- Storage credentials exposure can be mitigated (see below).
When a certain code component running within an AML pipeline requires direct access to Azure Storage, there are multiple ways to provide it:
- Pass Account Key or SAS-token as AML Pipeline Parameters (not recommended).
- Store storage access credentials in AML workspace's Key Vault
- Isolate code that requires direct storage access to a separate PythonScriptStep (ParallelRunStep) running on an AML compute cluster with a managed identity. Such identity must be given appropriate access to the storage resources. This option should be used once Managed Identity support in azure ML reaches General Availability status.
Data caching and lineage can be implemented for computer vision as a PythonScriptStep running within the following steps.
dataset_prefix_name- a dataset-specific name that will uniquely identify a dataset-specific location in the cache container.source_datastore_name- source data store, an Azure ML Datastore (Blob, ADLS Gen2 or File Shares) where the dataset is stored long-term.source_datastore_data_path- a directory/prefix in the the source data store where the data files (images) are stored.source_datastore_annotations_path- a directory/prefix in the the source data store where annotation files are stored.cache_datastore_name- cache data store, an Azure ML Datastore configured on a premium storage container in Blob, ADLS Gen2 or File Shares.cache_data_dir_name- name of the last component of the data cache path (e.g.data).cache_annotations_dir_name- name of last component of the annotation snapshot cache path (e.g.annotations).
Additional variables are expected to be assigned within the pipeline configuration process:
snapshot_name- a unique name generated for a run. If reproducing another run, this name should be taken from tags of the run that is being reproducedsource_datastore_url- URL of the source datastore container.source_datastore_secret_name- name of the workspace KeyVault secret where the source datastore account key or SAS token is stored.cache_datastore_url- URL of the cache datastore container.cache_datastore_secret_name- name of the workspace KeyVault secret where the cache datastore account key or SAS token is stored.cache_datastore_data_path- built as{source_datastore_name}/{dataset_prefix_name}/{cache_data_dir_name}cache_datastore_annotations_path- built as{source_datastore_name}/{dataset_prefix_name}/{cache_annotations_dir_name}/{snapshot_name}
Annotation snapshot is always downloaded to a temporary directory annotations_local_dir.
If it's a run that is reproducing another run, then a snapshot of annotations is downloaded to annotations_local_dir from the output folder of the run that is being reproduced (using Azure ML Run.download_files function). it is then copied to the snapshot location in the cache datastore.
azcopy cp "{annotations_local_dir}/*" "{cache_datastore_url}/{cache_datastore_annotations_path}"If this is a new run, the annotation snapshot is made by directly copying from the source datastore. At the same time, annotations are also downloaded to annotations_local_dir so they can be uploaded to the run's output later.
azcopy cp "{source_datastore_url}/{source_datastore_annotations_path}/*" "{cache_datastore_url}/{cache_datastore_annotations_path}"
azcopy cp "{cache_datastore_url}/{cache_datastore_annotations_path}/*" "{annotations_local_dir}" overwrite option of AzCopy is set to ifSourceNewer which make the process a lot faster if files in the cache are not older than ones in the source datastore.
azcopy cp "{source_datastore_url}/{source_datastore_data_path}/*" "{cache_datastore_url}/{cache_datastore_data_path}"Azure ML Run.upload_folder function is used for uploading the annotation snapshot to the current run's output folder constructed as {DATAPREP_PREFIX}.{dataset_prefix_name}.{_ANNOTATIONS_COPY_NAME}, where
- DATAPREP_PREFIX is
mldl.dataprep - _ANNOTATIONS_COPY_NAME is
annotations
The following final cache structure will be provisioned in the Premium storage cache container.
[{source_datastore_name}]
└── [{dataset_prefix_name}]
└── [{cache_data_dir_name}]
└── [{cache_annotations_dir_name}]
└── [{snapshot_name}]
As an example, let's use the following values:
- source_datastore_name is
longterm-datastore - dataset_prefix_name is
coco(can assign any name here) - cache_data_dir_name is
data - cache_annotations_dir_name is
annotations - snapshot_name is
9D299F782E35485F9A4B86A8EA0A93B7(a generate random value)
The cache structure would be the following:
[longterm-datastore]
└── [coco]
└── [data]
└── [annotations]
└── [9D299F782E35485F9A4B86A8EA0A93B7]
Based on the cache structure, data and annotations must be passed to training steps as 2 separate Azure ML File Datasets in the mount mode:
- File dataset made on
{source_datastore_name}/{dataset_prefix_name}/{cache_data_dir_name} - File dataset made on
{source_datastore_name}/{dataset_prefix_name}/{cache_annotations_dir_name}/{snapshot_name}
Using the same example, it would be:
- File dataset made on
longterm-datastore/coco/data - File dataset made on
longterm-datastore/coco/annotations/9D299F782E35485F9A4B86A8EA0A93B7
When passed to a training step as as_mount() results, they will be resolved to 2 parameters containing filesystem paths where the corresponding cache storage located are mounted by Azure ML compute engine.
In machine learning Pipelines, the same data files (images) are used across multiple datasets and ML models. At the same time, we may have multiple different data collection and labeling processes that include pre-processing, merging, and structuring the storage for different purposes. In order to make the process flexible and efficient, we propose the following structure of the long term dataset storage.
Images are stored within a structure that reflects the data collection process. For example, if images are collected from pre-defined locations and known data collection devices (cameras), the following structure can be leveraged:
[images]
└── [{location_name}]
└── [{camera_id}]
└── [{year}]
└── [{month}]
├── image1.jpg
├── image2.jpg
└── ...
This approach makes image collection independent of the labeling process.
Due to the modern practices on annotation/labeling for computer vision data, the process of labeling on datasets that change over time includes 2 traits:
- Labeled images are referenced as they are stored in data storage.
- There is always a "merging" labels step for grouping images into a single labeling set. It either happens as part of the labeling itself, or during the data preparation steps.
Our practice shows that regardless of the labeling type (images tags for classification, object identification with bounding boxes, instance segmentation, etc.), what makes labels across multiple sets being grouped and used together is a notion of dataset. A single image may be used across many datasets, but a single label will be most likely used only within a single dataset. Even when a label is leveraged across multiple datasets, most likely, it will be augmented with dataset-specific attributes, and therefore better be copied to a dataset-specific location. Dataset is what defines a group of labeled images/files.
Labels define the dataset structure and purposes of different image groups such as training, fine-tuning, testing, and validation. To manage across all labels, we propose to store them by groups as following. The training code should be able to process the labels folder structure.
- Labels are stored under a dataset name.
- Every label file is named as
{location_name}.{camera_id}.{unique_file_id}. unique_file_id must be a unique value withing a group of files with the same location_name and camera_id. - When possible, label files are placed according to their primary purpose within the training process (e.g.
train,test).
If a random runtime shuffling/splitting is used, for a purpose of experiment reproducibility the split information (file list for every group) must be stored in the run's output storage, and reproduced in derived runs.
Here are a few examples with dataset name object_detection and labels in .csv format.
Example 1 - annotations are stored by the data category
[object_detection]
└── [train]
├── {location_name}.{camera_id}.029347623462.csv
├── {location_name}.{camera_id}.029347346341.csv
└── {location_name}.{camera_id}.235235994566.csv
└── [test]
├── {location_name}.{camera_id}.029347623410.csv
├── {location_name}.{camera_id}.029347346395.csv
└── {location_name}.{camera_id}.235232302300.csv
└── [validation]
├── {location_name}.{camera_id}.029347678464.csv
├── {location_name}.{camera_id}.462747346320.csv
└── {location_name}.{camera_id}.556435994562.csv
Example 2 - annotations are stored by the model stage category
[object_detection]
└── [train] <- for runtime train/test/val split
├── {location_name}.{camera_id}.029347623462.csv
├── {location_name}.{camera_id}.029347346341.csv
└── {location_name}.{camera_id}.235235994566.csv
└── [qa_validation] <- for QA validation tests
├── {location_name}.{camera_id}.029347623410.csv
├── {location_name}.{camera_id}.029347346395.csv
└── {location_name}.{camera_id}.235232302300.csv
Example 3 - annotations are stored by the model stage category and time
This is the recommended label storage structure
[object_detection]
└── [train] <- for runtime train/test/val split
│ └── [{year}.{month}]
│ │ ├── {location_name}.{camera_id}.029347623462.csv
│ │ ├── {location_name}.{camera_id}.029347346341.csv
│ │ └── {location_name}.{camera_id}.235235994566.csv
└── [qa_validation] <- for QA validation tests
├── {location_name}.{camera_id}.029347623410.csv
├── {location_name}.{camera_id}.029347346395.csv
└── {location_name}.{camera_id}.235232302300.csv
There are labels that are yet to be verified, such as those made using transfer learning or another auto-labeling technique. For these labels, the storage structure needs to reflect the state in the data collection and verification processes. On the data collection side, the source (location, camera id) should be presented in the structure. On the verification side, the state of the labels should be presented: for example, if we are doing object detection for trucks or binary classification, the images can be auto-labeled as 'positive', auto-labeled as 'negative', and 'verified'. All verified labels must be further processed/converted and transferred to the train (default) or qa_validation subsets.
[object_detection] <- dataset name
└── [unverified]
│ ├── [positive]
│ │ └── [{location_name}]
│ │ └── [{camera_id}]
│ │ ├── {location_name}.{camera_id}.029347623462.csv
│ │ ├── {location_name}.{camera_id}.029347346341.csv
│ │ └── {location_name}.{camera_id}.235235994566.csv
│ └── [negative] <- same as in positive
└── [verified]
└──[{year}.{month}]
├── {location_name}.{camera_id}.029347623410.csv
├── {location_name}.{camera_id}.029347346395.csv
└── {location_name}.{camera_id}.235232302300.csv