Main talk (25 minutes + 5 minutes for Q&A)
Main talk (25 minutes + 5 minutes for Q&A)
Speaker: Artur Chakhvadze [email protected] @
Title: Data as API: principled data storage approach for Machine Learning projects. Case study from Snapchat 3D tracking team
Abstract: Data storage is often a pain point for large machine learning projects where training and validation data is used my multiple teams with different requirements and access patterns. Typical approach is to store data as a set of files on a distributed filesystem and write a separate ad-hoc dataloader for each new application.
I argue that it is better to hide the underlying storage details under a simple and well-documented API. This eliminates bugs, simplifies cross-team collaboration, allows for easy onboarding, and makes possible to easily switch between different underlying storage systems without having to change anything on the side of the consumers.
I present a case study from the Snapchat 3D tracking team where transition to a single data storage API allowed to improve performance of training pipelines, shorten iteration times, eliminate bugs, and simplify GDPR compliance and new team member onboarding.
Recording consent: No
Publishing slides consent: Yes
Availability:
Special requirements:
Submitted 10/07/2024 13:46:36 via PyData London - Submit a Talk