Skip to content

[RMP] Create a separate Merlin package for the dataloaders #394

@karlhigley

Description

@karlhigley

Problem:

A number of customers only want to use our dataloaders. They're a thin wedge that we can use to get Merlin adoption amongst teams.

The PyTorch recommendations framework TorchRec would like to make the Merlin dataloader their default without depending on all of Merlin Models. They'd like to publish blog posts about the framework, which creates an opportunity to co-promote one part of the Merlin ecosystem.

The Spark team want to use our dataloaders to accelerate their workflows in TensorFlow and have coordinated with the horovod team to make it an optional dataloader that's natively included with horovod.

Goal:

  • Publish a Merlin package specifically for the dataloaders separate from Merlin Models and NVT that both MM and TorchRec can then depend on.
  • Integrate the dataloader into as many upstream packages as possible.

Scope:

Publish Merlin dataloaders under a new 'dataloader' package

Constraints:

  • TorchRec is using the old (deprecated) version of the dataloaders from NVT
  • TorchRec isn't set up to use the new version of the dataloaders in Merlin Models
  • Merlin Models still needs to maintain access to the dataloaders too
  • Horovod container size was a concern. They're interested in Pip install of RAPIDS. (coming in 22.10)

Blockers

Create a new repo from the Merlin repo template
Note: Julio is blocked on this. Ben has to create a new repo 'dataloader' repo created

Starting Point:

v22.08

v22.09

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions