Skip to content

[FEATURE] Transform custom dataset to deeplake dataset/database/vectorstore conveniently using DDP #2602

@ChawDoe

Description

@ChawDoe

Description

Here is my use case:
I have 4 gpu nodes for training (including compute tensors) on aws.
I want to save pre-computed tensors to deeplake (Dataset/database/vectorstore), aiming to save a lot of time for next training.
I use accelerate as my distributed parallel framework.
So my framework works like this:

deeplake_path = 'dataset_{}'.format(current_process_index)
ds = deeplake.dataset(deeplake_path, overwrite=False)
for index, data_dict in enumerate(my_pytorch_dataloader):
  with torch.no_grad():
    a = net_a_frozen(data_dict['a'])
    b = net_b_frozen(data_dict['b'])
  # loss = net_c_training(a, b)
  # the loss is only used in training.
  save_dict = {'data_dict': data_dict, 'a': a.detach().cpu().numpy(), 'b': b.detach().cpu().numpy()}
  append_to_deeplake(deeplake_path, save_dict)
  if index % 100 == 0:
    commit_to_deeplake(deeplake_path)

Note that I can use deeplake instead of computing the tensors i need again in the next training after the deeplake dataset construction.
The problem includes:

  1. I have to assign different deeplake dataset to different processes but i need to merge them into a dataset after this.
  2. I need to design a proper for-loop/parallel workflow for deeplake dataset construction.
  3. The frequent append and commit function takes me a lot of time.
  4. detach() and to_cpu() function takes me a lot of time.

So Is there any feature to transform custom dataset to deeplake dataset?
If we have a function which works like this:

ds.distributed_append_gpu_tensor_and_auto_commit(data_tensor)
ds.auto_transorm_pytorch_dataset(my_pytorch_dataloader)

or could you give me a standard workflow to solve this?
I don't know which is the best method for this scenario.
The document did not cover this problem. #2596 also indicates this problem.

Use Cases

Distributed parallel computing and saving to deeplake.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions