-
Notifications
You must be signed in to change notification settings - Fork 705
Description
Description
Here is my use case:
I have 4 gpu nodes for training (including compute tensors) on aws.
I want to save pre-computed tensors to deeplake (Dataset/database/vectorstore), aiming to save a lot of time for next training.
I use accelerate as my distributed parallel framework.
So my framework works like this:
deeplake_path = 'dataset_{}'.format(current_process_index)
ds = deeplake.dataset(deeplake_path, overwrite=False)
for index, data_dict in enumerate(my_pytorch_dataloader):
with torch.no_grad():
a = net_a_frozen(data_dict['a'])
b = net_b_frozen(data_dict['b'])
# loss = net_c_training(a, b)
# the loss is only used in training.
save_dict = {'data_dict': data_dict, 'a': a.detach().cpu().numpy(), 'b': b.detach().cpu().numpy()}
append_to_deeplake(deeplake_path, save_dict)
if index % 100 == 0:
commit_to_deeplake(deeplake_path)
Note that I can use deeplake instead of computing the tensors i need again in the next training after the deeplake dataset construction.
The problem includes:
- I have to assign different deeplake dataset to different processes but i need to merge them into a dataset after this.
- I need to design a proper for-loop/parallel workflow for deeplake dataset construction.
- The frequent append and commit function takes me a lot of time.
- detach() and to_cpu() function takes me a lot of time.
So Is there any feature to transform custom dataset to deeplake dataset?
If we have a function which works like this:
ds.distributed_append_gpu_tensor_and_auto_commit(data_tensor)
ds.auto_transorm_pytorch_dataset(my_pytorch_dataloader)
or could you give me a standard workflow to solve this?
I don't know which is the best method for this scenario.
The document did not cover this problem. #2596 also indicates this problem.
Use Cases
Distributed parallel computing and saving to deeplake.