Create a large dataset loader #3360

QuanticDisaster · 2021-10-21T16:18:17Z

QuanticDisaster
Oct 21, 2021

Hello,

I am currently looking for a way to deal with medium large datasets of geometric points clouds. The idea would be to create 1-meters blocks for instance just like in pointnets where 1-m block would be a sample (which is what is done in S3DIS torch_geometric dataset implementation if I am not wrong ?)

I saw that there were InMemoryDataset and Datasets classes for this purpose. Although I don't use them for the moment (I am not yet comfortable with them), my problem arises before in what would be in the 'process' part I think.

I am using as inputs a folder where clouds are located (like in Semantic3D dataset for instance), this is the part I do for each file (cloud)

print("creating data")
data = Data(pos = pos,  x=feats, y=labels, batch = i * torch.ones(pos.shape[0])) #create Data object of the i-th cloud/file
        
print("Voxelizing")
id_grid = voxel_grid(data.pos, data.batch, size=1.5) #Voxelise in 1.5 meters blocks

print("batching")

for i in np.unique(id_grid):
    ids = id_grid == i 
    sample = Data()   
    for k in data.keys:
        sample[k] = data[k][ids]
    datas.append(sample)

The for loop, in which I want to build one Data object for each 1.5-meter block, is very slow (like 20 minutes for a 30M points cloud). I feel like I am not doing it the most efficient way as I need to get for each voxel the points that belong to it, and build a data object with all the attributes of these specific points

I am planning to add all these data objects in the datas list before applying Dataloader on it

Is there a way to do for instance sample = Data( data.keys[id] ) to spare the for loop for attributes ? Or is there a more efficient way to do what I want ?

Thx !

Answered by rusty1s

Oct 22, 2021

Your code looks correct to me, although there are probably more efficient ways to implement this. Currently, you are checking via masking whether a point is contained in a single cluster, which is very expensive to do on large-scale point clouds. While there exists sophisticated data structures for this use case, e.g., kd-trees, a probably easier way to do it is to convert ids into a list of tensors which hold the indices of nodes belonging to this cluster, and use that for indexing the global point cloud:

deg = torch_geometric.utils.degree(id_grid)
indices = id_grid.argsort().split(deg.tolist())

for index in indices:
    pos = data.pos[index]

Let me know if this speeds up things :)

View full answer

rusty1s · 2021-10-22T10:34:34Z

rusty1s
Oct 22, 2021
Maintainer

Your code looks correct to me, although there are probably more efficient ways to implement this. Currently, you are checking via masking whether a point is contained in a single cluster, which is very expensive to do on large-scale point clouds. While there exists sophisticated data structures for this use case, e.g., kd-trees, a probably easier way to do it is to convert ids into a list of tensors which hold the indices of nodes belonging to this cluster, and use that for indexing the global point cloud:

deg = torch_geometric.utils.degree(id_grid)
indices = id_grid.argsort().split(deg.tolist())

for index in indices:
    pos = data.pos[index]

Let me know if this speeds up things :)

1 reply

QuanticDisaster Oct 22, 2021
Author

Indeed it speed up things, and by a lot !

For comparison, my method took 0.14s per indice

To try I modified my function with .nonzero() to keep index and not a mask as you said. it performs 0.05s per indice

for i in np.unique(id_grid):
    ids = id_grid == i 
    ids = ids.nonzero()
    sample = Data()   
    for k in data.keys:
        sample[k] = data[k][ids]
    datas.append(sample)

By using your method

deg = torch_geometric.utils.degree(id_grid)
deg = deg[deg.nonzero()].flatten()
indices = id_grid.argsort().split(deg.type(torch.LongTensor).tolist())

for index in indices:
    sample = Data()
    for k in data.keys:
        sample[k] = data[k][index]
    datas.append(sample)

Well it performs too fast to be mesured, each indice taking 0.0s according to time.time() differences

Thx a lot !

A weird thing is the behavior or degree, I suppose it would give an output the shape of id_grid but it was not the case, so I had to remove zeros values

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create a large dataset loader #3360

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Create a large dataset loader #3360

Uh oh!

Uh oh!

QuanticDisaster Oct 21, 2021

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

rusty1s Oct 22, 2021 Maintainer

Uh oh!

QuanticDisaster Oct 22, 2021 Author

QuanticDisaster
Oct 21, 2021

Replies: 1 comment 1 reply

rusty1s
Oct 22, 2021
Maintainer

QuanticDisaster Oct 22, 2021
Author