Create a large dataset loader #3360
-
Hello, I am currently looking for a way to deal with medium large datasets of geometric points clouds. The idea would be to create 1-meters blocks for instance just like in pointnets where 1-m block would be a sample (which is what is done in S3DIS torch_geometric dataset implementation if I am not wrong ?) I saw that there were InMemoryDataset and Datasets classes for this purpose. Although I don't use them for the moment (I am not yet comfortable with them), my problem arises before in what would be in the 'process' part I think. I am using as inputs a folder where clouds are located (like in Semantic3D dataset for instance), this is the part I do for each file (cloud) print("creating data")
data = Data(pos = pos, x=feats, y=labels, batch = i * torch.ones(pos.shape[0])) #create Data object of the i-th cloud/file
print("Voxelizing")
id_grid = voxel_grid(data.pos, data.batch, size=1.5) #Voxelise in 1.5 meters blocks
print("batching")
for i in np.unique(id_grid):
ids = id_grid == i
sample = Data()
for k in data.keys:
sample[k] = data[k][ids]
datas.append(sample) The for loop, in which I want to build one Data object for each 1.5-meter block, is very slow (like 20 minutes for a 30M points cloud). I feel like I am not doing it the most efficient way as I need to get for each voxel the points that belong to it, and build a data object with all the attributes of these specific points I am planning to add all these data objects in the Is there a way to do for instance Thx ! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Your code looks correct to me, although there are probably more efficient ways to implement this. Currently, you are checking via masking whether a point is contained in a single cluster, which is very expensive to do on large-scale point clouds. While there exists sophisticated data structures for this use case, e.g., kd-trees, a probably easier way to do it is to convert ids into a list of tensors which hold the indices of nodes belonging to this cluster, and use that for indexing the global point cloud: deg = torch_geometric.utils.degree(id_grid)
indices = id_grid.argsort().split(deg.tolist())
for index in indices:
pos = data.pos[index] Let me know if this speeds up things :) |
Beta Was this translation helpful? Give feedback.
Your code looks correct to me, although there are probably more efficient ways to implement this. Currently, you are checking via masking whether a point is contained in a single cluster, which is very expensive to do on large-scale point clouds. While there exists sophisticated data structures for this use case, e.g., kd-trees, a probably easier way to do it is to convert ids into a list of tensors which hold the indices of nodes belonging to this cluster, and use that for indexing the global point cloud:
Let me know if this speeds up things :)