Facing Memory Issues When Building a Custom Dataset #8955

Amyr14 · 2024-02-22T23:37:07Z

Amyr14
Feb 22, 2024

Good evening fellas. First of all, I want to clarify that I'm not a expert at all in pyg and its structure or conventions, so bear with me a little bit. Ok, to the issue:

The Data

I'm trying to build a custom dataset for research purposes. The data in question is kinda unconventional, haven't found anything similar to it in my researches. I want to represent a radiography dataset as an undirected graph where the vertices are LBP histograms (42 decimal features) and the edges values represent cosine similarity between these histograms. I want to test if there is any kind of advantage in representing this data in a graph format.

The Problem

Due to the nature of the data, as expected, the amount of edges is huge. I'm talking millions or even a couple hundred millions. Being that a simple dataset (as the one I'm working on) contains some tens of thousands of images, naturally the maximum amount of edges will be this number squared. Using COO format, I managed to curb the memory problem a little bit, but some layers simply do not support sparse operations (e.g the minCUT pool layer). I looked into batching, but I'm having trouble understanding it. So, what I'm looking for is some guidance as to how to properly structure this data following torch conventions in a manner that my computer doesn't explode. Any tip or hint would be greatly appreciated. Thank you in advance.

Specs

OS -> Fedora 39
GPU -> Quadro P2200 with 5120MiB of VRAM
CPU -> AMD Ryzen 7 3800X 8-Core
RAM -> 16GB

Edit: System specs

Answered by rusty1s

Feb 26, 2024

In this case, please take a look at NeighborLoader or LinkNeighborLoader, which you can use to scale node-level or link-level tasks on a single graph.

View full answer

rusty1s · 2024-02-23T10:40:51Z

rusty1s
Feb 23, 2024
Maintainer

Is this a single graph or multiple graphs? Is this a node or graph classification dataset? You are right that some layers do not support this scale (such as dense pooling layers). Happy to give you more advice if you share some more details on what you are trying to do.

6 replies

Amyr14 Feb 23, 2024
Author

Hey, thank you for your time. As of the moment, is a single graph with 15300 nodes and a little more thank 100 million edges. As for your second question, I'm not sure. What I had in mind was clustering the vertices (even experimented GCN layers with triplet loss in other datasets), so maybe this fits node classification?
Also, just realized I didn't write my specs. Sorry about that.

rusty1s Feb 26, 2024
Maintainer

In this case, please take a look at NeighborLoader or LinkNeighborLoader, which you can use to scale node-level or link-level tasks on a single graph.

Answer selected by Amyr14

Amyr14 Feb 27, 2024
Author

Your suggestion worked wonderfully in the training phase. Thank you. However, I'm still having trouble with the validation. I need to do one last forward pass to calculate the final embedding and cluster assignments for the graph, and can't do it in none of the devices. Maybe I could just divide the base into chunks and concatenate the result in the end? Do you have a final suggestion? In the worst case scenario I'll just find more RAM, because to use 16GB for AI research is just asking for trouble.

rusty1s Feb 27, 2024
Maintainer

Can you clarify the issue during validation? Why would it be different than during training? You can iterate over validation mini-batches and collect their embeddings.

Amyr14 Feb 27, 2024
Author

Hmm I see. Yes, I suppose this would work. The issue is that I wanted to do one forward pass to compute the final embeddings of the vertices, and I thought I needed to pass the whole graph to the model. Can I use the NeighborLoader for this also? I thought that it just randomly sampled nodes from the graph with repetition, so I could not use it in validation.

Amyr14 Feb 27, 2024
Author

Just letting you know that everything is working great. Managed to iterate over the mini-batches using the NeighborLoader and collect the final embeddings of the vertices. I'll be closing the question, thanks for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Facing Memory Issues When Building a Custom Dataset #8955

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Facing Memory Issues When Building a Custom Dataset #8955

Uh oh!

Uh oh!

Amyr14 Feb 22, 2024

The Data

The Problem

Specs

Replies: 1 comment · 6 replies

Uh oh!

rusty1s Feb 23, 2024 Maintainer

Uh oh!

Uh oh!

Amyr14 Feb 23, 2024 Author

Uh oh!

rusty1s Feb 26, 2024 Maintainer

Uh oh!

Amyr14 Feb 27, 2024 Author

Uh oh!

rusty1s Feb 27, 2024 Maintainer

Uh oh!

Amyr14 Feb 27, 2024 Author

Uh oh!

Amyr14 Feb 27, 2024 Author

Amyr14
Feb 22, 2024

Replies: 1 comment 6 replies

rusty1s
Feb 23, 2024
Maintainer

Amyr14 Feb 23, 2024
Author

rusty1s Feb 26, 2024
Maintainer

Amyr14 Feb 27, 2024
Author

rusty1s Feb 27, 2024
Maintainer

Amyr14 Feb 27, 2024
Author

Amyr14 Feb 27, 2024
Author