Device Assignment Problem for model.eval() in TGN #7008

higgins4286 · 2023-03-22T20:59:58Z

higgins4286
Mar 22, 2023

I augmented this TGN example to work on my dataset. My dataset is composed of 36 CSVs, each 37.3MB. When I run the original TGN on cuda there are no issues. When I run my augmented network on CPU it works. I would keep it on CPU, but it will take over 18 hours to train & test running on CPU over just 10 epochs, so I need to utilize the Nvidia RTX A6000 GPUs with 48GB available through my university. Unfortunately, when I switch the device to cuda, I get the error below when memory.eval() is called during testing. Do I have any control over the device these variables are on? (see "1. Error:" code block, line 158)

I have searched the internet high and low to find how to place all components on the cuda device and have tried many fixes, to no avail. I now have many device assignment redundancies in my code, but I have not been able to make the devices match. Could the error be in how I created the datasets? (see "2. Temporal Dataset creation:" code block)

Since I have separate datasets for training, testing, and validation I also moved the code that calls data into the testing and training functions and replaced data with events_train or events_test, could that be the source of the problem? (see "3. Test function:" code block)

I am happy to provide any further information. This is my first time posting a help discussion post, so any feedback on clarity of asking questions is also welcomed. Thank you in advance!

Error:

RuntimeError                              Traceback (most recent call last)
Cell In[48], line 219
    217 for epoch in range(1, 3): 
    218     loss = train()
--> 219     test_loss = test()
    221     print(epoch, loss, test_loss)

File ~/.conda/envs/my-pytorch/lib/python3.10/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

Cell In[48], line 169, in test()
    153 memory = TGNMemory(
    154     te_num_nodes,
    155     te_msg,
   (...)
    159     aggregator_module=LastAggregator().to('cuda'),
    160 ).to(device)
    162 gnn = GraphAttentionEmbedding(
    163     in_channels=memory_dim,
    164     out_channels=embedding_dim,
    165     msg_dim=events_test.msg.size(-1),
    166     time_enc=memory.time_enc,
    167 ).to(device)
--> 169 memory.eval()
    170 print('here')
    171 gnn.eval()

File ~/.conda/envs/my-pytorch/lib/python3.10/site-packages/torch/nn/modules/module.py:1930, in Module.eval(self)
   1914 def eval(self: T) -> T:
   1915     r"""Sets the module in evaluation mode.
   1916 
   1917     This has any effect only on certain modules. See documentations of
   (...)
   1928         Module: self
   1929     """
-> 1930     return self.train(False)

File ~/.conda/envs/my-pytorch/lib/python3.10/site-packages/torch_geometric/nn/models/tgn.py:169, in TGNMemory.train(self, mode)
    166 """Sets the module in training mode."""
    167 if self.training and not mode:
    168     # Flush message store to memory in case we just entered eval mode.
--> 169     self.__update_memory__(
    170         torch.arange(self.num_nodes, device=self.memory.device))
    171     self.__reset_message_store__()
    172 super().train(mode)

File ~/.conda/envs/my-pytorch/lib/python3.10/site-packages/torch_geometric/nn/models/tgn.py:115, in TGNMemory.__update_memory__(self, n_id)
    114 def __update_memory__(self, n_id):
--> 115     memory, last_update = self.__get_updated_memory__(n_id)
    116     self.memory[n_id] = memory
    117     self.last_update[n_id] = last_update

File ~/.conda/envs/my-pytorch/lib/python3.10/site-packages/torch_geometric/nn/models/tgn.py:123, in TGNMemory.__get_updated_memory__(self, n_id)
    120 self.__assoc__[n_id] = torch.arange(n_id.size(0), device=n_id.device)
    122 # Compute messages (src -> dst).
--> 123 msg_s, t_s, src_s, dst_s = self.__compute_msg__(
    124     n_id, self.msg_s_store, self.msg_s_module)
    126 # Compute messages (dst -> src).
    127 msg_d, t_d, src_d, dst_d = self.__compute_msg__(
    128     n_id, self.msg_d_store, self.msg_d_module)


File ~/.conda/envs/my-pytorch/lib/python3.10/site-packages/torch_geometric/nn/models/tgn.py:158, in TGNMemory.__compute_msg__(self, n_id, msg_store, msg_module)
    156 t = torch.cat(t, dim=0)
    157 raw_msg = torch.cat(raw_msg, dim=0)
--> 158 t_rel = t - self.last_update[src]    <--------- Do I have any control over the device these variables are on?              
    159 t_enc = self.time_enc(t_rel.to(raw_msg.dtype))
    161 msg = msg_module(self.memory[src], self.memory[dst], raw_msg, t_enc)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Temporal Dataset creation:

for t in range(0,24):
  def create_dataset(So, De, Ti, Fe, time_steps=1):
      so, de, ti, fe = [], [], [], []
      for i in range(0,len(So)-56,24):
          S = So[i + time_steps+t+16].item() 
            #print(S)
          D = De[i + time_steps+t+16].item()
          T = Ti[i + time_steps+t+16].item()
          F = Fe[i + time_steps+t+16, :].numpy()
          so.append(S)
          de.append(D)
          ti.append(T)
          fe.append(F)
      return torch.tensor(so, device=device), torch.tensor(de, device=device), torch.tensor(ti, device=device), torch.tensor(fe, device=device)
  time_steps = 24

  soc, des, ti_st, feat = create_dataset(tr_scale, tr_scale_ii, t_train, train1_features, time_steps)
  soce, dese, ti_ste, feate = create_dataset(te_scale, te_scale_ii, t_test, test1_features, time_steps)
  socv, desv, ti_stv, featv = create_dataset(va_scale, va_scale_ii, t_val, val1_features, time_steps)

  events_train = TemporalData(
          src= soc.to('cuda'),# A list of source nodes for the events with shape [num_events]
          dst= des.to('cuda'),# A list of destination nodes for the events w/shape [num_events]
          t= ti_st.to('cuda'),  # The timestamp for each event with shape [num_events]
          msg= feat.to('cuda'),# Messages feature matrix with shape [num_events, num_msg_features]
  )

  events_test = TemporalData(
          src= soce.to('cuda'),# A list of source nodes for the events with shape [num_events]
          dst= dese.to('cuda'),# A list of destination nodes for the events w/shape [num_events]
          t= ti_ste.to('cuda:0'),  # The timestamp for each event with shape [num_events]
          msg= feate.to('cuda'),# Messages feature matrix with shape [num_events, num_msg_features]
  )

Test function:

  @torch.no_grad()
  def test():
      te_msg = torch.tensor(events_test.msg.size(-1)).to(device) 
      te_mem_dim = torch.tensor(memory_dim).to(device)
      te_time_dim = torch.tensor(time_dim).to(device)
      te_num_nodes = torch.tensor(events_test.num_nodes).to(device)

      min_dst_idx, max_dst_idx = int(events_test.dst.min()), int(events_test.dst.max())
      assoc = torch.empty(te_num_nodes, dtype=torch.long, device=device)
    
      neighbor_loader = LastNeighborLoader(te_num_nodes, size=10, device=device)
         
      memory = TGNMemory(
          te_num_nodes,
          te_msg,
          te_mem_dim,
          te_time_dim,
          message_module=IdentityMessage(te_msg, te_mem_dim, te_time_dim).to('cuda'),
          aggregator_module=LastAggregator().to('cuda'),
      ).to(device)

      gnn = GraphAttentionEmbedding(
          in_channels=te_mem_dim.item(),
          out_channels=embedding_dim,
          msg_dim=te_msg,
          time_enc=memory.time_enc,
      ).to(device)

      memory.eval() # This is where the error is thrown.

Answered by rusty1s

Mar 24, 2023

I pushed a fix here: #7028

With this, you should be able to run

memory = TGNMemory(...).to(device)
memory.reset_state()
memory.eval()

View full answer

rusty1s · 2023-03-24T08:11:06Z

rusty1s
Mar 24, 2023
Maintainer

I pushed a fix here: #7028

With this, you should be able to run

memory = TGNMemory(...).to(device)
memory.reset_state()
memory.eval()

1 reply

higgins4286 Mar 24, 2023
Author

Thank you so much! That worked.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Device Assignment Problem for model.eval() in TGN #7008

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Device Assignment Problem for model.eval() in TGN #7008

Uh oh!

higgins4286 Mar 22, 2023

Replies: 1 comment · 1 reply

Uh oh!

rusty1s Mar 24, 2023 Maintainer

Uh oh!

higgins4286 Mar 24, 2023 Author

higgins4286
Mar 22, 2023

Replies: 1 comment 1 reply

rusty1s
Mar 24, 2023
Maintainer

higgins4286 Mar 24, 2023
Author