Problems creating my own dataset & loading it (for graph-classification) #6074

elisagdelope · 2022-11-26T11:28:13Z

elisagdelope
Nov 26, 2022

I'm working on a graph classification problem, so I wanted to create a Dataset object from a list (e.g. data_list) of 578 Data() objects. This works just fine when I save the Dataset as torch.save(data_list, self.processed_paths[0]) in the process() method. It is saved and it loads as a list of Data() objects that I'm able to access without problems.
However, pyg documentation suggests collating the data_list (the list of Data() objects) and then save the data and slices.

data, slices = self.collate(data_list)
torch.save((data, slices), self.processed_paths[0])

When I do this, then my dataset has length 1 (which is unexpected, as I would expect it to contain the 578 Data() objects). Surprisingly, dataset[0] is a Data() object that contains the information from all my 578 Data() objects. How do I get to access my individual Data() objects then?

The ideal outcome when loading the dataset would be a dataset object containing 578 Data() objects, each representing a graph that I can access. I think I am not understanding the functioning of slices. I don't even know where they are stored or how to get them from the dataset object...

This is how I create the dataset:

`import os.path as osp
from torch_geometric.data import Dataset, InMemoryDataset

class MyOmicsDataset(InMemoryDataset):

def __init__(self, root, X_file, graph_file, labels_file, transform=None, pre_transform=None, pre_filter=None): #
    self.X_file = X_file
    self.graph_file = graph_file
    self.labels_file = labels_file
    super(MyOmicsDataset, self).__init__(root, transform, pre_transform, pre_filter=None)
    self.data, self.slices = torch.load(self.processed_paths[0])
       
@property
def raw_file_names(self):
    """ If this file exists in raw_dir, the download is not triggered.
        (The download func. is not implemented here)  
    """
    return [self.X_file, self.graph_file, self.labels_file]

@property
def processed_file_names(self):
    """ If these files are found in processed_dir, processing is skipped"""
    return ['data.pt']

def download(self):
    """ Download to `self.raw_dir`.
        Not implemented here
    """
    # path = download_url(url, self.raw_dir)
    pass

def process(self):
    
    # load node attributes: gene expression
    X_df = pd.read_csv(self.raw_paths[0], index_col=0) # index is geneid

    # load graph: ppi
    ppi = pd.read_csv(self.raw_paths[1]) 

    # load labels
    labels = pd.read_csv(self.raw_paths[2], index_col=0)
    labels_dict = dict(zip(labels.index, labels[labels.columns[0]]))
    
    # convert df into tensors
    # map index to gene id in X_df -> {geneid: index} in node features
    X_mapping = {index: i for i, index in enumerate(X_df.index.unique())}
    src = [X_mapping[index] for index in ppi.iloc[:,0]] # get source nodes from first column
    dst = [X_mapping[index] for index in ppi.iloc[:,1]] # get destination nodes from second column
    edge_index = torch.tensor([src, dst])

    edge_attr = torch.tensor(ppi.iloc[:,2], dtype=torch.int) # get edge attributes from third column

    
    # create data objects
    data_list=[]
    for subject in tqdm(X_df.columns.tolist()[1:]):
        ft = torch.tensor(X_df[subject].values, dtype=torch.float64) # take the column corresponding for each subject (X_df: genes x samples] 
        label = labels_dict[subject] # take the value corresponding for each subject (labels_dict is {subject: label})
        data = Data(x=ft, edge_index=edge_index, edge_attr=edge_attr, y=label, subject=subject) # edge_index values are corresponding to the index in the X_df matrix 
        data_list.append(data)
    
    
    # Apply the functions specified in pre_filter and pre_transform
    if self.pre_filter is not None:
        data_list = [data for data in data_list if self.pre_filter(data)]
    
    if self.pre_transform is not None: 
        data_list = [self.pre_transform(data) for data in data_list]
    
    # save processed data
    data, slices = self.collate(data_list) # torch.save(data_list, self.processed_paths[0]) -> this works!
    torch.save((data, slices), self.processed_paths[0]) 


def len(self):
    return len(self.processed_file_names)

def get(self, idx):
    data = torch.load(osp.join(self.processed_dir, 'data.pt'))
    return data

`

And this is how I call the function:

dataset = MyOmicsDataset(root="../data", X_file="rnaseq.csv", graph_file="ppi_score.csv", labels_file="labels.csv")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Problems creating my own dataset & loading it (for graph-classification) #6074

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Problems creating my own dataset & loading it (for graph-classification) #6074

Uh oh!

Uh oh!

elisagdelope Nov 26, 2022

Replies: 0 comments

elisagdelope
Nov 26, 2022