Data Transforms for Standardizing and Creating sklearn Pipelines #4798

snknitin · 2022-06-13T07:36:20Z

snknitin
Jun 13, 2022

Hi PyG Community,

I was initially doing feature scaling and standardization (column-wise Minmax Scaler) of my node and edge attributes, during the process() method when creating my InMemoryDataset HeteroData, when it was going to be just one data.pt object with a single graph, accessed by data[0].

Now, i'm planning to split my data and use the larger Dataset class since the size is huge. So it is going to be a List of Heterodata objects MyDataset(30). This makes batching , loading and train-test split very easy for me. However my concern is

Now the node and edge data for each graph will be standardized and that might create inconsistencies in feature values due to different min and max fo each graph. Is there a way to do this using the transforms?

import torch_geometric.transforms as T

transform = T.Compose([ ??? , T.ToUndirected(), T.AddSelfLoops())])

NormalizeFeatures - does it row wise and not column wise
NormalizeScale - doesn't seem like what i need. works on node positions
GCNNorm - Not entirely sure , but it might be what I need based on few examples? Please let me know if it is

Also, looking at the BaseTransform description, it seems i can explicitly but for i was looking for a way to do it as a transform but consider all edges in all graphs for scaling feature values. Not sure if explicit does that

dataset = TUDataset(path, name='MUTAG', transform=transform)
data = dataset[0]  # Implicitly transform data on every access.

data = TUDataset(path, name='MUTAG')[0]
data = transform(data)  # Explicitly transform data.

I'm not sure way i haven't seen a lot of this(maybe datasets are preprocessed), but is it not necessary to standardize your data for GNNs? Would love to know if it can be avoided and if Normalizing instead makes no difference.

Another thing I would like some confirmation on is

Custom Transforms - Can I create a custom Transform ? For instance , if my edge_index is [2,1000] and edge_attr = [1000,45]

Can I use some transformations(not inbuilt) that act on a particular ede_type and a particular column of edge_attr and expand the size to be [1000,50], like feature transformation, one-hot or other values like
Delete a column in edge_attr/ resize dimensions or maybe convert the last column into an edge_label

Wanted to know if such a thing can be done through transforms , so that if i can create a new input for a data object, I can pass it through this like we do in sklearn and it takes care of dataprocessing

I would be grateful for any comments, suggestions or advice on these 2 issues

rusty1s · 2022-06-13T12:40:13Z

rusty1s
Jun 13, 2022
Maintainer

Standardization and pre-processing is currently we expect the user to handle. For example, you would first need to gather the mean and std deviation of features across your entire dataset:

mean = dataset.data.x.mean(dim=0, keepdim=True)
std = dataset.data.x.std(dim=0, keepdim=True)

and then use it to re-scale your features. We can also add support for this similar to torchvision (see here). Importantly, stats have to be pre-computed in order to do ensure consistency across examples.

Yes, you can write custom transforms, e.g.:

def transform_fn(data):
    data = ... # transform data
    return data

dataset = MyDataset(..., transform=transform_fn)

Yes, you can just apply the transform for a specific edge type.

def transform_fn(data):
   data[edge_type].edge_attr = ... # transform edge type features
   return data

3 replies

snknitin Jun 15, 2022
Author

Hey @rusty1s ,

So I figured a hack for this, but I'm confused about applying it.
I've created a MyOwnDataset(Dataset) class based on this
It will be DailyData(30), let's say with 30 small graphs. This cannot be acted upon by transform function directly because it doesn't have data.stores .

# this works for InMemoryDataset when we do data[0] but not for list of graphs
data = T.ToUndirected()(dailydata)
data = T.AddSelfLoops()(data)
data = T.NormalizeFeatures()(data)

>>---> 38     for store in data.edge_stores:
     39         if 'edge_index' not in store:
     40             continue

AttributeError: 'DailyData' object has no attribute 'edge_stores'


# Even this doesn't work for a set of graphs. Can it only be done after we change it to a batch from the loader?
data.to(device)

Is there a different way to apply a transform to such a graph object to modify the values of individual edges in each ? I know the code I wrote works, but when I try to do it, in a loop it doesn't reflect in the result data object. this is for the transform_fn(data) you mentioned above. Is there a better way to apply it?

My node features, x are already standardized pulled from a csv file. My edge features are different for each graph . If I wanted row normalization , I could do this and only pass edge_attr.

transform = T.Compose([T.ToUndirected(), T.AddSelfLoops(),T.NormalizeFeatures(attrs=["edge_attr"])])
data = MyOwnDataset(root, transform = transform)

I cannot debug this on pycharm properly because i don't see where transform gets called

if self.pre_filter is not None and not self.pre_filter(data):
                continue

if self.pre_transform is not None:
    data = self.pre_transform(data)

and it looks like it creates all 30 graphs first and then applies it to them individually but how is T.Compose getting called?

However this normalizes features across a row and I need to do MinMax Scaling or StandardScaling across columns and for the scaling to be consistent the min, max and mean should be of all edge_attrs of all 30 graphs and not just the individual ones. That wont make features consistent. Here's my hack

data = DailyData(root) # create data object DailyData(30) without any transforms

def getfullstats(dailydata):
    """
    Pass the data into a loader to get the full stats
    Do this before Transforms that add reverse edges
    Create a transform for this if you want to limit edge columns
    """
    loader = DataLoader(dailydata, batch_size=len(dailydata))
    data = next(iter(loader))
    edge_stats = defaultdict(dict)
    for edge in data.edge_types:
        e = data[edge].edge_attr
        # get stats
        edge_stats[edge]["mean"] = e.mean(dim=0, keepdim=True)
        edge_stats[edge]["std"] = e.std(dim=0, keepdim=True)
        edge_stats[edge]["min"] = torch.min(e, 0).values
        edge_stats[edge]["max"] = torch.max(e, 0).values
    return edge_stats

# this will give me a dictionary of stats for both edge_types in my HeteroData 
#  based on the whole data since i create a batch of the full size and all 30 graphs are taken into consideration 
edge_stats = getfullstats(data)

# Now I simply delete the data and all the processed files, to rerun with new Transforms
del data
shutil.rmtree(os.path.join(root,'processed'))

transform = T.Compose([ tnf.ScaleEdges(stats = edge_stats,attrs=["edge_attr"]),T.ToUndirected(), T.AddSelfLoops()])
data = DailyData(root,transform=transform)

Here is the class I wrote

@functional_transform('scale_edges')
class ScaleEdges(BaseTransform):
    r"""Row-normalizes the attributes given in :obj:`attrs` to sum-up to one
    (functional name: :obj:`scale_edges`).

    Args:
        attrs (List[str]): The names of attributes to normalize.
            (default: :obj:`["x"]`)
    """
    def __init__(self, stats, attrs: List[str] = ["x"]):
        self.stats = stats
        self.attrs = attrs

    def __call__(self, data: Union[Data, HeteroData]):
        for store in data.stores:
            for key, value in store.items(*self.attrs):
                value = value - value.min()
                xmean = self.stats[key]["mean"]
                xstd = self.stats[key]["std"]
                xmin, xmax = self.stats[key]["min"], self.stats[key]["max"]
                #value.div_(value.sum(dim=-1, keepdim=True).clamp_(min=1.))
                store[key] = (value-xmean).div(xmax-xmin)
        return data

    def __repr__(self) -> str:
        return f'{self.__class__.__name__}()'

For some reason I'm not able to figure out where exactly in the process the transforms get called and are applied to each of the 30 graphs. The first set of transforms works correctly and adds reverse edges to each of the 30 graphs. So I change the order and do the scaling edges first. Even though I test each individual part of the code in a notebook cell and saw the values change there, in a transform, it doesn't seem to get applied and my values are not modified in the final graph

I'm unable to apply the transform to the DailyData(30) externally and even when I add it as a transform during creation, it doesn't reflect, even though other transforms do. What am i doing wrong here?

Can you please help me with how to externally make a transform be applied to a set of graphs , like Dataset(30) instead of data[0]?
I tried using this as a transform_fn like you mentioned. The code works even during debug but it doesn't reflect on the final result. the values are still unsacled in the graph

def edge_norm_fn(data):
    """
    Normalization of edge_attributes
    """
    edge_stats = getfullstats(data)
    for edge in data[0].edge_types:
        xmean = edge_stats[edge]["mean"]
        xstd = edge_stats[edge]["std"]
        xmin,xmax = edge_stats[edge]["min"],edge_stats[edge]["max"]
        for i in range(len(data)):
            x = data[i][edge].edge_attr
            data[i][edge].edge_attr = (x-xmean)/(xmax-xmin)
    return data

Can you please help ? Stuck on this all night
if you can shoew me how to make it work for this data = T.ToUndirected()(dailydata) for a DailyData(30), I think i can just use the above function to apply or use the functional transform I wrote

rusty1s Jun 15, 2022
Maintainer

transforms are always an a per-graph level. For a Dataset, the transform will be applied on access. You can fix this to pass this transform to pre_transform and it will be only applied once prior to saving:

dataset = DailyData(..., transform=T.ToUndirected())
for data in dataset:  # transform will be called on access.
    assert data.is_undirected()

snknitin Jun 17, 2022
Author

Thanks for your response. Got it. I guess the best way for me to make it work is to run it twice. Once without any transforms and then get all the edge_stats I need, Delete the processed directory and data object and this time run it with transforms . That worked for me

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data Transforms for Standardizing and Creating sklearn Pipelines #4798

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Data Transforms for Standardizing and Creating sklearn Pipelines #4798

Uh oh!

snknitin Jun 13, 2022

Replies: 1 comment · 3 replies

Uh oh!

Uh oh!

rusty1s Jun 13, 2022 Maintainer

Uh oh!

Uh oh!

snknitin Jun 15, 2022 Author

Uh oh!

rusty1s Jun 15, 2022 Maintainer

Uh oh!

snknitin Jun 17, 2022 Author

snknitin
Jun 13, 2022

Replies: 1 comment 3 replies

rusty1s
Jun 13, 2022
Maintainer

snknitin Jun 15, 2022
Author

rusty1s Jun 15, 2022
Maintainer

snknitin Jun 17, 2022
Author