don't automatically batch graphs inside vector when using getobs (#183)

CarloLucibello · web-flow · commit 20c591ce59f7 · 2022-07-30T11:03:44.000+02:00
* don't automatically batch graphs inside vector when using getobs

* add tests

* bump version

* bump

* nice error
diff --git a/Project.toml b/Project.toml
@@ -1,7 +1,7 @@
 name = "GraphNeuralNetworks"
 uuid = "cffab07f-9bc2-4db1-8861-388f63bf7694"
 authors = ["Carlo Lucibello and contributors"]
-version = "0.4.5"
+version = "0.5.0"
 
 [deps]
 Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
diff --git a/docs/src/index.md b/docs/src/index.md
@@ -26,7 +26,7 @@ Usage examples on real datasets can be found in the [examples](https://github.co
 We create a dataset consisting in multiple random graphs and associated data features. 
 
 ```julia
-using GraphNeuralNetworks, Graphs, Flux, CUDA, Statistics
+using GraphNeuralNetworks, Graphs, Flux, CUDA, Statistics, MLUtils
 using Flux.Data: DataLoader
 
 all_graphs = GNNGraph[]
@@ -60,13 +60,17 @@ opt = Adam(1f-4)
 ### Training 
 
 Finally, we use a standard Flux training pipeline to fit our dataset.
-Flux's `DataLoader` iterates over mini-batches of graphs 
-(batched together into a `GNNGraph` object). 
+We use Flux's `DataLoader` to iterate over mini-batches of graphs 
+that are glued together into a single `GNNGraph` using the [`MLUtils.batch`](@ref) method. This is what happens under the hood when creating a `DataLoader` with the
+`collate=true` option. 
 
 ```julia
-train_size = round(Int, 0.8 * length(all_graphs))
-train_loader = DataLoader(all_graphs[1:train_size], batchsize=32, shuffle=true)
-test_loader = DataLoader(all_graphs[train_size+1:end], batchsize=32, shuffle=false)
+train_graphs, test_graphs = MLUtils.split(all_graphs, at=0.8)
+
+train_loader = DataLoader(train_graphs, 
+                batchsize=32, shuffle=true, collate=true)
+test_loader = DataLoader(test_graphs, 
+                batchsize=32, shuffle=false, collate=true)
 
 loss(g::GNNGraph) = mean((vec(model(g, g.ndata.x)) - g.gdata.y).^2)
 
diff --git a/docs/src/tutorials/graph_classification_pluto.jl b/docs/src/tutorials/graph_classification_pluto.jl
@@ -1,5 +1,5 @@
 ### A Pluto.jl notebook ###
-# v0.19.5
+# v0.19.6
 
 #> [frontmatter]
 #> title = "Graph Classification with Graph Neural Networks"
@@ -13,12 +13,13 @@ using InteractiveUtils
 begin
     using Pkg
     Pkg.activate(; temp=true)
-    packages = [
+    Pkg.add([
         PackageSpec(; path=joinpath(@__DIR__,"..","..","..")),
         PackageSpec(; name="Flux", version="0.13"),
 		PackageSpec(; name="MLDatasets", version="0.7"),
-    ]
-    Pkg.add(packages)
+		PackageSpec(; name="MLUtils"),
+	])
+	Pkg.develop("GraphNeuralNetworks")
 end
 
 # ╔═╡ 361e0948-d91a-11ec-2d95-2db77435a0c1
@@ -29,6 +30,7 @@ begin
 	using Flux.Data: DataLoader
 	using GraphNeuralNetworks
 	using MLDatasets
+	using MLUtils
 	using LinearAlgebra, Random, Statistics
 	ENV["DATADEPS_ALWAYS_ACCEPT"] = "true"  # don't ask for dataset download confirmation
 	Random.seed!(17) # for reproducibility
@@ -73,8 +75,11 @@ This dataset provides **188 different graphs**, and the task is to classify each
 By inspecting the first graph object of the dataset, we can see that it comes with **17 nodes** and **38 edges**.
 It also comes with exactly **one graph label**, and provides additional node labels (7 classes) and edge labels (4 classes).
 However, for the sake of simplicity, we will not make use of edge labels.
+"""
 
-We have some useful utilities for working with graph datasets, *e.g.*, we can shuffle the dataset and use the first 150 graphs as training graphs, while using the remaining ones for testing:
+# ╔═╡ 7f7750ff-b7fa-4fe2-a5a8-6c9c26c479bb
+md"""
+We now convert the MLDatasets.jl graph types to our `GNNGraph`s and we also onehot encode both the node labels (which will be used as input features) and the graph labels (what we want to predict):  
 """
 
 # ╔═╡ 936c09f6-ee62-4bc2-a0c6-749a66080fd2
@@ -84,19 +89,27 @@ begin
 		               ndata=Float32.(onehotbatch(g.ndata.targets, 0:6)),
 	                   edata=nothing) 
 		      for g in graphs]
+	y = onehotbatch(dataset.graph_data.targets, [-1, 1])
 end
 
+# ╔═╡ 2c6ccfdd-cf11-415b-b398-95e5b0b2bbd4
+md"""We have some useful utilities for working with graph datasets, *e.g.*, we can shuffle the dataset and use the first 150 graphs as training graphs, while using the remaining ones for testing:
+"""
+
 # ╔═╡ 519477b2-8323-4ece-a7eb-141e9841117c
+train_data, test_data = splitobs((graphs, y), at=150, shuffle=true) |> getobs
+
+# ╔═╡ 3c3d5038-0ef6-47d7-a1b7-50880c5f3a0b
 begin
-	shuffled_idxs = randperm(length(graphs))
-	train_idxs = shuffled_idxs[1:150]
-	test_idxs = shuffled_idxs[151:end]
-	train_graphs = graphs[train_idxs]
-	test_graphs = graphs[test_idxs]
-	ytrain = onehotbatch(dataset.graph_data.targets[train_idxs], [-1, 1])
-	ytest = onehotbatch(dataset.graph_data.targets[test_idxs], [-1, 1])
+	train_loader = DataLoader(train_data, batchsize=64, shuffle=true)
+	test_loader = DataLoader(test_data, batchsize=64, shuffle=false)
 end
 
+# ╔═╡ f7778e2d-2e2a-4fc8-83b0-5242e4ec5eb4
+md"""
+Here, we opt for a `batch_size` of 64, leading to 3 (randomly shuffled) mini-batches, containing all ``2 \cdot 64+22 = 150`` graphs.
+"""
+
 # ╔═╡ 2a1c501e-811b-4ddd-887b-91e8c929c8b7
 md"""
 ## Mini-batching of graphs
@@ -114,35 +127,27 @@ This procedure has some crucial advantages over other batching procedures:
 
 2. There is no computational or memory overhead since adjacency matrices are saved in a sparse fashion holding only non-zero entries, *i.e.*, the edges.
 
-GNN.jl can **batch multiple graphs into a single giant graph** with the help of Flux's `DataLoader`:
+GNN.jl can **batch multiple graphs into a single giant graph**:
 """
 
 
-# ╔═╡ c202e3b7-1f39-496a-98e7-e03ada53b5c7
-begin
-	train_loader = DataLoader((train_graphs, ytrain), batchsize=64, shuffle=true)
-	test_loader = DataLoader((test_graphs, ytest), batchsize=64, shuffle=false)
-end
-
 # ╔═╡ a142610a-d862-42a9-88af-c8d8b6825650
-first(train_loader)
+vec_gs, _ = first(train_loader)
 
 # ╔═╡ 6faaf637-a0ff-468c-86b5-b0a7250258d6
-collect(train_loader)
-
-# ╔═╡ 6cc5e766-ddcd-4547-b69c-6435428caf44
-first(train_loader)[1]
+MLUtils.batch(vec_gs)
 
-# ╔═╡ ac69571a-998b-4630-afd6-f3d405618bc5
+# ╔═╡ e314b25f-e904-4c39-bf60-24cddf91fe9d
 md"""
-Here, we opt for a `batch_size` of 64, leading to 3 (randomly shuffled) mini-batches, containing all ``2 \cdot 64+22 = 150`` graphs.
-
-Furthermore, each batched graph object is equipped with a **`graph_indicator` vector**, which maps each node to its respective graph in the batch:
+Each batched graph object is equipped with a **`graph_indicator` vector**, which maps each node to its respective graph in the batch:
 
 ```math
 \textrm{graph-indicator} = [1, \ldots, 1, 2, \ldots, 2, 3, \ldots ]
 ```
+"""
 
+# ╔═╡ ac69571a-998b-4630-afd6-f3d405618bc5
+md"""
 ## Training a Graph Neural Network (GNN)
 
 Training a GNN for graph classification usually follows a simple recipe:
@@ -186,7 +191,7 @@ function eval_loss_accuracy(model, data_loader, device)
     acc = 0.
     ntot = 0
     for (g, y) in data_loader
-        g, y = g |> device, y |> device
+        g, y = MLUtils.batch(g) |> device, y |> device
         n = length(y)
         ŷ = model(g, g.ndata.x)
         loss += logitcrossentropy(ŷ, y) * n 
@@ -214,7 +219,7 @@ function train!(model; epochs=200, η=1e-2, infotime=10)
     report(0)
     for epoch in 1:epochs
         for (g, y) in train_loader
-            g, y = g |> device, y |> device
+            g, y = MLUtils.batch(g) |> device, y |> device
             gs = Flux.gradient(ps) do
                 ŷ = model(g, g.ndata.x)
                 logitcrossentropy(ŷ, y)
@@ -266,22 +271,25 @@ You have learned how graphs can be batched together for better GPU utilization,
 """
 
 # ╔═╡ Cell order:
-# ╟─c97a0002-2253-45b6-9266-017189dbb6fe
+# ╠═c97a0002-2253-45b6-9266-017189dbb6fe
 # ╠═361e0948-d91a-11ec-2d95-2db77435a0c1
 # ╟─15136fd8-f9b2-4841-9a95-9de7b8969687
 # ╠═f6e86958-e96f-4c77-91fc-c72d8967575c
 # ╠═24f76360-8599-46c8-a49f-4c31f02eb7d8
 # ╠═5d5e5152-c860-4158-8bc7-67ee1022f9f8
 # ╠═33163dd2-cb35-45c7-ae5b-d4854d141773
 # ╠═a8d6a133-a828-4d51-83c4-fb44f9d5ede1
-# ╟─3b3e0a79-264b-47d7-8bda-2a6db7290828
+# ╠═3b3e0a79-264b-47d7-8bda-2a6db7290828
+# ╠═7f7750ff-b7fa-4fe2-a5a8-6c9c26c479bb
 # ╠═936c09f6-ee62-4bc2-a0c6-749a66080fd2
+# ╟─2c6ccfdd-cf11-415b-b398-95e5b0b2bbd4
 # ╠═519477b2-8323-4ece-a7eb-141e9841117c
+# ╠═3c3d5038-0ef6-47d7-a1b7-50880c5f3a0b
+# ╟─f7778e2d-2e2a-4fc8-83b0-5242e4ec5eb4
 # ╟─2a1c501e-811b-4ddd-887b-91e8c929c8b7
-# ╠═c202e3b7-1f39-496a-98e7-e03ada53b5c7
 # ╠═a142610a-d862-42a9-88af-c8d8b6825650
 # ╠═6faaf637-a0ff-468c-86b5-b0a7250258d6
-# ╠═6cc5e766-ddcd-4547-b69c-6435428caf44
+# ╟─e314b25f-e904-4c39-bf60-24cddf91fe9d
 # ╟─ac69571a-998b-4630-afd6-f3d405618bc5
 # ╠═04402032-18a4-42b5-ad04-19b286bd29b7
 # ╟─2313fd8d-6e84-4bde-bacc-fb697dc33cbb
diff --git a/examples/Project.toml b/examples/Project.toml
@@ -6,13 +6,14 @@ Flux = "587475ba-b771-5e3f-ad9e-33799f191a9c"
 GraphNeuralNetworks = "cffab07f-9bc2-4db1-8861-388f63bf7694"
 Graphs = "86223c79-3864-5bf0-83f7-82e725a168b6"
 MLDatasets = "eb30cadb-4394-5ae3-aed4-317e484a6458"
+MLUtils = "f1d291b0-491e-4a28-83b9-f70985020b54"
 NNlib = "872c559c-99b0-510c-b3b7-b6c96a88d5cd"
 NNlibCUDA = "a00861dc-f156-4864-bf3c-e6376f28a68d"
 
 [compat]
 DiffEqFlux = "1.45"
 Flux = "0.13"
+GraphNeuralNetworks = "0.5"
 Graphs = "1"
-GraphNeuralNetworks = "0.4"
 MLDatasets = "0.6, 0.7"
 julia = "1.7"
diff --git a/examples/graph_classification_tudataset.jl b/examples/graph_classification_tudataset.jl
@@ -15,9 +15,8 @@ function eval_loss_accuracy(model, data_loader, device)
     loss = 0.
     acc = 0.
     ntot = 0
-    for (graphs, y) in data_loader
-        g = Flux.batch(graphs) |> device
-        y = y |> device
+    for (g, y) in data_loader
+        g, y = (g, y) |> device
         n = length(y)
         ŷ = model(g, g.ndata.x) |> vec
         loss += logitbinarycrossentropy(ŷ, y) * n 
@@ -66,10 +65,10 @@ function train(; kws...)
     NUM_TRAIN = 150
     
     dataset = getdataset()
-    train_data, test_data = splitobs(dataset, at=NUM_TRAIN/numobs(dataset), shuffle=true)
+    train_data, test_data = splitobs(dataset, at=NUM_TRAIN, shuffle=true)
     
-    train_loader = DataLoader(train_data, batchsize=args.batchsize, shuffle=true)
-    test_loader = DataLoader(test_data, batchsize=args.batchsize, shuffle=false)
+    train_loader = DataLoader(train_data; args.batchsize, shuffle=true, collate=true)
+    test_loader = DataLoader(test_data; args.batchsize, shuffle=false, collate=true)
     
     # DEFINE MODEL
 
@@ -96,9 +95,8 @@ function train(; kws...)
     
     report(0)
     for epoch in 1:args.epochs
-        for (graphs, y) in train_loader
-            g = Flux.batch(graphs) |> device
-            y = y |> device
+        for (g, y) in train_loader
+            g, y = (g, y) |> device
             gs = Flux.gradient(ps) do
                 ŷ = model(g, g.ndata.x) |> vec
                 logitbinarycrossentropy(ŷ, y)
diff --git a/src/GNNGraphs/gnngraph.jl b/src/GNNGraphs/gnngraph.jl
@@ -236,12 +236,6 @@ end
 MLUtils.numobs(g::GNNGraph) = g.num_graphs 
 MLUtils.getobs(g::GNNGraph, i) = getgraph(g, i)
 
-# DataLoader compatibility passing a vector of graphs and
-# effectively using `batch` as a collated function.
-MLUtils.numobs(data::Vector{<:GNNGraph}) = length(data)
-MLUtils.getobs(data::Vector{<:GNNGraph}, i::Int) = data[i]
-MLUtils.getobs(data::Vector{<:GNNGraph}, i) = Flux.batch(data[i])
-
 
 #########################
 
diff --git a/src/GNNGraphs/transform.jl b/src/GNNGraphs/transform.jl
@@ -432,6 +432,9 @@ function Flux.batch(gs::AbstractVector{<:GNNGraph{T}}) where T<:COO_T
     )
 end
 
+Flux.batch(g::GNNGraph) = 
+    throw(ArgumentError("Cannot batch a `GNNGraph` (containing $(g.num_graphs) graphs). Pass a vector of `GNNGraph`s instead."))
+
 """
     unbatch(g::GNNGraph)
 
diff --git a/src/layers/basic.jl b/src/layers/basic.jl
@@ -11,6 +11,12 @@ abstract type GNNLayer end
 # To be specialized by layers also needing edge features as input (e.g. NNConv). 
 (l::GNNLayer)(g::GNNGraph) = GNNGraph(g, ndata=l(g, node_features(g)))
 
+function (l::GNNLayer)(g::AbstractVector{<:GNNGraph}, args...; kws...)
+    @warn "Passing an array of graphs to a `GNNLayer` is discouraged. 
+           Explicitely call `MLUtils.batch(graphs)` first instead." maxlog=1
+    return l(batch(g), args...; kws...)
+end
+
 
 """
     WithGraph(model, g::GNNGraph; traingraph=false) 
diff --git a/test/GNNGraphs/gnngraph.jl b/test/GNNGraphs/gnngraph.jl
@@ -269,13 +269,14 @@
             @test first(d) == getgraph(g, 1:2)
         end
 
-        @testset "pass to dataloader and collate" begin
-            @test MLUtils.getobs(data, 3) == getgraph(g, 3)
-            @test MLUtils.getobs(data, 3:5) == getgraph(g, 3:5)
+        @testset "pass to dataloader and no automatic collation" begin
+            @test MLUtils.getobs(data, 3) == data[3]
+            @test MLUtils.getobs(data, 3:5) isa Vector{<:GNNGraph} 
+            @test MLUtils.getobs(data, 3:5)  == [data[3], data[4], data[5]]
             @test MLUtils.numobs(data) == g.num_graphs
 
             d = Flux.Data.DataLoader(data, batchsize=2, shuffle=false)
-            @test first(d) == getgraph(g, 1:2)
+            @test first(d) == [data[1], data[2]]
         end
     end
 
diff --git a/test/layers/basic.jl b/test/layers/basic.jl
@@ -93,5 +93,14 @@
         params, restructure = Flux.destructure(chain)
         @test restructure(params) isa GNNChain
     end
+    @testset "GNNGraph array input" begin
+        gs = [rand_graph(5, 6, ndata=rand(2, 5), graph_type=GRAPH_T) for _ in 1:4]
+        l = GCNConv(2 => 3)
+        y = l(gs, rand(2, 20))
+        @test size(y) == (3, 20)
+
+        gout = l(gs)
+        @test size(gout.ndata.x) == (3, 20)
+    end
 end
 

Original file line number	Diff line number	Diff line change
`@@ -432,6 +432,9 @@ function Flux.batch(gs::AbstractVector{<:GNNGraph{T}}) where T<:COO_T`
`432`	`432`	`)`
`433`	`433`	`end`
`434`	`434`
	`435`	`+Flux.batch(g::GNNGraph) =`
	`436`	+ throw(ArgumentError("Cannot batch a `GNNGraph` (containing $(g.num_graphs) graphs). Pass a vector of `GNNGraph`s instead."))
	`437`	`+`
`435`	`438`	`"""`
`436`	`439`	`unbatch(g::GNNGraph)`
`437`	`440`