You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"Before training any GNN, let's actually look at what the graph looks like. The idea is that fraud nodes might have different structural properties (higher degree, different centrality, etc.) that the GNN can pick up on."
"PyG is great for training but networkx is way easier for structural analysis. Let's build a bipartite graph (users <-> merchants) and tag edges with fraud labels."
"Plotting the whole graph would be a mess with 1200 nodes. Let's grab a small subgraph around some fraud nodes and see if we can spot any patterns visually."
"Interesting -- fraud users tend to have slightly higher degree on average. Makes sense because in the synthetic data, fraud transactions are spread across users randomly so users with more transactions have a higher chance of being tagged. In real data the pattern might be different (e.g., fraud rings using many accounts)."
223
+
]
224
+
},
225
+
{
226
+
"cell_type": "markdown",
227
+
"metadata": {},
228
+
"source": [
229
+
"## Node feature distributions\n",
230
+
"\n",
231
+
"Let's look at the computed node features from the graph builder and see if fraud-adjacent nodes look different."
232
+
]
233
+
},
234
+
{
235
+
"cell_type": "code",
236
+
"execution_count": null,
237
+
"metadata": {},
238
+
"outputs": [],
239
+
"source": [
240
+
"# the node features are stored in pyg_data.x\n",
241
+
"# first num_users rows are users, rest are merchants\n",
"plt.suptitle('User Node Features: Fraud vs Legit', y=1.02)\n",
264
+
"plt.tight_layout()"
265
+
]
266
+
},
267
+
{
268
+
"cell_type": "markdown",
269
+
"metadata": {},
270
+
"source": [
271
+
"The `max_amount` and `avg_amount` features clearly separate fraud users -- that's expected since we multiply fraud amounts by 3x in the synthetic generator. Good to confirm the graph builder is capturing this."
272
+
]
273
+
},
274
+
{
275
+
"cell_type": "markdown",
276
+
"metadata": {},
277
+
"source": [
278
+
"## Centrality analysis\n",
279
+
"\n",
280
+
"Let's check a few centrality metrics. This takes a bit on the full graph so we'll use the user projection."
281
+
]
282
+
},
283
+
{
284
+
"cell_type": "code",
285
+
"execution_count": null,
286
+
"metadata": {},
287
+
"outputs": [],
288
+
"source": [
289
+
"# project to user-only graph (two users connected if they share a merchant)\n",
"print(f' Fraud mean: {np.mean([deg_cent.get(u, 0) for u in users if u in all_fraud_users]):.4f}')\n",
338
+
"print(f' Legit mean: {np.mean([deg_cent.get(u, 0) for u in users if u not in all_fraud_users]):.4f}')\n",
339
+
"\n",
340
+
"print('\\n=== Clustering Coefficient ===')\n",
341
+
"print(f' Fraud mean: {np.mean([clustering.get(u, 0) for u in users if u in all_fraud_users]):.4f}')\n",
342
+
"print(f' Legit mean: {np.mean([clustering.get(u, 0) for u in users if u not in all_fraud_users]):.4f}')\n",
343
+
"\n",
344
+
"print('\\n=== Betweenness Centrality ===')\n",
345
+
"print(f' Fraud mean: {np.mean([betweenness.get(u, 0) for u in users if u in all_fraud_users]):.4f}')\n",
346
+
"print(f' Legit mean: {np.mean([betweenness.get(u, 0) for u in users if u not in all_fraud_users]):.4f}')"
347
+
]
348
+
},
349
+
{
350
+
"cell_type": "markdown",
351
+
"metadata": {},
352
+
"source": [
353
+
"The centrality differences aren't huge, which makes sense -- the synthetic data generates fraud uniformly at random. In real-world fraud you'd expect to see fraud rings (dense subgraphs) and higher betweenness for money mule accounts. The GNN should still be able to learn these structural signals combined with the node/edge features."
354
+
]
355
+
},
356
+
{
357
+
"cell_type": "markdown",
358
+
"metadata": {},
359
+
"source": [
360
+
"## Takeaways\n",
361
+
"\n",
362
+
"- Graph is bipartite (users <-> merchants), pretty well connected since we have 10k transactions over 1200 nodes\n",
363
+
"- Fraud users have slightly higher degree and higher amount features (expected from synthetic generator)\n",
364
+
"- Centrality metrics show small differences -- the GNN will need to combine topology with features\n",
365
+
"- Next step: train GraphSAGE and see if message passing actually helps vs a flat MLP baseline"
0 commit comments