add graph structure analysis notebook

Accrame · Accrame · commit a99c1004eff5 · 2025-02-11T14:55:20.000+01:00
diff --git a/notebooks/01_graph_analysis.ipynb b/notebooks/01_graph_analysis.ipynb
@@ -0,0 +1,382 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Graph Structure Analysis\n",
+    "\n",
+    "Before training any GNN, let's actually look at what the graph looks like. The idea is that fraud nodes might have different structural properties (higher degree, different centrality, etc.) that the GNN can pick up on."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "sys.path.insert(0, '..')\n",
+    "\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import networkx as nx\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "from src.data.dataset import create_synthetic_fraud_data\n",
+    "from src.data.graph_builder import TransactionGraphBuilder\n",
+    "\n",
+    "%matplotlib inline"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# generate synthetic transactions\n",
+    "df = create_synthetic_fraud_data(\n",
+    "    num_users=1000,\n",
+    "    num_merchants=200,\n",
+    "    num_transactions=10000,\n",
+    "    fraud_rate=0.05,\n",
+    ")\n",
+    "\n",
+    "# build the PyG graph\n",
+    "builder = TransactionGraphBuilder()\n",
+    "pyg_data = builder.build_graph(df)\n",
+    "\n",
+    "print(f'Nodes: {pyg_data.num_nodes} ({pyg_data.num_users} users + {pyg_data.num_merchants} merchants)')\n",
+    "print(f'Edges: {pyg_data.edge_index.shape[1]} (bidirectional, so {pyg_data.edge_index.shape[1] // 2} unique)')\n",
+    "print(f'Node feature dim: {pyg_data.x.shape[1]}')\n",
+    "print(f'Fraud rate: {df[\"is_fraud\"].mean():.2%}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Build a NetworkX graph for analysis\n",
+    "\n",
+    "PyG is great for training but networkx is way easier for structural analysis. Let's build a bipartite graph (users <-> merchants) and tag edges with fraud labels."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "G = nx.Graph()\n",
+    "\n",
+    "users = df['user_id'].unique()\n",
+    "merchants = df['merchant_id'].unique()\n",
+    "\n",
+    "G.add_nodes_from(users, bipartite=0, node_type='user')\n",
+    "G.add_nodes_from(merchants, bipartite=1, node_type='merchant')\n",
+    "\n",
+    "# add edges with fraud attribute\n",
+    "for _, row in df.iterrows():\n",
+    "    if G.has_edge(row['user_id'], row['merchant_id']):\n",
+    "        # multiple transactions between same user-merchant, keep max fraud\n",
+    "        G[row['user_id']][row['merchant_id']]['fraud'] = max(\n",
+    "            G[row['user_id']][row['merchant_id']]['fraud'], row['is_fraud']\n",
+    "        )\n",
+    "    else:\n",
+    "        G.add_edge(row['user_id'], row['merchant_id'], fraud=row['is_fraud'])\n",
+    "\n",
+    "print(f'NetworkX graph: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges')\n",
+    "print(f'Connected components: {nx.number_connected_components(G)}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Subgraph visualization\n",
+    "\n",
+    "Plotting the whole graph would be a mess with 1200 nodes. Let's grab a small subgraph around some fraud nodes and see if we can spot any patterns visually."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# get users involved in fraud\n",
+    "fraud_users = df[df['is_fraud'] == 1]['user_id'].unique()[:15]\n",
+    "\n",
+    "# grab their 1-hop neighborhood\n",
+    "subgraph_nodes = set(fraud_users)\n",
+    "for u in fraud_users:\n",
+    "    subgraph_nodes.update(G.neighbors(u))\n",
+    "\n",
+    "# also add some legit users for contrast\n",
+    "legit_users = df[df['is_fraud'] == 0]['user_id'].unique()[:20]\n",
+    "for u in legit_users:\n",
+    "    if u in G:\n",
+    "        subgraph_nodes.add(u)\n",
+    "        subgraph_nodes.update(list(G.neighbors(u))[:3])  # just a few neighbors\n",
+    "\n",
+    "sub = G.subgraph(subgraph_nodes)\n",
+    "print(f'Subgraph: {sub.number_of_nodes()} nodes, {sub.number_of_edges()} edges')\n",
+    "\n",
+    "# color nodes\n",
+    "node_colors = []\n",
+    "for n in sub.nodes():\n",
+    "    if G.nodes[n].get('node_type') == 'merchant':\n",
+    "        node_colors.append('#999999')  # gray for merchants\n",
+    "    elif n in fraud_users:\n",
+    "        node_colors.append('#e74c3c')  # red for fraud users\n",
+    "    else:\n",
+    "        node_colors.append('#3498db')  # blue for legit users\n",
+    "\n",
+    "fig, ax = plt.subplots(figsize=(12, 8))\n",
+    "pos = nx.spring_layout(sub, seed=42, k=0.5)\n",
+    "nx.draw_networkx(\n",
+    "    sub, pos, ax=ax,\n",
+    "    node_color=node_colors,\n",
+    "    node_size=40,\n",
+    "    with_labels=False,\n",
+    "    edge_color='#cccccc',\n",
+    "    alpha=0.8,\n",
+    "    width=0.5,\n",
+    ")\n",
+    "ax.set_title('Subgraph: red=fraud users, blue=legit users, gray=merchants')\n",
+    "plt.tight_layout()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Degree distribution\n",
+    "\n",
+    "In fraud detection papers, fraudulent nodes sometimes show unusual connectivity patterns. Let's check."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "user_degrees = {u: G.degree(u) for u in users}\n",
+    "merchant_degrees = {m: G.degree(m) for m in merchants}\n",
+    "\n",
+    "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n",
+    "\n",
+    "axes[0].hist(list(user_degrees.values()), bins=30, color='steelblue', alpha=0.7)\n",
+    "axes[0].set_xlabel('Degree')\n",
+    "axes[0].set_ylabel('Count')\n",
+    "axes[0].set_title('User Degree Distribution')\n",
+    "\n",
+    "axes[1].hist(list(merchant_degrees.values()), bins=30, color='coral', alpha=0.7)\n",
+    "axes[1].set_xlabel('Degree')\n",
+    "axes[1].set_ylabel('Count')\n",
+    "axes[1].set_title('Merchant Degree Distribution')\n",
+    "\n",
+    "plt.tight_layout()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Fraud vs legit node degree\n",
+    "\n",
+    "The real question -- do users involved in fraud have different degree distributions?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# tag users by whether they've been involved in any fraud\n",
+    "all_fraud_users = set(df[df['is_fraud'] == 1]['user_id'].unique())\n",
+    "\n",
+    "fraud_degrees = [user_degrees[u] for u in users if u in all_fraud_users]\n",
+    "legit_degrees = [user_degrees[u] for u in users if u not in all_fraud_users]\n",
+    "\n",
+    "print(f'Fraud users: {len(fraud_degrees)}, avg degree: {np.mean(fraud_degrees):.1f}')\n",
+    "print(f'Legit users: {len(legit_degrees)}, avg degree: {np.mean(legit_degrees):.1f}')\n",
+    "\n",
+    "fig, ax = plt.subplots(figsize=(8, 4))\n",
+    "ax.hist(legit_degrees, bins=20, alpha=0.6, label='Legit', color='steelblue', density=True)\n",
+    "ax.hist(fraud_degrees, bins=20, alpha=0.6, label='Fraud', color='indianred', density=True)\n",
+    "ax.set_xlabel('Degree (unique merchants)')\n",
+    "ax.set_ylabel('Density')\n",
+    "ax.set_title('User Degree: Fraud vs Legit')\n",
+    "ax.legend()\n",
+    "plt.tight_layout()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Interesting -- fraud users tend to have slightly higher degree on average. Makes sense because in the synthetic data, fraud transactions are spread across users randomly so users with more transactions have a higher chance of being tagged. In real data the pattern might be different (e.g., fraud rings using many accounts)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Node feature distributions\n",
+    "\n",
+    "Let's look at the computed node features from the graph builder and see if fraud-adjacent nodes look different."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# the node features are stored in pyg_data.x\n",
+    "# first num_users rows are users, rest are merchants\n",
+    "feature_names = [\n",
+    "    'tx_count', 'avg_amount', 'std_amount', 'max_amount',\n",
+    "    'unique_merchants', 'merch_tx_count', 'merch_avg_amount', 'merch_unique_users'\n",
+    "]\n",
+    "\n",
+    "user_feats = pyg_data.x[:pyg_data.num_users].numpy()\n",
+    "\n",
+    "# map fraud flag to users\n",
+    "user_list = df['user_id'].unique()  # same order as builder.user_mapping\n",
+    "is_fraud_user = np.array([1 if u in all_fraud_users else 0 for u in user_list])\n",
+    "\n",
+    "fig, axes = plt.subplots(2, 2, figsize=(10, 8))\n",
+    "for idx, ax in enumerate(axes.flat):\n",
+    "    if idx >= 4:\n",
+    "        break\n",
+    "    feat = user_feats[:, idx]\n",
+    "    ax.hist(feat[is_fraud_user == 0], bins=25, alpha=0.6, label='Legit', color='steelblue', density=True)\n",
+    "    ax.hist(feat[is_fraud_user == 1], bins=25, alpha=0.6, label='Fraud', color='indianred', density=True)\n",
+    "    ax.set_title(feature_names[idx])\n",
+    "    ax.legend(fontsize=8)\n",
+    "\n",
+    "plt.suptitle('User Node Features: Fraud vs Legit', y=1.02)\n",
+    "plt.tight_layout()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The `max_amount` and `avg_amount` features clearly separate fraud users -- that's expected since we multiply fraud amounts by 3x in the synthetic generator. Good to confirm the graph builder is capturing this."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Centrality analysis\n",
+    "\n",
+    "Let's check a few centrality metrics. This takes a bit on the full graph so we'll use the user projection."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# project to user-only graph (two users connected if they share a merchant)\n",
+    "user_proj = nx.bipartite.projected_graph(G, users)\n",
+    "print(f'User projection: {user_proj.number_of_nodes()} nodes, {user_proj.number_of_edges()} edges')\n",
+    "\n",
+    "# degree centrality\n",
+    "deg_cent = nx.degree_centrality(user_proj)\n",
+    "\n",
+    "# clustering coefficient\n",
+    "clustering = nx.clustering(user_proj)\n",
+    "\n",
+    "# betweenness is slow on big graphs, sample it\n",
+    "betweenness = nx.betweenness_centrality(user_proj, k=100, seed=42)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig, axes = plt.subplots(1, 3, figsize=(14, 4))\n",
+    "\n",
+    "metrics = [\n",
+    "    ('Degree Centrality', deg_cent),\n",
+    "    ('Clustering Coeff', clustering),\n",
+    "    ('Betweenness Centrality', betweenness),\n",
+    "]\n",
+    "\n",
+    "for ax, (name, metric) in zip(axes, metrics):\n",
+    "    fraud_vals = [metric.get(u, 0) for u in users if u in all_fraud_users]\n",
+    "    legit_vals = [metric.get(u, 0) for u in users if u not in all_fraud_users]\n",
+    "\n",
+    "    ax.hist(legit_vals, bins=25, alpha=0.6, label='Legit', color='steelblue', density=True)\n",
+    "    ax.hist(fraud_vals, bins=25, alpha=0.6, label='Fraud', color='indianred', density=True)\n",
+    "    ax.set_title(name)\n",
+    "    ax.legend(fontsize=8)\n",
+    "\n",
+    "plt.tight_layout()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# quick summary stats\n",
+    "print('=== Degree Centrality ===')\n",
+    "print(f'  Fraud mean: {np.mean([deg_cent.get(u, 0) for u in users if u in all_fraud_users]):.4f}')\n",
+    "print(f'  Legit mean: {np.mean([deg_cent.get(u, 0) for u in users if u not in all_fraud_users]):.4f}')\n",
+    "\n",
+    "print('\\n=== Clustering Coefficient ===')\n",
+    "print(f'  Fraud mean: {np.mean([clustering.get(u, 0) for u in users if u in all_fraud_users]):.4f}')\n",
+    "print(f'  Legit mean: {np.mean([clustering.get(u, 0) for u in users if u not in all_fraud_users]):.4f}')\n",
+    "\n",
+    "print('\\n=== Betweenness Centrality ===')\n",
+    "print(f'  Fraud mean: {np.mean([betweenness.get(u, 0) for u in users if u in all_fraud_users]):.4f}')\n",
+    "print(f'  Legit mean: {np.mean([betweenness.get(u, 0) for u in users if u not in all_fraud_users]):.4f}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The centrality differences aren't huge, which makes sense -- the synthetic data generates fraud uniformly at random. In real-world fraud you'd expect to see fraud rings (dense subgraphs) and higher betweenness for money mule accounts. The GNN should still be able to learn these structural signals combined with the node/edge features."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Takeaways\n",
+    "\n",
+    "- Graph is bipartite (users <-> merchants), pretty well connected since we have 10k transactions over 1200 nodes\n",
+    "- Fraud users have slightly higher degree and higher amount features (expected from synthetic generator)\n",
+    "- Centrality metrics show small differences -- the GNN will need to combine topology with features\n",
+    "- Next step: train GraphSAGE and see if message passing actually helps vs a flat MLP baseline"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}