pagerank-cuda-dynamic/aa-appendix.tex at main · merferry/pagerank-cuda-dynamic · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
\section{Appendix}

\ignore{\subsection{Our ND PageRank implementation}}

\ignore{Algorithm \ref{alg:naive} outlines the psuedocode of our GPU-based Naive-dynamic PageRank. It takes as input the transpose of the current graph snapshot $G^{t'}$ and the previous rank vector $R^{t-1}$, and returns the updated rank vector $R$. We start by initializing the rank vectors $R$ and $R_{new}$ with the previous rank vector $R^{t-1}$ (line \ref{alg:frontier--initialize}). The rest of the algorithm is similar to Static PageRank (Section \ref{sec:static-impl}).}

\ignore{\input{src/alg-naive}}
\ignore{\input{src/alg-traversal}}


\subsection{Updating ranks of vertices in parallel}
\label{sec:update}

Algorithm \ref{alg:update} provides a psuedocode for updating the ranks of vertices in parallel. Here, the function \texttt{updateRanks()} takes as input the set of affected vertices $\delta_V$, affected neighbor vertices $\delta_N$, the previous and current rank vectors $R$ and $R_{new}$, respectively, the current input graph $G^t$, partitioned vertex IDs $P'$ (low in-degree vertices come first), and the number of vertices with low in-degree $N'_P$. It also requires parameters such as the damping factor $\alpha$, frontier tolerance $\tau_f$, and prune tolerance $\tau_p$.

The algorithm operates in two phases: a thread-per-vertex kernel (for low degree vertices) and a block-per-vertex kernel (for high degree vertices). In the \textit{thread-per-vertex kernel} (lines \ref{alg:update--thread-begin}-\ref{alg:update--thread-end}), we use each thread to process each low degree vertex in parallel, iterating over the partitioned vertex IDs ($P'$). For each vertex $v$, if it's not marked as affected, we skip it (line \ref{alg:update--affected}). Otherwise, we compute the new rank $r$ based on the incoming edges and ranks of neighboring vertices (lines \ref{alg:update--rank-begin}-\ref{alg:update--rank-end}). Depending on whether DF or DF-P PageRank is selected, we compute ranks using either Equation \ref{eq:pr} or \ref{eq:pr-prune}, respectively. Next, we compute the change in rank $\Delta r$ of the current vertex $v$ from its previous iteration (line \ref{alg:update--change}). If the relative change in rank of $v$, i.e., $\Delta r / \max(r, R[v])$, is within the prune tolerance $\tau_p$, we perform pruning by marking $v$ and no longer affected. However, if the relative change in rank of $v$ is above the frontier tolerance $\tau_f$, we mark the vertices whose neighbors must be incrementally marked as affected (the incremental marking of affected vertices is performed at a later point of time, using the \texttt{expandAffected()} function (Algorithm \ref{alg:affected}). Finally, we update the rank of vertex $v$ in the $R$ vector. In the \textit{block-per-vertex kernel} (lines \ref{alg:update--block-begin}-\ref{alg:update--block-end}), we use each thread block to process each high degree vertex in parallel, utilizing block-level parallelism. It involves similar operations as the thread-per-vertex kernel, but uses block-reduce operations and shared memory.

\input{src/alg-update}


\subsection{Parallel vertex partitioning by degree}
\label{sec:partition}

Algorithm \ref{alg:partition} outlines the psuedocode for parallel vertex partitioning by degree. It aims to split the vertices of the input graph $G(V, E)$ into two groups based on their degree: low-degree vertices and high-degree vertices. The algorithm provides partitioned vertex IDs $P$ with low-degree vertices being listed first, along with the number of vertices with low degree $N_P$, as its output.

In the function \texttt{partition()}, we start by initializing an empty buffer $B_k$ and the set of partitioned vertex IDs $P$ (line \ref{alg:partition--initialize}). We then proceed to populate $B_k$ with boolean values indicating whether each vertex has a degree less than or equal to a specified threshold $D_P$ (lines \ref{alg:partition--less-begin}-\ref{alg:partition--less-end}). Afterward, we perform an exclusive prefix sum operation on $B_k$ to determine the number of low-degree vertices $N_P$ (lines \ref{alg:partition--lscan-begin}-\ref{alg:partition--lscan-end}). In the subsequent loop, we assign low-degree vertices to the appropriate positions in the partitioned vertex IDs array $P$ (lines \ref{alg:partition--lpopulate-begin}-\ref{alg:partition--lpopulate-end}). We then repeat a similar process for high-degree vertices. We populate $B_k$ with boolean values indicating whether each vertex has a degree greater than $D_P$ (lines \ref{alg:partition--greater-begin}-\ref{alg:partition--greater-end}), perform another exclusive prefix sum operation on $B_k$ (line \ref{alg:partition--gscan}), and assign high-degree vertices to the appropriate positions in $P$ (lines \ref{alg:partition--gpopulate-begin}-\ref{alg:partition--gpopulate-end}). Finally, we return the partitioned vertex IDs $P$ along with the number of low-degree vertices $N_P$ (line \ref{alg:partition--return}).

\input{src/alg-partition}


\subsection{Parallel marking of affected vertices}
\label{sec:affected}

Algorithm \ref{alg:affected} presents the psuedocode for parallel marking of affected vertices, consisting of two functions: \texttt{initialAffected()} and \texttt{expandAffected()}.

The \texttt{initialAffected()} function performs an initial marking step of DF and DF-P PageRank. It takes as input the current graph snapshot $G^t$ and the sets of edge deletions $\Delta^{t-}$ and insertions $\Delta^{t+}$. Here, we first initialize two arrays, $\delta_V$ and $\delta_N$, which represent whether each vertex and its neighbors are affected, respectively (line \ref{alg:affected--iinitialize}). Next, for each edge deletion in $\Delta^{t-}$, we mark both the source and target vertices as affected (lines \ref{alg:affected--idel-begin}-\ref{alg:affected--idel-end}). Similarly, for each edge insertion in $\Delta^{t+}$, we mark the source vertex as affected (lines \ref{alg:affected--iins-begin}-\ref{alg:affected--iins-end}). Finally, we return $\delta_V$ and $\delta_N$ (line \ref{alg:affected--ireturn}).

The \texttt{expandAffected()} function propagates the affected status to neighboring vertices. It takes as input flags indicating whether each vertex is affected $\delta_V$ or its neighbors are affected $\delta_N$, the current graph snapshot $G^t$, partitioned vertex IDs $P$ with low degree vertices placed first, and the number of vertices with low degree $N_P$. This algorithm also operates in two phases: a thread-per-vertex kernel (for low degree vertices) and a block-per-vertex kernel (for high degree vertices). In the \textit{thread-per-vertex kernel} (lines \ref{alg:affected--ethread-begin}-\ref{alg:affected--ethread-end}), we use each thread to process each low degree vertex in parallel by iterating through the partitioned vertex IDs array $P$. For each vertex $u$ marked as affected in $\delta_N$, we invoke the \texttt{markNeighbors()} function to mark its neighbors as affected in $\delta_V$ (line \ref{alg:affected--etmark}). In the \textit{block-per-vertex kernel} (lines \ref{alg:affected--eblock-begin}-\ref{alg:affected--eblock-end}), we use each thread block to process each vertex in parallel, utilizing block-level parallelism. It involves similar operations as the thread-per-vertex kernel.

\input{src/alg-affected}


\subsection{Indirect Comparison with State-of-the-art PageRank Implementations (Static)}
\label{sec:static-comparison-indirect}

We now indirectly compare the performance of our GPU implementation of Static PageRank with other similar state-of-the-art implementations, listed in Table \ref{tab:compare-large}. Wang et al. \cite{wang2021grus} present Grus, a unified memory efficient high-performance graph processing framework for GPUs. They achieve a $1.2\times$ speedup over Gunrock \cite{wang2016gunrock} on the \textit{uk-2005} graph (refer to Table $4$ in their paper \cite{wang2021grus}). On the same graph, Wang et al. observe that Gunrock achieves a speedup of $4.6\times$ over Tigr \cite{nodehi2018tigr}. However, our implementation achieves an $8.6\times$ speedup over Gunrock on the same graph (Figure \ref{fig:compare--speedup}). Chen et al. \cite{chen2022atos} introduce Atos, a task-parallel GPU scheduler for graph analytics. They achieve a $3.2\times$ speedup over Gunrock on the \textit{indochina-2004} graph (refer to Table $1$ in their paper \cite{chen2022atos}). However, our implementation achieves a $24.4\times$ speedup over Gunrock on the same graph (Figure \ref{fig:compare--speedup}). In a subsequent work, Chen et al. \cite{chen2022scalable} extend their Atos dynamic scheduling framework to multi-node GPU systems supporting Partitioned Global Address Space (PGAS) style lightweight one-sided memory operations within and between nodes. Even with $4$ GPUs, they fail to surpass our speedup with respect to Gunrock on the \textit{indochina-2004} graph (refer to Table $4$ in their paper \cite{chen2022scalable}). On the same graph, Chen et al. observe that they achieve a speedup of $57.5\times$ over Galois with Gluon \cite{dathathri2018gluon} on a single GPU (see Table $5$ in their paper \cite{chen2022scalable}). Yang et al. \cite{yang2022graphblast} introduce GraphBLAST, a high-performance linear algebra-based graph framework for GPUs. On the \textit{indochina-2004} graph, they achieve a $2.2\times$/$1.2\times$ speedup over Gunrock (see Table $12$/$13$ in their paper \cite{yang2022graphblast}). Nonetheless, our implementation achieves a $24.4\times$ speedup over Gunrock on the same graph (Figure \ref{fig:compare--speedup}). Concessao et al. \cite{concessao2023meerkat} propose Meerkat, a library-based framework for dynamic graph algorithms that utilizes a GPU-tailored graph representation and exploits the warp-cooperative execution model. The PageRank implementation of Meerkat performs, on average, $1.7\times$ faster than Hornet \cite{busato2018hornet}. However, our Static PageRank outperforms Hornet by an average speedup of $31\times$ (Figure \ref{fig:compare--speedup}).

\ignore{\clearpage}

\input{src/fig-temporal-compare}
\input{src/fig-8020-runtime-compare}
\input{src/fig-8020-error-compare}
\input{src/fig-temporal-sx-mathoverflow}
\input{src/fig-temporal-sx-askubuntu}
\input{src/fig-temporal-sx-superuser}
\input{src/fig-temporal-wiki-talk-temporal}
\input{src/fig-temporal-sx-stackoverflow}