pagerank-cuda-dynamic/02-related-work.tex at main · merferry/pagerank-cuda-dynamic · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
\subsection{Static PageRank}

One of the earliest GPU implementations of PageRank is by Wu et al. \cite{rank-wu10}. They implement Static PageRank on AMD GPUs using OpenCL. For this, they develop a Sparse Matrix-Vector Multiplication (SpMV) routine as a primitive. Here, as a pre-processing step, they sort the rows of the PageRank sparse matrix based on the number of non-zero elements, while keeping track of the original row number. Subsequently, they employ three distinct kernels to process the rows: One thread per row (1T1R), one quarter wavefront per row (16T1R), and one wavefront per row (1W1R), depending on the number of non-zero elements in each row. Cevahir et al. \cite{cevahir2010efficient} conduct PageRank computation on an NVIDIA GPU cluster (TSUBAME 1.2), with each node housing two Tesla GPUs. They partition the graph into chunks, assigning one chunk to each node. Each chunk is split into two portions, each of which can be loaded into the device memory of each GPU. They allocate one MPI process for each GPU, and utilize an SpMV kernel on the GPU for computation. Rungsawang and Manaskasemsak \cite{rank-rungsawang12} improve upon the work of Cevahir et al. \cite{cevahir2010efficient} by devising an algorithm that does not require large graphs to fit within the limited device memory. They achieve this by partitioning the graph between nodes and then further dividing each partition into chunks using the CPU at each node. These chunks are sized to fit into the device memory of the Tesla GPU at each node, which then processes them one after the other. Duong et al. \cite{rank-duong12} introduce a push-based GPU implementation of Static PageRank, employing atomic operations. They utilize a single thread per vertex for rank computation and handle the global teleport rank contribution (dangling value) with atomic operations. Additionally, they propose a multi-GPU implementation where the input graph is partitioned among GPUs, allowing independent computation of PageRank scores on each GPU. After each iteration, the computed ranks are synchronized on the CPU, where they are combined and redistributed among the GPUs\ignore{for subsequent iterations}.

However, Wu et al. \cite{rank-wu10}, Cevahir et al. \cite{cevahir2010efficient}, and Rungsawang and Manaskasemsak \cite{rank-rungsawang12} must first build the PageRank matrix before performing PageRank computation using an SpMV kernel. Our GPU implementation of PageRank does not require the building of any sparse matrix --- we directly utilize the adjacency list (stored in CSR format) of the graph for PageRank computation. Rungsawang and Manaskasemsak \cite{rank-rungsawang12} simply assign a thread for processing each vertex in the graph, which fails to achieve good load balancing and suffers from thread divergence, while Wu et al. \cite{rank-wu10} sort the rows by the number of non-zero elements, for load balancing using three different kernels, which can be an expensive operation. In contrast, we partition the vertices in the graph into low and high degree vertex sets and process them with two kernels.
Duong et al. \cite{rank-duong12} use a push-based PageRank computation, relying on atomic operations per edge, and also compute the global teleport contribution due to dead ends using atomic add operations. This can result in substantial memory contention among GPU threads. In contrast, our pull-based approach requires only one write per vertex, and by eliminating dead ends during graph loading we avoid the need for finding the global teleport contribution. Duong et al. \cite{rank-duong12} also do not partition vertices into low and high-degree sets.

We now discuss a number of graph processing frameworks, which include a GPU implementation of PageRank. Wang et al. \cite{wang2016gunrock} present the Gunrock multi-GPU CUDA library, which provides a high-level vertex/edge frontier-based bulk-synchronous / asynchronous API, and a number of high performance primitives. The PageRank implementation of Gunrock adopts a push-based approach, executing atomic adds per edge, employs a parallel for loop (using thrust) over the range of vertex IDs, utilizing either a thread- or a block-per-vertex approach. Further, its computes the global teleport contribution to each vertex attributable to dead ends. Busato et al. \cite{busato2018hornet} introduce Hornet, a platform-independent data structure optimized for dynamic sparse graphs and matrices. Hornet accommodates large-scale data growth without necessitating any data reallocation or re-initialization throughout the dynamic evolution process. The GPU-based graph processing framework based on the Hornet data structure is known as cuHornet. In its PageRank implementation, Hornet follows a push-based approach, akin to Gunrock. It calculates the rank contribution of each vertex separately, stored in a distinct vector, and utilizes an additional kernel to compute ranks from these contributions. Similar to Gunrock, Hornet employs a parallel for loop (using a custom kernel) over all vertices, adopting a thread-per-vertex approach.

We now discuss issues with the PageRank implementation of Gunrock \cite{wang2016gunrock} and Hornet \cite{busato2018hornet}. Both utilize push-based PageRank computation involving atomic add operations per edge, leading to significant memory contention among GPU threads. However, our approach is pull-based, and necessitates only a single write per vertex. Hornet computes vertex rank contributions separately and employs an additional kernel for rank computation, whereas we calculate rank contributions on the fly and utilize only a kernel pair for rank computation. Gunrock performs a parallel for (using thrust) over the range of vertex IDs, while Hornet performs a parallel for (using custom kernel) over all vertices using a thread per vertex. Unlike Gunrock and Hornet, we partition vertices into low and high in-degree sets and employ both thread- and block-per-vertex kernels. Gunrock uses a kernel to compute the global teleport contribution due to dead ends, while we preemptively eliminate dead ends during graph loading. Finally, Hornet uses a naive rank vector vector norm computation with atomic operations, while we implement a more efficient parallel reduce operation.

Wang et al. \cite{wang2021grus} introduce Grus, a GPU-based graph processing framework optimized for Unified Memory (UM) efficiency. Their focus lies in minimizing data migration, reducing page faults, and mitigating page migration overhead. Their PageRank implementation is push-based, and utilizes an adaptive UM policy, prioritizing frontier and rank data with high priority, Compressed Sparse Row (CSR) index array with medium priority, and CSR edges array with low priority. They use a bitmap-directed frontier, based on an 8-bit integer array alongside a queue, which eliminates the need for atomic operations. Load balancing is achieved with a warp-centric approach. Chen et al. \cite{chen2022atos} critique existing frameworks like Gunrock for launching each graph frontier as a separate GPU kernel in the Bulk Synchronous Parallel (BSP) model, leading to potential issues with parallelism, finish times, and kernel launch overhead, especially for small frontiers. To address this, they propose Atos, a persistent task scheduler designed to minimize kernel launch overhead and support asynchronous execution. For PageRank, they present a push-based asynchronous implementation, employing a queue-based frontier to track vertices for the next iteration, based on Adpative PageRank by Kamvar et al. \cite{kamvar2004adaptive}. In another study, Chen et al. \cite{chen2022scalable} broaden their Atos dynamic scheduling framework to multi-node GPU systems, accommodating Partitioned Global Address Space (PGAS) style lightweight one-sided memory operations within and between nodes. Yang et al. \cite{yang2022graphblast} introduce GraphBLAST, a high-performance linear algebra-based graph framework on the GPU. They address the lack of efficient GraphBLAS implementations for GPUs, highlighting the performance gap compared to state-of-the-art GPU graph frameworks, like the GraphBLAS Template Library (GBTL). They focus on exploiting input and output sparsity to simplify algorithm development and improve performance. They employ edge-balanced load balancing approach (merge-based) with segmented scan for PageRank, alongside a heuristic to switch between push- and pull-based approaches, favoring an early switch akin to Ligra \cite{shun2013ligra}. Concessao et al. \cite{concessao2023meerkat} introduce Meerkat, a library-based framework for dynamic graph algorithms optimized for GPU architectures. Leveraging a GPU-tailored graph representation and the warp-cooperative execution model, Meerkat enhances performance by exploiting common iteration patterns such as iterating over vertex neighborhoods efficiently. The framework supports dynamic edge additions and deletions, including batched versions. In their implementation of PageRank, Meerkat first computes the rank contribution of each vertex before calculating the rank itself\ignore{(similar to Hornet \cite{busato2018hornet})}. It employs a pull-based warp-per-vertex approach for parallel computation, where threads collaborate within a warp to compute vertex ranks, and achieves coalesced writes for rank updates.

We now note some concerns regarding GPU-based implementations of PageRank algorithms in existing frameworks. Grus \cite{wang2021grus} and Atos \cite{chen2022atos} adopt a push-based PageRank computation method. This involves atomic add operations per edge, and can introduce needless memory contention among GPU threads. Our approach is pull-based, and requires only a single write per vertex. Atos additionally employs a queue-based frontier for tracking vertices to be processed in the next iteration. This requires atomic operations, and is not conducive to parallelism. While Grus and Meerkat \cite{concessao2023meerkat} utilize warp-centric load balancing, they do not partition vertices into low and high in-degree sets. We partition vertices into such sets for load balancing, and to minimize thread divergence. Finally, Meerkat computes the rank contribution of each vertex, followed by a separate kernel for computing the ranks. It the uses another kernel to compute the common teleport contribution with atomics. However, as mentioned earlier, we calculate rank contributions on the fly, and employ only a pair of kernels for rank computation.


\subsection{Dynamic PageRank}

Early work in dynamic graph algorithms in the sequential setting includes the sparsification method proposed by Eppstein et al. \cite{graph-eppstein97} and Ramalingam's bounded incremental computation approach \cite{incr-ramalingam96}.\ignore{The latter advocates measuring the work done as part of the update in proportion to the effect the update has on the computation.} Several approaches have been suggested for incremental computation of approximate PageRank values in a dynamic or evolving graph. Chien et al. \cite{rank-chien01} identify a small region near updated vertices in the graph and represent the rest of the graph as a single vertex in a smaller graph. PageRanks are computed for this reduced graph and then transferred back to the original graph. Chen et al. \cite{chen2004local} propose various methods to estimate the PageRank score of a webpage using a small subgraph of the entire web, by expanding backwards from the target node along reverse hyperlinks. Bahmani et al. \cite{bahmani2010fast} analyze the efficiency of Monte Carlo methods for incremental PageRank computation. Zhan et al. \cite{zhan2019fast} introduce a Monte Carlo-based algorithm for PageRank tracking on dynamic networks, maintaining $R$ random walks starting from each node. Pashikanti et al. \cite{rank-pashikanti22} also employ a similar Monte Carlo-based approach for updating PageRank scores upon vertex and edge insertions/deletions.

A few approaches have been devised to update exact PageRank scores on dynamic graphs. Zhang \cite{rank-zhang17} uses a simple incremental PageRank computation system for dynamic graphs, which we refer to as the \textit{Naive-dynamic (ND)} approach, on hybrid CPU and GPU platforms\ignore{ --- employing the Update-Gather-Apply-Scatter (UGAS) computation model}. Additionally, Ohsaka et al. \cite{ohsaka2015efficient} propose a method for locally updating PageRank using the Gauss-Southwell method, prioritizing the vertex with the greatest residual for initial updating; however, their algorithm is inherently sequential. A widely adopted approach for updating PageRank \cite{rank-desikan05, kim2015incremental, rank-giri20, sahu2022dynamic} is based on the observation that changes in the out-degree of a node do not influence its PageRank score, adhering to the first-order Markov property. The portion of the graph undergoing updates, involving edge insertions or deletions, is used to identify the affected region of the graph in a preprocessing step. This is typically accomplished through Breadth-First Search (BFS) or Depth-First Search (DFS) traversal from vertices connected to the inserted or deleted edges. Subsequently, PageRanks are computed solely for this region. Desikan et al. \cite{rank-desikan05} originally proposed this, which we term as the \textit{Dynamic Traversal (DT)} approach in this report. Kim and Choi \cite{kim2015incremental} apply this approach with an asynchronous PageRank implementation, while Giri et al. \cite{rank-giri20} utilize it with collaborative executions on multi-core CPUs and massively parallel GPUs. Sahu et al. \cite{sahu2022dynamic} employ this strategy on a Strongly Connected Component (SCC)-based graph decomposition to limit computation to reachable SCCs from updated vertices, on multi-core CPUs and GPUs. Sallinen et al. \cite{sallinen2023real} recently proposed an event-based approach for asynchronously updating PageRank scores of vertices in dynamic graphs.\ignore{In the paper, each vertex stores its present rank, and computes its total rank contribution to its out-going neighbors, and thus the value it has previously distributed to its out-neighbors. For each edge addition, the vertex computes the previous value it sent, then sends a negative value representing the difference of the prior value to the new. For the new edge, it simply sends the adjusted flow as a single delta. For edge deletion, the algorithm is similar; negative flow is passed across the removed edge, redirecting positive flow to the remaining edges.} However, their per-thread event processing and termination detection methods are relatively complicated.

In our prior research \cite{sahu2024df}, we introduced two parallel approaches, Dynamic Frontier (DF) and Dynamic Frontier with Pruning (DF-P), for updating PageRank on dynamic graphs while using the power-iteration method \cite{rank-page99}\ignore{, achieving notable performance on multicore processors}.\ignore{Our findings indicated that DF/DF-P PageRank outperformed Static and DT PageRank by $5.2\times$/$15.2\times$ and $1.3\times$/$3.5\times$ respectively on real-world dynamic graphs, and by $7.2\times$/$9.6\times$ and $4.0\times$/$5.6\times$ on large static graphs with random batch updates.} In the report, we recommended DF-P PageRank for real-world dynamic graphs, suggesting a switch to DF PageRank if higher error rates were observed. For large graphs with random batch updates, we recommended DF-P PageRank, except for web graphs, where we suggested choosing DF PageRank.\ignore{In this report, we migrate these algorithms to the GPU, apply GPU-specific optimizations, and observe the performance characteristics of the two proposed dynamic PageRank algorithms.}

\ignore{Several open-source graph processing frameworks, such as Hornet \cite{busato2018hornet} (also known as cuHornet) and Gunrock \cite{wang2016gunrock}, offer GPU implementations of the PageRank algorithm. Hornet utilizes the Hornet/cuSTINGER data structure, providing efficient support for various graph operations, including edge insertions, deletions, and traversals, while Gunrock employs a high-level, bulk-synchronous / asynchronous, data-centric abstraction, offering high-performance GPU computing primitives and a programmer-friendly model for the development of graph algorithms.}

\ignore{Further, Bahmani et al. \cite{rank-bahmani12} introduce an algorithm for selectively crawling a small section of the web to estimate the true PageRank of the graph at a given moment, while Berberich et al. \cite{rank-berberich07} propose a method to compute normalized PageRank scores that remain robust against non-local changes in the graph. These approaches diverge from our improved \textit{Dynamic Frontier} approach, which concentrates on computing the PageRank vector itself rather than on the tasks of web crawling or maintaining normalized scores.}


%% COMPRE
%% ------
% Note that as originally conceived, the PageRank model does not factor a web browser's \textbf{back button} into a surfer's hyperlinking possibilities. Surfers in one class, if teleporting, may be much more likely to jump to pages about sports, while surfers in another class may be much more likely to jump to pages pertaining to news and current events. Such differing teleportation tendencies can be captured in two different \textbf{personalization vectors}. However, it makes the once query-independent, user independent PageRankings, user-dependent and more calculation-laden. Nevertheless, it seems this little personalization vector has had more significant side effects. This personalization vector, along with a \textbf{non-uniform/weighted} version of PageRank \cite{pr-dubey16} can help control spamming done by the so-called link farms \cite{pr-deeper01}.

% Techniques to optimize the PageRank algorithm usually fall in two categories. One is to try \textbf{reducing the work per iteration}, and the other is to try \textbf{reducing the number of iterations}. Often, these goals are at odds against each other. The \textbf{adapting PageRank technique} "locks" vertices which have converged, and saves iteration time by skipping their computation \cite{pr-deeper01}. Identical nodes, which have the same in-links, can be removed to reduce duplicate computations and thus reduce iteration time. Road networks often have chains which can be short-circuited before PageRank computation to improve performance. Final ranks of chain nodes can be easily calculated. This reduces both the iteration time, and the number of iterations. If a graph has no dangling nodes, PageRank of each strongly connected component can be computed in topological order. This helps reduce the iteration time, number of iterations, and also enable concurrency in PageRank computation. The combination of all of the above methods is the \textbf{STIC-D algorithm} (see Figure \ref{fig:about-pagerank-sticd}) \cite{pr-sticd16}. A somewhat similar aggregation algorithm is \textbf{BlockRank} which computes the PageRank of hosts, local PageRank of pages within hosts independently, and aggregates them with weights for the final rank vector. The ranks of vertices for the entire graph can be found efficiently by computing the sub-PageRank of each connected component, and then using the sub-PageRanks together to form the global PageRank (Avrachenkov et. al. \cite{pr-avrachenkov04}). These methods exploit the inherent reducibility in the graph. The \textbf{Jacobi method} can also be used to compute the PageRank vector (Bianchini et. al. \cite{pr-bianchini05}) \cite{pr-deeper01}. \textbf{Monte Carlo} based PageRank methods consider several random walks on the input graph to get approximate PageRanks. Its optimizations for distributed PageRank computation (specially for undirected graphs) \cite{compute-frey13}, map-reduce algorithm for personalized PageRank \cite{pr-bahmani11}, and reordering strategy (to reduce space and compute complexity on GPU) for local PageRank \cite{pr-lai17} are present.

% The time per iteration can be reduced further by taking note of the fact that the traditional algorithm is \textbf{not computationally bound} and \textbf{generates fine granularity random accesses} (it exhibits irregular parallelism). This causes poor memory bandwidth and compute utilization, and the extent of this is quite dependent upon the graph structure \cite{compute-hunold15} \cite{pr-lakhotia18}. \textit{Four strategies for neighbour iteration} were attempted, to help reason about the \textit{expected impact of a graph's structure} on the performance of each strategy \cite{compute-hunold15}. CPUs/GPUs are generally designed optimized to \textbf{load memory} in blocks (cache-lines in CPUs, coalesced memory reads in GPUs), and not for fine-grained accesses. Being able to adjust this behaviour depending upon application (PageRank) can lead to performance improvements. Techniques like \textit{prefetching to SRAM, using a high-performance shuffle network} \cite{graph-wang15}, \textit{indirect memory prefetcher (of the form $A[B[i]]$), partial cache line accessing mechanisms} \cite{memory-yu15}, \textit{adjusting data layout} \cite{pr-lakhotia18} \textit{(for sequential DRAM access} \cite{pr-zhou15} \textit{), and branch avoidance mechanisms (with partitioning)} \cite{pr-lakhotia18} are used. Large graphs can be \textbf{partitioned }or decomposed into subgraphs to help reduce cross-partition data access that helps both in distributed, as well as shared memory systems (by reducing random accesses). Techniques like \textit{chunk partitioning} \cite{pr-rungsawang12}, \textit{cache/propagation blocking} \cite{pr-beamer17}, \textit{partition-centric processing with gather-apply-scatter model} \cite{pr-lakhotia18}, \textit{edge-centric scatter-gather with non-overlapping vertex-set} \cite{pr-zhou17}, \textit{exploiting node-score sparsity} \cite{pr-li21}, and even \textit{personalized PageRank based partitioning} \cite{pr-mazaheri19} have been used. Graph/subgraph \textbf{compression} can also help reduce memory bottlenecks \cite{pr-rungsawang12} \cite{pr-guoqiang20}, and enable processing of larger graphs in memory. A number of techniques can be used to compress adjacency lists, such as, \textit{delta encoding of edge/neighbour ids} \cite{graph-bharat98}, and \textit{referring sets of edges in edge lists} \cite{graph-adler01} \cite{graph-raghavan03} (hard to find reference vertices though) \cite{pr-deeper01}. Since the rank vector (possibly even including certain additional page-importance estimates) must reside entirely in main memory, a few compression techniques have been attempted. These include \textit{lossy encoding schemes based on scalar quantization} seeking to minimize the distortion of search-result rankings \cite{pr-haveliwala02} \cite{pr-deeper01}, and using \textit{custom half-precision floating-point formats} \cite{pr-molahosseini20}.

% As new software/hardware \textbf{platforms} appear on the horizon, researchers have been eager to test the performance of PageRank on the hardware. This is because each platform offers its own unique architecture and engineering choices, and also because PageRank often serves as a good benchmark for the capability of the platform to handle various other graph algorithms. Attempts have been made on distributed frameworks like \textit{Hadoop} \cite{pr-abdullah10}, and even \textit{RDBMS} \cite{compute-barolli21}. A number of implementations have been done on \textit{standard multicores} \cite{compute-barolli21}, \textit{Cell BE} \cite{compute-buehrer08} \cite{pr-zhou17}, \textit{AMD GPUs} \cite{pr-wu10}, \textit{NVIDIA/CUDA GPUs} \cite{pr-bisson16} \cite{pr-zhou17} \cite{graph-seo15}, \textit{GPU clusters} \cite{pr-rungsawang12}, \textit{FPGAs} \cite{pr-mughrabi21} \cite{graph-wang15} \cite{pr-guoqiang20}, \textit{CPU-FPGA hybrids} \cite{pr-hassan21} \cite{pr-usta21} \cite{pr-li21}, and even on SpMV \textit{ASICs} \cite{pr-sadi18}.

% PageRank algorithm is a \textbf{live algorithm} which means that an ongoing computation can be paused during graph update, and simply be resumed afterwards (instead of restarting it). The first \textbf{updating} paper by Chien et al. \cite{pr-chien01} identifies a tiny region of the graph near the changed vertices/edges and model the remainder of the graph as a single vertex in a new, much smaller graph. PageRanks are computed for the small graph and then transferred to the original graph \cite{pr-deeper01}. A common technique used for dynamic PageRank algorithm, given a small change to the input graph, is to find the affected region in the preprocessing step with breadth-first search (BFS) or depth-first search (DFS) traversal from the vertices connecting the edges that were inserted or deleted \cite{pr-desikan05} \cite{pr-giri20}.

  % 1 &
  % Prasanna Desikan, Nishith Pathak, Jaideep Srivastava, and Vipin Kumar. 2005. \textbf{Incremental page rank computation on evolving graphs}. In Special interest tracks and posters of the 14th international conference on World Wide Web (WWW '05). Association for Computing Machinery, New York, NY, USA, 1094–1095.\linebreak
  % DOI: https://doi.org/10.1145/1062745.1062885 &
  % 54 \\ \hline
  % \multicolumn{3}{|p{14cm}|}{This paper describes a simple method for computing dynamic pagerank, based on the fact that change of out-degree of a node does not affect its pagerank (first order markov property). The part of the graph which is updated (edge additions / edge deletions / weight changes) is used to find the affected partition of the graph using BFS. The unaffected partition is simply scaled, and pagerank computation is done only for the affected partition. \footnote{https://gist.github.com/wolfram77/f0a7534d49d5c07d4479ec3966c5d635}} \\ \hline

  % 2 &
  % Yen-Yu Chen, Qingqing Gan, and Torsten Suel. 2002. \textbf{I/O-efficient techniques for computing pagerank}. In Proceedings of the eleventh international conference on Information and knowledge management (CIKM '02). Association for Computing Machinery, New York, NY, USA, 549–557.\linebreak
  % DOI: https://doi.org/10.1145/584792.584882 &
  % 33 \\ \hline
  % \multicolumn{3}{|p{14cm}|}{This paper describes a technique to partition the link file of the whole file into blocks of a range of destination nodes, with partial source nodes, so that it is possible to run power iteration of pagerank of massive graphs which do not fit in memory. The graphs must be stored on disk, and partitions of the graphs are scanned in every iteration until the ranks converge. Unlike Haveliwala's technique, this is similar to pull based pagerank. Both methods have similarities with join techniques in database systems. Topic-sensitive pagerank is also discussed which finds pagerank of graphs related to a specific keywords beforehand, and merges them together based upon the query (might return better results than global pagerank). This requires small adjustments to the random jump probability factor $(1-d)$. \footnote{https://gist.github.com/wolfram77/925cede0214aa0f391f34fa8ce137290}} \\ \hline

  % 3 &
  % Paritosh Garg and Kishore Kothapalli. 2016. \textbf{STIC-D: algorithmic techniques for efficient parallel pagerank computation on real-world graphs}. In Proceedings of the 17th International Conference on Distributed Computing and Networking (ICDCN '16). Association for Computing Machinery, New York, NY, USA, Article 15, 1–10.\linebreak
  % DOI: https://doi.org/10.1145/2833312.2833322 &
  % 7 \\ \hline
  % \multicolumn{3}{|p{14cm}|}{In this paper, the authors exploit the reducibility of dead-end free graphs to compute PageRanks. They first split the vertices into strongly connected components (SCCs) and represent each SCC as a vertex in a block-graph. Each SCC is then processed as per its topological order in the block-graph. This enables them to reduce the operating memory requirement, thanks to the smaller size of SCCs that are processed in one go. As SCCs are processed in topological order, unnecessary iterations on vertices that are unlikely to converge are avoided. Processing vertices grouped into SCCs also improves performance due to inherent spatial locality within an SCC. In addition, this method allows SCCs residing on the same level in the block-graph to be processed independently of each other. This is demonstrated by the authors by processing each such SCC in parallel with OpenMP. They also present three algorithmic techniques for eliminating redundancies in PageRank computation, namely skipping repeated computation on in-identical vertices (minimize redundant computation), short circuiting chain vertices (help accelerate convergence), and skipping computation on vertices that appear to have converged (minimize redundant computation).The suitability of these techniques depend upon the nature of input graph. They study the techniques on four classes of real-world graphs: web graphs, social networks, citation and collaboration networks, and road networks. Their implementation achieves an average speedup of 32\% compared to a baseline implementation. \footnote{https://gist.github.com/wolfram77/bb09968cc0e592583c4b180243697d5a}} \\ \hline

  % 4 &
  % Stergios Stergiou. 2020. \textbf{Scaling PageRank to 100 Billion Pages}. Proceedings of The Web Conference 2020. Association for Computing Machinery, New York, NY, USA, 2761–2767.\linebreak
  % DOI: https://doi.org/10.1145/3366423.3380035 &
  % 1 \\ \hline
  % \multicolumn{3}{|p{14cm}|}{In this paper, the author exploits the fact the communication required between iterations is identical. He uses this to develop a new communication format that allows significant reduction in bandwidth requirement. He experiments on massive web graphs with up to 38 billion vertices and 3.1 trillion edges, requiring a per-iteration time of 34.4 seconds, which is more than an order of magnitude improvement over the state-of-the-art. \footnote{https://gist.github.com/wolfram77/10964cd26f11f7a7299e7b74a0be7e7e}} \\ \hline
%% ------

% Coarse PageRank approaches: Arasu et al. \cite{rank-arasu02} proposed HostRank, where the web is aggregated at the level of host names. BlockRank \cite{kamvar2003exploiting} uses HostRank to initialize PageRank, while DirRank \cite{rank-eiron04} forms aggregation at the level of directories of websites.

%% We use an asynchronous approach:
% Real-Time PageRank on Dynamic Graphs (2023): In this paper, Sallinen et al. \cite{sallinen2023real} compute PageRank asynchronously for real-time, on demand PageRank computation with arbitrary granularity. They model PageRank as a flow problem, where mass is absorbed by the page, and the rest is distributed to neighbors. This is done by sending delta values of probability mass depending on edge deletion or insertions by adjustment upon earlier values. Sink/dangling vertices (dead ends) are handled as usual (teleport).

%% Interesting approach:
% PageRank Algorithm Based on Dynamic Damping Factor (2023): Existing methods often set the damping factor empirically, overlooking the relevance of web visitors’ topics. HaoLin et al. \cite{haolin2023pagerank} propose an adaptive dynamic damping factor based on the web browsing context, and demonstrate that it effectively mitigates the impact of noisy web pages on query results and improves the convergence speed.

%% Sliding window approach.
% Time-Aware Ranking in Dynamic Citation Networks (2011): In this paper, Ghosh et al. \cite{ghosh2011time} consider the temporal order of links and chains of links connecting to a node with some temporal decay factors, and show that it is more appropriate for predicting future citations and PageRank scores with regard to new citations.

%% Other interesting approach:
% A Dynamical System for PageRank with Time-Dependent Teleportation (2014): In this paper, Gleich and Rossi \cite{gleich2014dynamical} propose a time-dependent teleportation to the PageRank score. The magnitude of the deviation from a static PageRank vector is given by a PageRank problem with complex-valued teleportation parameters. They demonstrate the utility of dynamic teleportation on both the article graph of Wikipedia, where the external interest information is given by the number of hourly visitors to each page, and the Twitter social network, where external interest is the number of tweets per month. They show that using information from the dynamical system helps improve a prediction task and identify trends in the data.

%% Other interesting approach:
% Temporal PageRank (2016): In this paper, Rozenshtein and Gionis \cite{rozenshtein2016temporal} propose an extension of static PageRank to temporal PageRank so that only temporal walks are considered instead of all possible walks. In order to compute temporal PageRank we need to process the sequence of interactions and calculate the weighted number of temporal walks. Their algorithm counts explicitly the weighted number of temporal walks.

%% Similar to STIC-D:
% Divide and conquer approach for efficient pagerank computation (2006): In this paper, Desikan et al. \cite{desikan2006divide} propose a graph-partitioning technique for PageRank, on which computation can be performed independently.

%% Similar to STIC-D:
% A componentwise PageRank algorithm (2015): In this paper, Engstrom et al. \cite{engstrom2015componentwise} propose two PageRank algorithms, one similar to the Lumping algorithm proposed by Qing et al. which handles certain types of vertices faster, and last, another PageRank algorithm which can handle more types of vertices as well as strongly connected components more effectively. This is similar to the work of Garg et al.

%% Streaming PageRank:
% Estimating PageRank on graph streams (2011): In this paper, Sarma et al. \cite{rank-sarma11} study the streaming model for PageRank, which uses a small amount of memory (preferably sub-linear in the number of nodes n). They compute approximate PageRank values in Õ(nM−1/4) space and Õ(M3/4) passes. They also give another approach to approximate the PageRank values in just Õ(1) passes although this requires Õ(nM) space.

%% Applications of PageRank:
% PageRank Tracker: From Ranking to Tracking (2013): PageRank has been used by Gong et al. \cite{gong2013pagerank} in video object tracking to improve its robustness, i.e., to address difficulties with adaptation to environmental or target change. Determining the target is equivalent to finding the unlabeled sample that is the most associated with the existing labeled set.

%% Applications of PageRank:
% Abstracting PageRank To Dynamic Asset Valuation (2006): In this paper, Sawilla \cite{sawilla2006abstracting} uses (weighted) PageRank to quickly and dynamically calculate a relative value for all assets in an organization in any context in which dependencies may be specified. Their scheme works in general and will provide asset valuation in any context, be it confidentiality, integrity, availability, or even political capital.

%% For introduction, also a bit here:
% Adaptive Implementation to Update Page Rank on Dynamic Networks (2021): In this oral presentation, Srinivasan \cite{srinivasan2021adaptive} talk about the fact that There are a lot of attempts made to parallelize the page rank algorithm for static networks, however, there are only very few attempts made to compute page rank on dynamic networks. As the networks change with time, computing page rank or updating is an expensive operation, the previous attempts have only approximated the metric to avoid recomputation. In this paper, we introduce a framework where we try to update the page rank of the vertices which embraces change as the network changes. The proposed framework is implemented on a shared memory system and experiments on real-world and synthetic networks show good scalability. The framework proposed gets an input set of networks, initial page rank values for all the vertices, and a set of batches that has the changeset. As the batches are processed in parallel, affected vertices are identified and marked for an update, once the batch is processed the vertices affected or identified their page rank values are computed. The main contribution of this paper is the proposed framework avoids recomputation of all vertices, and only recomputes few vertices, and avoids approximation to provide accurate values.

% PageRank is fundamentally an \textbf{SpMV} problem, with irregular memory accesses and low \textbf{computation to communication} ratio \cite{usta2020accelerating}. Usta uses \textbf{propagation blocking (PB)} technique by dividing the execution into binning and accumulation phases, carried out on the FPGA and CPU, respectively \cite{usta2020accelerating}. FPGAs can execute complex \textbf{combinational operations (dataflow?)}, while the GPUs excel at running multiple threads in parallel \cite{usta2020accelerating}. Many real-world graphs are irregular and scale-free --- hard to partition/re-order vertices. \textbf{Cache-blocking techniques} can improve performance --- you want to improve locality \cite{usta2020accelerating}.

% The advantage of pull direction over push direction manifests itself in the case of multiple threads. In push direction, there will be multiple writes to the same index of the result vector, requiring atomic operations to prevent data races. On the other hand when using a push-direction one can benefit from the active set optimization, where at each iteration instead of traversing all the vertices the algorithm can drop the ones that have already reached an equilibrium \cite{usta2020accelerating}. However, there have been a recent proposal in this area that uses a different storage format which enables active set optimization to pull direction as well \cite{grossman2018making, grossman2019new}. Moreover, there are also existing proposals on a more hybrid approach where the decision can change between the iterations \cite{besta2017push}.

% To improve the locality and reduce the random memory access bottlenecks, \textbf{two-phase algorithm} are proposed by various groups with different optimizations \cite{buono2016optimizing, sadi2018pagerank, beamer2017reducing}.

% Grutzmacher et al. \cite{grutzmacher2018high}.

% Grutzmacher et al. \cite{grutzmacher2020acceleration} propose a variable-precision strategy for computing PageRank using their customized floating-point format, i.e., Customized Precision based on Mantissa Segmentation (CPMS). They observe that compared to standard IEEE double-precision implementation, their CPMS-based PageRank completes about $10\%$ faster is high-accuracy output is needed, and about $30\%$ faster is reduced output accuracy is acceptable.

% Duong et al. \cite{duong2012parallel} propose a parallel GPU implementation of PageRank algorithm (static), where they use a push-based approach for PageRank computation (where the rank contribution of each vertex is pushed to the out-going neighbors). For parallelizing the computation, they assign one thread to each vertex in the graph, i.e., thread-per-vertex, and use \texttt{atomicAdd()} to update the ranks of out-going neghbors (for the next iteration). They do not add self-loops to dangling vertices, and thus compute the global teleport contibution of each vertex using atomic operations as well. This is inefficient!

% Piccinotti et al. \cite{piccinotti2019solving} present an approach to systematically handle write conflicts at software level, without explicit thread synchronization or serialization of write operations. We show that our approach can successfully be applied to summation of partial results on the same memory location, even when dedicated hardware solutions are not present.

% Piccinotti et al. \cite{piccinotti2019solving} present a solution to handle write conflicts in GPU computations exploiting high level of parallelism. They state that floating point arithmetic operations lack commutative property due to machine precision errors. Non-commutativity becomes an issue when performing parallel computation on the same data, since the summation of results on the same memory location happens in a non-repeatable order. Even worse, rounding error introduced by floating point operations accumulates over iterations and, when analyzing large graphs, can disrupt the correctness of results and prevent algorithms’ termination. They analyze the applicability of interleaved reduction of arrays using warp atomic methods and develop a solution to the accumulating error generated by reduction methods.

% Concessao et al. \cite{concessao2023meerkat} use Naive-dynamic PageRank, while handling dead-ends as usual.

% Beamer et al. \cite{rank-beamer17} propose propagation blocking, an optimization to improve spatial locality, and demonstrate its application to PageRank. In contrast to cache blocking which partitions the graph, they partition the data transfers between vertices (propagations). Their approach reduces communication more than conventional cache blocking if the input graph is sufficiently sparse or if number of vertices is sufficiently large relative to the cache size.

% Lakhotia et al. \cite{rank-lakhotia18} present a novel Partition-Centric Processing Methodology (PCPM) to compute PageRank, that drastically reduces the amount of DRAM communication while achieving high sustained memory bandwidth. PCPM uses a Partition-centric abstraction coupled with the Gather-Apply-Scatter (GAS) programming model. They develop a new data layout that significantly reduces communication and random DRAM accesses, and branch avoidance mechanisms to get rid of unpredictable data-dependent branches.

% Chen and Chung \cite{chen2021hipa} present HiPa, a hierarchical partitioning methodology to accelerate PageRank by utilizing the memory-cache architecture of the multicore system. For the shared memory, HiPa subdivides the graph based on the NUMA characteristics to reduce remote memory access while ensuring workload balance. For the private cache, HiPa further splits the graph into cache-able partitions to promote in-core computing and cache locality. Based on the partitioning strategy, systematical optimizations are proposed, such as thread management and new data layout.

% Vandromme et al. \cite{vandromme2022scaling} develop and implement a distributed parallel version of the PageRank using the power iteration method, instead of computing an approximation of the PageRank scores. Their application is deployed on the supercomputer Fugaku, using up to one thousand compute nodes to assess scalability on random stochastic matrices. These largescale experiments show that the network-on-chip of the A64FX processor acts as an additional level of computation, in between nodes and cores.

% Kang et al. \cite{kang2020computing} implement PageRank on NVIDIA’s newly released DGX A100 cluster (with 32 A100 GPUs) and compare it with ShenTu and Hronos frameworks on the Common Crawl dataset. They further extend their implementation to a cluster of 2048 A100 GPUs on the Selene supercomputer, while using their updated 2D partitioning scheme, and using a partially doubly compressed sparse column (PDCSC) format --- a hybrid data structure that
% combines CSC (compressed sparse column) and DCSC (doubly compressed sparse column) formats \cite{kang2022analyzing}.

% Huang et al. \cite{huang2020accelerating} present Accelerated Parallel PageRank (APPR) which uses the push-based approach for PageRank computation, and minimizes write-conflicts by heuristically assigning vertices with common out-neighbors in the same partition, while keeping the partitions load-balanced, by exploiting properties of social network graphs.

% Kim et al. \cite{kim2023multi} present a multi-mode SpMV accelerator for half-to-single transprecision PageRank, which uses half-precision (FP16) initially and later changes its precision to single-precision (FP32). They observe a low $0$-$4\%$ error rate and $1.3\times$-$1.9\times$ speedup compared to single-precision PageRank computation without transprecision.

% Mughrabi et al. \cite{rank-mughrabi21} design a vertex-centric shared memory accelerator for PageRank algorithm, while using a cache-coherent shared memory interface, known as Coherent Accelerator Processor Interface (CAPI), on FPGA hardware (that is used for acceleration). They also introduce PageRank quantization using 32-bit fixed-point values for improved performance.

% Hassan and Athanas \cite{rank-hassan21} propose a framework, called Graph Analytics on Hybrid Systems (GAHS), for workload partitioning and scheduling of
% graph analytics on hybrid CPU-FPGA systems.

% Usta \cite{usta2020accelerating} proposes a heterogeneous implementation of PageRank algorithm with the Propagation Blocking (PB) technique where the binning phase executed on the FPGA while accumulation is done on a CPU. His design can handle graphs of any sizes with no need for an on-board FPGA memory.

% Sallinen et al. \cite{sallinen2023real} propose an event-based approach for asynchronously updating PageRank scores of vertices in dynamic graphs. Each vertex stores its present rank, and computes its total rank contribution to its out-going neighbors, and thus the value it has previously distrbuted to its out-neighbors. For each edge addition, the vertex computes the previous value it sent, then sends a negative value representing the difference of the prior value to the new. For the new edge, it simply sends the adjusted flow as a single delta. For deletion, the algorithm is similar; negative flow is passed across the removed edge, redirecting positive flow to the remaining edges.

% Graph algorithms are challenging to implement due to their varying topology and irregular access patterns. Real-world graphs are dynamic in nature and routinely undergo edge and vertex additions, as well as, deletions. Typical examples of dynamic graphs are social networks, collaboration networks, and road networks. Applying static algorithms repeatedly on dynamic graphs is inefficient. Further, due to the rapid growth of unstructured and semi-structured data, graph algorithms demand efficient parallel processing. Unfortunately, we know only a little about how to efficiently process dynamic graphs on massively parallel architectures such as GPUs. Existing approaches to represent and process dynamic graphs are either not general or inefficient \cite{concessao2023meerkat}.