louvain-communities/aa-appendix.tex at main · merferry/louvain-communities · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
\subsection{Indirect Comparison with State-of-the-art Louvain Implementations}
\label{sec:comparison-indirect}

We now conduct an indirect comparison of the performance of our multicore implementation of the Louvain algorithm with other similar state-of-the-art implementations, as listed in Table \ref{tab:compare}. Mohammadi et al. \cite{com-mohammadi20} introduce the Adaptive CUDA Louvain Method (ACLM), which employs GPU acceleration. For a computational setup featuring a $12$-core AMD Opteron 6344 CPU clocked at $2.60$ GHz alongside an NVIDIA Tesla K20Xm GPU, Mohammadi et al. report a runtime of $0.47$ seconds for ACLM when applied to the \textit{in-2004} graph with $13.6$ million edges (refer to Table 6 in \cite{com-mohammadi20}). Moreover, they document runtimes of $3.71$, $0.60$, and $0.95$ seconds for PLM \cite{staudt2015engineering}, APLM \cite{com-fazlali17}, and Rundemanen \cite{com-naim17} respectively (also presented in Table 6 of \cite{com-mohammadi20}). In contrast, leveraging our system equipped with two $16$-core Intel Xeon Gold 6226R CPUs operating at $2.90$ GHz (delivering up to $3.0\times$ higher performance), we process the \textit{indochina-2004} graph boasting $341$ million edges ($25.1\times$ larger) in merely $0.65$ seconds. Consequently, our Louvain implementation outperforms ACLM, PLM, APLM, and Rundemanen by approximately $6.0\times$, $48\times$, $7.7\times$, and $12.2\times$ respectively\ignore{ (all achieved without GPU)}.

Sattar and Arifuzzaman \cite{sattar2022scalable} propose their Distributed Parallel Louvain Algorithm with Load-balancing (DPLAL). On a graph containing $1M$ vertices, utilizing $40$ compute nodes of the Louisiana Optical Network Infrastructure (LONI) QB2 cluster, where each node is equipped with two $10$ core $2.80$ GHz Intel Xeon E5-2680v2 CPUs, they achieve community detection within approximately $80$ seconds (refer to Figure 4 in their paper \cite{sattar2022scalable}). In contrast, employing our system\ignore{featuring two $16$ core Intel Xeon Gold 6226R CPUs operating at $2.90$ GHz} (which is $6.0\times$ slower, assuming DPLAL utilizes each node with only $25\%$ efficiency), we process the \textit{indochina-2004} graph with $7.41M$ vertices ($7.41\times$ larger) in only $0.65$ seconds. Consequently, our Louvain implementation is roughly $5472\times$ faster\ignore{ than DPLAL}\ignore{ exhibits a speedup of roughly $5472\times$ compared to DPLAL}.

Qie et al. \cite{qie2022isolate} introduce a graph partitioning algorithm designed to minimize inter-partition communication delay and avoid community swaps. Their experiments on an $8$ core Intel i9 9900K CPU running at $3.60$ GHz, using the \textit{com-LiveJournal} graph, reveal a runtime of approximately $1050$ seconds (as depicted in Figure 3 of their paper \cite{qie2022isolate}). In contrast, on our system\ignore{ leveraging our system equipped with two $16$ core Intel Xeon Gold 6226R CPUs running at $2.90$ GHz} (roughly $3.2\times$ faster), we process the same graph in merely $1.2$ seconds. Consequently, our Louvain implementation achieves a speedup of about $273\times$.

Bhowmick et al. \cite{com-bhowmik19} present HyDetect, a community detection algorithm tailored for hybrid CPU-GPU systems, employing a divide-and-conquer strategy. Their experimentation on a system comprising a six-core Intel Xeon E5-2620 CPU operating at $2.00$ GHz paired with an NVIDIA Kepler K40M GPU yields a runtime of $1123$ seconds when applied to the \textit{it-2004} graph (as documented in Table 5 of their paper \cite{com-bhowmik19}). In contrast, leveraging our system equipped with two $16$ core Intel Xeon Gold 6226R CPUs running at $2.90$ GHz (approximately $7.7\times$ faster), we process the same graph in $2.7$ seconds. Consequently, our Louvain implementation achieves a speedup of around $54\times$ compared to HyDetect, without using a GPU.

Bhowmick et al. \cite{com-bhowmick22} later introduce a multi-node multi-GPU Louvain community detection algorithm. Their experimentation conducted on a CrayXC40 system, where each node comprises one $12$ core $2.40$ GHz Intel Xeon Ivybridge E5-2695 v2 CPU and one NVIDIA Tesla K40 GPU, demonstrates the capability to identify communities within approximately $32$ seconds on the \textit{uk-2005} graph utilizing $8$ compute nodes (refer to Figure 8 in their paper \cite{com-bhowmick22}). Furthermore, they observe a speedup of about $1.7\times$ compared to Cheong et al. \cite{com-cheong13} on the \textit{uk-2005} graph (as depicted in Figure 11 of their paper), and a speedup of approximately $1.2\times$ relative to Ghost et al. \cite{com-ghosh18} on the \textit{sk-2005} graph (as shown in Figure 10 of their paper). In contrast, on our system featuring two $16$ core Intel Xeon Gold 6226R CPUs operating at $2.90$ GHz (up to $1.6\times$ faster, assuming Bhowmick et al. utilize each compute node with only $25\%$ efficiency), we process the \textit{uk-2005} graph in a mere $4.1$ seconds. Thus, our Louvain implementation achieves at least $4.9\times$, $8.3\times$, and $5.9\times$ speedup compared to Multi-node HyDetect \cite{com-bhowmick22}, Cheong et al. \cite{com-cheong13}, and Ghost et al. \cite{com-ghosh18}, respectively, without\ignore{requiring} the use of a\ignore{single} GPU.

Chou and Ghosh \cite{chou2022batched} introduce Nido, a batched clustering method designed for GPUs. Their experimentation conducted on a system equipped with two $128$ core AMD EPYC 7742 CPUs running at $2.25$ GHz and eight NVIDIA A100 GPUs demonstrates a (geometric) mean speedup of $2.4\times$ compared to Grappolo \cite{com-halappanavar17}, utilizing $128$ threads (as shown in Table 3 of their paper \cite{chou2022batched}).\ignore{Additionally, they find that cuGraph \cite{hricik2020using} offers a mean speedup of $7.7\times$ relative to Grappolo (also depicted in Table 3 of their paper).} In contrast, our Louvain implementation achieves a mean speedup of $22\times$ compared to Grappolo. Consequently, it surpasses Chou and Ghosh by $9.2\times$\ignore{and cuGraph by $2.9\times$, again without employing a single GPU}.

Gawande et al. \cite{com-gawande22} introduce cuVite, an ongoing project focusing on distributed Louvain for heterogeneous systems. They evaluate cuVite, cuGraph, and Rundemanen on a single-GPU of NVIDIA DGX-2, i.e., NVIDIA Tesla V100, and with two $24$ core Intel Xeon Platinum 8168 CPUs running at $2.7$ GHz. Further, they evaluate Grappolo on a $224$ core system with eight $28$ core Intel Xeon Platinum 8276M CPUs running at $2.20$ GHz. They observe that, on average,\ignore{cuGraph,} Rundemanen and cuVite exhibit speedups of\ignore{$3.3\times$,} $3.0\times$ and $3.28\times$ respectively compared to Grappolo. \ignore{Further analysis on the \textit{webbase-2001} graph indicates that cuVite (32 nodes) achieves a speedup of $1.3\times$ (including remapping) or $2.0\times$ faster (excluding remapping) relative to Grappolo (128 threads), while Vite (32 nodes) demonstrates a speedup of 2.3 times. Notably, cuVite does not significantly outperform Vite, and in some cases, it exhibits slower performance.} Consequently, our Louvain implementation achieves a speedup of $6.7\times$ compared to\ignore{both cuGraph and} cuVite, and a speedup of $7.3\times$ relative to Rundemanen\ignore{, all achieved without the utilization of a GPU}.