graph-csr-openmp/05-evaluation.tex at main · merferry/graph-csr-openmp · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
\subsection{Experimental Setup}
\label{sec:setup}

We use a server that has two $16$-core x86-based Intel Xeon Gold 6226R processors running at $2.90$ GHz. Each core has an L1 cache of $1$ MB, an L2 cache of $16$ MB, and a shared L3 cache of $22$ MB. $12$ Seagate Exos 10e2400 $2.4$ TB SAS-12Gbps 2.5inch HDDs (DL2400MM0159), each with a rotational speed of $10k$ rpm, a sustained transfer rate of $241$ (outer diameter) to $117$ Mbps (inner diameter), and an average latency of $2.9$ ms, are installed on the server --- connected with a MegaRAID SAS-3 3008 SAS+SATA Controller. The machine has $512$ GB of system memory, and runs on CentOS Stream 8. We use GCC 8.5 and OpenMP 4.5. Table \ref{tab:dataset} shows the graphs we use in our experiments. All of them are obtained from the SuiteSparse Matrix Collection \cite{kolodziej2019suitesparse}.

\input{src/tab-dataset}


\subsection{Performance Comparison for Reading CSR}

We now compare the performance of GVEL for reading CSR, i.e. reading Edgelist and converting it to CSR, with Hornet, Gunrock, and PIGO. For Hornet and Gunrock, we modify an example and/or source code of the framework to measure time taken to load the graph. For PIGO we include the released header-only file and write a program for measure CSR loading time. For each graph, we measure the CSR reading with each framework five times, for averaging. We also attempt to compare with GraphOne, but it appears to crash for all graphs except \textit{indochina-2004}. Figure \ref{fig:compare-large} shows the runtimes of Hornet, Gunrock, PIGO, and GVEL for reading CSR on each graph in the dataset. GVEL is on average $78\times$, $112\times$, and $1.8\times$ faster than Hornet, Gunrock, and PIGO respectively. The CSR reading time for Hornet is not shown for the graphs \textit{uk-2002}, \textit{it-2004}, and \textit{sk-2005} as it crashes while loading. Note that the runtimes of PIGO and GVEL are not visible on the scale of Figure \ref{fig:compare-large} as they are significantly faster than Hornet and Gunrock. Figure \ref{fig:compare-small} accordingly only shows the runtimes of PIGO and GVEL. Here, we observe that GVEL excels on web graphs, which are characterized by power law distributions and high average degrees. This advantage stems from our adoption of a multi-stage approach in constructing a CSR from the Edgelists, effectively minimizing contention among threads, particularly evident in high-degree graphs. Conversely, on graphs with low average degrees such as road networks and protein k-mer graphs, thread contention is minimal due to the low likelihood of simultaneous edge addition to the same vertex in the CSR targets array. Consequently, GVEL does not perform better than PIGO on such graphs. Further, PIGO only loads the upper-triangular part of undirected graphs (social networks, road networks, and protein k-mer graphs), i.e., it only loads half the number of edges in the graphs. This results in lower runtime being reported for PIGO.

\input{src/fig-compare-large}
\input{src/fig-compare-small}

% The efficiency of reading an edgelist stands out, significantly outpacing the speed of reading a graph stored as CSR, which necessitates an additional conversion step. Surprisingly, the average cost of converting an edgelist to CSR is three times higher than simply reading an edgelist from a file. This underscores the notable speed of modern I/O processes. In the sk-2005 dataset, our approach achieves an impressive reading speed of 1.9 billion edges per second. Notably, it's essential to clarify that the table represents edges as directed, not undirected. To visually represent the edge reading performance across different graphs, we propose the inclusion of a plot illustrating this metric. The sluggishness in CSR reading is justified by its nature as a shared data structure, leading to high contention when multiple threads concurrently attempt to add an edge to the same vertex. The process of adding to CSR structures with small degrees is hampered by false sharing, introducing cache coherency issues that contribute to decreased performance.


\subsection{Performance Comparison for Reading Edgelist}

In Figure \ref{fig:compare-el}, we compare the performance of GVEL for reading Edgelist with PIGO. As before, we measure the Edgelist reading time with each framework five times, for averaging. GVEL is on average $2.6\times$ faster than PIGO - with little skew, i.e., GVEL is more or less equivalently faster than PIGO on all graphs. This is more probably than not, due to the reading of Edgelist being \textit{pleasingly parallel}. On the \textit{sk-2005} graph, GVEL achieves a read performance of $1.9$ billion edges/s. Figure \ref{fig:runtime} illustrates the time GVEL requires to read the edge-list and convert it to CSR for each graph in the dataset, presented separately.

\input{src/fig-compare-el}
\input{src/fig-runtime}


\subsection{Strong Scaling of GVEL}

Finally, we measure the strong scaling performance of GVEL. To this end, we adjust the number of thread from $1$ to $64$ in multiples of $2$ for each input graph, and measure the time taken for reading Edgelist and reading CSR (reading Edgelist + converting to CSR). As for each experiment above, for each graph, and for each thread count, we perform graph loading five times for averaging. The results are shown in Figure \ref{fig:strong-scaling}. With 32 threads, reading Edgelist with GVEL obtains a $25.0\times$ speedup compared to running with a single thread, i.e., the performance of Edgelist reading increases by $1.9\times$ for every doubling of threads. Further, with 32 threads, reading CSR with GVEL obtains a speedup of $13.9\times$ with respect to a single threaded execution, i.e., the CSR reading performance increases by $1.7\times$ for every doubling of threads. This drop in increase of speedup is due that fact that CSR is a shared data structure, leading to high contention when multiple threads concurrently attempt to add an edge to the same vertex, especially for vertices with high degrees. Further, the process of adding edges to vertices with small degrees is hampered by false sharing of cache lines, which contribute to decreased performance. At 64 threads, both reading Edgelist and reading CSR are impacted by NUMA effects, and offer speedups of only $27.2\times$ and $16.4\times$ respectively.

\input{src/fig-strong-scaling}


% \textbf{Figure 3}:


% We observe that:
% - reading edgelist is significantly faster than reading a graph as CSR (this requires and additional conversion step)
% - in fact, on average, converting an edgelist to CSR is 3 times more costly than reading and edgelist from file (which is quite surprising indded). This shows that modern IO is fastr (dhooom).
% - We are able to read 1.9B edges per second (the table put edges as directed not undirected) in sk-2005.
% - Can we have a plot on edge reading performance per graph
% - (Justify why CSR reading is slow) converting to CSR is slow becuase it is a shared data structure resulting in high contention beteen thread, when multiple threads attempt to add an edge to the same vertex.
% - Also adding to CSR with small degrees suffer from false sharing (cache coherency issue).

% \textbf{Figure 4}:


% Our observations reveal:

% Reading edgelists exhibits impressive scalability, achieving a remarkable 25x scale on 32 threads.
% However, at 64 threads, performance is impacted by NUMA effects.
% Notably, reading CSR does not scale as well, primarily due to issues such as false sharing and contention.
% The influence of NUMA effects is also discernible in CSR reading performance.
% The scaling pattern follows multiples of 2 (linear Y-axis, logarithmic X-axis).

% There is a paper called "Efficient Memory Mapped File I/O for In-Memory File Systems" on the topic - where Choi et al. working at Samsung say that mmap() is a good interface for fast IO (in contrast to file streams) and propose async map-ahead based madvise() for low-latency NVM storage devices. The also have a good slides on this - where their (modified) extended madvice obtains ~2.5x better performance than default mmap() by minimizing the number of page faults.
% I tried parsing integers from text and saving into per-thread integer lists. To over-allocate memory for per-thread integer-lists i use sufficient size mmap() instead of malloc(). Even if i over-allocate, due to virtual memory, only memory as much i need is actually used.