pad-project/upcxx_notes.txt at master · fvanmaele/pad-project · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
ssh -A -t fvanmaele@mp-login.ziti.uni-heidelberg.de \
	ssh -A fvanmaele@mp-headnode.ziti.uni-heidelberg.de

export UPCXX_INSTALL=$HOME/personalSoftware

module use $HOME/personalSoftware/share/modulefiles
module load upcxx # sets UPCXX_INSTALL to <install directory>, prepends <install directory>/bin to PATH

upcxx <file>.cpp
GASNET_PSHM_NOES=<num threads> ./a.out

int *local = /* ... */;
rget(remote, local, count).wait();
rput(local, remote, count).wait();

---

collectives
 dist_object
  local_team()
  initial value

new_array
 !collective

---

reduction
 one array per NUMA node
  upcxx::dist_object(arg, local_team())
  upcxx::rank_me() * block_size
 one array per process
  upcxx::broadcast
  divided heap
 one array per program
  upcxx::dist_object(arg, world())
  upcxx::rank_me() * block_size


---

check results with serial implementation
 random generated numbers
  fixed seed
   use same seed for serial case
   check results
 multiple repetitions
  check for race conditions

---


Questions:
 dist_object is "universal name for group of processes" with local values; for reduction (with a partial sum reduction value per process or per node), broadcast seems more suitable;
 how to retrieve values from multiple nodes?
 SSH errors

---

$ kinit # create kerberos ticket

$ GASNET_SSH_SERVERS='mp-media1 mp-media2 mp-media3 mp-media4' GASNET_SPAWNFN=C GASNET_CSPAWN_CMD='srun -w mp-media[1-4] -n %N %C' upcxx-run -N 4 -n 16 ./upcxx

https://gasnet.lbl.gov/dist/udp-conduit/

---

demonstration of "local()" for remote global pointers (when using broadcast)

//////////////////////////////////////////////////////////////////////
UPC++ assertion failure:
 on process 1 (mp-media1.ziti.uni-heidelberg.de)
 at /home/nfs/fvanmaele/personalSoftware/upcxx.debug.gasnet_seq.udp/include/upcxx/backend.hpp:187
 in function: void* upcxx::backend::localize_memory_nonnull(upcxx::intrank_t, uintptr_t)

Rank 0 is not local with current rank (1).

To have UPC++ freeze during these errors so you can attach a debugger,
rerun the program with GASNET_FREEZE_ON_ERROR=1 in the environment.
//////////////////////////////////////////////////////////////////////

---

main overhead in communication of partial sums / barrier / finalize

---

Questions:
 std::chrono
  ignore initialization time (?)
   may be done if array is shared per node (instead of per process)
    limits possible approaches
   otherwise, take end time after upcxx::finalize?
    begin time: on process 0?

 command-line interface
  time from upcxx-run
  may be unstable due to network delays etc.

 openmp
  "parallel for reduction" seems to have no impact on reduction (even with icc)
  vectorization already applied from -march=skylake-avx512

 srun / upcxx-run
  maximum amount of processes seems 16 over 4 servers (with 8 processors each)
  otherwise "could not allocate" error

 numa openmp example
  Does it only work on NUMA architectures with a single operating system? (numactl etc.)

 uneven amount of processes / processors
  required to handle? (dist_object e.g. with new_array assumes arrays of equal size)


Serial verification of symmetrized matrix/exercise
 serialize matrix (process-by-process)

Throughput:
 load + store over lower/upper triangle of floats
 -> 2 * dim * (dim-1) * sizeof(float) * 1e-9 / time

---

SYMMETRIZE - RUNTIMES

Following Gustafson's law, matrix dimensions were chosen sufficiently large (1 << 16, 1 << 16).

The runtimes below are for the compiled program, including intialization.
* 3 runs were done, taking the minimum value.
* All programs were compiled with -O3 -march=knl -std=c++17 and GCC 10.2.

X	Y	nodes	processes (total)	conduit	runtime (seconds)
---------------------------------------------------------------
1<<16	1<<16	1	1	-		       114.907s <---
1<<16	1<<16	1	2	UPCXX_NETWORK=smp	79.053s
1<<16	1<<16	1	2	OpenMP			59.419s
---------------------------------------------------------------
1<<16	1<<16	4	4	UPCXX_NETWORK=udp	42.104s
1<<16	1<<16	1	4	UPCXX_NETWORK=smp	47.688s
1<<16	1<<16	1	4	OpenMP			32.064s
---------------------------------------------------------------
1<<16	1<<16	4	8	UPCXX_NETWORK=udp	27.767s <---
1<<16	1<<16	1	8	UPCXX_NETWORK=smp	31.139s <---
1<<16	1<<16	1	8	OpenMP			21.281s
---------------------------------------------------------------
1<<16	1<<16	4	16	UPCXX_NETWORK=udp	22.845s
1<<16	1<<16	1	16	UPCXX_NETWORK=smp	22.732s
1<<16	1<<16	1	16	OpenMP			14.582s
---------------------------------------------------------------
1<<16	1<<16	4	32	UPCXX_NETWORK=udp	21.962s
1<<16	1<<16	1	32	UPCXX_NETWORK=smp	19.939s
1<<16	1<<16	1	32	OpenMP			13.187s
---------------------------------------------------------------
1<<16	1<<16	4	64	UPCXX_NETWORK=udp	21.681s
1<<16	1<<16	1	64	UPCXX_NETWORK=smp	19.331s
1<<16	1<<16	1	64	OpenMP			12.363s
---------------------------------------------------------------

Conclusion:

* Symmetrization was implemented as a SAXPY operation, thus arithmetic intensity is low and speedup limited (here ~6x)
* On single nodes, UPCXX has significant overhead compared to OpenMP.
* When there is no communication between processes, the udp and smp upcxx conduits have similar overhead.

---

symmetrize benchmarks:

For the KNL cluster, the amount of processes were chosen as far as problem size allows. In particular:

1 << 10 up to 1 << 14: 256
1 << 9: 128
1 << 8: 64
1 << 7: 32
1 << 6: 16
1 << 5: 8

For a single KNL node, 64 processes were taken for dimensions (1 << 8) up to (1 << 14). For the media cluster, 16 processes were taken from (1 << 6) up to (1 << 14) and 8 processes for (1 << 5). for a single Media node, 8 processes were taken for all dimensions.

--- latency hiding, agglomeration

dimension increases cubically; halo increases quadratically
-> surface-to-volume ratio

communication/computation: which cost dominates

--- mapping

place communicating tasks on same processor, or on processors close to eachother, to increase locality

---