Skip to content

Commit beff32f

Browse files
authored
Merge pull request #630 from bgoglin/memtiers
Rework the heuristics to better guess NUMA node subtypes, and add CXL subtypes (CXL-DRAM and CXL-NVM only so far). Build "tiers" internally, based on hw info and performance. Expose the tier "rank" as MemoryTier=X in info attr. Support numa[tier=1] in command-line tools.
2 parents a1b9391 + a474591 commit beff32f

26 files changed

+1114
-216
lines changed

doc/hwloc.doxy

Lines changed: 224 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,7 @@
3939
<li> \ref miscobjs
4040
<li> \ref attributes
4141
<li> \ref topoattrs
42+
<li> \ref heteromem
4243
<li> \ref xml
4344
<li> \ref synthetic
4445
<li> \ref interoperability
@@ -1117,6 +1118,31 @@ following environment variables.
11171118
Bit 2 enables the use of target/initiator information.
11181119
</dd>
11191120

1121+
<dt>HWLOC_MEMTIERS_GUESS=none</dt>
1122+
<dt>HWLOC_MEMTIERS_GUESS=all</dt>
1123+
<dd>Disable or enable all heuristics to guess memory subtypes and tiers.
1124+
By default, hwloc only uses heuristics that are likely correct
1125+
and disables those that are unlikely.
1126+
</dd>
1127+
<!-- since 2.10, not stable yet, hence not documented
1128+
HWLOC_MEMTIERS_GUESS=spm_is_hbm,node0_is_dram
1129+
assume all SPM nodes are HBM, assume node0 is in the DRAM tier
1130+
-->
1131+
1132+
<dt>HWLOC_MEMTIERS=0x0f=HBM;0xf=DRAM</dt>
1133+
<dd>Enforce the memory tiers from the given semi-colon separated list.
1134+
Each entry specifies a bitmask (nodeset) of NUMA nodes and their subtype.
1135+
Nodes not listed in any entry are not placed in any tier.
1136+
1137+
If an empty value or <tt>none</tt> is given, tiers are entirely disabled.
1138+
</dd>
1139+
1140+
<dt>HWLOC_MEMTIERS_REFRESH=1</dt>
1141+
<dd>Force the rebuilding of memory tiers.
1142+
This is mostly useful when importing a XML topology from an old hwloc
1143+
version which was not able to guess memory subtypes and tiers.
1144+
</dd>
1145+
11201146
<dt>HWLOC_GROUPING=1</dt>
11211147
<dd>enables or disables objects grouping based on distances.
11221148
By default, hwloc uses distance matrices between objects (either read
@@ -1925,8 +1951,13 @@ subtype <tt>DRAM</tt> (for usual main memory),
19251951
<tt>HBM</tt> (high-bandwidth memory),
19261952
<tt>SPM</tt> (specific-purpose memory, usually reserved for some custom applications),
19271953
<tt>NVM</tt> (non-volatile memory when used as main memory),
1928-
<tt>MCDRAM</tt> (on KNL)
1929-
or <tt>GPUMemory</tt> (on POWER architecture with NVIDIA GPU memory shared over NVLink).
1954+
<tt>MCDRAM</tt> (on KNL),
1955+
<tt>GPUMemory</tt> (on POWER architecture with NVIDIA GPU memory shared over NVLink),
1956+
<tt>CXL-DRAM</tt> or <tt>CXL-NVM</tt> for CXL DRAM or non-volatile memory.
1957+
Note that some of these subtypes are guessed by the library,
1958+
they might be missing or slightly wrong in some corner cases.
1959+
See \ref heteromem for details, and HWLOC_MEMTIERS and HWLOC_MEMTIERS_GUESS
1960+
in \ref envvar for tuning these.
19301961
</li>
19311962
<li>Groups:
19321963
subtype <tt>Cluster</tt>, <tt>Module</tt>, <tt>Tile</tt>, <tt>Compute Unit</tt>,
@@ -2258,6 +2289,12 @@ and GID #1 of port #3.
22582289
These info attributes are attached to objects specified in parentheses.
22592290

22602291
<dl>
2292+
<dt>MemoryTier (NUMA Nodes)</dt>
2293+
<dd>The rank of the memory tier of this node.
2294+
Ranks start from 0 for highest bandwidth nodes.
2295+
The attribute is only set if multiple tiers are found.
2296+
See \ref heteromem.
2297+
</dd>
22612298
<dt>CXLDevice (NUMA Nodes or DAX Memory OS devices)</dt>
22622299
<dd>The PCI/CXL bus ID of a device whose CXL Type-3 memory is exposed here.
22632300
If multiple devices are interleaved, their bus IDs are separated by commas,
@@ -2465,6 +2502,11 @@ The memory attributes API is located in hwloc/memattrs.h,
24652502
see \ref hwlocality_memattrs and \ref hwlocality_memattrs_manage for details.
24662503
See also an example in doc/examples/memory-attributes.c in the source tree.
24672504

2505+
Memory attributes are the low-level solution to selecting target
2506+
memory. hwloc uses them internally to build Memory Tiers which provide
2507+
an easy way to distinguish NUMA nodes of different kinds, as explained
2508+
in \ref heteromem.
2509+
24682510

24692511
\htmlonly
24702512
</div><div class="section" id="topoattrs_cpukinds">
@@ -2524,6 +2566,186 @@ See \ref hwlocality_cpukinds for details.
25242566

25252567

25262568

2569+
\page heteromem Heterogeneous Memory
2570+
2571+
\htmlonly
2572+
<div class="section">
2573+
\endhtmlonly
2574+
2575+
Heterogeneous memory hardware exposes different NUMA nodes for
2576+
different memory technologies.
2577+
On the image below, a dual-socket server has both HBM (high bandwidth
2578+
memory) and usual DRAM connected to each socket, as well as some
2579+
CXL memory connected to the entire machine.
2580+
2581+
\image html heteromem.png
2582+
\image latex heteromem.png "" width=\textwidth
2583+
2584+
The hardware usually exposes "normal" memory first because it is
2585+
where "normal" data buffers should be allocated by default.
2586+
However there is no guarantee about whether HBM, NVM, CXL will appear
2587+
second.
2588+
Hence there is a need to explicit memory technologies and performance
2589+
to help users decide where to allocate.
2590+
2591+
\htmlonly
2592+
</div><div class="section" id="heteromem_memtiers">
2593+
\endhtmlonly
2594+
\section heteromem_memtiers Memory Tiers
2595+
2596+
hwloc builds <i>Memory Tiers</i> to identify different kinds of
2597+
NUMA nodes.
2598+
On the above machine, the first tier would contain both HBM NUMA nodes
2599+
(L\#1 and L\#3), while the second tier would contain both DRAM nodes
2600+
(L\#0 and L\#2), and the CXL memory (L\#4) would be in the third tier.
2601+
NUMA nodes are then annotated accordingly:
2602+
<ul>
2603+
<li> Each node object has its <tt>subtype</tt> field set to <tt>HBM</tt>,
2604+
<tt>DRAM</tt> or <tt>CXL-DRAM</tt>
2605+
(see other possible values in \ref attributes_normal).
2606+
<li> Each node also has a string info attribute with name
2607+
<tt>MemoryTier</tt> and value <tt>0</tt> for the first tier,
2608+
<tt>1</tt> for the second, etc.
2609+
</ul>
2610+
2611+
Tiers are built using two kinds of information:
2612+
<ul>
2613+
<li>First hwloc looks into operating system information to find out
2614+
whether a node is non-volatile, CXL, special-purpose, etc.
2615+
<li>Then it combines that knowledge with performance metrics exposed
2616+
by the hardware to guess what's actually DRAM, HBM, etc.
2617+
These metrics are also exposed in hwloc Memory Attributes, for
2618+
instance bandwidth and latency, for read and write.
2619+
See \ref topoattrs_memattrs and \ref hwlocality_memattrs for more details.
2620+
</ul>
2621+
2622+
Once nodes with similar or different characteristics are identified,
2623+
they are placed in tiers.
2624+
Tiers are then sorted by bandwidth so that the highest bandwidth
2625+
is ranked first, etc.
2626+
2627+
If hwloc fails to build tiers properly, see <tt>HWLOC_MEMTIERS</tt>
2628+
and <tt>HWLOC_MEMTIERS_GUESS</tt> in \ref envvar.
2629+
2630+
2631+
\htmlonly
2632+
</div><div class="section" id="heteromem_use_cli">
2633+
\endhtmlonly
2634+
\section heteromem_use_cli Using Heterogeneous Memory from the command-line
2635+
2636+
Tiers may be specified in location filters when using NUMA nodes
2637+
in hwloc command-line tools.
2638+
For instance, binding memory on the first HBM node (<tt>numa[hbm]:0</tt>)
2639+
is actually equivalent to binding on the second node (<tt>numa:1</tt>)
2640+
on our example platform:
2641+
\verbatim
2642+
$ hwloc-bind --membind 'numa[hbm]:0' -- myprogram
2643+
$ hwloc-bind --membind 'numa:1' -- myprogram
2644+
\endverbatim
2645+
To count DRAM nodes in the first CPU package, or all nodes:
2646+
\verbatim
2647+
$ hwloc-calc -N 'numa[dram]' package:0
2648+
1
2649+
$ hwloc-calc -N 'numa' package:0
2650+
2
2651+
\endverbatim
2652+
To list all the physical indexes of Tier-0 NUMA nodes (HBM P\#2 and P\#3 not shown on the figure):
2653+
\verbatim
2654+
$ hwloc-calc -I 'numa[tier=0]' -p all
2655+
2,3
2656+
\endverbatim
2657+
2658+
hwloc-calc and hwloc-bind also have options such as
2659+
<tt>\--local-memory</tt> and <tt>\--best-memattr</tt>
2660+
to select the best NUMA node among the local ones.
2661+
For instance, the following command-lines say that,
2662+
among nodes near node:0 (DRAM L\#0),
2663+
the best one for latency is itself
2664+
while the best one for bandwidth is node:1 (HBM L\#1).
2665+
\verbatim
2666+
$ hwloc-calc --best-memattr latency node:0
2667+
0
2668+
$ hwloc-calc --best-memattr bandwidth node:0
2669+
1
2670+
\endverbatim
2671+
2672+
2673+
\htmlonly
2674+
</div><div class="section" id="heteromem_use_api">
2675+
\endhtmlonly
2676+
\section heteromem_use_api Using Heterogeneous Memory from the C API
2677+
2678+
There are two major changes introduced by heterogeneous memory
2679+
when looking at the hierarchical tree of objects.
2680+
<ul>
2681+
<li> First, there may be multiple memory children attached at the same
2682+
place.
2683+
For instance, each Package in the above image has two memory children,
2684+
one for the DRAM NUMA node, and another one for the HBM node.
2685+
<li> Second, memory children may be attached at different levels.
2686+
In the above image, CXL memory is attached to the root Machine object
2687+
instead of below a Package.
2688+
</ul>
2689+
2690+
Hence, one may have to rethink the way it selects NUMA nodes.
2691+
2692+
\subsection heteromem_use_api_iterate Iterating over the list of (heterogeneous) NUMA nodes
2693+
2694+
A common need consists in iterating over the list of NUMA nodes
2695+
(e.g. using hwloc_get_next_obj_by_type()).
2696+
This is useful for counting some domains before partitioning a job,
2697+
or for finding a node that is local to some objects.
2698+
With heterogeneous memory, one should remember that multiple nodes may
2699+
now have the same locality (HBM and DRAM above) or overlapping localities
2700+
(e.g. DRAM and CXL above).
2701+
Checking NUMA node subtype or tier attributes is a good way to avoid
2702+
this issue by ignoring nodes of different kinds.
2703+
2704+
Another solution consists in ignoring nodes whose cpuset overlap the
2705+
previously selected ones.
2706+
For instance, in the above example, one could first select DRAM L\#0
2707+
but ignore HBM L\#1 (because it overlaps with DRAM L\#0),
2708+
then select DRAM L\#2 but ignore HBM L\#3 and CXL L\#4
2709+
(overlap wih DRAM L\#2).
2710+
2711+
<br/>
2712+
2713+
It is also possible to iterate over the memory parents (e.g. Packages
2714+
in our example) and select only one memory child for each of them.
2715+
hwloc_get_memory_parents_depth() may be used to find the depth
2716+
of these parents.
2717+
However this method only works if all memory parents are at the same level.
2718+
It would fail in our example: the root Machine object
2719+
also has a memory child (CXL), hence hwloc_get_memory_parents_depth()
2720+
would returns ::HWLOC_TYPE_DEPTH_MULTIPLE.
2721+
2722+
2723+
\subsection heteromem_use_api_vertical Iterating over local (heterogeneous) NUMA nodes
2724+
2725+
Another common need is to find NUMA nodes that are local to some
2726+
objects (e.g. a Core).
2727+
A basic solution consists in looking at the Core nodeset and iterating
2728+
over NUMA nodes to select those whose nodeset are included.
2729+
A nicer solution is to walk up the tree to find ancestors with a
2730+
memory child.
2731+
With heterogeneous memory, multiple such ancestors may exist
2732+
(Package and Machine in our example) and they may have multiple memory
2733+
children.
2734+
2735+
Both these methods may be replaced with hwloc_get_local_numanode_objs()
2736+
which provides a convenient and flexible way to retrieve local NUMA nodes.
2737+
One may then iterate over the returned array to select the appropriate one(s)
2738+
depending on their subtype, tier or performance attributes.
2739+
2740+
<br>
2741+
2742+
hwloc_memattr_get_best_target() is also a convenient way to select
2743+
the best local NUMA node according to performance metrics.
2744+
See also \ref hwlocality_memattrs.
2745+
2746+
2747+
2748+
25272749
\page xml Importing and exporting topologies from/to XML files
25282750

25292751
\htmlonly

doc/images/HACKING

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,7 @@ done
1010
for f in ppc64-without-smt ppc64-with-smt ppc64-full-with-smt ; do
1111
LANG=C lstopo -i ${f}.xml --horiz --no-legend --logical --no-index --index=pu --index=numa --index=core --index=pack --no-factorize -f ${f}.png ;
1212
done
13+
14+
for f in heteromem ; do
15+
LANG=C lstopo -i ${f}.xml --horiz --no-legend --logical --no-index --ignore pu --index=numa --index=core --index=pack --no-factorize -f ${f}.png ;
16+
done

doc/images/heteromem.png

9.71 KB
Loading

doc/images/heteromem.xml

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<!DOCTYPE topology SYSTEM "hwloc2.dtd">
3+
<topology version="3.0">
4+
<object type="Machine" os_index="0" cpuset="0x000000ff" complete_cpuset="0x000000ff" allowed_cpuset="0x000000ff" nodeset="0x0000001f" complete_nodeset="0x0000001f" allowed_nodeset="0x0000001f" gp_index="1" id="obj1">
5+
<object type="NUMANode" os_index="4" cpuset="0x000000ff" complete_cpuset="0x000000ff" nodeset="0x00000010" complete_nodeset="0x00000010" gp_index="24" id="obj24" local_memory="1073741824" subtype="CXL-DRAM">
6+
<page_type size="4096" count="262144"/>
7+
<info name="MemoryTier" value="2"/>
8+
</object>
9+
<object type="Package" os_index="0" cpuset="0x0000000f" complete_cpuset="0x0000000f" nodeset="0x00000013" complete_nodeset="0x00000013" gp_index="10" id="obj10">
10+
<object type="NUMANode" os_index="0" cpuset="0x0000000f" complete_cpuset="0x0000000f" nodeset="0x00000001" complete_nodeset="0x00000001" gp_index="11" id="obj11" local_memory="1073741824" subtype="DRAM">
11+
<page_type size="4096" count="262144"/>
12+
<info name="MemoryTier" value="1"/>
13+
</object>
14+
<object type="NUMANode" os_index="1" cpuset="0x0000000f" complete_cpuset="0x0000000f" nodeset="0x00000002" complete_nodeset="0x00000002" gp_index="12" id="obj12" local_memory="1073741824" subtype="HBM">
15+
<page_type size="4096" count="262144"/>
16+
<info name="MemoryTier" value="0"/>
17+
</object>
18+
<object type="Core" os_index="0" cpuset="0x00000001" complete_cpuset="0x00000001" nodeset="0x00000013" complete_nodeset="0x00000013" gp_index="3" id="obj3">
19+
<object type="PU" os_index="0" cpuset="0x00000001" complete_cpuset="0x00000001" nodeset="0x00000013" complete_nodeset="0x00000013" gp_index="2" id="obj2"/>
20+
</object>
21+
<object type="Core" os_index="1" cpuset="0x00000002" complete_cpuset="0x00000002" nodeset="0x00000013" complete_nodeset="0x00000013" gp_index="5" id="obj5">
22+
<object type="PU" os_index="1" cpuset="0x00000002" complete_cpuset="0x00000002" nodeset="0x00000013" complete_nodeset="0x00000013" gp_index="4" id="obj4"/>
23+
</object>
24+
<object type="Core" os_index="2" cpuset="0x00000004" complete_cpuset="0x00000004" nodeset="0x00000013" complete_nodeset="0x00000013" gp_index="7" id="obj7">
25+
<object type="PU" os_index="2" cpuset="0x00000004" complete_cpuset="0x00000004" nodeset="0x00000013" complete_nodeset="0x00000013" gp_index="6" id="obj6"/>
26+
</object>
27+
<object type="Core" os_index="3" cpuset="0x00000008" complete_cpuset="0x00000008" nodeset="0x00000013" complete_nodeset="0x00000013" gp_index="9" id="obj9">
28+
<object type="PU" os_index="3" cpuset="0x00000008" complete_cpuset="0x00000008" nodeset="0x00000013" complete_nodeset="0x00000013" gp_index="8" id="obj8"/>
29+
</object>
30+
</object>
31+
<object type="Package" os_index="1" cpuset="0x000000f0" complete_cpuset="0x000000f0" nodeset="0x0000001c" complete_nodeset="0x0000001c" gp_index="21" id="obj21">
32+
<object type="NUMANode" os_index="2" cpuset="0x000000f0" complete_cpuset="0x000000f0" nodeset="0x00000004" complete_nodeset="0x00000004" gp_index="22" id="obj22" local_memory="1073741824" subtype="DRAM">
33+
<page_type size="4096" count="262144"/>
34+
<info name="MemoryTier" value="1"/>
35+
</object>
36+
<object type="NUMANode" os_index="3" cpuset="0x000000f0" complete_cpuset="0x000000f0" nodeset="0x00000008" complete_nodeset="0x00000008" gp_index="23" id="obj23" local_memory="1073741824" subtype="HBM">
37+
<page_type size="4096" count="262144"/>
38+
<info name="MemoryTier" value="0"/>
39+
</object>
40+
<object type="Core" os_index="4" cpuset="0x00000010" complete_cpuset="0x00000010" nodeset="0x0000001c" complete_nodeset="0x0000001c" gp_index="14" id="obj14">
41+
<object type="PU" os_index="4" cpuset="0x00000010" complete_cpuset="0x00000010" nodeset="0x0000001c" complete_nodeset="0x0000001c" gp_index="13" id="obj13"/>
42+
</object>
43+
<object type="Core" os_index="5" cpuset="0x00000020" complete_cpuset="0x00000020" nodeset="0x0000001c" complete_nodeset="0x0000001c" gp_index="16" id="obj16">
44+
<object type="PU" os_index="5" cpuset="0x00000020" complete_cpuset="0x00000020" nodeset="0x0000001c" complete_nodeset="0x0000001c" gp_index="15" id="obj15"/>
45+
</object>
46+
<object type="Core" os_index="6" cpuset="0x00000040" complete_cpuset="0x00000040" nodeset="0x0000001c" complete_nodeset="0x0000001c" gp_index="18" id="obj18">
47+
<object type="PU" os_index="6" cpuset="0x00000040" complete_cpuset="0x00000040" nodeset="0x0000001c" complete_nodeset="0x0000001c" gp_index="17" id="obj17"/>
48+
</object>
49+
<object type="Core" os_index="7" cpuset="0x00000080" complete_cpuset="0x00000080" nodeset="0x0000001c" complete_nodeset="0x0000001c" gp_index="20" id="obj20">
50+
<object type="PU" os_index="7" cpuset="0x00000080" complete_cpuset="0x00000080" nodeset="0x0000001c" complete_nodeset="0x0000001c" gp_index="19" id="obj19"/>
51+
</object>
52+
</object>
53+
</object>
54+
<support name="discovery.pu"/>
55+
<support name="discovery.numa"/>
56+
<support name="discovery.numa_memory"/>
57+
<support name="custom.exported_support"/>
58+
<info name="Backend" value="Synthetic"/>
59+
<info name="SyntheticDescription" value="[numa] pack:2 [numa] [numa] core:4 1"/>
60+
<info name="hwlocVersion" value="3.0.0a1-git"/>
61+
<info name="ProcessName" value="lstopo"/>
62+
</topology>

0 commit comments

Comments
 (0)