- 
                Notifications
    You must be signed in to change notification settings 
- Fork 200
Heterogeneous Memory
Although hwloc supports heterogeneous memories since v2.0, there is still a lot of work in the firmware, operating system and hwloc to properly expose them to applications. It matters at least for the following platforms:
- Knight Landing with MCDRAM (high bandwidth) and DDR (low bandwidth).
- Platforms mixing DDR and non-volatile memory DIMMs (slower), for instance Intel Xeon (since CascadeLake, e.g. 62xx) with Optane DCPMM (DataCenter Persistent Memory Modules).
- POWER9 servers with NVLink-connected NVIDIA V100 GPUs. The GPU memory is exposed as additional NUMA nodes.
A dedicated API ([1] for hwloc 2.3) is being designed to ease the selection of appropriate NUMA nodes for allocating a buffer, for instance by saying "I care about latency, what are the local NUMA nodes and which one is better for me, and what's its capacity?".
[1] Draft of API for querying memory node performance attributes at https://github.com/bgoglin/hwloc/blob/mmms/include/hwloc/memattr.h
For now (hwloc 2.1-2.2), here's how applications may find out if a given NUMA node is DDR, MCDRAM or NVDIMMs:
There are DDR NUMA nodes and MCDRAM NUMA nodes. MCDRAM NUMA nodes have a "MCDRAM" subtype.
   if (!strcmp(obj->subtype, "MCDRAM"))
     /* this is a high-bandwidth memory */
In practice, they are also the second NUMA node attached to each location.
- In alltoall or quadrant mode, there are 2 NUMA nodes total, both attached to the entire machine. The first one is DDR, the second one is MCDRAM.
- In SNC-4 mode, there 8 nodes total, 2 per SNC group, with DDR first and MCDRAM second.
GPU NUMA nodes are physicalled numbered down from 255 (250 to 255 if you have 6 GPUs). hwloc hides these NUMA nodes by default to avoid allocating there by mistake (especially to avoid interleaving policies to allocate on both DDR and GPU memories).
They may be unignored by setting HWLOC_KEEP_NVIDIA_GPU_NUMA_NODES=1 in the environment. Once unignored, they are attached to their local CPU package, right after the local DDR. They have a "GPUMemory" subtype.
   if (!strcmp(obj->subtype, "GPUMemory"))
     /* this is a GPU memory node */
The problem only occurs when they are configured in AppDirect mode and then exposed as "volatile" memory through the kmem dax driver (daxctl reconfigure in "system-ram" mode). These nodes have a "DAXDevice" info attribute.
   if (hwloc_obj_get_info_by_name(obj, "DAXDevice") != NULL)
     /* this is non-volatile memory exposed as normal "volatile" memory */
Note that we cannot guarantee that those DAXDevice will actually be slower NUMA nodes, because there are ways to make DDR appear like this. Fortunately, this shouldn't ever occur on production platforms.
Linux kernel developers do not plan to explicitly let us know which node is DDR or NVDIMM because application developers would then hardwire the fact that NVDIMMs are slower than DDR (which is often true but not always). They rather want us to use performance attributes to select the right target NUMA nodes. Modern platforms are supposed to provide a HMAT ACPI table with bandwidth/latency, and kernel 5.2+ exposes some of it to userspace. That's what hwloc 2.3 will use in the aforementioned API.
Optane DCPMM may be used as volatile memory (see above) but also as different kinds of storage. Only the volatile case above makes them appear as NUMA node where you may allocate "normal" memory buffers. Other kinds are totally different, they are storage that is either managed manually ("dax0.0", i.e. a single large file) or through a filesystem ("pmem0", i.e. a block device). Those devices are not exposed as NUMA node but as hwloc Block OS devices (possibly with a NVDIMM subtype).