@@ -154,6 +154,90 @@ that no processes or threads escape the cgroups. This sync is
154154done via a pipe ( specified in the runtime section below ) that the container's
155155init process will block waiting for the parent to finish setup.
156156
157+ ### IntelRdt
158+
159+ Intel platforms with new Xeon CPU support Intel Resource Director Technology
160+ (RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
161+ currently supports L3 cache resource allocation.
162+
163+ This feature provides a way for the software to restrict cache allocation to a
164+ defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
165+ The different subsets are identified by class of service (CLOS) and each CLOS
166+ has a capacity bitmask (CBM).
167+
168+ It can be used to handle L3 cache resource allocation for containers if
169+ hardware and kernel support Intel RDT/CAT.
170+
171+ In Linux 4.10 kernel or newer, the interface is defined and exposed via
172+ "resource control" filesystem, which is a "cgroup-like" interface.
173+
174+ Comparing with cgroups, it has similar process management lifecycle and
175+ interfaces in a container. But unlike cgroups' hierarchy, it has single level
176+ filesystem layout.
177+
178+ Intel RDT "resource control" filesystem hierarchy:
179+ ```
180+ mount -t resctrl resctrl /sys/fs/resctrl
181+ tree /sys/fs/resctrl
182+ /sys/fs/resctrl/
183+ |-- info
184+ | |-- L3
185+ | |-- cbm_mask
186+ | |-- min_cbm_bits
187+ | |-- num_closids
188+ |-- cpus
189+ |-- schemata
190+ |-- tasks
191+ |-- <container_id>
192+ |-- cpus
193+ |-- schemata
194+ |-- tasks
195+
196+ ```
197+
198+ For runc, we can make use of ` tasks ` and ` schemata ` configuration for L3 cache
199+ resource constraints.
200+
201+ The file ` tasks ` has a list of tasks that belongs to this group (e.g.,
202+ <container_id>" group). Tasks can be added to a group by writing the task ID
203+ to the "tasks" file (which will automatically remove them from the previous
204+ group to which they belonged). New tasks created by fork(2) and clone(2) are
205+ added to the same group as their parent. If a pid is not in any sub group, it
206+ is in root group.
207+
208+ The file ` schemata ` has allocation masks/values for L3 cache on each socket,
209+ which contains L3 cache id and capacity bitmask (CBM).
210+ ```
211+ Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
212+ ```
213+ For example, on a two-socket machine, L3's schema line could be ` L3:0=ff;1=c0 `
214+ Which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.
215+
216+ The valid L3 cache CBM is a * contiguous bits set* and number of bits that can
217+ be set is less than the max bit. The max bits in the CBM is varied among
218+ supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
219+ layout, the CBM in a group should be a subset of the CBM in root. Kernel will
220+ check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
221+ of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
222+ values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.
223+
224+ For more information about Intel RDT/CAT kernel interface:
225+ https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
226+
227+ An example for runc:
228+ ```
229+ Consider a two-socket machine with two L3 caches where the default CBM is
230+ 0xfffff and the max CBM length is 20 bits. With this configuration, tasks
231+ inside the container only have access to the "upper" 80% of L3 cache id 0 and
232+ the "lower" 50% L3 cache id 1:
233+
234+ "linux": {
235+ "intelRdt": {
236+ "l3CacheSchema": "L3:0=ffff0;1=3ff"
237+ }
238+ }
239+ ```
240+
157241### Security
158242
159243The standard set of Linux capabilities that are set in a container
0 commit comments