@@ -338,10 +338,272 @@ Statistics
338
338
339
339
Zones
340
340
=====
341
+ As we have mentioned, each zone in memory is described by a ``struct zone ``
342
+ which is an element of the ``node_zones `` array of the node it belongs to.
343
+ ``struct zone `` is the core data structure of the page allocator. A zone
344
+ represents a range of physical memory and may have holes.
345
+
346
+ The page allocator uses the GFP flags, see :ref: `mm-api-gfp-flags `, specified by
347
+ a memory allocation to determine the highest zone in a node from which the
348
+ memory allocation can allocate memory. The page allocator first allocates memory
349
+ from that zone, if the page allocator can't allocate the requested amount of
350
+ memory from the zone, it will allocate memory from the next lower zone in the
351
+ node, the process continues up to and including the lowest zone. For example, if
352
+ a node contains ``ZONE_DMA32 ``, ``ZONE_NORMAL `` and ``ZONE_MOVABLE `` and the
353
+ highest zone of a memory allocation is ``ZONE_MOVABLE ``, the order of the zones
354
+ from which the page allocator allocates memory is ``ZONE_MOVABLE `` >
355
+ ``ZONE_NORMAL `` > ``ZONE_DMA32 ``.
356
+
357
+ At runtime, free pages in a zone are in the Per-CPU Pagesets (PCP) or free areas
358
+ of the zone. The Per-CPU Pagesets are a vital mechanism in the kernel's memory
359
+ management system. By handling most frequent allocations and frees locally on
360
+ each CPU, the Per-CPU Pagesets improve performance and scalability, especially
361
+ on systems with many cores. The page allocator in the kernel employs a two-step
362
+ strategy for memory allocation, starting with the Per-CPU Pagesets before
363
+ falling back to the buddy allocator. Pages are transferred between the Per-CPU
364
+ Pagesets and the global free areas (managed by the buddy allocator) in batches.
365
+ This minimizes the overhead of frequent interactions with the global buddy
366
+ allocator.
367
+
368
+ Architecture specific code calls free_area_init() to initializes zones.
369
+
370
+ Zone structure
371
+ --------------
372
+ The zones structure ``struct zone `` is defined in ``include/linux/mmzone.h ``.
373
+ Here we briefly describe fields of this structure:
341
374
342
- .. admonition :: Stub
375
+ General
376
+ ~~~~~~~
343
377
344
- This section is incomplete. Please list and describe the appropriate fields.
378
+ ``_watermark ``
379
+ The watermarks for this zone. When the amount of free pages in a zone is below
380
+ the min watermark, boosting is ignored, an allocation may trigger direct
381
+ reclaim and direct compaction, it is also used to throttle direct reclaim.
382
+ When the amount of free pages in a zone is below the low watermark, kswapd is
383
+ woken up. When the amount of free pages in a zone is above the high watermark,
384
+ kswapd stops reclaiming (a zone is balanced) when the
385
+ ``NUMA_BALANCING_MEMORY_TIERING `` bit of ``sysctl_numa_balancing_mode `` is not
386
+ set. The promo watermark is used for memory tiering and NUMA balancing. When
387
+ the amount of free pages in a zone is above the promo watermark, kswapd stops
388
+ reclaiming when the ``NUMA_BALANCING_MEMORY_TIERING `` bit of
389
+ ``sysctl_numa_balancing_mode `` is set. The watermarks are set by
390
+ ``__setup_per_zone_wmarks() ``. The min watermark is calculated according to
391
+ ``vm.min_free_kbytes `` sysctl. The other three watermarks are set according
392
+ to the distance between two watermarks. The distance itself is calculated
393
+ taking ``vm.watermark_scale_factor `` sysctl into account.
394
+
395
+ ``watermark_boost ``
396
+ The number of pages which are used to boost watermarks to increase reclaim
397
+ pressure to reduce the likelihood of future fallbacks and wake kswapd now
398
+ as the node may be balanced overall and kswapd will not wake naturally.
399
+
400
+ ``nr_reserved_highatomic ``
401
+ The number of pages which are reserved for high-order atomic allocations.
402
+
403
+ ``nr_free_highatomic ``
404
+ The number of free pages in reserved highatomic pageblocks
405
+
406
+ ``lowmem_reserve ``
407
+ The array of the amounts of the memory reserved in this zone for memory
408
+ allocations. For example, if the highest zone a memory allocation can
409
+ allocate memory from is ``ZONE_MOVABLE ``, the amount of memory reserved in
410
+ this zone for this allocation is ``lowmem_reserve[ZONE_MOVABLE] `` when
411
+ attempting to allocate memory from this zone. This is a mechanism the page
412
+ allocator uses to prevent allocations which could use ``highmem `` from using
413
+ too much ``lowmem ``. For some specialised workloads on ``highmem `` machines,
414
+ it is dangerous for the kernel to allow process memory to be allocated from
415
+ the ``lowmem `` zone. This is because that memory could then be pinned via the
416
+ ``mlock() `` system call, or by unavailability of swapspace.
417
+ ``vm.lowmem_reserve_ratio `` sysctl determines how aggressive the kernel is in
418
+ defending these lower zones. This array is recalculated by
419
+ ``setup_per_zone_lowmem_reserve() `` at runtime if ``vm.lowmem_reserve_ratio ``
420
+ sysctl changes.
421
+
422
+ ``node ``
423
+ The index of the node this zone belongs to. Available only when
424
+ ``CONFIG_NUMA `` is enabled because there is only one zone in a UMA system.
425
+
426
+ ``zone_pgdat ``
427
+ Pointer to the ``struct pglist_data `` of the node this zone belongs to.
428
+
429
+ ``per_cpu_pageset ``
430
+ Pointer to the Per-CPU Pagesets (PCP) allocated and initialized by
431
+ ``setup_zone_pageset() ``. By handling most frequent allocations and frees
432
+ locally on each CPU, PCP improves performance and scalability on systems with
433
+ many cores.
434
+
435
+ ``pageset_high_min ``
436
+ Copied to the ``high_min `` of the Per-CPU Pagesets for faster access.
437
+
438
+ ``pageset_high_max ``
439
+ Copied to the ``high_max `` of the Per-CPU Pagesets for faster access.
440
+
441
+ ``pageset_batch ``
442
+ Copied to the ``batch `` of the Per-CPU Pagesets for faster access. The
443
+ ``batch ``, ``high_min `` and ``high_max `` of the Per-CPU Pagesets are used to
444
+ calculate the number of elements the Per-CPU Pagesets obtain from the buddy
445
+ allocator under a single hold of the lock for efficiency. They are also used
446
+ to decide if the Per-CPU Pagesets return pages to the buddy allocator in page
447
+ free process.
448
+
449
+ ``pageblock_flags ``
450
+ The pointer to the flags for the pageblocks in the zone (see
451
+ ``include/linux/pageblock-flags.h `` for flags list). The memory is allocated
452
+ in ``setup_usemap() ``. Each pageblock occupies ``NR_PAGEBLOCK_BITS `` bits.
453
+ Defined only when ``CONFIG_FLATMEM `` is enabled. The flags is stored in
454
+ ``mem_section `` when ``CONFIG_SPARSEMEM `` is enabled.
455
+
456
+ ``zone_start_pfn ``
457
+ The start pfn of the zone. It is initialized by
458
+ ``calculate_node_totalpages() ``.
459
+
460
+ ``managed_pages ``
461
+ The present pages managed by the buddy system, which is calculated as:
462
+ ``managed_pages `` = ``present_pages `` - ``reserved_pages ``, ``reserved_pages ``
463
+ includes pages allocated by the memblock allocator. It should be used by page
464
+ allocator and vm scanner to calculate all kinds of watermarks and thresholds.
465
+ It is accessed using ``atomic_long_xxx() `` functions. It is initialized in
466
+ ``free_area_init_core() `` and then is reinitialized when memblock allocator
467
+ frees pages into buddy system.
468
+
469
+ ``spanned_pages ``
470
+ The total pages spanned by the zone, including holes, which is calculated as:
471
+ ``spanned_pages `` = ``zone_end_pfn `` - ``zone_start_pfn ``. It is initialized
472
+ by ``calculate_node_totalpages() ``.
473
+
474
+ ``present_pages ``
475
+ The physical pages existing within the zone, which is calculated as:
476
+ ``present_pages `` = ``spanned_pages `` - ``absent_pages `` (pages in holes). It
477
+ may be used by memory hotplug or memory power management logic to figure out
478
+ unmanaged pages by checking (``present_pages `` - ``managed_pages ``). Write
479
+ access to ``present_pages `` at runtime should be protected by
480
+ ``mem_hotplug_begin/done() ``. Any reader who can't tolerant drift of
481
+ ``present_pages `` should use ``get_online_mems() `` to get a stable value. It
482
+ is initialized by ``calculate_node_totalpages() ``.
483
+
484
+ ``present_early_pages ``
485
+ The present pages existing within the zone located on memory available since
486
+ early boot, excluding hotplugged memory. Defined only when
487
+ ``CONFIG_MEMORY_HOTPLUG `` is enabled and initialized by
488
+ ``calculate_node_totalpages() ``.
489
+
490
+ ``cma_pages ``
491
+ The pages reserved for CMA use. These pages behave like ``ZONE_MOVABLE `` when
492
+ they are not used for CMA. Defined only when ``CONFIG_CMA `` is enabled.
493
+
494
+ ``name ``
495
+ The name of the zone. It is a pointer to the corresponding element of
496
+ the ``zone_names `` array.
497
+
498
+ ``nr_isolate_pageblock ``
499
+ Number of isolated pageblocks. It is used to solve incorrect freepage counting
500
+ problem due to racy retrieving migratetype of pageblock. Protected by
501
+ ``zone->lock ``. Defined only when ``CONFIG_MEMORY_ISOLATION `` is enabled.
502
+
503
+ ``span_seqlock ``
504
+ The seqlock to protect ``zone_start_pfn `` and ``spanned_pages ``. It is a
505
+ seqlock because it has to be read outside of ``zone->lock ``, and it is done in
506
+ the main allocator path. However, the seqlock is written quite infrequently.
507
+ Defined only when ``CONFIG_MEMORY_HOTPLUG `` is enabled.
508
+
509
+ ``initialized ``
510
+ The flag indicating if the zone is initialized. Set by
511
+ ``init_currently_empty_zone() `` during boot.
512
+
513
+ ``free_area ``
514
+ The array of free areas, where each element corresponds to a specific order
515
+ which is a power of two. The buddy allocator uses this structure to manage
516
+ free memory efficiently. When allocating, it tries to find the smallest
517
+ sufficient block, if the smallest sufficient block is larger than the
518
+ requested size, it will be recursively split into the next smaller blocks
519
+ until the required size is reached. When a page is freed, it may be merged
520
+ with its buddy to form a larger block. It is initialized by
521
+ ``zone_init_free_lists() ``.
522
+
523
+ ``unaccepted_pages ``
524
+ The list of pages to be accepted. All pages on the list are ``MAX_PAGE_ORDER ``.
525
+ Defined only when ``CONFIG_UNACCEPTED_MEMORY `` is enabled.
526
+
527
+ ``flags ``
528
+ The zone flags. The least three bits are used and defined by
529
+ ``enum zone_flags ``. ``ZONE_BOOSTED_WATERMARK `` (bit 0): zone recently boosted
530
+ watermarks. Cleared when kswapd is woken. ``ZONE_RECLAIM_ACTIVE `` (bit 1):
531
+ kswapd may be scanning the zone. ``ZONE_BELOW_HIGH `` (bit 2): zone is below
532
+ high watermark.
533
+
534
+ ``lock ``
535
+ The main lock that protects the internal data structures of the page allocator
536
+ specific to the zone, especially protects ``free_area ``.
537
+
538
+ ``percpu_drift_mark ``
539
+ When free pages are below this point, additional steps are taken when reading
540
+ the number of free pages to avoid per-cpu counter drift allowing watermarks
541
+ to be breached. It is updated in ``refresh_zone_stat_thresholds() ``.
542
+
543
+ Compaction control
544
+ ~~~~~~~~~~~~~~~~~~
545
+
546
+ ``compact_cached_free_pfn ``
547
+ The PFN where compaction free scanner should start in the next scan.
548
+
549
+ ``compact_cached_migrate_pfn ``
550
+ The PFNs where compaction migration scanner should start in the next scan.
551
+ This array has two elements: the first one is used in ``MIGRATE_ASYNC `` mode,
552
+ and the other one is used in ``MIGRATE_SYNC `` mode.
553
+
554
+ ``compact_init_migrate_pfn ``
555
+ The initial migration PFN which is initialized to 0 at boot time, and to the
556
+ first pageblock with migratable pages in the zone after a full compaction
557
+ finishes. It is used to check if a scan is a whole zone scan or not.
558
+
559
+ ``compact_init_free_pfn ``
560
+ The initial free PFN which is initialized to 0 at boot time and to the last
561
+ pageblock with free ``MIGRATE_MOVABLE `` pages in the zone. It is used to check
562
+ if it is the start of a scan.
563
+
564
+ ``compact_considered ``
565
+ The number of compactions attempted since last failure. It is reset in
566
+ ``defer_compaction() `` when a compaction fails to result in a page allocation
567
+ success. It is increased by 1 in ``compaction_deferred() `` when a compaction
568
+ should be skipped. ``compaction_deferred() `` is called before
569
+ ``compact_zone() `` is called, ``compaction_defer_reset() `` is called when
570
+ ``compact_zone() `` returns ``COMPACT_SUCCESS ``, ``defer_compaction() `` is
571
+ called when ``compact_zone() `` returns ``COMPACT_PARTIAL_SKIPPED `` or
572
+ ``COMPACT_COMPLETE ``.
573
+
574
+ ``compact_defer_shift ``
575
+ The number of compactions skipped before trying again is
576
+ ``1<<compact_defer_shift ``. It is increased by 1 in ``defer_compaction() ``.
577
+ It is reset in ``compaction_defer_reset() `` when a direct compaction results
578
+ in a page allocation success. Its maximum value is ``COMPACT_MAX_DEFER_SHIFT ``.
579
+
580
+ ``compact_order_failed ``
581
+ The minimum compaction failed order. It is set in ``compaction_defer_reset() ``
582
+ when a compaction succeeds and in ``defer_compaction() `` when a compaction
583
+ fails to result in a page allocation success.
584
+
585
+ ``compact_blockskip_flush ``
586
+ Set to true when compaction migration scanner and free scanner meet, which
587
+ means the ``PB_migrate_skip `` bits should be cleared.
588
+
589
+ ``contiguous ``
590
+ Set to true when the zone is contiguous (in other words, no hole).
591
+
592
+ Statistics
593
+ ~~~~~~~~~~
594
+
595
+ ``vm_stat ``
596
+ VM statistics for the zone. The items tracked are defined by
597
+ ``enum zone_stat_item ``.
598
+
599
+ ``vm_numa_event ``
600
+ VM NUMA event statistics for the zone. The items tracked are defined by
601
+ ``enum numa_stat_item ``.
602
+
603
+ ``per_cpu_zonestats ``
604
+ Per-CPU VM statistics for the zone. It records VM statistics and VM NUMA event
605
+ statistics on a per-CPU basis. It reduces updates to the global ``vm_stat ``
606
+ and ``vm_numa_event `` fields of the zone to improve performance.
345
607
346
608
.. _pages :
347
609
0 commit comments