Skip to content

Conversation

@miaoqing-quic
Copy link
Contributor

No description provided.

chenhuacai and others added 30 commits November 10, 2025 08:37
Now we use virtual addresses to fill CSR_MERRENTRY/CSR_TLBRENTRY, but
hardware hope physical addresses. Now it works well because the high
bits are ignored above PA_BITS (48 bits), but explicitly use physical
addresses can avoid potential bugs. So fix it.

Cc: [email protected]
Signed-off-by: Huacai Chen <[email protected]>
1. Use phys_addr_t instead of u64, which can work for both 32/64 bits.
2. Check whether the input physical address is above TO_PHYS_MASK (and
   return NULL if yes) for the DMW version.

Note: In theory early_ioremap() also need the TO_PHYS_MASK checking, but
the UEFI BIOS pass some DMW virtual addresses.

Cc: [email protected]
Signed-off-by: Jiaxun Yang <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>
Now there 5 places which calculate max_pfn & max_low_pfn:
1. in fdt_setup() for FDT systems;
2. in memblock_init() for ACPI systems;
3. in init_numa_memory() for NUMA systems;
4. in arch_mem_init() to recalculate for "mem=" cmdline;
5. in paging_init() to recalculate for NUMA systems.

Since memblock_init() is called both for ACPI and FDT systems, move the
calculation out of the for_each_efi_memory_desc() loop can eliminate the
first case. The last case is very questionable (may be derived from the
MIPS/Loongson code) and breaks the "mem=" cmdline, so should be removed.
And then the NUMA version of paging_init() can be also eliminated.

After consolidation there are 3 places of calculation:
1. in memblock_init() for both ACPI and FDT systems;
2. in init_numa_memory() to recalculate for NUMA systems;
3. in arch_mem_init() to recalculate for the "mem=" cmdline.

For all cases the calculation is:
max_pfn = PFN_DOWN(memblock_end_of_DRAM());
max_low_pfn = min(PFN_DOWN(HIGHMEM_START), max_pfn);

Cc: [email protected]
Signed-off-by: Huacai Chen <[email protected]>
Now if the PTE/PMD is dirty with _PAGE_DIRTY but without _PAGE_MODIFIED,
after {pte,pmd}_modify() we lose _PAGE_DIRTY, then {pte,pmd}_dirty()
return false and lead to data loss. This can happen in certain scenarios
such as HW PTW doesn't set _PAGE_MODIFIED automatically, so here we need
_PAGE_MODIFIED to record the dirty status (_PAGE_DIRTY).

The new modification involves checking whether the original PTE/PMD has
the _PAGE_DIRTY flag. If it exists, the _PAGE_MODIFIED bit is also set,
ensuring that the {pte,pmd}_dirty() interface can always return accurate
information.

Cc: [email protected]
Co-developed-by: Liupu Wang <[email protected]>
Signed-off-by: Liupu Wang <[email protected]>
Signed-off-by: Tianyang Zhang <[email protected]>
Remove the unnecessary __GFP_HIGHMEM masking in pud_alloc_one(), which
was introduced with commit 3827397 ("loongarch: convert various
functions to use ptdescs"). GFP_KERNEL doesn't contain __GFP_HIGHMEM.

Signed-off-by: Vishal Moola (Oracle) <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>
(1) Use the existing CPUCFG6_PMNUM_SHIFT macro definition instead of
the magic value 4 to get the PMU number.

(2) Detect the value of PMU bits via CPUCFG instruction according to
the ISA manual instead of hard-coded as 64, because the value may be
different for various micro-architectures.

Link: https://loongson.github.io/LoongArch-Documentation/LoongArch-Vol1-EN.html#_cpucfg
Signed-off-by: Tiezhu Yang <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>
CSR.FWPC and CSR.MWPC are 32bit registers, so use csr_read32() rather
than csr_read64() to read the values of FWPC/MWPC.

Cc: [email protected]
Fixes: edffa33 ("LoongArch: Add hardware breakpoints/watchpoints support")
Signed-off-by: Huacai Chen <[email protected]>
The kexec_buf structure was previously declared without initialization.
commit bf454ec ("kexec_file: allow to place kexec_buf randomly")
added a field that is always read but not consistently populated by all
architectures. This un-initialized field will contain garbage.

This is also triggering a UBSAN warning when the uninitialized data is
accessed:

        ------------[ cut here ]------------
        UBSAN: invalid-load in ./include/linux/kexec.h:210:10
        load of value 252 is not a valid value for type '_Bool'

Zero-initializing kexec_buf at declaration ensures all fields are
cleanly set, preventing future instances of uninitialized memory being
used.

Fixes: bf454ec ("kexec_file: allow to place kexec_buf randomly")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Youling Tang <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>
When specifying '-d' for kexec_file_load interface, loaded locations of
kernel/initrd/cmdline etc can be printed out to help debug.

Commit eb7622d ("kexec_file, riscv: print out debugging message if
required") fixes the same issue on RISC-V.

So, remove kexec_image_info() because the content has been printed out
in generic code.

And on Loongson-3A5000, the printed messages look like below:

 kexec_file: kernel: 00000000d9aad283 kernel_size: 0x2e77f30
 kexec_file(EFI): No LoongArch PE image header.
 kexec_file: Loaded initrd at 0x80000000 bufsz=0x1637cd0 memsz=0x1638000
 kexec_file(ELF): Loaded kernel at 0x9c20000 bufsz=0x27f1800 memsz=0x2950000
 kexec_file: nr_segments = 2
 kexec_file: segment[0]: buf=0x00000000cc3e6c33 bufsz=0x27f1800 mem=0x9c20000 memsz=0x2950000
 kexec_file: segment[1]: buf=0x00000000bb75a541 bufsz=0x1637cd0 mem=0x80000000 memsz=0x1638000
 kexec_file: kexec_file_load: type:0, start:0xb15d000 head:0x18db60002 flags:0x8

Signed-off-by: Qiang Ma <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>
With secondary MMU page table, if there is a read page fault, the page's
write attribute will not set even if it is writable from master MMU page
table. This logic only works if dirty tracking is enabled, so page table
should be set with _PAGE_WRITE if dirty tracking is disabled.

It reduces extra page fault on secondary MMU page table if a VM finishes
migration, when the master MMU page table is ready and the secondary MMU
page is fresh.

Signed-off-by: Bibo Mao <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>
When timer is fired in oneshot mode, CSR.TVAL will stop with value -1
rather than 0. However when the register CSR.TVAL is restored, it will
continue to count down rather than stop there.

Now the method is to write 0 to CSR.TVAL, wait to count down for 1 cycle
at least, which is 10ns with a timer freq 100MHz, and then retore timer
interrupt status. Here add 2 cycles delay to assure that timer interrupt
is injected.

With this patch, timer selftest case passes to run always.

Cc: [email protected]
Signed-off-by: Bibo Mao <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>
On LoongArch system, guest PMU hardware is shared by guest and host but
PMU interrupt is separated. PMU is pass-through to VM, and there is PMU
context switch when exit to host and return to guest.

There is optimiation to check whether PMU is enabled by guest. If not,
it is not necessary to return to guest. However, if it is enabled, PMU
context for guest need switch on. Now KVM_REQ_PMU notification is set
on vCPU context switch, but it is missing if there is no vCPU context
switch while PMU is used by guest VM, so fix it.

Cc: <[email protected]>
Fixes: f4e40ea ("LoongArch: KVM: Add PMU support for guest")
Signed-off-by: Bibo Mao <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>
PMU hardware about VM is switched on VM exit to host rather than vCPU
context sched off, PMU is checked and restored on return to VM. It is
not necessary to check PMU on vCPU context sched on callback, since the
request is made on the VM exit entry or VM PMU CSR access abort routine
already.

Signed-off-by: Bibo Mao <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>
VM fails to boot with 256 vCPUs, the detailed command is

  qemu-system-loongarch64 -smp 256

and there is an error reported as follows:

  KVM_LOONGARCH_EXTIOI_INIT_NUM_CPU failed: Invalid argument

There is typo issue in function kvm_eiointc_ctrl_access() when set
max supported vCPUs.

Cc: [email protected]
Fixes: 47256c4 ("LoongArch: KVM: Avoid copy_*_user() with lock hold in kvm_eiointc_ctrl_access()")
Signed-off-by: Bibo Mao <[email protected]>
Signed-off-by: Huacai Chen <[email protected]>
Page cache folios from a file system that support large block size (LBS)
can have minimal folio order greater than 0, thus a high order folio might
not be able to be split down to order-0.  Commit e220917 ("mm: split
a folio in minimum folio order chunks") bumps the target order of
split_huge_page*() to the minimum allowed order when splitting a LBS
folio.  This causes confusion for some split_huge_page*() callers like
memory failure handling code, since they expect after-split folios all
have order-0 when split succeeds but in reality get min_order_for_split()
order folios and give warnings.

Fix it by failing a split if the folio cannot be split to the target
order.  Rename try_folio_split() to try_folio_split_to_order() to reflect
the added new_order parameter.  Remove its unused list parameter.

[The test poisons LBS folios, which cannot be split to order-0 folios, and
also tries to poison all memory.  The non split LBS folios take more
memory than the test anticipated, leading to OOM.  The patch fixed the
kernel warning and the test needs some change to avoid OOM.]

Link: https://lkml.kernel.org/r/[email protected]
Fixes: e220917 ("mm: split a folio in minimum folio order chunks")
Signed-off-by: Zi Yan <[email protected]>
Reported-by: [email protected]
Closes: https://lore.kernel.org/all/[email protected]/
Reviewed-by: Luis Chamberlain <[email protected]>
Reviewed-by: Pankaj Raghav <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Barry Song <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Dev Jain <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Mariano Pache <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Patch series "KHO: kfence + KHO memory corruption fix", v3.

This series fixes a memory corruption bug in KHO that occurs when KFENCE
is enabled.

The root cause is that KHO metadata, allocated via kzalloc(), can be
randomly serviced by kfence_alloc().  When a kernel boots via KHO, the
early memblock allocator is restricted to a "scratch area".  This forces
the KFENCE pool to be allocated within this scratch area, creating a
conflict.  If KHO metadata is subsequently placed in this pool, it gets
corrupted during the next kexec operation.

Google is using KHO and have had obscure crashes due to this memory
corruption, with stacks all over the place.  I would prefer this fix to be
properly backported to stable so we can also automatically consume it once
we switch to the upstream KHO.

Patch 1/3 introduces a debug-only feature (CONFIG_KEXEC_HANDOVER_DEBUG)
that adds checks to detect and fail any operation that attempts to place
KHO metadata or preserved memory within the scratch area.  This serves as
a validation and diagnostic tool to confirm the problem without affecting
production builds.

Patch 2/3 Increases bitmap to PAGE_SIZE, so buddy allocator can be used.

Patch 3/3 Provides the fix by modifying KHO to allocate its metadata
directly from the buddy allocator instead of slab.  This bypasses the
KFENCE interception entirely.


This patch (of 3):

It is invalid for KHO metadata or preserved memory regions to be located
within the KHO scratch area, as this area is overwritten when the next
kernel is loaded, and used early in boot by the next kernel.  This can
lead to memory corruption.

Add checks to kho_preserve_* and KHO's internal metadata allocators
(xa_load_or_alloc, new_chunk) to verify that the physical address of the
memory does not overlap with any defined scratch region.  If an overlap is
detected, the operation will fail and a WARN_ON is triggered.  To avoid
performance overhead in production kernels, these checks are enabled only
when CONFIG_KEXEC_HANDOVER_DEBUG is selected.

[[email protected]: fix KEXEC_HANDOVER_DEBUG Kconfig dependency]
  Link: https://lkml.kernel.org/r/[email protected]
[[email protected]: build fix]
  Link: https://lkml.kernel.org/r/CA+CK2bBnorfsTymKtv4rKvqGBHs=y=MjEMMRg_tE-RME6n-zUw@mail.gmail.com
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: fc33e4b ("kexec: enable KHO support for memory preservation")
Signed-off-by: Pasha Tatashin <[email protected]>
Signed-off-by: Mike Rapoport <[email protected]>
Reviewed-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: Pratyush Yadav <[email protected]>
Cc: Alexander Graf <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: David Matlack <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Samiullah Khawaja <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
KHO memory preservation metadata is preserved in 512 byte chunks which
requires their allocation from slab allocator.  Slabs are not safe to be
used with KHO because of kfence, and because partial slabs may lead leaks
to the next kernel.  Change the size to be PAGE_SIZE.

The kfence specifically may cause memory corruption, where it randomly
provides slab objects that can be within the scratch area.  The reason for
that is that kfence allocates its objects prior to KHO scratch is marked
as CMA region.

While this change could potentially increase metadata overhead on systems
with sparsely preserved memory, this is being mitigated by ongoing work to
reduce sparseness during preservation via 1G guest pages.  Furthermore,
this change aligns with future work on a stateless KHO, which will also
use page-sized bitmaps for its radix tree metadata.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: fc33e4b ("kexec: enable KHO support for memory preservation")
Signed-off-by: Pasha Tatashin <[email protected]>
Reviewed-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: Pratyush Yadav <[email protected]>
Cc: Alexander Graf <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: David Matlack <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Samiullah Khawaja <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
KHO allocates metadata for its preserved memory map using the slab
allocator via kzalloc().  This metadata is temporary and is used by the
next kernel during early boot to find preserved memory.

A problem arises when KFENCE is enabled.  kzalloc() calls can be randomly
intercepted by kfence_alloc(), which services the allocation from a
dedicated KFENCE memory pool.  This pool is allocated early in boot via
memblock.

When booting via KHO, the memblock allocator is restricted to a "scratch
area", forcing the KFENCE pool to be allocated within it.  This creates a
conflict, as the scratch area is expected to be ephemeral and
overwriteable by a subsequent kexec.  If KHO metadata is placed in this
KFENCE pool, it leads to memory corruption when the next kernel is loaded.

To fix this, modify KHO to allocate its metadata directly from the buddy
allocator instead of slab.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: fc33e4b ("kexec: enable KHO support for memory preservation")
Signed-off-by: Pasha Tatashin <[email protected]>
Reviewed-by: Pratyush Yadav <[email protected]>
Reviewed-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: David Matlack <[email protected]>
Cc: Alexander Graf <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Jason Gunthorpe <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Masahiro Yamada <[email protected]>
Cc: Miguel Ojeda <[email protected]>
Cc: Randy Dunlap <[email protected]>
Cc: Samiullah Khawaja <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
The order check and fallback loop is updating the index value on every
loop.  This will cause the index to be wrongly aligned by a larger value
while the loop shrinks the order.

This may result in inserting and returning a folio of the wrong index and
cause data corruption with some userspace workloads [1].

[[email protected]: introduce a temporary variable to improve code]
  Link: https://lkml.kernel.org/r/[email protected]
  Link: https://lore.kernel.org/linux-mm/CAMgjq7DqgAmj25nDUwwu1U2cSGSn8n4-Hqpgottedy0S6YYeUw@mail.gmail.com/ [1]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/linux-mm/CAMgjq7DqgAmj25nDUwwu1U2cSGSn8n4-Hqpgottedy0S6YYeUw@mail.gmail.com/ [1]
Fixes: e7a2ab7 ("mm: shmem: add mTHP support for anonymous shmem")
Closes: https://lore.kernel.org/linux-mm/CAMgjq7DqgAmj25nDUwwu1U2cSGSn8n4-Hqpgottedy0S6YYeUw@mail.gmail.com/
Signed-off-by: Kairui Song <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: Zi Yan <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
Reviewed-by: Barry Song <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Cc: Dev Jain <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Nico Pache <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
If no stack depot is allocated yet, due to masking out __GFP_RECLAIM flags
kmsan called from kmalloc cannot allocate stack depot.  kmsan fails to
record origin and report issues.  This may result in KMSAN failing to
report issues.

Reusing flags from kmalloc without modifying them should be safe for kmsan.
For example, such chain of calls is possible:
test_uninit_kmalloc -> kmalloc -> __kmalloc_cache_noprof ->
slab_alloc_node -> slab_post_alloc_hook ->
kmsan_slab_alloc -> kmsan_internal_poison_memory.

Only when it is called in a context without flags present should
__GFP_RECLAIM flags be masked.

With this change all kmsan tests start working reliably.

Eric reported:

: Yes, KMSAN seems to be at least partially broken currently.  Besides the
: fact that the kmsan KUnit test is currently failing (which I reported at
: https://lore.kernel.org/r/20250911175145.GA1376@sol), I've confirmed that
: the poly1305 KUnit test causes a KMSAN warning with Aleksei's patch
: applied but does not cause a warning without it.  The warning did get
: reached via syzbot somehow
: (https://lore.kernel.org/r/751b3d80293a6f599bb07770afcef24f623c7da0.1761026343.git.xiaopei01@kylinos.cn/),
: so KMSAN must still work in some cases.  But it didn't work for me.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/20251022030213.GA35717@sol
Fixes: 97769a5 ("mm, bpf: Introduce try_alloc_pages() for opportunistic page allocation")
Signed-off-by: Aleksei Nikiforov <[email protected]>
Reviewed-by: Alexander Potapenko <[email protected]>
Tested-by: Eric Biggers <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Dmitriy Vyukov <[email protected]>
Cc: Ilya Leoshkevich <[email protected]>
Cc: Marco Elver <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
…_item

Currently, scan_get_next_rmap_item() walks every page address in a VMA to
locate mergeable pages.  This becomes highly inefficient when scanning
large virtual memory areas that contain mostly unmapped regions, causing
ksmd to use large amount of cpu without deduplicating much pages.

This patch replaces the per-address lookup with a range walk using
walk_page_range().  The range walker allows KSM to skip over entire
unmapped holes in a VMA, avoiding unnecessary lookups.  This problem was
previously discussed in [1].

Consider the following test program which creates a 32 TiB mapping in the
virtual address space but only populates a single page:

#include <unistd.h>
#include <stdio.h>
#include <sys/mman.h>

/* 32 TiB */
const size_t size = 32ul * 1024 * 1024 * 1024 * 1024;

int main() {
        char *area = mmap(NULL, size, PROT_READ | PROT_WRITE,
                          MAP_NORESERVE | MAP_PRIVATE | MAP_ANON, -1, 0);

        if (area == MAP_FAILED) {
                perror("mmap() failed\n");
                return -1;
        }

        /* Populate a single page such that we get an anon_vma. */
        *area = 0;

        /* Enable KSM. */
        madvise(area, size, MADV_MERGEABLE);
        pause();
        return 0;
}

$ ./ksm-sparse  &
$ echo 1 > /sys/kernel/mm/ksm/run 

Without this patch ksmd uses 100% of the cpu for a long time (more then 1
hour in my test machine) scanning all the 32 TiB virtual address space
that contain only one mapped page.  This makes ksmd essentially deadlocked
not able to deduplicate anything of value.  With this patch ksmd walks
only the one mapped page and skips the rest of the 32 TiB virtual address
space, making the scan fast using little cpu.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Link: https://lore.kernel.org/linux-mm/[email protected]/ [1]
Fixes: 31dbd01 ("ksm: Kernel SamePage Merging")
Signed-off-by: Pedro Demarchi Gomes <[email protected]>
Co-developed-by: David Hildenbrand <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
Reported-by: craftfever <[email protected]>
Closes: https://lkml.kernel.org/r/[email protected]
Suggested-by: David Hildenbrand <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Cc: Chengming Zhou <[email protected]>
Cc: xu xin <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
…order

folio split clears PG_has_hwpoisoned, but the flag should be preserved in
after-split folios containing pages with PG_hwpoisoned flag if the folio
is split to >0 order folios.  Scan all pages in a to-be-split folio to
determine which after-split folios need the flag.

An alternatives is to change PG_has_hwpoisoned to PG_maybe_hwpoisoned to
avoid the scan and set it on all after-split folios, but resulting false
positive has undesirable negative impact.  To remove false positive,
caller of folio_test_has_hwpoisoned() and folio_contain_hwpoisoned_page()
needs to do the scan.  That might be causing a hassle for current and
future callers and more costly than doing the scan in the split code. 
More details are discussed in [1].

This issue can be exposed via:
1. splitting a has_hwpoisoned folio to >0 order from debugfs interface;
2. truncating part of a has_hwpoisoned folio in
   truncate_inode_partial_folio().

And later accesses to a hwpoisoned page could be possible due to the
missing has_hwpoisoned folio flag.  This will lead to MCE errors.

Link: https://lore.kernel.org/all/CAHbLzkoOZm0PXxE9qwtF4gKR=cpRXrSrJ9V9Pm2DJexs985q4g@mail.gmail.com/ [1]
Link: https://lkml.kernel.org/r/[email protected]
Fixes: c010d47 ("mm: thp: split huge page to any lower order pages")
Signed-off-by: Zi Yan <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Reviewed-by: Lance Yang <[email protected]>
Reviewed-by: Miaohe Lin <[email protected]>
Reviewed-by: Baolin Wang <[email protected]>
Reviewed-by: Wei Yang <[email protected]>
Cc: Pankaj Raghav <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Dev Jain <[email protected]>
Cc: Jane Chu <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Luis Chamberalin <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Naoya Horiguchi <[email protected]>
Cc: Nico Pache <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Pde is erased from subdir rbtree through rb_erase(), but not set the node
to EMPTY, which may result in uaf access.  We should use RB_CLEAR_NODE()
set the erased node to EMPTY, then pde_subdir_next() will return NULL to
avoid uaf access.

We found an uaf issue while using stress-ng testing, need to run testcase
getdent and tun in the same time.  The steps of the issue is as follows:

1) use getdent to traverse dir /proc/pid/net/dev_snmp6/, and current
   pde is tun3;

2) in the [time windows] unregister netdevice tun3 and tun2, and erase
   them from rbtree.  erase tun3 first, and then erase tun2.  the
   pde(tun2) will be released to slab;

3) continue to getdent process, then pde_subdir_next() will return
   pde(tun2) which is released, it will case uaf access.

CPU 0                                      |    CPU 1
-------------------------------------------------------------------------
traverse dir /proc/pid/net/dev_snmp6/      |   unregister_netdevice(tun->dev)   //tun3 tun2
sys_getdents64()                           |
  iterate_dir()                            |
    proc_readdir()                         |
      proc_readdir_de()                    |     snmp6_unregister_dev()
        pde_get(de);                       |       proc_remove()
        read_unlock(&proc_subdir_lock);    |         remove_proc_subtree()
                                           |           write_lock(&proc_subdir_lock);
        [time window]                      |           rb_erase(&root->subdir_node, &parent->subdir);
                                           |           write_unlock(&proc_subdir_lock);
        read_lock(&proc_subdir_lock);      |
        next = pde_subdir_next(de);        |
        pde_put(de);                       |
        de = next;    //UAF                |

rbtree of dev_snmp6
                        |
                    pde(tun3)
                     /    \
                  NULL  pde(tun2)

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Wei Yang <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: wangzijie <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Patch series "Fix SIGBUS semantics with large folios", v3.

Accessing memory within a VMA, but beyond i_size rounded up to the next
page size, is supposed to generate SIGBUS.

Darrick reported[1] an xfstests regression in v6.18-rc1.  generic/749
failed due to missing SIGBUS.  This was caused by my recent changes that
try to fault in the whole folio where possible:

        19773df ("mm/fault: try to map the entire file folio in finish_fault()")
        357b927 ("mm/filemap: map entire large folio faultaround")

These changes did not consider i_size when setting up PTEs, leading to
xfstest breakage.

However, the problem has been present in the kernel for a long time -
since huge tmpfs was introduced in 2016.  The kernel happily maps
PMD-sized folios as PMD without checking i_size.  And huge=always tmpfs
allocates PMD-size folios on any writes.

I considered this corner case when I implemented a large tmpfs, and my
conclusion was that no one in their right mind should rely on receiving a
SIGBUS signal when accessing beyond i_size.  I cannot imagine how it could
be useful for the workload.

But apparently filesystem folks care a lot about preserving strict SIGBUS
semantics.

Generic/749 was introduced last year with reference to POSIX, but no real
workloads were mentioned.  It also acknowledged the tmpfs deviation from
the test case.

POSIX indeed says[3]:

        References within the address range starting at pa and
        continuing for len bytes to whole pages following the end of an
        object shall result in delivery of a SIGBUS signal.

The patchset fixes the regression introduced by recent changes as well as
more subtle SIGBUS breakage due to split failure on truncation.


This patch (of 2):

Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
supposed to generate SIGBUS.

Recent changes attempted to fault in full folio where possible.  They did
not respect i_size, which led to populating PTEs beyond i_size and
breaking SIGBUS semantics.

Darrick reported generic/749 breakage because of this.

However, the problem existed before the recent changes.  With huge=always
tmpfs, any write to a file leads to PMD-size allocation.  Following the
fault-in of the folio will install PMD mapping regardless of i_size.

Fix filemap_map_pages() and finish_fault() to not install:
  - PTEs beyond i_size;
  - PMD mappings across i_size;

Make an exception for shmem/tmpfs that for long time intentionally
mapped with PMDs across i_size.

Link: https://lkml.kernel.org/r/[email protected]
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Kiryl Shutsemau <[email protected]>
Fixes: 6795801 ("xfs: Support large folios")
Reported-by: "Darrick J. Wong" <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Accesses within VMA, but beyond i_size rounded up to PAGE_SIZE are
supposed to generate SIGBUS.

This behavior might not be respected on truncation.

During truncation, the kernel splits a large folio in order to reclaim
memory.  As a side effect, it unmaps the folio and destroys PMD mappings
of the folio.  The folio will be refaulted as PTEs and SIGBUS semantics
are preserved.

However, if the split fails, PMD mappings are preserved and the user will
not receive SIGBUS on any accesses within the PMD.

Unmap the folio on split failure.  It will lead to refault as PTEs and
preserve SIGBUS semantics.

Make an exception for shmem/tmpfs that for long time intentionally mapped
with PMDs across i_size.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: b9a8a41 ("truncate,shmem: Handle truncates that split large folios")
Signed-off-by: Kiryl Shutsemau <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Matthew Wilcox (Oracle) <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
When emitting the order of the allocation for a hash table,
alloc_large_system_hash() unconditionally subtracts PAGE_SHIFT from log
base 2 of the allocation size.  This is not correct if the allocation size
is smaller than a page, and yields a negative value for the order as seen
below:

TCP established hash table entries: 32 (order: -4, 256 bytes, linear) TCP
bind hash table entries: 32 (order: -2, 1024 bytes, linear)

Use get_order() to compute the order when emitting the hash table
information to correctly handle cases where the allocation size is smaller
than a page:

TCP established hash table entries: 32 (order: 0, 256 bytes, linear) TCP
bind hash table entries: 32 (order: 0, 1024 bytes, linear)

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 1da177e ("Linux-2.6.12-rc2")
Signed-off-by: Isaac J. Manjarres <[email protected]>
Reviewed-by: Mike Rapoport (Microsoft) <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Using gcov on kernels compiled with GCC 15 results in truncated 16-byte
long .gcda files with no usable data.  To fix this, update GCOV_COUNTERS
to match the value defined by GCC 15.

Tested with GCC 14.3.0 and GCC 15.2.0.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Peter Oberparleiter <[email protected]>
Reported-by: Matthieu Baerts <[email protected]>
Closes: linux-test-project/lcov#445
Tested-by: Matthieu Baerts <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Currently mremap folio pte batch ignores the writable bit during figuring
out a set of similar ptes mapping the same folio.  Suppose that the first
pte of the batch is writable while the others are not - set_ptes will end
up setting the writable bit on the other ptes, which is a violation of
mremap semantics.  Therefore, use FPB_RESPECT_WRITE to check the writable
bit while determining the pte batch.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Dev Jain <[email protected]>
Fixes: f822a9a ("mm: optimize mremap() by PTE batching")
Reported-by: David Hildenbrand <[email protected]>
Debugged-by: David Hildenbrand <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: Pedro Falcato <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Jann Horn <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: <[email protected]>	[6.17+]
Signed-off-by: Andrew Morton <[email protected]>
…or slabobj_ext

When alloc_slab_obj_exts() fails and then later succeeds in allocating a
slab extension vector, it calls handle_failed_objexts_alloc() to mark all
objects in the vector as empty.  As a result all objects in this slab
(slabA) will have their extensions set to CODETAG_EMPTY.

Later on if this slabA is used to allocate a slabobj_ext vector for
another slab (slabB), we end up with the slabB->obj_exts pointing to a
slabobj_ext vector that itself has a non-NULL slabobj_ext equal to
CODETAG_EMPTY.  When slabB gets freed, free_slab_obj_exts() is called to
free slabB->obj_exts vector.  

free_slab_obj_exts() calls mark_objexts_empty(slabB->obj_exts) which will
generate a warning because it expects slabobj_ext vectors to have a NULL
obj_ext, not CODETAG_EMPTY.

Modify mark_objexts_empty() to skip the warning and setting the obj_ext
value if it's already set to CODETAG_EMPTY.


To quickly detect this WARN, I modified the code from
WARN_ON(slab_exts[offs].ref.ct) to BUG_ON(slab_exts[offs].ref.ct == 1);

We then obtained this message:

[21630.898561] ------------[ cut here ]------------
[21630.898596] kernel BUG at mm/slub.c:2050!
[21630.898611] Internal error: Oops - BUG: 00000000f2000800 [qualcomm-linux#1] SMP
[21630.900372] Modules linked in: squashfs isofs vfio_iommu_type1 
vhost_vsock vfio vhost_net vmw_vsock_virtio_transport_common vhost tap 
vhost_iotlb iommufd vsock binfmt_misc nfsv3 nfs_acl nfs lockd grace 
netfs tls rds dns_resolver tun brd overlay ntfs3 exfat btrfs 
blake2b_generic xor xor_neon raid6_pq loop sctp ip6_udp_tunnel 
udp_tunnel nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib 
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct 
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 
nf_tables rfkill ip_set sunrpc vfat fat joydev sg sch_fq_codel nfnetlink 
virtio_gpu sr_mod cdrom drm_client_lib virtio_dma_buf drm_shmem_helper 
drm_kms_helper drm ghash_ce backlight virtio_net virtio_blk virtio_scsi 
net_failover virtio_console failover virtio_mmio dm_mirror 
dm_region_hash dm_log dm_multipath dm_mod fuse i2c_dev virtio_pci 
virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring autofs4 
aes_neon_bs aes_ce_blk [last unloaded: hwpoison_inject]
[21630.909177] CPU: 3 UID: 0 PID: 3787 Comm: kylin-process-m Kdump: 
loaded Tainted: G        W           6.18.0-rc1+ qualcomm-linux#74 PREEMPT(voluntary)
[21630.910495] Tainted: [W]=WARN
[21630.910867] Hardware name: QEMU KVM Virtual Machine, BIOS unknown 
2/2/2022
[21630.911625] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS 
BTYPE=--)
[21630.912392] pc : __free_slab+0x228/0x250
[21630.912868] lr : __free_slab+0x18c/0x250[21630.913334] sp : 
ffff8000a02f73e0
[21630.913830] x29: ffff8000a02f73e0 x28: fffffdffc43fc800 x27: 
ffff0000c0011c40
[21630.914677] x26: ffff0000c000cac0 x25: ffff00010fe5e5f0 x24: 
ffff000102199b40
[21630.915469] x23: 0000000000000003 x22: 0000000000000003 x21: 
ffff0000c0011c40
[21630.916259] x20: fffffdffc4086600 x19: fffffdffc43fc800 x18: 
0000000000000000
[21630.917048] x17: 0000000000000000 x16: 0000000000000000 x15: 
0000000000000000
[21630.917837] x14: 0000000000000000 x13: 0000000000000000 x12: 
ffff70001405ee66
[21630.918640] x11: 1ffff0001405ee65 x10: ffff70001405ee65 x9 : 
ffff800080a295dc
[21630.919442] x8 : ffff8000a02f7330 x7 : 0000000000000000 x6 : 
0000000000003000
[21630.920232] x5 : 0000000024924925 x4 : 0000000000000001 x3 : 
0000000000000007
[21630.921021] x2 : 0000000000001b40 x1 : 000000000000001f x0 : 
0000000000000001
[21630.921810] Call trace:
[21630.922130]  __free_slab+0x228/0x250 (P)
[21630.922669]  free_slab+0x38/0x118
[21630.923079]  free_to_partial_list+0x1d4/0x340
[21630.923591]  __slab_free+0x24c/0x348
[21630.924024]  ___cache_free+0xf0/0x110
[21630.924468]  qlist_free_all+0x78/0x130
[21630.924922]  kasan_quarantine_reduce+0x114/0x148
[21630.925525]  __kasan_slab_alloc+0x7c/0xb0
[21630.926006]  kmem_cache_alloc_noprof+0x164/0x5c8
[21630.926699]  __alloc_object+0x44/0x1f8
[21630.927153]  __create_object+0x34/0xc8
[21630.927604]  kmemleak_alloc+0xb8/0xd8
[21630.928052]  kmem_cache_alloc_noprof+0x368/0x5c8
[21630.928606]  getname_flags.part.0+0xa4/0x610
[21630.929112]  getname_flags+0x80/0xd8
[21630.929557]  vfs_fstatat+0xc8/0xe0
[21630.929975]  __do_sys_newfstatat+0xa0/0x100
[21630.930469]  __arm64_sys_newfstatat+0x90/0xd8
[21630.931046]  invoke_syscall+0xd4/0x258
[21630.931685]  el0_svc_common.constprop.0+0xb4/0x240
[21630.932467]  do_el0_svc+0x48/0x68
[21630.932972]  el0_svc+0x40/0xe0
[21630.933472]  el0t_64_sync_handler+0xa0/0xe8
[21630.934151]  el0t_64_sync+0x1ac/0x1b0
[21630.934923] Code: aa1803e0 97ffef2b a9446bf9 17ffff9c (d4210000)
[21630.936461] SMP: stopping secondary CPUs
[21630.939550] Starting crashdump kernel...
[21630.940108] Bye!

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 09c4656 ("codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext allocations")
Signed-off-by: Hao Ge <[email protected]>
Reviewed-by: Suren Baghdasaryan <[email protected]>
Cc: Christoph Lameter (Ampere) <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: gehao <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
maple_tree tracepoints contain pointers to function names. Such a pointer
is saved when a tracepoint logs an event. There's no guarantee that it's
still valid when the event is parsed later and the pointer is dereferenced.

The kernel warns about these unsafe pointers.

	event 'ma_read' has unsafe pointer field 'fn'
	WARNING: kernel/trace/trace.c:3779 at ignore_event+0x1da/0x1e4

Mark the function names as tracepoint_string() to fix the events.

One case that doesn't work without my patch would be trace-cmd record
to save the binary ringbuffer and trace-cmd report to parse it in
userspace.  The address of __func__ can't be dereferenced from
userspace but tracepoint_string will add an entry to
/sys/kernel/tracing/printk_formats

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 54a611b ("Maple Tree: add new data structure")
Signed-off-by: Martin Kaiser <[email protected]>
Acked-by: Liam R. Howlett <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
torvalds and others added 29 commits November 14, 2025 12:50
…el/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
 "A collection of small fixes. All changes are device-specific, and
  nothing stands out.

   - A regression fix for HD-audio HDMI probe

   - USB-audio hardening patches for issues spotted by fuzzers

   - ASoC fixes for TAS278x, SoundWire and Cirrus

   - Usual HD-audio and USB-audio quirks"

* tag 'sound-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: usb-audio: Add native DSD quirks for PureAudio DAC series
  ASoC: rsnd: fix OF node reference leak in rsnd_ssiu_probe()
  ALSA: hda/tas2781: Correct the wrong project ID
  ALSA: usb-audio: Fix NULL pointer dereference in snd_usb_mixer_controls_badd
  ASoC: SDCA: bug fix while parsing mipi-sdca-control-cn-list
  ALSA: usb-audio: Fix potential overflow of PCM transfer buffer
  ALSA: hda/tas2781: Add new quirk for HP new projects
  ASoC: tas2781: fix getting the wrong device number
  ASoC: codecs: va-macro: fix resource leak in probe error path
  ASoC: tas2783A: Fix issues in firmware parsing
  ASoC: sdw_utils: fix device reference leak in is_sdca_endpoint_present()
  ASoC: cs4271: Fix regulator leak on probe failure
  ALSA: hda/hdmi: Fix breakage at probing nvhdmi-mcp driver
  ASoC: da7213: Use component driver suspend/resume
  ALSA: usb-audio: add min_mute quirk for SteelSeries Arctis
  ASoC: doc: cs35l56: Update firmware filename description for B0 silicon
…inux/kernel/git/broonie/regulator

Pull regulator fix from Mark Brown:
 "One simple fix for a GPIO descriptor leak in the probe error handling
  for the fixed regulator"

* tag 'regulator-fix-v6.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
  regulator: fixed: fix GPIO descriptor leak on register failure
…ernel/git/broonie/spi

Pull spi fixes from Mark Brown:
 "A few standard fixes here, plus one more interesting one from Hans
  which addresses an issue where a move in when we requested GPIOs on
  ACPI systems caused us to stop doing pinmuxing and leave things
  floating that we'd really rather not have floating"

* tag 'spi-fix-v6.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
  spi: Add TODO comment about ACPI GPIO setup
  spi: xilinx: increase number of retries before declaring stall
  spi: imx: keep dma request disabled before dma transfer setup
  spi: Try to get ACPI GPIO IRQ earlier
…kernel/git/cxl/cxl

Pull cxl fixes from Dave Jiang:

 - Fix incorrect device handle check for Generic Initiator

 - Fix offset calculation for extended linear cache poison injection

 - Fix lockdep warning for hmem_register_resource()

* tag 'cxl-fixes-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
  acpi/hmat: Fix lockdep warning for hmem_register_resource()
  cxl: Adjust offset calculation for poison injection
  acpi,srat: Fix incorrect device handle check for Generic Initiator
…kernel/git/ulfh/linux-pm

Pull pmdomain fixes from Ulf Hansson:

 - imx: Fix reference count leak in ->remove()

 - samsung: Rework legacy splash-screen handover workaround

 - samsung: Fix potential memleak during ->probe()

 - arm: Fix genpd leak on provider registration failure for scmi

* tag 'pmdomain-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/linux-pm:
  pmdomain: imx: Fix reference count leak in imx_gpc_remove
  pmdomain: samsung: Rework legacy splash-screen handover workaround
  pmdomain: arm: scmi: Fix genpd leak on provider registration failure
  pmdomain: samsung: plug potential memleak during probe
…l/git/ulfh/mmc

Pull MMC fixes from Ulf Hansson:

 - dw_mmc-rockchip: Fix internal phase calculation

 - pxamci: Simplify and fix ->probe() error handling

 - sdhci-of-dwcmshc: Fix strbin signal delay

 - wmt-sdmmc: Fix compile test default

* tag 'mmc-v6.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
  mmc: dw_mmc-rockchip: Fix wrong internal phase calculate
  mmc: pxamci: Simplify pxamci_probe() error handling using devm APIs
  mmc: sdhci-of-dwcmshc: Change DLL_STRBIN_TAPNUM_DEFAULT to 0x4
  mmc: wmt-sdmmc: fix compile test default
…m/kernel

Pull drm fixes from Dave Airlie:
 "Weekly fixes, amdgpu and vmwgfx making up the most of it, along with
  panthor and i915/xe.

  Seems about right for this time of development, nothing major
  outstanding.

  client:
   - Fix description of module parameter

  panthor:
   - Flush writes before mapping buffers

  vmwgfx:
   - Improve command validation
   - Improve ref counting
   - Fix cursor-plane support

  amdgpu:
   - Disallow P2P DMA for GC 12 DCC surfaces
   - ctx error handling fix
   - UserQ fixes
   - VRR fix
   - ISP fix
   - JPEG 5.0.1 fix

  amdkfd:
   - Save area check fix
   - Fix GPU mappings for APU after prefetch

  i915:
   - Fix PSR's pipe to vblank conversion
   - Disable Panel Replay on MST links

  xe:
   - New HW workarounds affecting PTL and WCL platforms

* tag 'drm-fixes-2025-11-15' of https://gitlab.freedesktop.org/drm/kernel:
  drm/client: fix MODULE_PARM_DESC string for "active"
  drm/i915/dp_mst: Disable Panel Replay
  drm/amdkfd: Fix GPU mappings for APU after prefetch
  drm/amdkfd: relax checks for over allocation of save area
  drm/amdgpu/jpeg: Add parse_cs for JPEG5_0_1
  drm/amd/amdgpu: Ensure isp_kernel_buffer_alloc() creates a new BO
  drm/amd/display: Allow VRR params change if unsynced with the stream
  drm/amdgpu: fix lock warning in amdgpu_userq_fence_driver_process
  drm/amdgpu: jump to the correct label on failure
  drm/amdgpu: disable peer-to-peer access for DCC-enabled GC12 VRAM surfaces
  drm/xe/xe3lpg: Extend Wa_15016589081 for xe3lpg
  drm/xe/xe3: Extend wa_14023061436
  drm/xe/xe3: Add WA_14024681466 for Xe3_LPG
  drm/i915/psr: fix pipe to vblank conversion
  drm/panthor: Flush shmem writes before mapping buffers CPU-uncached
  drm/vmwgfx: Restore Guest-Backed only cursor plane support
  drm/vmwgfx: Use kref in vmw_bo_dirty
  drm/vmwgfx: Validate command header size against SVGA_CMD_MAX_DATASIZE
…inux-nfs

Pull NFS client fixes from Anna Schumaker:

 - Various fixes when using NFS with TLS

 - Localio direct-IO fixes

 - Fix error handling in nfs_atomic_open_v23()

 - Fix sysfs memory leak when nfs_client kobject add fails

 - Fix an incorrect parameter when calling nfs4_call_sync()

 - Fix a failing LTP test when using delegated timestamps

* tag 'nfs-for-6.18-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFS: Fix LTP test failures when timestamps are delegated
  NFSv4: Fix an incorrect parameter when calling nfs4_call_sync()
  NFS: sysfs: fix leak when nfs_client kobject add fails
  NFSv2/v3: Fix error handling in nfs_atomic_open_v23()
  nfs/localio: do not issue misaligned DIO out-of-order
  nfs/localio: Ensure DIO WRITE's IO on stable storage upon completion
  nfs/localio: backfill missing partial read support for misaligned DIO
  nfs/localio: add refcounting for each iocb IO associated with NFS pgio header
  nfs/localio: remove unecessary ENOTBLK handling in DIO WRITE support
  NFS: Check the TLS certificate fields in nfs_match_client()
  pnfs: Set transport security policy to RPC_XPRTSEC_NONE unless using TLS
  pnfs: Fix TLS logic in _nfs4_pnfs_v4_ds_connect()
  pnfs: Fix TLS logic in _nfs4_pnfs_v3_ds_connect()
…ernel/git/ojeda/linux

Pull Rust fix from Miguel Ojeda:

 - Fix a Rust 1.91.0 build issue due to 'bindings.o' not containing
   DWARF debug information anymore by teaching gendwarfksyms to skip
   object files without exports

* tag 'rust-fixes-6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux:
  gendwarfksyms: Skip files with no exports
…t/bpf/bpf

Pull bpf fixes from Alexei Starovoitov:

 - Fix interaction between livepatch and BPF fexit programs (Song Liu)
   With Steven and Masami acks.

 - Fix stack ORC unwind from BPF kprobe_multi (Jiri Olsa)
   With Steven and Masami acks.

 - Fix out of bounds access in widen_imprecise_scalars() in the verifier
   (Eduard Zingerman)

 - Fix conflicts between MPTCP and BPF sockmap (Jiayuan Chen)

 - Fix net_sched storage collision with BPF data_meta/data_end (Eric
   Dumazet)

 - Add _impl suffix to BPF kfuncs with implicit args to avoid breaking
   them in bpf-next when KF_IMPLICIT_ARGS is added (Mykyta Yatsenko)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  selftests/bpf: Test widen_imprecise_scalars() with different stack depth
  bpf: account for current allocated stack depth in widen_imprecise_scalars()
  bpf: Add bpf_prog_run_data_pointers()
  selftests/bpf: Add mptcp test with sockmap
  mptcp: Fix proto fallback detection with BPF
  mptcp: Disallow MPTCP subflows from sockmap
  selftests/bpf: Add stacktrace ips test for raw_tp
  selftests/bpf: Add stacktrace ips test for kprobe_multi/kretprobe_multi
  x86/fgraph,bpf: Fix stack ORC unwind from kprobe_multi return probe
  Revert "perf/x86: Always store regs->ip in perf_callchain_kernel()"
  bpf: add _impl suffix for bpf_stream_vprintk() kfunc
  bpf:add _impl suffix for bpf_task_work_schedule* kfuncs
  selftests/bpf: Add tests for livepatch + bpf trampoline
  ftrace: bpf: Fix IPMODIFY + DIRECT in modify_ftrace_direct()
  ftrace: Fix BPF fexit with livepatch
…ernel/git/pci/pci

Pull pci fixes from Bjorn Helgaas:

 - Cache the ASPM L0s/L1 Supported bits early so quirks can override
   them if necessary (Bjorn Helgaas)

 - Add quirks for PA Semi and Freescale Root Ports and a HiSilicon Wi-Fi
   device that are reported to have broken L0s and L1 (Shawn Lin, Bjorn
   Helgaas)

* tag 'pci-v6.18-fixes-5' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
  PCI/ASPM: Avoid L0s and L1 on Hi1105 [19e5:1105] Wi-Fi
  PCI/ASPM: Avoid L0s and L1 on PA Semi [1959:a002] Root Ports
  PCI/ASPM: Avoid L0s and L1 on Freescale [1957:0451] Root Ports
  PCI/ASPM: Convert quirks to override advertised link states
  PCI/ASPM: Add pcie_aspm_remove_cap() to override advertised link states
  PCI/ASPM: Cache L0s/L1 Supported so advertised link states can be overridden
…nux/kernel/git/tip/tip

Pull core fix from Ingo Molnar:
 "Fix a broken #ifndef in the <linux/entry-virt.h> header.

  It hasn't caused problems upstream yet because no arch overrides
  arch_xfer_to_guest_mode_handle_work() at this moment"

* tag 'core-urgent-2025-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  entry: Fix ifndef around arch_xfer_to_guest_mode_handle_work() stub
…ux/kernel/git/tip/tip

Pull irq fix from Ingo Molnar:
 "Fix an irqchip driver release bug in the riscv-intc irqchip driver"

* tag 'irq-urgent-2025-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/riscv-intc: Add missing free() callback in riscv_intc_domain_ops
…linux/kernel/git/tip/tip

Pull timer fix from Ingo Molnar:
 "Fix a memory leak in the posix timer creation logic"

* tag 'timers-urgent-2025-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  posix-timers: Plug potential memory leak in do_timer_create()
…ux/kernel/git/tip/tip

Pull x86 fixes from Ingo Molnar:

 - Update the list of AMD microcode minimum Entrysign revisions

 - Add additional fixed AMD RDSEED microcode revisions

 - Update the language transliteration for Kiryl Shutsemau's name
   in the MAINTAINERS entry

* tag 'x86-urgent-2025-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/microcode/AMD: Add Zen5 model 0x44, stepping 0x1 minrev
  x86/CPU/AMD: Add additional fixed RDSEED microcode revisions
  MAINTAINERS: Update name spelling
…git/s390/linux

Pull s390 fix from Heiko Carstens:

 - Fix a bug in the __ptep_rdp() inline assembly which may lead to
   missing TLB flushes

* tag 's390-6.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/mm: Fix __ptep_rdp() inline assembly
In the past, CONFIG_ARCH_HAS_GIGANTIC_PAGE indicated that we support
runtime allocation of gigantic hugetlb folios.  In the meantime it evolved
into a generic way for the architecture to state that it supports gigantic
hugetlb folios.

In commit fae7d83 ("mm: add __dump_folio()") we started using
CONFIG_ARCH_HAS_GIGANTIC_PAGE to decide MAX_FOLIO_ORDER: whether we could
have folios larger than what the buddy can handle.  In the context of that
commit, we started using MAX_FOLIO_ORDER to detect page corruptions when
dumping tail pages of folios.  Before that commit, we assumed that we
cannot have folios larger than the highest buddy order, which was
obviously wrong.

In commit 7b4f21f ("mm/hugetlb: check for unreasonable folio sizes
when registering hstate"), we used MAX_FOLIO_ORDER to detect
inconsistencies, and in fact, we found some now.

Powerpc allows for configs that can allocate gigantic folio during boot
(not at runtime), that do not set CONFIG_ARCH_HAS_GIGANTIC_PAGE and can
exceed PUD_ORDER.

To fix it, let's make powerpc select CONFIG_ARCH_HAS_GIGANTIC_PAGE with
hugetlb on powerpc, and increase the maximum folio size with hugetlb to 16
GiB on 64bit (possible on arm64 and powerpc) and 1 GiB on 32 bit
(powerpc).  Note that on some powerpc configurations, whether we actually
have gigantic pages depends on the setting of CONFIG_ARCH_FORCE_MAX_ORDER,
but there is nothing really problematic about setting it unconditionally:
we just try to keep the value small so we can better detect problems in
__dump_folio() and inconsistencies around the expected largest folio in
the system.

Ideally, we'd have a better way to obtain the maximum hugetlb folio size
and detect ourselves whether we really end up with gigantic folios.  Let's
defer bigger changes and fix the warnings first.

While at it, handle gigantic DAX folios more clearly: DAX can only end up
creating gigantic folios with HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD.

Add a new Kconfig option HAVE_GIGANTIC_FOLIOS to make both cases clearer. 
In particular, worry about ARCH_HAS_GIGANTIC_PAGE only with HUGETLB_PAGE.

Note: with enabling CONFIG_ARCH_HAS_GIGANTIC_PAGE on powerpc, we will now
also allow for runtime allocations of folios in some more powerpc configs.
I don't think this is a problem, but if it is we could handle it through
__HAVE_ARCH_GIGANTIC_PAGE_RUNTIME_SUPPORTED.

While __dump_page()/__dump_folio was also problematic (not handling
dumping of tail pages of such gigantic folios correctly), it doesn't seem
critical enough to mark it as a fix.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 7b4f21f ("mm/hugetlb: check for unreasonable folio sizes when registering hstate")
Reported-by: Christophe Leroy <[email protected]>
Closes: https://lore.kernel.org/r/[email protected]/
Reported-by: Sourabh Jain <[email protected]>
Closes: https://lore.kernel.org/r/[email protected]/
Signed-off-by: David Hildenbrand (Red Hat) <[email protected]>
Cc: Ritesh Harjani (IBM) <[email protected]>
Cc: Madhavan Srinivasan <[email protected]>
Cc: Donet Tom <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Nicholas Piggin <[email protected]>
Cc: Christophe Leroy <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: "Liam R. Howlett" <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Mike Rapoport <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Nathan Chancellor <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
When crashkernel is configured with a high reservation, shrinking its
value below the low crashkernel reservation causes two issues:

1. Invalid crashkernel resource objects
2. Kernel crash if crashkernel shrinking is done twice

For example, with crashkernel=200M,high, the kernel reserves 200MB of high
memory and some default low memory (say 256MB).  The reservation appears
as:

cat /proc/iomem | grep -i crash
af000000-beffffff : Crash kernel
433000000-43f7fffff : Crash kernel

If crashkernel is then shrunk to 50MB (echo 52428800 >
/sys/kernel/kexec_crash_size), /proc/iomem still shows 256MB reserved:
af000000-beffffff : Crash kernel

Instead, it should show 50MB:
af000000-b21fffff : Crash kernel

Further shrinking crashkernel to 40MB causes a kernel crash with the
following trace (x86):

BUG: kernel NULL pointer dereference, address: 0000000000000038
PGD 0 P4D 0
Oops: 0000 [qualcomm-linux#1] PREEMPT SMP NOPTI
<snip...>
Call Trace: <TASK>
? __die_body.cold+0x19/0x27
? page_fault_oops+0x15a/0x2f0
? search_module_extables+0x19/0x60
? search_bpf_extables+0x5f/0x80
? exc_page_fault+0x7e/0x180
? asm_exc_page_fault+0x26/0x30
? __release_resource+0xd/0xb0
release_resource+0x26/0x40
__crash_shrink_memory+0xe5/0x110
crash_shrink_memory+0x12a/0x190
kexec_crash_size_store+0x41/0x80
kernfs_fop_write_iter+0x141/0x1f0
vfs_write+0x294/0x460
ksys_write+0x6d/0xf0
<snip...>

This happens because __crash_shrink_memory()/kernel/crash_core.c
incorrectly updates the crashk_res resource object even when
crashk_low_res should be updated.

Fix this by ensuring the correct crashkernel resource object is updated
when shrinking crashkernel memory.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 16c6006 ("kexec: enable kexec_crash_size to support two crash kernel regions")
Signed-off-by: Sourabh Jain <[email protected]>
Acked-by: Baoquan He <[email protected]>
Cc: Zhen Lei <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Switch to kernel.org email address as I will be leaving Red Hat.  The old
address will remain active until end of January 2026, so performing the
change now should make sure that most mails will reach me.

Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: David Hildenbrand <[email protected]>
Signed-off-by: David Hildenbrand (Red Hat) <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Both uniform and non uniform split check missed the check to prevent
splitting anon folios in swapcache to non-zero order.

Splitting anon folios in swapcache to non-zero order can cause data
corruption since swapcache only support PMD order and order-0 entries. 
This can happen when one use split_huge_pages under debugfs to split
anon folios in swapcache.

In-tree callers do not perform such an illegal operation.  Only debugfs
interface could trigger it.  I will put adding a test case on my TODO
list.

Fix the check.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 58729c0 ("mm/huge_memory: add buddy allocator like (non-uniform) folio_split()")
Signed-off-by: Zi Yan <[email protected]>
Reported-by: "David Hildenbrand (Red Hat)" <[email protected]>
Closes: https://lore.kernel.org/all/[email protected]/
Acked-by: David Hildenbrand (Red Hat) <[email protected]>
Cc: Baolin Wang <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Dev Jain <[email protected]>
Cc: Lance Yang <[email protected]>
Cc: Liam Howlett <[email protected]>
Cc: Lorenzo Stoakes <[email protected]>
Cc: Nico Pache <[email protected]>
Cc: Ryan Roberts <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
We must check whether KHO is enabled prior to issuing KHO commands,
otherwise KHO internal data structures are not initialized.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: b753522 ("kho: add test for kexec handover")
Signed-off-by: Pasha Tatashin <[email protected]>
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-lkp/[email protected]
Reviewed-by: Pratyush Yadav <[email protected]>
Reviewed-by: Mike Rapoport (Microsoft) <[email protected]>
Cc: Alexander Graf <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
… perf_test

Accessing 'reg.write_index' directly triggers a -Waddress-of-packed-member
warning due to potential unaligned pointer access:

perf_test.c:239:38: warning: taking address of packed member 'write_index'
of class or structure 'user_reg' may result in an unaligned pointer value
[-Waddress-of-packed-member]
  239 |         ASSERT_NE(-1, write(self->data_fd, &reg.write_index,
      |                                             ^~~~~~~~~~~~~~~

Since write(2) works with any alignment. Casting '&reg.write_index'
explicitly to 'void *' to suppress this warning.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 42187bd ("selftests/user_events: Add perf self-test for empty arguments events")
Signed-off-by: Ankit Khushwaha <[email protected]>
Cc: Beau Belgrave <[email protected]>
Cc: "Masami Hiramatsu (Google)" <[email protected]>
Cc: Steven Rostedt <[email protected]>
Cc: sunliming <[email protected]>
Cc: Wei Yang <[email protected]>
Cc: Shuah Khan <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Since commit 78524b0 ("mm, swap: avoid redundant swap device
pinning"), the common helper for allocating and preparing a folio in the
swap cache layer no longer tries to get a swap device reference
internally, because all callers of __read_swap_cache_async are already
holding a swap entry reference.  The repeated swap device pinning isn't
needed on the same swap device.

Caller of VMA readahead is also holding a reference to the target entry's
swap device, but VMA readahead walks the page table, so it might encounter
swap entries from other devices, and call __read_swap_cache_async on
another device without holding a reference to it.

So it is possible to cause a UAF when swapoff of device A raced with
swapin on device B, and VMA readahead tries to read swap entries from
device A.  It's not easy to trigger, but in theory, it could cause real
issues.

Make VMA readahead try to get the device reference first if the swap
device is a different one from the target entry.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: 78524b0 ("mm, swap: avoid redundant swap device pinning")
Suggested-by: Huang Ying <[email protected]>
Signed-off-by: Kairui Song <[email protected]>
Acked-by: Chris Li <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Barry Song <[email protected]>
Cc: Kemeng Shi <[email protected]>
Cc: Nhat Pham <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
The generation field of topology map is updated after initialized by zero.
The updated value of generation field is always zero, and is against
specification.

This commit fixes the bug.

Fixes: 7d138cb ("firewire: core: use spin lock specific to topology map")
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Takashi Sakamoto <[email protected]>
…/linux/kernel/git/ras/ras

Pull EDAC fixes from Borislav Petkov:

 - In Versalnet, handle the reporting of non-standard hw errors whose
   information can come in more than one remote processor message.

 - Explicitly reenable ECC checking after a warm reset in Altera OCRAM
   as those registers are reset to default otherwise

 - Fix single-bit error injection in Altera EDAC to not inject errors
   directly in ECC RAM and thus lead to false double-bit errors due to
   same ECC RAM being in concurrent use

* tag 'edac_urgent_for_v6.18_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
  EDAC/altera: Use INTTEST register for Ethernet and USB SBE injection
  EDAC/altera: Handle OCRAM ECC enable after warm reset
  EDAC/versalnet: Handle split messages for non-standard errors
…inux/kernel/git/ieee1394/linux1394

Pull firewire fixes from Takashi Sakamoto:
 "This includes some fixes for the topology map, newly introduced in
  v6.18 kernel"

* tag 'firewire-fixes-6.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
  firewire: core: fix to update generation field in topology map
  firewire: core: Initialize topology_map.lock
…rg/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "7 hotfixes.  5 are cc:stable, 4 are against mm/

  All are singletons - please see the respective changelogs for details"

* tag 'mm-hotfixes-stable-2025-11-16-10-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm, swap: fix potential UAF issue for VMA readahead
  selftests/user_events: fix type cast for write_index packed member in perf_test
  lib/test_kho: check if KHO is enabled
  mm/huge_memory: fix folio split check for anon folios in swapcache
  MAINTAINERS: update David Hildenbrand's email address
  crash: fix crashkernel resource shrink
  mm: fix MAX_FOLIO_ORDER on powerpc configs with hugetlb
…el.org/pub/scm/linux/kernel/git/perf/perf-tools

Pull perf tools fixes from Arnaldo Carvalho de Melo:

 - Fix writing bpf_prog (infos|btfs)_cnt to data file, to not generate
   invalid perf.data files in some corner cases.

 - Fix 'perf top' segfault by ensuring libbfd is initialized. This is an
   opt-in feature due to license incompatibilities.

 - Fix segfault in 'perf lock' due to missing kernel map.

 - Fix 'perf lock contention' test.

 - Don't fail fast path detection if binutils-devel isn't available.

 - Sync KVM's vmx.h with the kernel to pick SEAMCALL exit reason.

* tag 'perf-tools-fixes-for-v6.18-2-2025-11-16' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools:
  perf libbfd: Ensure libbfd is initialized prior to use
  perf test: Fix lock contention test
  perf lock: Fix segfault due to missing kernel map
  tools headers UAPI: Sync KVM's vmx.h with the kernel to pick SEAMCALL exit reason
  perf build: Don't fail fast path feature detection when binutils-devel is not available
  perf header: Write bpf_prog (infos|btfs)_cnt to data file
@shashim-quic
Copy link
Collaborator

For rebase you dont need to raise a PR for topic branch you maintain. You can simply rebase in your local branch and push it to repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.