Commit Graph

3102 Commits

Author SHA1 Message Date
Linus Torvalds
4ba05b0e85 Merge tag 'tpmdd-next-6.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd
Pull tpm fixes from Jarkko Sakkinen:
 "Two bug fixes for TPM bus encryption (the remaining reported issues in
  the feature)"

* tag 'tpmdd-next-6.12-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/jarkko/linux-tpmdd:
  tpm: Disable TPM on tpm2_create_primary() failure
  tpm: Opt-in in disable PCR integrity protection
2024-11-13 13:28:58 -08:00
Jarkko Sakkinen
27184f8905 tpm: Opt-in in disable PCR integrity protection
The initial HMAC session feature added TPM bus encryption and/or integrity
protection to various in-kernel TPM operations. This can cause performance
bottlenecks with IMA, as it heavily utilizes PCR extend operations.

In order to mitigate this performance issue, introduce a kernel
command-line parameter to the TPM driver for disabling the integrity
protection for PCR extend operations (i.e. TPM2_PCR_Extend).

Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Link: https://lore.kernel.org/linux-integrity/20241015193916.59964-1-zohar@linux.ibm.com/
Fixes: 6519fea6fd ("tpm: add hmac checks to tpm2_pcr_extend()")
Tested-by: Mimi Zohar <zohar@linux.ibm.com>
Co-developed-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Co-developed-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2024-11-13 21:10:45 +02:00
Barry Song
e7ac4daeed mm: count zeromap read and set for swapout and swapin
When the proportion of folios from the zeromap is small, missing their
accounting may not significantly impact profiling.  However, it's easy to
construct a scenario where this becomes an issue—for example, allocating
1 GB of memory, writing zeros from userspace, followed by MADV_PAGEOUT,
and then swapping it back in.  In this case, the swap-out and swap-in
counts seem to vanish into a black hole, potentially causing semantic
ambiguity.

On the other hand, Usama reported that zero-filled pages can exceed 10% in
workloads utilizing zswap, while Hailong noted that some app in Android
have more than 6% zero-filled pages.  Before commit 0ca0c24e32 ("mm:
store zero pages to be swapped out in a bitmap"), both zswap and zRAM
implemented similar optimizations, leading to these optimized-out pages
being counted in either zswap or zRAM counters (with pswpin/pswpout also
increasing for zRAM).  With zeromap functioning prior to both zswap and
zRAM, userspace will no longer detect these swap-out and swap-in actions.

We have three ways to address this:

1. Introduce a dedicated counter specifically for the zeromap.

2. Use pswpin/pswpout accounting, treating the zero map as a standard
   backend.  This approach aligns with zRAM's current handling of
   same-page fills at the device level.  However, it would mean losing the
   optimized-out page counters previously available in zRAM and would not
   align with systems using zswap.  Additionally, as noted by Nhat Pham,
   pswpin/pswpout counters apply only to I/O done directly to the backend
   device.

3. Count zeromap pages under zswap, aligning with system behavior when
   zswap is enabled.  However, this would not be consistent with zRAM, nor
   would it align with systems lacking both zswap and zRAM.

Given the complications with options 2 and 3, this patch selects
option 1.

We can find these counters from /proc/vmstat (counters for the whole
system) and memcg's memory.stat (counters for the interested memcg).

For example:

$ grep -E 'swpin_zero|swpout_zero' /proc/vmstat
swpin_zero 1648
swpout_zero 33536

$ grep -E 'swpin_zero|swpout_zero' /sys/fs/cgroup/system.slice/memory.stat
swpin_zero 3905
swpout_zero 3985

This patch does not address any specific zeromap bug, but the missing
swpout and swpin counts for zero-filled pages can be highly confusing and
may mislead user-space agents that rely on changes in these counters as
indicators.  Therefore, we add a Fixes tag to encourage the inclusion of
this counter in any kernel versions with zeromap.

Many thanks to Kanchana for the contribution of changing
count_objcg_event() to count_objcg_events() to support large folios[1],
which has now been incorporated into this patch.

[1] https://lkml.kernel.org/r/20241001053222.6944-5-kanchana.p.sridhar@intel.com

Link: https://lkml.kernel.org/r/20241107011246.59137-1-21cnbao@gmail.com
Fixes: 0ca0c24e32 ("mm: store zero pages to be swapped out in a bitmap")
Co-developed-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Hailong Liu <hailong.liu@oppo.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-11 00:00:37 -08:00
Maíra Canal
652e1a5146 mm: fix docs for the kernel parameter `thp_anon=`
If we add ``thp_anon=32,64K:always`` to the kernel command line, we
will see the following error:

[    0.000000] huge_memory: thp_anon=32,64K:always: error parsing string, ignoring setting

This happens because the correct format isn't ``thp_anon=<size>,<size>[KMG]:<state>```,
as [KMG] must follow each number to especify its unit. So, the correct
format is ``thp_anon=<size>[KMG],<size>[KMG]:<state>```.

Therefore, adjust the documentation to reflect the correct format of the
parameter ``thp_anon=``.

Link: https://lkml.kernel.org/r/20241101165719.1074234-3-mcanal@igalia.com
Fixes: dd4d30d1cd ("mm: override mTHP "enabled" defaults at kernel cmdline")
Signed-off-by: Maíra Canal <mcanal@igalia.com>
Acked-by: Barry Song <baohua@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07 14:14:59 -08:00
Christian Loehle
29dcbea924 cpufreq: docs: Reflect latency changes in docs
There were two changes related to transition latency recently.
Namely commit e13aa799c2 ("cpufreq: Change default transition delay
to 2ms") and
commit 37c6dccd68 ("cpufreq: Remove LATENCY_MULTIPLIER").

Both changed the defaults / maximums so let the documentation
reflect that.

Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Link: https://patch.msgid.link/46853b6e-bad5-4ace-9b23-ff157f234ae3@arm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2024-10-21 13:20:03 +02:00
Luca Boccassi
02e2f9aa33 ipe: allow secondary and platform keyrings to install/update policies
The current policy management makes it impossible to use IPE
in a general purpose distribution. In such cases the users are not
building the kernel, the distribution is, and access to the private
key included in the trusted keyring is, for obvious reason, not
available.
This means that users have no way to enable IPE, since there will
be no built-in generic policy, and no access to the key to sign
updates validated by the trusted keyring.

Just as we do for dm-verity, kernel modules and more, allow the
secondary and platform keyrings to also validate policies. This
allows users enrolling their own keys in UEFI db or MOK to also
sign policies, and enroll them. This makes it sensible to enable
IPE in general purpose distributions, as it becomes usable by
any user wishing to do so. Keys in these keyrings can already
load kernels and kernel modules, so there is no security
downgrade.

Add a kconfig each, like dm-verity does, but default to enabled if
the dependencies are available.

Signed-off-by: Luca Boccassi <bluca@debian.org>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
[FW: fixed some style issues]
Signed-off-by: Fan Wu <wufan@kernel.org>
2024-10-17 11:46:10 -07:00
Luca Boccassi
5ceecb301e ipe: also reject policy updates with the same version
Currently IPE accepts an update that has the same version as the policy
being updated, but it doesn't make it a no-op nor it checks that the
old and new policyes are the same. So it is possible to change the
content of a policy, without changing its version. This is very
confusing from userspace when managing policies.
Instead change the update logic to reject updates that have the same
version with ESTALE, as that is much clearer and intuitive behaviour.

Signed-off-by: Luca Boccassi <bluca@debian.org>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Fan Wu <wufan@kernel.org>
2024-10-17 11:38:15 -07:00
Linus Torvalds
3efc57369a Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull x86 kvm updates from Paolo Bonzini:
 "x86:

   - KVM currently invalidates the entirety of the page tables, not just
     those for the memslot being touched, when a memslot is moved or
     deleted.

     This does not traditionally have particularly noticeable overhead,
     but Intel's TDX will require the guest to re-accept private pages
     if they are dropped from the secure EPT, which is a non starter.

     Actually, the only reason why this is not already being done is a
     bug which was never fully investigated and caused VM instability
     with assigned GeForce GPUs, so allow userspace to opt into the new
     behavior.

   - Advertise AVX10.1 to userspace (effectively prep work for the
     "real" AVX10 functionality that is on the horizon)

   - Rework common MSR handling code to suppress errors on userspace
     accesses to unsupported-but-advertised MSRs

     This will allow removing (almost?) all of KVM's exemptions for
     userspace access to MSRs that shouldn't exist based on the vCPU
     model (the actual cleanup is non-trivial future work)

   - Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC)
     splits the 64-bit value into the legacy ICR and ICR2 storage,
     whereas Intel (APICv) stores the entire 64-bit value at the ICR
     offset

   - Fix a bug where KVM would fail to exit to userspace if one was
     triggered by a fastpath exit handler

   - Add fastpath handling of HLT VM-Exit to expedite re-entering the
     guest when there's already a pending wake event at the time of the
     exit

   - Fix a WARN caused by RSM entering a nested guest from SMM with
     invalid guest state, by forcing the vCPU out of guest mode prior to
     signalling SHUTDOWN (the SHUTDOWN hits the VM altogether, not the
     nested guest)

   - Overhaul the "unprotect and retry" logic to more precisely identify
     cases where retrying is actually helpful, and to harden all retry
     paths against putting the guest into an infinite retry loop

   - Add support for yielding, e.g. to honor NEED_RESCHED, when zapping
     rmaps in the shadow MMU

   - Refactor pieces of the shadow MMU related to aging SPTEs in
     prepartion for adding multi generation LRU support in KVM

   - Don't stuff the RSB after VM-Exit when RETPOLINE=y and AutoIBRS is
     enabled, i.e. when the CPU has already flushed the RSB

   - Trace the per-CPU host save area as a VMCB pointer to improve
     readability and cleanup the retrieval of the SEV-ES host save area

   - Remove unnecessary accounting of temporary nested VMCB related
     allocations

   - Set FINAL/PAGE in the page fault error code for EPT violations if
     and only if the GVA is valid. If the GVA is NOT valid, there is no
     guest-side page table walk and so stuffing paging related metadata
     is nonsensical

   - Fix a bug where KVM would incorrectly synthesize a nested VM-Exit
     instead of emulating posted interrupt delivery to L2

   - Add a lockdep assertion to detect unsafe accesses of vmcs12
     structures

   - Harden eVMCS loading against an impossible NULL pointer deref
     (really truly should be impossible)

   - Minor SGX fix and a cleanup

   - Misc cleanups

  Generic:

   - Register KVM's cpuhp and syscore callbacks when enabling
     virtualization in hardware, as the sole purpose of said callbacks
     is to disable and re-enable virtualization as needed

   - Enable virtualization when KVM is loaded, not right before the
     first VM is created

     Together with the previous change, this simplifies a lot the logic
     of the callbacks, because their very existence implies
     virtualization is enabled

   - Fix a bug that results in KVM prematurely exiting to userspace for
     coalesced MMIO/PIO in many cases, clean up the related code, and
     add a testcase

   - Fix a bug in kvm_clear_guest() where it would trigger a buffer
     overflow _if_ the gpa+len crosses a page boundary, which thankfully
     is guaranteed to not happen in the current code base. Add WARNs in
     more helpers that read/write guest memory to detect similar bugs

  Selftests:

   - Fix a goof that caused some Hyper-V tests to be skipped when run on
     bare metal, i.e. NOT in a VM

   - Add a regression test for KVM's handling of SHUTDOWN for an SEV-ES
     guest

   - Explicitly include one-off assets in .gitignore. Past Sean was
     completely wrong about not being able to detect missing .gitignore
     entries

   - Verify userspace single-stepping works when KVM happens to handle a
     VM-Exit in its fastpath

   - Misc cleanups"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits)
  Documentation: KVM: fix warning in "make htmldocs"
  s390: Enable KVM_S390_UCONTROL config in debug_defconfig
  selftests: kvm: s390: Add VM run test case
  KVM: SVM: let alternatives handle the cases when RSB filling is required
  KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is valid
  KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalent
  KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literals
  KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range()
  KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper
  KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowed
  KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot
  KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps()
  KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range()
  KVM: x86/mmu: WARN on MMIO cache hit when emulating write-protected gfn
  KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list
  KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() version
  KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure()
  KVM: x86: Update retry protection fields when forcing retry on emulation failure
  KVM: x86: Apply retry protection to "unprotect on failure" path
  KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn
  ...
2024-09-28 09:20:14 -07:00
Linus Torvalds
e477dba544 Merge tag 'for-6.12/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mikulas Patocka:

 - Misc VDO fixes

 - Remove unused declarations dm_get_rq_mapinfo() and dm_zone_map_bio()

 - Dm-delay: Improve kernel documentation

 - Dm-crypt: Allow to specify the integrity key size as an option

 - Dm-bufio: Remove pointless NULL check

 - Small code cleanups: Use ERR_CAST; remove unlikely() around IS_ERR;
   use __assign_bit

 - Dm-integrity: Fix gcc 5 warning; convert comma to semicolon; fix
   smatch warning

 - Dm-integrity: Support recalculation in the 'I' mode

 - Revert "dm: requeue IO if mapping table not yet available"

 - Dm-crypt: Small refactoring to make the code more readable

 - Dm-cache: Remove pointless error check

 - Dm: Fix spelling errors

 - Dm-verity: Restart or panic on an I/O error if restart or panic was
   requested

 - Dm-verity: Fallback to platform keyring also if key in trusted
   keyring is rejected

* tag 'for-6.12/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (26 commits)
  dm verity: fallback to platform keyring also if key in trusted keyring is rejected
  dm-verity: restart or panic on an I/O error
  dm: fix spelling errors
  dm-cache: remove pointless error check
  dm vdo: handle unaligned discards correctly
  dm vdo indexer: Convert comma to semicolon
  dm-crypt: Use common error handling code in crypt_set_keyring_key()
  dm-crypt: Use up_read() together with key_put() only once in crypt_set_keyring_key()
  Revert "dm: requeue IO if mapping table not yet available"
  dm-integrity: check mac_size against HASH_MAX_DIGESTSIZE in sb_mac()
  dm-integrity: support recalculation in the 'I' mode
  dm integrity: Convert comma to semicolon
  dm integrity: fix gcc 5 warning
  dm: Make use of __assign_bit() API
  dm integrity: Remove extra unlikely helper
  dm: Convert to use ERR_CAST()
  dm bufio: Remove NULL check of list_entry()
  dm-crypt: Allow to specify the integrity key size as option
  dm: Remove unused declaration and empty definition "dm_zone_map_bio"
  dm delay: enhance kernel documentation
  ...
2024-09-27 09:12:51 -07:00
Linus Torvalds
abf2050f51 Merge tag 'media/v6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media updates from Mauro Carvalho Chehab:

 - New CEC driver: Extron DA HD 4K Plus

 - Lots of driver fixes, cleanups and improvements

* tag 'media/v6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (179 commits)
  media: atomisp: Use clamp() in ia_css_eed1_8_vmem_encode()
  media: atomisp: Fix eed1_8 code assigning signed values to an unsigned variable
  media: atomisp: set lock before calling vb2_queue_init()
  media: atomisp: Improve binary finding debug logging
  media: atomisp: Drop dev_dbg() calls from hmm_[alloc|free]()
  media: atomisp: csi2-bridge: Add DMI quirk for t4ka3 on Xiaomi Mipad2
  media: atomisp: add missing wait_prepare/finish ops
  media: atomisp: Remove unused declaration
  media: atomisp: use clamp() in compute_coring()
  media: atomisp: use clamp() in ia_css_eed1_8_encode()
  media: atomisp: Simplify ia_css_pipe_create_cas_scaler_desc_single_output()
  media: atomisp: Replace rarely used macro from math_support.h
  media: atomisp: Remove duplicated leftover, i.e. sh_css_dvs_info.h
  media: atomisp: bnr: fix trailing statement
  media: atomisp: move trailing */ to separate lines
  media: atomisp: move trailing statement to next line.
  media: atomisp: Fix trailing statement in ia_css_de.host.c
  media: atomisp: Fix spelling mistakes in atomisp.h
  media: atomisp: Fix spelling mistakes in atomisp_platform.h
  media: atomisp: Fix spelling mistake in csi_rx_public.h
  ...
2024-09-23 15:27:58 -07:00
Linus Torvalds
af9c191ac2 Merge tag 'trace-ring-buffer-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull ring-buffer updates from Steven Rostedt:

 - tracing/ring-buffer: persistent buffer across reboots

   This allows for the tracing instance ring buffer to stay persistent
   across reboots. The way this is done is by adding to the kernel
   command line:

     trace_instance=boot_map@0x285400000:12M

   This will reserve 12 megabytes at the address 0x285400000, and then
   map the tracing instance "boot_map" ring buffer to that memory. This
   will appear as a normal instance in the tracefs system:

     /sys/kernel/tracing/instances/boot_map

   A user could enable tracing in that instance, and on reboot or kernel
   crash, if the memory is not wiped by the firmware, it will recreate
   the trace in that instance. For example, if one was debugging a
   shutdown of a kernel reboot:

     # cd /sys/kernel/tracing
     # echo function > instances/boot_map/current_tracer
     # reboot
     [..]
     # cd /sys/kernel/tracing
     # tail instances/boot_map/trace
           swapper/0-1       [000] d..1.   164.549800: restore_boot_irq_mode <-native_machine_shutdown
           swapper/0-1       [000] d..1.   164.549801: native_restore_boot_irq_mode <-native_machine_shutdown
           swapper/0-1       [000] d..1.   164.549802: disconnect_bsp_APIC <-native_machine_shutdown
           swapper/0-1       [000] d..1.   164.549811: hpet_disable <-native_machine_shutdown
           swapper/0-1       [000] d..1.   164.549812: iommu_shutdown_noop <-native_machine_restart
           swapper/0-1       [000] d..1.   164.549813: native_machine_emergency_restart <-__do_sys_reboot
           swapper/0-1       [000] d..1.   164.549813: tboot_shutdown <-native_machine_emergency_restart
           swapper/0-1       [000] d..1.   164.549820: acpi_reboot <-native_machine_emergency_restart
           swapper/0-1       [000] d..1.   164.549821: acpi_reset <-acpi_reboot
           swapper/0-1       [000] d..1.   164.549822: acpi_os_write_port <-acpi_reboot

   On reboot, the buffer is examined to make sure it is valid. The
   validation check even steps through every event to make sure the meta
   data of the event is correct. If any test fails, it will simply reset
   the buffer, and the buffer will be empty on boot.

 - Allow the tracing persistent boot buffer to use the "reserve_mem"
   option

   Instead of having the admin find a physical address to store the
   persistent buffer, which can be very tedious if they have to
   administrate several different machines, allow them to use the
   "reserve_mem" option that will find a location for them. It is not as
   reliable because of KASLR, as the loading of the kernel in different
   locations can cause the memory allocated to be inconsistent. Booting
   with "nokaslr" can make reserve_mem more reliable.

 - Have function graph tracer handle offsets from a previous boot.

   The ring buffer output from a previous boot may have different
   addresses due to kaslr. Have the function graph tracer handle these
   by using the delta from the previous boot to the new boot address
   space.

 - Only reset the saved meta offset when the buffer is started or reset

   In the persistent memory meta data, it holds the previous address
   space information, so that it can calculate the delta to have
   function tracing work. But this gets updated after being read to hold
   the new address space. But if the buffer isn't used for that boot, on
   reboot, the delta is now calculated from the previous boot and not
   the boot that holds the data in the ring buffer. This causes the
   functions not to be shown. Do not save the address space information
   of the current kernel until it is being recorded.

 - Add a magic variable to test the valid meta data

   Add a magic variable in the meta data that can also be used for
   validation. The validator of the previous buffer doesn't need this
   magic data, but it can be used if the meta data is changed by a new
   kernel, which may have the same format that passes the validator but
   is used differently. This magic number can also be used as a
   "versioning" of the meta data.

 - Align user space mapped ring buffer sub buffers to improve TLB
   entries

   Linus mentioned that the mapped ring buffer sub buffers were
   misaligned between the meta page and the sub-buffers, so that if the
   sub-buffers were bigger than PAGE_SIZE, it wouldn't allow the TLB to
   use bigger entries.

 - Add new kernel command line "traceoff" to disable tracing on boot for
   instances

   If tracing is enabled for a boot instance, there needs a way to be
   able to disable it on boot so that new events do not get entered into
   the ring buffer and be mixed with events from a previous boot, as
   that can be confusing.

 - Allow trace_printk() to go to other instances

   Currently, trace_printk() can only go to the top level instance. When
   debugging with a persistent buffer, it is really useful to be able to
   add trace_printk() to go to that buffer, so that you have access to
   them after a crash.

 - Do not use "bin_printk()" for traces to a boot instance

   The bin_printk() saves only a pointer to the printk format in the
   ring buffer, as the reader of the buffer can still have access to it.
   But this is not the case if the buffer is from a previous boot. If
   the trace_printk() is going to a "persistent" buffer, it will use the
   slower version that writes the printk format into the buffer.

 - Add command line option to allow trace_printk() to go to an instance

   Allow the kernel command line to define which instance the
   trace_printk() goes to, instead of forcing the admin to set it for
   every boot via the tracefs options.

 - Start a document that explains how to use tracefs to debug the kernel

 - Add some more kernel selftests to test user mapped ring buffer

* tag 'trace-ring-buffer-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (28 commits)
  selftests/ring-buffer: Handle meta-page bigger than the system
  selftests/ring-buffer: Verify the entire meta-page padding
  tracing/Documentation: Start a document on how to debug with tracing
  tracing: Add option to set an instance to be the trace_printk destination
  tracing: Have trace_printk not use binary prints if boot buffer
  tracing: Allow trace_printk() to go to other instance buffers
  tracing: Add "traceoff" flag to boot time tracing instances
  ring-buffer: Align meta-page to sub-buffers for improved TLB usage
  ring-buffer: Add magic and struct size to boot up meta data
  ring-buffer: Don't reset persistent ring-buffer meta saved addresses
  tracing/fgraph: Have fgraph handle previous boot function addresses
  tracing: Allow boot instances to use reserve_mem boot memory
  tracing: Fix ifdef of snapshots to not prevent last_boot_info file
  ring-buffer: Use vma_pages() helper function
  tracing: Fix NULL vs IS_ERR() check in enable_instances()
  tracing: Add last boot delta offset for stack traces
  tracing: Update function tracing output for previous boot buffer
  tracing: Handle old buffer mappings for event strings and functions
  tracing/ring-buffer: Add last_boot_info file to boot instance
  ring-buffer: Save text and data locations in mapped meta data
  ...
2024-09-22 09:47:16 -07:00
Linus Torvalds
7856a56541 Merge tag 'mm-nonmm-stable-2024-09-21-07-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull non-MM updates from Andrew Morton:
 "Many singleton patches - please see the various changelogs for
  details.

  Quite a lot of nilfs2 work this time around.

  Notable patch series in this pull request are:

   - "mul_u64_u64_div_u64: new implementation" by Nicolas Pitre, with
     assistance from Uwe Kleine-König. Reimplement mul_u64_u64_div_u64()
     to provide (much) more accurate results. The current implementation
     was causing Uwe some issues in the PWM drivers.

   - "xz: Updates to license, filters, and compression options" from
     Lasse Collin. Miscellaneous maintenance and kinor feature work to
     the xz decompressor.

   - "Fix some GDB command error and add some GDB commands" from
     Kuan-Ying Lee. Fixes and enhancements to the gdb scripts.

   - "treewide: add missing MODULE_DESCRIPTION() macros" from Jeff
     Johnson. Adds lots of MODULE_DESCRIPTIONs, thus fixing lots of
     warnings about this.

   - "nilfs2: add support for some common ioctls" from Ryusuke Konishi.
     Adds various commonly-available ioctls to nilfs2.

   - "This series fixes a number of formatting issues in kernel doc
     comments" from Ryusuke Konishi does that.

   - "nilfs2: prevent unexpected ENOENT propagation" from Ryusuke
     Konishi. Fix issues where -ENOENT was being unintentionally and
     inappropriately returned to userspace.

   - "nilfs2: assorted cleanups" from Huang Xiaojia.

   - "nilfs2: fix potential issues with empty b-tree nodes" from Ryusuke
     Konishi fixes some issues which can occur on corrupted nilfs2
     filesystems.

   - "scripts/decode_stacktrace.sh: improve error reporting and
     usability" from Luca Ceresoli does those things"

* tag 'mm-nonmm-stable-2024-09-21-07-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (103 commits)
  list: test: increase coverage of list_test_list_replace*()
  list: test: fix tests for list_cut_position()
  proc: use __auto_type more
  treewide: correct the typo 'retun'
  ocfs2: cleanup return value and mlog in ocfs2_global_read_info()
  nilfs2: remove duplicate 'unlikely()' usage
  nilfs2: fix potential oob read in nilfs_btree_check_delete()
  nilfs2: determine empty node blocks as corrupted
  nilfs2: fix potential null-ptr-deref in nilfs_btree_insert()
  user_namespace: use kmemdup_array() instead of kmemdup() for multiple allocation
  tools/mm: rm thp_swap_allocator_test when make clean
  squashfs: fix percpu address space issues in decompressor_multi_percpu.c
  lib: glob.c: added null check for character class
  nilfs2: refactor nilfs_segctor_thread()
  nilfs2: use kthread_create and kthread_stop for the log writer thread
  nilfs2: remove sc_timer_task
  nilfs2: do not repair reserved inode bitmap in nilfs_new_inode()
  nilfs2: eliminate the shared counter and spinlock for i_generation
  nilfs2: separate inode type information from i_state field
  nilfs2: use the BITS_PER_LONG macro
  ...
2024-09-21 08:20:50 -07:00
Linus Torvalds
617a814f14 Merge tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
 "Along with the usual shower of singleton patches, notable patch series
  in this pull request are:

   - "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds
     consistency to the APIs and behaviour of these two core allocation
     functions. This also simplifies/enables Rustification.

   - "Some cleanups for shmem" from Baolin Wang. No functional changes -
     mode code reuse, better function naming, logic simplifications.

   - "mm: some small page fault cleanups" from Josef Bacik. No
     functional changes - code cleanups only.

   - "Various memory tiering fixes" from Zi Yan. A small fix and a
     little cleanup.

   - "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and
     simplifications and .text shrinkage.

   - "Kernel stack usage histogram" from Pasha Tatashin and Shakeel
     Butt. This is a feature, it adds new feilds to /proc/vmstat such as

       $ grep kstack /proc/vmstat
       kstack_1k 3
       kstack_2k 188
       kstack_4k 11391
       kstack_8k 243
       kstack_16k 0

     which tells us that 11391 processes used 4k of stack while none at
     all used 16k. Useful for some system tuning things, but
     partivularly useful for "the dynamic kernel stack project".

   - "kmemleak: support for percpu memory leak detect" from Pavel
     Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory.

   - "mm: memcg: page counters optimizations" from Roman Gushchin. "3
     independent small optimizations of page counters".

   - "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from
     David Hildenbrand. Improves PTE/PMD splitlock detection, makes
     powerpc/8xx work correctly by design rather than by accident.

   - "mm: remove arch_make_page_accessible()" from David Hildenbrand.
     Some folio conversions which make arch_make_page_accessible()
     unneeded.

   - "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David
     Finkel. Cleans up and fixes our handling of the resetting of the
     cgroup/process peak-memory-use detector.

   - "Make core VMA operations internal and testable" from Lorenzo
     Stoakes. Rationalizaion and encapsulation of the VMA manipulation
     APIs. With a view to better enable testing of the VMA functions,
     even from a userspace-only harness.

   - "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix
     issues in the zswap global shrinker, resulting in improved
     performance.

   - "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill
     in some missing info in /proc/zoneinfo.

   - "mm: replace follow_page() by folio_walk" from David Hildenbrand.
     Code cleanups and rationalizations (conversion to folio_walk())
     resulting in the removal of follow_page().

   - "improving dynamic zswap shrinker protection scheme" from Nhat
     Pham. Some tuning to improve zswap's dynamic shrinker. Significant
     reductions in swapin and improvements in performance are shown.

   - "mm: Fix several issues with unaccepted memory" from Kirill
     Shutemov. Improvements to the new unaccepted memory feature,

   - "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on
     DAX PUDs. This was missing, although nobody seems to have notied
     yet.

   - "Introduce a store type enum for the Maple tree" from Sidhartha
     Kumar. Cleanups and modest performance improvements for the maple
     tree library code.

   - "memcg: further decouple v1 code from v2" from Shakeel Butt. Move
     more cgroup v1 remnants away from the v2 memcg code.

   - "memcg: initiate deprecation of v1 features" from Shakeel Butt.
     Adds various warnings telling users that memcg v1 features are
     deprecated.

   - "mm: swap: mTHP swap allocator base on swap cluster order" from
     Chris Li. Greatly improves the success rate of the mTHP swap
     allocation.

   - "mm: introduce numa_memblks" from Mike Rapoport. Moves various
     disparate per-arch implementations of numa_memblk code into generic
     code.

   - "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly
     improves the performance of munmap() of swap-filled ptes.

   - "support large folio swap-out and swap-in for shmem" from Baolin
     Wang. With this series we no longer split shmem large folios into
     simgle-page folios when swapping out shmem.

   - "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice
     performance improvements and code reductions for gigantic folios.

   - "support shmem mTHP collapse" from Baolin Wang. Adds support for
     khugepaged's collapsing of shmem mTHP folios.

   - "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect()
     performance regression due to the addition of mseal().

   - "Increase the number of bits available in page_type" from Matthew
     Wilcox. Increases the number of bits available in page_type!

   - "Simplify the page flags a little" from Matthew Wilcox. Many legacy
     page flags are now folio flags, so the page-based flags and their
     accessors/mutators can be removed.

   - "mm: store zero pages to be swapped out in a bitmap" from Usama
     Arif. An optimization which permits us to avoid writing/reading
     zero-filled zswap pages to backing store.

   - "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race
     window which occurs when a MAP_FIXED operqtion is occurring during
     an unrelated vma tree walk.

   - "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of
     the vma_merge() functionality, making ot cleaner, more testable and
     better tested.

   - "misc fixups for DAMON {self,kunit} tests" from SeongJae Park.
     Minor fixups of DAMON selftests and kunit tests.

   - "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang.
     Code cleanups and folio conversions.

   - "Shmem mTHP controls and stats improvements" from Ryan Roberts.
     Cleanups for shmem controls and stats.

   - "mm: count the number of anonymous THPs per size" from Barry Song.
     Expose additional anon THP stats to userspace for improved tuning.

   - "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more
     folio conversions and removal of now-unused page-based APIs.

   - "replace per-quota region priorities histogram buffer with
     per-context one" from SeongJae Park. DAMON histogram
     rationalization.

   - "Docs/damon: update GitHub repo URLs and maintainer-profile" from
     SeongJae Park. DAMON documentation updates.

   - "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and
     improve related doc and warn" from Jason Wang: fixes usage of page
     allocator __GFP_NOFAIL and GFP_ATOMIC flags.

   - "mm: split underused THPs" from Yu Zhao. Improve THP=always policy.
     This was overprovisioning THPs in sparsely accessed memory areas.

   - "zram: introduce custom comp backends API" frm Sergey Senozhatsky.
     Add support for zram run-time compression algorithm tuning.

   - "mm: Care about shadow stack guard gap when getting an unmapped
     area" from Mark Brown. Fix up the various arch_get_unmapped_area()
     implementations to better respect guard areas.

   - "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability
     of mem_cgroup_iter() and various code cleanups.

   - "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge
     pfnmap support.

   - "resource: Fix region_intersects() vs add_memory_driver_managed()"
     from Huang Ying. Fix a bug in region_intersects() for systems with
     CXL memory.

   - "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches
     a couple more code paths to correctly recover from the encountering
     of poisoned memry.

   - "mm: enable large folios swap-in support" from Barry Song. Support
     the swapin of mTHP memory into appropriately-sized folios, rather
     than into single-page folios"

* tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits)
  zram: free secondary algorithms names
  uprobes: turn xol_area->pages[2] into xol_area->page
  uprobes: introduce the global struct vm_special_mapping xol_mapping
  Revert "uprobes: use vm_special_mapping close() functionality"
  mm: support large folios swap-in for sync io devices
  mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
  mm: fix swap_read_folio_zeromap() for large folios with partial zeromap
  mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries
  set_memory: add __must_check to generic stubs
  mm/vma: return the exact errno in vms_gather_munmap_vmas()
  memcg: cleanup with !CONFIG_MEMCG_V1
  mm/show_mem.c: report alloc tags in human readable units
  mm: support poison recovery from copy_present_page()
  mm: support poison recovery from do_cow_fault()
  resource, kunit: add test case for region_intersects()
  resource: make alloc_free_mem_region() works for iomem_resource
  mm: z3fold: deprecate CONFIG_Z3FOLD
  vfio/pci: implement huge_fault support
  mm/arm64: support large pfn mappings
  mm/x86: support large pfn mappings
  ...
2024-09-21 07:29:05 -07:00
Linus Torvalds
056f8c437d Merge tag 'ext4_for_linus-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
 "Lots of cleanups and bug fixes this cycle, primarily in the block
  allocation, extent management, fast commit, and journalling"

* tag 'ext4_for_linus-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (93 commits)
  ext4: convert EXT4_B2C(sbi->s_stripe) users to EXT4_NUM_B2C
  ext4: check stripe size compatibility on remount as well
  ext4: fix i_data_sem unlock order in ext4_ind_migrate()
  ext4: remove the special buffer dirty handling in do_journal_get_write_access
  ext4: fix a potential assertion failure due to improperly dirtied buffer
  ext4: hoist ext4_block_write_begin and replace the __block_write_begin
  ext4: persist the new uptodate buffers in ext4_journalled_zero_new_buffers
  ext4: dax: keep orphan list before truncate overflow allocated blocks
  ext4: fix error message when rejecting the default hash
  ext4: save unnecessary indentation in ext4_ext_create_new_leaf()
  ext4: make some fast commit functions reuse extents path
  ext4: refactor ext4_swap_extents() to reuse extents path
  ext4: get rid of ppath in convert_initialized_extent()
  ext4: get rid of ppath in ext4_ext_handle_unwritten_extents()
  ext4: get rid of ppath in ext4_ext_convert_to_initialized()
  ext4: get rid of ppath in ext4_convert_unwritten_extents_endio()
  ext4: get rid of ppath in ext4_split_convert_extents()
  ext4: get rid of ppath in ext4_split_extent()
  ext4: get rid of ppath in ext4_force_split_extent_at()
  ext4: get rid of ppath in ext4_split_extent_at()
  ...
2024-09-20 19:26:45 -07:00
Linus Torvalds
84bbfe6b64 Merge tag 'platform-drivers-x86-v6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform drivers updates from Hans de Goede:

 - asus-wmi: Add support for vivobook fan profiles

 - dell-laptop: Add knobs to change battery charge settings

 - lg-laptop: Add operation region support

 - intel-uncore-freq: Add support for efficiency latency control

 - intel/ifs: Add SBAF test support

 - intel/pmc: Ignore all LTRs during suspend

 - platform/surface: Support for arm64 based Surface devices

 - wmi: Pass event data directly to legacy notify handlers

 - x86/platform/geode: switch GPIO buttons and LEDs to software
   properties

 - bunch of small cleanups, fixes, hw-id additions, etc.

* tag 'platform-drivers-x86-v6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (65 commits)
  MAINTAINERS: adjust file entry in INTEL MID PLATFORM
  platform/x86: x86-android-tablets: Adjust Xiaomi Pad 2 bottom bezel touch buttons LED
  platform/mellanox: mlxbf-pmc: fix lockdep warning
  platform/x86/amd: pmf: Add quirk for TUF Gaming A14
  platform/x86: touchscreen_dmi: add nanote-next quirk
  platform/x86: asus-wmi: don't fail if platform_profile already registered
  platform/x86: asus-wmi: add debug print in more key places
  platform/x86: intel_scu_wdt: Move intel_scu_wdt.h to x86 subfolder
  platform/x86: intel_scu_ipc: Move intel_scu_ipc.h out of arch/x86/include/asm
  MAINTAINERS: Add Intel MID section
  platform/x86: panasonic-laptop: Add support for programmable buttons
  platform/olpc: Remove redundant null pointer checks in olpc_ec_setup_debugfs()
  platform/x86: intel/pmc: Ignore all LTRs during suspend
  platform/x86: wmi: Call both legacy and WMI driver notify handlers
  platform/x86: wmi: Merge get_event_data() with wmi_get_notify_data()
  platform/x86: wmi: Remove wmi_get_event_data()
  platform/x86: wmi: Pass event data directly to legacy notify handlers
  platform/x86: thinkpad_acpi: Fix uninitialized symbol 's' warning
  platform/x86: x86-android-tablets: Fix spelling in the comments
  platform/x86: ideapad-laptop: Make the scope_guard() clear of its scope
  ...
2024-09-19 09:16:04 +02:00
Linus Torvalds
eec91e22fe Merge tag 'iommu-updates-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Joerg Roedel:
 "Core changes:
   - Allow ATS on VF when parent device is identity mapped
   - Optimize unmap path on ARM io-pagetable implementation
   - Use of_property_present()

  ARM-SMMU changes:
   - SMMUv2:
       - Devicetree binding updates for Qualcomm MMU-500 implementations
       - Extend workarounds for broken Qualcomm hypervisor to avoid
         touching features that are not available (e.g. 16KiB page
         support, reserved context banks)
   - SMMUv3:
       - Support for NVIDIA's custom virtual command queue hardware
       - Fix Stage-2 stall configuration and extend tests to cover this
         area
       - A bunch of driver cleanups, including simplification of the
         master rbtree code
   - Minor cleanups and fixes across both drivers

  Intel VT-d changes:
   - Retire si_domain and convert to use static identity domain
   - Batched IOTLB/dev-IOTLB invalidation
   - Small code refactoring and cleanups

  AMD-Vi changes:
   - Cleanup and refactoring of io-pagetable code
   - Add parameter to limit the used io-pagesizes
   - Other cleanups and fixes"

* tag 'iommu-updates-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (77 commits)
  dt-bindings: arm-smmu: Add compatible for QCS8300 SoC
  iommu/amd: Test for PAGING domains before freeing a domain
  iommu/amd: Fix argument order in amd_iommu_dev_flush_pasid_all()
  iommu/amd: Add kernel parameters to limit V1 page-sizes
  iommu/arm-smmu-v3: Reorganize struct arm_smmu_ctx_desc_cfg
  iommu/arm-smmu-v3: Add types for each level of the CD table
  iommu/arm-smmu-v3: Shrink the cdtab l1_desc array
  iommu/arm-smmu-v3: Do not use devm for the cd table allocations
  iommu/arm-smmu-v3: Remove strtab_base/cfg
  iommu/arm-smmu-v3: Reorganize struct arm_smmu_strtab_cfg
  iommu/arm-smmu-v3: Add types for each level of the 2 level stream table
  iommu/arm-smmu-v3: Add arm_smmu_strtab_l1/2_idx()
  iommu/arm-smmu-qcom: apply num_context_bank fixes for SDM630 / SDM660
  iommu/arm-smmu-v3: Use the new rb tree helpers
  dt-bindings: arm-smmu: document the support on SA8255p
  iommu/tegra241-cmdqv: Do not allocate vcmdq until dma_set_mask_and_coherent
  iommu/tegra241-cmdqv: Drop static at local variable
  iommu/tegra241-cmdqv: Fix ioremap() error handling in probe()
  iommu/amd: Do not set the D bit on AMD v2 table entries
  iommu/amd: Correct the reported page sizes from the V1 table
  ...
2024-09-18 12:45:52 +02:00
Linus Torvalds
7c9026b2b0 Merge tag 'pstore-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull pstore updates from Kees Cook:

 - ramoops: Fix .rst typo (Steven Rostedt)

 - pstore: replace spinlock_t by raw_spinlock_t (Wen Yang)

* tag 'pstore-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  pstore: replace spinlock_t by raw_spinlock_t
  pstore/ramoops: Fix typo as there is no "reserver"
2024-09-18 11:47:03 +02:00
Linus Torvalds
067610ebaa Merge tag 'rcu.release.v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux
Pull RCU updates from Neeraj Upadhyay:
 "Context tracking:
   - rename context tracking state related symbols and remove references
     to "dynticks" in various context tracking state variables and
     related helpers
   - force context_tracking_enabled_this_cpu() to be inlined to avoid
     leaving a noinstr section

  CSD lock:
   - enhance CSD-lock diagnostic reports
   - add an API to provide an indication of ongoing CSD-lock stall

  nocb:
   - update and simplify RCU nocb code to handle (de-)offloading of
     callbacks only for offline CPUs
   - fix RT throttling hrtimer being armed from offline CPU

  rcutorture:
   - remove redundant rcu_torture_ops get_gp_completed fields
   - add SRCU ->same_gp_state and ->get_comp_state functions
   - add generic test for NUM_ACTIVE_*RCU_POLL* for testing RCU and SRCU
     polled grace periods
   - add CFcommon.arch for arch-specific Kconfig options
   - print number of update types in rcu_torture_write_types()
   - add rcutree.nohz_full_patience_delay testing to the TREE07 scenario
   - add a stall_cpu_repeat module parameter to test repeated CPU stalls
   - add argument to limit number of CPUs a guest OS can use in
     torture.sh

  rcustall:
   - abbreviate RCU CPU stall warnings during CSD-lock stalls
   - Allow dump_cpu_task() to be called without disabling preemption
   - defer printing stall-warning backtrace when holding rcu_node lock

  srcu:
   - make SRCU gp seq wrap-around faster
   - add KCSAN checks for concurrent updates to ->srcu_n_exp_nodelay and
     ->reschedule_count which are used in heuristics governing
     auto-expediting of normal SRCU grace periods and
     grace-period-state-machine delays
   - mark idle SRCU-barrier callbacks to help identify stuck
     SRCU-barrier callback

  rcu tasks:
   - remove RCU Tasks Rude asynchronous APIs as they are no longer used
   - stop testing RCU Tasks Rude asynchronous APIs
   - fix access to non-existent percpu regions
   - check processor-ID assumptions during chosen CPU calculation for
     callback enqueuing
   - update description of rtp->tasks_gp_seq grace-period sequence
     number
   - add rcu_barrier_cb_is_done() to identify whether a given
     rcu_barrier callback is stuck
   - mark idle Tasks-RCU-barrier callbacks
   - add *torture_stats_print() functions to print detailed diagnostics
     for Tasks-RCU variants
   - capture start time of rcu_barrier_tasks*() operation to help
     distinguish a hung barrier operation from a long series of barrier
     operations

  refscale:
   - add a TINY scenario to support tests of Tiny RCU and Tiny
     SRCU
   - optimize process_durations() operation

  rcuscale:
   - dump stacks of stalled rcu_scale_writer() instances and
     grace-period statistics when rcu_scale_writer() stalls
   - mark idle RCU-barrier callbacks to identify stuck RCU-barrier
     callbacks
   - print detailed grace-period and barrier diagnostics on
     rcu_scale_writer() hangs for Tasks-RCU variants
   - warn if async module parameter is specified for RCU implementations
     that do not have async primitives such as RCU Tasks Rude
   - make all writer tasks report upon hang
   - tolerate repeated GFP_KERNEL failure in rcu_scale_writer()
   - use special allocator for rcu_scale_writer()
   - NULL out top-level pointers to heap memory to avoid double-free
     bugs on modprobe failures
   - maintain per-task instead of per-CPU callbacks count to avoid any
     issues with migration of either tasks or callbacks
   - constify struct ref_scale_ops

  Fixes:
   - use system_unbound_wq for kfree_rcu work to avoid disturbing
     isolated CPUs

  Misc:
   - warn on unexpected rcu_state.srs_done_tail state
   - better define "atomic" for list_replace_rcu() and
     hlist_replace_rcu() routines
   - annotate struct kvfree_rcu_bulk_data with __counted_by()"

* tag 'rcu.release.v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: (90 commits)
  rcu: Defer printing stall-warning backtrace when holding rcu_node lock
  rcu/nocb: Remove superfluous memory barrier after bypass enqueue
  rcu/nocb: Conditionally wake up rcuo if not already waiting on GP
  rcu/nocb: Fix RT throttling hrtimer armed from offline CPU
  rcu/nocb: Simplify (de-)offloading state machine
  context_tracking: Tag context_tracking_enabled_this_cpu() __always_inline
  context_tracking, rcu: Rename rcu_dyntick trace event into rcu_watching
  rcu: Update stray documentation references to rcu_dynticks_eqs_{enter, exit}()
  rcu: Rename rcu_momentary_dyntick_idle() into rcu_momentary_eqs()
  rcu: Rename rcu_implicit_dynticks_qs() into rcu_watching_snap_recheck()
  rcu: Rename dyntick_save_progress_counter() into rcu_watching_snap_save()
  rcu: Rename struct rcu_data .exp_dynticks_snap into .exp_watching_snap
  rcu: Rename struct rcu_data .dynticks_snap into .watching_snap
  rcu: Rename rcu_dynticks_zero_in_eqs() into rcu_watching_zero_in_eqs()
  rcu: Rename rcu_dynticks_in_eqs_since() into rcu_watching_snap_stopped_since()
  rcu: Rename rcu_dynticks_in_eqs() into rcu_watching_snap_in_eqs()
  rcu: Rename rcu_dynticks_eqs_online() into rcu_watching_online()
  context_tracking, rcu: Rename rcu_dynticks_curr_cpu_in_eqs() into rcu_is_watching_curr_cpu()
  context_tracking, rcu: Rename rcu_dynticks_task*() into rcu_task*()
  refscale: Constify struct ref_scale_ops
  ...
2024-09-18 07:52:24 +02:00
Linus Torvalds
85a77db95a Merge tag 'wq-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
 "Nothing major:

   - workqueue.panic_on_stall boot param added

   - alloc_workqueue_lockdep_map() added (used by DRM)

   - Other cleanusp and doc updates"

* tag 'wq-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
  kernel/workqueue.c: fix DEFINE_PER_CPU_SHARED_ALIGNED expansion
  workqueue: Fix another htmldocs build warning
  workqueue: fix null-ptr-deref on __alloc_workqueue() error
  workqueue: Don't call va_start / va_end twice
  workqueue: Fix htmldocs build warning
  workqueue: Add interface for user-defined workqueue lockdep map
  workqueue: Change workqueue lockdep map to pointer
  workqueue: Split alloc_workqueue into internal function and lockdep init
  Documentation: kernel-parameters: add workqueue.panic_on_stall
  workqueue: add cmdline parameter workqueue.panic_on_stall
2024-09-18 06:59:44 +02:00
Linus Torvalds
78567e2bc7 Merge tag 'cgroup-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:

 - cpuset isolation improvements

 - cpuset cgroup1 support is split into its own file behind the new
   config option CONFIG_CPUSET_V1. This makes it the second controller
   which makes cgroup1 support optional after memcg

 - Handling of unavailable v1 controller handling improved during
   cgroup1 mount operations

 - union_find applied to cpuset. It makes code simpler and more
   efficient

 - Reduce spurious events in pids.events

 - Cleanups and other misc changes

 - Contains a merge of cgroup/for-6.11-fixes to receive cpuset fixes
   that further changes build upon

* tag 'cgroup-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (34 commits)
  cgroup: Do not report unavailable v1 controllers in /proc/cgroups
  cgroup: Disallow mounting v1 hierarchies without controller implementation
  cgroup/cpuset: Expose cpuset filesystem with cpuset v1 only
  cgroup/cpuset: Move cpu.h include to cpuset-internal.h
  cgroup/cpuset: add sefltest for cpuset v1
  cgroup/cpuset: guard cpuset-v1 code under CONFIG_CPUSETS_V1
  cgroup/cpuset: rename functions shared between v1 and v2
  cgroup/cpuset: move v1 interfaces to cpuset-v1.c
  cgroup/cpuset: move validate_change_legacy to cpuset-v1.c
  cgroup/cpuset: move legacy hotplug update to cpuset-v1.c
  cgroup/cpuset: add callback_lock helper
  cgroup/cpuset: move memory_spread to cpuset-v1.c
  cgroup/cpuset: move relax_domain_level to cpuset-v1.c
  cgroup/cpuset: move memory_pressure to cpuset-v1.c
  cgroup/cpuset: move common code to cpuset-internal.h
  cgroup/cpuset: introduce cpuset-v1.c
  selftest/cgroup: Make test_cpuset_prs.sh deal with pre-isolated CPUs
  cgroup/cpuset: Account for boot time isolated CPUs
  cgroup/cpuset: remove use_parent_ecpus of cpuset
  cgroup/cpuset: remove fetch_xcpus
  ...
2024-09-18 06:39:03 +02:00
Paolo Bonzini
c09dd2bb57 Merge branch 'kvm-redo-enable-virt' into HEAD
Register KVM's cpuhp and syscore callbacks when enabling virtualization in
hardware, as the sole purpose of said callbacks is to disable and re-enable
virtualization as needed.

The primary motivation for this series is to simplify dealing with enabling
virtualization for Intel's TDX, which needs to enable virtualization
when kvm-intel.ko is loaded, i.e. long before the first VM is created.

That said, this is a nice cleanup on its own.  By registering the callbacks
on-demand, the callbacks themselves don't need to check kvm_usage_count,
because their very existence implies a non-zero count.

Patch 1 (re)adds a dedicated lock for kvm_usage_count.  This avoids a
lock ordering issue between cpus_read_lock() and kvm_lock.  The lock
ordering issue still exist in very rare cases, and will be fixed for
good by switching vm_list to an (S)RCU-protected list.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-09-17 11:38:20 -04:00
Linus Torvalds
d58db3f3a0 Merge tag 'docs-6.12' of git://git.lwn.net/linux
Pull documentation update from Jonathan Corbet:
 "Another relatively mundane cycle for docs:

   - The beginning of an EEVDF scheduler document

   - More Chinese translations

   - A rethrashing of our bisection documentation

  ...plus the usual array of smaller fixes, and more than the usual
  number of typo fixes"

* tag 'docs-6.12' of git://git.lwn.net/linux: (48 commits)
  Remove duplicate "and" in 'Linux NVMe docs.
  docs:filesystems: fix spelling and grammar mistakes
  docs:filesystem: fix mispelled words on autofs page
  docs:mm: fixed spelling and grammar mistakes on vmalloc kernel stack page
  Documentation: PCI: fix typo in pci.rst
  docs/zh_CN: add the translation of kbuild/gcc-plugins.rst
  docs/process: fix typos
  docs:mm: fix spelling mistakes in heterogeneous memory management page
  accel/qaic: Fix a typo
  docs/zh_CN: update the translation of security-bugs
  docs: block: Fix grammar and spelling mistakes in bfq-iosched.rst
  Documentation: Fix spelling mistakes
  Documentation/gpu: Fix typo in Documentation/gpu/komeda-kms.rst
  scripts: sphinx-pre-install: remove unnecessary double check for $cur_version
  Loongarch: KVM: Add KVM hypercalls documentation for LoongArch
  Documentation: Document the kernel flag bdev_allow_write_mounted
  docs: scheduler: completion: Update member of struct completion
  docs: kerneldoc-preamble.sty: Suppress extra spaces in CJK literal blocks
  docs: submitting-patches: Advertise b4
  docs: update dev-tools/kcsan.rst url about KTSAN
  ...
2024-09-17 16:44:08 +02:00
Linus Torvalds
9ea925c806 Merge tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "Core:

   - Overhaul of posix-timers in preparation of removing the workaround
     for periodic timers which have signal delivery ignored.

   - Remove the historical extra jiffie in msleep()

     msleep() adds an extra jiffie to the timeout value to ensure
     minimal sleep time. The timer wheel ensures minimal sleep time
     since the large rewrite to a non-cascading wheel, but the extra
     jiffie in msleep() remained unnoticed. Remove it.

   - Make the timer slack handling correct for realtime tasks.

     The procfs interface is inconsistent and does neither reflect
     reality nor conforms to the man page. Show the correct 0 slack for
     real time tasks and enforce it at the core level instead of having
     inconsistent individual checks in various timer setup functions.

   - The usual set of updates and enhancements all over the place.

  Drivers:

   - Allow the ACPI PM timer to be turned off during suspend

   - No new drivers

   - The usual updates and enhancements in various drivers"

* tag 'timers-core-2024-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
  ntp: Make sure RTC is synchronized when time goes backwards
  treewide: Fix wrong singular form of jiffies in comments
  cpu: Use already existing usleep_range()
  timers: Rename next_expiry_recalc() to be unique
  platform/x86:intel/pmc: Fix comment for the pmc_core_acpi_pm_timer_suspend_resume function
  clocksource/drivers/jcore: Use request_percpu_irq()
  clocksource/drivers/cadence-ttc: Add missing clk_disable_unprepare in ttc_setup_clockevent
  clocksource/drivers/asm9260: Add missing clk_disable_unprepare in asm9260_timer_init
  clocksource/drivers/qcom: Add missing iounmap() on errors in msm_dt_timer_init()
  clocksource/drivers/ingenic: Use devm_clk_get_enabled() helpers
  platform/x86:intel/pmc: Enable the ACPI PM Timer to be turned off when suspended
  clocksource: acpi_pm: Add external callback for suspend/resume
  clocksource/drivers/arm_arch_timer: Using for_each_available_child_of_node_scoped()
  dt-bindings: timer: rockchip: Add rk3576 compatible
  timers: Annotate possible non critical data race of next_expiry
  timers: Remove historical extra jiffie for timeout in msleep()
  hrtimer: Use and report correct timerslack values for realtime tasks
  hrtimer: Annotate hrtimer_cpu_base_.*_expiry() for sparse.
  timers: Add sparse annotation for timer_sync_wait_running().
  signal: Replace BUG_ON()s
  ...
2024-09-17 07:25:37 +02:00
Linus Torvalds
a430d95c5e Merge tag 'lsm-pr-20240911' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm
Pull lsm updates from Paul Moore:

 - Move the LSM framework to static calls

   This transitions the vast majority of the LSM callbacks into static
   calls. Those callbacks which haven't been converted were left as-is
   due to the general ugliness of the changes required to support the
   static call conversion; we can revisit those callbacks at a future
   date.

 - Add the Integrity Policy Enforcement (IPE) LSM

   This adds a new LSM, Integrity Policy Enforcement (IPE). There is
   plenty of documentation about IPE in this patches, so I'll refrain
   from going into too much detail here, but the basic motivation behind
   IPE is to provide a mechanism such that administrators can restrict
   execution to only those binaries which come from integrity protected
   storage, e.g. a dm-verity protected filesystem. You will notice that
   IPE requires additional LSM hooks in the initramfs, dm-verity, and
   fs-verity code, with the associated patches carrying ACK/review tags
   from the associated maintainers. We couldn't find an obvious
   maintainer for the initramfs code, but the IPE patchset has been
   widely posted over several years.

   Both Deven Bowers and Fan Wu have contributed to IPE's development
   over the past several years, with Fan Wu agreeing to serve as the IPE
   maintainer moving forward. Once IPE is accepted into your tree, I'll
   start working with Fan to ensure he has the necessary accounts, keys,
   etc. so that he can start submitting IPE pull requests to you
   directly during the next merge window.

 - Move the lifecycle management of the LSM blobs to the LSM framework

   Management of the LSM blobs (the LSM state buffers attached to
   various kernel structs, typically via a void pointer named "security"
   or similar) has been mixed, some blobs were allocated/managed by
   individual LSMs, others were managed by the LSM framework itself.

   Starting with this pull we move management of all the LSM blobs,
   minus the XFRM blob, into the framework itself, improving consistency
   across LSMs, and reducing the amount of duplicated code across LSMs.
   Due to some additional work required to migrate the XFRM blob, it has
   been left as a todo item for a later date; from a practical
   standpoint this omission should have little impact as only SELinux
   provides a XFRM LSM implementation.

 - Fix problems with the LSM's handling of F_SETOWN

   The LSM hook for the fcntl(F_SETOWN) operation had a couple of
   problems: it was racy with itself, and it was disconnected from the
   associated DAC related logic in such a way that the LSM state could
   be updated in cases where the DAC state would not. We fix both of
   these problems by moving the security_file_set_fowner() hook into the
   same section of code where the DAC attributes are updated. Not only
   does this resolve the DAC/LSM synchronization issue, but as that code
   block is protected by a lock, it also resolve the race condition.

 - Fix potential problems with the security_inode_free() LSM hook

   Due to use of RCU to protect inodes and the placement of the LSM hook
   associated with freeing the inode, there is a bit of a challenge when
   it comes to managing any LSM state associated with an inode. The VFS
   folks are not open to relocating the LSM hook so we have to get
   creative when it comes to releasing an inode's LSM state.
   Traditionally we have used a single LSM callback within the hook that
   is triggered when the inode is "marked for death", but not actually
   released due to RCU.

   Unfortunately, this causes problems for LSMs which want to take an
   action when the inode's associated LSM state is actually released; so
   we add an additional LSM callback, inode_free_security_rcu(), that is
   called when the inode's LSM state is released in the RCU free
   callback.

 - Refactor two LSM hooks to better fit the LSM return value patterns

   The vast majority of the LSM hooks follow the "return 0 on success,
   negative values on failure" pattern, however, there are a small
   handful that have unique return value behaviors which has caused
   confusion in the past and makes it difficult for the BPF verifier to
   properly vet BPF LSM programs. This includes patches to
   convert two of these"special" LSM hooks to the common 0/-ERRNO pattern.

 - Various cleanups and improvements

   A handful of patches to remove redundant code, better leverage the
   IS_ERR_OR_NULL() helper, add missing "static" markings, and do some
   minor style fixups.

* tag 'lsm-pr-20240911' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: (40 commits)
  security: Update file_set_fowner documentation
  fs: Fix file_set_fowner LSM hook inconsistencies
  lsm: Use IS_ERR_OR_NULL() helper function
  lsm: remove LSM_COUNT and LSM_CONFIG_COUNT
  ipe: Remove duplicated include in ipe.c
  lsm: replace indirect LSM hook calls with static calls
  lsm: count the LSMs enabled at compile time
  kernel: Add helper macros for loop unrolling
  init/main.c: Initialize early LSMs after arch code, static keys and calls.
  MAINTAINERS: add IPE entry with Fan Wu as maintainer
  documentation: add IPE documentation
  ipe: kunit test for parser
  scripts: add boot policy generation program
  ipe: enable support for fs-verity as a trust provider
  fsverity: expose verified fsverity built-in signatures to LSMs
  lsm: add security_inode_setintegrity() hook
  ipe: add support for dm-verity as a trust provider
  dm-verity: expose root hash digest and signature data to LSMs
  block,lsm: add LSM blob and new LSM hooks for block devices
  ipe: add permissive toggle
  ...
2024-09-16 18:19:47 +02:00
Linus Torvalds
e8fc317dfc Merge tag 'vfs-6.12.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull procfs updates from Christian Brauner:
 "This contains the following changes for procfs:

   - Add config options and parameters to block forcing memory writes.

     This adds a Kconfig option and boot param to allow removing the
     FOLL_FORCE flag from /proc/<pid>/mem write calls as this can be
     used in various attacks.

     The traditional forcing behavior is kept as default because it can
     break GDB and some other use cases.

     This is the simpler version that you had requested.

   - Restrict overmounting of ephemeral entities.

     It is currently possible to mount on top of various ephemeral
     entities in procfs. This specifically includes magic links. To
     recap, magic links are links of the form /proc/<pid>/fd/<nr>. They
     serve as references to a target file and during path lookup they
     cause a jump to the target path. Such magic links disappear if the
     corresponding file descriptor is closed.

     Currently it is possible to overmount such magic links. This is
     mostly interesting for an attacker that wants to somehow trick a
     process into e.g., reopening something that it didn't intend to
     reopen or to hide a malicious file descriptor.

     But also it risks leaking mounts for long-running processes. When
     overmounting a magic link like above, the mount will not be
     detached when the file descriptor is closed. Only the target
     mountpoint will disappear. Which has the consequence of making it
     impossible to unmount that mount afterwards. So the mount will
     stick around until the process exits and the /proc/<pid>/ directory
     is cleaned up during proc_flush_pid() when the dentries are pruned
     and invalidated.

     That in turn means it's possible for a program to accidentally leak
     mounts and it's also possible to make a task leak mounts without
     it's knowledge if the attacker just keeps overmounting things under
     /proc/<pid>/fd/<nr>.

     Disallow overmounting of such ephemeral entities.

   - Cleanup the readdir method naming in some procfs file operations.

   - Replace kmalloc() and strcpy() with a simple kmemdup() call"

* tag 'vfs-6.12.procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  proc: fold kmalloc() + strcpy() into kmemdup()
  proc: block mounting on top of /proc/<pid>/fdinfo/*
  proc: block mounting on top of /proc/<pid>/fd/*
  proc: block mounting on top of /proc/<pid>/map_files/*
  proc: add proc_splice_unmountable()
  proc: proc_readfdinfo() -> proc_fdinfo_iterate()
  proc: proc_readfd() -> proc_fd_iterate()
  proc: add config & param to block forcing mem writes
2024-09-16 09:36:59 +02:00
Linus Torvalds
02824a5fd1 Merge tag 'pm-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
 "By the number of new lines of code, the most visible change here is
  the addition of hybrid CPU capacity scaling support to the
  intel_pstate driver. Next are the amd-pstate driver changes related to
  the calculation of the AMD boost numerator and preferred core
  detection.

  As far as new hardware support is concerned, the intel_idle driver
  will now handle Granite Rapids Xeon processors natively, the
  intel_rapl power capping driver will recognize family 1Ah of AMD
  processors and Intel ArrowLake-U chipos, and intel_pstate will handle
  Granite Rapids and Sierra Forest chips in the out-of-band (OOB) mode.

  Apart from the above, there is a usual collection of assorted fixes
  and code cleanups in many places and there are tooling updates.

  Specifics:

   - Remove LATENCY_MULTIPLIER from cpufreq (Qais Yousef)

   - Add support for Granite Rapids and Sierra Forest in OOB mode to the
     intel_pstate cpufreq driver (Srinivas Pandruvada)

   - Add basic support for CPU capacity scaling on x86 and make the
     intel_pstate driver set asymmetric CPU capacity on hybrid systems
     without SMT (Rafael Wysocki)

   - Add missing MODULE_DESCRIPTION() macros to the powerpc cpufreq
     driver (Jeff Johnson)

   - Several OF related cleanups in cpufreq drivers (Rob Herring)

   - Enable COMPILE_TEST for ARM drivers (Rob Herrring)

   - Introduce quirks for syscon failures and use socinfo to get
     revision for TI cpufreq driver (Dhruva Gole, Nishanth Menon)

   - Minor cleanups in amd-pstate driver (Anastasia Belova, Dhananjay
     Ugwekar)

   - Minor cleanups for loongson, cpufreq-dt and powernv cpufreq drivers
     (Danila Tikhonov, Huacai Chen, and Liu Jing)

   - Make amd-pstate validate return of any attempt to update EPP
     limits, which fixes the masking hardware problems (Mario
     Limonciello)

   - Move the calculation of the AMD boost numerator outside of
     amd-pstate, correcting acpi-cpufreq on systems with preferred cores
     (Mario Limonciello)

   - Harden preferred core detection in amd-pstate to avoid potential
     false positives (Mario Limonciello)

   - Add extra unit test coverage for mode state machine (Mario
     Limonciello)

   - Fix an "Uninitialized variables" issue in amd-pstste (Qianqiang
     Liu)

   - Add Granite Rapids Xeon support to intel_idle (Artem Bityutskiy)

   - Disable promotion to C1E on Jasper Lake and Elkhart Lake in
     intel_idle (Kai-Heng Feng)

   - Use scoped device node handling to fix missing of_node_put() and
     simplify walking OF children in the riscv-sbi cpuidle driver
     (Krzysztof Kozlowski)

   - Remove dead code from cpuidle_enter_state() (Dhruva Gole)

   - Change an error pointer to NULL to fix error handling in the
     intel_rapl power capping driver (Dan Carpenter)

   - Fix off by one in get_rpi() in the intel_rapl power capping driver
     (Dan Carpenter)

   - Add support for ArrowLake-U to the intel_rapl power capping driver
     (Sumeet Pawnikar)

   - Fix the energy-pkg event for AMD CPUs in the intel_rapl power
     capping driver (Dhananjay Ugwekar)

   - Add support for AMD family 1Ah processors to the intel_rapl power
     capping driver (Dhananjay Ugwekar)

   - Remove unused stub for saveable_highmem_page() and remove
     deprecated macros from power management documentation (Andy
     Shevchenko)

   - Use ysfs_emit() and sysfs_emit_at() in "show" functions in the PM
     sysfs interface (Xueqin Luo)

   - Update the maintainers information for the
     operating-points-v2-ti-cpu DT binding (Dhruva Gole)

   - Drop unnecessary of_match_ptr() from ti-opp-supply (Rob Herring)

   - Add missing MODULE_DESCRIPTION() macros to devfreq governors (Jeff
     Johnson)

   - Use devm_clk_get_enabled() in the exynos-bus devfreq driver (Anand
     Moon)

   - Use of_property_present() instead of of_get_property() in the
     imx-bus devfreq driver (Rob Herring)

   - Update directory handling and installation process in the pm-graph
     Makefile and add .gitignore to ignore sleepgraph.py artifacts to
     pm-graph (Amit Vadhavana, Yo-Jung Lin)

   - Make cpupower display residency value in idle-info (Aboorva
     Devarajan)

   - Add missing powercap_set_enabled() stub function to cpupower (John
     B. Wyatt IV)

   - Add SWIG support to cpupower (John B. Wyatt IV)"

* tag 'pm-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (62 commits)
  cpufreq/amd-pstate-ut: Fix an "Uninitialized variables" issue
  cpufreq/amd-pstate-ut: Add test case for mode switches
  cpufreq/amd-pstate: Export symbols for changing modes
  amd-pstate: Add missing documentation for `amd_pstate_prefcore_ranking`
  cpufreq: amd-pstate: Add documentation for `amd_pstate_hw_prefcore`
  cpufreq: amd-pstate: Optimize amd_pstate_update_limits()
  cpufreq: amd-pstate: Merge amd_pstate_highest_perf_set() into amd_get_boost_ratio_numerator()
  x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()
  x86/amd: Move amd_get_highest_perf() out of amd-pstate
  ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn
  ACPI: CPPC: Drop check for non zero perf ratio
  x86/amd: Rename amd_get_highest_perf() to amd_get_boost_ratio_numerator()
  ACPI: CPPC: Adjust return code for inline functions in !CONFIG_ACPI_CPPC_LIB
  x86/amd: Move amd_get_highest_perf() from amd.c to cppc.c
  PM: hibernate: Remove unused stub for saveable_highmem_page()
  pm:cpupower: Add error warning when SWIG is not installed
  MAINTAINERS: Add Maintainers for SWIG Python bindings
  pm:cpupower: Include test_raw_pylibcpupower.py
  pm:cpupower: Add SWIG bindings files for libcpupower
  pm:cpupower: Add missing powercap_set_enabled() stub function
  ...
2024-09-16 07:47:50 +02:00
Linus Torvalds
114143a595 Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
 "The highlights are support for Arm's "Permission Overlay Extension"
  using memory protection keys, support for running as a protected guest
  on Android as well as perf support for a bunch of new interconnect
  PMUs.

  Summary:

  ACPI:
   - Enable PMCG erratum workaround for HiSilicon HIP10 and 11
     platforms.
   - Ensure arm64-specific IORT header is covered by MAINTAINERS.

  CPU Errata:
   - Enable workaround for hardware access/dirty issue on Ampere-1A
     cores.

  Memory management:
   - Define PHYSMEM_END to fix a crash in the amdgpu driver.
   - Avoid tripping over invalid kernel mappings on the kexec() path.
   - Userspace support for the Permission Overlay Extension (POE) using
     protection keys.

  Perf and PMUs:
   - Add support for the "fixed instruction counter" extension in the
     CPU PMU architecture.
   - Extend and fix the event encodings for Apple's M1 CPU PMU.
   - Allow LSM hooks to decide on SPE permissions for physical
     profiling.
   - Add support for the CMN S3 and NI-700 PMUs.

  Confidential Computing:
   - Add support for booting an arm64 kernel as a protected guest under
     Android's "Protected KVM" (pKVM) hypervisor.

  Selftests:
   - Fix vector length issues in the SVE/SME sigreturn tests
   - Fix build warning in the ptrace tests.

  Timers:
   - Add support for PR_{G,S}ET_TSC so that 'rr' can deal with
     non-determinism arising from the architected counter.

  Miscellaneous:
   - Rework our IPI-based CPU stopping code to try NMIs if regular IPIs
     don't succeed.
   - Minor fixes and cleanups"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (94 commits)
  perf: arm-ni: Fix an NULL vs IS_ERR() bug
  arm64: hibernate: Fix warning for cast from restricted gfp_t
  arm64: esr: Define ESR_ELx_EC_* constants as UL
  arm64: pkeys: remove redundant WARN
  perf: arm_pmuv3: Use BR_RETIRED for HW branch event if enabled
  MAINTAINERS: List Arm interconnect PMUs as supported
  perf: Add driver for Arm NI-700 interconnect PMU
  dt-bindings/perf: Add Arm NI-700 PMU
  perf/arm-cmn: Improve format attr printing
  perf/arm-cmn: Clean up unnecessary NUMA_NO_NODE check
  arm64/mm: use lm_alias() with addresses passed to memblock_free()
  mm: arm64: document why pte is not advanced in contpte_ptep_set_access_flags()
  arm64: Expose the end of the linear map in PHYSMEM_END
  arm64: trans_pgd: mark PTEs entries as valid to avoid dead kexec()
  arm64/mm: Delete __init region from memblock.reserved
  perf/arm-cmn: Support CMN S3
  dt-bindings: perf: arm-cmn: Add CMN S3
  perf/arm-cmn: Refactor DTC PMU register access
  perf/arm-cmn: Make cycle counts less surprising
  perf/arm-cmn: Improve build-time assertion
  ...
2024-09-16 06:55:07 +02:00
Linus Torvalds
963d0d60d6 Merge tag 'x86_bugs_for_v6.12_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 hw mitigation updates from Borislav Petkov:

 - Add CONFIG_ option for every hw CPU mitigation. The intent is to
   support configurations and scenarios where the mitigations code is
   irrelevant

 - Other small fixlets and improvements

* tag 'x86_bugs_for_v6.12_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/bugs: Fix handling when SRSO mitigation is disabled
  x86/bugs: Add missing NO_SSB flag
  Documentation/srso: Document a method for checking safe RET operates properly
  x86/bugs: Add a separate config for GDS
  x86/bugs: Remove GDS Force Kconfig option
  x86/bugs: Add a separate config for SSB
  x86/bugs: Add a separate config for Spectre V2
  x86/bugs: Add a separate config for SRBDS
  x86/bugs: Add a separate config for Spectre v1
  x86/bugs: Add a separate config for RETBLEED
  x86/bugs: Add a separate config for L1TF
  x86/bugs: Add a separate config for MMIO Stable Data
  x86/bugs: Add a separate config for TAA
  x86/bugs: Add a separate config for MDS
2024-09-16 06:48:38 +02:00
Joerg Roedel
97162f6093 Merge branches 'fixes', 'arm/smmu', 'intel/vt-d', 'amd/amd-vi' and 'core' into next 2024-09-13 12:53:05 +02:00
Rafael J. Wysocki
415dff1c96 Merge branch 'pm-cpufreq'
Merge cpufreq updates for 6.12-rc1:

 - Remove LATENCY_MULTIPLIER from cpufreq (Qais Yousef).

 - Add support for Granite Rapids and Sierra Forest in OOB mode to the
   intel_pstate cpufreq driver (Srinivas Pandruvada).

 - Add basic support for CPU capacity scaling on x86 and make the
   intel_pstate driver set asymmetric CPU capacity on hybrid systems
   without SMT (Rafael Wysocki).

 - Add missing MODULE_DESCRIPTION() macros to the powerpc cpufreq
   driver (Jeff Johnson).

 - Several OF related cleanups in cpufreq drivers (Rob Herring).

 - Enable COMPILE_TEST for ARM drivers (Rob Herrring).

 - Introduce quirks for syscon failures and use socinfo to get revision
   for TI cpufreq driver (Dhruva Gole, Nishanth Menon).

 - Minor cleanups in amd-pstate driver (Anastasia Belova, Dhananjay
   Ugwekar).

 - Minor cleanups for loongson, cpufreq-dt and powernv cpufreq drivers
   (Danila Tikhonov, Huacai Chen, and Liu Jing).

 - Make amd-pstate validate return of any attempt to update EPP limits,
   which fixes the masking hardware problems (Mario Limonciello).

 - Move the calculation of the AMD boost numerator outside of amd-pstate,
   correcting acpi-cpufreq on systems with preferred cores (Mario
   Limonciello).

 - Harden preferred core detection in amd-pstate to avoid potential
   false positives (Mario Limonciello).

 - Add extra unit test coverage for mode state machine (Mario
   Limonciello).

 - Fix an "Uninitialized variables" issue in amd-pstste (Qianqiang Liu).

* pm-cpufreq: (35 commits)
  cpufreq/amd-pstate-ut: Fix an "Uninitialized variables" issue
  cpufreq/amd-pstate-ut: Add test case for mode switches
  cpufreq/amd-pstate: Export symbols for changing modes
  amd-pstate: Add missing documentation for `amd_pstate_prefcore_ranking`
  cpufreq: amd-pstate: Add documentation for `amd_pstate_hw_prefcore`
  cpufreq: amd-pstate: Optimize amd_pstate_update_limits()
  cpufreq: amd-pstate: Merge amd_pstate_highest_perf_set() into amd_get_boost_ratio_numerator()
  x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()
  x86/amd: Move amd_get_highest_perf() out of amd-pstate
  ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn
  ACPI: CPPC: Drop check for non zero perf ratio
  x86/amd: Rename amd_get_highest_perf() to amd_get_boost_ratio_numerator()
  ACPI: CPPC: Adjust return code for inline functions in !CONFIG_ACPI_CPPC_LIB
  x86/amd: Move amd_get_highest_perf() from amd.c to cppc.c
  cpufreq/amd-pstate: Catch failures for amd_pstate_epp_update_limit()
  cpufreq: ti-cpufreq: Use socinfo to get revision in AM62 family
  cpufreq: Fix the cacography in powernv-cpufreq.c
  cpufreq: ti-cpufreq: Introduce quirks to handle syscon fails appropriately
  cpufreq: loongson3: Use raw_smp_processor_id() in do_service_request()
  cpufreq: amd-pstate: add check for cpufreq_cpu_get's return value
  ...
2024-09-11 18:25:54 +02:00
Rafael J. Wysocki
9bcf30348f Merge tag 'amd-pstate-v6.12-2024-09-11' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux
Merge the second round of amd-pstate changes for 6.12 from Mario
Limonciello:

"* Move the calculation of the AMD boost numerator outside of
   amd-pstate, correcting acpi-cpufreq on systems with preferred cores
 * Harden preferred core detection to avoid potential false positives
 * Add extra unit test coverage for mode state machine"

* tag 'amd-pstate-v6.12-2024-09-11' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux:
  cpufreq/amd-pstate-ut: Fix an "Uninitialized variables" issue
  cpufreq/amd-pstate-ut: Add test case for mode switches
  cpufreq/amd-pstate: Export symbols for changing modes
  amd-pstate: Add missing documentation for `amd_pstate_prefcore_ranking`
  cpufreq: amd-pstate: Add documentation for `amd_pstate_hw_prefcore`
  cpufreq: amd-pstate: Optimize amd_pstate_update_limits()
  cpufreq: amd-pstate: Merge amd_pstate_highest_perf_set() into amd_get_boost_ratio_numerator()
  x86/amd: Detect preferred cores in amd_get_boost_ratio_numerator()
  x86/amd: Move amd_get_highest_perf() out of amd-pstate
  ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn
  ACPI: CPPC: Drop check for non zero perf ratio
  x86/amd: Rename amd_get_highest_perf() to amd_get_boost_ratio_numerator()
  ACPI: CPPC: Adjust return code for inline functions in !CONFIG_ACPI_CPPC_LIB
  x86/amd: Move amd_get_highest_perf() from amd.c to cppc.c
2024-09-11 18:22:23 +02:00
Mario Limonciello
15a2b764ea amd-pstate: Add missing documentation for amd_pstate_prefcore_ranking
`amd_pstate_prefcore_ranking` reflects the dynamic rankings of a CPU
core based on platform conditions.  Explicitly include it in the
documentation.

Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2024-09-11 10:23:23 -05:00
Mario Limonciello
b96b82d1af cpufreq: amd-pstate: Add documentation for amd_pstate_hw_prefcore
Explain that the sysfs file represents both preferred core being
enabled by the user and supported by the hardware.

Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2024-09-11 10:23:23 -05:00
Mario Limonciello
ad4caad58d cpufreq: amd-pstate: Merge amd_pstate_highest_perf_set() into amd_get_boost_ratio_numerator()
The special case in amd_pstate_highest_perf_set() is the value used
for calculating the boost numerator.  Merge this into
amd_get_boost_ratio_numerator() and then use that to calculate boost
ratio.

This allows dropping more special casing of the highest perf value.

Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2024-09-11 10:23:23 -05:00
Thomas Gleixner
2f7eedca6c Merge branch 'linus' into timers/core
To update with the latest fixes.
2024-09-10 13:49:53 +02:00
Joerg Roedel
f0295913c4 iommu/amd: Add kernel parameters to limit V1 page-sizes
Add two new kernel command line parameters to limit the page-sizes
used for v1 page-tables:

	nohugepages     - Limits page-sizes to 4KiB

	v2_pgsizes_only - Limits page-sizes to 4Kib/2Mib/1GiB; The
	                  same as the sizes used with v2 page-tables

This is needed for multiple scenarios. When assigning devices to
SEV-SNP guests the IOMMU page-sizes need to match the sizes in the RMP
table, otherwise the device will not be able to access all shared
memory.

Also, some ATS devices do not work properly with arbitrary IO
page-sizes as supported by AMD-Vi, so limiting the sizes used by the
driver is a suitable workaround.

All-in-all, these parameters are only workarounds until the IOMMU core
and related APIs gather the ability to negotiate the page-sizes in a
better way.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20240905072240.253313-1-joro@8bytes.org
2024-09-10 11:48:57 +02:00
Sergey Senozhatsky
e899007a5e zram: support priority parameter in recompression
recompress device attribute supports alg=NAME parameter so that we can
specify only one particular algorithm we want to perform recompression
with.  However, with algo params we now can have several exactly same
secondary algorithms but each with its own params tuning (e.g.  priority 1
configured to use more aggressive level, and priority 2 configured to use
a pre-trained dictionary).  Support priority=NUM parameter so that we can
correctly determine which secondary algorithm we want to use.

Link: https://lkml.kernel.org/r/20240902105656.1383858-25-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:12 -07:00
Sergey Senozhatsky
97ee4842f2 Documentation/zram: add documentation for algorithm parameters
Document brief description of compression algorithms' parameters:
compression level and pre-trained dictionary.

[senozhatsky@chromium.org: trivial fixup]
  Link: https://lkml.kernel.org/r/20240903063722.1603592-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20240902105656.1383858-24-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:11 -07:00
Sergey Senozhatsky
4eac932103 zram: introduce algorithm_params device attribute
This attribute is used to setup compression algorithms' parameters, so we
can tweak algorithms' characteristics.  At this point only 'level' is
supported (to be extended in the future).

Each call sets up parameters for one particular algorithm, which should be
specified either by the algorithm's priority or algo name.  This is
expected to be called after corresponding algorithm is selected via
comp_algorithm or recomp_algorithm.

 echo "priority=0 level=1" > /sys/block/zram0/algorithm_params
or
 echo "algo=zstd level=1" > /sys/block/zram0/algorithm_params

Link: https://lkml.kernel.org/r/20240902105656.1383858-16-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:09 -07:00
Sergey Senozhatsky
917a59e81c zram: introduce custom comp backends API
Moving to custom backends implementation gives us ability to have our own
minimalistic and extendable API, and algorithms tunings becomes possible.

The list of compression backends is empty at this point, we will add
backends in the followup patches.

Link: https://lkml.kernel.org/r/20240902105656.1383858-5-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:07 -07:00
Usama Arif
81d3ff3c6f mm: add sysfs entry to disable splitting underused THPs
If disabled, THPs faulted in or collapsed will not be added to
_deferred_list, and therefore won't be considered for splitting under
memory pressure if underused.

Link: https://lkml.kernel.org/r/20240830100438.3623486-7-usamaarif642@gmail.com
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Cc: Alexander Zhu <alexlzhu@fb.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Shuang Zhai <zhais@google.com>
Cc: Shuang Zhai <szhai2@cs.rochester.edu>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:04 -07:00
Usama Arif
dafff3f4c8 mm: split underused THPs
This is an attempt to mitigate the issue of running out of memory when THP
is always enabled.  During runtime whenever a THP is being faulted in
(__do_huge_pmd_anonymous_page) or collapsed by khugepaged
(collapse_huge_page), the THP is added to _deferred_list.  Whenever memory
reclaim happens in linux, the kernel runs the deferred_split shrinker
which goes through the _deferred_list.

If the folio was partially mapped, the shrinker attempts to split it.  If
the folio is not partially mapped, the shrinker checks if the THP was
underused, i.e.  how many of the base 4K pages of the entire THP were
zero-filled.  If this number goes above a certain threshold (decided by
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none), the
shrinker will attempt to split that THP.  Then at remap time, the pages
that were zero-filled are mapped to the shared zeropage, hence saving
memory.

Link: https://lkml.kernel.org/r/20240830100438.3623486-6-usamaarif642@gmail.com
Signed-off-by: Usama Arif <usamaarif642@gmail.com>
Suggested-by: Rik van Riel <riel@surriel.com>
Co-authored-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexander Zhu <alexlzhu@fb.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Shuang Zhai <zhais@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Shuang Zhai <szhai2@cs.rochester.edu>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:04 -07:00
SeongJae Park
23a425aab0 Docs/damon: use damonitor GitHub organization instead of awslabs
Patch series "Docs/damon: update GitHub repo URLs and maintainer-profile".

Replace GitHub URLS on DAMON documents for none-kernel parts DAMON repos
with new ones[1] via the first patch.  With following two patches,
wordsmith maitnainer-profile for better readability, and document the
Google clendsar for bi-weekly meetups, respectively.

[1] https://lore.kernel.org/20240813232158.83903-1-sj@kernel.org


This patch (of 3):

GitHub repos for non-kernel parts of DAMON project including 'damo',
'damon-tests' and 'damoos' will be moved[1] from 'awslabs' org to
'damonitor', by 2024-09-05.  Update related URLs in kernel tree.

[1] https://lore.kernel.org/20240813232158.83903-1-sj@kernel.org

Link: https://lkml.kernel.org/r/20240826015741.80707-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20240826015741.80707-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Alex Shi <alexs@kernel.org>
Cc: Hu Haowen <2023002089@link.tyut.edu.cn>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:39:00 -07:00
Barry Song
8175ebfd30 mm: count the number of partially mapped anonymous THPs per size
When a THP is added to the deferred_list due to partially mapped, its
partial pages are unused, leading to wasted memory and potentially
increasing memory reclamation pressure.

Detailing the specifics of how unmapping occurs is quite difficult and not
that useful, so we adopt a simple approach: each time a THP enters the
deferred_list, we increment the count by 1; whenever it leaves for any
reason, we decrement the count by 1.

Link: https://lkml.kernel.org/r/20240824010441.21308-3-21cnbao@gmail.com
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Chuanhua Han <hanchuanhua@oppo.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuai Yuan <yuanshuai@oppo.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:38:57 -07:00
Barry Song
5d65c8d758 mm: count the number of anonymous THPs per size
Patch series "mm: count the number of anonymous THPs per size", v4.

Knowing the number of transparent anon THPs in the system is crucial
for performance analysis. It helps in understanding the ratio and
distribution of THPs versus small folios throughout the system.

Additionally, partial unmapping by userspace can lead to significant waste
of THPs over time and increase memory reclamation pressure. We need this
information for comprehensive system tuning.


This patch (of 2):

Let's track for each anonymous THP size, how many of them are currently
allocated.  We'll track the complete lifespan of an anon THP, starting
when it becomes an anon THP ("large anon folio") (->mapping gets set),
until it gets freed (->mapping gets cleared).

Introduce a new "nr_anon" counter per THP size and adjust the
corresponding counter in the following cases:
* We allocate a new THP and call folio_add_new_anon_rmap() to map
   it the first time and turn it into an anon THP.
* We split an anon THP into multiple smaller ones.
* We migrate an anon THP, when we prepare the destination.
* We free an anon THP back to the buddy.

Note that AnonPages in /proc/meminfo currently tracks the total number of
*mapped* anonymous *pages*, and therefore has slightly different
semantics.  In the future, we might also want to track "nr_anon_mapped"
for each THP size, which might be helpful when comparing it to the number
of allocated anon THPs (long-term pinning, stuck in swapcache, memory
leaks, ...).

Further note that for now, we only track anon THPs after they got their
->mapping set, for example via folio_add_new_anon_rmap().  If we would
allocate some in the swapcache, they will only show up in the statistics
for now after they have been mapped to user space the first time, where we
call folio_add_new_anon_rmap().

[akpm@linux-foundation.org: documentation fixups, per David]
  Link: https://lkml.kernel.org/r/3e8add35-e26b-443b-8a04-1078f4bc78f6@redhat.com
Link: https://lkml.kernel.org/r/20240824010441.21308-1-21cnbao@gmail.com
Link: https://lkml.kernel.org/r/20240824010441.21308-2-21cnbao@gmail.com
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Chuanhua Han <hanchuanhua@oppo.com>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Lance Yang <ioworker0@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuai Yuan <yuanshuai@oppo.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-09 16:38:57 -07:00
Anna-Maria Behnsen
bd7c8ff9fe treewide: Fix wrong singular form of jiffies in comments
There are several comments all over the place, which uses a wrong singular
form of jiffies.

Replace 'jiffie' by 'jiffy'. No functional change.

Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Link: https://lore.kernel.org/all/20240904-devel-anna-maria-b4-timers-flseep-v1-3-e98760256370@linutronix.de
2024-09-08 20:47:40 +02:00
Neeraj Upadhyay
355debb83b Merge branches 'context_tracking.15.08.24a', 'csd.lock.15.08.24a', 'nocb.09.09.24a', 'rcutorture.14.08.24a', 'rcustall.09.09.24a', 'srcu.12.08.24a', 'rcu.tasks.14.08.24a', 'rcu_scaling_tests.15.08.24a', 'fixes.12.08.24a' and 'misc.11.08.24a' into next.09.09.24a 2024-09-09 00:09:47 +05:30
Robin Murphy
4d5a7680f2 perf: Add driver for Arm NI-700 interconnect PMU
The Arm NI-700 Network-on-Chip Interconnect has a relatively
straightforward design with a hierarchy of voltage, power, and clock
domains, where each clock domain then contains a number of interface
units and a PMU which can monitor events thereon. As such, it begets a
relatively straightforward driver to interface those PMUs with perf.

Even more so than with arm-cmn, users will require detailed knowledge of
the wider system topology in order to meaningfully analyse anything,
since the interconnect itself cannot know what lies beyond the boundary
of each inscrutably-numbered interface. Given that, for now they are
also expected to refer to the NI-700 documentation for the relevant
event IDs to provide as well. An identifier is implemented so we can
come back and add jevents if anyone really wants to.

Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/9933058d0ab8138c78a61cd6852ea5d5ff48e393.1725470837.git.robin.murphy@arm.com
Signed-off-by: Will Deacon <will@kernel.org>
2024-09-06 12:58:28 +01:00
Guilherme G. Piccoli
e04eb52bfa Documentation: Document the kernel flag bdev_allow_write_mounted
Commit ed5cc702d3 ("block: Add config option to not allow writing to mounted
devices") added a Kconfig option along with a kernel command-line tuning to
control writes to mounted block devices, as a means to deal with fuzzers like
Syzkaller, that provokes kernel crashes by directly writing on block devices
bypassing the filesystem (so the FS has no awareness and cannot cope with that).

The patch just missed adding such kernel command-line option to the kernel
documentation, so let's fix that.

Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Link: https://lore.kernel.org/r/20240828145045.309835-1-gpiccoli@igalia.com
2024-09-05 14:18:28 -06:00
Jonathan Corbet
d224338aa1 Merge tag 'v6.11-rc6' into docs-mw
This is done primarily to get a docs build fix merged via another tree so
that "make htmldocs" stops failing.
2024-09-05 14:01:38 -06:00