Remove the not needed "vm/allocate_pgste" sysctl. It has no effect
anymore. However this is a user space visible change. It shouldn't cause
any problems, however if it does this needs to be partially reverted.
Note that some distributions set
vm/allocate_pgste=1
in one of the various sysctl configuration files. Besides a warning about
the (now) non-existent procfs file this doesn't cause any problems.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.
Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit 78eb4ea25c ("sysctl: treewide:
constify the ctl_table argument of proc_handlers") constified all the
proc_handlers.
Created this by running an spatch followed by a sed command:
Spatch:
virtual patch
@
depends on !(file in "net")
disable optional_qualifier
@
identifier table_name != {
watchdog_hardlockup_sysctl,
iwcm_ctl_table,
ucma_ctl_table,
memory_allocation_profiling_sysctls,
loadpin_sysctl_table
};
@@
+ const
struct ctl_table table_name [] = { ... };
sed:
sed --in-place \
-e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \
kernel/utsname_sysctl.c
Reviewed-by: Song Liu <song@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI
Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Acked-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
Commit 78966b550289 ("s390: pgtable: add statistics for PUD and P4D level
page table") misses the call to pagetable_p4d_ctor() against a newly
allocated P4D table in crst_table_upgrade();
Commit 68c601de75d8 ("mm: introduce ctor/dtor at PGD level") misses the
call to pagetable_pgd_ctor() against a newly allocated PGD and the call to
pagetable_dtor() against a newly allocated P4D that is about to be freed
on crst_table_upgrade() PGD upgrade fail path.
The missed constructors and destructor break (at least) the page table
accounting when a process memory space is upgraded.
Link: https://lkml.kernel.org/r/20250123160349.200154-1-agordeev@linux.ibm.com
Fixes: 78966b550289 ("s390: pgtable: add statistics for PUD and P4D level page table")
Fixes: 68c601de75d8 ("mm: introduce ctor/dtor at PGD level")
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reported-by: Heiko Carstens <hca@linux.ibm.com>
Closes: https://lore.kernel.org/all/20250122074954.8685-A-hca@linux.ibm.com/
Suggested-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Acked-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull s390 updates from Vasily Gorbik:
- Remove restrictions on PAI NNPA and crypto counters, enabling
concurrent per-task and system-wide sampling and counting events
- Switch to GENERIC_CPU_DEVICES by setting up the CPU present mask in
the architecture code and letting the generic code handle CPU
bring-up
- Add support for the diag204 busy indication facility to prevent
undesirable blocking during hypervisor logical CPU utilization
queries. Implement results caching
- Improve the handling of Store Data SCLP events by suppressing
unnecessary warning, preventing buffer release in I/O during
failures, and adding timeout handling for Store Data requests to
address potential firmware issues
- Provide optimized __arch_hweight*() implementations
- Remove the unnecessary CPU KOBJ_CHANGE uevents generated during
topology updates, as they are unused and also not present on other
architectures
- Cleanup atomic_ops, optimize __atomic_set() for small values and
__atomic_cmpxchg_bool() for compilers supporting flag output
constraint
- Couple of cleanups for KVM:
- Move and improve KVM struct definitions for DAT tables from
gaccess.c to a new header
- Pass the asce as parameter to sie64a()
- Make the crdte() and cspg() page table handling wrappers return a
boolean to indicate success, like the other existing "compare and
swap" wrappers
- Add documentation for HWCAP flags
- Switch to obtaining total RAM pages from memblock instead of
totalram_pages() during mm init, to ensure correct calculation of
zero page size, when defer_init is enabled
- Refactor lowcore access and switch to using the get_lowcore()
function instead of the S390_lowcore macro
- Cleanups for PG_arch_1 and folio handling in UV and hugetlb code
- Add missing MODULE_DESCRIPTION() macros
- Fix VM_FAULT_HWPOISON handling in do_exception()
* tag 's390-6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (54 commits)
s390/mm: Fix VM_FAULT_HWPOISON handling in do_exception()
s390/kvm: Move bitfields for dat tables
s390/entry: Pass the asce as parameter to sie64a()
s390/sthyi: Use cached data when diag is busy
s390/sthyi: Move diag operations
s390/hypfs_diag: Diag204 busy loop
s390/diag: Add busy-indication-facility requirements
s390/diag: Diag204 add busy return errno
s390/diag: Return errno's from diag204
s390/sclp: Diag204 busy indication facility detection
s390/atomic_ops: Make use of flag output constraint
s390/atomic_ops: Improve __atomic_set() for small values
s390/atomic_ops: Use symbolic names
s390/smp: Switch to GENERIC_CPU_DEVICES
s390/hwcaps: Add documentation for HWCAP flags
s390/pgtable: Make crdte() and cspg() return a value
s390/topology: Remove CPU KOBJ_CHANGE uevents
s390/sclp: Add timeout to Store Data requests
s390/sclp: Prevent release of buffer in I/O
s390/sclp: Suppress unnecessary Store Data warning
...
crst_table_free() used to work with NULL pointers before the conversion
to ptdescs. Since crst_table_free() can be called with a NULL pointer
(error handling in crst_table_upgrade() add an explicit check.
Also add the same check to base_crst_free() for consistency reasons.
In real life this should not happen, since order two GFP_KERNEL
allocations will not fail, unless FAIL_PAGE_ALLOC is enabled and used.
Reported-by: Yunseong Kim <yskelg@gmail.com>
Fixes: 6326c26c15 ("s390: convert various pgalloc functions to use ptdescs")
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CRSTs always have size of four pages, while 2KB-size page tables
always occupy a single page. Use that information to distinguish
page tables from CRSTs.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Cease using 4KB pages to host two 2KB PTEs. That greatly
simplifies the memory management code at the expense of
page tables memory footprint.
Instead of two PTEs per 4KB page use only upper half of
the parent page for a single PTE. With that the list of
half-used pages pgtable_list becomes unneeded.
Further, the upper byte of the parent page _refcount
counter does not need to be used for fragments tracking
and could be left alone.
Commit 8211dad627 ("s390: add pte_free_defer() for
pgtables sharing page") introduced the use of PageActive
flag to coordinate a deferred free with 2KB page table
fragments tracking. Since there is no tracking anymore,
there is no need for using PageActive flag.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
In order to be usable for early boot code move the simple
arch_set_page_dat() function to header file, and add its counter-part
arch_set_page_nodat(). Also change the parameters, and the function name
slightly.
This is required since there aren't any struct pages available in early
boot code, and renaming of functions is done to make sure that all users
are converted to the new API.
Instead of a pointer to a struct page a virtual address is passed, and
instead of an order the number of pages for which the page state needs be
set.
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Commit 6326c26c15 ("s390: convert various pgalloc functions
to use ptdescs") missed to convert tlb_remove_table() into
tlb_remove_ptdesc() in few locations.
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Pull s390 updates from Vasily Gorbik:
- Get rid of private VM_FAULT flags
- Add word-at-a-time implementation
- Add DCACHE_WORD_ACCESS support
- Cleanup control register handling
- Disallow CPU hotplug of CPU 0 to simplify its handling complexity,
following a similar restriction in x86
- Optimize pai crypto map allocation
- Update the list of crypto express EP11 coprocessor operation modes
- Fixes and improvements for secure guests AP pass-through
- Several fixes to address incorrect page marking for address
translation with the "cmma no-dat" feature, preventing potential
incorrect guest TLB flushes
- Fix early IPI handling
- Several virtual vs physical address confusion fixes
- Various small fixes and improvements all over the code
* tag 's390-6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (74 commits)
s390/cio: replace deprecated strncpy with strscpy
s390/sclp: replace deprecated strncpy with strtomem
s390/cio: fix virtual vs physical address confusion
s390/cio: export CMG value as decimal
s390: delete the unused store_prefix() function
s390/cmma: fix handling of swapper_pg_dir and invalid_pg_dir
s390/cmma: fix detection of DAT pages
s390/sclp: handle default case in sclp memory notifier
s390/pai_crypto: remove per-cpu variable assignement in event initialization
s390/pai: initialize event count once at initialization
s390/pai_crypto: use PERF_ATTACH_TASK define for per task detection
s390/mm: add missing arch_set_page_dat() call to gmap allocations
s390/mm: add missing arch_set_page_dat() call to vmem_crst_alloc()
s390/cmma: fix initial kernel address space page table walk
s390/diag: add missing virt_to_phys() translation to diag224()
s390/mm,fault: move VM_FAULT_ERROR handling to do_exception()
s390/mm,fault: remove VM_FAULT_BADMAP and VM_FAULT_BADACCESS
s390/mm,fault: remove VM_FAULT_SIGNAL
s390/mm,fault: remove VM_FAULT_BADCONTEXT
s390/mm,fault: simplify kfence fault handling
...
If the cmma no-dat feature is available all pages that are not used for
dynamic address translation are marked as "no-dat" with the ESSA
instruction. This information is visible to the hypervisor, so that the
hypervisor can optimize purging of guest TLB entries. This also means that
pages which are used for dynamic address translation must not be marked as
"no-dat", since the hypervisor may then incorrectly not purge guest TLB
entries.
Region, segment, and page tables allocated within the gmap code are
incorrectly marked as "no-dat", since an explicit call to
arch_set_page_dat() is missing, which would remove the "no-dat" mark.
In order to fix this add a new gmap_alloc_crst() function which should
be used to allocate region and segment tables, and which also calls
arch_set_page_dat().
Also add the arch_set_page_dat() call to page_table_alloc_pgste().
Cc: <stable@vger.kernel.org>
Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which will
reduce the overall build time size of the kernel and run time memory
bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)
Remove the sentinel element from appldata_table, s390dbf_table,
topology_ctl_table, cmm_table and page_table_sysctl. Reduced the memory
allocation in appldata_register_ops by 1 effectively removing the
sentinel from ops->ctl_table.
This removal is safe because register_sysctl_sz and register_sysctl use
the array size in addition to checking for the sentinel.
Tested-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Convert all single control register usages of __local_ctl_load() and
__local_ctl_store() to local_ctl_load() and local_ctl_store().
This also requires to change the type of some struct lowcore members
from __u64 to unsigned long.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Add local and system prefix to some functions to clarify they change
control register contents on either the local CPU or the on all CPUs.
This results in the following API:
Two defines which load and save multiple control registers.
The defines correlate with the following C prototypes:
void __local_ctl_load(unsigned long *, unsigned int cr_low, unsigned int cr_high);
void __local_ctl_store(unsigned long *, unsigned int cr_low, unsigned int cr_high);
Two functions which locally set or clear one bit for a specified
control register:
void local_ctl_set_bit(unsigned int cr, unsigned int bit);
void local_ctl_clear_bit(unsigned int cr, unsigned int bit);
Two functions which set or clear one bit for a specified control
register on all CPUs:
void system_ctl_set_bit(unsigned int cr, unsigned int bit);
void system_ctl_clear_bit(unsigend int cr, unsigned int bit);
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Make use of atomic_fetch_xor() instead of an atomic_cmpxchg() loop to
implement atomic_xor_bits() (aka atomic_xor_return()). This makes the C
code more readable and in addition generates better code, since for z196
and newer a single lax instruction is generated instead of a cmpxchg()
loop.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Use CRST_ALLOC_ORDER to make it more obvious what the order means,
and also to be consistent with other code, e.g. the vmemmap code.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
When CONFIG_DEBUG_VM is defined check that pending remove
and tracking nibbles (bits 31-24 of the page refcount) are
cleared. Should the earlier stages of the page lifespan
have a race or logical error, such check could help in
exposing the issue.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Explicitly encode immediate value of pending remove nibble
(bits 31-28) and tracking nibble (bits 27-24) of the page
refcount whenever these nibbles are tested or changed, for
better readability. Also, add some comments describing how
the fragments are handled.
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
There is a race on concurrent 2KB-pgtables release paths when
both upper and lower halves of the containing parent page are
freed, one via page_table_free_rcu() + __tlb_remove_table(),
and the other via page_table_free(). The race might lead to a
corruption as result of remove of list item in page_table_free()
concurrently with __free_page() in __tlb_remove_table().
Let's assume first the lower and next the upper 2KB-pgtables are
freed from a page. Since both halves of the page are allocated
the tracking byte (bits 24-31 of the page _refcount) has value
of 0x03 initially:
CPU0 CPU1
---- ----
page_table_free_rcu() // lower half
{
// _refcount[31..24] == 0x03
...
atomic_xor_bits(&page->_refcount,
0x11U << (0 + 24));
// _refcount[31..24] <= 0x12
...
table = table | (1U << 0);
tlb_remove_table(tlb, table);
}
...
__tlb_remove_table()
{
// _refcount[31..24] == 0x12
mask = _table & 3;
// mask <= 0x01
...
page_table_free() // upper half
{
// _refcount[31..24] == 0x12
...
atomic_xor_bits(
&page->_refcount,
1U << (1 + 24));
// _refcount[31..24] <= 0x10
// mask <= 0x10
...
atomic_xor_bits(&page->_refcount,
mask << (4 + 24));
// _refcount[31..24] <= 0x00
// mask <= 0x00
...
if (mask != 0) // == false
break;
fallthrough;
...
if (mask & 3) // == false
...
else
__free_page(page); list_del(&page->lru);
^^^^^^^^^^^^^^^^^^ RACE! ^^^^^^^^^^^^^^^^^^^^^
} ...
}
The problem is page_table_free() releases the page as result of
lower nibble unset and __tlb_remove_table() observing zero too
early. With this update page_table_free() will use the similar
logic as page_table_free_rcu() + __tlb_remove_table(), and mark
the fragment as pending for removal in the upper nibble until
after the list_del().
In other words, the parent page is considered as unreferenced and
safe to release only when the lower nibble is cleared already and
unsetting a bit in upper nibble results in that nibble turned zero.
Cc: stable@vger.kernel.org
Suggested-by: Vlastimil Babka <vbabka@suse.com>
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
After adding the missing __va()/__pa() calls to the base asce
functions there are even more casts in the code than before. Make the
code more readable by passing and using pointers to page tables,
instead of using unsigned values for the same purpose.
This allows to get rid of nearly all casts within the code.
Suggested-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The base asce functions create/free page tables open-coded to make
sure that the returned asce and page tables do not make use of any
enhanced DAT features like e.g. large pages. This is required for some
I/O functions that use an asce, like e.g. some service call requests.
Handling of virtual vs physical addresses is missing; therefore add
that now.
Note: this currently doesn't fix a real bug, since virtual addresses
are indentical to physical ones.
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
The physical address of page tables is passed around and
used as virtual address in various locations.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Remove set_fs support from s390. With doing this rework address space
handling and simplify it. As a result address spaces are now setup
like this:
CPU running in | %cr1 ASCE | %cr7 ASCE | %cr13 ASCE
----------------------------|-----------|-----------|-----------
user space | user | user | kernel
kernel, normal execution | kernel | user | kernel
kernel, kvm guest execution | gmap | user | kernel
To achieve this the getcpu vdso syscall is removed in order to avoid
secondary address mode and a separate vdso address space in for user
space. The getcpu vdso syscall will be implemented differently with a
subsequent patch.
The kernel accesses user space always via secondary address space.
This happens in different ways:
- with mvcos in home space mode and directly read/write to secondary
address space
- with mvcs/mvcp in primary space mode and copy from primary space to
secondary space or vice versa
- with e.g. cs in secondary space mode and access secondary space
Switching translation modes happens with sacf before and after
instructions which access user space, like before.
Lazy handling of control register reloading is removed in the hope to
make everything simpler, but at the cost of making kernel entry and
exit a bit slower. That is: on kernel entry the primary asce is always
changed to contain the kernel asce, and on kernel exit the primary
asce is changed again so it contains the user asce.
In kernel mode there is only one exception to the primary asce: when
kvm guests are executed the primary asce contains the gmap asce (which
describes the guest address space). The primary asce is reset to
kernel asce whenever kvm guest execution is interrupted, so that this
doesn't has to be taken into account for any user space accesses.
Reviewed-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Pull s390 fix from Christian Borntraeger:
"Fix a race between page table upgrade and uaccess on s390.
This fixes CVE-2020-11884 which allows for a local kernel crash or
code execution"
* tag 'cve-2020-11884' from emailed bundle:
s390/mm: fix page table upgrade vs 2ndary address mode accesses
A page table upgrade in a kernel section that uses secondary address
mode will mess up the kernel instructions as follows:
Consider the following scenario: two threads are sharing memory.
On CPU1 thread 1 does e.g. strnlen_user(). That gets to
old_fs = enable_sacf_uaccess();
len = strnlen_user_srst(src, size);
and
" la %2,0(%1)\n"
" la %3,0(%0,%1)\n"
" slgr %0,%0\n"
" sacf 256\n"
"0: srst %3,%2\n"
in strnlen_user_srst(). At that point we are in secondary space mode,
control register 1 points to kernel page table and instruction fetching
happens via c1, rather than usual c13. Interrupts are not disabled, for
obvious reasons.
On CPU2 thread 2 does MAP_FIXED mmap(), forcing the upgrade of page table
from 3-level to e.g. 4-level one. We'd allocated new top-level table,
set it up and now we hit this:
notify = 1;
spin_unlock_bh(&mm->page_table_lock);
}
if (notify)
on_each_cpu(__crst_table_upgrade, mm, 0);
OK, we need to actually change over to use of new page table and we
need that to happen in all threads that are currently running. Which
happens to include the thread 1. IPI is delivered and we have
static void __crst_table_upgrade(void *arg)
{
struct mm_struct *mm = arg;
if (current->active_mm == mm)
set_user_asce(mm);
__tlb_flush_local();
}
run on CPU1. That does
static inline void set_user_asce(struct mm_struct *mm)
{
S390_lowcore.user_asce = mm->context.asce;
OK, user page table address updated...
__ctl_load(S390_lowcore.user_asce, 1, 1);
... and control register 1 set to it.
clear_cpu_flag(CIF_ASCE_PRIMARY);
}
IPI is run in home space mode, so it's fine - insns are fetched
using c13, which always points to kernel page table. But as soon
as we return from the interrupt, previous PSW is restored, putting
CPU1 back into secondary space mode, at which point we no longer
get the kernel instructions from the kernel mapping.
The fix is to only fixup the control registers that are currently in use
for user processes during the page table update. We must also disable
interrupts in enable_sacf_uaccess to synchronize the cr and
thread.mm_segment updates against the on_each-cpu.
Fixes: 0aaba41b58 ("s390: remove all code using the access register mode")
Cc: stable@vger.kernel.org # 4.15+
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
References: CVE-2020-11884
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
This update consolidates page table handling code. Because
there are hardly any 31-bit binaries left we do not need to
optimize for that.
No extra efforts are needed to ensure that a compat task does
not map anything above 2GB. The TASK_SIZE limit for 31-bit
tasks is 2GB already and the generic code does check that a
resulting map address would not surpass that limit.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
There is a maximum of two new tables allocated on page table
upgrade. Because we know that a loop the current implementation
is based on could be unrolled with some improvements:
* upgrade from 3 to 5 levels happens in one go - without an
unnecessary re-take of page_table_lock in-between;
* page tables initialization moved out of the atomic code;
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
The naming of pgtable_page_{ctor,dtor}() seems to have confused a few
people, and until recently arm64 used these erroneously/pointlessly for
other levels of page table.
To make it incredibly clear that these only apply to the PTE level, and to
align with the naming of pgtable_pmd_page_{ctor,dtor}(), let's rename them
to pgtable_pte_page_{ctor,dtor}().
These changes were generated with the following shell script:
----
git grep -lw 'pgtable_page_.tor' | while read FILE; do
sed -i '{s/pgtable_page_ctor/pgtable_pte_page_ctor/}' $FILE;
sed -i '{s/pgtable_page_dtor/pgtable_pte_page_dtor/}' $FILE;
done
----
... with the documentation re-flowed to remain under 80 columns, and
whitespace fixed up in macros to keep backslashes aligned.
There should be no functional change as a result of this patch.
Link: http://lkml.kernel.org/r/20190722141133.3116-1-mark.rutland@arm.com
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> [m68k]
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit eec4844fae ("proc/sysctl: add shared variables for range
check") special shared variables are available for sysctl range check.
Reuse them for /proc/sys/vm/allocate_pgste proc handler.
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Pull RCU changes from Paul E. McKenney:
- Convert RCU's BUG_ON() and similar calls to WARN_ON() and similar.
- Replace calls of RCU-bh and RCU-sched update-side functions
to their vanilla RCU counterparts. This series is a step
towards complete removal of the RCU-bh and RCU-sched update-side
functions.
( Note that some of these conversions are going upstream via their
respective maintainers. )
- Documentation updates, including a number of flavor-consolidation
updates from Joel Fernandes.
- Miscellaneous fixes.
- Automate generation of the initrd filesystem used for
rcutorture testing.
- Convert spin_is_locked() assertions to instead use lockdep.
( Note that some of these conversions are going upstream via their
respective maintainers. )
- SRCU updates, especially including a fix from Dennis Krein
for a bag-on-head-class bug.
- RCU torture-test updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The downgrade of a page table from 3 levels to 2 levels for a 31-bit compat
process removes a pmd table which has to be counted against pgtable_bytes.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Now that call_rcu()'s callback is not invoked until after all
preempt-disable regions of code have completed (in addition to explicitly
marked RCU read-side critical sections), call_rcu() can be used in place
of call_rcu_sched(). This commit therefore makes that change.
Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <linux-s390@vger.kernel.org>