Shrink shmem's stack usage by eliminating the pseudo-vma from its folio
allocation. alloc_pages_mpol(gfp, order, pol, ilx, nid) becomes the
principal actor for passing mempolicy choice down to __alloc_pages(),
rather than vma_alloc_folio(gfp, order, vma, addr, hugepage).
vma_alloc_folio() and alloc_pages() remain, but as wrappers around
alloc_pages_mpol(). alloc_pages_bulk_*() untouched, except to provide the
additional args to policy_nodemask(), which subsumes policy_node().
Cleanup throughout, cutting out some unhelpful "helpers".
It would all be much simpler without MPOL_INTERLEAVE, but that adds a
dynamic to the constant mpol: complicated by v3.6 commit 09c231cb8b
("tmpfs: distribute interleave better across nodes"), which added ino bias
to the interleave, hidden from mm/mempolicy.c until this commit.
Hence "ilx" throughout, the "interleave index". Originally I thought it
could be done just with nid, but that's wrong: the nodemask may come from
the shared policy layer below a shmem vma, or it may come from the task
layer above a shmem vma; and without the final nodemask then nodeid cannot
be decided. And how ilx is applied depends also on page order.
The interleave index is almost always irrelevant unless MPOL_INTERLEAVE:
with one exception in alloc_pages_mpol(), where the NO_INTERLEAVE_INDEX
passed down from vma-less alloc_pages() is also used as hint not to use
THP-style hugepage allocation - to avoid the overhead of a hugepage arg
(though I don't understand why we never just added a GFP bit for THP - if
it actually needs a different allocation strategy from other pages of the
same order). vma_alloc_folio() still carries its hugepage arg here, but
it is not used, and should be removed when agreed.
get_vma_policy() no longer allows a NULL vma: over time I believe we've
eradicated all the places which used to need it e.g. swapoff and madvise
used to pass NULL vma to read_swap_cache_async(), but now know the vma.
[hughd@google.com: handle NULL mpol being passed to __read_swap_cache_async()]
Link: https://lkml.kernel.org/r/ea419956-4751-0102-21f7-9c93cb957892@google.com
Link: https://lkml.kernel.org/r/74e34633-6060-f5e3-aee-7040d43f2e93@google.com
Link: https://lkml.kernel.org/r/1738368e-bac0-fd11-ed7f-b87142a939fe@google.com
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Tejun heo <tj@kernel.org>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Domenico Cerasuolo <mimmocerasuolo@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Per-VMA lock support for swap and userfaults", v7.
When per-VMA locks were introduced in [1] several types of page faults
would still fall back to mmap_lock to keep the patchset simple. Among
them are swap and userfault pages. The main reason for skipping those
cases was the fact that mmap_lock could be dropped while handling these
faults and that required additional logic to be implemented. Implement
the mechanism to allow per-VMA locks to be dropped for these cases.
First, change handle_mm_fault to drop per-VMA locks when returning
VM_FAULT_RETRY or VM_FAULT_COMPLETED to be consistent with the way
mmap_lock is handled. Then change folio_lock_or_retry to accept vm_fault
and return vm_fault_t which simplifies later patches. Finally allow swap
and uffd page faults to be handled under per-VMA locks by dropping per-VMA
and retrying, the same way it's done under mmap_lock. Naturally, once VMA
lock is dropped that VMA should be assumed unstable and can't be used.
This patch (of 6):
Commit [1] introduced IO polling support duding swapin to reduce swap read
latency for block devices that can be polled. However later commit [2]
removed polling support. Therefore it seems safe to remove do_poll
parameter in read_swap_cache_async and always call swap_readpage with
synchronous=false waiting for IO completion in folio_lock_or_retry.
[1] commit 23955622ff ("swap: add block io poll in swapin path")
[2] commit 9650b453a3 ("block: ignore RWF_HIPRI hint for sync dio")
Link: https://lkml.kernel.org/r/20230630211957.1341547-1-surenb@google.com
Link: https://lkml.kernel.org/r/20230630211957.1341547-2-surenb@google.com
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Suggested-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Minchan Kim <minchan@google.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yu Zhao <yuzhao@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shadow_nodes is for shadow nodes reclaiming of workingset handling, it is
updated when page cache add or delete since long time ago workingset only
supported page cache. But when workingset supports anonymous page
detection, we missied updating shadow nodes for it. This caused that
shadow nodes of anonymous page will never be reclaimd by
scan_shadow_nodes() even they use much memory and system memory is tense.
So update shadow_nodes of anonymous page when swap cache is add or delete
by calling xas_set_update(..workingset_update_node).
Link: https://lkml.kernel.org/r/202301182013032211005@zte.com.cn
Fixes: aae466b005 ("mm/swap: implement workingset detection for anonymous LRU")
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
All its callers either already hold a reference to, or lock the swap
device while calling this function. There is only one exception in
shmem_swapin_folio, just make this caller also hold a reference of the
swap device, so this helper can be simplified and saves a few cycles.
This also provides finer control of error handling in shmem_swapin_folio,
on race (with swap off), it can just try again. For invalid swap entry,
it can fail with a proper error code.
Link: https://lkml.kernel.org/r/20221219185840.25441-5-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull MM updates from Andrew Morton:
"Most of the MM queue. A few things are still pending.
Liam's maple tree rework didn't make it. This has resulted in a few
other minor patch series being held over for next time.
Multi-gen LRU still isn't merged as we were waiting for mapletree to
stabilize. The current plan is to merge MGLRU into -mm soon and to
later reintroduce mapletree, with a view to hopefully getting both
into 6.1-rc1.
Summary:
- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve
latency and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place"
[ XFS merge from hell as per Darrick Wong in
https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]
* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
tools/testing/selftests/vm/hmm-tests.c: fix build
mm: Kconfig: fix typo
mm: memory-failure: convert to pr_fmt()
mm: use is_zone_movable_page() helper
hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
hugetlbfs: cleanup some comments in inode.c
hugetlbfs: remove unneeded header file
hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
hugetlbfs: use helper macro SZ_1{K,M}
mm: cleanup is_highmem()
mm/hmm: add a test for cross device private faults
selftests: add soft-dirty into run_vmtests.sh
selftests: soft-dirty: add test for mprotect
mm/mprotect: fix soft-dirty check in can_change_pte_writable()
mm: memcontrol: fix potential oom_lock recursion deadlock
mm/gup.c: fix formatting in check_and_migrate_movable_page()
xfs: fail dax mount if reflink is enabled on a partition
mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
userfaultfd: don't fail on unrecognized features
hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
...
Convert all callers to pass a folio. Most have the folio
already available. Switch all users from aops->migratepage to
aops->migrate_folio. Also turn the documentation into kerneldoc.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Sterba <dsterba@suse.com>
swap_readpage() is given one page at a time, but may be called repeatedly
in succession.
For block-device swap-space, the blk_plug functionality allows the
multiple pages to be combined together at lower layers. That cannot be
used for SWP_FS_OPS as blk_plug may not exist - it is only active when
CONFIG_BLOCK=y. Consequently all swap reads over NFS are single page
reads.
With this patch we pass in a pointer-to-pointer when swap_readpage can
store state between calls - much like the effect of blk_plug. After
calling swap_readpage() some number of times, the state will be passed to
swap_read_unplug() which can submit the combined request.
Link: https://lkml.kernel.org/r/164859778127.29473.14059420492644907783.stgit@noble.brown
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: David Howells <dhowells@redhat.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
folios that are written to swap are owned by the MM subsystem - not any
filesystem.
When such a folio is passed to a filesystem to be written out to a
swap-file, the filesystem handles the data, but the folio itself does not
belong to the filesystem. So calling the filesystem's ->dirty_folio()
address_space operation makes no sense. This is for folios in the given
address space, and a folio to be written to swap does not exist in the
given address space.
So drop swap_dirty_folio() which calls the address-space's
->dirty_folio(), and always use noop_dirty_folio(), which is appropriate
for folios being swapped out.
Link: https://lkml.kernel.org/r/164859778123.29473.6900942583784889976.stgit@noble.brown
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: David Howells <dhowells@redhat.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "MM changes to improve swap-over-NFS support".
Assorted improvements for swap-via-filesystem.
This is a resend of these patches, rebased on current HEAD. The only
substantial changes is that swap_dirty_folio has replaced
swap_set_page_dirty.
Currently swap-via-fs (SWP_FS_OPS) doesn't work for any filesystem. It
has previously worked for NFS but that broke a few releases back. This
series changes to use a new ->swap_rw rather than ->readpage and
->direct_IO. It also makes other improvements.
There is a companion series already in linux-next which fixes various
issues with NFS. Once both series land, a final patch is needed which
changes NFS over to use ->swap_rw.
This patch (of 10):
Many functions declared in include/linux/swap.h are only used within mm/
Create a new "mm/swap.h" and move some of these declarations there.
Remove the redundant 'extern' from the function declarations.
[akpm@linux-foundation.org: mm/memory-failure.c needs mm/swap.h]
Link: https://lkml.kernel.org/r/164859751830.29473.5309689752169286816.stgit@noble.brown
Link: https://lkml.kernel.org/r/164859778120.29473.11725907882296224053.stgit@noble.brown
Signed-off-by: NeilBrown <neilb@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: David Howells <dhowells@redhat.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull filesystem folio updates from Matthew Wilcox:
"Primarily this series converts some of the address_space operations to
take a folio instead of a page.
Notably:
- a_ops->is_partially_uptodate() takes a folio instead of a page and
changes the type of the 'from' and 'count' arguments to make it
obvious they're bytes.
- a_ops->invalidatepage() becomes ->invalidate_folio() and has a
similar type change.
- a_ops->launder_page() becomes ->launder_folio()
- a_ops->set_page_dirty() becomes ->dirty_folio() and adds the
address_space as an argument.
There are a couple of other misc changes up front that weren't worth
separating into their own pull request"
* tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits)
fs: Remove aops ->set_page_dirty
fb_defio: Use noop_dirty_folio()
fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio
fs: Convert __set_page_dirty_buffers to block_dirty_folio
nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio()
mm: Convert swap_set_page_dirty() to swap_dirty_folio()
ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio
f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio
f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio
f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio
afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio()
btrfs: Convert extent_range_redirty_for_io() to use folios
fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio
btrfs: Convert from set_page_dirty to dirty_folio
fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio()
fs: Add aops->dirty_folio
fs: Remove aops->launder_page
orangefs: Convert launder_page to launder_folio
nfs: Convert from launder_page to launder_folio
fuse: Convert from launder_page to launder_folio
...
In our testing, a livelock task was found. Through sysrq printing, same
stack was found every time, as follows:
__swap_duplicate+0x58/0x1a0
swapcache_prepare+0x24/0x30
__read_swap_cache_async+0xac/0x220
read_swap_cache_async+0x58/0xa0
swapin_readahead+0x24c/0x628
do_swap_page+0x374/0x8a0
__handle_mm_fault+0x598/0xd60
handle_mm_fault+0x114/0x200
do_page_fault+0x148/0x4d0
do_translation_fault+0xb0/0xd4
do_mem_abort+0x50/0xb0
The reason for the livelock is that swapcache_prepare() always returns
EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it
cannot jump out of the loop. We suspect that the task that clears the
SWAP_HAS_CACHE flag never gets a chance to run. We try to lower the
priority of the task stuck in a livelock so that the task that clears
the SWAP_HAS_CACHE flag will run. The results show that the system
returns to normal after the priority is lowered.
In our testing, multiple real-time tasks are bound to the same core, and
the task in the livelock is the highest priority task of the core, so
the livelocked task cannot be preempted.
Although cond_resched() is used by __read_swap_cache_async, it is an
empty function in the preemptive system and cannot achieve the purpose
of releasing the CPU. A high-priority task cannot release the CPU
unless preempted by a higher-priority task. But when this task is
already the highest priority task on this core, other tasks will not be
able to be scheduled. So we think we should replace cond_resched() with
schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will
call set_current_state first to set the task state, so the task will be
removed from the running queue, so as to achieve the purpose of giving
up the CPU and prevent it from running in kernel mode for too long.
(akpm: ugly hack becomes uglier. But it fixes the issue in a
backportable-to-stable fashion while we hopefully work on something
better)
Link: https://lkml.kernel.org/r/20220221111749.1928222-1-cgel.zte@gmail.com
Signed-off-by: Guo Ziliang <guo.ziliang@zte.com.cn>
Reported-by: Zeal Robot <zealci@zte.com.cn>
Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Reviewed-by: Jiang Xuexin <jiang.xuexin@zte.com.cn>
Reviewed-by: Yang Yang <yang.yang29@zte.com.cn>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roger Quadros <rogerq@kernel.org>
Cc: Ziliang Guo <guo.ziliang@zte.com.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This nets us 178 bytes of savings from removing calls to compound_head.
The three callers all grow a little, but each of them will be converted
to use folios soon, so that's fine.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
With commit 09854ba94c ("mm: do_wp_page() simplification"), after COW,
the idle swap cache page (neither the page nor the corresponding swap
entry is mapped by any process) will be left in the LRU list, even if it's
in the active list or the head of the inactive list. So, the page
reclaimer may take quite some overhead to reclaim these actually unused
pages.
To help the page reclaiming, in this patch, after COW, the idle swap cache
page will be tried to be freed. To avoid to introduce much overhead to
the hot COW code path,
a) there's almost zero overhead for non-swap case via checking
PageSwapCache() firstly.
b) the page lock is acquired via trylock only.
To test the patch, we used pmbench memory accessing benchmark with
working-set larger than available memory on a 2-socket Intel server with a
NVMe SSD as swap device. Test results shows that the pmbench score
increases up to 23.8% with the decreased size of swap cache and swapin
throughput.
Link: https://lkml.kernel.org/r/20210601053143.1380078-1-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org> [use free_swap_cache()]
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>