Because memory-less node's ->node_present_pages and its zone's
->present_pages are all 0, the judgement before calling node_set_state()
to set N_MEMORY, N_HIGH_MEMORY, N_NORMAL_MEMORY for node is enough to skip
memory-less node. The 'continue;' statement inside for_each_node() loop
of free_area_init() is gilding the lily.
Here, remove the special handling to make memory-less node share the same
code flow as normal node.
And also rephrase the code comments above the 'continue' statement
and move them above above line 'if (pgdat->node_present_pages)'.
[bhe@redhat.com: redo code comments, per Mike]
Link: https://lkml.kernel.org/r/ZhYJAVQRYJSTKZng@MiWiFi-R3L-srv
Link: https://lkml.kernel.org/r/20240326061134.1055295-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, in free_area_init_core(), when initialize zone's field, a rough
value is set to zone->managed_pages. That value is calculated by
(zone->present_pages - memmap_pages).
In the meantime, add the value to nr_all_pages and nr_kernel_pages which
represent all free pages of system (only low memory or including HIGHMEM
memory separately). Both of them are gonna be used in
alloc_large_system_hash().
However, the rough calculation and setting of zone->managed_pages is
meaningless because
a) memmap pages are allocated on units of node in sparse_init() or
alloc_node_mem_map(pgdat); The simple (zone->present_pages -
memmap_pages) is too rough to make sense for zone;
b) the set zone->managed_pages will be zeroed out and reset with
acutal value in mem_init() via memblock_free_all(). Before the
resetting, no buddy allocation request is issued.
Here, remove the meaningless and complicated calculation of
(zone->present_pages - memmap_pages), directly set zone->managed_pages as
zone->present_pages for now. It will be adjusted in mem_init().
And also remove the assignment of nr_all_pages and nr_kernel_pages in
free_area_init_core(). Instead, call the newly added
calc_nr_kernel_pages() to count up all free but not reserved memory in
memblock and assign to nr_all_pages and nr_kernel_pages. The counting
excludes memmap_pages, and other kernel used data, which is more accurate
than old way and simpler, and can also cover the ppc required
arch_reserved_kernel_pages() case.
And also clean up the outdated code comment above free_area_init_core().
And free_area_init_core() is easy to understand now, no need to add words
to explain.
[bhe@redhat.com: initialize zone->managed_pages as zone->present_pages for now]
Link: https://lkml.kernel.org/r/ZgU0bsJ2FEjykvju@MiWiFi-R3L-srv
Link: https://lkml.kernel.org/r/20240325145646.1044760-5-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This is a preparation to calculate nr_kernel_pages and nr_all_pages, both
of which will be used later in alloc_large_system_hash().
nr_all_pages counts up all free but not reserved memory in memblock
allocator, including HIGHMEM memory. While nr_kernel_pages counts up all
free but not reserved low memory in memblock allocator, excluding HIGHMEM
memory.
Link: https://lkml.kernel.org/r/20240325145646.1044760-4-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The new boot flow when it comes to initialization of gigantic pages is as
follows:
- At boot time, for a gigantic page during __alloc_bootmem_hugepage, the
region after the first struct page is marked as noinit.
- This results in only the first struct page to be initialized in
reserve_bootmem_region. As the tail struct pages are not initialized at
this point, there can be a significant saving in boot time if HVO
succeeds later on.
- Later on in the boot, the head page is prepped and the first
HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page) - 1 tail struct pages
are initialized.
- HVO is attempted. If it is not successful, then the rest of the tail
struct pages are initialized. If it is successful, no more tail struct
pages need to be initialized saving significant boot time.
The WARN_ON for increased ref count in gather_bootmem_prealloc was changed
to a VM_BUG_ON. This is OK as there should be no speculative references
this early in boot process. The VM_BUG_ON's are there just in case such
code is introduced.
[akpm@linux-foundation.org: make it nicer for 80 cols]
Link: https://lkml.kernel.org/r/20230913105401.519709-5-usama.arif@bytedance.com
Signed-off-by: Usama Arif <usama.arif@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Punit Agrawal <punit.agrawal@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
For system with kernelcore=mirror enabled while no mirrored memory is
reported by efi. This could lead to kernel OOM during startup since all
memory beside zone DMA are in the movable zone and this prevents the
kernel to use it.
Zone DMA/DMA32 initialization is independent of mirrored memory and their
max pfn is set in zone_sizes_init(). Since kernel can fallback to zone
DMA/DMA32 if there is no memory in zone Normal, these zones are seen as
mirrored memory no mather their memory attributes are.
To solve this problem, disable kernelcore=mirror when there is no real
mirrored memory exists.
Link: https://lkml.kernel.org/r/20230802072328.2107981-1-mawupeng1@huawei.com
Signed-off-by: Ma Wupeng <mawupeng1@huawei.com>
Suggested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Suggested-by: Mike Rapoport <rppt@kernel.org>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Levi Yun <ppbuk5246@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pull mm updates from Andrew Morton:
- Yosry Ahmed brought back some cgroup v1 stats in OOM logs
- Yosry has also eliminated cgroup's atomic rstat flushing
- Nhat Pham adds the new cachestat() syscall. It provides userspace
with the ability to query pagecache status - a similar concept to
mincore() but more powerful and with improved usability
- Mel Gorman provides more optimizations for compaction, reducing the
prevalence of page rescanning
- Lorenzo Stoakes has done some maintanance work on the
get_user_pages() interface
- Liam Howlett continues with cleanups and maintenance work to the
maple tree code. Peng Zhang also does some work on maple tree
- Johannes Weiner has done some cleanup work on the compaction code
- David Hildenbrand has contributed additional selftests for
get_user_pages()
- Thomas Gleixner has contributed some maintenance and optimization
work for the vmalloc code
- Baolin Wang has provided some compaction cleanups,
- SeongJae Park continues maintenance work on the DAMON code
- Huang Ying has done some maintenance on the swap code's usage of
device refcounting
- Christoph Hellwig has some cleanups for the filemap/directio code
- Ryan Roberts provides two patch series which yield some
rationalization of the kernel's access to pte entries - use the
provided APIs rather than open-coding accesses
- Lorenzo Stoakes has some fixes to the interaction between pagecache
and directio access to file mappings
- John Hubbard has a series of fixes to the MM selftesting code
- ZhangPeng continues the folio conversion campaign
- Hugh Dickins has been working on the pagetable handling code, mainly
with a view to reducing the load on the mmap_lock
- Catalin Marinas has reduced the arm64 kmalloc() minimum alignment
from 128 to 8
- Domenico Cerasuolo has improved the zswap reclaim mechanism by
reorganizing the LRU management
- Matthew Wilcox provides some fixups to make gfs2 work better with the
buffer_head code
- Vishal Moola also has done some folio conversion work
- Matthew Wilcox has removed the remnants of the pagevec code - their
functionality is migrated over to struct folio_batch
* tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (380 commits)
mm/hugetlb: remove hugetlb_set_page_subpool()
mm: nommu: correct the range of mmap_sem_read_lock in task_mem()
hugetlb: revert use of page_cache_next_miss()
Revert "page cache: fix page_cache_next/prev_miss off by one"
mm/vmscan: fix root proactive reclaim unthrottling unbalanced node
mm: memcg: rename and document global_reclaim()
mm: kill [add|del]_page_to_lru_list()
mm: compaction: convert to use a folio in isolate_migratepages_block()
mm: zswap: fix double invalidate with exclusive loads
mm: remove unnecessary pagevec includes
mm: remove references to pagevec
mm: rename invalidate_mapping_pagevec to mapping_try_invalidate
mm: remove struct pagevec
net: convert sunrpc from pagevec to folio_batch
i915: convert i915_gpu_error to use a folio_batch
pagevec: rename fbatch_count()
mm: remove check_move_unevictable_pages()
drm: convert drm_gem_put_pages() to use a folio_batch
i915: convert shmem_sg_free_table() to use a folio_batch
scatterlist: add sg_set_folio()
...
early_pfn_to_nid() is called frequently in init_reserved_page(), it
returns the node id of the PFN. These PFN are probably from the same
memory region, they have the same node id. It's not necessary to call
early_pfn_to_nid() for each PFN.
Pass nid to reserve_bootmem_region() and drop the call to
early_pfn_to_nid() in init_reserved_page(). Also, set nid on all reserved
pages before doing this, as some reserved memory regions may not be set
nid.
The most beneficial function is memmap_init_reserved_pages() if
CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled.
The following data was tested on an x86 machine with 190GB of RAM.
before:
memmap_init_reserved_pages() 67ms
after:
memmap_init_reserved_pages() 20ms
Link: https://lkml.kernel.org/r/20230619023406.424298-1-yajun.deng@linux.dev
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, no matter whether a node actually has memory or not,
calculate_node_totalpages() is used to account number of pages in
zone/node. However, for node without memory, these unnecessary
calculations can be skipped. All the zone/node page counts can be set to
0 directly. So introduce reset_memoryless_node_totalpages() to perform
this action.
Furthermore, calculate_node_totalpages() only gets called for the node
with memory.
Link: https://lkml.kernel.org/r/20230526085251.1977-1-haifeng.xu@shopee.com
Signed-off-by: Haifeng Xu <haifeng.xu@shopee.com>
Suggested-by: Mike Rapoport <rppt@kernel.org>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
UEFI Specification version 2.9 introduces the concept of memory
acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD
SEV-SNP, require memory to be accepted before it can be used by the
guest. Accepting happens via a protocol specific to the Virtual Machine
platform.
There are several ways the kernel can deal with unaccepted memory:
1. Accept all the memory during boot. It is easy to implement and it
doesn't have runtime cost once the system is booted. The downside is
very long boot time.
Accept can be parallelized to multiple CPUs to keep it manageable
(i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate
memory bandwidth and does not scale beyond the point.
2. Accept a block of memory on the first use. It requires more
infrastructure and changes in page allocator to make it work, but
it provides good boot time.
On-demand memory accept means latency spikes every time kernel steps
onto a new memory block. The spikes will go away once workload data
set size gets stabilized or all memory gets accepted.
3. Accept all memory in background. Introduce a thread (or multiple)
that gets memory accepted proactively. It will minimize time the
system experience latency spikes on memory allocation while keeping
low boot time.
This approach cannot function on its own. It is an extension of #2:
background memory acceptance requires functional scheduler, but the
page allocator may need to tap into unaccepted memory before that.
The downside of the approach is that these threads also steal CPU
cycles and memory bandwidth from the user's workload and may hurt
user experience.
Implement #1 and #2 for now. #2 is the default. Some workloads may want
to use #1 with accept_memory=eager in kernel command line. #3 can be
implemented later based on user's demands.
Support of unaccepted memory requires a few changes in core-mm code:
- memblock accepts memory on allocation. It serves early boot memory
allocations and doesn't limit them to pre-accepted pool of memory.
- page allocator accepts memory on the first allocation of the page.
When kernel runs out of accepted memory, it accepts memory until the
high watermark is reached. It helps to minimize fragmentation.
EFI code will provide two helpers if the platform supports unaccepted
memory:
- accept_memory() makes a range of physical addresses accepted.
- range_contains_unaccepted_memory() checks anything within the range
of physical addresses requires acceptance.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mike Rapoport <rppt@linux.ibm.com> # memblock
Link: https://lore.kernel.org/r/20230606142637.5171-2-kirill.shutemov@linux.intel.com
clang produces a build failure on x86 for some randconfig builds after a
change that moves around code to mm/mm_init.c:
Cannot find symbol for section 2: .text.
mm/mm_init.o: failed
I have not been able to figure out why this happens, but the __weak
annotation on arch_has_descending_max_zone_pfns() is the trigger here.
Removing the weak function in favor of an open-coded Kconfig option check
avoids the problem and becomes clearer as well as better to optimize by
the compiler.
[arnd@arndb.de: fix logic bug]
Link: https://lkml.kernel.org/r/20230415081904.969049-1-arnd@kernel.org
Link: https://lkml.kernel.org/r/20230414080418.110236-1-arnd@kernel.org
Fixes: 9420f89db2 ("mm: move most of core MM initialization to mm/mm_init.c")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: SeongJae Park <sj@kernel.org>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>