Commit Graph

8528 Commits

Author SHA1 Message Date
Linus Torvalds
40f92e79b0 Merge tag 'block-6.16-20250710' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:

 - MD changes via Yu:
     - fix UAF due to stack memory used for bio mempool (Jinchao)
     - fix raid10/raid1 nowait IO error path (Nigel and Qixing)
     - fix kernel crash from reading bitmap sysfs entry (Håkon)

 - Fix for a UAF in the nbd connect error path

 - Fix for blocksize being bigger than pagesize, if THP isn't enabled

* tag 'block-6.16-20250710' of git://git.kernel.dk/linux:
  block: reject bs > ps block devices when THP is disabled
  nbd: fix uaf in nbd_genl_connect() error path
  md/md-bitmap: fix GPF in bitmap_get_stats()
  md/raid1,raid10: strip REQ_NOWAIT from member bios
  raid10: cleanup memleak at raid10_make_request
  md/raid1: Fix stack memory use after return in raid1_reshape
2025-07-11 10:35:54 -07:00
Zheng Qixing
aa9552438e nbd: fix uaf in nbd_genl_connect() error path
There is a use-after-free issue in nbd:

block nbd6: Receive control failed (result -104)
block nbd6: shutting down sockets
==================================================================
BUG: KASAN: slab-use-after-free in recv_work+0x694/0xa80 drivers/block/nbd.c:1022
Write of size 4 at addr ffff8880295de478 by task kworker/u33:0/67

CPU: 2 UID: 0 PID: 67 Comm: kworker/u33:0 Not tainted 6.15.0-rc5-syzkaller-00123-g2c89c1b655c0 #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
Workqueue: nbd6-recv recv_work
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:94 [inline]
 dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120
 print_address_description mm/kasan/report.c:408 [inline]
 print_report+0xc3/0x670 mm/kasan/report.c:521
 kasan_report+0xe0/0x110 mm/kasan/report.c:634
 check_region_inline mm/kasan/generic.c:183 [inline]
 kasan_check_range+0xef/0x1a0 mm/kasan/generic.c:189
 instrument_atomic_read_write include/linux/instrumented.h:96 [inline]
 atomic_dec include/linux/atomic/atomic-instrumented.h:592 [inline]
 recv_work+0x694/0xa80 drivers/block/nbd.c:1022
 process_one_work+0x9cc/0x1b70 kernel/workqueue.c:3238
 process_scheduled_works kernel/workqueue.c:3319 [inline]
 worker_thread+0x6c8/0xf10 kernel/workqueue.c:3400
 kthread+0x3c2/0x780 kernel/kthread.c:464
 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:153
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>

nbd_genl_connect() does not properly stop the device on certain
error paths after nbd_start_device() has been called. This causes
the error path to put nbd->config while recv_work continue to use
the config after putting it, leading to use-after-free in recv_work.

This patch moves nbd_start_device() after the backend file creation.

Reported-by: syzbot+48240bab47e705c53126@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/68227a04.050a0220.f2294.00b5.GAE@google.com/T/
Fixes: 6497ef8df5 ("nbd: provide a way for userspace processes to identify device backends")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250612132405.364904-1-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-07 11:58:08 -06:00
Linus Torvalds
1880df2cf4 Merge tag 'block-6.16-20250704' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:

 - NVMe fixes via Christoph:
     - fix incorrect cdw15 value in passthru error logging (Alok Tiwari)
     - fix memory leak of bio integrity in nvmet (Dmitry Bogdanov)
     - refresh visible attrs after being checked (Eugen Hristev)
     - fix suspicious RCU usage warning in the multipath code (Geliang Tang)
     - correctly account for namespace head reference counter (Nilay Shroff)

 - Fix for a regression introduced in ublk in this cycle, where it would
   attempt to queue a canceled request.

 - brd RCU sleeping fix, also introduced in this cycle. Bare bones fix,
   should be improved upon for the next release.

* tag 'block-6.16-20250704' of git://git.kernel.dk/linux:
  brd: fix sleeping function called from invalid context in brd_insert_page()
  ublk: don't queue request if the associated uring_cmd is canceled
  nvme-multipath: fix suspicious RCU usage warning
  nvme-pci: refresh visible attrs after being checked
  nvmet: fix memory leak of bio integrity
  nvme: correctly account for namespace head reference counter
  nvme: Fix incorrect cdw15 value in passthru error logging
2025-07-04 09:33:59 -07:00
Yu Kuai
0d519bb0de brd: fix sleeping function called from invalid context in brd_insert_page()
__xa_cmpxchg() is called with rcu_read_lock(), and it will allocate
memory if necessary.

Fix the problem by moving rcu_read_lock() after __xa_cmpxchg(), meanwhile,
it still should be held before xa_unlock(), prevent returned page to be
freed by concurrent discard.

Fixes: bbcacab2e8 ("brd: avoid extra xarray lookups on first write")
Reported-by: syzbot+ea4c8fd177a47338881a@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/685ec4c9.a00a0220.129264.000c.GAE@google.com/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250630112828.421219-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01 08:14:01 -06:00
Ming Lei
01ed88aea5 ublk: don't queue request if the associated uring_cmd is canceled
Commit 524346e9d7 ("ublk: build batch from IOs in same io_ring_ctx and io task")
need to dereference `io->cmd` for checking if the IO can be added to current
batch, see ublk_belong_to_same_batch() and io_uring_cmd_ctx_handle(). However,
`io->cmd` may become invalid after the uring_cmd is canceled.

Fixes it by only allowing to queue this IO in case that ublk_prep_req()
returns `BLK_STS_OK`, when 'io->cmd' is guaranteed to be valid.

Reported-by: Changhui Zhong <czhong@redhat.com>
Fixes: 524346e9d7 ("ublk: build batch from IOs in same io_ring_ctx and io task")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250701072325.1458109-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-07-01 07:54:35 -06:00
Linus Torvalds
e540341508 Merge tag 'block-6.16-20250626' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:

 - Fixes for ublk:
      - fix C++ narrowing warnings in the uapi header
      - update/improve UBLK_F_SUPPORT_ZERO_COPY comment in uapi header
      - fix for the ublk ->queue_rqs() implementation, limiting a batch
        to just the specific task AND ring
      - ublk_get_data() error handling fix
      - sanity check more arguments in ublk_ctrl_add_dev()
      - selftest addition

 - NVMe pull request via Christoph:
      - reset delayed remove_work after reconnect
      - fix atomic write size validation

 - Fix for a warning introduced in bdev_count_inflight_rw() in this
   merge window

* tag 'block-6.16-20250626' of git://git.kernel.dk/linux:
  block: fix false warning in bdev_count_inflight_rw()
  ublk: sanity check add_dev input for underflow
  nvme: fix atomic write size validation
  nvme: refactor the atomic write unit detection
  nvme: reset delayed remove_work after reconnect
  ublk: setup ublk_io correctly in case of ublk_get_data() failure
  ublk: update UBLK_F_SUPPORT_ZERO_COPY comment in UAPI header
  ublk: fix narrowing warnings in UAPI header
  selftests: ublk: don't take same backing file for more than one ublk devices
  ublk: build batch from IOs in same io_ring_ctx and io task
2025-06-27 09:02:33 -07:00
Ronnie Sahlberg
969127bf07 ublk: sanity check add_dev input for underflow
Add additional checks that queue depth and number of queues are
non-zero.

Signed-off-by: Ronnie Sahlberg <rsahlberg@whamcloud.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250626022046.235018-1-ronniesahlberg@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-26 07:31:24 -06:00
Ming Lei
4c8a951787 ublk: setup ublk_io correctly in case of ublk_get_data() failure
If ublk_get_data() fails, -EIOCBQUEUED is returned and the current command
becomes ASYNC. And the only reason is that mapping data can't move on,
because of no enough pages or pending signal, then the current ublk request
has to be requeued.

Once the request need to be requeued, we have to setup `ublk_io` correctly,
including io->cmd and flags, otherwise the request may not be forwarded to
ublk server successfully.

Fixes: 9810362a57 ("ublk: don't call ublk_dispatch_req() for NEED_GET_DATA")
Reported-by: Changhui Zhong <czhong@redhat.com>
Closes: https://lore.kernel.org/linux-block/CAGVVp+VN9QcpHUz_0nasFf5q9i1gi8H8j-G-6mkBoqa3TyjRHA@mail.gmail.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Changhui Zhong <czhong@redhat.com>
Link: https://lore.kernel.org/r/20250624104121.859519-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-24 20:45:31 -06:00
Ming Lei
524346e9d7 ublk: build batch from IOs in same io_ring_ctx and io task
ublk_queue_cmd_list() dispatches the whole batch list by scheduling task
work via the tail request's io_uring_cmd, this way is fine even though
more than one io_ring_ctx are involved for this batch since it is just
one running context.

However, the task work handler ublk_cmd_list_tw_cb() takes `issue_flags`
of tail uring_cmd's io_ring_ctx for completing all commands. This way is
wrong if any uring_cmd is issued from different io_ring_ctx.

Fixes it by always building batch IOs from same io_ring_ctx and io task
because ublk_dispatch_req() does validate task context, and IO needs to
be aborted in case of running from fallback task work context.

For typical per-queue or per-io daemon implementation, this way shouldn't
make difference from performance viewpoint, because single io_ring_ctx is
taken in each daemon for normal use case.

Fixes: d796cea7b9 ("ublk: implement ->queue_rqs()")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250625022554.883571-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-24 20:44:52 -06:00
Linus Torvalds
75f5f23f87 Merge tag 'block-6.16-20250619' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:

 - Two fixes for aoe which fixes issues dating back to when this driver
   was converted to blk-mq

 - Fix for ublk, checking for valid queue depth and count values before
   setting up a device

* tag 'block-6.16-20250619' of git://git.kernel.dk/linux:
  ublk: santizize the arguments from userspace when adding a device
  aoe: defer rexmit timer downdev work to workqueue
  aoe: clean device rq_list in aoedev_downdev()
2025-06-19 23:29:35 -07:00
Ronnie Sahlberg
8c84728558 ublk: santizize the arguments from userspace when adding a device
Sanity check the values for queue depth and number of queues
we get from userspace when adding a device.

Signed-off-by: Ronnie Sahlberg <rsahlberg@whamcloud.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Fixes: 71f28f3136 ("ublk_drv: add io_uring based userspace block driver")
Fixes: 62fe99cef9 ("ublk: add read()/write() support for ublk char device")
Link: https://lore.kernel.org/r/20250619021031.181340-1-ronniesahlberg@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-19 07:53:24 -06:00
Justin Sanders
cffc873d68 aoe: defer rexmit timer downdev work to workqueue
When aoe's rexmit_timer() notices that an aoe target fails to respond to
commands for more than aoe_deadsecs, it calls aoedev_downdev() which
cleans the outstanding aoe and block queues. This can involve sleeping,
such as in blk_mq_freeze_queue(), which should not occur in irq context.

This patch defers that aoedev_downdev() call to the aoe device's
workqueue.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=212665
Signed-off-by: Justin Sanders <jsanders.devel@gmail.com>
Link: https://lore.kernel.org/r/20250610170600.869-2-jsanders.devel@gmail.com
Tested-By: Valentin Kleibel <valentin@vrvis.at>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-17 06:12:35 -06:00
Justin Sanders
7f90d45e57 aoe: clean device rq_list in aoedev_downdev()
An aoe device's rq_list contains accepted block requests that are
waiting to be transmitted to the aoe target. This queue was added as
part of the conversion to blk_mq. However, the queue was not cleaned out
when an aoe device is downed which caused blk_mq_freeze_queue() to sleep
indefinitely waiting for those requests to complete, causing a hang. This
fix cleans out the queue before calling blk_mq_freeze_queue().

Link: https://bugzilla.kernel.org/show_bug.cgi?id=212665
Fixes: 3582dd2917 ("aoe: convert aoeblk to blk-mq")
Signed-off-by: Justin Sanders <jsanders.devel@gmail.com>
Link: https://lore.kernel.org/r/20250610170600.869-1-jsanders.devel@gmail.com
Tested-By: Valentin Kleibel <valentin@vrvis.at>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-17 06:12:30 -06:00
Linus Torvalds
f713ffa363 Merge tag 'block-6.16-20250614' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:

 - Fix for a deadlock on queue freeze with zoned writes

 - Fix for zoned append emulation

 - Two bio folio fixes, for sparsemem and for very large folios

 - Fix for a performance regression introduced in 6.13 when plug
   insertion was changed

 - Fix for NVMe passthrough handling for polled IO

 - Document the ublk auto registration feature

 - loop lockdep warning fix

* tag 'block-6.16-20250614' of git://git.kernel.dk/linux:
  nvme: always punt polled uring_cmd end_io work to task_work
  Documentation: ublk: Separate UBLK_F_AUTO_BUF_REG fallback behavior sublists
  block: Fix bvec_set_folio() for very large folios
  bio: Fix bio_first_folio() for SPARSEMEM without VMEMMAP
  block: use plug request list tail for one-shot backmerge attempt
  block: don't use submit_bio_noacct_nocheck in blk_zone_wplug_bio_work
  block: Clear BIO_EMULATES_ZONE_APPEND flag on BIO completion
  ublk: document auto buffer registration(UBLK_F_AUTO_BUF_REG)
  loop: move lo_set_size() out of queue freeze
2025-06-14 09:25:22 -07:00
Ming Lei
4bb08cf974 loop: move lo_set_size() out of queue freeze
It isn't necessary to freeze queue for updating disk size given submit_bio()
doesn't grab queue usage counter for checking eod.

Also many driver won't freeze queue for calling set_capacity_and_notify().

Move lo_set_size() out of queue freeze for fixing many lockdep warning
report.

Link: https://lore.kernel.org/linux-block/67ea99e0.050a0220.3c3d88.0042.GAE@google.com/
Reported-by: syzbot+9dd7dbb1a4b915dee638@syzkaller.appspotmail.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250611084938.108829-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-06-11 06:41:45 -06:00
Ingo Molnar
41cb08555c treewide, timers: Rename from_timer() to timer_container_of()
Move this API to the canonical timer_*() namespace.

[ tglx: Redone against pre rc1 ]

Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com
2025-06-08 09:07:37 +02:00
Linus Torvalds
6d8854216e Merge tag 'block-6.16-20250606' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:

 - NVMe pull request via Christoph:
      - TCP error handling fix (Shin'ichiro Kawasaki)
      - TCP I/O stall handling fixes (Hannes Reinecke)
      - fix command limits status code (Keith Busch)
      - support vectored buffers also for passthrough (Pavel Begunkov)
      - spelling fixes (Yi Zhang)

 - MD pull request via Yu:
      - fix REQ_RAHEAD and REQ_NOWAIT IO err handling for raid1/10
      - fix max_write_behind setting for dm-raid
      - some minor cleanups

 - Integrity data direction fix and cleanup

 - bcache NULL pointer fix

 - Fix for loop missing write start/end handling

 - Decouple hardware queues and IO threads in ublk

 - Slew of ublk selftests additions and updates

* tag 'block-6.16-20250606' of git://git.kernel.dk/linux: (29 commits)
  nvme: spelling fixes
  nvme-tcp: fix I/O stalls on congested sockets
  nvme-tcp: sanitize request list handling
  nvme-tcp: remove tag set when second admin queue config fails
  nvme: enable vectored registered bufs for passthrough cmds
  nvme: fix implicit bool to flags conversion
  nvme: fix command limits status code
  selftests: ublk: kublk: improve behavior on init failure
  block: flip iter directions in blk_rq_integrity_map_user()
  block: drop direction param from bio_integrity_copy_user()
  selftests: ublk: cover PER_IO_DAEMON in more stress tests
  Documentation: ublk: document UBLK_F_PER_IO_DAEMON
  selftests: ublk: add stress test for per io daemons
  selftests: ublk: add functional test for per io daemons
  selftests: ublk: kublk: decouple ublk_queues from ublk server threads
  selftests: ublk: kublk: move per-thread data out of ublk_queue
  selftests: ublk: kublk: lift queue initialization out of thread
  selftests: ublk: kublk: tie sqe allocation to io instead of queue
  selftests: ublk: kublk: plumb q_id in io_uring user_data
  ublk: have a per-io daemon instead of a per-queue daemon
  ...
2025-06-06 13:12:50 -07:00
Linus Torvalds
3719a04a80 Merge tag 'pci-v6.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci updates from Bjorn Helgaas:
 "Enumeration:

   - Print the actual delay time in pci_bridge_wait_for_secondary_bus()
     instead of assuming it was 1000ms (Wilfred Mallawa)

   - Revert 'iommu/amd: Prevent binding other PCI drivers to IOMMU PCI
     devices', which broke resume from system sleep on AMD platforms and
     has been fixed by other commits (Lukas Wunner)

  Resource management:

   - Remove mtip32xx use of pcim_iounmap_regions(), which is deprecated
     and unnecessary (Philipp Stanner)

   - Remove pcim_iounmap_regions() and pcim_request_region_exclusive()
     and related flags since all uses have been removed (Philipp
     Stanner)

   - Rework devres 'request' functions so they are no longer 'hybrid',
     i.e., their behavior no longer depends on whether
     pcim_enable_device or pci_enable_device() was used, and remove
     related code (Philipp Stanner)

   - Warn (not BUG()) about failure to assign optional resources (Ilpo
     Järvinen)

  Error handling:

   - Log the DPC Error Source ID only when it's actually valid (when
     ERR_FATAL or ERR_NONFATAL was received from a downstream device)
     and decode into bus/device/function (Bjorn Helgaas)

   - Determine AER log level once and save it so all related messages
     use the same level (Karolina Stolarek)

   - Use KERN_WARNING, not KERN_ERR, when logging PCIe Correctable
     Errors (Karolina Stolarek)

   - Ratelimit PCIe Correctable and Non-Fatal error logging, with sysfs
     controls on interval and burst count, to avoid flooding logs and
     RCU stall warnings (Jon Pan-Doh)

  Power management:

   - Increment PM usage counter when probing reset methods so we don't
     try to read config space of a powered-off device (Alex Williamson)

   - Set all devices to D0 during enumeration to ensure ACPI opregion is
     connected via _REG (Mario Limonciello)

  Power control:

   - Rename pwrctrl Kconfig symbols from 'PWRCTL' to 'PWRCTRL' to match
     the filename paths. Retain old deprecated symbols for
     compatibility, except for the pwrctrl slot driver
     (PCI_PWRCTRL_SLOT) (Johan Hovold)

   - When unregistering pwrctrl, cancel outstanding rescan work before
     cleaning up data structures to avoid use-after-free issues (Brian
     Norris)

  Bandwidth control:

   - Simplify link bandwidth controller by replacing the count of Link
     Bandwidth Management Status (LBMS) events with a PCI_LINK_LBMS_SEEN
     flag (Ilpo Järvinen)

   - Update the Link Speed after retraining, since the Link Speed may
     have changed (Ilpo Järvinen)

  PCIe native device hotplug:

   - Ignore Presence Detect Changed caused by DPC.

     pciehp already ignores Link Down/Up events caused by DPC, but on
     slots using in-band presence detect, DPC causes a spurious Presence
     Detect Changed event (Lukas Wunner)

   - Ignore Link Down/Up caused by Secondary Bus Reset.

     On hotplug ports using in-band presence detect, the reset causes a
     Presence Detect Changed event, which mistakenly caused teardown and
     re-enumeration of the device. Drivers may need to annotate code
     that resets their device (Lukas Wunner)

  Virtualization:

   - Add an ACS quirk for Loongson Root Ports that don't advertise ACS
     but don't allow peer-to-peer transactions between Root Ports; the
     quirk allows each Root Port to be in a separate IOMMU group (Huacai
     Chen)

  Endpoint framework:

   - For fixed-size BARs, retain both the actual size and the possibly
     larger size allocated to accommodate iATU alignment requirements
     (Jerome Brunet)

   - Simplify ctrl/SPAD space allocation and avoid allocating more space
     than needed (Jerome Brunet)

   - Correct MSI-X PBA offset calculations for DesignWare and Cadence
     endpoint controllers (Niklas Cassel)

   - Align the return value (number of interrupts) encoding for
     pci_epc_get_msi()/pci_epc_ops::get_msi() and
     pci_epc_get_msix()/pci_epc_ops::get_msix() (Niklas Cassel)

   - Align the nr_irqs parameter encoding for
     pci_epc_set_msi()/pci_epc_ops::set_msi() and
     pci_epc_set_msix()/pci_epc_ops::set_msix() (Niklas Cassel)

  Common host controller library:

   - Convert pci-host-common to a library so platforms that don't need
     native host controller drivers don't need to include these helper
     functions (Manivannan Sadhasivam)

  Apple PCIe controller driver:

   - Extract ECAM bridge creation helper from pci_host_common_probe() to
     separate driver-specific things like MSI from PCI things (Marc
     Zyngier)

   - Dynamically allocate RID-to_SID bitmap to prepare for SoCs with
     varying capabilities (Marc Zyngier)

   - Skip ports disabled in DT when setting up ports (Janne Grunau)

   - Add t6020 compatible string (Alyssa Rosenzweig)

   - Add T602x PCIe support (Hector Martin)

   - Directly set/clear INTx mask bits because T602x dropped the
     accessors that could do this without locking (Marc Zyngier)

   - Move port PHY registers to their own reg items to accommodate
     T602x, which moves them around; retain default offsets for existing
     DTs that lack phy%d entries with the reg offsets (Hector Martin)

   - Stop polling for core refclk, which doesn't work on T602x and the
     bootloader has already done anyway (Hector Martin)

   - Use gpiod_set_value_cansleep() when asserting PERST# in probe
     because we're allowed to sleep there (Hector Martin)

  Cadence PCIe controller driver:

   - Drop a runtime PM 'put' to resolve a runtime atomic count underflow
     (Hans Zhang)

   - Make the cadence core buildable as a module (Kishon Vijay Abraham I)

   - Add cdns_pcie_host_disable() and cdns_pcie_ep_disable() for use by
     loadable drivers when they are removed (Siddharth Vadapalli)

  Freescale i.MX6 PCIe controller driver:

   - Apply link training workaround only on IMX6Q, IMX6SX, IMX6SP
     (Richard Zhu)

   - Remove redundant dw_pcie_wait_for_link() from
     imx_pcie_start_link(); since the DWC core does this, imx6 only
     needs it when retraining for a faster link speed (Richard Zhu)

   - Toggle i.MX95 core reset to align with PHY powerup (Richard Zhu)

   - Set SYS_AUX_PWR_DET to work around i.MX95 ERR051624 erratum: in
     some cases, the controller can't exit 'L23 Ready' through Beacon or
     PERST# deassertion (Richard Zhu)

   - Clear GEN3_ZRXDC_NONCOMPL to work around i.MX95 ERR051586 erratum:
     controller can't meet 2.5 GT/s ZRX-DC timing when operating at 8
     GT/s, causing timeouts in L1 (Richard Zhu)

   - Wait for i.MX95 PLL lock before enabling controller (Richard Zhu)

   - Save/restore i.MX95 LUT for suspend/resume (Richard Zhu)

  Mobiveil PCIe controller driver:

   - Return bool (not int) for link-up check in
     mobiveil_pab_ops.link_up() and layerscape-gen4, mobiveil (Hans
     Zhang)

  NVIDIA Tegra194 PCIe controller driver:

   - Create debugfs directory for 'aspm_state_cnt' only when
     CONFIG_PCIEASPM is enabled, since there are no other entries (Hans
     Zhang)

  Qualcomm PCIe controller driver:

   - Add OF support for parsing DT 'eq-presets-<N>gts' property for lane
     equalization presets (Krishna Chaitanya Chundru)

   - Read Maximum Link Width from the Link Capabilities register if DT
     lacks 'num-lanes' property (Krishna Chaitanya Chundru)

   - Add Physical Layer 64 GT/s Capability ID and register offsets for
     8, 32, and 64 GT/s lane equalization registers (Krishna Chaitanya
     Chundru)

   - Add generic dwc support for configuring lane equalization presets
     (Krishna Chaitanya Chundru)

   - Add DT and driver support for PCIe on IPQ5018 SoC (Nitheesh Sekar)

  Renesas R-Car PCIe controller driver:

   - Describe endpoint BAR 4 as being fixed size (Jerome Brunet)

   - Document how to obtain R-Car V4H (r8a779g0) controller firmware
     (Yoshihiro Shimoda)

  Rockchip PCIe controller driver:

   - Reorder rockchip_pci_core_rsts because
     reset_control_bulk_deassert() deasserts in reverse order, to fix a
     link training regression (Jensen Huang)

   - Mark RK3399 as being capable of raising INTx interrupts (Niklas
     Cassel)

  Rockchip DesignWare PCIe controller driver:

   - Check only PCIE_LINKUP, not LTSSM status, to determine whether the
     link is up (Shawn Lin)

   - Increase N_FTS (used in L0s->L0 transitions) and enable ASPM L0s
     for Root Complex and Endpoint modes (Shawn Lin)

   - Hide the broken ATS Capability in rockchip_pcie_ep_init() instead
     of rockchip_pcie_ep_pre_init() so it stays hidden after PERST#
     resets non-sticky registers (Shawn Lin)

   - Call phy_power_off() before phy_exit() in rockchip_pcie_phy_deinit()
     (Diederik de Haas)

  Synopsys DesignWare PCIe controller driver:

   - Set PORT_LOGIC_LINK_WIDTH to one lane to make initial link training
     more robust; this will not affect the intended link width if all
     lanes are functional (Wenbin Yao)

   - Return bool (not int) for link-up check in dw_pcie_ops.link_up()
     and armada8k, dra7xx, dw-rockchip, exynos, histb, keembay,
     keystone, kirin, meson, qcom, qcom-ep, rcar_gen4, spear13xx,
     tegra194, uniphier, visconti (Hans Zhang)

   - Add debugfs support for exposing DWC device-specific PTM context
     (Manivannan Sadhasivam)

  TI J721E PCIe driver:

   - Make j721e buildable as a loadable and removable module (Siddharth
     Vadapalli)

   - Fix j721e host/endpoint dependencies that result in link failures
     in some configs (Arnd Bergmann)

  Device tree bindings:

   - Add qcom DT binding for 'global' interrupt (PCIe controller and
     link-specific events) for ipq8074, ipq8074-gen3, ipq6018, sa8775p,
     sc7280, sc8180x sdm845, sm8150, sm8250, sm8350 (Manivannan
     Sadhasivam)

   - Add qcom DT binding for 8 MSI SPI interrupts for msm8998, ipq8074,
     ipq8074-gen3, ipq6018 (Manivannan Sadhasivam)

   - Add dw rockchip DT binding for rk3576 and rk3562 (Kever Yang)

   - Correct indentation and style of examples in brcm,stb-pcie,
     cdns,cdns-pcie-ep, intel,keembay-pcie-ep, intel,keembay-pcie,
     microchip,pcie-host, rcar-pci-ep, rcar-pci-host, xilinx-versal-cpm
     (Krzysztof Kozlowski)

   - Convert Marvell EBU (dove, kirkwood, armada-370, armada-xp) and
     armada8k from text to schema DT bindings (Rob Herring)

   - Remove obsolete .txt DT bindings for content that has been moved to
     schemas (Rob Herring)

   - Add qcom DT binding for MHI registers in IPQ5332, IPQ6018, IPQ8074
     and IPQ9574 (Varadarajan Narayanan)

   - Convert v3,v360epc-pci from text to DT schema binding (Rob Herring)

   - Change microchip,pcie-host DT binding to be 'dma-noncoherent' since
     PolarFire may be configured that way (Conor Dooley)

  Miscellaneous:

   - Drop 'pci' suffix from intel_mid_pci.c filename to match similar
     files (Andy Shevchenko)

   - All platforms with PCI have an MMU, so add PCI Kconfig dependency
     on MMU to simplify build testing and avoid inadvertent build
     regressions (Arnd Bergmann)

   - Update Krzysztof Wilczyński's email address in MAINTAINERS
     (Krzysztof Wilczyński)

   - Update Manivannan Sadhasivam's email address in MAINTAINERS
     (Manivannan Sadhasivam)"

* tag 'pci-v6.16-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (147 commits)
  MAINTAINERS: Update Manivannan Sadhasivam email address
  PCI: j721e: Fix host/endpoint dependencies
  PCI: j721e: Add support to build as a loadable module
  PCI: cadence-ep: Introduce cdns_pcie_ep_disable() helper for cleanup
  PCI: cadence-host: Introduce cdns_pcie_host_disable() helper for cleanup
  PCI: cadence: Add support to build pcie-cadence library as a kernel module
  MAINTAINERS: Update Krzysztof Wilczyński email address
  PCI: Remove unnecessary linesplit in __pci_setup_bridge()
  PCI: WARN (not BUG()) when we fail to assign optional resources
  PCI: Remove unused pci_printk()
  PCI: qcom: Replace PERST# sleep time with proper macro
  PCI: dw-rockchip: Replace PERST# sleep time with proper macro
  PCI: host-common: Convert to library for host controller drivers
  PCI/ERR: Remove misleading TODO regarding kernel panic
  PCI: cadence: Remove duplicate message code definitions
  PCI: endpoint: Align pci_epc_set_msix(), pci_epc_ops::set_msix() nr_irqs encoding
  PCI: endpoint: Align pci_epc_set_msi(), pci_epc_ops::set_msi() nr_irqs encoding
  PCI: endpoint: Align pci_epc_get_msix(), pci_epc_ops::get_msix() return value encoding
  PCI: endpoint: Align pci_epc_get_msi(), pci_epc_ops::get_msi() return value encoding
  PCI: cadence-ep: Correct PBA offset in .set_msix() callback
  ...
2025-06-04 11:26:17 -07:00
Linus Torvalds
fd1f847350 Merge tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull more MM updates from Andrew Morton:

 - "zram: support algorithm-specific parameters" from Sergey Senozhatsky
   adds infrastructure for passing algorithm-specific parameters into
   zram. A single parameter `winbits' is implemented at this time.

 - "memcg: nmi-safe kmem charging" from Shakeel Butt makes memcg
   charging nmi-safe, which is required by BFP, which can operate in NMI
   context.

 - "Some random fixes and cleanup to shmem" from Kemeng Shi implements
   small fixes and cleanups in the shmem code.

 - "Skip mm selftests instead when kernel features are not present" from
   Zi Yan fixes some issues in the MM selftest code.

 - "mm/damon: build-enable essential DAMON components by default" from
   SeongJae Park reworks DAMON Kconfig to make it easier to enable
   CONFIG_DAMON.

 - "sched/numa: add statistics of numa balance task migration" from Libo
   Chen adds more info into sysfs and procfs files to improve visibility
   into the NUMA balancer's task migration activity.

 - "selftests/mm: cow and gup_longterm cleanups" from Mark Brown
   provides various updates to some of the MM selftests to make them
   play better with the overall containing framework.

* tag 'mm-stable-2025-06-01-14-06' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (43 commits)
  mm/khugepaged: clean up refcount check using folio_expected_ref_count()
  selftests/mm: fix test result reporting in gup_longterm
  selftests/mm: report unique test names for each cow test
  selftests/mm: add helper for logging test start and results
  selftests/mm: use standard ksft_finished() in cow and gup_longterm
  selftests/damon/_damon_sysfs: skip testcases if CONFIG_DAMON_SYSFS is disabled
  sched/numa: add statistics of numa balance task
  sched/numa: fix task swap by skipping kernel threads
  tools/testing: check correct variable in open_procmap()
  tools/testing/vma: add missing function stub
  mm/gup: update comment explaining why gup_fast() disables IRQs
  selftests/mm: two fixes for the pfnmap test
  mm/khugepaged: fix race with folio split/free using temporary reference
  mm: add CONFIG_PAGE_BLOCK_ORDER to select page block order
  mmu_notifiers: remove leftover stub macros
  selftests/mm: deduplicate test names in madv_populate
  kcov: rust: add flags for KCOV with Rust
  mm: rust: make CONFIG_MMU ifdefs more narrow
  mmu_gather: move tlb flush for VM_PFNMAP/VM_MIXEDMAP vmas into free_pgtables()
  mm/damon/Kconfig: enable CONFIG_DAMON by default
  ...
2025-06-02 16:00:26 -07:00
Sergey Senozhatsky
dc75a0d93b zram: support deflate-specific params
Introduce support of algorithm specific parameters in algorithm_params
device attribute.  The expected format is algorithm.param=value.

For starters, add support for deflate.winbits parameter.

Link: https://lkml.kernel.org/r/20250514024825.1745489-3-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:07 -07:00
Sergey Senozhatsky
a5ade2e9fa zram: rename ZCOMP_PARAM_NO_LEVEL
Patch series "zram: support algorithm-specific parameters".

This patchset adds support for algorithm-specific parameters.  For now,
only deflate-specific winbits can be configured, which fixes deflate
support on some s390 setups.


This patch (of 2):

Use more generic name because this will be default "un-set"
value for more params in the future.

Link: https://lkml.kernel.org/r/20250514024825.1745489-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20250514024825.1745489-2-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-31 22:46:07 -07:00
Linus Torvalds
00c010e130 Merge tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:

 - "Add folio_mk_pte()" from Matthew Wilcox simplifies the act of
   creating a pte which addresses the first page in a folio and reduces
   the amount of plumbing which architecture must implement to provide
   this.

 - "Misc folio patches for 6.16" from Matthew Wilcox is a shower of
   largely unrelated folio infrastructure changes which clean things up
   and better prepare us for future work.

 - "memory,x86,acpi: hotplug memory alignment advisement" from Gregory
   Price adds early-init code to prevent x86 from leaving physical
   memory unused when physical address regions are not aligned to memory
   block size.

 - "mm/compaction: allow more aggressive proactive compaction" from
   Michal Clapinski provides some tuning of the (sadly, hard-coded (more
   sadly, not auto-tuned)) thresholds for our invokation of proactive
   compaction. In a simple test case, the reduction of a guest VM's
   memory consumption was dramatic.

 - "Minor cleanups and improvements to swap freeing code" from Kemeng
   Shi provides some code cleaups and a small efficiency improvement to
   this part of our swap handling code.

 - "ptrace: introduce PTRACE_SET_SYSCALL_INFO API" from Dmitry Levin
   adds the ability for a ptracer to modify syscalls arguments. At this
   time we can alter only "system call information that are used by
   strace system call tampering, namely, syscall number, syscall
   arguments, and syscall return value.

   This series should have been incorporated into mm.git's "non-MM"
   branch, but I goofed.

 - "fs/proc: extend the PAGEMAP_SCAN ioctl to report guard regions" from
   Andrei Vagin extends the info returned by the PAGEMAP_SCAN ioctl
   against /proc/pid/pagemap. This permits CRIU to more efficiently get
   at the info about guard regions.

 - "Fix parameter passed to page_mapcount_is_type()" from Gavin Shan
   implements that fix. No runtime effect is expected because
   validate_page_before_insert() happens to fix up this error.

 - "kernel/events/uprobes: uprobe_write_opcode() rewrite" from David
   Hildenbrand basically brings uprobe text poking into the current
   decade. Remove a bunch of hand-rolled implementation in favor of
   using more current facilities.

 - "mm/ptdump: Drop assumption that pxd_val() is u64" from Anshuman
   Khandual provides enhancements and generalizations to the pte dumping
   code. This might be needed when 128-bit Page Table Descriptors are
   enabled for ARM.

 - "Always call constructor for kernel page tables" from Kevin Brodsky
   ensures that the ctor/dtor is always called for kernel pgtables, as
   it already is for user pgtables.

   This permits the addition of more functionality such as "insert hooks
   to protect page tables". This change does result in various
   architectures performing unnecesary work, but this is fixed up where
   it is anticipated to occur.

 - "Rust support for mm_struct, vm_area_struct, and mmap" from Alice
   Ryhl adds plumbing to permit Rust access to core MM structures.

 - "fix incorrectly disallowed anonymous VMA merges" from Lorenzo
   Stoakes takes advantage of some VMA merging opportunities which we've
   been missing for 15 years.

 - "mm/madvise: batch tlb flushes for MADV_DONTNEED and MADV_FREE" from
   SeongJae Park optimizes process_madvise()'s TLB flushing.

   Instead of flushing each address range in the provided iovec, we
   batch the flushing across all the iovec entries. The syscall's cost
   was approximately halved with a microbenchmark which was designed to
   load this particular operation.

 - "Track node vacancy to reduce worst case allocation counts" from
   Sidhartha Kumar makes the maple tree smarter about its node
   preallocation.

   stress-ng mmap performance increased by single-digit percentages and
   the amount of unnecessarily preallocated memory was dramaticelly
   reduced.

 - "mm/gup: Minor fix, cleanup and improvements" from Baoquan He removes
   a few unnecessary things which Baoquan noted when reading the code.

 - ""Enhance sysfs handling for memory hotplug in weighted interleave"
   from Rakie Kim "enhances the weighted interleave policy in the memory
   management subsystem by improving sysfs handling, fixing memory
   leaks, and introducing dynamic sysfs updates for memory hotplug
   support". Fixes things on error paths which we are unlikely to hit.

 - "mm/damon: auto-tune DAMOS for NUMA setups including tiered memory"
   from SeongJae Park introduces new DAMOS quota goal metrics which
   eliminate the manual tuning which is required when utilizing DAMON
   for memory tiering.

 - "mm/vmalloc.c: code cleanup and improvements" from Baoquan He
   provides cleanups and small efficiency improvements which Baoquan
   found via code inspection.

 - "vmscan: enforce mems_effective during demotion" from Gregory Price
   changes reclaim to respect cpuset.mems_effective during demotion when
   possible. because presently, reclaim explicitly ignores
   cpuset.mems_effective when demoting, which may cause the cpuset
   settings to violated.

   This is useful for isolating workloads on a multi-tenant system from
   certain classes of memory more consistently.

 - "Clean up split_huge_pmd_locked() and remove unnecessary folio
   pointers" from Gavin Guo provides minor cleanups and efficiency gains
   in in the huge page splitting and migrating code.

 - "Use kmem_cache for memcg alloc" from Huan Yang creates a slab cache
   for `struct mem_cgroup', yielding improved memory utilization.

 - "add max arg to swappiness in memory.reclaim and lru_gen" from
   Zhongkun He adds a new "max" argument to the "swappiness=" argument
   for memory.reclaim MGLRU's lru_gen.

   This directs proactive reclaim to reclaim from only anon folios
   rather than file-backed folios.

 - "kexec: introduce Kexec HandOver (KHO)" from Mike Rapoport is the
   first step on the path to permitting the kernel to maintain existing
   VMs while replacing the host kernel via file-based kexec. At this
   time only memblock's reserve_mem is preserved.

 - "mm: Introduce for_each_valid_pfn()" from David Woodhouse provides
   and uses a smarter way of looping over a pfn range. By skipping
   ranges of invalid pfns.

 - "sched/numa: Skip VMA scanning on memory pinned to one NUMA node via
   cpuset.mems" from Libo Chen removes a lot of pointless VMA scanning
   when a task is pinned a single NUMA mode.

   Dramatic performance benefits were seen in some real world cases.

 - "JFS: Implement migrate_folio for jfs_metapage_aops" from Shivank
   Garg addresses a warning which occurs during memory compaction when
   using JFS.

 - "move all VMA allocation, freeing and duplication logic to mm" from
   Lorenzo Stoakes moves some VMA code from kernel/fork.c into the more
   appropriate mm/vma.c.

 - "mm, swap: clean up swap cache mapping helper" from Kairui Song
   provides code consolidation and cleanups related to the folio_index()
   function.

 - "mm/gup: Cleanup memfd_pin_folios()" from Vishal Moola does that.

 - "memcg: Fix test_memcg_min/low test failures" from Waiman Long
   addresses some bogus failures which are being reported by the
   test_memcontrol selftest.

 - "eliminate mmap() retry merge, add .mmap_prepare hook" from Lorenzo
   Stoakes commences the deprecation of file_operations.mmap() in favor
   of the new file_operations.mmap_prepare().

   The latter is more restrictive and prevents drivers from messing with
   things in ways which, amongst other problems, may defeat VMA merging.

 - "memcg: decouple memcg and objcg stocks"" from Shakeel Butt decouples
   the per-cpu memcg charge cache from the objcg's one.

   This is a step along the way to making memcg and objcg charging
   NMI-safe, which is a BPF requirement.

 - "mm/damon: minor fixups and improvements for code, tests, and
   documents" from SeongJae Park is yet another batch of miscellaneous
   DAMON changes. Fix and improve minor problems in code, tests and
   documents.

 - "memcg: make memcg stats irq safe" from Shakeel Butt converts memcg
   stats to be irq safe. Another step along the way to making memcg
   charging and stats updates NMI-safe, a BPF requirement.

 - "Let unmap_hugepage_range() and several related functions take folio
   instead of page" from Fan Ni provides folio conversions in the
   hugetlb code.

* tag 'mm-stable-2025-05-31-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (285 commits)
  mm: pcp: increase pcp->free_count threshold to trigger free_high
  mm/hugetlb: convert use of struct page to folio in __unmap_hugepage_range()
  mm/hugetlb: refactor __unmap_hugepage_range() to take folio instead of page
  mm/hugetlb: refactor unmap_hugepage_range() to take folio instead of page
  mm/hugetlb: pass folio instead of page to unmap_ref_private()
  memcg: objcg stock trylock without irq disabling
  memcg: no stock lock for cpu hot-unplug
  memcg: make __mod_memcg_lruvec_state re-entrant safe against irqs
  memcg: make count_memcg_events re-entrant safe against irqs
  memcg: make mod_memcg_state re-entrant safe against irqs
  memcg: move preempt disable to callers of memcg_rstat_updated
  memcg: memcg_rstat_updated re-entrant safe against irqs
  mm: khugepaged: decouple SHMEM and file folios' collapse
  selftests/eventfd: correct test name and improve messages
  alloc_tag: check mem_profiling_support in alloc_tag_init
  Docs/damon: update titles and brief introductions to explain DAMOS
  selftests/damon/_damon_sysfs: read tried regions directories in order
  mm/damon/tests/core-kunit: add a test for damos_set_filters_default_reject()
  mm/damon/paddr: remove unused variable, folio_list, in damon_pa_stat()
  mm/damon/sysfs-schemes: fix wrong comment on damons_sysfs_quota_goal_metric_strs
  ...
2025-05-31 15:44:16 -07:00
Uday Shankar
ab03a61c66 ublk: have a per-io daemon instead of a per-queue daemon
Currently, ublk_drv associates to each hardware queue (hctx) a unique
task (called the queue's ubq_daemon) which is allowed to issue
COMMIT_AND_FETCH commands against the hctx. If any other task attempts
to do so, the command fails immediately with EINVAL. When considered
together with the block layer architecture, the result is that for each
CPU C on the system, there is a unique ublk server thread which is
allowed to handle I/O submitted on CPU C. This can lead to suboptimal
performance under imbalanced load generation. For an extreme example,
suppose all the load is generated on CPUs mapping to a single ublk
server thread. Then that thread may be fully utilized and become the
bottleneck in the system, while other ublk server threads are totally
idle.

This issue can also be addressed directly in the ublk server without
kernel support by having threads dequeue I/Os and pass them around to
ensure even load. But this solution requires inter-thread communication
at least twice for each I/O (submission and completion), which is
generally a bad pattern for performance. The problem gets even worse
with zero copy, as more inter-thread communication would be required to
have the buffer register/unregister calls to come from the correct
thread.

Therefore, address this issue in ublk_drv by allowing each I/O to have
its own daemon task. Two I/Os in the same queue are now allowed to be
serviced by different daemon tasks - this was not possible before.
Imbalanced load can then be balanced across all ublk server threads by
having the ublk server threads issue FETCH_REQs in a round-robin manner.
As a small toy example, consider a system with a single ublk device
having 2 queues, each of depth 4. A ublk server having 4 threads could
issue its FETCH_REQs against this device as follows (where each entry is
the qid,tag pair that the FETCH_REQ targets):

ublk server thread:	T0	T1	T2	T3
			0,0	0,1	0,2	0,3
			1,3	1,0	1,1	1,2

This setup allows for load that is concentrated on one hctx/ublk_queue
to be spread out across all ublk server threads, alleviating the issue
described above.

Add the new UBLK_F_PER_IO_DAEMON feature to ublk_drv, which ublk servers
can use to essentially test for the presence of this change and tailor
their behavior accordingly.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20250529-ublk_task_per_io-v8-1-e9d3b119336a@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-31 14:38:21 -06:00
Ming Lei
39d86db34e loop: add file_start_write() and file_end_write()
file_start_write() and file_end_write() should be added around ->write_iter().

Recently we switch to ->write_iter() from vfs_iter_write(), and the
implied file_start_write() and file_end_write() are lost.

Also we never add them for dio code path, so add them back for covering
both.

Cc: Jeff Moyer <jmoyer@redhat.com>
Fixes: f2fed441c6 ("loop: stop using vfs_iter_{read,write} for buffered I/O")
Fixes: bc07c10a36 ("block: loop: support DIO & AIO")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250527153405.837216-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-27 10:14:26 -06:00
Linus Torvalds
6f59de9bc0 Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux
Pull block updates from Jens Axboe:

 - ublk updates:
      - Add support for updating the size of a ublk instance
      - Zero-copy improvements
      - Auto-registering of buffers for zero-copy
      - Series simplifying and improving GET_DATA and request lookup
      - Series adding quiesce support
      - Lots of selftests additions
      - Various cleanups

 - NVMe updates via Christoph:
      - add per-node DMA pools and use them for PRP/SGL allocations
        (Caleb Sander Mateos, Keith Busch)
      - nvme-fcloop refcounting fixes (Daniel Wagner)
      - support delayed removal of the multipath node and optionally
        support the multipath node for private namespaces (Nilay Shroff)
      - support shared CQs in the PCI endpoint target code (Wilfred
        Mallawa)
      - support admin-queue only authentication (Hannes Reinecke)
      - use the crc32c library instead of the crypto API (Eric Biggers)
      - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes
        Reinecke, Leon Romanovsky, Gustavo A. R. Silva)

 - MD updates via Yu:
      - Fix that normal IO can be starved by sync IO, found by mkfs on
        newly created large raid5, with some clean up patches for bdev
        inflight counters

 - Clean up brd, getting rid of atomic kmaps and bvec poking

 - Add loop driver specifically for zoned IO testing

 - Eliminate blk-rq-qos calls with a static key, if not enabled

 - Improve hctx locking for when a plug has IO for multiple queues
   pending

 - Remove block layer bouncing support, which in turn means we can
   remove the per-node bounce stat as well

 - Improve blk-throttle support

 - Improve delay support for blk-throttle

 - Improve brd discard support

 - Unify IO scheduler switching. This should also fix a bunch of lockdep
   warnings we've been seeing, after enabling lockdep support for queue
   freezing/unfreezeing

 - Add support for block write streams via FDP (flexible data placement)
   on NVMe

 - Add a bunch of block helpers, facilitating the removal of a bunch of
   duplicated boilerplate code

 - Remove obsolete BLK_MQ pci and virtio Kconfig options

 - Add atomic/untorn write support to blktrace

 - Various little cleanups and fixes

* tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits)
  selftests: ublk: add test for UBLK_F_QUIESCE
  ublk: add feature UBLK_F_QUIESCE
  selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE
  traceevent/block: Add REQ_ATOMIC flag to block trace events
  ublk: run auto buf unregisgering in same io_ring_ctx with registering
  io_uring: add helper io_uring_cmd_ctx_handle()
  ublk: remove io argument from ublk_auto_buf_reg_fallback()
  ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch()
  selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK
  selftests: ublk: support UBLK_F_AUTO_BUF_REG
  ublk: support UBLK_AUTO_BUF_REG_FALLBACK
  ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG
  ublk: prepare for supporting to register request buffer automatically
  ublk: convert to refcount_t
  selftests: ublk: make IO & device removal test more stressful
  nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk
  nvme: introduce multipath_always_on module param
  nvme-multipath: introduce delayed removal of the multipath head node
  nvme-pci: derive and better document max segments limits
  nvme-pci: use struct_size for allocation struct nvme_dev
  ...
2025-05-26 11:39:36 -07:00
Ming Lei
b465ae7b25 ublk: add feature UBLK_F_QUIESCE
Add feature UBLK_F_QUIESCE, which adds control command `UBLK_U_CMD_QUIESCE_DEV`
for quiescing device, then device state can become `UBLK_S_DEV_QUIESCED`
or `UBLK_S_DEV_FAIL_IO` finally from ublk_ch_release() with ublk server
cooperation.

This feature can help to support to upgrade ublk server application by
shutting down ublk server gracefully, meantime keep ublk block device
persistent during the upgrading period.

The feature is only available for UBLK_F_USER_RECOVERY.

Suggested-by: Yoav Cohen <yoav@nvidia.com>
Link: https://lore.kernel.org/linux-block/DM4PR12MB632807AB7CDCE77D1E5AB7D0A9B92@DM4PR12MB6328.namprd12.prod.outlook.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250522163523.406289-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-23 09:42:12 -06:00
Linus Torvalds
a11a722298 Merge tag 'block-6.15-20250522' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:

 - Fix for a regression with setting up loop on a file system
   without ->write_iter()

 - Fix for an nvme sysfs regression

* tag 'block-6.15-20250522' of git://git.kernel.dk/linux:
  nvme: avoid creating multipath sysfs group under namespace path devices
  loop: don't require ->write_iter for writable files in loop_configure
2025-05-22 13:08:21 -07:00
Ming Lei
914e0dc508 ublk: run auto buf unregisgering in same io_ring_ctx with registering
UBLK_F_AUTO_BUF_REG requires that the buffer registered automatically
is unregistered in same `io_ring_ctx`, so check it explicitly.

Document this requirement for UBLK_F_AUTO_BUF_REG.

Drop WARN_ON_ONCE() which is triggered from userspace code path.

Fixes: 99c1e4eb6a ("ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG")
Reported-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250522152043.399824-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-22 10:03:55 -06:00
Caleb Sander Mateos
5234f2c3e3 ublk: remove io argument from ublk_auto_buf_reg_fallback()
The argument has been unused since the function was added, so remove it.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250521160720.1893326-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-22 05:01:14 -06:00
Ming Lei
9172dbf3a6 ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch()
If ublk_set_auto_buf_reg() fails, we need to unlock and return,
otherwise `ub->mutex` is leaked.

Fixes: 99c1e4eb6a ("ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG")
Reported-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250521025502.71041-2-ming.lei@redhat.com
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-21 14:30:28 -06:00
Ming Lei
53f427e794 ublk: support UBLK_AUTO_BUF_REG_FALLBACK
For UBLK_F_AUTO_BUF_REG, buffer is registered to uring_cmd context
automatically with the provided buffer index. User may provide one wrong
buffer index, or the specified buffer is registered by application already.

Add UBLK_AUTO_BUF_REG_FALLBACK for supporting to auto buffer registering
fallback by completing the uring_cmd and telling ublk server the
register failure via UBLK_AUTO_BUF_REG_FALLBACK, then ublk server still
can register the buffer from userspace.

So we can provide reliable way for supporting auto buffer register.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250520045455.515691-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-20 10:24:45 -06:00
Ming Lei
99c1e4eb6a ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG
Add UBLK_F_AUTO_BUF_REG for supporting to register buffer automatically
to local io_uring context with provided buffer index.

Add UAPI structure `struct ublk_auto_buf_reg` for holding user parameter
to register request buffer automatically, one 'flags' field is defined, and
there is still 32bit available for future extension, such as, adding one
io_ring FD field for registering buffer to external io_uring.

`struct ublk_auto_buf_reg` is populated from ublk uring_cmd's sqe->addr,
and all existing ublk commands are data-less, so it is just fine to reuse
sqe->addr for this purpose.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250520045455.515691-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-20 10:24:45 -06:00
Ming Lei
9e6b4756b3 ublk: prepare for supporting to register request buffer automatically
UBLK_F_SUPPORT_ZERO_COPY requires ublk server to issue explicit buffer
register/unregister uring_cmd for each IO, this way is not only inefficient,
but also introduce dependency between buffer consumer and buffer register/
unregister uring_cmd, please see tools/testing/selftests/ublk/stripe.c
in which backing file IO has to be issued one by one by IOSQE_IO_LINK.

Prepare for adding feature UBLK_F_AUTO_BUF_REG for addressing the existing
zero copy limitation:

- register request buffer automatically to ublk uring_cmd's io_uring
  context before delivering io command to ublk server

- unregister request buffer automatically from the ublk uring_cmd's
  io_uring context when completing the request

- io_uring will unregister the buffer automatically when uring is
  exiting, so we needn't worry about accident exit

For using this feature, ublk server has to create one sparse buffer table

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250520045455.515691-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-20 10:24:45 -06:00
Ming Lei
b1c3b4695a ublk: convert to refcount_t
Convert to refcount_t and prepare for supporting to register bvec buffer
automatically, which needs to initialize reference counter as 2, and
kref doesn't provide this interface, so convert to refcount_t.

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250520045455.515691-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-20 10:24:45 -06:00
Christoph Hellwig
355341e435 loop: don't require ->write_iter for writable files in loop_configure
Block devices can be opened read-write even if they can't be written to
for historic reasons.  Remove the check requiring file->f_op->write_iter
when the block devices was opened in loop_configure. The call to
loop_check_backing_file just below ensures the ->write_iter is present
for backing files opened for writing, which is the only check that is
actually needed.

Fixes: f5c84eff63 ("loop: Add sanity check for read/write_iter")
Reported-by: Christian Hesse <mail@eworm.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250520135420.1177312-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-20 09:16:23 -06:00
Linus Torvalds
6462c247b2 Merge tag 'block-6.15-20250515' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:

 - NVMe pull request via Christoph:
      - fixes for atomic writes (Alan Adamson)
      - fixes for polled CQs in nvmet-epf (Damien Le Moal)
      - fix for polled CQs in nvme-pci (Keith Busch)
      - fix compile on odd configs that need to be forced to inline
        (Kees Cook)
      - one more quirk (Ilya Guterman)

 - Fix for missing allocation of an integrity buffer for some cases

 - Fix for a regression with ublk command cancelation

* tag 'block-6.15-20250515' of git://git.kernel.dk/linux:
  ublk: fix dead loop when canceling io command
  nvme-pci: add NVME_QUIRK_NO_DEEPEST_PS quirk for SOLIDIGM P44 Pro
  nvme: all namespaces in a subsystem must adhere to a common atomic write size
  nvme: multipath: enable BLK_FEAT_ATOMIC_WRITES for multipathing
  nvmet: pci-epf: remove NVMET_PCI_EPF_Q_IS_SQ
  nvmet: pci-epf: improve debug message
  nvmet: pci-epf: cleanup nvmet_pci_epf_raise_irq()
  nvmet: pci-epf: do not fall back to using INTX if not supported
  nvmet: pci-epf: clear completion queue IRQ flag on delete
  nvme-pci: acquire cq_poll_lock in nvme_poll_irqdisable
  nvme-pci: make nvme_pci_npages_prp() __always_inline
  block: always allocate integrity buffer when required
2025-05-16 10:21:25 -07:00
Ming Lei
dd24f87f65 ublk: fix dead loop when canceling io command
Commit:

f40139fde5 ("ublk: fix race between io_uring_cmd_complete_in_task and
		ublk_cancel_cmd")

adds a request state check in ublk_cancel_cmd(), and if the request is
started, skips canceling this uring_cmd.

However, the current uring_cmd may be in ACTIVE state, without block
request coming to the uring command. Meantime, if the cached request in
tag_set.tags[tag] has been delivered to ublk server and reycycled, then
this uring_cmd can't be canceled.

ublk requests are aborted in ublk char device release handler, which
depends on canceling all ACTIVE uring_cmd. So it causes a dead loop.

Fix this issue by not taking a stale request into account when canceling
uring_cmd in ublk_cancel_cmd().

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Closes: https://lore.kernel.org/linux-block/mruqwpf4tqenkbtgezv5oxwq7ngyq24jzeyqy4ixzvivatbbxv@4oh2wzz4e6qn/
Fixes: f40139fde5 ("ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250515162601.77346-1-ming.lei@redhat.com
[axboe: rewording of commit message]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-15 10:53:41 -06:00
Christoph Hellwig
bbcacab2e8 brd: avoid extra xarray lookups on first write
The xarray can return the previous entry at a location.  Use this
fact to simplify the brd code when there is no existing page at
a location.  This also slighly improves the handling of racy
discards as we now always have a page under RCU protection by the
time we are ready to copy the data.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250507060700.3929430-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-14 05:44:13 -06:00
Sergey Senozhatsky
cf42d4cccf zram: modernize writeback interface
The writeback interface supports a page_index=N parameter which performs
writeback of the given page.  Since we rarely need to writeback just one
single page, the typical use case involves a number of writeback calls,
each performing writeback of one page:

  echo page_index=100 > zram0/writeback
  ...
  echo page_index=200 > zram0/writeback
  echo page_index=500 > zram0/writeback
  ...
  echo page_index=700 > zram0/writeback

One obvious downside of this is that it increases the number of syscalls. 
Less obvious, but a significantly more important downside, is that when
given only one page to post-process zram cannot perform an optimal target
selection.  This becomes a critical limitation when writeback_limit is
enabled, because under writeback_limit we want to guarantee the highest
memory savings hence we first need to writeback pages that release the
highest amount of zsmalloc pool memory.

This patch adds page_indexes=LOW-HIGH parameter to the writeback
interface:

  echo page_indexes=100-200 page_indexes=500-700 > zram0/writeback

This gives zram a chance to apply an optimal target selection strategy on
each iteration of the writeback loop.

We also now permit multiple page_index parameters per call (previously
zram would recognize only one page_index) and a mix or single pages and
page ranges:

  echo page_index=42 page_index=99 page_indexes=100-200 \
       page_indexes=500-700 > zram0/writeback

Apart from that the patch also unifies parameters passing and resembles
other "modern" zram device attributes (e.g.  recompression), while the old
interface used a mixed scheme: values-less parameters for mode and a
key=value format for page_index.  We still support the "old" value-less
format for compatibility reasons.

[senozhatsky@chromium.org: simplify parse_page_index() range checks, per Brian]
  nk: https://lkml.kernel.org/r/20250404015327.2427684-1-senozhatsky@chromium.org
[sozhatsky@chromium.org: fix uninitialized variable in zram_writeback_slots(), per Dan]
  nk: https://lkml.kernel.org/r/20250409112611.1154282-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20250327015818.4148660-1-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Brian Geffon <bgeffon@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Richard Chang <richardycc@google.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11 17:48:09 -07:00
Nhat Pham
56e5a103a7 zsmalloc: prefer the the original page's node for compressed data
Currently, zsmalloc, zswap's and zram's backend memory allocator, does not
enforce any policy for the allocation of memory for the compressed data,
instead just adopting the memory policy of the task entering reclaim, or
the default policy (prefer local node) if no such policy is specified. 
This can lead to several pathological behaviors in multi-node NUMA
systems:

1. Systems with CXL-based memory tiering can encounter the following
   inversion with zswap/zram: the coldest pages demoted to the CXL tier
   can return to the high tier when they are reclaimed to compressed swap,
   creating memory pressure on the high tier.

2. Consider a direct reclaimer scanning nodes in order of allocation
   preference.  If it ventures into remote nodes, the memory it compresses
   there should stay there.  Trying to shift those contents over to the
   reclaiming thread's preferred node further *increases* its local
   pressure, and provoking more spills.  The remote node is also the most
   likely to refault this data again.  This undesirable behavior was
   pointed out by Johannes Weiner in [1].

3. For zswap writeback, the zswap entries are organized in
   node-specific LRUs, based on the node placement of the original pages,
   allowing for targeted zswap writeback for specific nodes.

   However, the compressed data of a zswap entry can be placed on a
   different node from the LRU it is placed on.  This means that reclaim
   targeted at one node might not free up memory used for zswap entries in
   that node, but instead reclaiming memory in a different node.

All of these issues will be resolved if the compressed data go to the same
node as the original page.  This patch encourages this behavior by having
zswap and zram pass the node of the original page to zsmalloc, and have
zsmalloc prefer the specified node if we need to allocate new (zs)pages
for the compressed data.

Note that we are not strictly binding the allocation to the preferred
node.  We still allow the allocation to fall back to other nodes when the
preferred node is full, or if we have zspages with slots available on a
different node.  This is OK, and still a strict improvement over the
status quo:

1. On a system with demotion enabled, we will generally prefer
   demotions over compressed swapping, and only swap when pages have
   already gone to the lowest tier.  This patch should achieve the desired
   effect for the most part.

2. If the preferred node is out of memory, letting the compressed data
   going to other nodes can be better than the alternative (OOMs, keeping
   cold memory unreclaimed, disk swapping, etc.).

3. If the allocation go to a separate node because we have a zspage
   with slots available, at least we're not creating extra immediate
   memory pressure (since the space is already allocated).

3. While there can be mixings, we generally reclaim pages in same-node
   batches, which encourage zspage grouping that is more likely to go to
   the right node.

4. A strict binding would require partitioning zsmalloc by node, which
   is more complicated, and more prone to regression, since it reduces the
   storage density of zsmalloc.  We need to evaluate the tradeoff and
   benchmark carefully before adopting such an involved solution.

[1]: https://lore.kernel.org/linux-mm/20250331165306.GC2110528@cmpxchg.org/

[senozhatsky@chromium.org: coding-style fixes]
  Link: https://lkml.kernel.org/r/mnvexa7kseswglcqbhlot4zg3b3la2ypv2rimdl5mh5glbmhvz@wi6bgqn47hge
Link: https://lkml.kernel.org/r/20250402204416.3435994-1-nphamcs@gmail.com
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Gregory Price <gourry@gourry.net>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Acked-by: Sergey Senozhatsky <senozhatsky@chromium.org>	[zram, zsmalloc]
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosry.ahmed@linux.dev>	[zswap/zsmalloc]
Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
Cc: Joanthan Cameron <Jonathan.Cameron@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11 17:48:06 -07:00
Linus Torvalds
cc9f0629ca Merge tag 'block-6.15-20250509' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:

 - Fix for a regression in this series for loop and read/write iterator
   handling

 - zone append block update tweak

 - remove a broken IO priority test

 - NVMe pull request via Christoph:
      - unblock ctrl state transition for firmware update (Daniel
        Wagner)

* tag 'block-6.15-20250509' of git://git.kernel.dk/linux:
  block: remove test of incorrect io priority level
  nvme: unblock ctrl state transition for firmware update
  block: only update request sector if needed
  loop: Add sanity check for read/write_iter
2025-05-09 10:34:50 -07:00
Christoph Hellwig
a216081323 rnbd-srv: use bio_add_virt_nofail
Use the bio_add_virt_nofail to add a single kernel virtual address
to a bio as that can't fail.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250507120451.4000627-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07 07:31:07 -06:00
Christoph Hellwig
af78428ed3 block: remove the q argument from blk_rq_map_kern
Remove the q argument from blk_rq_map_kern and the internal helpers
called by it as the queue can trivially be derived from the request.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250507120451.4000627-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07 07:31:07 -06:00
Yu Kuai
a26a339a65 brd: fix discard end sector
brd_do_discard() just aligned start sector to page, this can only work
if the discard size if at least one page. For example:

blkdiscard /dev/ram0 -o 5120 -l 1024

In this case, size = (1024 - (8192 - 5120)), which is a huge value.

Fix the problem by round_down() the end sector.

Fixes: 9ead7efc6f ("brd: implement discard support")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250506061756.2970934-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06 07:42:27 -06:00
Yu Kuai
d4099f8893 brd: fix aligned_sector from brd_do_discard()
The calculation is just wrong, fix it by round_up().

Fixes: 9ead7efc6f ("brd: implement discard support")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250506061756.2970934-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06 07:42:27 -06:00
Yu Kuai
0e8acffc1b brd: protect page with rcu
Currently, after fetching the page by xa_load() in IO path, there is no
protection and page can be freed concurrently by discard:

cpu0
brd_submit_bio
 brd_do_bvec
  page = brd_lookup_page
                          cpu1
                          brd_submit_bio
                           brd_do_discard
                            page = __xa_erase()
                            __free_page()
  // page UAF

Fix the problem by protecting page with rcu.

Meanwhile, if page is already freed, also prevent BUG_ON() by skipping
the write, and user will get zero data later if there is no page.

Fixes: 9ead7efc6f ("brd: implement discard support")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250506061756.2970934-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06 07:42:27 -06:00
Caleb Sander Mateos
e96ee7e1de ublk: consolidate UBLK_IO_FLAG_OWNED_BY_SRV checks
Every ublk I/O command except UBLK_IO_FETCH_REQ checks that the ublk_io
has UBLK_IO_FLAG_OWNED_BY_SRV set. Consolidate the separate checks into
a single one in __ublk_ch_uring_cmd(), analogous to those for
UBLK_IO_FLAG_ACTIVE and UBLK_IO_FLAG_NEED_GET_DATA.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250505172624.1121839-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-06 07:40:18 -06:00
Lizhi Xu
f5c84eff63 loop: Add sanity check for read/write_iter
Some file systems do not support read_iter/write_iter, such as selinuxfs
in this issue.
So before calling them, first confirm that the interface is supported and
then call it.

It is releavant in that vfs_iter_read/write have the check, and removal
of their used caused szybot to be able to hit this issue.

Fixes: f2fed441c6 ("loop: stop using vfs_iter__{read,write} for buffered I/O")
Reported-by: syzbot+6af973a3b8dfd2faefdc@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=6af973a3b8dfd2faefdc
Signed-off-by: Lizhi Xu <lizhi.xu@windriver.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250428143626.3318717-1-lizhi.xu@windriver.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-05 07:18:05 -06:00
Jens Axboe
c595b5402f Merge branch 'block-6.15' into for-6.16/block
Merge 6.15 block fixes in, once again, to resolve conflicts with the
fixes for ublk that went into mainline and the 6.16 ublk updates.

* block-6.15:
  nvmet-auth: always free derived key data
  nvmet-tcp: don't restore null sk_state_change
  nvmet-tcp: select CONFIG_TLS from CONFIG_NVME_TARGET_TCP_TLS
  nvme-tcp: select CONFIG_TLS from CONFIG_NVME_TCP_TLS
  nvme-tcp: fix premature queue removal and I/O failover
  nvme-pci: add quirks for WDC Blue SN550 15b7:5009
  nvme-pci: add quirks for device 126f:1001
  nvme-pci: fix queue unquiesce check on slot_reset
  ublk: remove the check of ublk_need_req_ref() from __ublk_check_and_get_req
  ublk: enhance check for register/unregister io buffer command
  ublk: decouple zero copy from user copy
  selftests: ublk: fix UBLK_F_NEED_GET_DATA

Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-05 07:13:44 -06:00
Linus Torvalds
e205ff48fa Merge tag 'block-6.15-20250502' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:

 - NVMe pull request via Christoph:
     - fix queue unquiesce check on PCI slot_reset (Keith Busch)
     - fix premature queue removal and I/O failover in nvme-tcp (Michael
       Liang)
     - don't restore null sk_state_change (Alistair Francis)
     - select CONFIG_TLS where needed (Alistair Francis)
     - always free derived key data (Hannes Reinecke)
     - more quirks (Wentao Guan)

 - ublk zero copy fix

 - ublk selftest fix for UBLK_F_NEED_GET_DATA

* tag 'block-6.15-20250502' of git://git.kernel.dk/linux:
  nvmet-auth: always free derived key data
  nvmet-tcp: don't restore null sk_state_change
  nvmet-tcp: select CONFIG_TLS from CONFIG_NVME_TARGET_TCP_TLS
  nvme-tcp: select CONFIG_TLS from CONFIG_NVME_TCP_TLS
  nvme-tcp: fix premature queue removal and I/O failover
  nvme-pci: add quirks for WDC Blue SN550 15b7:5009
  nvme-pci: add quirks for device 126f:1001
  nvme-pci: fix queue unquiesce check on slot_reset
  ublk: remove the check of ublk_need_req_ref() from __ublk_check_and_get_req
  ublk: enhance check for register/unregister io buffer command
  ublk: decouple zero copy from user copy
  selftests: ublk: fix UBLK_F_NEED_GET_DATA
2025-05-02 10:24:37 -07:00