Commit Graph

338 Commits

Author SHA1 Message Date
Linus Torvalds
69f637c335 Merge tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "Nothing major in here:

   - NVMe pull request from Christoph:
        - nvmet passthrough improvements (Chaitanya Kulkarni)
        - fcloop error injection support (James Smart)
        - read-only support for zoned namespaces without Zone Append
          (Javier González)
        - improve some error message (Minwoo Im)
        - reject I/O to offline fabrics namespaces (Victor Gladkov)
        - PCI queue allocation cleanups (Niklas Schnelle)
        - remove an unused allocation in nvmet (Amit Engel)
        - a Kconfig spelling fix (Colin Ian King)
        - nvme_req_qid simplication (Baolin Wang)

   - MD pull request from Song:
        - Fix race condition in md_ioctl() (Dae R. Jeong)
        - Initialize read_slot properly for raid10 (Kevin Vigor)
        - Code cleanup (Pankaj Gupta)
        - md-cluster resync/reshape fix (Zhao Heming)

   - Move null_blk into its own directory (Damien Le Moal)

   - null_blk zone and discard improvements (Damien Le Moal)

   - bcache race fix (Dongsheng Yang)

   - Set of rnbd fixes/improvements (Gioh Kim, Guoqing Jiang, Jack Wang,
     Lutz Pogrell, Md Haris Iqbal)

   - lightnvm NULL pointer deref fix (tangzhenhao)

   - sr in_interrupt() removal (Sebastian Andrzej Siewior)

   - FC endpoint security support for s390/dasd (Jan Höppner, Sebastian
     Ott, Vineeth Vijayan). From the s390 arch guys, arch bits included
     as it made it easier for them to funnel the feature through the
     block driver tree.

   - Follow up fixes (Colin Ian King)"

* tag 'for-5.11/drivers-2020-12-14' of git://git.kernel.dk/linux-block: (64 commits)
  block: drop dead assignments in loop_init()
  sr: Remove in_interrupt() usage in sr_init_command().
  sr: Switch the sector size back to 2048 if sr_read_sector() changed it.
  cdrom: Reset sector_size back it is not 2048.
  drivers/lightnvm: fix a null-ptr-deref bug in pblk-core.c
  null_blk: Move driver into its own directory
  null_blk: Allow controlling max_hw_sectors limit
  null_blk: discard zones on reset
  null_blk: cleanup discard handling
  null_blk: Improve implicit zone close
  null_blk: improve zone locking
  block: Align max_hw_sectors to logical blocksize
  null_blk: Fail zone append to conventional zones
  null_blk: Fix zone size initialization
  bcache: fix race between setting bdev state to none and new write request direct to backing
  block/rnbd: fix a null pointer dereference on dev->blk_symlink_name
  block/rnbd-clt: Dynamically alloc buffer for pathname & blk_symlink_name
  block/rnbd: call kobject_put in the failure path
  Documentation/ABI/rnbd-srv: add document for force_close
  block/rnbd-srv: close a mapped device from server side.
  ...
2020-12-16 13:09:32 -08:00
Lukas Bulwahn
aeb2b0b1a3 block: drop dead assignments in loop_init()
Commit 8410d38c25 ("loop: use __register_blkdev to allocate devices on
demand") simplified loop_init(); so computing the range of the block region
is not required anymore and can be dropped.

Drop dead assignments in loop_init().

As compilers will detect these unneeded assignments and optimize this,
the resulting object code is identical before and after this change.

No functional change. No change in object code.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-12 11:18:56 -07:00
Christoph Hellwig
a782483cc1 block: remove the nr_sects field in struct hd_struct
Now that the hd_struct always has a block device attached to it, there is
no need for having two size field that just get out of sync.

Additionally the field in hd_struct did not use proper serialization,
possibly allowing for torn writes.  By only using the block_device field
this problem also gets fixed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Coly Li <colyli@suse.de>			[bcache]
Acked-by: Chao Yu <yuchao0@huawei.com>			[f2fs]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:40 -07:00
Christoph Hellwig
37c3fc9abb block: simplify the block device claiming interface
Stop passing the whole device as a separate argument given that it
can be trivially deducted and cleanup the !holder debug check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:39 -07:00
Christoph Hellwig
a954ea8120 block: remove ->bd_contains
Now that each hd_struct has a reference to the corresponding
block_device, there is no need for the bd_contains pointer.  Add
a bdev_whole() helper to look up the whole device block_device
struture instead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:39 -07:00
Christoph Hellwig
4e7b5671c6 block: remove i_bdev
Switch the block device lookup interfaces to directly work with a dev_t
so that struct block_device references are only acquired by the
blkdev_get variants (and the blk-cgroup special case).  This means that
we now don't need an extra reference in the inode and can generally
simplify handling of struct block_device to keep the lookups contained
in the core block layer code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Coly Li <colyli@suse.de>		[bcache]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:39 -07:00
Christoph Hellwig
f46f2a3198 loop: do not call set_blocksize
set_blocksize is used by file systems to use their preferred buffer cache
block size.  Block drivers should not set it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-12-01 14:53:39 -07:00
Christoph Hellwig
449f4ec989 block: remove the update_bdev parameter to set_capacity_revalidate_and_notify
The update_bdev argument is always set to true, so remove it.  Also
rename the function to the slighly less verbose set_capacity_and_notify,
as propagating the disk size to the block device isn't really
revalidation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16 08:34:14 -07:00
Christoph Hellwig
3b4f85d02a loop: let set_capacity_revalidate_and_notify update the bdev size
There is no good reason to call revalidate_disk_size separately.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16 08:34:14 -07:00
Christoph Hellwig
8410d38c25 loop: use __register_blkdev to allocate devices on demand
Use the simpler mechanism attached to major_name to allocate a brd device
when a currently unregistered minor is accessed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16 08:14:30 -07:00
Christoph Hellwig
7a2f0ce19f loop: use set_disk_ro
Use set_disk_ro instead of set_device_ro to match all other block
drivers and to ensure all partitions mirror the read-only flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-16 08:14:29 -07:00
Petr Vorel
c01a21b777 loop: Fix occasional uevent drop
Commit 716ad0986c ("loop: Switch to set_capacity_revalidate_and_notify")
causes an occasional drop of loop device uevent, which are no longer
triggered in loop_set_size() but in a different part of code.

Bug is reproducible with LTP test uevent01 [1]:

i=0; while true; do
    i=$((i+1)); echo "== $i =="
    lsmod |grep -q loop && rmmod -f loop
    ./uevent01 || break
done

Put back triggering through code called in loop_set_size().

Fix required to add yet another parameter to
set_capacity_revalidate_and_notify().

[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/uevents/uevent01.c

[hch: rebased on a different change to the prototype of
 set_capacity_revalidate_and_notify]

Cc: stable@vger.kernel.org # v5.9
Fixes: 716ad0986c ("loop: Switch to set_capacity_revalidate_and_notify")
Reported-by: <ltp@lists.linux.it>
Signed-off-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-11-12 13:59:04 -07:00
Christoph Hellwig
611bee526b block: replace bd_set_size with bd_set_nr_sectors
Replace bd_set_size with a version that takes the number of sectors
instead, as that fits most of the current and future callers much better.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-01 16:49:25 -06:00
Linus Torvalds
4d41ead6ea Merge tag 'block-5.9-2020-08-28' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:

 - nbd timeout fix (Hou)

 - device size fix for loop LOOP_CONFIGURE (Martijn)

 - MD pull from Song with raid5 stripe size fix (Yufen)

* tag 'block-5.9-2020-08-28' of git://git.kernel.dk/linux-block:
  md/raid5: make sure stripe_size as power of two
  loop: Set correct device size when using LOOP_CONFIGURE
  nbd: restore default timeout when setting it to zero
2020-08-28 16:38:29 -07:00
Martijn Coenen
79e5dc59e2 loop: Set correct device size when using LOOP_CONFIGURE
The device size calculation was done before processing the loop
configuration, which meant that the we set the size on the underlying
block device incorrectly in case lo_offset/lo_sizelimit were set in the
configuration. Delay computing the size until we've setup the device
parameters correctly.

Fixes: 3448914e8cc5("loop: Add LOOP_CONFIGURE ioctl")
Reported-by: Lennart Poettering <mzxreary@0pointer.de>
Tested-by: Yang Xu <xuyang2018.jy@cn.fujitsu.com>
Signed-off-by: Martijn Coenen <maco@android.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-26 09:30:31 -06:00
Linus Torvalds
c41c3ec4a2 Merge tag 'io_uring-5.9-2020-08-23' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:

 - NVMe pull request from Sagi:
       - nvme completion rework from Christoph and Chao that mostly came
         from a bit of divergence of how we classify errors related to
         pathing/retry etc.
       - nvmet passthru fixes from Chaitanya
       - minor nvmet fixes from Amit and I
       - mpath round-robin path selection fix from Martin
       - ignore noiob for zoned devices from Keith
       - minor nvme-fc fix from Tianjia"

 - BFQ cgroup leak fix (Dmitry)

 - block layer MAINTAINERS addition (Geert)

 - fix null_blk FUA checking (Hou)

 - get_max_io_size() size fix (Keith)

 - fix block page_is_mergeable() for compound pages (Matthew)

 - discard granularity fixes (Ming)

 - IO scheduler ordering fix (Ming)

 - misc fixes

* tag 'io_uring-5.9-2020-08-23' of git://git.kernel.dk/linux-block: (31 commits)
  null_blk: fix passing of REQ_FUA flag in null_handle_rq
  nvmet: Disable keep-alive timer when kato is cleared to 0h
  nvme: redirect commands on dying queue
  nvme: just check the status code type in nvme_is_path_error
  nvme: refactor command completion
  nvme: rename and document nvme_end_request
  nvme: skip noiob for zoned devices
  nvme-pci: fix PRP pool size
  nvme-pci: Use u32 for nvme_dev.q_depth and nvme_queue.q_depth
  nvme: Use spin_lock_irq() when taking the ctrl->lock
  nvmet: call blk_mq_free_request() directly
  nvmet: fix oops in pt cmd execution
  nvmet: add ns tear down label for pt-cmd handling
  nvme: multipath: round-robin: eliminate "fallback" variable
  nvme: multipath: round-robin: fix single non-optimized path case
  nvme-fc: Fix wrong return value in __nvme_fc_init_request()
  nvmet-passthru: Reject commands with non-sgl flags set
  nvmet: fix a memory leak
  blkcg: fix memleak for iolatency
  MAINTAINERS: Add missing header files to BLOCK LAYER section
  ...
2020-08-24 11:53:15 -07:00
Gustavo A. R. Silva
df561f6688 treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23 17:36:59 -05:00
Ming Lei
bcb21c8cc9 block: loop: set discard granularity and alignment for block device backed loop
In case of block device backend, if the backend supports write zeros, the
loop device will set queue flag of QUEUE_FLAG_DISCARD. However,
limits.discard_granularity isn't setup, and this way is wrong,
see the following description in Documentation/ABI/testing/sysfs-block:

	A discard_granularity of 0 means that the device does not support
	discard functionality.

Especially 9b15d109a6 ("block: improve discard bio alignment in
__blkdev_issue_discard()") starts to take q->limits.discard_granularity
for computing max discard sectors. And zero discard granularity may cause
kernel oops, or fail discard request even though the loop queue claims
discard support via QUEUE_FLAG_DISCARD.

Fix the issue by setup discard granularity and alignment.

Fixes: c52abf5630 ("loop: Better discard support for block devices")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Coly Li <colyli@suse.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Xiao Ni <xni@redhat.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Evan Green <evgreen@chromium.org>
Cc: Gwendal Grignou <gwendal@chromium.org>
Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Cc: Andrzej Pietrasiewicz <andrzej.p@collabora.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-17 06:58:36 -07:00
Lennart Poettering
fe6a8fc5ed loop: unset GENHD_FL_NO_PART_SCAN on LOOP_CONFIGURE
When LOOP_CONFIGURE is used with LO_FLAGS_PARTSCAN we need to propagate
this into the GENHD_FL_NO_PART_SCAN. LOOP_SETSTATUS does this,
LOOP_CONFIGURE doesn't so far. Effect is that setting up a loopback
device with partition scanning doesn't actually work when LOOP_CONFIGURE
is issued, though it works fine with LOOP_SETSTATUS.

Let's correct that and propagate the flag in LOOP_CONFIGURE too.

Fixes: 3448914e8cc5("loop: Add LOOP_CONFIGURE ioctl")

Signed-off-by: Lennart Poettering <lennart@poettering.net>
Acked-by: Martijn Coenen <maco@android.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-08-11 07:13:06 -06:00
Christoph Hellwig
ecbe6bc000 block: use bd_prepare_to_claim directly in the loop driver
The arcane magic in bd_start_claiming is only needed to be able to claim
a block_device that hasn't been fully set up.  Switch the loop driver
that claims from the ioctl path with a fully set up struct block_device
to just use the much simpler bd_prepare_to_claim directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-07-16 09:35:44 -06:00
Luis Chamberlain
200f933772 loop: be paranoid on exit and prevent new additions / removals
Be pedantic on removal as well and hold the mutex.
This should prevent uses of addition while we exit.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:15:58 -06:00
Christoph Hellwig
15f73f5b3e blk-mq: move failure injection out of blk_mq_complete_request
Move the call to blk_should_fake_timeout out of blk_mq_complete_request
and into the drivers, skipping call sites that are obvious error
handlers, and remove the now superflous blk_mq_force_complete_rq helper.
This ensures we don't keep injecting errors into completions that just
terminate the Linux request after the hardware has been reset or the
command has been aborted.

Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:15:57 -06:00
Zheng Bin
f4bd34b139 loop: replace kill_bdev with invalidate_bdev
When a filesystem is mounted on a loop device and on a loop ioctl
LOOP_SET_STATUS64, because of kill_bdev, buffer_head mappings are getting
destroyed.
kill_bdev
  truncate_inode_pages
    truncate_inode_pages_range
      do_invalidatepage
        block_invalidatepage
          discard_buffer  -->clear BH_Mapped flag

sb_bread
  __bread_gfp
  bh = __getblk_gfp
  -->discard_buffer clear BH_Mapped flag
  __bread_slow
    submit_bh
      submit_bh_wbc
        BUG_ON(!buffer_mapped(bh))  --> hit this BUG_ON

Fixes: 5db470e229 ("loop: drop caches if offset or block_size are changed")
Signed-off-by: Zheng Bin <zhengbin13@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-18 09:24:35 -06:00
Linus Torvalds
a58dfea297 Merge tag 'block-5.8-2020-06-11' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
 "Some followup fixes for this merge window. In particular:

   - Seqcount write missing preemption disable for stats (Ahmed)

   - blktrace fixes (Chaitanya)

   - Redundant initializations (Colin)

   - Various small NVMe fixes (Chaitanya, Christoph, Daniel, Max,
     Niklas, Rikard)

   - loop flag bug regression fix (Martijn)

   - blk-mq tagging fixes (Christoph, Ming)"

* tag 'block-5.8-2020-06-11' of git://git.kernel.dk/linux-block:
  umem: remove redundant initialization of variable ret
  pktcdvd: remove redundant initialization of variable ret
  nvmet: fail outstanding host posted AEN req
  nvme-pci: use simple suspend when a HMB is enabled
  nvme-fc: don't call nvme_cleanup_cmd() for AENs
  nvmet-tcp: constify nvmet_tcp_ops
  nvme-tcp: constify nvme_tcp_mq_ops and nvme_tcp_admin_mq_ops
  nvme: do not call del_gendisk() on a disk that was never added
  blk-mq: fix blk_mq_all_tag_iter
  blk-mq: split out a __blk_mq_get_driver_tag helper
  blktrace: fix endianness for blk_log_remap()
  blktrace: fix endianness in get_pdu_int()
  blktrace: use errno instead of bi_status
  block: nr_sects_write(): Disable preemption on seqcount write
  block: remove the error argument to the block_bio_complete tracepoint
  loop: Fix wrong masking of status flags
  block/bio-integrity: don't free 'buf' if bio_integrity_add_page() failed
2020-06-11 16:07:33 -07:00
Martijn Coenen
6ac92fb5cd loop: Fix wrong masking of status flags
In faf1d25440, loop_set_status() now assigns lo_status directly from
the passed in lo_flags, but then fixes it up by masking out flags that
can't be set by LOOP_SET_STATUS; unfortunately the mask was negated.

Re-ran all ltp ioctl_loop tests, and they all passed.

Pass run of the previously failing one:

tst_test.c:1247: INFO: Timeout per run is 0h 05m 00s
tst_device.c:88: INFO: Found free device 0 '/dev/loop0'
ioctl_loop01.c:49: PASS: /sys/block/loop0/loop/partscan = 0
ioctl_loop01.c:50: PASS: /sys/block/loop0/loop/autoclear = 0
ioctl_loop01.c:51: PASS: /sys/block/loop0/loop/backing_file =
'/tmp/ZRJ6H4/test.img'
ioctl_loop01.c:65: PASS: get expected lo_flag 12
ioctl_loop01.c:67: PASS: /sys/block/loop0/loop/partscan = 1
ioctl_loop01.c:68: PASS: /sys/block/loop0/loop/autoclear = 1
ioctl_loop01.c:77: PASS: access /dev/loop0p1 succeeds
ioctl_loop01.c:83: PASS: access /sys/block/loop0/loop0p1 succeeds

Summary:
passed   8
failed   0
skipped  0
warnings 0

Fixes: faf1d25440 ("loop: Clean up LOOP_SET_STATUS lo_flags handling")
Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Martijn Coenen <maco@android.com>
Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-04 21:13:45 -06:00
Linus Torvalds
96ed320d52 Merge tag 'vfs-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull DAX updates part one from Darrick Wong:
 "After many years of LKML-wrangling about how to enable programs to
  query and influence the file data access mode (DAX) when a filesystem
  resides on storage devices such as persistent memory, Ira Weiny has
  emerged with a proposed set of standard behaviors that has not been
  shot down by anyone! We're more or less standardizing on the current
  XFS behavior and adapting ext4 to do the same.

  This is the first of a handful pull requests that will make ext4 and
  XFS present a consistent interface for user programs that care about
  DAX. We add a statx attribute that programs can check to see if DAX is
  enabled on a particular file. Then, we update the DAX documentation to
  spell out the user-visible behaviors that filesystems will guarantee
  (until the next storage industry shakeup). The on-disk inode flag has
  been in XFS for a few years now.

  Summary:

   - Clean up io_is_direct.

   - Add a new statx flag to indicate when file data access is being
     done via DAX (as opposed to the page cache).

   - Update the documentation for how system administrators and
     application programmers can take advantage of the (still
     experimental DAX) feature"

Link: https://lore.kernel.org/lkml/20200505002016.1085071-1-ira.weiny@intel.com/

* tag 'vfs-5.8-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  Documentation/dax: Update Usage section
  fs/stat: Define DAX statx attribute
  fs: Remove unneeded IS_DAX() check in io_is_direct()
2020-06-02 19:45:12 -07:00
Linus Torvalds
bce159d734 Merge tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-block
Pull block driver updates from Jens Axboe:
 "On top of the core changes, here are the block driver changes for this
  merge window:

   - NVMe changes:
        - NVMe over Fibre Channel protocol updates, which also reach
          over to drivers/scsi/lpfc (James Smart)
        - namespace revalidation support on the target (Anthony
          Iliopoulos)
        - gcc zero length array fix (Arnd Bergmann)
        - nvmet cleanups (Chaitanya Kulkarni)
        - misc cleanups and fixes (me, Keith Busch, Sagi Grimberg)
        - use a SRQ per completion vector (Max Gurtovoy)
        - fix handling of runtime changes to the queue count (Weiping
          Zhang)
        - t10 protection information support for nvme-rdma and
          nvmet-rdma (Israel Rukshin and Max Gurtovoy)
        - target side AEN improvements (Chaitanya Kulkarni)
        - various fixes and minor improvements all over, icluding the
          nvme part of the lpfc driver"

   - Floppy code cleanup series (Willy, Denis)

   - Floppy contention fix (Jiri)

   - Loop CONFIGURE support (Martijn)

   - bcache fixes/improvements (Coly, Joe, Colin)

   - q->queuedata cleanups (Christoph)

   - Get rid of ioctl_by_bdev (Christoph, Stefan)

   - md/raid5 allocation fixes (Coly)

   - zero length array fixes (Gustavo)

   - swim3 task state fix (Xu)"

* tag 'for-5.8/drivers-2020-06-01' of git://git.kernel.dk/linux-block: (166 commits)
  bcache: configure the asynchronous registertion to be experimental
  bcache: asynchronous devices registration
  bcache: fix refcount underflow in bcache_device_free()
  bcache: Convert pr_<level> uses to a more typical style
  bcache: remove redundant variables i and n
  lpfc: Fix return value in __lpfc_nvme_ls_abort
  lpfc: fix axchg pointer reference after free and double frees
  lpfc: Fix pointer checks and comments in LS receive refactoring
  nvme: set dma alignment to qword
  nvmet: cleanups the loop in nvmet_async_events_process
  nvmet: fix memory leak when removing namespaces and controllers concurrently
  nvmet-rdma: add metadata/T10-PI support
  nvmet: add metadata support for block devices
  nvmet: add metadata/T10-PI support
  nvme: add Metadata Capabilities enumerations
  nvmet: rename nvmet_check_data_len to nvmet_check_transfer_len
  nvmet: rename nvmet_rw_len to nvmet_rw_data_len
  nvmet: add metadata characteristics for a namespace
  nvme-rdma: add metadata/T10-PI support
  nvme-rdma: introduce nvme_rdma_sgl structure
  ...
2020-06-02 15:37:03 -07:00
Linus Torvalds
750a02ab8d Merge tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
 "Core block changes that have been queued up for this release:

   - Remove dead blk-throttle and blk-wbt code (Guoqing)

   - Include pid in blktrace note traces (Jan)

   - Don't spew I/O errors on wouldblock termination (me)

   - Zone append addition (Johannes, Keith, Damien)

   - IO accounting improvements (Konstantin, Christoph)

   - blk-mq hardware map update improvements (Ming)

   - Scheduler dispatch improvement (Salman)

   - Inline block encryption support (Satya)

   - Request map fixes and improvements (Weiping)

   - blk-iocost tweaks (Tejun)

   - Fix for timeout failing with error injection (Keith)

   - Queue re-run fixes (Douglas)

   - CPU hotplug improvements (Christoph)

   - Queue entry/exit improvements (Christoph)

   - Move DMA drain handling to the few drivers that use it (Christoph)

   - Partition handling cleanups (Christoph)"

* tag 'for-5.8/block-2020-06-01' of git://git.kernel.dk/linux-block: (127 commits)
  block: mark bio_wouldblock_error() bio with BIO_QUIET
  blk-wbt: rename __wbt_update_limits to wbt_update_limits
  blk-wbt: remove wbt_update_limits
  blk-throttle: remove tg_drain_bios
  blk-throttle: remove blk_throtl_drain
  null_blk: force complete for timeout request
  blk-mq: drain I/O when all CPUs in a hctx are offline
  blk-mq: add blk_mq_all_tag_iter
  blk-mq: open code __blk_mq_alloc_request in blk_mq_alloc_request_hctx
  blk-mq: use BLK_MQ_NO_TAG in more places
  blk-mq: rename BLK_MQ_TAG_FAIL to BLK_MQ_NO_TAG
  blk-mq: move more request initialization to blk_mq_rq_ctx_init
  blk-mq: simplify the blk_mq_get_request calling convention
  blk-mq: remove the bio argument to ->prepare_request
  nvme: force complete cancelled requests
  blk-mq: blk-mq: provide forced completion method
  block: fix a warning when blkdev.h is included for !CONFIG_BLOCK builds
  block: blk-crypto-fallback: remove redundant initialization of variable err
  block: reduce part_stat_lock() scope
  block: use __this_cpu_add() instead of access by smp_processor_id()
  ...
2020-06-02 15:29:19 -07:00
NeilBrown
a37b0715dd mm/writeback: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE
PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
daemon needs to write to one bdi (the final bdi) in order to free up
writes queued to another bdi (the client bdi).

The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
pages, so that it can still dirty pages after other processses have been
throttled.  The purpose of this is to avoid deadlock that happen when
the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
but it is being thottled and cannot write.

This approach was designed when all threads were blocked equally,
independently on which device they were writing to, or how fast it was.
Since that time the writeback algorithm has changed substantially with
different threads getting different allowances based on non-trivial
heuristics.  This means the simple "add 25%" heuristic is no longer
reliable.

The important issue is not that the daemon needs a *larger* dirty page
allowance, but that it needs a *private* dirty page allowance, so that
dirty pages for the "client" bdi that it is helping to clear (the bdi
for an NFS filesystem or loop block device etc) do not affect the
throttling of the daemon writing to the "final" bdi.

This patch changes the heuristic so that the task is not throttled when
the bdi it is writing to has a dirty page count below below (or equal
to) the free-run threshold for that bdi.  This ensures it will always be
able to have some pages in flight, and so will not deadlock.

In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
still be throttled by global threshold, but that is acceptable as it is
only the deadlock state that is interesting for this flag.

This approach of "only throttle when target bdi is busy" is consistent
with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
it causes attention to be focussed only on the target bdi.

So this patch
 - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
 - removes the 25% bonus that that flag gives, and
 - If PF_LOCAL_THROTTLE is set, don't delay at all unless the
   global and the local free-run thresholds are exceeded.

Note that previously realtime threads were treated the same as
PF_LESS_THROTTLE threads.  This patch does *not* change the behvaiour
for real-time threads, so it is now different from the behaviour of nfsd
and loop tasks.  I don't know what is wanted for realtime.

[akpm@linux-foundation.org: coding style fixes]
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Chuck Lever <chuck.lever@oracle.com>	[nfsd]
Cc: Christoph Hellwig <hch@lst.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.name
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:08 -07:00
Ming Lei
bf0beec060 blk-mq: drain I/O when all CPUs in a hctx are offline
Most of blk-mq drivers depend on managed IRQ's auto-affinity to setup
up queue mapping. Thomas mentioned the following point[1]:

"That was the constraint of managed interrupts from the very beginning:

 The driver/subsystem has to quiesce the interrupt line and the associated
 queue _before_ it gets shutdown in CPU unplug and not fiddle with it
 until it's restarted by the core when the CPU is plugged in again."

However, current blk-mq implementation doesn't quiesce hw queue before
the last CPU in the hctx is shutdown.  Even worse, CPUHP_BLK_MQ_DEAD is a
cpuhp state handled after the CPU is down, so there isn't any chance to
quiesce the hctx before shutting down the CPU.

Add new CPUHP_AP_BLK_MQ_ONLINE state to stop allocating from blk-mq hctxs
where the last CPU goes away, and wait for completion of in-flight
requests.  This guarantees that there is no inflight I/O before shutting
down the managed IRQ.

Add a BLK_MQ_F_STACKING and set it for dm-rq and loop, so we don't need
to wait for completion of in-flight requests from these drivers to avoid
a potential dead-lock. It is safe to do this for stacking drivers as those
do not use interrupts at all and their I/O completions are triggered by
underlying devices I/O completion.

[1] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/

[hch: different retry mechanism, merged two patches, minor cleanups]

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-29 10:23:25 -06:00
Colin Ian King
d29b92f57e loop: remove redundant assignment to variable error
The variable error is being assigned a value that is never
read so the assignment is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-24 11:13:36 -06:00
Martijn Coenen
3448914e8c loop: Add LOOP_CONFIGURE ioctl
This allows userspace to completely setup a loop device with a single
ioctl, removing the in-between state where the device can be partially
configured - eg the loop device has a backing file associated with it,
but is reading from the wrong offset.

Besides removing the intermediate state, another big benefit of this
ioctl is that LOOP_SET_STATUS can be slow; the main reason for this
slowness is that LOOP_SET_STATUS(64) calls blk_mq_freeze_queue() to
freeze the associated queue; this requires waiting for RCU
synchronization, which I've measured can take about 15-20ms on this
device on average.

In addition to doing what LOOP_SET_STATUS can do, LOOP_CONFIGURE can
also be used to:
- Set the correct block size immediately by setting
  loop_config.block_size (avoids LOOP_SET_BLOCK_SIZE)
- Explicitly request direct I/O mode by setting LO_FLAGS_DIRECT_IO
  in loop_config.info.lo_flags (avoids LOOP_SET_DIRECT_IO)
- Explicitly request read-only mode by setting LO_FLAGS_READ_ONLY
  in loop_config.info.lo_flags

Here's setting up ~70 regular loop devices with an offset on an x86
Android device, using LOOP_SET_FD and LOOP_SET_STATUS:

vsoc_x86:/system/apex # time for i in `seq 30 100`;
do losetup -r -o 4096 /dev/block/loop$i com.android.adbd.apex; done
    0m03.40s real     0m00.02s user     0m00.03s system

Here's configuring ~70 devices in the same way, but using a modified
losetup that uses the new LOOP_CONFIGURE ioctl:

vsoc_x86:/system/apex # time for i in `seq 30 100`;
do losetup -r -o 4096 /dev/block/loop$i com.android.adbd.apex; done
    0m01.94s real     0m00.01s user     0m00.01s system

Signed-off-by: Martijn Coenen <maco@android.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21 08:20:35 -06:00
Martijn Coenen
faf1d25440 loop: Clean up LOOP_SET_STATUS lo_flags handling
LOOP_SET_STATUS(64) will actually allow some lo_flags to be modified; in
particular, LO_FLAGS_AUTOCLEAR can be set and cleared, whereas
LO_FLAGS_PARTSCAN can be set to request a partition scan. Make this
explicit by updating the UAPI to include the flags that can be
set/cleared using this ioctl.

The implementation can then blindly take over the passed in flags,
and use the previous flags for those flags that can't be set / cleared
using LOOP_SET_STATUS.

Signed-off-by: Martijn Coenen <maco@android.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21 08:20:35 -06:00
Martijn Coenen
571fae6e29 loop: Rework lo_ioctl() __user argument casting
In preparation for a new ioctl that needs to copy_from_user(); makes the
code easier to read as well.

Signed-off-by: Martijn Coenen <maco@android.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21 08:20:35 -06:00
Martijn Coenen
62ab466ca8 loop: Move loop_set_status_from_info() and friends up
So we can use it without forward declaration. This is a separate commit
to make it easier to verify that this is just a move, without functional
modifications.

Signed-off-by: Martijn Coenen <maco@android.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21 08:20:35 -06:00
Martijn Coenen
0c3796c244 loop: Factor out configuring loop from status
Factor out this code into a separate function, so it can be reused by
other code more easily.

Signed-off-by: Martijn Coenen <maco@android.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21 08:20:34 -06:00
Martijn Coenen
0a6ed1b5ff loop: Remove figure_loop_size()
This function was now only used by loop_set_capacity(). Just open code
the remaining code in the caller instead.

Signed-off-by: Martijn Coenen <maco@android.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21 08:20:34 -06:00
Martijn Coenen
b0bd158dd6 loop: Refactor loop_set_status() size calculation
figure_loop_size() calculates the loop size based on the passed in
parameters, but at the same time it updates the offset and sizelimit
parameters in the loop device configuration. That is a somewhat
unexpected side effect of a function with this name, and it is only only
needed by one of the two callers of this function - loop_set_status().

Move the lo_offset and lo_sizelimit assignment back into loop_set_status(),
and use the newly factored out functions to validate and apply the newly
calculated size. This allows us to get rid of figure_loop_size() in a
follow-up commit.

Signed-off-by: Martijn Coenen <maco@android.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21 08:20:34 -06:00
Martijn Coenen
716ad0986c loop: Switch to set_capacity_revalidate_and_notify()
This was recently added to block/genhd.c, and takes care of both
updating the capacity and notifying userspace of the new size.

Signed-off-by: Martijn Coenen <maco@android.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21 08:20:34 -06:00
Martijn Coenen
5795b6f560 loop: Factor out setting loop device size
This code is used repeatedly.

Signed-off-by: Martijn Coenen <maco@android.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21 08:20:34 -06:00
Martijn Coenen
083a6a5078 loop: Remove sector_t truncation checks
sector_t is now always u64, so we don't need to check for truncation.

Signed-off-by: Martijn Coenen <maco@android.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21 08:20:34 -06:00
Martijn Coenen
7c5014b098 loop: Call loop_config_discard() only after new config is applied
loop_set_status() calls loop_config_discard() to configure discard for
the loop device; however, the discard configuration depends on whether
the loop device uses encryption, and when we call it the encryption
configuration has not been updated yet. Move the call down so we apply
the correct discard configuration based on the new configuration.

Signed-off-by: Martijn Coenen <maco@android.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bob Liu <bob.liu@oracle.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-05-21 08:20:34 -06:00
Ira Weiny
efbe3c2493 fs: Remove unneeded IS_DAX() check in io_is_direct()
Remove the check because DAX now has it's own read/write methods and
file systems which support DAX check IS_DAX() prior to IOCB_DIRECT on
their own.  Therefore, it does not matter if the file state is DAX when
the iocb flags are created.

Also remove io_is_direct() as it is just a simple flag check.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-04 08:49:39 -07:00
Evan Green
c52abf5630 loop: Better discard support for block devices
If the backing device for a loop device is itself a block device,
then mirror the "write zeroes" capabilities of the underlying
block device into the loop device. Copy this capability into both
max_write_zeroes_sectors and max_discard_sectors of the loop device.

The reason for this is that REQ_OP_DISCARD on a loop device translates
into blkdev_issue_zeroout(), rather than blkdev_issue_discard(). This
presents a consistent interface for loop devices (that discarded data
is zeroed), regardless of the backing device type of the loop device.
There should be no behavior change for loop devices backed by regular
files.

This change fixes blktest block/003, and removes an extraneous
error print in block/013 when testing on a loop device backed
by a block device that does not support discard.

Signed-off-by: Evan Green <evgreen@chromium.org>
Reviewed-by: Gwendal Grignou <gwendal@chromium.org>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
[used updated version of Evan's comment in loop_config_discard()]
[moved backingq to local scope, removed redundant braces]
Signed-off-by: Andrzej Pietrasiewicz <andrzej.p@collabora.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-03 13:44:22 -06:00
Evan Green
8cd55087dc loop: Report EOPNOTSUPP properly
Properly plumb out EOPNOTSUPP from loop driver operations, which may
get returned when for instance a discard operation is attempted but not
supported by the underlying block device. Before this change, everything
was reported in the log as an I/O error, which is scary and not
helpful in debugging.

Signed-off-by: Evan Green <evgreen@chromium.org>
Reviewed-by: Gwendal Grignou <gwendal@chromium.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Andrzej Pietrasiewicz <andrzej.p@collabora.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-04-03 13:44:20 -06:00
Martijn Coenen
0fbcf57982 loop: Only freeze block queue when needed.
__loop_update_dio() can be called as a part of loop_set_fd(), when the
block queue is not yet up and running; avoid freezing the block queue in
that case, since that is an expensive operation.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Martijn Coenen <maco@android.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10 14:10:43 -06:00
Martijn Coenen
7e81f99afd loop: Only change blocksize when needed.
Return early in loop_set_block_size() if the requested block size is
identical to the one we already have; this avoids expensive calls to
freeze the block queue.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martijn Coenen <maco@android.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10 14:10:41 -06:00
Christoph Hellwig
f0b870df80 block: remove (__)blkdev_reread_part as an exported API
In general drivers should never mess with partition tables directly.
Unfortunately s390 and loop do for somewhat historic reasons, but they
can use bdev_disk_changed directly instead when we export it as they
satisfy the sanity checks we have in __blkdev_reread_part.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>	[dasd]
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-14 07:43:59 -07:00
Darrick J. Wong
efcfec579f loop: fix no-unmap write-zeroes request behavior
Currently, if the loop device receives a WRITE_ZEROES request, it asks
the underlying filesystem to punch out the range.  This behavior is
correct if unmapping is allowed.  However, a NOUNMAP request means that
the caller doesn't want us to free the storage backing the range, so
punching out the range is incorrect behavior.

To satisfy a NOUNMAP | WRITE_ZEROES request, loop should ask the
underlying filesystem to FALLOC_FL_ZERO_RANGE, which is (according to
the fallocate documentation) required to ensure that the entire range is
backed by real storage, which suffices for our purposes.

Fixes: 19372e2769 ("loop: implement REQ_OP_WRITE_ZEROES")
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-11-01 08:43:20 -06:00
Martijn Coenen
85560117d0 loop: change queue block size to match when using DIO
The loop driver assumes that if the passed in fd is opened with
O_DIRECT, the caller wants to use direct I/O on the loop device.
However, if the underlying block device has a different block size than
the loop block queue, direct I/O can't be enabled. Instead of requiring
userspace to manually change the blocksize and re-enable direct I/O,
just change the queue block sizes to match, as well as the io_min size.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Martijn Coenen <maco@android.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-10-01 09:36:01 -06:00