Commit Graph

1030501 Commits

Author SHA1 Message Date
Sagi Grimberg
0866200ed7 nvme: Have NVME_FABRICS select NVME_CORE instead of transport drivers
Transport drivers need both core and fabrics modules, instead of
selecting both, have the selection transitive such that NVME_FABRICS
selects NVME_CORE and transport drivers select NVME_FABRICS.

Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: James Smart <jsmart2021@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 21:03:18 +02:00
Amit Engel
e19e9f47f3 nvmet: check that host sqsize does not exceed ctrl MQES
Check that host sqsize is not greater-than Maximum Queue Entries
Supported (MQES) value supported by the controller.

Signed-off-by: Amit Engel <amit.engel@dell.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:25 +02:00
Amit Engel
b71df12605 nvmet: avoid duplicate qid in connect cmd
According to the NVMe specification, if the host sends a Connect command
specifying a queue id which has already been created, a status value of
NVME_SC_CMD_SEQ_ERROR is returned.

Signed-off-by: Amit Engel <amit.engel@dell.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:25 +02:00
Amit Engel
e804d5abe2 nvmet: pass back cntlid on successful completion
According to the NVMe specification, the response dword 0 value of the
Connect command is based on status code: return cntlid for successful
compeltion return IPO and IATTR for connect invalid parameters.  Fix
a missing error information for a zero sized queue, and return the
cntlid also for I/O queue Connect commands.

Signed-off-by: Amit Engel <amit.engel@dell.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:24 +02:00
Ruozhu Li
85032874f8 nvme-rdma: don't update queue count when failing to set io queues
We update ctrl->queue_count and schedule another reconnect when io queue
count is zero.But we will never try to create any io queue in next reco-
nnection, because ctrl->queue_count already set to zero.We will end up
having an admin-only session in Live state, which is exactly what we try
to avoid in the original patch.
Update ctrl->queue_count after queue_count zero checking to fix it.

Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:24 +02:00
Ruozhu Li
664227fde6 nvme-tcp: don't update queue count when failing to set io queues
We update ctrl->queue_count and schedule another reconnect when io queue
count is zero.But we will never try to create any io queue in next reco-
nnection, because ctrl->queue_count already set to zero.We will end up
having an admin-only session in Live state, which is exactly what we try
to avoid in the original patch.
Update ctrl->queue_count after queue_count zero checking to fix it.

Signed-off-by: Ruozhu Li <liruozhu@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:24 +02:00
Keith Busch
d48f92cd27 nvme-tcp: pair send_mutex init with destroy
Each mutex_init() should have a corresponding mutex_destroy().

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:24 +02:00
Keith Busch
a5df5e79c4 nvme: allow user toggling hmb usage
The NVMe host memory buffer may consume a non-negligable amount of
memory. Controllers are required to function without the host memory
buffer enabled, but with possibly degraded performance. Export a sysfs
property to toggle this feature on a per-device granularity so users may
choose to reclaim memory at the expense of storage performance.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:24 +02:00
Keith Busch
e5ad96f388 nvme-pci: disable hmb on idle suspend
An idle suspend may or may not disable host memory access from devices
placed in low power mode. Either way, it should always be safe to
disable the host memory buffer prior to entering the low power mode, and
this should also always be faster than a full device shutdown.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:24 +02:00
Colin Ian King
ad0e9a80ba nvmet: remove redundant assignments of variable status
There are two occurrances where variable status is being assigned a
value that is never read and it is being re-assigned a new value
almost immediately afterwards on an error exit path. The assignments
are redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:23 +02:00
Hou Pu
8d84f9de69 nvmet: add set feature tracing support
A nvme connect command produces following trace from the target side.

Before:
    kworker/0:1H-56      [000] ....  9012.155139: nvmet_req_init: nvmet1: qid=0, cmdid=16, nsid=0, flags=0x40, meta=0x0, cmd=(nvme_admin_set_features, cdw10=07 00 00 00 07 00 07 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00)
    kworker/0:1H-56      [000] ....  9012.872272: nvmet_req_init: nvmet1: qid=0, cmdid=13, nsid=0, flags=0x40, meta=0x0, cmd=(nvme_admin_set_features, cdw10=0b 00 00 00 00 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00)

cmdline:/sys/kernel/debug/tracing# cat trace | grep feature
    kworker/0:1H-56      [000] ....   203.493914: nvmet_req_init: nvmet1: qid=0, cmdid=29, nsid=0, flags=0x40, meta=0x0, cmd=(nvme_admin_set_features, fid=0x7, sv=0x0, cdw11=0x70007)
    kworker/0:1H-56      [000] ....   204.197079: nvmet_req_init: nvmet1: qid=0, cmdid=29, nsid=0, flags=0x40, meta=0x0, cmd=(nvme_admin_set_features, fid=0xb, sv=0x0, cdw11=0x900)

Using ',' to separate different field like others in
nvmet_trace_admin_get_features.

Signed-off-by: Hou Pu <houpu.main@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:23 +02:00
Hou Pu
a7b5e8d864 nvme: add set feature tracing support
A nvme connect command produces following trace.

Before:
/sys/kernel/debug/tracing# cat trace | grep feature
    kworker/5:1H-98      [005] ....  3221.294844: nvme_setup_cmd: nvme0: qid=0, cmdid=25, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features cdw10=07 00 00 00 07 00 07 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00)
    kworker/4:1H-124     [004] ....  3222.009186: nvme_setup_cmd: nvme0: qid=0, cmdid=17, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features cdw10=0b 00 00 00 00 09 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00)

After:
/sys/kernel/debug/tracing# cat trace | grep feature
    kworker/0:1H-253     [000] ....   196.060509: nvme_setup_cmd: nvme0: qid=0, cmdid=29, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features fid=0x7, sv=0x0, cdw11=0x70007)
    kworker/0:1H-253     [000] ....   196.763947: nvme_setup_cmd: nvme0: qid=0, cmdid=29, nsid=0, flags=0x0, meta=0x0, cmd=(nvme_admin_set_features fid=0xb, sv=0x0, cdw11=0x900)

Using ',' to separate different field like others in
nvmet_trace_admin_get_features.

Signed-off-by: Hou Pu <houpu.main@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:23 +02:00
Hou Pu
e23439e977 nvme-fabrics: remove superfluous nvmf_host_put in nvmf_parse_options
Opts->host is NULL there. It is checked just before. So remove
nvmf_host_put. It is introduced by commit 59a2f3f00f ("nvme: fix
potential memory leak in option parsing").

Signed-off-by: Hou Pu <houpu.main@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:23 +02:00
Keith Busch
1751e97aa9 nvme-pci: cmb sysfs: one file, one value
An attribute should only be exporting one value as recommended in
Documentation/filesystems/sysfs.rst. Implement CMB attributes this way.
The old attribute will remain for backward compatibility.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:23 +02:00
Keith Busch
0521905e85 nvme-pci: use attribute group for cmb sysfs
Appending sysfs files to the controller kobject is a bit clunky and
becomes a maintenance problem as more attributes are added. The
attribute group infrastructure handles this better, so use that.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:22 +02:00
Sagi Grimberg
e7006de6c2 nvme: code command_id with a genctr for use-after-free validation
We cannot detect a (perhaps buggy) controller that is sending us
a completion for a request that was already completed (for example
sending a completion twice), this phenomenon was seen in the wild
a few times.

So to protect against this, we use the upper 4 msbits of the nvme sqe
command_id to use as a 4-bit generation counter and verify it matches
the existing request generation that is incrementing on every execution.

The 16-bit command_id structure now is constructed by:
| xxxx | xxxxxxxxxxxx |
  gen    request tag

This means that we are giving up some possible queue depth as 12 bits
allow for a maximum queue depth of 4095 instead of 65536, however we
never create such long queues anyways so no real harm done.

Suggested-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Tested-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:22 +02:00
Sagi Grimberg
3b01a9d0ca nvme-tcp: don't check blk_mq_tag_to_rq when receiving pdu data
We already validate it when receiving the c2hdata pdu header
and this is not changing so this is a redundant check.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:22 +02:00
Sagi Grimberg
27453b45e6 nvme-pci: limit maximum queue depth to 4095
We are going to use the upper 4-bits of the command_id for a generation
counter, so enforce the new queue depth upper limit. As we enforce
both min and max queue depth, use param_set_uint_minmax istead of
open coding it.

Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:22 +02:00
Sagi Grimberg
2a14c9ae15 params: lift param_set_uint_minmax to common code
It is a useful helper hence move it to common code so others can enjoy
it.

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-08-16 14:42:22 +02:00
Christoph Hellwig
9ea9b9c483 remove the lightnvm subsystem
Lightnvm supports the OCSSD 1.x and 2.0 specs which were early attempts
to produce Open Channel SSDs and never made it into the NVMe spec
proper.  They have since been superceeded by NVMe enhancements such
as ZNS support.  Remove the support per the deprecation schedule.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210812132308.38486-1-hch@lst.de
Reviewed-by: Matias Bjørling <mb@lightnvm.io>
Reviewed-by: Javier González <javier@javigon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-14 15:54:09 -06:00
Christoph Hellwig
6e4df4c648 nbd: reduce the nbd_index_mutex scope
nbd_index_mutex is currently held over add_disk and inside ->open, which
leads to lock order reversals.  Refactor the device creation code path
so that nbd_dev_add is called without nbd_index_mutex lock held and
only takes it for the IDR insertation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210811124428.2368491-7-hch@lst.de
[axboe: fix whitespace]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-13 14:17:41 -06:00
Christoph Hellwig
6177b56c96 nbd: refactor device search and allocation in nbd_genl_connect
Use idr_for_each_entry instead of the awkward callback to find an
existing device for the index == -1 case, and de-duplicate the device
allocation if no existing device was found.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210811124428.2368491-6-hch@lst.de
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-13 14:17:39 -06:00
Christoph Hellwig
7bdc00cf7e nbd: return the allocated nbd_device from nbd_dev_add
Return the device we just allocated instead of doing an extra search for
it in the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210811124428.2368491-5-hch@lst.de
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-13 14:17:37 -06:00
Christoph Hellwig
327b501b1d nbd: remove nbd_del_disk
Fold nbd_del_disk and remove the pointless NULL check on ->disk given
that it is always set for a successfully allocated nbd_device structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210811124428.2368491-4-hch@lst.de
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-13 14:17:36 -06:00
Christoph Hellwig
3f74e0645c nbd: refactor device removal
Share common code for the synchronous and workqueue based device removal,
and remove the pointless use of refcount_dec_and_mutex_lock.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210811124428.2368491-3-hch@lst.de
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-13 14:17:32 -06:00
Hou Tao
68c9417b19 nbd: do del_gendisk() asynchronously for NBD_DESTROY_ON_DISCONNECT
Now open_mutex is used to synchronize partition operations (e.g,
blk_drop_partitions() and blkdev_reread_part()), however it makes
nbd driver broken, because nbd may call del_gendisk() in nbd_release()
or nbd_genl_disconnect() if NBD_CFLAG_DESTROY_ON_DISCONNECT is enabled,
and deadlock occurs, as shown below:

// AB-BA dead-lock
nbd_genl_disconnect            blkdev_open
  nbd_disconnect_and_put
                                 lock bd_mutex
  // last ref
  nbd_put
    lock nbd_index_mutex
      del_gendisk
                                   nbd_open
                                     try lock nbd_index_mutex
        try lock bd_mutex

 or

// AA dead-lock
nbd_release
  lock bd_mutex
    nbd_put
      try lock bd_mutex

Instead of fixing block layer (e.g, introduce another lock), fixing
the nbd driver to call del_gendisk() in a kworker when
NBD_DESTROY_ON_DISCONNECT is enabled. When NBD_DESTROY_ON_DISCONNECT
is disabled, nbd device will always be destroy through module removal,
and there is no risky of deadlock.

To ensure the reuse of nbd index succeeds, moving the calling of
idr_remove() after del_gendisk(), so if the reused index is not found
in nbd_index_idr, the old disk must have been deleted. And reusing
the existing destroy_complete mechanism to ensure nbd_genl_connect()
will wait for the completion of del_gendisk().

Also adding a new workqueue for nbd removal, so nbd_cleanup()
can ensure all removals complete before exits.

Reported-by: syzbot+0fe7752e52337864d29b@syzkaller.appspotmail.com
Fixes: c76f48eb5c ("block: take bd_mutex around delete_partitions in del_gendisk")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210811124428.2368491-2-hch@lst.de
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-13 14:17:16 -06:00
Baokun Li
fad7cd3310 nbd: add the check to prevent overflow in __nbd_ioctl()
If user specify a large enough value of NBD blocks option, it may trigger
signed integer overflow which may lead to nbd->config->bytesize becomes a
large or small value, zero in particular.

UBSAN: Undefined behaviour in drivers/block/nbd.c:325:31
signed integer overflow:
1024 * 4611686155866341414 cannot be represented in type 'long long int'
[...]
Call trace:
[...]
 handle_overflow+0x188/0x1dc lib/ubsan.c:192
 __ubsan_handle_mul_overflow+0x34/0x44 lib/ubsan.c:213
 nbd_size_set drivers/block/nbd.c:325 [inline]
 __nbd_ioctl drivers/block/nbd.c:1342 [inline]
 nbd_ioctl+0x998/0xa10 drivers/block/nbd.c:1395
 __blkdev_driver_ioctl block/ioctl.c:311 [inline]
[...]

Although it is not a big deal, still silence the UBSAN by limit
the input value.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20210804021212.990223-1-libaokun1@huawei.com
[axboe: dropped unlikely()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-13 10:09:26 -06:00
Colin Ian King
da20b58d5b xen-blkfront: Remove redundant assignment to variable err
The variable err is being assigned a value that is never read, the
assignment is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20210806110601.11386-1-colin.king@canonical.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-09 20:04:46 -06:00
Md Haris Iqbal
3087b335b5 block/rnbd: Use sysfs_emit instead of s*printf function for sysfs show
sysfs_emit function was added to be aware of the PAGE_SIZE maximum of
the temporary buffer used for outputting sysfs content, so there is no
possible overruns. So replace the uses of any s*printf functions for
the sysfs show functions with sysfs_emit.

Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20210726115950.470543-3-jinpu.wang@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02 13:37:40 -06:00
Gioh Kim
94dace8c85 block/rnbd-clt: Use put_cpu_ptr after get_cpu_ptr
This patch replaces put_cpu_var with put_cpu_ptr because
get_cpu_ptr should be paired with put_cpu_ptr.

Signed-off-by: Gioh Kim <gi-oh.kim@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20210726115950.470543-2-jinpu.wang@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02 13:37:40 -06:00
Damien Le Moal
2bc1f6e442 block: remove blk-mq-sysfs dead code
In block/blk-mq-sysfs.c, struct blk_mq_ctx_sysfs_entry is not used to
define any attribute since the "mq" sysfs directory contains only
sub-directories (no attribute files). As a result, blk_mq_sysfs_show(),
blk_mq_sysfs_store(), and struct sysfs_ops blk_mq_sysfs_ops are all
unused and unnecessary. Remove all this unused code.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Link: https://lore.kernel.org/r/20210713081837.524422-1-damien.lemoal@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02 13:37:29 -06:00
Matteo Croce
9f65c489b6 loop: raise media_change event
Make the loop device raise a DISK_MEDIA_CHANGE event on attach or detach.

	# udevadm monitor -up |grep -e DISK_MEDIA_CHANGE -e DEVNAME &

	# losetup -f zero
	[    7.454235] loop0: detected capacity change from 0 to 16384
	DISK_MEDIA_CHANGE=1
	DEVNAME=/dev/loop0
	DEVNAME=/dev/loop0
	DEVNAME=/dev/loop0

	# losetup -f zero
	[   10.205245] loop1: detected capacity change from 0 to 16384
	DISK_MEDIA_CHANGE=1
	DEVNAME=/dev/loop1
	DEVNAME=/dev/loop1
	DEVNAME=/dev/loop1

	# losetup -f zero2
	[   13.532368] loop2: detected capacity change from 0 to 40960
	DISK_MEDIA_CHANGE=1
	DEVNAME=/dev/loop2
	DEVNAME=/dev/loop2

	# losetup -D
	DEVNAME=/dev/loop1
	DISK_MEDIA_CHANGE=1
	DEVNAME=/dev/loop1
	DEVNAME=/dev/loop2
	DISK_MEDIA_CHANGE=1
	DEVNAME=/dev/loop2
	DEVNAME=/dev/loop0
	DISK_MEDIA_CHANGE=1
	DEVNAME=/dev/loop0

Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-7-mcroce@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02 13:37:29 -06:00
Matteo Croce
e6138dc12d block: add a helper to raise a media changed event
Refactor disk_check_events() and move some code into disk_event_uevent().
Then add disk_force_media_change(), a helper which will be used by
devices to force issuing a DISK_EVENT_MEDIA_CHANGE event.

Co-developed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-6-mcroce@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02 13:37:28 -06:00
Matteo Croce
13927b31b1 block: export diskseq in sysfs
Add a new sysfs handle to export the new diskseq value.
Place it in <sysfs>/block/<disk>/diskseq and document it.

    $ grep . /sys/class/block/*/diskseq
    /sys/class/block/loop0/diskseq:13
    /sys/class/block/loop1/diskseq:14
    /sys/class/block/loop2/diskseq:5
    /sys/class/block/loop3/diskseq:6
    /sys/class/block/ram0/diskseq:1
    /sys/class/block/ram1/diskseq:2
    /sys/class/block/vda/diskseq:7

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-5-mcroce@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02 13:37:28 -06:00
Matteo Croce
7957d93bf3 block: add ioctl to read the disk sequence number
Add a new BLKGETDISKSEQ ioctl which retrieves the disk sequence number
from the genhd structure.

    # ./getdiskseq /dev/loop*
    /dev/loop0:     13
    /dev/loop0p1:   13
    /dev/loop0p2:   13
    /dev/loop0p3:   13
    /dev/loop1:     14
    /dev/loop1p1:   14
    /dev/loop1p2:   14
    /dev/loop2:     5
    /dev/loop3:     6

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-4-mcroce@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02 13:37:28 -06:00
Matteo Croce
87eb710747 block: export the diskseq in uevents
Export the newly introduced diskseq in uevents:

    $ udevadm info /sys/class/block/* |grep -e DEVNAME -e DISKSEQ
    E: DEVNAME=/dev/loop0
    E: DISKSEQ=1
    E: DEVNAME=/dev/loop1
    E: DISKSEQ=2
    E: DEVNAME=/dev/loop2
    E: DISKSEQ=3
    E: DEVNAME=/dev/loop3
    E: DISKSEQ=4
    E: DEVNAME=/dev/loop4
    E: DISKSEQ=5
    E: DEVNAME=/dev/loop5
    E: DISKSEQ=6
    E: DEVNAME=/dev/loop6
    E: DISKSEQ=7
    E: DEVNAME=/dev/loop7
    E: DISKSEQ=8
    E: DEVNAME=/dev/nvme0n1
    E: DISKSEQ=9
    E: DEVNAME=/dev/nvme0n1p1
    E: DISKSEQ=9
    E: DEVNAME=/dev/nvme0n1p2
    E: DISKSEQ=9
    E: DEVNAME=/dev/nvme0n1p3
    E: DISKSEQ=9
    E: DEVNAME=/dev/nvme0n1p4
    E: DISKSEQ=9
    E: DEVNAME=/dev/nvme0n1p5
    E: DISKSEQ=9
    E: DEVNAME=/dev/sda
    E: DISKSEQ=10
    E: DEVNAME=/dev/sda1
    E: DISKSEQ=10
    E: DEVNAME=/dev/sda2
    E: DISKSEQ=10

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-3-mcroce@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02 13:37:28 -06:00
Matteo Croce
cf17994855 block: add disk sequence number
Associating uevents with block devices in userspace is difficult and racy:
the uevent netlink socket is lossy, and on slow and overloaded systems
has a very high latency.
Block devices do not have exclusive owners in userspace, any process can
set one up (e.g. loop devices). Moreover, device names can be reused
(e.g. loop0 can be reused again and again). A userspace process setting
up a block device and watching for its events cannot thus reliably tell
whether an event relates to the device it just set up or another earlier
instance with the same name.

Being able to set a UUID on a loop device would solve the race conditions.
But it does not allow to derive orderings from uevents: if you see a
uevent with a UUID that does not match the device you are waiting for,
you cannot tell whether it's because the right uevent has not arrived yet,
or it was already sent and you missed it. So you cannot tell whether you
should wait for it or not.

Associating a unique, monotonically increasing sequential number to the
lifetime of each block device, which can be retrieved with an ioctl
immediately upon setting it up, allows to solve the race conditions with
uevents, and also allows userspace processes to know whether they should
wait for the uevent they need or if it was dropped and thus they should
move on.

Additionally, increment the disk sequence number when the media change,
i.e. on DISK_EVENT_MEDIA_CHANGE event.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Tested-by: Luca Boccassi <bluca@debian.org>
Link: https://lore.kernel.org/r/20210712230530.29323-2-mcroce@linux.microsoft.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02 13:37:28 -06:00
Christoph Hellwig
2164877c7f block: remove cmdline-parser.c
cmdline-parser.c is only used by the cmdline faux partition format,
so merge the code into that and avoid an indirect call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210728053756.409654-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02 13:37:28 -06:00
Christoph Hellwig
abd2864a3e block: remove disk_name()
Remove the disk_name function now that all users are gone.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727062518.122108-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02 13:37:28 -06:00
Christoph Hellwig
1d7035478f block: simplify disk name formatting in check_partition
disk_name for partition 0 just copies out the disk_name field.  Replace
the call to disk_name with a %s format specifier.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727062518.122108-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02 13:37:28 -06:00
Christoph Hellwig
453b8ab696 block: simplify printing the device names disk_stack_limits
Printk ->disk_name directly for the disk and use the %pg format specifier
for the block device, which is equivalent to a bdevname call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727062518.122108-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02 13:37:28 -06:00
Christoph Hellwig
a291bb43e5 block: use the %pg format specifier in show_partition
Simplify printing the partition name by using the %pg format specifier
that is equivalent to a bdevname call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727062518.122108-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02 13:37:28 -06:00
Christoph Hellwig
a9e7bc3de4 block: use the %pg format specifier in printk_all_partitions
Simplify printing the partition name by using the %pg format specifier
that is equivalent to a bdevname call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727062518.122108-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02 13:37:28 -06:00
Abd-Alrhman Masalkhi
26e2d7a362 block: reduce stack usage in diskstats_show
I have compiled the kernel with a cross compiler "hppa-linux-gnu-" v9.3.0
on x86-64 host machine. I got the following warning:

block/genhd.c: In function ‘diskstats_show’:
block/genhd.c:1227:1: warning: the frame size of 1688 bytes is larger
than 1280 bytes [-Wframe-larger-than=]
 1227  |  }

By Reduced the stack footprint by using the %pg printk specifier instead
of disk_name to remove the need for the on-stack buffer.

Signed-off-by: Abd-Alrhman Masalkhi <abd.masalkhi@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20210727062518.122108-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02 13:37:28 -06:00
Christoph Hellwig
2f4731dcd0 block: remove bdput
Now that we've stopped using inode references for anything meaninful
in the block layer get rid of the helper to put it and just open code
the call to iput on the block_device inode.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Link: https://lore.kernel.org/r/20210722075402.983367-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02 13:37:28 -06:00
Christoph Hellwig
14cf1dbb55 block: remove bdgrab
All callers are gone, and no one should grab a pure inode reference to
a block device anymore.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20210722075402.983367-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02 13:37:28 -06:00
Christoph Hellwig
4b2731226d loop: don't grab a reference to the block device
The whole device block device won't be removed while the disk is still
alive, so don't bother to grab a reference to it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Ming Lei <ming.lei@rehat.com>
Reviewed-by: Chaitanya Kulkarni <ckulkarnilinux@gmail.com>
Link: https://lore.kernel.org/r/20210722075402.983367-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02 13:37:28 -06:00
Christoph Hellwig
9d3b881389 block: change the refcounting for partitions
Instead of acquiring an inode reference on open make sure partitions
always hold device model references to the disk while alive, and switch
open to grab only a device model reference to the opened block device.
If that is a partition the disk reference is transitively held by the
partition already.

Link: https://lore.kernel.org/r/20210722075402.983367-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02 13:37:28 -06:00
Christoph Hellwig
0468c53234 block: allocate bd_meta_info later in add_partitions
Move the allocation of bd_meta_info after initializing the struct device
to avoid the special bdput error handling path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20210722075402.983367-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02 13:37:28 -06:00
Christoph Hellwig
d7a66574b3 block: unhash the whole device inode earlier
Unhash the whole device inode early in del_gendisk.  This allows to
remove the first GENHD_FL_UP check in the open path as we simply
won't find a just removed inode.  The second non-racy check after
taking open_mutex is still kept.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210722075402.983367-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-08-02 13:37:28 -06:00