linux

iv/linux

Author	SHA1	Message	Date
Christoph Hellwig	793c7cfce0	nvmet: track and limit the number of namespaces per subsystem TP 4004 introduces a new 'Maximum Number of Allocated Namespaces' field in the Identify controller data to help the host size resources. Put an upper limit on the supported namespaces to be able to support this value as supporting 32-bits worth of namespaces would lead to very large buffers. The limit is completely arbitrary at this point. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>	2018-07-27 19:13:01 +02:00
Christoph Hellwig	4ee4328048	nvmet: keep a port pointer in nvmet_ctrl This will be needed for the ANA AEN code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>	2018-07-27 19:12:52 +02:00
Christoph Hellwig	0d0b660f21	nvme: add ANA support Add support for Asynchronous Namespace Access as specified in NVMe 1.3 TP 4004. With ANA each namespace attached to a controller belongs to an ANA group that describes the characteristics of accessing the namespaces through this controller. In the optimized and non-optimized states namespaces can be accessed regularly, although in a multi-pathing environment we should always prefer to access a namespace through a controller where an optimized relationship exists. Namespaces in Inaccessible, Permanent-Loss or Change state for a given controller should not be accessed. The states are updated through reading the ANA log page, which is read once during controller initialization, whenever the ANA change notice AEN is received, or when one of the ANA specific status codes that signal a state change is received on a command. The ANA state is kept in the nvme_ns structure, which makes the checks in the fast path very simple. Updating the ANA state when reading the log page is also very simple, the only downside is that finding the initial ANA state when scanning for namespaces is a bit cumbersome. The gendisk for a ns_head is only registered once a live path for it exists. Without that the kernel would hang during partition scanning. Includes fixes and improvements from Hannes Reinecke. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>	2018-07-27 19:12:08 +02:00
Christoph Hellwig	8decf5d5b9	nvme: remove nvme_req_needs_failover Now that we just call out to blk_path_error there isn't really any good reason to not merge it into the only caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>	2018-07-27 19:12:05 +02:00
Christoph Hellwig	0e98719b0e	nvme: simplify the API for getting log pages Merge nvme_get_log and nvme_get_log_ext into a single helper, which takes a plain nsid instead of the nvme_ns pointer. Also add support for the log specific field while we're at it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>	2018-07-27 19:12:01 +02:00
Christoph Hellwig	1a37621658	nvme.h: add ANA definitions Add various defintions from NVMe 1.3 TP 4004. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>	2018-07-27 19:11:52 +02:00
Christoph Hellwig	9b89bc3857	nvme.h: add support for the log specific field NVMe 1.3 added a new log specific field to the get log page CQ defintion, add it to our get_log_page SQ structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>	2018-07-27 19:11:46 +02:00
Greg Edwards	cdcdcaae84	scsi: virtio_scsi: fix pi_bytes{out,in} on 4 KiB block size devices When the underlying device is a 4 KiB logical block size device with a protection interval exponent of 0, i.e. 4096 bytes data + 8 bytes PI, the driver miscalculates the pi_bytes{out,in} by a factor of 8x (64 bytes). This leads to errors on all reads and writes on 4 KiB logical block size devices when CONFIG_BLK_DEV_INTEGRITY is enabled and the VIRTIO_SCSI_F_T10_PI feature bit has been negotiated. Fixes: e6dc783a38ec0 ("virtio-scsi: Enable DIF/DIX modes in SCSI host LLD") Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Edwards <gedwards@ddn.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-26 15:49:43 -06:00
Greg Edwards	359f642700	block: move bio_integrity_{intervals,bytes} into blkdev.h This allows bio_integrity_bytes() to be called from drivers instead of open coding it. Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Greg Edwards <gedwards@ddn.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-26 15:49:41 -06:00
Juergen Gross	d3df0ac096	xen/blkfront: remove unused macros Remove some macros not used anywhere. Acked-by: Roger Pau Monné <roger.pau@citrix.com> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-25 08:49:24 -06:00
Jens Axboe	eca53cb63f	Merge branch 'nvme-4.19' of git://git.infradead.org/nvme into for-4.19/block Pull NVMe updates from Christoph: "Highlights: - massively improved tracepoints (Keith Busch) - support for larger inline data in the RDMA host and target (Steve Wise) - RDMA setup/teardown path fixes and refactor (Sagi Grimberg) - Command Supported and Effects log support for the NVMe target (Chaitanya Kulkarni) - buffered I/O support for the NVMe target (Chaitanya Kulkarni) plus the usual set of cleanups and small enhancements." * 'nvme-4.19' of git://git.infradead.org/nvme: nvmet: don't use uuid_le type nvmet: check fileio lba range access boundaries nvmet: fix file discard return status nvme-rdma: centralize admin/io queue teardown sequence nvme-rdma: centralize controller setup sequence nvme-rdma: unquiesce queues when deleting the controller nvme-rdma: mark expected switch fall-through nvme: add disk name to trace events nvme: add controller name to trace events nvme: use hw qid in trace events nvme: cache struct nvme_ctrl reference to struct nvme_request nvmet-rdma: add an error flow for post_recv failures nvmet-rdma: add unlikely check in the fast path nvmet-rdma: support max(16KB, PAGE_SIZE) inline data nvme-rdma: support up to 4 segments of inline data nvmet: add buffered I/O support for file backed ns nvmet: add commands supported and effects log page nvme: move init of keep_alive work item to controller initialization nvme.h: resync with nvme-cli	2018-07-25 08:43:30 -06:00
Mike Snitzer	42c9cdfe1e	block: allow max_discard_segments to be stacked Set max_discard_segments to USHRT_MAX in blk_set_stacking_limits() so that blk_stack_limits() can stack up this limit for stacked devices. before: $ cat /sys/block/nvme0n1/queue/max_discard_segments 256 $ cat /sys/block/dm-0/queue/max_discard_segments 1 after: $ cat /sys/block/nvme0n1/queue/max_discard_segments 256 $ cat /sys/block/dm-0/queue/max_discard_segments 256 Fixes: 1e739730c5b9e ("block: optionally merge discontiguous discard bios into a single request") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-24 14:46:39 -06:00
Christoph Hellwig	c55183c9aa	block: unexport bio_clone_bioset Now only used by the bounce code, so move it there and mark the function static. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-24 14:43:26 -06:00
Christoph Hellwig	3ed122e68b	md: remove a bogus comment The function name mentioned doesn't exist, and the code next to it doesn't match the description either. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-24 14:43:24 -06:00
Christoph Hellwig	071f52fbce	block: remove bio_clone_kmalloc Unused now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-24 14:43:22 -06:00
Christoph Hellwig	076ff2f0b8	exofs: use bio_clone_fast in _write_mirror The mirroring code never changes the bio data or biovecs. This means we can reuse the biovec allocation easily instead of duplicating it. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by Boaz Harrosh <ooo@electrozaur.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-24 14:43:20 -06:00
Christoph Hellwig	c8b27acc77	bcache: don't clone bio in bch_data_verify We immediately overwrite the biovec array, so instead just allocate a new bio and copy over the disk, setor and size. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-24 14:43:19 -06:00
Christoph Hellwig	3bb5098310	block: bio_set_pages_dirty can't see NULL bv_page in a valid bio_vec So don't bother handling it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-24 14:39:28 -06:00
Christoph Hellwig	24d5493f20	block: simplify bio_check_pages_dirty bio_check_pages_dirty currently inviolates the invariant that bv_page of a bio_vec inside bi_vcnt shouldn't be zero, and that is going to become really annoying with multpath biovecs. Fortunately there isn't any all that good reason for it - once we decide to defer freeing the bio to a workqueue holding onto a few additional pages isn't really an issue anymore. So just check if there is a clean page that needs dirtying in the first path, and do a second pass to free them if there was none, while the cache is still hot. Also use the chance to micro-optimize bio_dirty_fn a bit by not saving irq state - we know we are called from a workqueue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-24 14:39:27 -06:00
Bart Van Assche	76f17d8ba1	block: Rename the null_blk_mod kernel module back into null_blk Commit ca4b2a011948 ("null_blk: add zone support") breaks several blktests scripts because it renamed the null_blk kernel module into null_blk_mod. Hence rename null_blk_mod back into null_blk. Fixes: ca4b2a011948 ("null_blk: add zone support") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Matias Bjorling <matias.bjorling@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-24 09:54:36 -06:00
Andy Shevchenko	1b0d274523	nvmet: don't use uuid_le type Don't use sizeof(uuid_le) where none of the parameters is type of uuid_le. Since both arguments are u8 [16], use size of destination there. Moreover, uuid_le is a deprecated type, and nvmet is using uuid_t already. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-07-24 15:55:51 +02:00
Sagi Grimberg	9c891c1398	nvmet: check fileio lba range access boundaries Fail out-of-bounds with a proper status code. Fixes: d5eff33ee6f8 ("nvmet: add simple file backed ns support") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-07-24 15:55:51 +02:00
Sagi Grimberg	1b72b71fac	nvmet: fix file discard return status If nvmet_copy_from_sgl failed, we falsly return successful completion status. Fixes: d5eff33ee6f8 ("nvmet: add simple file backed ns support") Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-07-24 15:55:50 +02:00
Sagi Grimberg	75862c7232	nvme-rdma: centralize admin/io queue teardown sequence We follow the same queue teardown sequence in delete, reset and error recovery. Centralize the logic. This patch does not change any functionality. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-07-24 15:55:50 +02:00
Sagi Grimberg	c66e2998c8	nvme-rdma: centralize controller setup sequence Centralize controller sequence to a single routine that correctly cleans up after failures instead of having multiple apperances in several flows (create, reset, reconnect). One thing that we also gain here are the sanity/boundary checks also when connecting back to a dynamic controller. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-07-24 15:55:49 +02:00
Sagi Grimberg	90140624e8	nvme-rdma: unquiesce queues when deleting the controller If the controller is going away, we need to unquiesce the IO queues so that all pending request can fail gracefully before moving forward with controller deletion. Do that before we destroy the IO queues so blk_cleanup_queue won't block in freeze. Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-07-24 15:55:49 +02:00
Gustavo A. R. Silva	249090f901	nvme-rdma: mark expected switch fall-through In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-07-24 15:55:48 +02:00
Keith Busch	6268953e89	nvme: add disk name to trace events This will print the disk name to the nvme event trace for io requests so a user can better distinguish traffic to different disks. This can be used to create disk based filters. For example, to see only nvme0n2 traffic: echo "disk == \"nvme0n2\"" > /sys/kernel/debug/tracing/events/nvme/filter Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> [hch: turned __assign_disk_name into an inline function] Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-07-24 15:55:48 +02:00
Keith Busch	b80a55e246	nvme: add controller name to trace events This appends the controller instance to the nvme trace buffer to distinguish which controller is dispatching and completing a command. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-07-24 15:55:45 +02:00
Keith Busch	5d87eb94d9	nvme: use hw qid in trace events We can not match a command to its completion based on the command id alone. We need the submitting queue identifier to pair with the completion, so this patch adds that to the trace buffer. This patch is also collapsing the admin and IO submission traces into a single one so we don't need to duplicate this and creating unnecessary code branches: we know if the command is an admin vs IO based on the qid. And since we're here, the patch fixes code formatting in the area. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> [hch: move the qid helper to nvme.h and made it an inline function] Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-07-23 09:35:19 +02:00
Sagi Grimberg	59e29ce66b	nvme: cache struct nvme_ctrl reference to struct nvme_request We will need to reference the controller in the setup and completion time for tracing and future traffic based keep alive support. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-07-23 09:35:18 +02:00
Max Gurtovoy	202093848c	nvmet-rdma: add an error flow for post_recv failures Posting receive buffer operation can fail, thus we should make sure to have an error flow during initialization phase. While we're here, add a debug print in case of a failure. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-07-23 09:35:17 +02:00
Max Gurtovoy	2fc464e216	nvmet-rdma: add unlikely check in the fast path ib_post_send operation should succeed unless something unusual happened to the ib device. Signed-off-by: Max Gurtovoy <maxg@mellanox.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-07-23 09:35:16 +02:00
Steve Wise	0d5ee2b2ab	nvmet-rdma: support max(16KB, PAGE_SIZE) inline data The patch enables inline data sizes using up to 4 recv sges, and capping the size at 16KB or at least 1 page size. So on a 4K page system, up to 16KB is supported, and for a 64K page system 1 page of 64KB is supported. We avoid > 0 order page allocations for the inline buffers by using multiple recv sges, one for each page. If the device cannot support the configured inline data size due to lack of enough recv sges, then log a warning and reduce the inline size. Add a new configfs port attribute, called param_inline_data_size, to allow configuring the size of inline data for a given nvmf port. The maximum size allowed is still enforced by nvmet-rdma with NVMET_RDMA_MAX_INLINE_DATA_SIZE, which is now max(16KB, PAGE_SIZE). And the default size, if not specified via configfs, is still PAGE_SIZE. This preserves the existing behavior, but allows larger inline sizes for small page systems. If the configured inline data size exceeds NVMET_RDMA_MAX_INLINE_DATA_SIZE, a warning is logged and the size is reduced. If param_inline_data_size is set to 0, then inline data is disabled for that nvmf port. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-07-23 09:35:16 +02:00
Steve Wise	64a741c1ea	nvme-rdma: support up to 4 segments of inline data Allow up to 4 segments of inline data for NVMF WRITE operations. This reduces latency for small WRITEs by removing the need for the target to issue a READ WR for IB, or a REG_MR + READ WR chain for iWarp. Also cap the inline segments used based on the limitations of the device. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-07-23 09:35:15 +02:00
Chaitanya Kulkarni	55eb942eda	nvmet: add buffered I/O support for file backed ns Add a new "buffered_io" attribute, which disabled direct I/O and thus enables page cache based caching when enabled. The attribute can only be changed when the namespace is disabled as the file has to be reopend for the change to take effect. The possibly blocking read/write are deferred to a newly introduced global workqueue. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-07-23 09:35:14 +02:00
Chaitanya Kulkarni	0866bf0c37	nvmet: add commands supported and effects log page This patch adds support for Commands Supported and Effects log page (Log Identifier 05h) for NVMeOF. This also makes it easier to find which commands are supported, e.g. :- subnqn : testnqn1 Admin Command Set ACS2 [Get Log Page ] 00000001 ACS6 [Identify ] 00000001 ACS8 [Abort ] 00000001 ACS9 [Set Features ] 00000001 ACS10 [Get Features ] 00000001 ACS12 [Asynchronous Event Request ] 00000001 ACS24 [Keep Alive ] 00000001 NVM Command Set IOCS0 [Flush ] 00000001 IOCS1 [Write ] 00000001 IOCS2 [Read ] 00000001 IOCS8 [Write Zeroes ] 00000001 IOCS9 [Dataset Management ] 00000001 This partticular functionality can be used from the host side to examine the NVMeOF ctrl commands supported. Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-07-23 09:35:13 +02:00
James Smart	230f1f9e04	nvme: move init of keep_alive work item to controller initialization Currently, the code initializes the keep alive work item whenever nvme_start_keep_alive() is called. However, this routine is called several times while reconnecting, etc. Although it's hoped that keep alive is always disabled and not scheduled when start is called, re-initing if it were scheduled or completing can have very bad side effects. There's no need for re-initialization. Move the keep_alive work item and cmd struct initialization to controller init. Signed-off-by: James Smart <james.smart@broadcom.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-07-23 09:35:13 +02:00
Revanth Rajashekar	40c6f9c28e	nvme.h: resync with nvme-cli Added some feature ids present in nvme-cli but not kernel. Signed-off-by: Revanth Rajashekar <revanth.rajashekar@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2018-07-23 09:35:12 +02:00
Ming Lei	8824f62246	blk-mq: fail the request in case issue failure Inside blk_mq_try_issue_list_directly(), if the request is issued as failed, we shouldn't try to do it again, otherwise the warning in blk_mq_start_request() will be triggered. This change is aligned to behaviour of other ways of request issue & dispatch. Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Laurence Oberman <loberman@redhat.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: kernel test robot <rong.a.chen@intel.com> Cc: LKP <lkp@01.org> Reported-by: kernel test robot <rong.a.chen@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-22 17:31:18 -06:00
Josef Bacik	22f17952c7	blk-rq-qos: make depth comparisons unsigned With the change to use UINT_MAX I broke the depth check as any value of inflight (ie 0) would be less than (int)UINT_MAX. Fix this by changing everything to unsigned int to match the depth. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-22 11:30:53 -06:00
Tejun Heo	636620b66d	blkcg: Track DISCARD statistics and output them in cgroup io.stat Add tracking of REQ_OP_DISCARD ios to the per-cgroup io.stat. Two fields, dbytes and dios, to respectively count the total bytes and number of discards are added. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andy Newell <newella@fb.com> Cc: Michael Callahan <michaelcallahan@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-18 08:44:23 -06:00
Michael Callahan	bdca3c87fb	block: Track DISCARD statistics and output them in stat and diskstat Add tracking of REQ_OP_DISCARD ios to the partition statistics and append them to the various stat files in /sys as well as /proc/diskstats. These are tracked with the same four stats as reads and writes: Number of discard ios completed. Number of discard ios merged Number of discard sectors completed Milliseconds spent on discard requests This is done via adding a new STAT_DISCARD define to genhd.h and then using it to index that stat field for discard requests. tj: Refreshed on top of v4.17 and other previous updates. Signed-off-by: Michael Callahan <michaelcallahan@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Andy Newell <newella@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-18 08:44:22 -06:00
Michael Callahan	ddcf35d397	block: Add and use op_stat_group() for indexing disk_stat fields. Add and use a new op_stat_group() function for indexing partition stat fields rather than indexing them by rq_data_dir() or bio_data_dir(). This function works similarly to op_is_sync() in that it takes the request::cmd_flags or bio::bi_opf flags and determines which stats should et updated. In addition, the second parameter to generic_start_io_acct() and generic_end_io_acct() is now a REQ_OP rather than simply a read or write bit and it uses op_stat_group() on the parameter to determine the stat group. Note that the partition in_flight counts are not part of the per-cpu statistics and as such are not indexed via this function. It's now indexed by op_is_write(). tj: Refreshed on top of v4.17. Updated to pass around REQ_OP. Signed-off-by: Michael Callahan <michaelcallahan@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Matias Bjorling <mb@lightnvm.io> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Alasdair Kergon <agk@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-18 08:44:20 -06:00
Michael Callahan	dbae2c5513	block: Define and use STAT_READ and STAT_WRITE Add defines for STAT_READ and STAT_WRITE for indexing the partition stat entries. This clarifies some fs/ code which has hardcoded 1 for STAT_WRITE and will make it easier to extend the stats with additional fields. tj: Refreshed on top of v4.17. Signed-off-by: Michael Callahan <michaelcallahan@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-18 08:44:18 -06:00
Michael Callahan	59767fbd49	block: Add part_stat_read_accum to read across field entries. Add a part_stat_read_accum macro to genhd.h to read and sum across field entries. For example to sum up the number read and write sectors completed. In addition to being ar reasonable cleanup by itself this will make it easier to add new stat fields in the future. tj: Refreshed on top of v4.17. Signed-off-by: Michael Callahan <michaelcallahan@fb.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-18 08:44:16 -06:00
Tejun Heo	3f289dcb4b	block: make bdev_ops->rw_page() take a REQ_OP instead of bool c11f0c0b5bb9 ("block/mm: make bdev_ops->rw_page() take a bool for read/write") replaced @op with boolean @is_write, which limited the amount of information going into ->rw_page() and more importantly page_endio(), which removed the need to expose block internals to mm. Unfortunately, we want to track discards separately and @is_write isn't enough information. This patch updates bdev_ops->rw_page() to take REQ_OP instead but leaves page_endio() to take bool @is_write. This allows the block part of operations to have enough information while not leaking it to mm. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Mike Christie <mchristi@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-18 08:44:14 -06:00
RAGHU Halharvi	ada94973f1	pktcdvd: remove assignment in if condition * Remove checkpatch errors caused due to assignment operation in if condition Signed-off-by: RAGHU Halharvi <raghuhack78@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-17 16:17:04 -06:00
Ming Lei	6ce3dd6eec	blk-mq: issue directly if hw queue isn't busy in case of 'none' In case of 'none' io scheduler, when hw queue isn't busy, it isn't necessary to enqueue request to sw queue and dequeue it from sw queue because request may be submitted to hw queue asap without extra cost, meantime there shouldn't be much request in sw queue, and we don't need to worry about effect on IO merge. There are still some single hw queue SCSI HBAs(HPSA, megaraid_sas, ...) which may connect high performance devices, so 'none' is often required for obtaining good performance. This patch improves IOPS and decreases CPU unilization on megaraid_sas, per Kashyap's test. Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Laurence Oberman <loberman@redhat.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Reported-by: Kashyap Desai <kashyap.desai@broadcom.com> Tested-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-17 16:04:00 -06:00
Josef Bacik	71e9690b59	blk-iolatency: truncate our current time In our longer tests we noticed that some boxes would degrade to the point of uselessness. This is because we truncate the current time when saving it in our bio, but I was using the raw current time to subtract from. So once the box had been up a certain amount of time it would appear as if our IO's were taking several years to complete. Fix this by truncating the current time so it matches the issue time. Verified this worked by running with this patch for a week on our test tier. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2018-07-16 10:15:19 -06:00

1 2 3 4 5 ...

767578 Commits