linux

iv/linux

Author	SHA1	Message	Date
Keith Busch	921920ab32	NVMe: Unbind driver on failure Instead of removing the PCI device from the kernel's topology on controller failure, this patch simply requests unbinding the device from the driver. This avoids concurrently running pci removal with the hot plug event, which has been reported to be problematic when multiple surprise events occur near simultaneously. The other benefit is that we will have PCI config and memory space available to poke around for debugging a failed controller, assuming the device was not physically removed. The down side occurs if the platform and/or kernel do not support any type of surprise hot removal. The device will remain visible through sysfs (and therefore lspci), and some manual work is necessary to get the logical topology corrected. But if your platform and/or kernel don't support surprise removal, you probably shouldn't be doing that anyway. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-17 17:14:21 -06:00
Keith Busch	014a0d609e	NVMe: Delete only created queues Use the online queue count instead of the number of allocated queues. The controller should just return an invalid queue identifier error to the commands if a queue wasn't created. While it's not harmful, it's still not correct. Reported-by: Saar Gross <saar@annapurnalabs.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-17 17:14:21 -06:00
Keith Busch	2800b8e7d9	NVMe: Allocate queues only for online cpus The driver previously requested allocating queues for the total possible number of CPUs so that blk-mq could rebalance these if CPUs were added after initialization. The number of hardware contexts can now be changed at runtime, so we only need to allocate the number of online queues since we can add more later. Suggested-by: Jeff Lien <jeff.lien@hgst.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-17 17:14:21 -06:00
Linus Torvalds	24b9f0cf00	Merge branch 'for-4.7/drivers' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: "On top of the core pull request, this is the drivers pull request for this merge window. This contains: - Switch drivers to the new write back cache API, and kill off the flush flags. From me. - Kill the discard support for the STEC pci-e flash driver. It's trivially broken, and apparently unmaintained, so it's safer to just remove it. From Jeff Moyer. - A set of lightnvm updates from the usual suspects (Matias/Javier, and Simon), and fixes from Arnd, Jeff Mahoney, Sagi, and Wenwei Tao. - A set of updates for NVMe: - Turn the controller state management into a proper state machine. From Christoph. - Shuffling of code in preparation for NVMe-over-fabrics, also from Christoph. - Cleanup of the command prep part from Ming Lin. - Rewrite of the discard support from Ming Lin. - Deadlock fix for namespace removal from Ming Lin. - Use the now exported blk-mq tag helper for IO termination. From Sagi. - Various little fixes from Christoph, Guilherme, Keith, Ming Lin, Wang Sheng-Hui. - Convert mtip32xx to use the now exported blk-mq tag iter function, from Keith" * 'for-4.7/drivers' of git://git.kernel.dk/linux-block: (74 commits) lightnvm: reserved space calculation incorrect lightnvm: rename nr_pages to nr_ppas on nvm_rq lightnvm: add is_cached entry to struct ppa_addr lightnvm: expose gennvm_mark_blk to targets lightnvm: remove mgt targets on mgt removal lightnvm: pass dma address to hardware rather than pointer lightnvm: do not assume sequential lun alloc. nvme/lightnvm: Log using the ctrl named device lightnvm: rename dma helper functions lightnvm: enable metadata to be sent to device lightnvm: do not free unused metadata on rrpc lightnvm: fix out of bound ppa lun id on bb tbl lightnvm: refactor set_bb_tbl for accepting ppa list lightnvm: move responsibility for bad blk mgmt to target lightnvm: make nvm_set_rqd_ppalist() aware of vblks lightnvm: remove struct factory_blks lightnvm: refactor device ops->get_bb_tbl() lightnvm: introduce nvm_for_each_lun_ppa() macro lightnvm: refactor dev->online_target to global nvm_targets lightnvm: rename nvm_targets to nvm_tgt_type ...	2016-05-17 16:03:32 -07:00
Keith Busch	87c3207781	NVMe: Fix reset/remove race This fixes a scenario where device is present and being reset, but a request to unbind the driver occurs. A previous patch series addressing a device failure removal scenario flushed reset_work after controller disable to unblock reset_work waiting on a completion that wouldn't occur. This isn't safe as-is. The broken scenario can potentially be induced with: modprobe nvme && modprobe -r nvme To fix, the reset work is flushed immediately after setting the controller removing flag, and any subsequent reset will not proceed with controller initialization if the flag is set. The controller status must be polled while active, so the watchdog timer is also left active until the controller is disabled to cleanup requests that may be stuck during namespace removal. [Fixes: `ff23a2a15a`] Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-03 14:00:29 -06:00
Ming Lin	6904242db1	nvme: add helper nvme_cleanup_cmd() This hides command cleanup into nvme.h and fabrics drivers will also use it. Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-02 09:11:58 -06:00
Christoph Hellwig	f866fc4282	nvme: move AER handling to common code The transport driver still needs to do the actual submission, but all the higher level code can be shared. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-02 09:09:26 -06:00
Christoph Hellwig	5955be2144	nvme: move namespace scanning to core Move the scan work item and surrounding code to the common code. For now we need a new finish_scan method to allow the PCI driver to set the irq affinity hints, but I have plans in the works to obsolete this as well. Note that this moves the namespace scanning from nvme_wq to the system workqueue, but as we don't rely on namespace scanning to finish from reset or I/O this should be fine. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by Jon Derrick: <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-02 09:09:24 -06:00
Christoph Hellwig	92911a55d4	nvme: tighten up state check for namespace scanning We only should be scanning namespaces if the controller is live. Currently we call the function just before setting it live, so fix the code up to move the call to nvme_queue_scan to just below the state change. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Acked-by Jon Derrick: <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-02 09:09:23 -06:00
Christoph Hellwig	bb8d261e08	nvme: introduce a controller state machine Replace the adhoc flags in the PCI driver with a state machine in the core code. Based on code from Sagi Grimberg for the Fabrics driver. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Acked-by Jon Derrick: <jonathan.derrick@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-02 09:09:21 -06:00
Christoph Hellwig	04a934d4c7	nvme: remove the io_incapable method It's unused since "NVMe: Move error handling to failed reset handler". Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-02 09:09:20 -06:00
Keith Busch	3b24774e1f	NVMe: Fix check_flush_dependency warning If the controller fails and is degraded after a reset, we need to kill off all requests queues before removing the inaccessble namespaces. This will prevent del_gendisk from syncing dirty data, which we can't due from a WQ_MEM_RECLAIM work queue. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-05-02 09:03:06 -06:00
Keith Busch	a5229050b6	NVMe: Always use MSI/MSI-x interrupts Multiple users have reported device initialization failure due the driver not receiving legacy PCI interrupts. This is not unique to any particular controller, but has been observed on multiple platforms. There have been no issues reported or observed when with message signaled interrupts, so this patch attempts to use MSI-x during initialization, falling back to MSI. If that fails, legacy would become the default. The setup_io_queues error handling had to change as a result: the admin queue's msix_entry used to be initialized to the legacy IRQ. The case where nr_io_queues is 0 would fail request_irq when setting up the admin queue's interrupt since re-enabling MSI-x fails with 0 vectors, leaving the admin queue's msix_entry invalid. Instead, return success immediately. Reported-by: Tim Muhlemmer <muhlemmer@gmail.com> Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-14 14:04:50 -06:00
Guilherme G. Piccoli	c875a7093f	nvme: Avoid reset work on watchdog timer function during error recovery This patch adds a check on nvme_watchdog_timer() function to avoid the call to reset_work() when an error recovery process is ongoing on controller. The check is made by looking at pci_channel_offline() result. If we don't check for this on nvme_watchdog_timer(), error recovery mechanism can't recover well, because reset_work() won't be able to do its job (since we're in the middle of an error) and so the controller is removed from the system before error recovery mechanism can perform slot reset (which would allow the adapter to recover). In this patch we also have split the huge condition expression on nvme_watchdog_timer() by introducing an auxiliary function to help make the code more readable. Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-13 08:15:39 -06:00
Jens Axboe	7e19793096	NVMe: silence warning about unused 'dev' Depending on options, we might not be using dev in nvme_cancel_io(): drivers/nvme/host/pci.c: In function ‘nvme_cancel_io’: drivers/nvme/host/pci.c:970:19: warning: unused variable ‘dev’ [-Wunused-variable] struct nvme_dev *dev = data; ^ So get rid of it, and just cast for the dev_dbg_ratelimited() call. Fixes: `82b4552b91` ("nvme: Use blk-mq helper for IO termination") Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-12 16:11:11 -06:00
Sagi Grimberg	82b4552b91	nvme: Use blk-mq helper for IO termination blk-mq offers a tagset iterator so let's use that instead of using nvme_clear_queues. Note, we changed nvme_queue_cancel_ios name to nvme_cancel_io as there is no concept of a queue now in this function (we also lost the print). Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-12 15:07:15 -06:00
Keith Busch	21f033f7c7	NVMe: Skip async events for degraded controllers If the controller is degraded, the driver should stay out of the way so the user can recover the drive. This patch skips driver initiated async event requests when the drive is in this state. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-12 13:44:00 -06:00
Ming Lin	8093f7ca73	nvme: add helper nvme_setup_cmd() This moves nvme_setup_{flush,discard,rw} calls into a common nvme_setup_cmd() helper. So we can eventually hide all the command setup in the core module and don't even need to update the fabrics drivers for any specific command type. Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-12 13:44:00 -06:00
Ming Lin	03b5929ebb	nvme: rewrite discard support This rewrites nvme_setup_discard() with blk_add_request_payload(). It allocates only the necessary amount(16 bytes) for the payload. Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-12 13:44:00 -06:00
Ming Lin	58b4560275	nvme: add helper nvme_map_len() The helper returns the number of bytes that need to be mapped using PRPs/SGL entries. Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-12 13:44:00 -06:00
Ming Lin	2e39e0f608	nvme: add missing lock nesting notation When unloading driver, nvme_disable_io_queues() calls nvme_delete_queue() that sends nvme_admin_delete_cq command to admin sq. So when the command completed, the lock acquired by nvme_irq() actually belongs to admin queue. While the lock that nvme_del_cq_end() trying to acquire belongs to io queue. So it will not deadlock. This patch adds lock nesting notation to fix following report. [ 109.840952] ============================================= [ 109.846379] [ INFO: possible recursive locking detected ] [ 109.851806] 4.5.0+ #180 Tainted: G E [ 109.856533] --------------------------------------------- [ 109.861958] swapper/0/0 is trying to acquire lock: [ 109.866771] (&(&nvmeq->q_lock)->rlock){-.....}, at: [<ffffffffc0820bc6>] nvme_del_cq_end+0x26/0x70 [nvme] [ 109.876535] [ 109.876535] but task is already holding lock: [ 109.882398] (&(&nvmeq->q_lock)->rlock){-.....}, at: [<ffffffffc0820c2b>] nvme_irq+0x1b/0x50 [nvme] [ 109.891547] [ 109.891547] other info that might help us debug this: [ 109.898107] Possible unsafe locking scenario: [ 109.898107] [ 109.904056] CPU0 [ 109.906515] ---- [ 109.908974] lock(&(&nvmeq->q_lock)->rlock); [ 109.913381] lock(&(&nvmeq->q_lock)->rlock); [ 109.917787] [ 109.917787] * DEADLOCK * [ 109.917787] [ 109.923738] May be due to missing lock nesting notation [ 109.923738] [ 109.930558] 1 lock held by swapper/0/0: [ 109.934413] #0: (&(&nvmeq->q_lock)->rlock){-.....}, at: [<ffffffffc0820c2b>] nvme_irq+0x1b/0x50 [nvme] [ 109.944010] [ 109.944010] stack backtrace: [ 109.948389] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G E 4.5.0+ #180 [ 109.955734] Hardware name: Dell Inc. OptiPlex 7010/0YXT71, BIOS A15 08/12/2013 [ 109.962989] 0000000000000000 ffff88011e203c38 ffffffff81383d9c ffffffff81c13540 [ 109.970478] ffffffff826711d0 ffff88011e203ce8 ffffffff810bb429 0000000000000046 [ 109.977964] 0000000000000046 0000000000000000 0000000000b2e597 ffffffff81f4cb00 [ 109.985453] Call Trace: [ 109.987911] <IRQ> [<ffffffff81383d9c>] dump_stack+0x85/0xc9 [ 109.993711] [<ffffffff810bb429>] __lock_acquire+0x19b9/0x1c60 [ 109.999575] [<ffffffff810b6d1d>] ? trace_hardirqs_off+0xd/0x10 [ 110.005524] [<ffffffff810b386d>] ? complete+0x3d/0x50 [ 110.010688] [<ffffffff810bb760>] lock_acquire+0x90/0xf0 [ 110.016029] [<ffffffffc0820bc6>] ? nvme_del_cq_end+0x26/0x70 [nvme] [ 110.022418] [<ffffffff81772afb>] _raw_spin_lock_irqsave+0x4b/0x60 [ 110.028632] [<ffffffffc0820bc6>] ? nvme_del_cq_end+0x26/0x70 [nvme] [ 110.035019] [<ffffffffc0820bc6>] nvme_del_cq_end+0x26/0x70 [nvme] [ 110.041232] [<ffffffff8135b485>] blk_mq_end_request+0x35/0x60 [ 110.047095] [<ffffffffc0821ad8>] nvme_complete_rq+0x68/0x190 [nvme] [ 110.053481] [<ffffffff8135b53f>] __blk_mq_complete_request+0x8f/0x130 [ 110.060043] [<ffffffff8135b611>] blk_mq_complete_request+0x31/0x40 [ 110.066343] [<ffffffffc08209e3>] __nvme_process_cq+0x83/0x240 [nvme] [ 110.072818] [<ffffffffc0820c35>] nvme_irq+0x25/0x50 [nvme] [ 110.078419] [<ffffffff810cdb66>] handle_irq_event_percpu+0x36/0x110 [ 110.084804] [<ffffffff810cdc77>] handle_irq_event+0x37/0x60 [ 110.090491] [<ffffffff810d0ea3>] handle_edge_irq+0x93/0x150 [ 110.096180] [<ffffffff81012306>] handle_irq+0xa6/0x130 [ 110.101431] [<ffffffff81011abe>] do_IRQ+0x5e/0x120 [ 110.106333] [<ffffffff8177384c>] common_interrupt+0x8c/0x8c Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-12 13:44:00 -06:00
Keith Busch	788e15abbb	NVMe: Always use MSI/MSI-x interrupts Multiple users have reported device initialization failure due the driver not receiving legacy PCI interrupts. This is not unique to any particular controller, but has been observed on multiple platforms. There have been no issues reported or observed when with message signaled interrupts, so this patch attempts to use MSI-x during initialization, falling back to MSI. If that fails, legacy would become the default. The setup_io_queues error handling had to change as a result: the admin queue's msix_entry used to be initialized to the legacy IRQ. The case where nr_io_queues is 0 would fail request_irq when setting up the admin queue's interrupt since re-enabling MSI-x fails with 0 vectors, leaving the admin queue's msix_entry invalid. Instead, return success immediately. Reported-by: Tim Muhlemmer <muhlemmer@gmail.com> Reported-by: Jon Derrick <jonathan.derrick@intel.com> Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-12 13:44:00 -06:00
Keith Busch	9bf2b972af	NVMe: Fix reset/remove race This fixes a scenario where device is present and being reset, but a request to unbind the driver occurs. A previous patch series addressing a device failure removal scenario flushed reset_work after controller disable to unblock reset_work waiting on a completion that wouldn't occur. This isn't safe as-is. The broken scenario can potentially be induced with: modprobe nvme && modprobe -r nvme To fix, the reset work is flushed immediately after setting the controller removing flag, and any subsequent reset will not proceed with controller initialization if the flag is set. The controller status must be polled while active, so the watchdog timer is also left active until the controller is disabled to cleanup requests that may be stuck during namespace removal. [Fixes: `ff23a2a15a`] Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-04-11 10:00:04 -06:00
Marta Rybczynska	d783e0bd02	nvme: avoid cqe corruption when update at the same time as read Make sure the CQE phase (validity) is read before the rest of the structure. The phase bit is the highest address and the CQE read will happen on most platforms from lower to upper addresses and will be done by multiple non-atomic loads. If the structure is updated by PCI during the reads from the processor, the processor may get a corrupted copy. The addition of the new nvme_cqe_valid function that verifies the validity bit also allows refactoring of the other CQE read sequences. Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-22 10:27:29 -06:00
Linus Torvalds	237045fc3c	Merge branch 'for-4.6/drivers' of git://git.kernel.dk/linux-block Pull block driver updates from Jens Axboe: "This is the block driver pull request for this merge window. It sits on top of for-4.6/core, that was just sent out. This contains: - A set of fixes for lightnvm. One from Alan, fixing an overflow, and the rest from the usual suspects, Javier and Matias. - A set of fixes for nbd from Markus and Dan, and a fixup from Arnd for correct usage of the signed 64-bit divider. - A set of bug fixes for the Micron mtip32xx, from Asai. - A fix for the brd discard handling from Bart. - Update the maintainers entry for cciss, since that hardware has transferred ownership. - Three bug fixes for bcache from Eric Wheeler. - Set of fixes for xen-blk{back,front} from Jan and Konrad. - Removal of the cpqarray driver. It has been disabled in Kconfig since 2013, and we were initially scheduled to remove it in 3.15. - Various updates and fixes for NVMe, with the most important being: - Removal of the per-device NVMe thread, replacing that with a watchdog timer instead. From Christoph. - Exposing the namespace WWID through sysfs, from Keith. - Set of cleanups from Ming Lin. - Logging the controller device name instead of the underlying PCI device name, from Sagi. - And a bunch of fixes and optimizations from the usual suspects in this area" * 'for-4.6/drivers' of git://git.kernel.dk/linux-block: (49 commits) NVMe: Expose ns wwid through single sysfs entry drivers:block: cpqarray clean up brd: Fix discard request processing cpqarray: remove it from the kernel cciss: update MAINTAINERS NVMe: Remove unused sq_head read in completion path bcache: fix cache_set_flush() NULL pointer dereference on OOM bcache: cleaned up error handling around register_cache() bcache: fix race of writeback thread starting before complete initialization NVMe: Create discard zero quirk white list nbd: use correct div_s64 helper mtip32xx: remove unneeded variable in mtip_cmd_timeout() lightnvm: generalize rrpc ppa calculations lightnvm: remove struct nvm_dev->total_blocks lightnvm: rename ->nr_pages to ->nr_sects lightnvm: update closed list outside of intr context xen/blback: Fit the important information of the thread in 17 characters lightnvm: fold get bb tbl when using dual/quad plane mode lightnvm: fix up nonsensical configure overrun checking xen-blkback: advertise indirect segment support earlier ...	2016-03-18 17:13:31 -07:00
Jon Derrick	48c7823f42	NVMe: Remove unused sq_head read in completion path Signed-off-by: Jon Derrick <jonathan.derrick@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-08 15:01:06 -07:00
Keith Busch	08095e7078	NVMe: Create discard zero quirk white list The NVMe specification does not require discarded blocks return zeroes on read, but provides that behavior as a possibility. Some applications more efficiently use an SSD if reads on discarded blocks were deterministically zero, based on the "discard_zeroes_data" queue attribute. There is no specification defined way to determine device behavior on discarded blocks, so the driver always left the queue setting disabled. We can only know behavior based on individual device models, so this patch adds a flag to the NVMe "quirk" list that vendors may set if they know their controller works that way. The patch also sets the new flag for one such known device. Signed-off-by: Keith Busch <keith.busch@intel.com> Suggested-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-08 08:32:40 -07:00
Keith Busch	69d9a99c25	NVMe: Move error handling to failed reset handler This moves failed queue handling out of the namespace removal path and into the reset failure path, fixing a hanging condition if the controller fails or link down during del_gendisk. Previously the driver had to see the controller as degraded prior to calling del_gendisk to setup the queues to fail. But, if the controller happened to fail after this, there was no task to end outstanding requests. On failure, all namespace states are set to dead. This has capacity revalidate to 0, and ends all new requests with error status. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-03 14:42:50 -07:00
Keith Busch	f58944e265	NVMe: Simplify device reset failure A reset failure schedules the device to unbind from the driver through the pci driver's remove. This cleans up all intialization, so there is no need to duplicate the potentially racy cleanup. To help understand why a reset failed, the status is logged with the existing warning message. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-03 14:42:49 -07:00
Keith Busch	646017a612	NVMe: Fix namespace removal deadlock This patch makes nvme namespace removal lockless. It is up to the caller to ensure no active namespace scanning is occuring. To ensure no scan work occurs, the nvme pci driver adds a removing state to the controller device to avoid queueing scan work during removal. The work is flushed after setting the state, so no new scan work can be queued. The lockless removal allows the driver to cleanup a namespace request_queue if the controller fails during removal. Previously this could deadlock trying to acquire the namespace mutex in order to handle such events. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-03 14:42:49 -07:00
Keith Busch	b00a726a9f	NVMe: Don't unmap controller registers on reset Unmapping the registers on reset or shutdown is not necessary. Keeping the mapping simplifies reset handling. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-03-03 14:42:49 -07:00
Christoph Hellwig	1cb3cce5eb	nvme: return the whole CQE through the request passthrough interface Both LighNVM and NVMe over Fabrics need to look at more than just the status and result field. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matias Bj?rling <m@bjorling.me> Reviewed-by: Jay Freyensee <james.p.freyensee@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-29 08:47:17 -07:00
Christoph Hellwig	2d55cd5f51	nvme: replace the kthread with a per-device watchdog timer The only work left in the kthread is the periodic health check for each controller. There is no need to run this from process context or keep a thread context around for it, so replace it with a simpler timer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-29 08:47:16 -07:00
Christoph Hellwig	79f2b358c9	nvme: don't poll the CQ from the kthread There is no reason to do unconditional polling of CQs per the NVMe spec. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-29 08:47:14 -07:00
Christoph Hellwig	9396dec916	nvme: use a work item to submit async event requests Use a dedicated work item to submit async event requests instead of the global kthread. This simplifies the code and reduces the latencies to resubmit a request once an even notification happened. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-29 08:47:13 -07:00
Keith Busch	f8e68a7c9a	NVMe: Rate limit nvme IO warnings We don't need to spam the kernel logs with thousands of IO cancelling messages. We can infer all IO's are being cancelled with fewer, or even none at all. This patch rate limits the message and uses the debug log level as it is mainly used for testing purposes. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-12 08:10:31 -07:00
Keith Busch	ff23a2a15a	NVMe: Poll device while still active during remove A device failure or link down wouldn't have been detected during namespace removal. This patch keeps the device in the list for polling so that the thread may see such failure and initiate a reset. The device is removed from the list after disable, so we can safely flush the reset work as it can't be requeued when disable completes. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-12 08:10:16 -07:00
Keith Busch	ae1fba2001	NVMe: Requeue requests on suspended queues It's possible a request may get to the driver after the nvme queue was disabled. This has the request requeue if that happens. Note the request is still "started" by the driver, but requeuing will clear the start state for timeout handling. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-12 08:10:08 -07:00
Ming Lin	576d55d625	nvme: split pci module out of core module NVMe over Fabrics drivers are going to reuse the core, so splits nvme.ko into 2 modules: nvme-core.ko: the core part nvme.ko: the PCI driver Export symbols from nvme-core.ko. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-10 14:22:38 -07:00
Ming Lin	9f2482b91b	nvme: split dev_list_lock Split dev_list_lock into one in the core and one in the PCI driver. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-10 14:22:36 -07:00
Ming Lin	ba0ba7d3e5	nvme: move timeout variables to core.c These variables are used by PCI driver and will also be used in the forthcoming NVMe over Fabrics drivers. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-10 14:22:34 -07:00
Sagi Grimberg	e439bb12e7	nvme/host: reference the fabric module for each bdev open callout We don't want to be able to unload the fabric driver when we have openened referenced to our namespaces. Thus, for each nvme_open we take a reference on the fabric driver and put it in nvme_release. This behavior is consistent with the scsi model. This resolves the panic when unloading a fabric module with mpath holders. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ian Bakshan <ianb@mellanox.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-10 14:22:32 -07:00
Sagi Grimberg	1b3c47c182	nvme: Log the ctrl device name instead of the underlying pci device name Having the ctrl name "nvmeX" seems much more friendly than the underlying device name. Also, with other nvme transports such as the soon to come nvme-loop we don't have an underlying device so it doesn't makes sense to make up one. In order to help matching an instance name to a pci function, we add a info print in nvme_probe. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Acked-by: Keith Busch <keith.busch@intel.com> Manually fixed up the hunk in nvme_cancel_queue_ios(). Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-10 08:51:15 -07:00
Keith Busch	949928c1c7	NVMe: Fix possible queue use after freed This notifies blk-mq when the tag set contains a different number of queues prior to freeing unused ones that the request queue points to. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-02-09 12:42:35 -07:00
Linus Torvalds	3e1e21c7bf	Merge branch 'for-4.5/nvme' of git://git.kernel.dk/linux-block Pull NVMe updates from Jens Axboe: "Last branch for this series is the nvme changes. It's in a separate branch to avoid splitting too much between core and NVMe changes, since NVMe is still helping drive some blk-mq changes. That said, not a huge amount of core changes in here. The grunt of the work is the continued split of the code" * 'for-4.5/nvme' of git://git.kernel.dk/linux-block: (67 commits) uapi: update install list after nvme.h rename NVMe: Export NVMe attributes to sysfs group NVMe: Shutdown controller only for power-off NVMe: IO queue deletion re-write NVMe: Remove queue freezing on resets NVMe: Use a retryable error code on reset NVMe: Fix admin queue ring wrap nvme: make SG_IO support optional nvme: fixes for NVME_IOCTL_IO_CMD on the char device nvme: synchronize access to ctrl->namespaces nvme: Move nvme_freeze/unfreeze_queues to nvme core PCI/AER: include header file NVMe: Export namespace attributes to sysfs NVMe: Add pci error handlers block: remove REQ_NO_TIMEOUT flag nvme: merge iod and cmd_info nvme: meta_sg doesn't have to be an array nvme: properly free resources for cancelled command nvme: simplify completion handling nvme: special case AEN requests ...	2016-01-21 19:58:02 -08:00
Linus Torvalds	7c24d9f3b2	Merge branch 'for-4.5/core' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: "We don't have a lot of core changes this time around, it's mostly in drivers, which will come in a subsequent pull. The cores changes include: - blk-mq - Prep patch from Christoph, changing blk_mq_alloc_request() to take flags instead of just using gfp_t for sleep/nosleep. - Doc patch from me, clarifying the difference between legacy and blk-mq for timer usage. - Fixes from Raghavendra for memory-less numa nodes, and a reuse of CPU masks. - Cleanup from Geliang Tang, using offset_in_page() instead of open coding it. - From Ilya, rename request_queue slab to it reflects what it holds, and a fix for proper use of bdgrab/put. - A real fix for the split across stripe boundaries from Keith. We yanked a broken version of this from 4.4-rc final, this one works. - From Mike Krinkin, emit a trace message when we split. - From Wei Tang, two small cleanups, not explicitly clearing memory that is already cleared" * 'for-4.5/core' of git://git.kernel.dk/linux-block: block: use bd{grab,put}() instead of open-coding block: split bios to max possible length block: add call to split trace point blk-mq: Avoid memoryless numa node encoded in hctx numa_node blk-mq: Reuse hardware context cpumask for tags blk-mq: add a flags parameter to blk_mq_alloc_request Revert "blk-flush: Queue through IO scheduler when flush not required" block: clarify blk_add_timer() use case for blk-mq bio: use offset_in_page macro block: do not initialise statics to 0 or NULL block: do not initialise globals to 0 or NULL block: rename request_queue slab cache	2016-01-19 15:03:34 -08:00
Keith Busch	a5cdb68c2c	NVMe: Shutdown controller only for power-off We don't need to shutdown a controller for a reset. A controller in a shutdown state may take longer to become ready than one that was simply disabled. This patch has the driver shut down a controller only if the device is about to be powered off or being removed. When taking the controller down for a reset reason, the controller will be disabled instead. Function names have been updated in this patch to reflect their changed semantics. Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-01-12 14:47:59 -07:00
Keith Busch	db3cbfff5b	NVMe: IO queue deletion re-write The nvme driver deletes IO queues asynchronously since this operation may potentially take an undesirable amount of time with a large number of queues if done serially. The driver used to manage coordinating asynchronous deletions. This patch simplifies that by leveraging the block layer rather than using kthread workers and chaining more complicated callbacks. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-01-12 14:47:56 -07:00
Keith Busch	25646264e1	NVMe: Remove queue freezing on resets NVMe submits all commands through the block layer now. This means we can let requests queue at the blk-mq hardware context since there is no path that bypasses this anymore so we don't need to freeze the queues anymore. The driver can simply stop the h/w queues from running during a reset instead. This also fixes a WARN in percpu_ref_reinit when the queue was unfrozen with requeued requests. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-01-12 13:33:36 -07:00
Keith Busch	1d49c38c48	NVMe: Use a retryable error code on reset A negative status has the "do not retry" bit set, which makes it not retryable. Use a fake status that can potentially be retried on reset. An aborted command's status is overridden by the timeout handler so that it won't be retried, which is necessary to keep initialization from getting into a reset loop. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2016-01-12 13:33:35 -07:00

1 2 3

129 Commits