linux

iv/linux

Author	SHA1	Message	Date
Keith Busch	3fd5940cb2	blk-mq: Wake tasks entering queue on dying When the queue is set to dying, wake up tasks that are waiting on frozen queue so they realize it is dying and abandon their request. Signed-off-by: Keith Busch <keith.busch@intel.com> Modified by me to add a code comment on the need for the wakeup. Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-08 08:53:56 -07:00
Borislav Petkov	26e022727f	efi: Rename efi_guid_unparse to efi_guid_to_str Call it what it does - "unparse" is plain-misleading. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>	2015-01-07 19:07:44 -08:00
Jens Axboe	17ded32070	blk-mq: get rid of ->cmd_size in the hardware queue We store it in the tag set, we don't need it in the hardware queue. While removing cmd_size, place ->queue_num further down to avoid a hole on 64-bit archs. It's not used in any fast paths, so we can safely move it. Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-07 10:44:04 -07:00
Jens Axboe	c761d96b07	blk-mq: export blk_mq_freeze_queue() Commit `b4c6a02877` exported the start and unfreeze, but we need the regular blk_mq_freeze_queue() for the loop conversion. Signed-off-by: Jens Axboe <axboe@fb.com>	2015-01-02 15:05:12 -07:00
Jens Axboe	aed3ea94bd	block: wake up waiters when a queue is marked dying If it's dying, we can't expect new request to complete and come in an wake up other tasks waiting for requests. So after we have marked it as dying, wake up everybody currently waiting for a request. Once they wake, they will retry their allocation and fail appropriately due to the state of the queue. Tested-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-31 09:39:16 -07:00
Keith Busch	b4c6a02877	blk-mq: Export freeze/unfreeze functions Let drivers prevent entering a queue that isn't available. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-20 10:34:15 -07:00
Keith Busch	c76541a932	blk-mq: Exit queue on alloc failure Fixes usage counter when a request could not be allocated. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-20 10:33:53 -07:00
Jens Axboe	35d37c6635	Revert "blk-mq: Micro-optimize bt_get()" This reverts commit `52f7eb945f`. The optimization is only really safe for a single queue, otherwise 'bs' and 'bt' can indeed change, and if we don't do a finish_wait() for each loop, we'll potentially change the wait structure and corrupt task wait list. Reported-by: Jan Kara <jack@suse.cz>	2014-12-15 08:30:26 -07:00
Linus Torvalds	caf292ae5b	Merge branch 'for-3.19/core' of git://git.kernel.dk/linux-block Pull block driver core update from Jens Axboe: "This is the pull request for the core block IO changes for 3.19. Not a huge round this time, mostly lots of little good fixes: - Fix a bug in sysfs blktrace interface causing a NULL pointer dereference, when enabled/disabled through that API. From Arianna Avanzini. - Various updates/fixes/improvements for blk-mq: - A set of updates from Bart, mostly fixing buts in the tag handling. - Cleanup/code consolidation from Christoph. - Extend queue_rq API to be able to handle batching issues of IO requests. NVMe will utilize this shortly. From me. - A few tag and request handling updates from me. - Cleanup of the preempt handling for running queues from Paolo. - Prevent running of unmapped hardware queues from Ming Lei. - Move the kdump memory limiting check to be in the correct location, from Shaohua. - Initialize all software queues at init time from Takashi. This prevents a kobject warning when CPUs are brought online that weren't online when a queue was registered. - Single writeback fix for I_DIRTY clearing from Tejun. Queued with the core IO changes, since it's just a single fix. - Version X of the __bio_add_page() segment addition retry from Maurizio. Hope the Xth time is the charm. - Documentation fixup for IO scheduler merging from Jan. - Introduce (and use) generic IO stat accounting helpers for non-rq drivers, from Gu Zheng. - Kill off artificial limiting of max sectors in a request from Christoph" * 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits) bio: modify __bio_add_page() to accept pages that don't start a new segment blk-mq: Fix uninitialized kobject at CPU hotplugging blktrace: don't let the sysfs interface remove trace from running list blk-mq: Use all available hardware queues blk-mq: Micro-optimize bt_get() blk-mq: Fix a race between bt_clear_tag() and bt_get() blk-mq: Avoid that __bt_get_word() wraps multiple times blk-mq: Fix a use-after-free blk-mq: prevent unmapped hw queue from being scheduled blk-mq: re-check for available tags after running the hardware queue blk-mq: fix hang in bt_get() blk-mq: move the kdump check to blk_mq_alloc_tag_set blk-mq: cleanup tag free handling blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map blk: introduce generic io stat accounting help function blk-mq: handle the single queue case in blk_mq_hctx_next_cpu genhd: check for int overflow in disk_expand_part_tbl() blk-mq: add blk_mq_free_hctx_request() blk-mq: export blk_mq_free_request() blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable ...	2014-12-13 14:14:23 -08:00
Maurizio Lombardi	fcbf6a087a	bio: modify __bio_add_page() to accept pages that don't start a new segment The original behaviour is to refuse to add a new page if the maximum number of segments has been reached, regardless of the fact the page we are going to add can be merged into the last segment or not. Unfortunately, when the system runs under heavy memory fragmentation conditions, a driver may try to add multiple pages to the last segment. The original code won't accept them and EBUSY will be reported to userspace. This patch modifies the function so it refuses to add a page only in case the latter starts a new segment and the maximum number of segments has already been reached. The bug can be easily reproduced with the st driver: 1) set CONFIG_SCSI_MPT2SAS_MAX_SGE or CONFIG_SCSI_MPT3SAS_MAX_SGE to 16 2) modprobe st buffer_kbs=1024 3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10 dd: error writing `/dev/st0': Device or resource busy Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Cc: Jet Chen <jet.chen@intel.com> Cc: Tomas Henzl <thenzl@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-11 09:11:52 -07:00
Linus Torvalds	92a578b064	ACPI and power management updates for 3.19-rc1 This time we have some more new material than we used to have during the last couple of development cycles. The most important part of it to me is the introduction of a unified interface for accessing device properties provided by platform firmware. It works with Device Trees and ACPI in a uniform way and drivers using it need not worry about where the properties come from as long as the platform firmware (either DT or ACPI) makes them available. It covers both devices and "bare" device node objects without struct device representation as that turns out to be necessary in some cases. This has been in the works for quite a few months (and development cycles) and has been approved by all of the relevant maintainers. On top of that, some drivers are switched over to the new interface (at25, leds-gpio, gpio_keys_polled) and some additional changes are made to the core GPIO subsystem to allow device drivers to manipulate GPIOs in the "canonical" way on platforms that provide GPIO information in their ACPI tables, but don't assign names to GPIO lines (in which case the driver needs to do that on the basis of what it knows about the device in question). That also has been approved by the GPIO core maintainers and the rfkill driver is now going to use it. Second is support for hardware P-states in the intel_pstate driver. It uses CPUID to detect whether or not the feature is supported by the processor in which case it will be enabled by default. However, it can be disabled entirely from the kernel command line if necessary. Next is support for a platform firmware interface based on ACPI operation regions used by the PMIC (Power Management Integrated Circuit) chips on the Intel Baytrail-T and Baytrail-T-CR platforms. That interface is used for manipulating power resources and for thermal management: sensor temperature reporting, trip point setting and so on. Also the ACPI core is now going to support the _DEP configuration information in a limited way. Basically, _DEP it supposed to reflect off-the-hierarchy dependencies between devices which may be very indirect, like when AML for one device accesses locations in an operation region handled by another device's driver (usually, the device depended on this way is a serial bus or GPIO controller). The support added this time is sufficient to make the ACPI battery driver work on Asus T100A, but it is general enough to be able to cover some other use cases in the future. Finally, we have a new cpufreq driver for the Loongson1B processor. In addition to the above, there are fixes and cleanups all over the place as usual and a traditional ACPICA update to a recent upstream release. As far as the fixes go, the ACPI LPSS (Low-power Subsystem) driver for Intel platforms should be able to handle power management of the DMA engine correctly, the cpufreq-dt driver should interact with the thermal subsystem in a better way and the ACPI backlight driver should handle some more corner cases, among other things. On top of the ACPICA update there are fixes for race conditions in the ACPICA's interrupt handling code which might lead to some random and strange looking failures on some systems. In the cleanups department the most visible part is the series of commits targeted at getting rid of the CONFIG_PM_RUNTIME configuration option. That was triggered by a discussion regarding the generic power domains code during which we realized that trying to support certain combinations of PM config options was painful and not really worth it, because nobody would use them in production anyway. For this reason, we decided to make CONFIG_PM_SLEEP select CONFIG_PM_RUNTIME and that lead to the conclusion that the latter became redundant and CONFIG_PM could be used instead of it. The material here makes that replacement in a major part of the tree, but there will be at least one more batch of that in the second part of the merge window. Specifics: - Support for retrieving device properties information from ACPI _DSD device configuration objects and a unified device properties interface for device drivers (and subsystems) on top of that. As stated above, this works with Device Trees and ACPI and allows device drivers to be written in a platform firmware (DT or ACPI) agnostic way. The at25, leds-gpio and gpio_keys_polled drivers are now going to use this new interface and the GPIO subsystem is additionally modified to allow device drivers to assign names to GPIO resources returned by ACPI _CRS objects (in case _DSD is not present or does not provide the expected data). The changes in this set are mostly from Mika Westerberg, Rafael J Wysocki, Aaron Lu, and Darren Hart with some fixes from others (Fabio Estevam, Geert Uytterhoeven). - Support for Hardware Managed Performance States (HWP) as described in Volume 3, section 14.4, of the Intel SDM in the intel_pstate driver. CPUID is used to detect whether or not the feature is supported by the processor. If supported, it will be enabled automatically unless the intel_pstate=no_hwp switch is present in the kernel command line. From Dirk Brandewie. - New Intel Broadwell-H ID for intel_pstate (Dirk Brandewie). - Support for firmware interface based on ACPI operation regions used by the PMIC chips on the Intel Baytrail-T and Baytrail-T-CR platforms for power resource control and thermal management (Aaron Lu). - Limited support for retrieving off-the-hierarchy dependencies between devices from ACPI _DEP device configuration objects and deferred probing support for the ACPI battery driver based on the _DEP information to make that driver work on Asus T100A (Lan Tianyu). - New cpufreq driver for the Loongson1B processor (Kelvin Cheung). - ACPICA update to upstream revision 20141107 which only affects tools (Bob Moore). - Fixes for race conditions in the ACPICA's interrupt handling code and in the ACPI code related to system suspend and resume (Lv Zheng and Rafael J Wysocki). - ACPI core fix for an RCU-related issue in the ioremap() regions management code that slowed down significantly after CPUs had been allowed to enter idle states even if they'd had RCU callbakcs queued and triggered some problems in certain proprietary graphics driver (and elsewhere). The fix replaces synchronize_rcu() in that code with synchronize_rcu_expedited() which makes the issue go away. From Konstantin Khlebnikov. - ACPI LPSS (Low-Power Subsystem) driver fix to handle power management of the DMA engine included into the LPSS correctly. The problem is that the DMA engine doesn't have ACPI PM support of its own and it simply is turned off when the last LPSS device having ACPI PM support goes into D3cold. To work around that, the PM domain used by the ACPI LPSS driver is redesigned so at least one device with ACPI PM support will be on as long as the DMA engine is in use. From Andy Shevchenko. - ACPI backlight driver fix to avoid using it on "Win8-compatible" systems where it doesn't work and where it was used by default by mistake (Aaron Lu). - Assorted minor ACPI core fixes and cleanups from Tomasz Nowicki, Sudeep Holla, Huang Rui, Hanjun Guo, Fabian Frederick, and Ashwin Chaugule (mostly related to the upcoming ARM64 support). - Intel RAPL (Running Average Power Limit) power capping driver fixes and improvements including new processor IDs (Jacob Pan). - Generic power domains modification to power up domains after attaching devices to them to meet the expectations of device drivers and bus types assuming devices to be accessible at probe time (Ulf Hansson). - Preliminary support for controlling device clocks from the generic power domains core code and modifications of the ARM/shmobile platform to use that feature (Ulf Hansson). - Assorted minor fixes and cleanups of the generic power domains core code (Ulf Hansson, Geert Uytterhoeven). - Assorted minor fixes and cleanups of the device clocks control code in the PM core (Geert Uytterhoeven, Grygorii Strashko). - Consolidation of device power management Kconfig options by making CONFIG_PM_SLEEP select CONFIG_PM_RUNTIME and removing the latter which is now redundant (Rafael J Wysocki and Kevin Hilman). That is the first batch of the changes needed for this purpose. - Core device runtime power management support code cleanup related to the execution of callbacks (Andrzej Hajda). - cpuidle ARM support improvements (Lorenzo Pieralisi). - cpuidle cleanup related to the CPUIDLE_FLAG_TIME_VALID flag and a new MAINTAINERS entry for ARM Exynos cpuidle (Daniel Lezcano and Bartlomiej Zolnierkiewicz). - New cpufreq driver callback (->ready) to be executed when the cpufreq core is ready to use a given policy object and cpufreq-dt driver modification to use that callback for cooling device registration (Viresh Kumar). - cpufreq core fixes and cleanups (Viresh Kumar, Vince Hsu, James Geboski, Tomeu Vizoso). - Assorted fixes and cleanups in the cpufreq-pcc, intel_pstate, cpufreq-dt, pxa2xx cpufreq drivers (Lenny Szubowicz, Ethan Zhao, Stefan Wahren, Petr Cvek). - OPP (Operating Performance Points) framework modification to allow OPPs to be removed too and update of a few cpufreq drivers (cpufreq-dt, exynos5440, imx6q, cpufreq) to remove OPPs (added during initialization) on driver removal (Viresh Kumar). - Hibernation core fixes and cleanups (Tina Ruchandani and Markus Elfring). - PM Kconfig fix related to CPU power management (Pankaj Dubey). - cpupower tool fix (Prarit Bhargava). / -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQIcBAABCAAGBQJUhj6JAAoJEILEb/54YlRxTM4P/j5g5SfqvY0QKsn7sR7MGZ6v nsgCBhJAqTw3ocNC7EAs8z9h2GWy1KbKpakKYWAh9Fs1yZoey7tFSlcv/Rgjlp70 uU5sDQHtpE9mHKiymdsowiQuWgpl962L4k+k8hUslhlvgk1PvVbpajR6OqG8G+pD asuIW9eh1APNkLyXmRJ3ZPomzs0VmRdZJ0NEs0lKX9mJskqEvxPIwdaxq3iaJq9B Fo0J345zUDcJnxWblDRdHlOigCimglElfN5qJwaC4KpwUKuBvLRKbp4f69+wfT0c kYFiR29X5KjJ2kLfP/wKsLyuDCYYXRq3tCia5M1tAqOjZ+UA89H/GDftx/5lntmv qUlBa35VfdS1SX4HyApZitOHiLgo+It/hl8Z9bJnhyVw66NxmMQ8JYN2imb8Lhqh XCLR7BxLTah82AapLJuQ0ZDHPzZqMPG2veC2vAzRMYzVijict/p4Y2+qBqONltER 4rs9uRVn+hamX33lCLg8BEN8zqlnT3rJFIgGaKjq/wXHAU/zpE9CjOrKMQcAg9+s t51XMNPwypHMAYyGVhEL89ImjXnXxBkLRuquhlmEpvQchIhR+mR3dLsarGn7da44 WPIQJXzcsojXczcwwfqsJCR4I1FTFyQIW+UNh02GkDRgRovQqo+Jk762U7vQwqH+ LBdhvVaS1VW4v+FWXEoZ =5dox -----END PGP SIGNATURE----- Merge tag 'pm+acpi-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and power management updates from Rafael Wysocki: "This time we have some more new material than we used to have during the last couple of development cycles. The most important part of it to me is the introduction of a unified interface for accessing device properties provided by platform firmware. It works with Device Trees and ACPI in a uniform way and drivers using it need not worry about where the properties come from as long as the platform firmware (either DT or ACPI) makes them available. It covers both devices and "bare" device node objects without struct device representation as that turns out to be necessary in some cases. This has been in the works for quite a few months (and development cycles) and has been approved by all of the relevant maintainers. On top of that, some drivers are switched over to the new interface (at25, leds-gpio, gpio_keys_polled) and some additional changes are made to the core GPIO subsystem to allow device drivers to manipulate GPIOs in the "canonical" way on platforms that provide GPIO information in their ACPI tables, but don't assign names to GPIO lines (in which case the driver needs to do that on the basis of what it knows about the device in question). That also has been approved by the GPIO core maintainers and the rfkill driver is now going to use it. Second is support for hardware P-states in the intel_pstate driver. It uses CPUID to detect whether or not the feature is supported by the processor in which case it will be enabled by default. However, it can be disabled entirely from the kernel command line if necessary. Next is support for a platform firmware interface based on ACPI operation regions used by the PMIC (Power Management Integrated Circuit) chips on the Intel Baytrail-T and Baytrail-T-CR platforms. That interface is used for manipulating power resources and for thermal management: sensor temperature reporting, trip point setting and so on. Also the ACPI core is now going to support the _DEP configuration information in a limited way. Basically, _DEP it supposed to reflect off-the-hierarchy dependencies between devices which may be very indirect, like when AML for one device accesses locations in an operation region handled by another device's driver (usually, the device depended on this way is a serial bus or GPIO controller). The support added this time is sufficient to make the ACPI battery driver work on Asus T100A, but it is general enough to be able to cover some other use cases in the future. Finally, we have a new cpufreq driver for the Loongson1B processor. In addition to the above, there are fixes and cleanups all over the place as usual and a traditional ACPICA update to a recent upstream release. As far as the fixes go, the ACPI LPSS (Low-power Subsystem) driver for Intel platforms should be able to handle power management of the DMA engine correctly, the cpufreq-dt driver should interact with the thermal subsystem in a better way and the ACPI backlight driver should handle some more corner cases, among other things. On top of the ACPICA update there are fixes for race conditions in the ACPICA's interrupt handling code which might lead to some random and strange looking failures on some systems. In the cleanups department the most visible part is the series of commits targeted at getting rid of the CONFIG_PM_RUNTIME configuration option. That was triggered by a discussion regarding the generic power domains code during which we realized that trying to support certain combinations of PM config options was painful and not really worth it, because nobody would use them in production anyway. For this reason, we decided to make CONFIG_PM_SLEEP select CONFIG_PM_RUNTIME and that lead to the conclusion that the latter became redundant and CONFIG_PM could be used instead of it. The material here makes that replacement in a major part of the tree, but there will be at least one more batch of that in the second part of the merge window. Specifics: - Support for retrieving device properties information from ACPI _DSD device configuration objects and a unified device properties interface for device drivers (and subsystems) on top of that. As stated above, this works with Device Trees and ACPI and allows device drivers to be written in a platform firmware (DT or ACPI) agnostic way. The at25, leds-gpio and gpio_keys_polled drivers are now going to use this new interface and the GPIO subsystem is additionally modified to allow device drivers to assign names to GPIO resources returned by ACPI _CRS objects (in case _DSD is not present or does not provide the expected data). The changes in this set are mostly from Mika Westerberg, Rafael J Wysocki, Aaron Lu, and Darren Hart with some fixes from others (Fabio Estevam, Geert Uytterhoeven). - Support for Hardware Managed Performance States (HWP) as described in Volume 3, section 14.4, of the Intel SDM in the intel_pstate driver. CPUID is used to detect whether or not the feature is supported by the processor. If supported, it will be enabled automatically unless the intel_pstate=no_hwp switch is present in the kernel command line. From Dirk Brandewie. - New Intel Broadwell-H ID for intel_pstate (Dirk Brandewie). - Support for firmware interface based on ACPI operation regions used by the PMIC chips on the Intel Baytrail-T and Baytrail-T-CR platforms for power resource control and thermal management (Aaron Lu). - Limited support for retrieving off-the-hierarchy dependencies between devices from ACPI _DEP device configuration objects and deferred probing support for the ACPI battery driver based on the _DEP information to make that driver work on Asus T100A (Lan Tianyu). - New cpufreq driver for the Loongson1B processor (Kelvin Cheung). - ACPICA update to upstream revision 20141107 which only affects tools (Bob Moore). - Fixes for race conditions in the ACPICA's interrupt handling code and in the ACPI code related to system suspend and resume (Lv Zheng and Rafael J Wysocki). - ACPI core fix for an RCU-related issue in the ioremap() regions management code that slowed down significantly after CPUs had been allowed to enter idle states even if they'd had RCU callbakcs queued and triggered some problems in certain proprietary graphics driver (and elsewhere). The fix replaces synchronize_rcu() in that code with synchronize_rcu_expedited() which makes the issue go away. From Konstantin Khlebnikov. - ACPI LPSS (Low-Power Subsystem) driver fix to handle power management of the DMA engine included into the LPSS correctly. The problem is that the DMA engine doesn't have ACPI PM support of its own and it simply is turned off when the last LPSS device having ACPI PM support goes into D3cold. To work around that, the PM domain used by the ACPI LPSS driver is redesigned so at least one device with ACPI PM support will be on as long as the DMA engine is in use. From Andy Shevchenko. - ACPI backlight driver fix to avoid using it on "Win8-compatible" systems where it doesn't work and where it was used by default by mistake (Aaron Lu). - Assorted minor ACPI core fixes and cleanups from Tomasz Nowicki, Sudeep Holla, Huang Rui, Hanjun Guo, Fabian Frederick, and Ashwin Chaugule (mostly related to the upcoming ARM64 support). - Intel RAPL (Running Average Power Limit) power capping driver fixes and improvements including new processor IDs (Jacob Pan). - Generic power domains modification to power up domains after attaching devices to them to meet the expectations of device drivers and bus types assuming devices to be accessible at probe time (Ulf Hansson). - Preliminary support for controlling device clocks from the generic power domains core code and modifications of the ARM/shmobile platform to use that feature (Ulf Hansson). - Assorted minor fixes and cleanups of the generic power domains core code (Ulf Hansson, Geert Uytterhoeven). - Assorted minor fixes and cleanups of the device clocks control code in the PM core (Geert Uytterhoeven, Grygorii Strashko). - Consolidation of device power management Kconfig options by making CONFIG_PM_SLEEP select CONFIG_PM_RUNTIME and removing the latter which is now redundant (Rafael J Wysocki and Kevin Hilman). That is the first batch of the changes needed for this purpose. - Core device runtime power management support code cleanup related to the execution of callbacks (Andrzej Hajda). - cpuidle ARM support improvements (Lorenzo Pieralisi). - cpuidle cleanup related to the CPUIDLE_FLAG_TIME_VALID flag and a new MAINTAINERS entry for ARM Exynos cpuidle (Daniel Lezcano and Bartlomiej Zolnierkiewicz). - New cpufreq driver callback (->ready) to be executed when the cpufreq core is ready to use a given policy object and cpufreq-dt driver modification to use that callback for cooling device registration (Viresh Kumar). - cpufreq core fixes and cleanups (Viresh Kumar, Vince Hsu, James Geboski, Tomeu Vizoso). - Assorted fixes and cleanups in the cpufreq-pcc, intel_pstate, cpufreq-dt, pxa2xx cpufreq drivers (Lenny Szubowicz, Ethan Zhao, Stefan Wahren, Petr Cvek). - OPP (Operating Performance Points) framework modification to allow OPPs to be removed too and update of a few cpufreq drivers (cpufreq-dt, exynos5440, imx6q, cpufreq) to remove OPPs (added during initialization) on driver removal (Viresh Kumar). - Hibernation core fixes and cleanups (Tina Ruchandani and Markus Elfring). - PM Kconfig fix related to CPU power management (Pankaj Dubey). - cpupower tool fix (Prarit Bhargava)" * tag 'pm+acpi-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (120 commits) i2c-omap / PM: Drop CONFIG_PM_RUNTIME from i2c-omap.c dmaengine / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM tools: cpupower: fix return checks for sysfs_get_idlestate_count() drivers: sh / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM e1000e / igb / PM: Eliminate CONFIG_PM_RUNTIME MMC / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM MFD / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM misc / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM media / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM input / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM leds: leds-gpio: Fix multiple instances registration without 'label' property iio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM hsi / OMAP / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM i2c-hid / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM drm / exynos / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM gpio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM hwrandom / exynos / PM: Use CONFIG_PM in #ifdef block / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM USB / PM: Drop CONFIG_PM_RUNTIME from the USB core PM: Merge the SET*_RUNTIME_PM_OPS() macros ...	2014-12-10 21:17:00 -08:00
Takashi Iwai	06a41a99d1	blk-mq: Fix uninitialized kobject at CPU hotplugging When a CPU is hotplugged, the current blk-mq spews a warning like: kobject '(null)' (ffffe8ffffc8b5d8): tried to add an uninitialized object, something is seriously wrong. CPU: 1 PID: 1386 Comm: systemd-udevd Not tainted 3.18.0-rc7-2.g088d59b-default #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_171129-lamiak 04/01/2014 0000000000000000 0000000000000002 ffffffff81605f07 ffffe8ffffc8b5d8 ffffffff8132c7a0 ffff88023341d370 0000000000000020 ffff8800bb05bd58 ffff8800bb05bd08 000000000000a0a0 000000003f441940 0000000000000007 Call Trace: [<ffffffff81005306>] dump_trace+0x86/0x330 [<ffffffff81005644>] show_stack_log_lvl+0x94/0x170 [<ffffffff81006d21>] show_stack+0x21/0x50 [<ffffffff81605f07>] dump_stack+0x41/0x51 [<ffffffff8132c7a0>] kobject_add+0xa0/0xb0 [<ffffffff8130aee1>] blk_mq_register_hctx+0x91/0xb0 [<ffffffff8130b82e>] blk_mq_sysfs_register+0x3e/0x60 [<ffffffff81309298>] blk_mq_queue_reinit_notify+0xf8/0x190 [<ffffffff8107cfdc>] notifier_call_chain+0x4c/0x70 [<ffffffff8105fd23>] cpu_notify+0x23/0x50 [<ffffffff81060037>] _cpu_up+0x157/0x170 [<ffffffff810600d9>] cpu_up+0x89/0xb0 [<ffffffff815fa5b5>] cpu_subsys_online+0x35/0x80 [<ffffffff814323cd>] device_online+0x5d/0xa0 [<ffffffff81432485>] online_store+0x75/0x80 [<ffffffff81236a5a>] kernfs_fop_write+0xda/0x150 [<ffffffff811c5532>] vfs_write+0xb2/0x1f0 [<ffffffff811c5f42>] SyS_write+0x42/0xb0 [<ffffffff8160c4ed>] system_call_fastpath+0x16/0x1b [<00007f0132fb24e0>] 0x7f0132fb24e0 This is indeed because of an uninitialized kobject for blk_mq_ctx. The blk_mq_ctx kobjects are initialized in blk_mq_sysfs_init(), but it goes loop over hctx_for_each_ctx(), i.e. it initializes only for online CPUs. Thus, when a CPU is hotplugged, the ctx for the newly onlined CPU is registered without initialization. This patch fixes the issue by initializing the all ctx kobjects belonging to each queue. Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=908794 Cc: <stable@vger.kernel.org> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-10 08:57:31 -07:00
Bart Van Assche	959f5f5b2f	blk-mq: Use all available hardware queues Suppose that a system has two CPU sockets, three cores per socket, that it does not support hyperthreading and that four hardware queues are provided by a block driver. With the current algorithm this will lead to the following assignment of CPU cores to hardware queues: HWQ 0: 0 1 HWQ 1: 2 3 HWQ 2: 4 5 HWQ 3: (none) This patch changes the queue assignment into: HWQ 0: 0 1 HWQ 1: 2 HWQ 2: 3 4 HWQ 3: 5 In other words, this patch has the following three effects: - All four hardware queues are used instead of only three. - CPU cores are spread more evenly over hardware queues. For the above example the range of the number of CPU cores associated with a single HWQ is reduced from [0..2] to [1..2]. - If the number of HWQ's is a multiple of the number of CPU sockets it is now guaranteed that all CPU cores associated with a single HWQ reside on the same CPU socket. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Cc: Jens Axboe <axboe@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@canonical.com> Cc: Alexander Gordeev <agordeev@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-09 09:08:21 -07:00
Bart Van Assche	52f7eb945f	blk-mq: Micro-optimize bt_get() Remove a superfluous finish_wait() call. Convert the two bt_wait_ptr() calls into a single call. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Robert Elliott <elliott@hp.com> Cc: Ming Lei <ming.lei@canonical.com> Cc: Alexander Gordeev <agordeev@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-09 09:07:28 -07:00
Bart Van Assche	c38d185d4a	blk-mq: Fix a race between bt_clear_tag() and bt_get() What we need is the following two guarantees: * Any thread that observes the effect of the test_and_set_bit() by __bt_get_word() also observes the preceding addition of 'current' to the appropriate wait list. This is guaranteed by the semantics of the spin_unlock() operation performed by prepare_and_wait(). Hence the conversion of test_and_set_bit_lock() into test_and_set_bit(). * The wait lists are examined by bt_clear() after the tag bit has been cleared. clear_bit_unlock() guarantees that any thread that observes that the bit has been cleared also observes the store operations preceding clear_bit_unlock(). However, clear_bit_unlock() does not prevent that the wait lists are examined before that the tag bit is cleared. Hence the addition of a memory barrier between clear_bit() and the wait list examination. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Robert Elliott <elliott@hp.com> Cc: Ming Lei <ming.lei@canonical.com> Cc: Alexander Gordeev <agordeev@redhat.com> Cc: <stable@vger.kernel.org> # v3.13+ Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-09 09:07:16 -07:00
Bart Van Assche	9e98e9d7cf	blk-mq: Avoid that __bt_get_word() wraps multiple times If __bt_get_word() is called with last_tag != 0, if the first find_next_zero_bit() fails, if after wrap-around the test_and_set_bit() call fails and find_next_zero_bit() succeeds, if the next test_and_set_bit() call fails and subsequently find_next_zero_bit() does not find a zero bit, then another wrap-around will occur. Avoid this by introducing an additional local variable. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Robert Elliott <elliott@hp.com> Cc: Ming Lei <ming.lei@canonical.com> Cc: Alexander Gordeev <agordeev@redhat.com> Cc: <stable@vger.kernel.org> # v3.13+ Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-09 09:07:14 -07:00
Bart Van Assche	45a9c9d909	blk-mq: Fix a use-after-free blk-mq users are allowed to free the memory request_queue.tag_set points at after blk_cleanup_queue() has finished but before blk_release_queue() has started. This can happen e.g. in the SCSI core. The SCSI core namely embeds the tag_set structure in a SCSI host structure. The SCSI host structure is freed by scsi_host_dev_release(). This function is called after blk_cleanup_queue() finished but can be called before blk_release_queue(). This means that it is not safe to access request_queue.tag_set from inside blk_release_queue(). Hence remove the blk_sync_queue() call from blk_release_queue(). This call is not necessary - outstanding requests must have finished before blk_release_queue() is called. Additionally, move the blk_mq_free_queue() call from blk_release_queue() to blk_cleanup_queue() to avoid that struct request_queue.tag_set gets accessed after it has been freed. This patch avoids that the following kernel oops can be triggered when deleting a SCSI host for which scsi-mq was enabled: Call Trace: [<ffffffff8109a7c4>] lock_acquire+0xc4/0x270 [<ffffffff814ce111>] mutex_lock_nested+0x61/0x380 [<ffffffff812575f0>] blk_mq_free_queue+0x30/0x180 [<ffffffff8124d654>] blk_release_queue+0x84/0xd0 [<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0 [<ffffffff8126c140>] kobject_put+0x30/0x70 [<ffffffff81245895>] blk_put_queue+0x15/0x20 [<ffffffff8125c409>] disk_release+0x99/0xd0 [<ffffffff8133d056>] device_release+0x36/0xb0 [<ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0 [<ffffffff8126c140>] kobject_put+0x30/0x70 [<ffffffff8125a78a>] put_disk+0x1a/0x20 [<ffffffff811d4cb5>] __blkdev_put+0x135/0x1b0 [<ffffffff811d56a0>] blkdev_put+0x50/0x160 [<ffffffff81199eb4>] kill_block_super+0x44/0x70 [<ffffffff8119a2a4>] deactivate_locked_super+0x44/0x60 [<ffffffff8119a87e>] deactivate_super+0x4e/0x70 [<ffffffff811b9833>] cleanup_mnt+0x43/0x90 [<ffffffff811b98d2>] __cleanup_mnt+0x12/0x20 [<ffffffff8107252c>] task_work_run+0xac/0xe0 [<ffffffff81002c01>] do_notify_resume+0x61/0xa0 [<ffffffff814d2c58>] int_signal+0x12/0x17 Signed-off-by: Bart Van Assche <bvanassche@acm.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Robert Elliott <elliott@hp.com> Cc: Ming Lei <ming.lei@canonical.com> Cc: Alexander Gordeev <agordeev@redhat.com> Cc: <stable@vger.kernel.org> # v3.13+ Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-09 09:07:13 -07:00
Linus Torvalds	f3f62a38ce	SCSI for-linus on 20141208 This patch is the usual mix of driver updates (srp, ipr, scsi_debug, NCR5380, fnic, 53c974, ses, wd719x, hpsa, megaraid_sas). Of those, wd7a9x is new and 53c974 is a rewrite of the old tmscsim driver and the extensive work by Finn Thain rewrites all the NCR5380 based drivers. There's also extensive infrastructure updates: a new logging infrastructure for sense information and a rewrite of the tagged command queue API and an assortment of minor updates. Signed-off-by: James Bottomley <JBottomley@Parallels.com> -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAABAgAGBQJUhgazAAoJEDeqqVYsXL0MxIYH/2wCs9sne5cfDTjfufLJDdu1 WLoJgNMxv+14OukknJfG2Kk1WlHgLRM5+TVWIGiG0mmjXuFyShzIqEOHKTDWqxnE tBH4wLi/+XYqZAmAeim4/2zhvf+cUVVlIu01VERR5uwaBWYM8BeLSwjdnAAvEEwb iV74p1WV6frXo4guADplgtkjD0YxI4MTUZ1figRMlLO6WLFFyQ+95UfY8jFs+eQv zk63y7Mm7dQNd57/Wl3i89lw0kqlaJSZNl8Ovj1axy4rDYzT1wXhY8mEwD8fI8Ym wahldjFE5vXgj0NpO3tB3Z+UDP2YmQduMyzTxkPNnrEPKiOCsnQo42XR6vb92cQ= =Y+DU -----END PGP SIGNATURE----- Merge tag 'scsi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi Pull SCSI updates from James Bottomley: "This patch is the usual mix of driver updates (srp, ipr, scsi_debug, NCR5380, fnic, 53c974, ses, wd719x, hpsa, megaraid_sas). Of those, wd7a9x is new and 53c974 is a rewrite of the old tmscsim driver and the extensive work by Finn Thain rewrites all the NCR5380 based drivers. There's also extensive infrastructure updates: a new logging infrastructure for sense information and a rewrite of the tagged command queue API and an assortment of minor updates" * tag 'scsi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (183 commits) scsi: set fmt to NULL scsi_extd_sense_format() by default libsas: remove task_collector mode wd719x: remove dma_cache_sync call scsi_debug: add Report supported opcodes+tmfs; Compare and write scsi_debug: change SCSI command parser to table driven scsi_debug: add Capacity Changed Unit Attention scsi_debug: append inject error flags onto scsi_cmnd object scsi_debug: pinpoint invalid field in sense data wd719x: Add firmware documentation wd719x: Introduce Western Digital WD7193/7197/7296 PCI SCSI card driver eeprom-93cx6: Add (read-only) support for 8-bit mode esas2r: fix an oversight in setting return value esas2r: fix an error path in esas2r_ioctl_handler esas2r: fir error handling in do_fm_api scsi: add SPC-3 command definitions scsi: rename SERVICE_ACTION_IN to SERVICE_ACTION_IN_16 scsi: remove scsi_driver owner field scsi: move scsi_dispatch_cmd to scsi_lib.c scsi: stop passing a gfp_mask argument down the command setup path scsi: remove scsi_next_command ...	2014-12-08 21:19:19 -08:00
Ming Lei	19c66e59ce	blk-mq: prevent unmapped hw queue from being scheduled When one hardware queue has no mapped software queues, it shouldn't have been scheduled. Otherwise WARNING or OOPS can triggered. blk_mq_hw_queue_mapped() helper is introduce for fixing the problem. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-08 21:37:08 -07:00
Rafael J. Wysocki	e3d857e1ae	Merge branch 'pm-runtime' * pm-runtime: (25 commits) i2c-omap / PM: Drop CONFIG_PM_RUNTIME from i2c-omap.c dmaengine / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM drivers: sh / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM e1000e / igb / PM: Eliminate CONFIG_PM_RUNTIME MMC / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM MFD / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM misc / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM media / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM input / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM iio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM hsi / OMAP / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM i2c-hid / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM drm / exynos / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM gpio / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM hwrandom / exynos / PM: Use CONFIG_PM in #ifdef block / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM USB / PM: Drop CONFIG_PM_RUNTIME from the USB core PM: Merge the SET*_RUNTIME_PM_OPS() macros PM / Kconfig: Do not select PM directly from Kconfig files PCI / PM: Drop CONFIG_PM_RUNTIME from the PCI core ...	2014-12-08 20:00:44 +01:00
Jens Axboe	080ff35114	blk-mq: re-check for available tags after running the hardware queue If we run out of tags and have to sleep, we run the hardware queue to kick pending IO into gear. During that run, we may have completed requests, so re-check if we have free tags before going to sleep. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-08 08:49:06 -07:00
Bart Van Assche	b32232073e	blk-mq: fix hang in bt_get() Avoid that if there are fewer hardware queues than CPU threads that bt_get() can hang. The symptoms of the hang were as follows: * All tags allocated for a particular hardware queue. * (nr_tags) pending commands for that hardware queue. * No pending commands for the software queues associated with that hardware queue. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-08 08:46:34 -07:00
James Bottomley	dc843ef00e	Merge remote-tracking branch 'scsi-queue/core-for-3.19' into for-linus	2014-12-08 07:40:20 -08:00
Rafael J. Wysocki	47fafbc701	block / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM After commit `b2b49ccbdd` (PM: Kconfig: Set PM_RUNTIME if PM_SLEEP is selected) PM_RUNTIME is always set if PM is set, so #ifdef blocks depending on CONFIG_PM_RUNTIME may now be changed to depend on CONFIG_PM. Replace CONFIG_PM_RUNTIME with CONFIG_PM in the block device core. Reviewed-by: Aaron Lu <aaron.lu@intel.com> Acked-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2014-12-04 01:00:23 +01:00
Darrick J. Wong	594416a720	block: fix regression where bio_integrity_process uses wrong bio_vec iterator bio integrity handling is broken on a system with LVM layered atop a DIF/DIX SCSI drive because device mapper clones the bio, modifies the clone, and sends the clone to the lower layers for processing. However, the clone bio has bi_vcnt == 0, which means that when the sd driver calls bio_integrity_process to attach DIX data, the for_each_segment_all() call (which uses bi_vcnt) returns immediately and random garbage is sent to the disk on a disk write. The disk of course returns an error. Therefore, teach bio_integrity_process() to use bio_for_each_segment() to iterate the bio_vecs, since the per-bio iterator tracks which bio_vecs are associated with that particular bio. The integrity handling code is effectively part of the "driver" (it's not the bio owner), so it must use the correct iterator function. v2: Fix a compiler warning about abandoned local variables. This patch supersedes "block: bio_integrity_process uses wrong bio_vec iterator". Patch applies against 3.18-rc6. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-12-02 08:15:21 -07:00
Shaohua Li	6637fadf25	blk-mq: move the kdump check to blk_mq_alloc_tag_set We call blk_mq_alloc_tag_set() first then blk_mq_init_queue(). The requests are allocated in the former function. So the kdump check should be moved to there to really save memory. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-30 18:35:39 -07:00
Jens Axboe	70114c393c	blk-mq: cleanup tag free handling We only call __blk_mq_put_tag() and __blk_mq_put_reserved_tag() from blk_mq_put_tag(), so just inline the two calls instead of having them as separate functions. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-24 15:52:30 -07:00
Jens Axboe	a33c1ba291	blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq <-> cpu map We currently use num_possible_cpus(), but that breaks on sparc64 where the CPU ID space is discontig. Use nr_cpu_ids as the highest CPU ID instead, so we don't end up reading from invalid memory. Cc: stable@kernel.org # 3.13+ Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-24 15:02:42 -07:00
Hannes Reinecke	eb846d9f14	scsi: rename SERVICE_ACTION_IN to SERVICE_ACTION_IN_16 SPC-3 defines SERVICE ACTION IN(12) and SERVICE ACTION IN(16). So rename SERVICE_ACTION_IN to SERVICE_ACTION_IN_16 to be consistent with SPC and to allow for better distinction. Signed-off-by: Hannes Reinecke <hare@suse.de> Tested-by: Robert Elliott <elliott@hp.com> Reviewed-by: Robert Elliott <elliott@hp.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2014-11-24 20:01:40 +01:00
Gu Zheng	394ffa503b	blk: introduce generic io stat accounting help function Many block drivers accounting io stat based on bio (e.g. NVMe...), the blk_account_io_start/end() which is based on request does not make sense to them, so here we introduce the similar help function named generic_start/end_io_acct base on raw sectors, and it can simplify some driver's open io accounting code. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-24 08:04:44 -07:00
Christoph Hellwig	b657d7e632	blk-mq: handle the single queue case in blk_mq_hctx_next_cpu Don't duplicate the code to handle the not cpu bounce case in the caller, do it inside blk_mq_hctx_next_cpu instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-24 08:02:07 -07:00
Jens Axboe	5fabcb4c33	genhd: check for int overflow in disk_expand_part_tbl() We can get here from blkdev_ioctl() -> blkpg_ioctl() -> add_partition() with a user passed in partno value. If we pass in 0x7fffffff, the new target in disk_expand_part_tbl() overflows the 'int' and we access beyond the end of ptbl->part[] and even write to it when we do the rcu_assign_pointer() to assign the new partition. Reported-by: David Ramos <daramos@stanford.edu> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-19 13:09:07 -07:00
Jens Axboe	7c7f2f2bc9	blk-mq: add blk_mq_free_hctx_request() It's silly to use blk_mq_free_request() which in turn maps the request to the hardware queue, for places where we already know what the hardware queue is. This saves us an extra mapping of a hardware queue on request completion, if the caller knows this information already. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-17 10:41:57 -07:00
Jens Axboe	1a3b595a28	blk-mq: export blk_mq_free_request() Drivers that know they are blk-mq should just use this function instead of calling through blk_put_request(). Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-17 10:40:48 -07:00
Christoph Hellwig	125c99bc8b	scsi: add new scsi-command flag for tagged commands Currently scsi piggy backs on the block layer to define the concept of a tagged command. But we want to be able to have block-level host-wide tags assigned even for untagged commands like the initial INQUIRY, so add a new SCSI-level flag for commands that are tagged at the scsi level, so that even commands without that set can have tags assigned to them. Note that this alredy is the case for the blk-mq code path, and this just lets the old path catch up with it. We also set this flag based upon sdev->simple_tags instead of the block queue flag, so that it is entirely independent of the block layer tagging, and thus always correct even if a driver doesn't use block level tagging yet. Also remove the old blk_rq_tagged; it was only used by SCSI drivers, and removing it forces them to look for the proper replacement. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Christie <michaelc@cs.wisc.edu> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de>	2014-11-12 11:19:40 +01:00
Bart Van Assche	205fb5f5ba	blk-mq: add blk_mq_unique_tag() The queuecommand() callback functions in SCSI low-level drivers need to know which hardware context has been selected by the block layer. Since this information is not available in the request structure, and since passing the hctx pointer directly to the queuecommand callback function would require modification of all SCSI LLDs, add a function to the block layer that allows to query the hardware context index. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Acked-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>	2014-11-12 11:16:09 +01:00
Ming Lei	7f60dcaaf9	block: blk-merge: fix blk_recount_segments() For cloned bio, bio->bi_vcnt can't be used at all, and we have resort to bio_segments() to figure out how many segment there are in the bio. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-11 16:24:15 -07:00
Paolo Bonzini	2a90d4aae5	blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable blk-mq is using preempt_disable/enable in order to ensure that the queue runners are placed on the right CPU. This does not work with the RT patches, because __blk_mq_run_hw_queue takes a non-raw spinlock with the preemption-disabled region. If there is contention on the lock, this violates the rules for preemption-disabled regions. While this should be easily fixable within the RT patches just by doing migrate_disable/enable, we can do better and document _why_ this particular region runs with disabled preemption. After the previous patch, it is trivial to switch it to get/put_cpu; the RT patches then can change it to get_cpu_light, which lets virtio-blk run under RT kernels. Cc: Jens Axboe <axboe@kernel.dk> Cc: Thomas Gleixner <tglx@linutronix.de> Reported-by: Clark Williams <williams@redhat.com> Tested-by: Clark Williams <williams@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-11 11:04:49 -07:00
Paolo Bonzini	398205b839	blk_mq: call preempt_disable/enable in blk_mq_run_hw_queue, and only if needed preempt_disable/enable surrounds every call to blk_mq_run_hw_queue, except the one in blk-flush.c. In fact that one is always asynchronous, and it does not need smp_processor_id(). We can do the same for all other calls, avoiding preempt_disable when async is true. This avoids peppering blk-mq.c with preemption-disabled regions. Cc: Jens Axboe <axboe@kernel.dk> Cc: Thomas Gleixner <tglx@linutronix.de> Reported-by: Clark Williams <williams@redhat.com> Tested-by: Clark Williams <williams@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-11 11:04:47 -07:00
Tony Battersby	92697dc947	scsi: Fix more error handling in SCSI_IOCTL_SEND_COMMAND Fix an error path in SCSI_IOCTL_SEND_COMMAND that calls blk_put_request(rq) on an invalid IS_ERR(rq) pointer. Fixes: `a492f07545` ("block,scsi: fixup blk_get_request dead queue scenarios") Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-10 15:41:47 -07:00
Tejun Heo	f3af020b9a	blk-mq: make mq_queue_reinit_notify() freeze queues in parallel q->mq_usage_counter is a percpu_ref which is killed and drained when the queue is frozen. On a CPU hotplug event, blk_mq_queue_reinit() which involves freezing the queue is invoked on all existing queues. Because percpu_ref killing and draining involve a RCU grace period, doing the above on one queue after another may take a long time if there are many queues on the system. This patch splits out initiation of freezing and waiting for its completion, and updates blk_mq_queue_reinit_notify() so that the queues are frozen in parallel instead of one after another. Note that freezing and unfreezing are moved from blk_mq_queue_reinit() to blk_mq_queue_reinit_notify(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-11-04 14:49:31 -07:00
Jan Kara	ece9c72acc	block: Fix computation of merged request priority Priority of a merged request is computed by ioprio_best(). If one of the requests has undefined priority (IOPRIO_CLASS_NONE) and another request has priority from IOPRIO_CLASS_BE, the function will return the undefined priority which is wrong. Fix the function to properly return priority of a request with the defined priority. Fixes: `d58cdfb89c` CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-31 08:30:43 -06:00
Jens Axboe	e167dfb53c	blk-mq: add BLK_MQ_F_DEFER_ISSUE support flag Drivers can now tell blk-mq if they take advantage of the deferred issue through 'last' or not. If they do, don't do queue-direct for sync IO. This is a preparation patch for the nvme conversion. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-29 11:18:26 -06:00
Jens Axboe	74c450521d	blk-mq: add a 'list' parameter to ->queue_rq() Since we have the notion of a 'last' request in a chain, we can use this to have the hardware optimize the issuing of requests. Add a list_head parameter to queue_rq that the driver can use to temporarily store hw commands for issue when 'last' is true. If we are doing a chain of requests, pass in a NULL list for the first request to force issue of that immediately, then batch the remainder for deferred issue until the last request has been sent. Instead of adding yet another argument to the hot ->queue_rq path, encapsulate the passed arguments in a blk_mq_queue_data structure. This is passed as a constant, and has been tested as faster than passing 4 (or even 3) args through ->queue_rq. Update drivers for the new ->queue_rq() prototype. There are no functional changes in this patch for drivers - if they don't use the passed in list, then they will just queue requests individually like before. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-29 11:14:52 -06:00
Sudip Mukherjee	d32f6b5752	block: fix wrong error return in elevator_init() while compiling integer err was showing as a set but unused variable. elevator_init_fn can be either cfq_init_queue or deadline_init_queue or noop_init_queue. all three of these functions are returning -ENOMEM if they fail to allocate the queue. so we should actually be returning the error code rather than returning 0 always. Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-23 12:35:42 -06:00
Jan Kara	84ce0f0e94	scsi: Fix error handling in SCSI_IOCTL_SEND_COMMAND When sg_scsi_ioctl() fails to prepare request to submit in blk_rq_map_kern() we jump to a label where we just end up copying (luckily zeroed-out) kernel buffer to userspace instead of reporting error. Fix the problem by jumping to the right label. CC: Jens Axboe <axboe@kernel.dk> CC: linux-scsi@vger.kernel.org CC: stable@vger.kernel.org Coverity-id: 1226871 Signed-off-by: Jan Kara <jack@suse.cz> Fixed up the, now unused, out label. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-22 20:13:39 -06:00
Ming Lei	76d8137a31	blk-merge: recaculate segment if it isn't less than max segments The problem is introduced by commit 764f612c6c3c231b(blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio), and merge is needed if number of current segment isn't less than max segments. Strictly speaking, bio->bi_vcnt shouldn't be used here since it may not be accurate in cases of both cloned bio or bio cloned from, but bio_segments() is a bit expensive, and bi_vcnt is still the biggest number, so the approach should work. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-21 19:00:32 -06:00
Christoph Hellwig	34b48db66e	block: remove artifical max_hw_sectors cap Set max_sectors to the value the drivers provides as hardware limit by default. Linux had proper I/O throttling for a long time and doesn't rely on a artifically small maximum I/O size anymore. By not limiting the I/O size by default we remove an annoying tuning step required for most Linux installation. Note that both the user, and if absolutely required the driver can still impose a limit for FS requests below max_hw_sectors_kb. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-21 14:02:54 -06:00
Linus Torvalds	d3dc366bba	Merge branch 'for-3.18/core' of git://git.kernel.dk/linux-block Pull core block layer changes from Jens Axboe: "This is the core block IO pull request for 3.18. Apart from the new and improved flush machinery for blk-mq, this is all mostly bug fixes and cleanups. - blk-mq timeout updates and fixes from Christoph. - Removal of REQ_END, also from Christoph. We pass it through the ->queue_rq() hook for blk-mq instead, freeing up one of the request bits. The space was overly tight on 32-bit, so Martin also killed REQ_KERNEL since it's no longer used. - blk integrity updates and fixes from Martin and Gu Zheng. - Update to the flush machinery for blk-mq from Ming Lei. Now we have a per hardware context flush request, which both cleans up the code should scale better for flush intensive workloads on blk-mq. - Improve the error printing, from Rob Elliott. - Backing device improvements and cleanups from Tejun. - Fixup of a misplaced rq_complete() tracepoint from Hannes. - Make blk_get_request() return error pointers, fixing up issues where we NULL deref when a device goes bad or missing. From Joe Lawrence. - Prep work for drastically reducing the memory consumption of dm devices from Junichi Nomura. This allows creating clone bio sets without preallocating a lot of memory. - Fix a blk-mq hang on certain combinations of queue depths and hardware queues from me. - Limit memory consumption for blk-mq devices for crash dump scenarios and drivers that use crazy high depths (certain SCSI shared tag setups). We now just use a single queue and limited depth for that" * 'for-3.18/core' of git://git.kernel.dk/linux-block: (58 commits) block: Remove REQ_KERNEL blk-mq: allocate cpumask on the home node bio-integrity: remove the needless fail handle of bip_slab creating block: include func name in __get_request prints block: make blk_update_request print prefix match ratelimited prefix blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio block: fix alignment_offset math that assumes io_min is a power-of-2 blk-mq: Make bt_clear_tag() easier to read blk-mq: fix potential hang if rolling wakeup depth is too high block: add bioset_create_nobvec() block: use bio_clone_fast() in blk_rq_prep_clone() block: misplaced rq_complete tracepoint sd: Honor block layer integrity handling flags block: Replace strnicmp with strncasecmp block: Add T10 Protection Information functions block: Don't merge requests if integrity flags differ block: Integrity checksum flag block: Relocate bio integrity flags block: Add a disk flag to block integrity profile block: Add prefix to block integrity profile flags ...	2014-10-18 11:53:51 -07:00
Jens Axboe	a86073e48a	blk-mq: allocate cpumask on the home node All other allocs are done on the specific node, somehow the cpumask for hw queue runs was missed. Fix that by using zalloc_cpumask_var_node() in blk_mq_init_queue(). Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-13 15:41:54 -06:00
Gu Zheng	b65c7491cb	bio-integrity: remove the needless fail handle of bip_slab creating bip_slab is created with SLAB_PANIC, so the fail handler is unneeded. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-13 15:09:38 -06:00
Robert Elliott	7b2b10e0e2	block: include func name in __get_request prints In __get_request calls to printk_ratelimited, include the function name so the callbacks suppressed message matches the messages that are printed, and add "dev" before the device name so it matches other block layer messages. Signed-off-by: Robert Elliott <elliott@hp.com> Reviewed-by: Webb Scales <webbnh@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-13 08:34:23 -06:00
Robert Elliott	ef3ecb66bc	block: make blk_update_request print prefix match ratelimited prefix In blk_update_request, change the printk_ratelimited prefix from end_request to blk_update_request so it matches the name printed if rate limiting occurs. Old: [10234.933106] blk_update_request: 174 callbacks suppressed [10234.934940] end_request: critical target error, dev sdr, sector 16 [10234.949788] end_request: critical target error, dev sdr, sector 16 New: [16863.445173] blk_update_request: 398 callbacks suppressed [16863.447029] blk_update_request: critical target error, dev sdr, sector 1442066176 [16863.449383] blk_update_request: critical target error, dev sdr, sector 802802888 [16863.451680] blk_update_request: critical target error, dev sdr, sector 1609535456 Signed-off-by: Robert Elliott <elliott@hp.com> Reviewed-by: Webb Scales <webbnh@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-13 08:34:21 -06:00
Linus Torvalds	c798360cd1	Merge branch 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu updates from Tejun Heo: "A lot of activities on percpu front. Notable changes are... - percpu allocator now can take @gfp. If @gfp doesn't contain GFP_KERNEL, it tries to allocate from what's already available to the allocator and a work item tries to keep the reserve around certain level so that these atomic allocations usually succeed. This will replace the ad-hoc percpu memory pool used by blk-throttle and also be used by the planned blkcg support for writeback IOs. Please note that I noticed a bug in how @gfp is interpreted while preparing this pull request and applied the fix `6ae833c7fe` ("percpu: fix how @gfp is interpreted by the percpu allocator") just now. - percpu_ref now uses longs for percpu and global counters instead of ints. It leads to more sparse packing of the percpu counters on 64bit machines but the overhead should be negligible and this allows using percpu_ref for refcnting pages and in-memory objects directly. - The switching between percpu and single counter modes of a percpu_ref is made independent of putting the base ref and a percpu_ref can now optionally be initialized in single or killed mode. This allows avoiding percpu shutdown latency for cases where the refcounted objects may be synchronously created and destroyed in rapid succession with only a fraction of them reaching fully operational status (SCSI probing does this when combined with blk-mq support). It's also planned to be used to implement forced single mode to detect underflow more timely for debugging. There's a separate branch percpu/for-3.18-consistent-ops which cleans up the duplicate percpu accessors. That branch causes a number of conflicts with s390 and other trees. I'll send a separate pull request w/ resolutions once other branches are merged" * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits) percpu: fix how @gfp is interpreted by the percpu allocator blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky percpu_ref: add PERCPU_REF_INIT_* flags percpu_ref: decouple switching to percpu mode and reinit percpu_ref: decouple switching to atomic mode and killing percpu_ref: add PCPU_REF_DEAD percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch percpu_ref: replace pcpu_ prefix with percpu_ percpu_ref: minor code and comment updates percpu_ref: relocate percpu_ref_reinit() Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe" Revert "percpu: free percpu allocation info for uniprocessor system" percpu-refcount: make percpu_ref based on longs instead of ints percpu-refcount: improve WARN messages percpu: fix locking regression in the failure path of pcpu_alloc() percpu-refcount: add @gfp to percpu_ref_init() proportions: add @gfp to init functions percpu_counter: add @gfp to percpu_counter_init() percpu_counter: make percpu_counters_lock irq-safe ...	2014-10-10 07:26:02 -04:00
Ming Lei	764f612c6c	blk-merge: don't compute bi_phys_segments from bi_vcnt for cloned bio It isn't correct to figure out req->bi_phys_segments from bio->bi_vcnt if the bio is cloned. Signed-off-by: Ming Lei <ming.lei@canonical.com> Tested-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-09 13:11:44 -06:00
Mike Snitzer	b8839b8c55	block: fix alignment_offset math that assumes io_min is a power-of-2 The math in both blk_stack_limits() and queue_limit_alignment_offset() assume that a block device's io_min (aka minimum_io_size) is always a power-of-2. Fix the math such that it works for non-power-of-2 io_min. This issue (of alignment_offset != 0) became apparent when testing dm-thinp with a thinp blocksize that matches a RAID6 stripesize of 1280K. Commit `fdfb4c8c1` ("dm thin: set minimum_io_size to pool's data block size") unlocked the potential for alignment_offset != 0 due to the dm-thin-pool's io_min possibly being a non-power-of-2. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-09 09:41:40 -06:00
Linus Torvalds	28596c9722	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull "trivial tree" updates from Jiri Kosina: "Usual pile from trivial tree everyone is so eagerly waiting for" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits) Remove MN10300_PROC_MN2WS0038 mei: fix comments treewide: Fix typos in Kconfig kprobes: update jprobe_example.c for do_fork() change Documentation: change "&" to "and" in Documentation/applying-patches.txt Documentation: remove obsolete pcmcia-cs from Changes Documentation: update links in Changes Documentation: Docbook: Fix generated DocBook/kernel-api.xml score: Remove GENERIC_HAS_IOMAP gpio: fix 'CONFIG_GPIO_IRQCHIP' comments tty: doc: Fix grammar in serial/tty dma-debug: modify check_for_stack output treewide: fix errors in printk genirq: fix reference in devm_request_threaded_irq comment treewide: fix synchronize_rcu() in comments checkstack.pl: port to AArch64 doc: queue-sysfs: minor fixes init/do_mounts: better syntax description MIPS: fix comment spelling powerpc/simpleboot: fix comment ...	2014-10-07 21:16:26 -04:00
Bart Van Assche	9d8f0bcca6	blk-mq: Make bt_clear_tag() easier to read Eliminate a backwards goto statement from bt_clear_tag(). Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-07 08:45:21 -06:00
Jens Axboe	abab13b5c4	blk-mq: fix potential hang if rolling wakeup depth is too high We currently divide the queue depth by 4 as our batch wakeup count, but we split the wakeups over BT_WAIT_QUEUES number of wait queues. This defaults to 8. If the product of the resulting batch wake count and BT_WAIT_QUEUES is higher than the device queue depth, we can get into a situation where a task goes to sleep waiting for a request, but never gets woken up. Reported-by: Bart Van Assche <bvanassche@acm.org> Fixes: `4bb659b156` Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-07 08:39:20 -06:00
Junichi Nomura	d8f429e166	block: add bioset_create_nobvec() Users of bio_clone_fast() do not want bios with their own bvecs. Allocating a bvec mempool as part of the bioset intended for such users is a waste of memory. bioset_create_nobvec() creates a bioset that doesn't have the bvec mempool. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-03 15:28:18 -06:00
Junichi Nomura	11dfce509e	block: use bio_clone_fast() in blk_rq_prep_clone() Request cloning clones bios in the request to track the completion of each bio. For that purpose, we can use bio_clone_fast() instead of bio_clone() to avoid unnecessary allocation and copy of bvecs. This patch reduces memory footprint of request-based device-mapper (about 1-4KB for each request) and is a preparation for further reduction of memory usage by removing unused bvec mempool. Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-03 15:28:16 -06:00
Hannes Reinecke	4a0efdc933	block: misplaced rq_complete tracepoint The rq_complete tracepoint was never issued for empty requests, causing the resulting blktrace information to never show any completion for those request. Signed-off-by: Hannes Reinecke <hare@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-10-01 08:17:42 -06:00
Rasmus Villemoes	582940508b	block: Replace strnicmp with strncasecmp The kernel used to contain two functions for length-delimited, case-insensitive string comparison, strnicmp with correct semantics and a slightly buggy strncasecmp. The latter is the POSIX name, so strnicmp was renamed to strncasecmp, and strnicmp made into a wrapper for the new strncasecmp to avoid breaking existing users. To allow the compat wrapper strnicmp to be removed at some point in the future, and to avoid the extra indirection cost, do s/strnicmp/strncasecmp/g. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 16:48:55 -06:00
Martin K. Petersen	2341c2f8c3	block: Add T10 Protection Information functions The T10 Protection Information format is also used by some devices that do not go through the SCSI layer (virtual block devices, NVMe). Relocate the relevant functions to a block layer library that can be used without involving SCSI. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:59 -06:00
Martin K. Petersen	4eaf99bead	block: Don't merge requests if integrity flags differ We'd occasionally merge requests with conflicting integrity flags. Introduce a merge helper which checks that the requests have compatible integrity payloads. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:57 -06:00
Martin K. Petersen	aae7df5019	block: Integrity checksum flag Make the choice of checksum a per-I/O property by introducing a flag that can be inspected by the SCSI layer. There are several reasons for this: 1. It allows us to switch choice of checksum without unloading and reloading the HBA driver. 2. During error recovery we need to be able to tell the HBA that checksums read from disk should not be verified and converted to IP checksums. 3. For error injection purposes we need to be able to write a bad guard tag to storage. Since the storage device only supports T10 CRC we need to be able to disable IP checksum conversion on the HBA. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:55 -06:00
Martin K. Petersen	b1f0138857	block: Relocate bio integrity flags Move flags affecting the integrity code out of the bio bi_flags and into the block integrity payload. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:54 -06:00
Martin K. Petersen	3aec2f41a8	block: Add a disk flag to block integrity profile So far we have relied on the app tag size to determine whether a disk has been formatted with T10 protection information or not. However, not all target devices provide application tag storage. Add a flag to the block integrity profile that indicates whether the disk has been formatted with protection information. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagig@dev.mellanox.co.il> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:51 -06:00
Martin K. Petersen	8288f496eb	block: Add prefix to block integrity profile flags Add a BLK_ prefix to the integrity profile flags. Also rename the flags to be more consistent with the generate/verify terminology in the rest of the integrity code. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:51 -06:00
Martin K. Petersen	1859308853	block: Clean up the code used to generate and verify integrity metadata Instead of the "operate" parameter we pass in a seed value and a pointer to a function that can be used to process the integrity metadata. The generation function is changed to have a return value to fit into this scheme. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:51 -06:00
Martin K. Petersen	5a2aa87305	block: Make protection interval calculation generic Now that the protection interval has been detached from the sector size we need to be able to handle sizes that are different from 4K and 512. Make the interval calculation generic. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:50 -06:00
Martin K. Petersen	3be91c4a3d	block: Deprecate the use of the term sector in the context of block integrity The protection interval is not necessarily tied to the logical block size of a block device. Stop using the terms "sector" and "sectors". Going forward we will use the term "seed" to describe the initial reference tag value for a given I/O. "Interval" will be used to describe the portion of the data buffer that a given piece of protection information is associated with. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:50 -06:00
Martin K. Petersen	5f9378fa9c	block: Remove bip_buf bip_buf is not really needed so we can remove it. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:50 -06:00
Martin K. Petersen	8492b68bc4	block: Remove integrity tagging functions None of the filesystems appear interested in using the integrity tagging feature. Potentially because very few storage devices actually permit using the application tag space. Remove the tagging functions. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:50 -06:00
Martin K. Petersen	180b2f95dd	block: Replace bi_integrity with bi_special For commands like REQ_COPY we need a way to pass extra information along with each bio. Like integrity metadata this information must be available at the bottom of the stack so bi_private does not suffice. Rename the existing bi_integrity field to bi_special and make it a union so we can have different bio extensions for each class of command. We previously used bi_integrity != NULL as a way to identify whether a bio had integrity metadata or not. Introduce a REQ_INTEGRITY to be the indicator now that bi_special can contain different things. In addition, bio_integrity(bio) will now return a pointer to the integrity payload (when applicable). Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:46 -06:00
Martin K. Petersen	e7258c1a26	block: Get rid of bdev_integrity_enabled() bdev_integrity_enabled() is only used by bio_integrity_enabled(). Combine these two functions. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-27 09:14:44 -06:00
Ming Lei	f70ced0917	blk-mq: support per-distpatch_queue flush machinery This patch supports to run one single flush machinery for each blk-mq dispatch queue, so that: - current init_request and exit_request callbacks can cover flush request too, then the buggy copying way of initializing flush request's pdu can be fixed - flushing performance gets improved in case of multi hw-queue In fio sync write test over virtio-blk(4 hw queues, ioengine=sync, iodepth=64, numjobs=4, bs=4K), it is observed that througput gets increased a lot over my test environment: - throughput: +70% in case of virtio-blk over null_blk - throughput: +30% in case of virtio-blk over SSD image The multi virtqueue feature isn't merged to QEMU yet, and patches for the feature can be found in below tree: git://kernel.ubuntu.com/ming/qemu.git v2.1.0-mq.4 And simply passing 'num_queues=4 vectors=5' should be enough to enable multi queue(quad queue) feature for QEMU virtio-blk. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-25 15:22:45 -06:00
Ming Lei	e97c293cdf	block: introduce 'blk_mq_ctx' parameter to blk_get_flush_queue This patch adds 'blk_mq_ctx' parameter to blk_get_flush_queue(), so that this function can find the corresponding blk_flush_queue bound with current mq context since the flush queue will become per hw-queue. For legacy queue, the parameter can be simply 'NULL'. For multiqueue case, the parameter should be set as the context from which the related request is originated. With this context info, the hw queue and related flush queue can be found easily. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-25 15:22:44 -06:00
Ming Lei	0bae352da5	block: flush: avoid to figure out flush queue unnecessarily Just figuring out flush queue at the entry of kicking off flush machinery and request's completion handler, then pass it through. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-25 15:22:42 -06:00
Ming Lei	ba483388e3	block: remove blk_init_flush() and its pair Now mission of the two helpers is over, and just call blk_alloc_flush_queue() and blk_free_flush_queue() directly. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-25 15:22:41 -06:00
Ming Lei	7c94e1c157	block: introduce blk_flush_queue to drive flush machinery This patch introduces 'struct blk_flush_queue' and puts all flush machinery related fields into this structure, so that - flush implementation details aren't exposed to driver - it is easy to convert to per dispatch-queue flush machinery This patch is basically a mechanical replacement. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-25 15:22:40 -06:00
Ming Lei	7ddab5de5b	block: avoid to use q->flush_rq directly This patch trys to use local variable to access flush request, so that we can convert to per-queue flush machinery a bit easier. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-25 15:22:38 -06:00
Ming Lei	3c09676c12	block: move flush initialization to blk_flush_init These fields are always used with the flush request, so initialize them together. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-25 15:22:37 -06:00
Ming Lei	f355265571	block: introduce blk_init_flush and its pair These two temporary functions are introduced for holding flush initialization and de-initialization, so that we can introduce 'flush queue' easier in the following patch. And once 'flush queue' and its allocation/free functions are ready, they will be removed for sake of code readability. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-25 15:22:35 -06:00
Ming Lei	1bcb1eada4	blk-mq: allocate flush_rq in blk_mq_init_flush() It is reasonable to allocate flush req in blk_mq_init_flush(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-25 15:22:34 -06:00
Ming Lei	08e98fc601	blk-mq: handle failure path for initializing hctx Failure of initializing one hctx isn't handled, so this patch introduces blk_mq_init_hctx() and its pair to handle it explicitly. Also this patch makes code cleaner. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-25 15:22:32 -06:00
Tejun Heo	17497acbdc	blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode blk-mq uses percpu_ref for its usage counter which tracks the number of in-flight commands and used to synchronously drain the queue on freeze. percpu_ref shutdown takes measureable wallclock time as it involves a sched RCU grace period. This means that draining a blk-mq takes measureable wallclock time. One would think that this shouldn't matter as queue shutdown should be a rare event which takes place asynchronously w.r.t. userland. Unfortunately, SCSI probing involves synchronously setting up and then tearing down a lot of request_queues back-to-back for non-existent LUNs. This means that SCSI probing may take above ten seconds when scsi-mq is used. [ 0.949892] scsi host0: Virtio SCSI HBA [ 1.007864] scsi 0:0:0:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5 [ 1.021299] scsi 0:0:1:0: Direct-Access QEMU QEMU HARDDISK 1.1. PQ: 0 ANSI: 5 [ 1.520356] tsc: Refined TSC clocksource calibration: 2491.910 MHz <stall> [ 16.186549] sd 0:0:0:0: Attached scsi generic sg0 type 0 [ 16.190478] sd 0:0:1:0: Attached scsi generic sg1 type 0 [ 16.194099] osd: LOADED open-osd 0.2.1 [ 16.203202] sd 0:0:0:0: [sda] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB) [ 16.208478] sd 0:0:0:0: [sda] Write Protect is off [ 16.211439] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA [ 16.218771] sd 0:0:1:0: [sdb] 31457280 512-byte logical blocks: (16.1 GB/15.0 GiB) [ 16.223264] sd 0:0:1:0: [sdb] Write Protect is off [ 16.225682] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA This is also the reason why request_queues start in bypass mode which is ended on blk_register_queue() as shutting down a fully functional queue also involves a RCU grace period and the queues for non-existent SCSI devices never reach registration. blk-mq basically needs to do the same thing - start the mq in a degraded mode which is faster to shut down and then make it fully functional only after the queue reaches registration. percpu_ref recently grew facilities to force atomic operation until explicitly switched to percpu mode, which can be used for this purpose. This patch makes blk-mq initialize q->mq_usage_counter in atomic mode and switch it to percpu mode only once blk_register_queue() is reached. Note that this issue was previously worked around by `0a30288da1` ("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe") for v3.17. The temp fix was reverted in preparation of adding persistent atomic mode to percpu_ref by `9eca80461a` ("Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe""). This patch and the prerequisite percpu_ref changes will be merged during v3.18 devel cycle. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Christoph Hellwig <hch@infradead.org> Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de Fixes: `add703fda9` ("blk-mq: use percpu_ref for mq usage count") Reviewed-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org>	2014-09-24 13:37:21 -04:00
Tejun Heo	2aad2a86f6	percpu_ref: add PERCPU_REF_INIT_* flags With the recent addition of percpu_ref_reinit(), percpu_ref now can be used as a persistent switch which can be turned on and off repeatedly where turning off maps to killing the ref and waiting for it to drain; however, there currently isn't a way to initialize a percpu_ref in its off (killed and drained) state, which can be inconvenient for certain persistent switch use cases. Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic selection of operation mode; however, currently a newly initialized percpu_ref is always in percpu mode making it impossible to avoid the latency overhead of switching to atomic mode. This patch adds @flags to percpu_ref_init() and implements the following flags. * PERCPU_REF_INIT_ATOMIC : start ref in atomic mode * PERCPU_REF_INIT_DEAD : start ref killed and drained These flags should be able to serve the above two use cases. v2: target_core_tpg.c conversion was missing. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org>	2014-09-24 13:31:50 -04:00
Tejun Heo	9eca80461a	Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe" This reverts commit `0a30288da1`, which was a temporary fix for SCSI blk-mq stall issue. The following patches will fix the issue properly by introducing atomic mode to percpu_ref. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de>	2014-09-24 13:07:33 -04:00
Tejun Heo	d06efebf0c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into for-3.18 This is to receive `0a30288da1` ("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe") which implements __percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The commit reverted and patches to implement proper fix will be added. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de>	2014-09-24 13:00:21 -04:00
Tejun Heo	0a30288da1	blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe blk-mq uses percpu_ref for its usage counter which tracks the number of in-flight commands and used to synchronously drain the queue on freeze. percpu_ref shutdown takes measureable wallclock time as it involves a sched RCU grace period. This means that draining a blk-mq takes measureable wallclock time. One would think that this shouldn't matter as queue shutdown should be a rare event which takes place asynchronously w.r.t. userland. Unfortunately, SCSI probing involves synchronously setting up and then tearing down a lot of request_queues back-to-back for non-existent LUNs. This means that SCSI probing may take more than ten seconds when scsi-mq is used. This will be properly fixed by implementing a mechanism to keep q->mq_usage_counter in atomic mode till genhd registration; however, that involves rather big updates to percpu_ref which is difficult to apply late in the devel cycle (v3.17-rc6 at the moment). As a stop-gap measure till the proper fix can be implemented in the next cycle, this patch introduces __percpu_ref_kill_expedited() and makes blk_mq_freeze_queue() use it. This is heavy-handed but should work for testing the experimental SCSI blk-mq implementation. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Christoph Hellwig <hch@infradead.org> Link: http://lkml.kernel.org/g/20140919113815.GA10791@lst.de Fixes: `add703fda9` ("blk-mq: use percpu_ref for mq usage count") Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Tested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-24 08:29:36 -06:00
Jens Axboe	46f341ffcf	genhd: fix leftover might_sleep() in blk_free_devt() Commit `2da78092` changed the locking from a mutex to a spinlock, so we now longer sleep in this context. But there was a leftover might_sleep() in there, which now triggers since we do the final free from an RCU callback. Get rid of it. Reported-by: Pontus Fuchs <pontus.fuchs@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 14:45:45 -06:00
Christoph Hellwig	9041583765	block: fix blk_abort_request on blk-mq Signed-off-by: Christoph Hellwig <hch@lst.de> Moved blk_mq_rq_timed_out() definition to the private blk-mq.h header. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 12:00:08 -06:00
Ming Lei	5e940aaa59	blk-timeout: fix blk_add_timer Commit 8cb34819cdd5d(blk-mq: unshared timeout handler) introduces blk-mq's own timeout handler, and removes following line: blk_queue_rq_timed_out(q, blk_mq_rq_timed_out); which then causes blk_add_timer() to bypass adding the timer, since blk-mq no longer has q->rq_timed_out_fn defined. This patch fixes the problem by bypassing the check for blk-mq, so that both request deadlines are still set and the rolling timer updated. Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 12:00:08 -06:00
Jens Axboe	aedcd72f6c	blk-mq: limit memory consumption if a crash dump is active It's not uncommon for crash dump kernels to be limited to 128MB or something low in that area. This is normally not a problem for devices as we don't use that much memory, but for some shared SCSI setups with huge queue depths, it can potentially fill most of memory with tons of request allocations. blk-mq does scale back when it fails to allocate memory, but it scales back just enough so that blk-mq succeeds. This could still leave the system with not enough memory to make any real progress. Check if we are in a kdump environment and limit the hardware queues and tag depth. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 12:00:08 -06:00
Ming Lei	2edd2c740b	blk-mq: remove unnecessary blk_clear_rq_complete() This patch removes two unnecessary blk_clear_rq_complete(), the REQ_ATOM_COMPLETE flag is cleared inside blk_mq_start_request(), so: - The blk_clear_rq_complete() in blk_flush_restore_request() needn't because the request will be freed later, and clearing it here may open a small race window with timeout. - The blk_clear_rq_complete() in blk_mq_requeue_request() isn't necessary too, even though REQ_ATOM_STARTED is cleared in __blk_mq_requeue_request(), in theory it still may cause a small race window with timeout since the two clear_bit() may be reordered. Signed-off-by: Ming Lei <ming.lei@canoical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 12:00:07 -06:00
Christoph Hellwig	0152fb6b57	blk-mq: pass a reserved argument to the timeout handler Allow blk-mq to pass an argument to the timeout handler to indicate if we're timing out a reserved or regular command. For many drivers those need to be handled different. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 12:00:07 -06:00
Christoph Hellwig	46f92d42ee	blk-mq: unshared timeout handler Duplicate the (small) timeout handler in blk-mq so that we can pass arguments more easily to the driver timeout handler. This enables the next patch. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 12:00:07 -06:00
Christoph Hellwig	81481eb423	blk-mq: fix and simplify tag iteration for the timeout handler Don't do a kmalloc from timer to handle timeouts, chances are we could be under heavy load or similar and thus just miss out on the timeouts. Fortunately it is very easy to just iterate over all in use tags, and doing this properly actually cleans up the blk_mq_busy_iter API as well, and prepares us for the next patch by passing a reserved argument to the iterator. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 12:00:07 -06:00
Christoph Hellwig	c8a446ad69	blk-mq: rename blk_mq_end_io to blk_mq_end_request Now that we've changed the driver API on the submission side use the opportunity to fix up the name on the completion side to fit into the general scheme. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 12:00:07 -06:00
Christoph Hellwig	e2490073cd	blk-mq: call blk_mq_start_request from ->queue_rq When we call blk_mq_start_request from the core blk-mq code before calling into ->queue_rq there is a racy window where the timeout handler can hit before we've fully set up the driver specific part of the command. Move the call to blk_mq_start_request into the driver so the driver can start the request only once it is fully set up. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 12:00:07 -06:00
Christoph Hellwig	bf57229745	blk-mq: remove REQ_END Pass an explicit parameter for the last request in a batch to ->queue_rq instead of using a request flag. Besides being a cleaner and non-stateful interface this is also required for the next patch, which fixes the blk-mq I/O submission code to not start a time too early. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 12:00:07 -06:00
Jens Axboe	6d11fb454b	Merge branch 'for-linus' into for-3.18/core Moving patches from for-linus to 3.18 instead, pull in this changes that will go to Linus today.	2014-09-22 11:57:32 -06:00
Jens Axboe	8b95741569	blk-mq: use blk_mq_start_hw_queues() when running requeue work When requests are retried due to hw or sw resource shortages, we often stop the associated hardware queue. So ensure that we restart the queues when running the requeue work, otherwise the queue run will be a no-op. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 11:55:56 -06:00
Jens Axboe	6b55e1f2d0	blk-mq: fix potential oops on out-of-memory in __blk_mq_alloc_rq_maps() __blk_mq_alloc_rq_maps() can be invoked multiple times, if we scale back the queue depth if we are low on memory. So don't clear set->tags when we fail, this is handled directly in the parent function, blk_mq_alloc_tag_set(). Reported-by: Robert Elliott <Elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 11:55:23 -06:00
Christoph Hellwig	a57a178a49	blk-mq: avoid infinite recursion with the FUA flag We should not insert requests into the flush state machine from blk_mq_insert_request. All incoming flush requests come through blk_{m,s}q_make_request and are handled there, while blk_execute_rq_nowait should only be called for BLOCK_PC requests. All other callers deal with requests that already went through the flush statemchine and shouldn't be reinserted into it. Reported-by: Robert Elliott <Elliott@hp.com> Debugged-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 11:55:19 -06:00
David Hildenbrand	683d0e1262	blk-mq: Avoid race condition with uninitialized requests This patch should fix the bug reported in https://lkml.org/lkml/2014/9/11/249. We have to initialize at least the atomic_flags and the cmd_flags when allocating storage for the requests. Otherwise blk_mq_timeout_check() might dereference uninitialized pointers when racing with the creation of a request. Also move the reset of cmd_flags for the initializing code to the point where a request is freed. So we will never end up with pending flush request indicators that might trigger dereferences of invalid pointers in blk_mq_timeout_check(). Cc: stable@vger.kernel.org Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com> Reported-by: Paulo De Rezende Pinatti <ppinatti@linux.vnet.ibm.com> Tested-by: Paulo De Rezende Pinatti <ppinatti@linux.vnet.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 11:55:14 -06:00
Jens Axboe	538b753418	blk-mq: request deadline must be visible before marking rq as started When we start the request, we set the deadline and flip the bits marking the request as started and non-complete. However, it's important that the deadline store is ordered before flipping the bits, otherwise we could have a small window where the request is marked started but with an invalid deadline. This can confuse the timeout handling. Suggested-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-22 11:54:04 -06:00
Jens Axboe	b207892b06	Merge branch 'for-linus' into for-3.18/core A bit of churn on the for-linus side that would be nice to have in the core bits for 3.18, so pull it in to catch us up and make forward progress easier. Signed-off-by: Jens Axboe <axboe@fb.com> Conflicts: block/scsi_ioctl.c	2014-09-11 09:31:18 -06:00
Jens Axboe	a516440542	blk-mq: scale depth and rq map appropriate if low on memory If we are running in a kdump environment, resources are scarce. For some SCSI setups with a huge set of shared tags, we run out of memory allocating what the drivers is asking for. So implement a scale back logic to reduce the tag depth for those cases, allowing the driver to successfully load. We should extend this to detect low memory situations, and implement a sane fallback for those (1 queue, 64 tags, or something like that). Tested-by: Robert Elliott <elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-10 09:02:03 -06:00
Alan Stern	df35c7c912	Block: fix unbalanced bypass-disable in blk_register_queue When a queue is registered, the block layer turns off the bypass setting (because bypass is enabled when the queue is created). This doesn't work well for queues that are unregistered and then registered again; we get a WARNING because of the unbalanced calls to blk_queue_bypass_end(). This patch fixes the problem by making blk_register_queue() call blk_queue_bypass_end() only the first time the queue is registered. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Acked-by: Tejun Heo <tj@kernel.org> CC: James Bottomley <James.Bottomley@HansenPartnership.com> CC: Jens Axboe <axboe@kernel.dk> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-09 10:44:24 -06:00
Masanari Iida	da3dae54e4	Documentation: Docbook: Fix generated DocBook/kernel-api.xml This patch fix spelling typo found in DocBook/kernel-api.xml. It is because the file is generated from the source comments, I have to fix the comments in source codes. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>	2014-09-09 10:34:56 +02:00
Tejun Heo	ff9ea32381	block, bdi: an active gendisk always has a request_queue associated with it bdev_get_queue() returns the request_queue associated with the specified block_device. blk_get_backing_dev_info() makes use of bdev_get_queue() to determine the associated bdi given a block_device. All the callers of bdev_get_queue() including blk_get_backing_dev_info() assume that bdev_get_queue() may return NULL and implement NULL handling; however, bdev_get_queue() requires the passed in block_device is opened and attached to its gendisk. Because an active gendisk always has a valid request_queue associated with it, bdev_get_queue() can never return NULL and neither can blk_get_backing_dev_info(). Make it clear that neither of the two functions can return NULL and remove NULL handling from all the callers. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Chris Mason <clm@fb.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-08 10:00:35 -06:00
Tejun Heo	f4da80727c	blkcg: remove blkcg->id blkcg->id is a unique id given to each blkcg; however, the cgroup_subsys_state which each blkcg embeds already has ->serial_nr which can be used for the same purpose. Drop blkcg->id and replace its uses with blkcg->css.serial_nr. Rename cfq_cgroup->blkcg_id to ->blkcg_serial_nr and @id in check_blkcg_changed() to @serial_nr for consistency. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-08 09:55:37 -06:00
Tejun Heo	a34375ef9e	percpu-refcount: add @gfp to percpu_ref_init() Percpu allocator now supports allocation mask. Add @gfp to percpu_ref_init() so that !GFP_KERNEL allocation masks can be used with percpu_refs too. This patch doesn't make any functional difference. v2: blk-mq conversion was missing. Updated. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <koverstreet@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Cc: Jens Axboe <axboe@kernel.dk>	2014-09-08 09:51:30 +09:00
Keith Busch	2da78092dd	block: Fix dev_t minor allocation lifetime Releases the dev_t minor when all references are closed to prevent another device from acquiring the same major/minor. Since the partition's release may be invoked from call_rcu's soft-irq context, the ext_dev_idr's mutex had to be replaced with a spinlock so as not so sleep. Signed-off-by: Keith Busch <keith.busch@intel.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-03 15:01:02 -06:00
Robert Elliott	5676e7b6db	blk-mq: cleanup after blk_mq_init_rq_map failures In blk-mq.c blk_mq_alloc_tag_set, if: set->tags = kmalloc_node() succeeds, but one of the blk_mq_init_rq_map() calls fails, goto out_unwind; needs to free set->tags so the caller is not obligated to do so. None of the current callers (null_blk, virtio_blk, virtio_blk, or the forthcoming scsi-mq) do so. set->tags needs to be set to NULL after doing so, so other tag cleanup logic doesn't try to free a stale pointer later. Also set it to NULL in blk_mq_free_tag_set. Tested with error injection on the forthcoming scsi-mq + hpsa combination. Signed-off-by: Robert Elliott <elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-03 10:44:15 -06:00
Ming Lei	0738854939	blk-merge: fix blk_recount_segments QUEUE_FLAG_NO_SG_MERGE is set at default for blk-mq devices, so bio->bi_phys_segment computed may be bigger than queue_max_segments(q) for blk-mq devices, then drivers will fail to handle the case, for example, BUG_ON() in virtio_queue_rq() can be triggerd for virtio-blk: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1359146 This patch fixes the issue by ignoring the QUEUE_FLAG_NO_SG_MERGE flag if the computed bio->bi_phys_segment is bigger than queue_max_segments(q), and the regression is caused by commit 05f1dd53152173(block: add queue flag for disabling SG merging). Reported-by: Kick In <pierre-andre.morey@canonical.com> Tested-by: Chris J Arges <chris.j.arges@canonical.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-09-02 10:25:12 -06:00
Jens Axboe	55872c5a3c	bsg: fix potential error pointer dereference Dan writes: block/bsg.c:327 bsg_map_hdr() error: 'next_rq' dereferencing possible ERR_PTR(). Fix this by setting next_rq to NULL, for the case where it can be != NULL but an error pointer. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-29 08:34:14 -06:00
Joe Lawrence	a492f07545	block,scsi: fixup blk_get_request dead queue scenarios The blk_get_request function may fail in low-memory conditions or during device removal (even if __GFP_WAIT is set). To distinguish between these errors, modify the blk_get_request call stack to return the appropriate ERR_PTR. Verify that all callers check the return status and consider IS_ERR instead of a simple NULL pointer check. For consistency, make a similar change to the blk_mq_alloc_request leg of blk_get_request. It may fail if the queue is dead, or the caller was unwilling to wait. Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com> Acked-by: Jiri Kosina <jkosina@suse.cz> [for pktdvd] Acked-by: Boaz Harrosh <bharrosh@panasas.com> [for osd] Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-28 10:03:46 -06:00
Toshiaki Makita	7b5af5cffc	cfq-iosched: Add comments on update timing of weight Explain that weight has to be updated on activation. This complements previous fix `e15693ef18` ("cfq-iosched: Fix wrong children_weight calculation"). Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-28 08:16:29 -06:00
Joe Lawrence	eb571eeade	block,scsi: verify return pointer from blk_get_request The blk-core dead queue checks introduce an error scenario to blk_get_request that returns NULL if the request queue has been shutdown. This affects the behavior for __GFP_WAIT callers, who should verify the return value before dereferencing. Signed-off-by: Joe Lawrence <joe.lawrence@stratus.com> Acked-by: Jiri Kosina <jkosina@suse.cz> [for pktdvd] Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-26 15:20:23 -06:00
Toshiaki Makita	e15693ef18	cfq-iosched: Fix wrong children_weight calculation cfq_group_service_tree_add() is applying new_weight at the beginning of the function via cfq_update_group_weight(). This actually allows weight to change between adding it to and subtracting it from children_weight, and triggers WARN_ON_ONCE() in cfq_group_service_tree_del(), or even causes oops by divide error during vfr calculation in cfq_group_service_tree_add(). The detailed scenario is as follows: 1. Create blkio cgroups X and Y as a child of X. Set X's weight to 500 and perform some I/O to apply new_weight. This X's I/O completes before starting Y's I/O. 2. Y starts I/O and cfq_group_service_tree_add() is called with Y. 3. cfq_group_service_tree_add() walks up the tree during children_weight calculation and adds parent X's weight (500) to children_weight of root. children_weight becomes 500. 4. Set X's weight to 1000. 5. X starts I/O and cfq_group_service_tree_add() is called with X. 6. cfq_group_service_tree_add() applies its new_weight (1000). 7. I/O of Y completes and cfq_group_service_tree_del() is called with Y. 8. I/O of X completes and cfq_group_service_tree_del() is called with X. 9. cfq_group_service_tree_del() subtracts X's weight (1000) from children_weight of root. children_weight becomes -500. This triggers WARN_ON_ONCE(). 10. Set X's weight to 500. 11. X starts I/O and cfq_group_service_tree_add() is called with X. 12. cfq_group_service_tree_add() applies its new_weight (500) and adds it to children_weight of root. children_weight becomes 0. Calcularion of vfr triggers oops by divide error. weight should be updated right before adding it to children_weight. Reported-by: Ruki Sekiya <sekiya.ruki@lab.ntt.co.jp> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Tejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-26 10:17:30 -06:00
Sabrina Dubroca	d19d744685	block: fix error handling in sg_io Before commit `2cada584b2` ("block: cleanup error handling in sg_io"), we had ret = 0 before entering the last big if block of sg_io. Since `2cada584b2`, ret = -EFAULT, which breaks hdparm: /dev/sda: setting Advanced Power Management level to 0xc8 (200) HDIO_DRIVE_CMD failed: Bad address APM_level = 128 Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Fixes: `2cada584b2` ("block: cleanup error handling in sg_io") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-26 08:20:01 -06:00
Tony Battersby	2ba136daa3	fix regression in SCSI_IOCTL_SEND_COMMAND blk_rq_set_block_pc() memsets rq->cmd to 0, so it should come immediately after blk_get_request() to avoid overwriting the user-supplied CDB. Also check for failure to allocate rq. Fixes: `f27b087b81` ("block: add blk_rq_set_block_pc()") Cc: <stable@vger.kernel.org> # 3.16.x Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-22 15:04:33 -05:00
Tony Battersby	6f4a16266f	scsi-mq: fix requests that use a separate CDB buffer This patch fixes code such as the following with scsi-mq enabled: rq = blk_get_request(...); blk_rq_set_block_pc(rq); rq->cmd = my_cmd_buffer; /* separate CDB buffer */ blk_execute_rq_nowait(...); Code like this appears in e.g. sg_start_req() in drivers/scsi/sg.c (for large CDBs only). Without this patch, scsi_mq_prep_fn() will set rq->cmd back to rq->__cmd, causing the wrong CDB to be sent to the device. Signed-off-by: Tony Battersby <tonyb@cybernetics.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-22 15:04:31 -05:00
Christoph Hellwig	a57821cac6	block: support > 16 byte CDBs for SG_IO Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Boaz Harrosh <boaz@plexistor.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-21 20:42:01 -05:00
Christoph Hellwig	2cada584b2	block: cleanup error handling in sg_io Make sure we always clean up through the out label and just have a single place to put the request. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-21 20:42:01 -05:00
Tejun Heo	cddd5d1764	blk-mq: blk_mq_freeze_queue() should allow nesting While converting to percpu_ref for freezing, `add703fda9` ("blk-mq: use percpu_ref for mq usage count") incorrectly made blk_mq_freeze_queue() misbehave when freezing is nested due to percpu_ref_kill() being invoked on an already killed ref. Fix it by making blk_mq_freeze_queue() kill and kick the queue only for the outermost freeze attempt. All the nested ones can simply wait for the ref to reach zero. While at it, remove unnecessary @wake initialization from blk_mq_unfreeze_queue(). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-21 20:37:51 -05:00
Jens Axboe	a68aafa5b2	blk-mq: correct a few wrong/bad comments Just grammar or spelling errors, nothing major. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-21 20:37:49 -05:00
Sagi Grimberg	16f408dc6b	block: Fix BUG_ON when pi errors occur When getting a pi error we get to bio_integrity_end_io with bi_remaining already decremented to 0 where we will eventually need to call bio_endio with restored original bio completion handler. Calling bio_endio invokes a BUG_ON(). We should call bio_endio_nodec instead, like what is done in bio_integrity_verify_fn. Signed-off-by: Sagi Grimberg <sagig@mellanox.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-21 20:37:47 -05:00
Jens Axboe	274a5843ff	blk-mq: don't allow merges if turned off for the queue blk-mq uses BLK_MQ_F_SHOULD_MERGE, as set by the driver at init time, to determine whether it should merge IO or not. However, this could also be disabled by the admin, if merging is switched off through sysfs. So check the general queue state as well before attempting to merge IO. Reported-by: Rob Elliott <Elliott@hp.com> Tested-by: Rob Elliott <Elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-21 20:37:45 -05:00
Ming Lei	dd84008708	blk-mq: fix WARNING "percpu_ref_kill() called more than once!" Before doing queue release, the queue has been freezed already by blk_cleanup_queue(), so needn't to freeze queue for deleting tag set. This patch fixes the WARNING of "percpu_ref_kill() called more than once!" which is triggered during unloading block driver. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-15 12:38:20 -06:00
Linus Torvalds	ba368991f6	. Allow the thin target to paired with any size external origin; also allow thin snapshots to be larger than the external origin. . Add support for quickly loading a repetitive pattern into the dm-switch target. . Use per-bio data in the dm-crypt target instead of always using a mempool for each allocation. Required switching to kmalloc alignment for the bio slab. . Fix DM core to properly stack the QUEUE_FLAG_NO_SG_MERGE flag . Fix the dm-cache and dm-thin targets' export of the minimum_io_size to match the data block size -- this fixes an issue where mkfs.xfs would improperly infer raid striping was in place on the underlying storage. . Small cleanups in dm-io, dm-mpath and dm-cache -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJT64yEAAoJEMUj8QotnQNatjQH/2mqm8EtPuZas70zHVDzjMlE ZyV8xgHpU0MBmiBi+JhUBv9iKX4sVa+C25559WkKtxRVMnZmI1WDry4TagiqrhnK 9o/uvdWigJMR+uwahwe4UErEtKscOQJD30a8taN/suJ6Z2C7XJJRUZPsyL4a3Vov w+UIi7aYDEGp/2VQ8mvTTxjdF5x5km4wKsjBTs03uTrrkEJ+bIUndl2I1X+X4bsw kiWYOQwmcnD8GwYkSrthJYLsS3Hjur/J/My7KZwXc00ANLOexqHdKfRDwH8b36+m olKXv3swCd8vi+jJYEYzuW9213ACsSEGP7h8NFVZ/+2FeDsSzB/C7zjW9okIUIw= =y/3r -----END PGP SIGNATURE----- Merge tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper changes from Mike Snitzer: - Allow the thin target to paired with any size external origin; also allow thin snapshots to be larger than the external origin. - Add support for quickly loading a repetitive pattern into the dm-switch target. - Use per-bio data in the dm-crypt target instead of always using a mempool for each allocation. Required switching to kmalloc alignment for the bio slab. - Fix DM core to properly stack the QUEUE_FLAG_NO_SG_MERGE flag - Fix the dm-cache and dm-thin targets' export of the minimum_io_size to match the data block size -- this fixes an issue where mkfs.xfs would improperly infer raid striping was in place on the underlying storage. - Small cleanups in dm-io, dm-mpath and dm-cache * tag 'dm-3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm table: propagate QUEUE_FLAG_NO_SG_MERGE dm switch: efficiently support repetitive patterns dm switch: factor out switch_region_table_read dm cache: set minimum_io_size to cache's data block size dm thin: set minimum_io_size to pool's data block size dm crypt: use per-bio data block: use kmalloc alignment for bio slab dm table: make dm_table_supports_discards static dm cache metadata: use dm-space-map-metadata.h defined size limits dm cache: fail migrations in the do_worker error path dm cache: simplify deferred set reference count increments dm thin: relax external origin size constraints dm thin: switch to an atomic_t for tracking pending new block preparations dm mpath: eliminate pg_ready() wrapper dm io: simplify dec_count and sync_io	2014-08-14 09:17:56 -06:00
Linus Torvalds	d429a3639c	Merge branch 'for-3.17/drivers' of git://git.kernel.dk/linux-block Pull block driver changes from Jens Axboe: "Nothing out of the ordinary here, this pull request contains: - A big round of fixes for bcache from Kent Overstreet, Slava Pestov, and Surbhi Palande. No new features, just a lot of fixes. - The usual round of drbd updates from Andreas Gruenbacher, Lars Ellenberg, and Philipp Reisner. - virtio_blk was converted to blk-mq back in 3.13, but now Ming Lei has taken it one step further and added support for actually using more than one queue. - Addition of an explicit SG_FLAG_Q_AT_HEAD for block/bsg, to compliment the the default behavior of adding to the tail of the queue. From Douglas Gilbert" * 'for-3.17/drivers' of git://git.kernel.dk/linux-block: (86 commits) bcache: Drop unneeded blk_sync_queue() calls bcache: add mutex lock for bch_is_open bcache: Correct printing of btree_gc_max_duration_ms bcache: try to set b->parent properly bcache: fix memory corruption in init error path bcache: fix crash with incomplete cache set bcache: Fix more early shutdown bugs bcache: fix use-after-free in btree_gc_coalesce() bcache: Fix an infinite loop in journal replay bcache: fix crash in bcache_btree_node_alloc_fail tracepoint bcache: bcache_write tracepoint was crashing bcache: fix typo in bch_bkey_equal_header bcache: Allocate bounce buffers with GFP_NOWAIT bcache: Make sure to pass GFP_WAIT to mempool_alloc() bcache: fix uninterruptible sleep in writeback thread bcache: wait for buckets when allocating new btree root bcache: fix crash on shutdown in passthrough mode bcache: fix lockdep warnings on shutdown bcache allocator: send discards with correct size bcache: Fix to remove the rcu_sched stalls. ...	2014-08-14 09:10:21 -06:00
Linus Torvalds	4a319a490c	Merge branch 'for-3.17/core' of git://git.kernel.dk/linux-block Pull block core bits from Jens Axboe: "Small round this time, after the massive blk-mq dump for 3.16. This pull request contains: - Fixes for max_sectors overflow in ioctls from Akinoby Mita. - Partition off-by-one bug fix in aix partitions from Dan Carpenter. - Various small partition cleanups from Fabian Frederick. - Fix for the block integrity code sometimes returning the wrong vector count from Gu Zheng. - Cleanup an re-org of the blk-mq queue enter/exit percpu counters from Tejun. Dependent on the percpu pull for 3.17 (which was in the block tree too), that you have already pulled in. - A blkcg oops fix, also from Tejun" * 'for-3.17/core' of git://git.kernel.dk/linux-block: partitions: aix.c: off by one bug blkcg: don't call into policy draining if root_blkg is already gone Revert "bio: modify __bio_add_page() to accept pages that don't start a new segment" bio: modify __bio_add_page() to accept pages that don't start a new segment block: fix SG_[GS]ET_RESERVED_SIZE ioctl when max_sectors is huge block: fix BLKSECTGET ioctl when max_sectors is greater than USHRT_MAX block/partitions/efi.c: kerneldoc fixing block/partitions/msdos.c: code clean-up block/partitions/amiga.c: replace nolevel printk by pr_err block/partitions/aix.c: replace count*size kzalloc by kcalloc bio-integrity: add "bip_max_vcnt" into struct bio_integrity_payload blk-mq: use percpu_ref for mq usage count blk-mq: collapse __blk_mq_drain_queue() into blk_mq_freeze_queue() blk-mq: decouble blk-mq freezing from generic bypassing block, blk-mq: draining can't be skipped even if bypass_depth was non-zero blk-mq: fix a memory ordering bug in blk_mq_queue_enter()	2014-08-14 09:07:02 -06:00
Dan Carpenter	d97a86c170	partitions: aix.c: off by one bug The lvip[] array has "state->limit" elements so the condition here should be >= instead of >. Fixes: `6ceea22bbb` ('partitions: add aix lvm partition support files') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Philippe De Muyter <phdm@macqel.be> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-08-05 13:13:24 -06:00
Linus Torvalds	47dfe4037e	Merge branch 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup changes from Tejun Heo: "Mostly changes to get the v2 interface ready. The core features are mostly ready now and I think it's reasonable to expect to drop the devel mask in one or two devel cycles at least for a subset of controllers. - cgroup added a controller dependency mechanism so that block cgroup can depend on memory cgroup. This will be used to finally support IO provisioning on the writeback traffic, which is currently being implemented. - The v2 interface now uses a separate table so that the interface files for the new interface are explicitly declared in one place. Each controller will explicitly review and add the files for the new interface. - cpuset is getting ready for the hierarchical behavior which is in the similar style with other controllers so that an ancestor's configuration change doesn't change the descendants' configurations irreversibly and processes aren't silently migrated when a CPU or node goes down. All the changes are to the new interface and no behavior changed for the multiple hierarchies" * 'for-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (29 commits) cpuset: fix the WARN_ON() in update_nodemasks_hier() cgroup: initialize cgrp_dfl_root_inhibit_ss_mask from !->dfl_files test cgroup: make CFTYPE_ONLY_ON_DFL and CFTYPE_NO_ internal to cgroup core cgroup: distinguish the default and legacy hierarchies when handling cftypes cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes() cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes cgroup: split cgroup_base_files[] into cgroup_{dfl\|legacy}_base_files[] cpuset: export effective masks to userspace cpuset: allow writing offlined masks to cpuset.cpus/mems cpuset: enable onlined cpu/node in effective masks cpuset: refactor cpuset_hotplug_update_tasks() cpuset: make cs->{cpus, mems}_allowed as user-configured masks cpuset: apply cs->effective_{cpus,mems} cpuset: initialize top_cpuset's configured masks at mount cpuset: use effective cpumask to build sched domains cpuset: inherit ancestor's masks if effective_{cpus, mems} becomes empty cpuset: update cs->effective_{cpus, mems} when config changes cpuset: update cpuset->effective_{cpus,mems} at hotplug cpuset: add cs->effective_cpus and cs->effective_mems cgroup: clean up sane_behavior handling ...	2014-08-04 10:11:28 -07:00
Mikulas Patocka	6a24148361	block: use kmalloc alignment for bio slab Various subsystems can ask the bio subsystem to create a bio slab cache with some free space before the bio. This free space can be used for any purpose. Device mapper uses this per-bio-data feature to place some target-specific and device-mapper specific data before the bio, so that the target-specific data doesn't have to be allocated separately. This per-bio-data mechanism is used in place of kmalloc, so we need the allocated slab to have the same memory alignment as memory allocated with kmalloc. Change bio_find_or_create_slab() so that it uses ARCH_KMALLOC_MINALIGN alignment when creating the slab cache. This is needed so that dm-crypt can use per-bio-data for encryption - the crypto subsystem assumes this data will have the same alignment as kmalloc'ed memory. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Jens Axboe <axboe@fb.com>	2014-08-01 12:30:34 -04:00
Tejun Heo	2a1b4cf233	blkcg: don't call into policy draining if root_blkg is already gone While a queue is being destroyed, all the blkgs are destroyed and its ->root_blkg pointer is set to NULL. If someone else starts to drain while the queue is in this state, the following oops happens. NULL pointer dereference at 0000000000000028 IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230 PGD `e4a1067` PUD b773067 PMD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched] CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000 RIP: 0010:[<ffffffff8144e944>] [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230 RSP: 0018:ffff88000efd7bf0 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001 R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450 R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28 FS: 00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0 Stack: ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80 ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58 ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450 Call Trace: [<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60 [<ffffffff81427641>] __blk_drain_queue+0x71/0x180 [<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0 [<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120 [<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50 [<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40 [<ffffffff8142d476>] blk_release_queue+0x26/0xd0 [<ffffffff81454968>] kobject_cleanup+0x38/0x70 [<ffffffff81454848>] kobject_put+0x28/0x60 [<ffffffff81427505>] blk_put_queue+0x15/0x20 [<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0 [<ffffffff810bc339>] execute_in_process_context+0x89/0xa0 [<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20 [<ffffffff817930e2>] device_release+0x32/0xa0 [<ffffffff81454968>] kobject_cleanup+0x38/0x70 [<ffffffff81454848>] kobject_put+0x28/0x60 [<ffffffff817934d7>] put_device+0x17/0x20 [<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0 [<ffffffff817d121b>] scsi_remove_device+0x2b/0x40 [<ffffffff817d1257>] sdev_store_delete+0x27/0x30 [<ffffffff81792ca8>] dev_attr_store+0x18/0x30 [<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50 [<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170 [<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0 [<ffffffff811f69bd>] SyS_write+0x4d/0xc0 [<ffffffff81d24692>] system_call_fastpath+0x16/0x1b `776687bce4` ("block, blk-mq: draining can't be skipped even if bypass_depth was non-zero") made it easier to trigger this bug by making blk_queue_bypass_start() drain even when it loses the first bypass test to blk_cleanup_queue(); however, the bug has always been there even before the commit as blk_queue_bypass_start() could race against queue destruction, win the initial bypass test but perform the actual draining after blk_cleanup_queue() already destroyed all blkgs. Fix it by skippping calling into policy draining if all the blkgs are already gone. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Shirish Pargaonkar <spargaonkar@suse.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Reported-by: Jet Chen <jet.chen@intel.com> Cc: stable@vger.kernel.org Tested-by: Shirish Pargaonkar <spargaonkar@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-16 09:52:03 +02:00
Tejun Heo	2cf669a58d	cgroup: replace cgroup_add_cftypes() with cgroup_add_legacy_cftypes() Currently, cftypes added by cgroup_add_cftypes() are used for both the unified default hierarchy and legacy ones and subsystems can mark each file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear only on one of them. This is quite hairy and error-prone. Also, we may end up exposing interface files to the default hierarchy without thinking it through. cgroup_subsys will grow two separate cftype addition functions and apply each only on the hierarchies of the matching type. This will allow organizing cftypes in a lot clearer way and encourage subsystems to scrutinize the interface which is being exposed in the new default hierarchy. In preparation, this patch adds cgroup_add_legacy_cftypes() which currently is a simple wrapper around cgroup_add_cftypes() and replaces all cgroup_add_cftypes() usages with it. While at it, this patch drops a completely spurious return from __hugetlb_cgroup_file_init(). This patch doesn't introduce any functional differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2014-07-15 11:05:09 -04:00
Tejun Heo	5577964e64	cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes Currently, cgroup_subsys->base_cftypes is used for both the unified default hierarchy and legacy ones and subsystems can mark each file with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear only on one of them. This is quite hairy and error-prone. Also, we may end up exposing interface files to the default hierarchy without thinking it through. cgroup_subsys will grow two separate cftype arrays and apply each only on the hierarchies of the matching type. This will allow organizing cftypes in a lot clearer way and encourage subsystems to scrutinize the interface which is being exposed in the new default hierarchy. In preparation, this patch renames cgroup_subsys->base_cftypes to cgroup_subsys->legacy_cftypes. This patch is pure rename. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Aristeu Rozanski <aris@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>	2014-07-15 11:05:09 -04:00
Jens Axboe	26a337944e	Revert "bio: modify __bio_add_page() to accept pages that don't start a new segment" This reverts commit `254c4407cb`. It causes crashes with cryptsetup, even after a few iterations and updates. Drop it for now.	2014-07-14 22:04:47 +02:00
Mikulas Patocka	3b3a1814d1	block: provide compat ioctl for BLKZEROOUT This patch provides the compat BLKZEROOUT ioctl. The argument is a pointer to two uint64_t values, so there is no need to translate it. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org # 3.7+ Acked-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-14 12:27:20 +02:00
Tejun Heo	0b462c89e3	blkcg: don't call into policy draining if root_blkg is already gone While a queue is being destroyed, all the blkgs are destroyed and its ->root_blkg pointer is set to NULL. If someone else starts to drain while the queue is in this state, the following oops happens. NULL pointer dereference at 0000000000000028 IP: [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230 PGD `e4a1067` PUD b773067 PMD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: cfq_iosched(-) [last unloaded: cfq_iosched] CPU: 1 PID: 537 Comm: bash Not tainted 3.16.0-rc3-work+ #2 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 task: ffff88000e222250 ti: ffff88000efd4000 task.ti: ffff88000efd4000 RIP: 0010:[<ffffffff8144e944>] [<ffffffff8144e944>] blk_throtl_drain+0x84/0x230 RSP: 0018:ffff88000efd7bf0 EFLAGS: 00010046 RAX: 0000000000000000 RBX: ffff880015091450 RCX: 0000000000000001 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88000efd7c10 R08: 0000000000000000 R09: 0000000000000001 R10: ffff88000e222250 R11: 0000000000000000 R12: ffff880015091450 R13: ffff880015092e00 R14: ffff880015091d70 R15: ffff88001508fc28 FS: 00007f1332650740(0000) GS:ffff88001fa80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000028 CR3: 0000000009446000 CR4: 00000000000006e0 Stack: ffffffff8144e8f6 ffff880015091450 0000000000000000 ffff880015091d80 ffff88000efd7c28 ffffffff8144ae2f ffff880015091450 ffff88000efd7c58 ffffffff81427641 ffff880015091450 ffffffff82401f00 ffff880015091450 Call Trace: [<ffffffff8144ae2f>] blkcg_drain_queue+0x1f/0x60 [<ffffffff81427641>] __blk_drain_queue+0x71/0x180 [<ffffffff81429b3e>] blk_queue_bypass_start+0x6e/0xb0 [<ffffffff814498b8>] blkcg_deactivate_policy+0x38/0x120 [<ffffffff8144ec44>] blk_throtl_exit+0x34/0x50 [<ffffffff8144aea5>] blkcg_exit_queue+0x35/0x40 [<ffffffff8142d476>] blk_release_queue+0x26/0xd0 [<ffffffff81454968>] kobject_cleanup+0x38/0x70 [<ffffffff81454848>] kobject_put+0x28/0x60 [<ffffffff81427505>] blk_put_queue+0x15/0x20 [<ffffffff817d07bb>] scsi_device_dev_release_usercontext+0x16b/0x1c0 [<ffffffff810bc339>] execute_in_process_context+0x89/0xa0 [<ffffffff817d064c>] scsi_device_dev_release+0x1c/0x20 [<ffffffff817930e2>] device_release+0x32/0xa0 [<ffffffff81454968>] kobject_cleanup+0x38/0x70 [<ffffffff81454848>] kobject_put+0x28/0x60 [<ffffffff817934d7>] put_device+0x17/0x20 [<ffffffff817d11b9>] __scsi_remove_device+0xa9/0xe0 [<ffffffff817d121b>] scsi_remove_device+0x2b/0x40 [<ffffffff817d1257>] sdev_store_delete+0x27/0x30 [<ffffffff81792ca8>] dev_attr_store+0x18/0x30 [<ffffffff8126f75e>] sysfs_kf_write+0x3e/0x50 [<ffffffff8126ea87>] kernfs_fop_write+0xe7/0x170 [<ffffffff811f5e9f>] vfs_write+0xaf/0x1d0 [<ffffffff811f69bd>] SyS_write+0x4d/0xc0 [<ffffffff81d24692>] system_call_fastpath+0x16/0x1b `776687bce4` ("block, blk-mq: draining can't be skipped even if bypass_depth was non-zero") made it easier to trigger this bug by making blk_queue_bypass_start() drain even when it loses the first bypass test to blk_cleanup_queue(); however, the bug has always been there even before the commit as blk_queue_bypass_start() could race against queue destruction, win the initial bypass test but perform the actual draining after blk_cleanup_queue() already destroyed all blkgs. Fix it by skippping calling into policy draining if all the blkgs are already gone. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Shirish Pargaonkar <spargaonkar@suse.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Reported-by: Jet Chen <jet.chen@intel.com> Cc: stable@vger.kernel.org Tested-by: Shirish Pargaonkar <spargaonkar@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-12 17:55:10 +02:00
Tejun Heo	aa6ec29bee	cgroup: remove sane_behavior support on non-default hierarchies sane_behavior has been used as a development vehicle for the default unified hierarchy. Now that the default hierarchy is in place, the flag became redundant and confusing as its usage is allowed on all hierarchies. There are gonna be either the default hierarchy or legacy ones. Let's make that clear by removing sane_behavior support on non-default hierarchies. This patch replaces cgroup_sane_behavior() with cgroup_on_dfl(). The comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of cgroup_on_dfl() with sane_behavior specific part dropped. On the default and legacy hierarchies w/o sane_behavior, this shouldn't cause any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz>	2014-07-09 10:08:08 -04:00
Tejun Heo	1ced953b17	blkcg, memcg: make blkcg depend on memcg on the default hierarchy Currently, the blkio subsystem attributes all of writeback IOs to the root. One of the issues is that there's no way to tell who originated a writeback IO from block layer. Those IOs are usually issued asynchronously from a task which didn't have anything to do with actually generating the dirty pages. The memory subsystem, when enabled, already keeps track of the ownership of each dirty page and it's desirable for blkio to piggyback instead of adding its own per-page tag. cgroup now has a mechanism to express such dependency - cgroup_subsys->depends_on. This patch declares that blkcg depends on memcg so that memcg is enabled automatically on the default hierarchy when available. Future changes will make blkcg map the memcg tag to find out the cgroup to blame for writeback IOs. As this means that a memcg may be made invisible, this patch also implements css_reset() for memcg which resets its basic configurations. This implementation will probably need to be expanded to cover other states which are used in the default hierarchy. v2: blkcg's dependency on memcg is wrapped with CONFIG_MEMCG to avoid build failure. Reported by kbuild test robot. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk>	2014-07-08 18:02:57 -04:00
Christoph Hellwig	d45b3279a5	block: don't assume last put of shared tags is for the host There is no inherent reason why the last put of a tag structure must be the one for the Scsi_Host, as device model objects can be held for arbitrary periods. Merge blk_free_tags and __blk_free_tags into a single funtion that just release a references and get rid of the BUG() when the host reference wasn't the last. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-08 12:25:28 +02:00
Maurizio Lombardi	254c4407cb	bio: modify __bio_add_page() to accept pages that don't start a new segment The original behaviour is to refuse to add a new page if the maximum number of segments has been reached, regardless of the fact the page we are going to add can be merged into the last segment or not. Unfortunately, when the system runs under heavy memory fragmentation conditions, a driver may try to add multiple pages to the last segment. The original code won't accept them and EBUSY will be reported to userspace. This patch modifies the function so it refuses to add a page only in case the latter starts a new segment and the maximum number of segments has already been reached. The bug can be easily reproduced with the st driver: 1) set CONFIG_SCSI_MPT2SAS_MAX_SGE or CONFIG_SCSI_MPT3SAS_MAX_SGE to 16 2) modprobe st buffer_kbs=1024 3) #dd if=/dev/zero of=/dev/st0 bs=1M count=10 dd: error writing `/dev/st0': Device or resource busy [ming.lei@canonical.com: update bi_iter.bi_size before recounting segments] Signed-off-by: Maurizio Lombardi <mlombard@redhat.com> Signed-off-by: Ming Lei <ming.lei@canonical.com> Tested-by: Dongsu Park <dongsu.park@profitbricks.com> Tested-by: Jet Chen <jet.chen@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:55:15 -06:00
Douglas Gilbert	d15156138d	block SG_IO: add SG_FLAG_Q_AT_HEAD flag After the SG_IO ioctl was copied into the block layer and later into the bsg driver, subtle differences emerged. One difference is the way injected commands are queued through the block layer (i.e. this is not SCSI device queueing nor SATA NCQ). Summarizing: - SG_IO on block layer device: blk_exec*(at_head=false) - sg device SG_IO: at_head=true - bsg device SG_IO: at_head=true Some time ago Boaz Harrosh introduced a sg v4 flag called BSG_FLAG_Q_AT_TAIL to override the bsg driver default. A recent patch titled: "sg: add SG_FLAG_Q_AT_TAIL flag" allowed the sg driver default to be overridden. This patch allows a SG_IO ioctl sent to a block layer device to have its default overridden. ChangeLog: - introduce SG_FLAG_Q_AT_HEAD flag in sg.h to cause commands that are injected via a block layer device SG_IO ioctl to set at_head=true - make comments clearer about queueing in sg.h since the header is used both by the sg device and block layer device implementations of the SG_IO ioctl. - introduce BSG_FLAG_Q_AT_HEAD in bsg.h for compatibility (it does nothing) and update comments. Signed-off-by: Douglas Gilbert <dgilbert@interlog.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Christie <michaelc@cs.wisc.edu> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:48:05 -06:00
Akinobu Mita	9b4231bf99	block: fix SG_[GS]ET_RESERVED_SIZE ioctl when max_sectors is huge SG_GET_RESERVED_SIZE and SG_SET_RESERVED_SIZE ioctls access a reserved buffer in bytes as int type. The value needs to be capped at the request queue's max_sectors. But integer overflow is not correctly handled in the calculation when converting max_sectors from sectors to bytes. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Douglas Gilbert <dgilbert@interlog.com> Cc: linux-scsi@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:43:09 -06:00
Akinobu Mita	63f2649659	block: fix BLKSECTGET ioctl when max_sectors is greater than USHRT_MAX BLKSECTGET ioctl loads the request queue's max_sectors as unsigned short value to the argument pointer. So if the max_sector is greater than USHRT_MAX, the upper 16 bits of that is just discarded. In such case, USHRT_MAX is more preferable than the lower 16 bits of max_sectors. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Douglas Gilbert <dgilbert@interlog.com> Cc: linux-scsi@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:43:07 -06:00
Fabian Frederick	16e1556526	block/partitions/efi.c: kerneldoc fixing Adding function documentation and fixing kerneldoc warnings ('field: description' uniformization). Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:40:05 -06:00
Fabian Frederick	dce14c239a	block/partitions/msdos.c: code clean-up checkpatch fixing: WARNING: Missing a blank line after declarations WARNING: space prohibited between function name and open parenthesis '(' ERROR: spaces required around that '<' (ctx:VxV) Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:40:03 -06:00
Fabian Frederick	600ffc5ead	block/partitions/amiga.c: replace nolevel printk by pr_err Also add no prefix pr_fmt to avoid any future default format update Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:40:02 -06:00
Fabian Frederick	472d5e2af2	block/partitions/aix.c: replace countsize kzalloc by kcalloc kcalloc manages countsizeof overflow. Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:40:00 -06:00
Gu Zheng	cbcd1054a1	bio-integrity: add "bip_max_vcnt" into struct bio_integrity_payload Commit `08778795` ("block: Fix nr_vecs for inline integrity vectors") from Martin introduces the function bip_integrity_vecs(get the useful vectors) to fix the issue about nr_vecs for inline integrity vectors that reported by David Milburn. But it seems that bip_integrity_vecs() will return the wrong number if the bio is not based on any bio_set for some reason(bio->bi_pool == NULL), because in that case, the bip_inline_vecs[0] is malloced directly. So here we add the bip_max_vcnt to record the count of vector slots, and cleanup the function bip_integrity_vecs(). Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:36:47 -06:00
Tejun Heo	add703fda9	blk-mq: use percpu_ref for mq usage count Currently, blk-mq uses a percpu_counter to keep track of how many usages are in flight. The percpu_counter is drained while freezing to ensure that no usage is left in-flight after freezing is complete. blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this per-cpu gating mechanism. This type of code has relatively high chance of subtle bugs which are extremely difficult to trigger and it's way too hairy to be open coded in blk-mq. percpu_ref can serve the same purpose after the recent changes. This patch replaces the open-coded per-cpu usage counting and draining mechanism with percpu_ref. blk_mq_queue_enter() performs tryget_live on the ref and exit() performs put. blk_mq_freeze_queue() kills the ref and waits until the reference count reaches zero. blk_mq_unfreeze_queue() revives the ref and wakes up the waiters. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:34:38 -06:00
Tejun Heo	72d6f02a8d	blk-mq: collapse __blk_mq_drain_queue() into blk_mq_freeze_queue() Keeping __blk_mq_drain_queue() as a separate function doesn't buy us anything and it's gonna be further simplified. Let's flatten it into its caller. This patch doesn't make any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:33:02 -06:00
Tejun Heo	780db2071a	blk-mq: decouble blk-mq freezing from generic bypassing blk_mq freezing is entangled with generic bypassing which bypasses blkcg and io scheduler and lets IO requests fall through the block layer to the drivers in FIFO order. This allows forward progress on IOs with the advanced features disabled so that those features can be configured or altered without worrying about stalling IO which may lead to deadlock through memory allocation. However, generic bypassing doesn't quite fit blk-mq. blk-mq currently doesn't make use of blkcg or ioscheds and it maps bypssing to freezing, which blocks request processing and drains all the in-flight ones. This causes problems as bypassing assumes that request processing is online. blk-mq works around this by conditionally allowing request processing for the problem case - during queue initialization. Another weirdity is that except for during queue cleanup, bypassing started on the generic side prevents blk-mq from processing new requests but doesn't drain the in-flight ones. This shouldn't break anything but again highlights that something isn't quite right here. The root cause is conflating blk-mq freezing and generic bypassing which are two different mechanisms. The only intersecting purpose that they serve is during queue cleanup. Let's properly separate blk-mq freezing from generic bypassing and simply use it where necessary. * request_queue->mq_freeze_depth is added and blk_mq_[un]freeze_queue() now operate on this counter instead of ->bypass_depth. The replacement for QUEUE_FLAG_BYPASS isn't added but the counter is tested directly. This will be further updated by later changes. * blk_mq_drain_queue() is dropped and "__" prefix is dropped from blk_mq_freeze_queue(). Queue cleanup path now calls blk_mq_freeze_queue() directly. * blk_queue_enter()'s fast path condition is simplified to simply check @q->mq_freeze_depth. Previously, the condition was !blk_queue_dying(q) && (!blk_queue_bypass(q) \|\| !blk_queue_init_done(q)) mq_freeze_depth is incremented right after dying is set and blk_queue_init_done() exception isn't necessary as blk-mq doesn't start frozen, which only leaves the blk_queue_bypass() test which can be replaced by @q->mq_freeze_depth test. This change simplifies the code and reduces confusion in the area. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:31:13 -06:00
Tejun Heo	776687bce4	block, blk-mq: draining can't be skipped even if bypass_depth was non-zero Currently, both blk_queue_bypass_start() and blk_mq_freeze_queue() skip queue draining if bypass_depth was already above zero. The assumption is that the one which bumped the bypass_depth should have performed draining already; however, there's nothing which prevents a new instance of bypassing/freezing from starting before the previous one finishes draining. The current code may allow the later bypassing/freezing instances to complete while there still are in-flight requests which haven't finished draining. Fix it by draining regardless of bypass_depth. We still skip draining from blk_queue_bypass_start() while the queue is initializing to avoid introducing excessive delays during boot. INIT_DONE setting is moved above the initial blk_queue_bypass_end() so that bypassing attempts can't slip inbetween. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:29:17 -06:00
Tejun Heo	531ed6261e	blk-mq: fix a memory ordering bug in blk_mq_queue_enter() blk-mq uses a percpu_counter to keep track of how many usages are in flight. The percpu_counter is drained while freezing to ensure that no usage is left in-flight after freezing is complete. blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this per-cpu gating mechanism; unfortunately, it contains a subtle bug - smp_wmb() in blk_mq_queue_enter() doesn't prevent prevent the cpu from fetching @q->bypass_depth before incrementing @q->mq_usage_counter and if freezing happens inbetween the caller can slip through and freezing can be complete while there are active users. Use smp_mb() instead so that bypass_depth and mq_usage_counter modifications and tests are properly interlocked. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Nicholas A. Bellinger <nab@linux-iscsi.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-07-01 10:27:06 -06:00
Linus Torvalds	3493860c76	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "A small collection of fixes/changes for the current series. This contains: - Removal of dead code from Gu Zheng. - Revert of two bad fixes that went in earlier in this round, marking things as __init that were not purely used from init. - A fix for blk_mq_start_hw_queue() using the __blk_mq_run_hw_queue(), which could place us wrongly. Make it use the non __ variant, which handles cases where we are called from the wrong CPU set. From me. - A fix for drbd, which allocates discard requests without room for the SCSI payload. From Lars Ellenberg. - A fix for user-after-free in the blkcg code from Tejun. - Addition of limiting gaps in SG lists, if the hardware needs it. This is the last pre-req patch for blk-mq to enable the full NVMe conversion. Could wait until 3.17, but it's simple enough so would be nice to have everything we need for the NVMe port in the 3.17 release. From me" * 'for-linus' of git://git.kernel.dk/linux-block: drbd: fix NULL pointer deref in blk_add_request_payload blk-mq: blk_mq_start_hw_queue() should use blk_mq_run_hw_queue() block: add support for limiting gaps in SG lists bio: remove unused macro bip_vec_idx() Revert "block: add __init to elv_register" Revert "block: add __init to blkcg_policy_register" blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t floppy: format block0 read error message properly	2014-06-26 13:06:13 -07:00
Jens Axboe	0ffbce80c2	blk-mq: blk_mq_start_hw_queue() should use blk_mq_run_hw_queue() Currently it calls __blk_mq_run_hw_queue(), which depends on the CPU placement being correct. This means it's not possible to call blk_mq_start_hw_queues(q) from a context that is correct for all queues, leading to triggering the WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask)); in __blk_mq_run_hw_queue(). Reported-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-25 08:22:34 -06:00
Jens Axboe	66cb45aa41	block: add support for limiting gaps in SG lists Another restriction inherited for NVMe - those devices don't support SG lists that have "gaps" in them. Gaps refers to cases where the previous SG entry doesn't end on a page boundary. For NVMe, all SG entries must start at offset 0 (except the first) and end on a page boundary (except the last). Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-24 16:22:24 -06:00
Jens Axboe	e567bf7112	Revert "block: add __init to elv_register" This reverts commit `b5097e956a`. The original commit is buggy, we do use the registration functions at runtime, for instance when loading IO schedulers through sysfs. Reported-by: Damien Wyart <damien.wyart@gmail.com>	2014-06-22 16:34:11 -06:00
Jens Axboe	d5bf02914e	Revert "block: add __init to blkcg_policy_register" This reverts commit `a2d445d440`. The original commit is buggy, we do use the registration functions at runtime for modular builds.	2014-06-22 16:34:11 -06:00
Tejun Heo	a5049a8ae3	blkcg: fix use-after-free in __blkg_release_rcu() by making blkcg_gq refcnt an atomic_t Hello, So, this patch should do. Joe, Vivek, can one of you guys please verify that the oops goes away with this patch? Jens, the original thread can be read at http://thread.gmane.org/gmane.linux.kernel/1720729 The fix converts blkg->refcnt from int to atomic_t. It does some overhead but it should be minute compared to everything else which is going on and the involved cacheline bouncing, so I think it's highly unlikely to cause any noticeable difference. Also, the refcnt in question should be converted to a perpcu_ref for blk-mq anyway, so the atomic_t is likely to go away pretty soon anyway. Thanks. ------- 8< ------- __blkg_release_rcu() may be invoked after the associated request_queue is released with a RCU grace period inbetween. As such, the function and callbacks invoked from it must not dereference the associated request_queue. This is clearly indicated in the comment above the function. Unfortunately, while trying to fix a different issue, `2a4fd070ee` ("blkcg: move bulk of blkcg_gq release operations to the RCU callback") ignored this and added [un]locking of @blkg->q->queue_lock to __blkg_release_rcu(). This of course can cause oops as the request_queue may be long gone by the time this code gets executed. general protection fault: 0000 [#1] SMP CPU: 21 PID: 30 Comm: rcuos/21 Not tainted 3.15.0 #1 Hardware name: Stratus ftServer 6400/G7LAZ, BIOS BIOS Version 6.3:57 12/25/2013 task: ffff880854021de0 ti: ffff88085403c000 task.ti: ffff88085403c000 RIP: 0010:[<ffffffff8162e9e5>] [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60 RSP: 0018:ffff88085403fdf0 EFLAGS: 00010086 RAX: 0000000000020000 RBX: 0000000000000010 RCX: 0000000000000000 RDX: 000060ef80008248 RSI: 0000000000000286 RDI: 6b6b6b6b6b6b6b6b RBP: ffff88085403fdf0 R08: 0000000000000286 R09: 0000000000009f39 R10: 0000000000020001 R11: 0000000000020001 R12: ffff88103c17a130 R13: ffff88103c17a080 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88107fca0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000006e5ab8 CR3: 000000000193d000 CR4: 00000000000407e0 Stack: ffff88085403fe18 ffffffff812cbfc2 ffff88103c17a130 0000000000000000 ffff88103c17a130 ffff88085403fec0 ffffffff810d1d28 ffff880854021de0 ffff880854021de0 ffff88107fcaec58 ffff88085403fe80 ffff88107fcaec30 Call Trace: [<ffffffff812cbfc2>] __blkg_release_rcu+0x72/0x150 [<ffffffff810d1d28>] rcu_nocb_kthread+0x1e8/0x300 [<ffffffff81091d81>] kthread+0xe1/0x100 [<ffffffff8163813c>] ret_from_fork+0x7c/0xb0 Code: ff 47 04 48 8b 7d 08 be 00 02 00 00 e8 55 48 a4 ff 5d c3 0f 1f 00 66 66 66 66 90 55 48 89 e5 +fa 66 66 90 66 66 90 b8 00 00 02 00 <f0> 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f +b7 RIP [<ffffffff8162e9e5>] _raw_spin_lock_irq+0x15/0x60 RSP <ffff88085403fdf0> The request_queue locking was added because blkcg_gq->refcnt is an int protected with the queue lock and __blkg_release_rcu() needs to put the parent. Let's fix it by making blkcg_gq->refcnt an atomic_t and dropping queue locking in the function. Given the general heavy weight of the current request_queue and blkcg operations, this is unlikely to cause any noticeable overhead. Moreover, blkcg_gq->refcnt is likely to be converted to percpu_ref in the near future, so whatever (most likely negligible) overhead it may add is temporary. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Joe Lawrence <joe.lawrence@stratus.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Link: http://lkml.kernel.org/g/alpine.DEB.2.02.1406081816540.17948@jlaw-desktop.mno.stratus.com Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-22 16:30:52 -06:00
Linus Torvalds	f1d702487b	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: "A smaller collection of fixes for the block core that would be nice to have in -rc2. This pull request contains: - Fixes for races in the wait/wakeup logic used in blk-mq from Alexander. No issues have been observed, but it is definitely a bit flakey currently. Alternatively, we may drop the cyclic wakeups going forward, but that needs more testing. - Some cleanups from Christoph. - Fix for an oops in null_blk if queue_mode=1 and softirq completions are used. From me. - A fix for a regression caused by the chunk size setting. It inadvertently used max_hw_sectors instead of max_sectors, which is incorrect, and causes hangs on btrfs multi-disk setups (where hw sectors apparently isn't set). From me. - Removal of WQ_POWER_EFFICIENT in the kblockd creation. This was a recent addition as well, but it actually breaks blk-mq which relies on strict scheduling. If the workqueue power_efficient mode is turned on, this breaks blk-mq. From Matias. - null_blk module parameter description fix from Mike" * 'for-linus' of git://git.kernel.dk/linux-block: blk-mq: bitmap tag: fix races in bt_get() function blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt blk-mq: bitmap tag: fix races on shared ::wake_index fields block: blk_max_size_offset() should check ->max_sectors null_blk: fix softirq completions for queue_mode == 1 blk-mq: merge blk_mq_drain_queue and __blk_mq_drain_queue blk-mq: properly drain stopped queues block: remove WQ_POWER_EFFICIENT from kblockd null_blk: fix name and description of 'queue_mode' module parameter block: remove elv_abort_queue and blk_abort_flushes	2014-06-19 17:56:43 -10:00
Alexander Gordeev	86fb5c56cf	blk-mq: bitmap tag: fix races in bt_get() function This update fixes few issues in bt_get() function: - list_empty(&wait.task_list) check is not protected; - was_empty check is always true which results in every thread entering the loop resets bt_wait_state::wait_cnt counter rather than every bt->wake_cnt'th thread; - 'bt_wait_state::wait_cnt' counter update is redundant, since it also gets reset in bt_clear_tag() function; Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <tom.leiming@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Alexander Gordeev <agordeev@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-17 22:13:08 -07:00
Alexander Gordeev	2971c35f35	blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt This piece of code in bt_clear_tag() function is racy: bs = bt_wake_ptr(bt); if (bs && atomic_dec_and_test(&bs->wait_cnt)) { atomic_set(&bs->wait_cnt, bt->wake_cnt); wake_up(&bs->wait); } Since nothing prevents bt_wake_ptr() from returning the very same 'bs' address on multiple CPUs, the following scenario is possible: CPU1 CPU2 ---- ---- 0. bs = bt_wake_ptr(bt); bs = bt_wake_ptr(bt); 1. atomic_dec_and_test(&bs->wait_cnt) 2. atomic_dec_and_test(&bs->wait_cnt) 3. atomic_set(&bs->wait_cnt, bt->wake_cnt); If the decrement in [1] yields zero then for some amount of time the decrement in [2] results in a negative/overflow value, which is not expected. The follow-up assignment in [3] overwrites the invalid value with the batch value (and likely prevents the issue from being severe) which is still incorrect and should be a lesser. Cc: Ming Lei <tom.leiming@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Alexander Gordeev <agordeev@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-17 22:13:05 -07:00
Alexander Gordeev	8537b12034	blk-mq: bitmap tag: fix races on shared ::wake_index fields Fix racy updates of shared blk_mq_bitmap_tags::wake_index and blk_mq_hw_ctx::wake_index fields. Cc: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Alexander Gordeev <agordeev@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-17 22:12:35 -07:00
Linus Torvalds	b55b390202	Merge git://git.infradead.org/users/willy/linux-nvme Pull NVMe update from Matthew Wilcox: "Mostly bugfixes again for the NVMe driver. I'd like to call out the exported tracepoint in the block layer; I believe Keith has cleared this with Jens. We've had a few reports from people who're really pounding on NVMe devices at scale, hence the timeout changes (and new module parameters), hotplug cpu deadlock, tracepoints, and minor performance tweaks" [ Jens hadn't seen that tracepoint thing, but is ok with it - it will end up going away when mq conversion happens ] * git://git.infradead.org/users/willy/linux-nvme: (22 commits) NVMe: Fix START_STOP_UNIT Scsi->NVMe translation. NVMe: Use Log Page constants in SCSI emulation NVMe: Define Log Page constants NVMe: Fix hot cpu notification dead lock NVMe: Rename io_timeout to nvme_io_timeout NVMe: Use last bytes of f/w rev SCSI Inquiry NVMe: Adhere to request queue block accounting enable/disable NVMe: Fix nvme get/put queue semantics NVMe: Delete NVME_GET_FEAT_TEMP_THRESH NVMe: Make admin timeout a module parameter NVMe: Make iod bio timeout a parameter NVMe: Prevent possible NULL pointer dereference NVMe: Fix the buffer size passed in GetLogPage(CDW10.NUMD) NVMe: Update data structures for NVMe 1.2 NVMe: Enable BUILD_BUG_ON checks NVMe: Update namespace and controller identify structures to the 1.1a spec NVMe: Flush with data support NVMe: Configure support for block flush NVMe: Add tracepoints NVMe: Protect against badly formatted CQEs ...	2014-06-15 15:58:03 -10:00
Christoph Hellwig	95ed068165	blk-mq: merge blk_mq_drain_queue and __blk_mq_drain_queue Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-13 12:17:40 -06:00
Christoph Hellwig	8f5280f4ee	blk-mq: properly drain stopped queues If we need to drain a queue we need to run all queues, even if they are marked stopped to make sure the driver has a chance to error out on all queued requests. This fixes surprise removal with scsi-mq. Reported-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-13 12:17:38 -06:00
Matias Bjørling	28747fcd22	block: remove WQ_POWER_EFFICIENT from kblockd blk-mq issues async requests through kblockd. To issue a work request on a specific CPU, kblockd_schedule_delayed_work_on is used. However, the specific CPU choice may not be honored, if the power_efficient option for workqueues is set. blk-mq requires that we have strict per-cpu scheduling, so it wont work properly if kblockd is marked POWER_EFFICIENT and power_efficient is set. Remove the kblockd WQ_POWER_EFFICIENT flag to prevent this behavior. This essentially reverts part of commit `695588f945`, which added the WQ_POWER_EFFICIENT marker to kblockd. Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-11 15:53:39 -06:00
Christoph Hellwig	2940474af7	block: remove elv_abort_queue and blk_abort_flushes elv_abort_queue has no callers, and blk_abort_flushes is only called by elv_abort_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-11 15:31:21 -06:00
Linus Torvalds	23d4ed53b7	Merge branch 'for-linus' of git://git.kernel.dk/linux-block Pull block layer fixes from Jens Axboe: "Final small batch of fixes to be included before -rc1. Some general cleanups in here as well, but some of the blk-mq fixes we need for the NVMe conversion and/or scsi-mq. The pull request contains: - Support for not merging across a specified "chunk size", if set by the driver. Some NVMe devices perform poorly for IO that crosses such a chunk, so we need to support it generically as part of request merging avoid having to do complicated split logic. From me. - Bump max tag depth to 10Ki tags. Some scsi devices have a huge shared tag space. Before we failed with EINVAL if a too large tag depth was specified, now we truncate it and pass back the actual value. From me. - Various blk-mq rq init fixes from me and others. - A fix for enter on a dying queue for blk-mq from Keith. This is needed to prevent oopsing on hot device removal. - Fixup for blk-mq timer addition from Ming Lei. - Small round of performance fixes for mtip32xx from Sam Bradshaw. - Minor stack leak fix from Rickard Strandqvist. - Two __init annotations from Fabian Frederick" * 'for-linus' of git://git.kernel.dk/linux-block: block: add __init to blkcg_policy_register block: add __init to elv_register block: ensure that bio_add_page() always accepts a page for an empty bio blk-mq: add timer in blk_mq_start_request blk-mq: always initialize request->start_time block: blk-exec.c: Cleaning up local variable address returnd mtip32xx: minor performance enhancements blk-mq: ->timeout should be cleared in blk_mq_rq_ctx_init() blk-mq: don't allow queue entering for a dying queue blk-mq: bump max tag depth to 10K tags block: add blk_rq_set_block_pc() block: add notion of a chunk size for request merging	2014-06-11 08:41:17 -07:00
Fabian Frederick	a2d445d440	block: add __init to blkcg_policy_register blkcg_policy_register is only called by __init functions: __init cfq_init __init throtl_init Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-10 13:13:12 -06:00
Fabian Frederick	b5097e956a	block: add __init to elv_register elv_register is only called by elevator init functions: __init cfq_init __init deadline_init __init noop_init Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-10 13:13:11 -06:00
Jens Axboe	58a4915ad2	block: ensure that bio_add_page() always accepts a page for an empty bio With commit `762380ad93` added support for chunk sizes and no merging across them, it broke the rule of always allowing adding of a single page to an empty bio. So relax the restriction a bit to allow for that, similarly to what we have always done. This fixes a crash with mkfs.xfs and 512b sector sizes on NVMe. Reported-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-10 12:53:56 -06:00
Linus Torvalds	14208b0ec5	Merge branch 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup updates from Tejun Heo: "A lot of activities on cgroup side. Heavy restructuring including locking simplification took place to improve the code base and enable implementation of the unified hierarchy, which currently exists behind a __DEVEL__ mount option. The core support is mostly complete but individual controllers need further work. To explain the design and rationales of the the unified hierarchy Documentation/cgroups/unified-hierarchy.txt is added. Another notable change is css (cgroup_subsys_state - what each controller uses to identify and interact with a cgroup) iteration update. This is part of continuing updates on css object lifetime and visibility. cgroup started with reference count draining on removal way back and is now reaching a point where csses behave and are iterated like normal refcnted objects albeit with some complexities to allow distinguishing the state where they're being deleted. The css iteration update isn't taken advantage of yet but is planned to be used to simplify memcg significantly" * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits) cgroup: disallow disabled controllers on the default hierarchy cgroup: don't destroy the default root cgroup: disallow debug controller on the default hierarchy cgroup: clean up MAINTAINERS entries cgroup: implement css_tryget() device_cgroup: use css_has_online_children() instead of has_children() cgroup: convert cgroup_has_live_children() into css_has_online_children() cgroup: use CSS_ONLINE instead of CGRP_DEAD cgroup: iterate cgroup_subsys_states directly cgroup: introduce CSS_RELEASED and reduce css iteration fallback window cgroup: move cgroup->serial_nr into cgroup_subsys_state cgroup: link all cgroup_subsys_states in their sibling lists cgroup: move cgroup->sibling and ->children into cgroup_subsys_state cgroup: remove cgroup->parent device_cgroup: remove direct access to cgroup->children memcg: update memcg_has_children() to use css_next_child() memcg: remove tasks/children test from mem_cgroup_force_empty() cgroup: remove css_parent() cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css cgroup: use cgroup->self.refcnt for cgroup refcnting ...	2014-06-09 15:03:33 -07:00
Ming Lei	2b8393b43e	blk-mq: add timer in blk_mq_start_request This way will become consistent with non-mq case, also avoid to update rq->deadline twice for mq. The comment said: "We do this early, to ensure we are on the right CPU.", but no percpu stuff is used in blk_add_timer(), so it isn't necessary. Even when inserting from plug list, there is no such guarantee at all. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-09 10:20:06 -06:00
Jens Axboe	3ee3237239	blk-mq: always initialize request->start_time The blk-mq core only initializes this if io stats are enabled, since blk-mq only reads the field in that case. But drivers could potentially use it internally, so ensure that we always set it to the current time when the request is allocated. Reported-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-09 09:36:53 -06:00
Rickard Strandqvist	de83953f9d	block: blk-exec.c: Cleaning up local variable address returnd Address of local variable assigned to a function parameter This was partly found using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-08 19:51:31 -06:00
Mitchel Humpherys	b1de0d139c	mm: convert some level-less printks to pr_* printk is meant to be used with an associated log level. There are some instances of printk scattered around the mm code where the log level is missing. Add a log level and adhere to suggestions by scripts/checkpatch.pl by moving to the pr_* macros. Also add the typical pr_fmt definition so that print statements can be easily traced back to the modules where they occur, correlated one with another, etc. This will require the removal of some (now redundant) prefixes on a few print statements. Signed-off-by: Mitchel Humpherys <mitchelh@codeaurora.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-06-06 16:08:18 -07:00
Jens Axboe	f6be4fb4bc	blk-mq: ->timeout should be cleared in blk_mq_rq_ctx_init() It'll be used in blk_mq_start_request() to set a potential timeout for the request, so clear it to zero at alloc time to ensure that we know if someone has set it or not. Fixes random early timeouts on NVMe testing. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-06 11:05:25 -06:00
Keith Busch	3b632cf0ea	blk-mq: don't allow queue entering for a dying queue If the queue is going away, don't let new allocs or queueing happen on it. Go through the normal wait process, and exit with ENODEV in that case. Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-06 10:40:03 -06:00
Jens Axboe	a4391c6465	blk-mq: bump max tag depth to 10K tags For some scsi-mq cases, the tag map can be huge. So increase the max number of tags we support. Additionally, don't fail with EINVAL if a user requests too many tags. Warn that the tag depth has been adjusted down, and store the new value inside the tag_set passed in. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-06 08:04:46 -06:00
Jens Axboe	f27b087b81	block: add blk_rq_set_block_pc() With the optimizations around not clearing the full request at alloc time, we are leaving some of the needed init for REQ_TYPE_BLOCK_PC up to the user allocating the request. Add a blk_rq_set_block_pc() that sets the command type to REQ_TYPE_BLOCK_PC, and properly initializes the members associated with this type of request. Update callers to use this function instead of manipulating rq->cmd_type directly. Includes fixes from Christoph Hellwig <hch@lst.de> for my half-assed attempt. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-06 07:57:37 -06:00
Jens Axboe	762380ad93	block: add notion of a chunk size for request merging Some drivers have different limits on what size a request should optimally be, depending on the offset of the request. Similar to dividing a device into chunks. Add a setting that allows the driver to inform the block layer of such a chunk size. The block layer will then prevent merging across the chunks. This is needed to optimally support NVMe with a non-zero stripe size. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-05 13:38:39 -06:00
Ming Lei	14b83e172f	block: mq flush: clear flush_rq's tag in flush_end_io() blk_mq_tag_to_rq() needs to be able to tell if it should return the original request, or the flush request if we are doing a flush sequence. Clear the flush tag when IO completes for a flush, since that is what we are comparing against. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-04 10:40:16 -06:00
Jens Axboe	0e62f51f87	blk-mq: let blk_mq_tag_to_rq() take blk_mq_tags as the main parameter We currently pass in the hardware queue, and get the tags from there. But from scsi-mq, with a shared tag space, it's a lot more convenient to pass in the blk_mq_tags instead as the hardware queue isn't always directly available. So instead of having to re-map to a given hardware queue from rq->mq_ctx, just pass in the tags structure. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-04 10:23:49 -06:00
Jens Axboe	f899fed442	blk-mq: fix regression from commit `624dbe4754` When the code was collapsed to avoid duplication, the recent patch for ensuring that a queue is idled before free was dropped, which was added by commit `19c5d84f14`. Add back the blk_mq_tag_idle(), to ensure we don't leak a reference to an active queue when it is freed. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-04 09:11:53 -06:00
Jens Axboe	ff87bcec19	blk-mq: handle NULL req return from blk_map_request in single queue mode blk_mq_map_request() can return NULL if we fail entering the queue (dying, or removed), in which case it has already ended IO on the bio. So nothing more to do, except just return. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-03 21:04:39 -06:00
Ming Lei	e6cdb0929f	blk-mq: fix sparse warning on missed __percpu annotation 'struct blk_mq_ctx' is __percpu, so add the annotation and fix the sparse warning reported from Fengguang: [block:for-linus 2/3] block/blk-mq.h:75:16: sparse: incorrect type in initializer (different address spaces) Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-03 21:04:39 -06:00
Ming Lei	cb96a42cc1	blk-mq: fix schedule from atomic context blk_mq_put_ctx() has to be called before io_schedule() in bt_get(). This patch fixes the problem by taking similar approach from percpu_ida allocation for the situation. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-03 21:04:39 -06:00
Ming Lei	1aecfe4887	blk-mq: move blk_mq_get_ctx/blk_mq_put_ctx to mq private header The blk-mq tag code need these helpers. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-06-03 21:04:38 -06:00
Linus Torvalds	776edb5931	Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next Pull core locking updates from Ingo Molnar: "The main changes in this cycle were: - reduced/streamlined smp_mb__() interface that allows more usecases and makes the existing ones less buggy, especially in rarer architectures - add rwsem implementation comments - bump up lockdep limits" 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits) rwsem: Add comments to explain the meaning of the rwsem's count field lockdep: Increase static allocations arch: Mass conversion of smp_mb__() arch,doc: Convert smp_mb__() arch,xtensa: Convert smp_mb__() arch,x86: Convert smp_mb__() arch,tile: Convert smp_mb__() arch,sparc: Convert smp_mb__() arch,sh: Convert smp_mb__() arch,score: Convert smp_mb__() arch,s390: Convert smp_mb__() arch,powerpc: Convert smp_mb__() arch,parisc: Convert smp_mb__() arch,openrisc: Convert smp_mb__() arch,mn10300: Convert smp_mb__() arch,mips: Convert smp_mb__() arch,metag: Convert smp_mb__() arch,m68k: Convert smp_mb__() arch,m32r: Convert smp_mb__() arch,ia64: Convert smp_mb__() ...	2014-06-03 12:57:53 -07:00
Linus Torvalds	80081ec309	Merge branch 'for-3.16/drivers' of git://git.kernel.dk/linux-block into next Pull block driver changes from Jens Axboe: "Now that the core bits are in, here's the pull request for the driver related changes for 3.16. Nothing out of the ordinary here, mostly business as usual. There are a few pulls of for-3.16/core into this branch, which were done when the blk-mq was modified after the mtip32xx conversion was put in. The pull request contains: - skd and cciss converted to use pci_enable_msix_exact(). From Alexander Gordeev. - A few mtip32xx fixes from Asai @ Micron. - The conversion of mtip32xx from make_request_fn to blk-mq, and a later small fix for that conversion on quiescing for non-queued IO. From me. - A fix for bsg to use an exported function to check whether this driver is request based or not. Needed updating for blk-mq, which is request based, but does not have a request_fn hook. From me. - Small floppy bug fix from Jiri. - A series of cleanups for the cdrom uniform layer from Joe Perches. Gets rid of various old ugly macros, making the code conform more to the modern coding style. - A series of patches for drbd from the drbd crew (Lars Ellenberg and Philipp Reisner). - A use-after-free fix for null_blk from Ming Lei. - Also from Ming Lei is a performance patch for virtio-blk, which can net us a 3x win on kvm platforms where world notification is expensive. - Ming Lei also fixed a stall issue in virtio-blk, due to a race between queue start/stop and resource limits. - A small batch of fixes for xen-blk{back,front} from Olaf Hering and Valentin Priescu" * 'for-3.16/drivers' of git://git.kernel.dk/linux-block: (54 commits) block: virtio_blk: don't hold spin lock during world switch xen-blkback: defer freeing blkif to avoid blocking xenwatch xen blkif.h: fix comment typo in discard-alignment xen/blkback: disable discard feature if requested by toolstack xen-blkfront: remove type check from blkfront_setup_discard floppy: do not corrupt bio.bi_flags when reading block 0 mtip32xx: move error handling to service thread virtio_blk: fix race between start and stop queue mtip32xx: stop block hardware queues before quiescing IO mtip32xx: blk_mq_init_queue() returns an ERR_PTR mtip32xx: convert to use blk-mq cdrom: Remove unnecessary prototype for cdrom_get_disc_info cdrom: Remove unnecessary prototype for cdrom_mrw_exit cdrom: Remove cdrom_count_tracks prototype cdrom: Remove cdrom_get_next_writeable prototype cdrom: Remove cdrom_get_last_written prototype cdrom: Move mmc_ioctls above cdrom_ioctl to remove unnecessary prototype cdrom: Remove unnecessary sanitize_format prototype cdrom: Remove unnecessary check_for_audio_disc prototype cdrom: Remove prototype for open_for_data ...	2014-06-02 13:57:01 -07:00
Linus Torvalds	681a289548	Merge branch 'for-3.16/core' of git://git.kernel.dk/linux-block into next Pull block core updates from Jens Axboe: "It's a big(ish) round this time, lots of development effort has gone into blk-mq in the last 3 months. Generally we're heading to where 3.16 will be a feature complete and performant blk-mq. scsi-mq is progressing nicely and will hopefully be in 3.17. A nvme port is in progress, and the Micron pci-e flash driver, mtip32xx, is converted and will be sent in with the driver pull request for 3.16. This pull request contains: - Lots of prep and support patches for scsi-mq have been integrated. All from Christoph. - API and code cleanups for blk-mq from Christoph. - Lots of good corner case and error handling cleanup fixes for blk-mq from Ming Lei. - A flew of blk-mq updates from me: * Provide strict mappings so that the driver can rely on the CPU to queue mapping. This enables optimizations in the driver. * Provided a bitmap tagging instead of percpu_ida, which never really worked well for blk-mq. percpu_ida relies on the fact that we have a lot more tags available than we really need, it fails miserably for cases where we exhaust (or are close to exhausting) the tag space. * Provide sane support for shared tag maps, as utilized by scsi-mq * Various fixes for IO timeouts. * API cleanups, and lots of perf tweaks and optimizations. - Remove 'buffer' from struct request. This is ancient code, from when requests were always virtually mapped. Kill it, to reclaim some space in struct request. From me. - Remove 'magic' from blk_plug. Since we store these on the stack and since we've never caught any actual bugs with this, lets just get rid of it. From me. - Only call part_in_flight() once for IO completion, as includes two atomic reads. Hopefully we'll get a better implementation soon, as the part IO stats are now one of the more expensive parts of doing IO on blk-mq. From me. - File migration of block code from {mm,fs}/ to block/. This includes bio.c, bio-integrity.c, bounce.c, and ioprio.c. From me, from a discussion on lkml. That should describe the meat of the pull request. Also has various little fixes and cleanups from Dave Jones, Shaohua Li, Duan Jiong, Fengguang Wu, Fabian Frederick, Randy Dunlap, Robert Elliott, and Sam Bradshaw" * 'for-3.16/core' of git://git.kernel.dk/linux-block: (100 commits) blk-mq: push IPI or local end_io decision to __blk_mq_complete_request() blk-mq: remember to start timeout handler for direct queue block: ensure that the timer is always added blk-mq: blk_mq_unregister_hctx() can be static blk-mq: make the sysfs mq/ layout reflect current mappings blk-mq: blk_mq_tag_to_rq should handle flush request block: remove dead code in scsi_ioctl:blk_verify_command blk-mq: request initialization optimizations block: add queue flag for disabling SG merging block: remove 'magic' from struct blk_plug blk-mq: remove alloc_hctx and free_hctx methods blk-mq: add file comments and update copyright notices blk-mq: remove blk_mq_alloc_request_pinned blk-mq: do not use blk_mq_alloc_request_pinned in blk_mq_map_request blk-mq: remove blk_mq_wait_for_tags blk-mq: initialize request in __blk_mq_alloc_request blk-mq: merge blk_mq_alloc_reserved_request into blk_mq_alloc_request blk-mq: add helper to insert requests from irq context blk-mq: remove stale comment for blk_mq_complete_request() blk-mq: allow non-softirq completions ...	2014-06-02 09:29:34 -07:00
Jens Axboe	ed851860b4	blk-mq: push IPI or local end_io decision to __blk_mq_complete_request() We have callers outside of the blk-mq proper (like timeouts) that want to call __blk_mq_complete_request(), so rename the function and put the decision code for whether to use ->softirq_done_fn or blk_mq_endio() into __blk_mq_complete_request(). This also makes the interface more logical again. blk_mq_complete_request() attempts to atomically mark the request completed, and calls __blk_mq_complete_request() if successful. __blk_mq_complete_request() then just ends the request. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-30 21:20:50 -06:00
Jens Axboe	feff689412	blk-mq: remember to start timeout handler for direct queue Commit `07068d5b8e` added a direct-to-hw-queue mode, but this mode needs to remember to add the request timeout handler as well. Without it, we don't track timeouts for these requests. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-30 15:42:56 -06:00
Jens Axboe	c7bca4183f	block: ensure that the timer is always added Commit `f793aa5378` relaxed the timer addition a little too much. If the timer isn't pending, we always need to add it. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-30 15:41:39 -06:00
Fengguang Wu	ee3c5db089	blk-mq: blk_mq_unregister_hctx() can be static CC: Jens Axboe <axboe@kernel.dk> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-30 10:31:13 -06:00
Jens Axboe	67aec14ce8	blk-mq: make the sysfs mq/ layout reflect current mappings Currently blk-mq registers all the hardware queues in sysfs, regardless of whether it uses them (e.g. they have CPU mappings) or not. The unused hardware queues lack the cpux/ directories, and the other sysfs entries (like active, pending, etc) are all zeroes. Change this so that sysfs correctly reflects the current mappings of the hardware queues. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-30 08:25:36 -06:00
Jens Axboe	f89ca16646	Merge branch 'for-3.16/core' into for-3.16/drivers Pulled in for the blk_mq_tag_to_rq() change, which impacts mtip32xx. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-30 08:11:50 -06:00
Shaohua Li	2230237500	blk-mq: blk_mq_tag_to_rq should handle flush request flush request is special, which borrows the tag from the parent request. Hence blk_mq_tag_to_rq needs special handling to return the flush request from the tag. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-30 08:06:42 -06:00
Dave Jones	da52f22fa9	block: remove dead code in scsi_ioctl:blk_verify_command filter gets assigned the address of blk_default_cmd_filter on entry to this function, so the !filter condition can never be true. Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-29 13:38:50 -06:00
Jens Axboe	4b570521be	blk-mq: request initialization optimizations We currently clear a lot more than we need to, so make that a bit more clever. Make some of the init dependent on features, like only setting start_time if we are going to use it. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-29 11:00:11 -06:00
Jens Axboe	05f1dd5315	block: add queue flag for disabling SG merging If devices are not SG starved, we waste a lot of time potentially collapsing SG segments. Enough that 1.5% of the CPU time goes to this, at only 400K IOPS. Add a queue flag, QUEUE_FLAG_NO_SG_MERGE, which just returns the number of vectors in a bio instead of looping over all segments and checking for collapsible ones. Add a BLK_MQ_F_SG_MERGE flag so that drivers can opt-in on the sg merging, if they so desire. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-29 09:53:32 -06:00
Jens Axboe	4d92a9beb3	block: remove 'magic' from struct blk_plug I don't think we've ever caught any bugs with this, and there's the list poisoning for the plug lists to catch uninitialized cases. So remove the magic member and save 8 bytes in the struct. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-29 08:09:00 -06:00
Jens Axboe	0fb662e225	Merge branch 'for-3.16/core' into for-3.16/drivers Pull in core changes (again), since we got rid of the alloc/free hctx mq_ops hooks and mtip32xx then needed updating again. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 10:18:51 -06:00
Christoph Hellwig	cdef54dd85	blk-mq: remove alloc_hctx and free_hctx methods There is no need for drivers to control hardware context allocation now that we do the context to node mapping in common code. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 10:18:31 -06:00
Jens Axboe	75bb4625bb	blk-mq: add file comments and update copyright notices None of the blk-mq files have an explanatory comment at the top for what that particular file does. Add that and add appropriate copyright notices as well. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 10:15:41 -06:00
Jens Axboe	6178976500	Merge branch 'for-3.16/core' into for-3.16/drivers mtip32xx uses blk_mq_alloc_reserved_request(), so pull in the core changes so we have a properly merged end result. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 09:50:26 -06:00
Christoph Hellwig	d852564f8c	blk-mq: remove blk_mq_alloc_request_pinned We now only have one caller left and can open code it there in a cleaner way. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 09:49:27 -06:00
Christoph Hellwig	793597a6a9	blk-mq: do not use blk_mq_alloc_request_pinned in blk_mq_map_request We already do a non-blocking allocation in blk_mq_map_request, no need to repeat it. Just call __blk_mq_alloc_request to wait directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 09:49:25 -06:00
Christoph Hellwig	a3bd77567c	blk-mq: remove blk_mq_wait_for_tags The current logic for blocking tag allocation is rather confusing, as we first allocated and then free again a tag in blk_mq_wait_for_tags, just to attempt a non-blocking allocation and then repeat if someone else managed to grab the tag before us. Instead change blk_mq_alloc_request_pinned to simply do a blocking tag allocation itself and use the request we get back from it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 09:49:23 -06:00
Christoph Hellwig	5dee857720	blk-mq: initialize request in __blk_mq_alloc_request Both callers if __blk_mq_alloc_request want to initialize the request, so lift it into the common path. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 09:49:21 -06:00
Christoph Hellwig	4ce01dd1a0	blk-mq: merge blk_mq_alloc_reserved_request into blk_mq_alloc_request Instead of having two almost identical copies of the same code just let the callers pass in the reserved flag directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 09:49:19 -06:00
Christoph Hellwig	6fca6a611c	blk-mq: add helper to insert requests from irq context Both the cache flush state machine and the SCSI midlayer want to submit requests from irq context, and the current per-request requeue_work unfortunately causes corruption due to sharing with the csd field for flushes. Replace them with a per-request_queue list of requests to be requeued. Based on an earlier test by Ming Lei. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Ming Lei <tom.leiming@gmail.com> Tested-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-28 08:08:02 -06:00
Jens Axboe	95f0968499	blk-mq: allow non-softirq completions Right now we export two ways of completing a request: 1) blk_mq_complete_request(). This uses an IPI (if needed) and completes through q->softirq_done_fn(). It also works with timeouts. 2) blk_mq_end_io(). This completes inline, and ignores any timeout state of the request. Let blk_mq_complete_request() handle non-softirq_done_fn completions as well, by just completing inline. If a driver has enough completion ports to place completions correctly, it need not define a mq_ops->complete() and we can avoid an indirect function call by doing the completion inline. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-27 17:46:48 -06:00
Jens Axboe	f14bbe77a9	blk-mq: pass in suggested NUMA node to ->alloc_hctx() Drivers currently have to figure this out on their own, and they are missing information to do it properly. The ones that did attempt to do it, do it wrong. So just pass in the suggested node directly to the alloc function. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-27 12:06:53 -06:00
Ming Lei	3d2936f457	block: only allocate/free mq_usage_counter in blk-mq The percpu counter is only used for blk-mq, so move its allocation and free inside blk-mq, and don't allocate it for legacy queue device. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-27 09:37:08 -06:00
Ming Lei	624dbe4754	blk-mq: avoid code duplication blk_mq_exit_hw_queues() and blk_mq_free_hw_queues() are introduced to avoid code duplication. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-27 09:37:06 -06:00
Ming Lei	1f9f07e917	blk-mq: fix leak of hctx->ctx_map hctx->ctx_map should have been freed inside blk_mq_free_queue(). Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-27 08:34:45 -06:00
Fabian Frederick	35086784ca	block/blk-lib.c: make __blkdev_issue_zeroout static __blkdev_issue_zeroout is only used in blk-lib.c Cc: Jens Axboe <axboe@kernel.dk> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-26 17:39:09 -06:00
Christoph Hellwig	19c5d84f14	blk-mq: idle all hardware contexts before freeing a queue Without this we can leak the active_queues reference if a command is freed while it is considered active. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-26 17:21:51 -06:00
Jens Axboe	c22d9d8a60	blk-mq: allow setting of per-request timeouts Currently blk-mq uses the queue timeout for all requests. But for some commands, drivers may want to set a specific timeout for special requests. Allow this to be passed in through request->timeout, and use it if set. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-23 14:14:57 -06:00
Sam Bradshaw	edf866b380	blk-mq: export blk_mq_tag_busy_iter Export the blk-mq in-flight tag iterator for driver consumption. This is particularly useful in exception paths or SRSI where in-flight IOs need to be cancelled and/or reissued. The NVMe driver conversion will use this. Signed-off-by: Sam Bradshaw <sbradshaw@micron.com> Signed-off-by: Matias Bjørling <m@bjorling.me> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-23 13:30:16 -06:00
Jens Axboe	07068d5b8e	blk-mq: split make request handler for multi and single queue We want slightly different behavior from them: - On single queue devices, we currently use the per-process plug for deferred IO and for merging. - On multi queue devices, we don't use the per-process plug, but we want to go straight to hardware for SYNC IO. Split blk_mq_make_request() into a blk_sq_make_request() for single queue devices, and retain blk_mq_make_request() for multi queue devices. Then we don't need multiple checks for q->nr_hw_queues in the request mapping. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-22 10:43:07 -06:00
Jens Axboe	484b4061e6	blk-mq: save memory by freeing requests on unused hardware queues Depending on the topology of the machine and the number of queues exposed by a device, we can end up in a situation where some of the hardware queues are unused (as in, they don't map to any software queues). For this case, free up the memory used by the request map, as we will not use it. This can be a substantial amount of memory, depending on the number of queues vs CPUs and the queue depth of the device. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-21 14:01:15 -06:00
Jens Axboe	e814e71ba4	blk-mq: allow the hctx cpu hotplug notifier to return errors Prepare this for the next patch which adds more smarts in the plugging logic, so that we can save some memory. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-21 13:59:08 -06:00
Robert Elliott	da41a589f5	blk-mq: Micro-optimize blk_queue_nomerges() check In blk_mq_make_request(), do the blk_queue_nomerges() check outside the call to blk_attempt_plug_merge() to eliminate function call overhead when nomerges=2 (disabled) Signed-off-by: Robert Elliott <elliott@hp.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-20 15:49:03 -06:00
Jens Axboe	eba7176826	blk-mq: initialize q->nr_requests after calling blk_queue_make_request() blk_queue_make_requests() overwrites our set value for q->nr_requests, turning it into the default of 128. Set this appropriately after initializing queue values in blk_queue_make_request(). Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-20 15:17:27 -06:00
Jens Axboe	e3a2b3f931	blk-mq: allow changing of queue depth through sysfs For request_fn based devices, the block layer exports a 'nr_requests' file through sysfs to allow adjusting of queue depth on the fly. Currently this returns -EINVAL for blk-mq, since it's not wired up. Wire this up for blk-mq, so that it now also always dynamic adjustments of the allowed queue depth for any given block device managed by blk-mq. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-20 11:49:02 -06:00
Jens Axboe	719c555f44	block: move mm/bounce.c to block/ Continue moving some of the block files that are scattered around. bounce.c contains only code for bouncing the contents of a bio. It's block proper code, not mm code. Suggested-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-19 20:01:52 -06:00
Jens Axboe	39a9f97e5e	Merge branch 'for-3.16/blk-mq-tagging' into for-3.16/core Signed-off-by: Jens Axboe <axboe@fb.com> Conflicts: block/blk-mq-tag.c	2014-05-19 11:52:35 -06:00
Jens Axboe	1429d7c946	blk-mq: switch ctx pending map to the sparser blk_align_bitmap Each hardware queue has a bitmap of software queues with pending requests. When new IO is queued on a software queue, the bit is set, and when IO is pruned on a hardware queue run, the bit is cleared. This causes a lot of traffic. Switch this from the regular BITS_PER_LONG bitmap to a sparser layout, similarly to what was done for blk-mq tagging. 20% performance increase was observed for single threaded IO, and about 15% performanc increase on multiple threads driving the same device. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-19 11:02:47 -06:00
Jens Axboe	e93ecf602b	blk-mq: move the cache friendly bitmap type of out blk-mq-tag We will use it for the pending list in blk-mq core as well. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-19 11:02:47 -06:00
Jens Axboe	2667bcbbd5	block: move ioprio.c from fs/ to block/ Like commit `f9c78b2b`, move this block related file outside of fs/ and into the core block directory, block/. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-19 11:02:18 -06:00
Jens Axboe	f9c78b2be2	block: move bio.c and bio-integrity.c from fs/ to block/ They really belong in block/, especially now since it's not in drivers/block/ anymore. Additionally, the get_maintainer script gets it wrong when in fs/. Suggested-by: Christoph Hellwig <hch@infradead.org> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-19 08:34:46 -06:00
Tejun Heo	5c9d535b89	cgroup: remove css_parent() cgroup in general is moving towards using cgroup_subsys_state as the fundamental structural component and css_parent() was introduced to convert from using cgroup->parent to css->parent. It was quite some time ago and we're moving forward with making css more prominent. This patch drops the trivial wrapper css_parent() and let the users dereference css->parent. While at it, explicitly mark fields of css which are public and immutable. v2: New usage from device_cgroup.c converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: "David S. Miller" <davem@davemloft.net> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Johannes Weiner <hannes@cmpxchg.org>	2014-05-16 13:22:48 -04:00
Jens Axboe	0d2602ca30	blk-mq: improve support for shared tags maps This adds support for active queue tracking, meaning that the blk-mq tagging maintains a count of active users of a tag set. This allows us to maintain a notion of fairness between users, so that we can distribute the tag depth evenly without starving some users while allowing others to try unfair deep queues. If sharing of a tag set is detected, each hardware queue will track the depth of its own queue. And if this exceeds the total depth divided by the number of active queues, the user is actively throttled down. The active queue count is done lazily to avoid bouncing that data between submitter and completer. Each hardware queue gets marked active when it allocates its first tag, and gets marked inactive when 1) the last tag is cleared, and 2) the queue timeout grace period has passed. Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-13 15:10:52 -06:00
Tejun Heo	451af504df	cgroup: replace cftype->write_string() with cftype->write() Convert all cftype->write_string() users to the new cftype->write() which maps directly to kernfs write operation and has full access to kernfs and cgroup contexts. The conversions are mostly mechanical. * @css and @cft are accessed using of_css() and of_cft() accessors respectively instead of being specified as arguments. * Should return @nbytes on success instead of 0. * @buf is not trimmed automatically. Trim if necessary. Note that blkcg and netprio don't need this as the parsers already handle whitespaces. cftype->write_string() has no user left after the conversions and removed. While at it, remove unnecessary local variable @p in cgroup_subtree_control_write() and stale comment about CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c. This patch doesn't introduce any visible behavior changes. v2: netprio was missing from conversion. Converted. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Aristeu Rozanski <arozansk@redhat.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: "David S. Miller" <davem@davemloft.net>	2014-05-13 12:16:21 -04:00
Tejun Heo	ec903c0c85	cgroup: rename css_tryget() to css_tryget_online() Unlike the more usual refcnting, what css_tryget() provides is the distinction between online and offline csses instead of protection against upping a refcnt which already reached zero. cgroup is planning to provide actual tryget which fails if the refcnt already reached zero. Let's rename the existing trygets so that they clearly indicate that they're onliness. I thought about keeping the existing names as-are and introducing new names for the planned actual tryget; however, given that each controller participates in the synchronization of the online state, it seems worthwhile to make it explicit that these functions are about on/offline state. Rename css_tryget() to css_tryget_online() and css_tryget_from_dir() to css_tryget_online_from_dir(). This is pure rename. v2: cgroup_freezer grew new usages of css_tryget(). Update accordingly. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>	2014-05-13 12:11:01 -04:00
Jens Axboe	acb12e0a9c	Merge branch 'for-3.16/blk-mq-tagging' into for-3.16/core	2014-05-10 15:44:42 -06:00
Ming Lei	1f236ab22c	blk-mq: bitmap tag: cleanup blk_mq_init_tags Both nr_cache and nr_tags arn't needed for bitmap tag anymore. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-10 15:44:00 -06:00
Ming Lei	9d3d21aeb4	blk-mq: bitmap tag: select random tag betweet 0 and (depth - 1) The selected tag should be selected at random between 0 and (depth - 1) with probability 1/depth, instead between 0 and (depth - 2) with probability 1/(depth - 1). Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>	2014-05-10 15:43:14 -06:00

... 3 4 5 6 7 ...

2607 Commits