linux

iv/linux

Author	SHA1	Message	Date
Ming Lei	d18d91740a	block: introduce bio_for_each_bvec() and rq_for_each_bvec() bio_for_each_bvec() is used for iterating over multi-page bvec for bio split & merge code. rq_for_each_bvec() can be used for drivers which may handle the multi-page bvec directly, so far loop is one perfect use case. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-15 08:40:10 -07:00
Ming Lei	3d75ca0ade	block: introduce multi-page bvec helpers This patch introduces helpers of 'mp_bvec_iter_' for multi-page bvec support. The introduced helpers treate one bvec as real multi-page segment, which may include more than one pages. The existed helpers of bvec_iter_ are interfaces for supporting current bvec iterator which is thought as single-page by drivers, fs, dm and etc. These introduced helpers will build single-page bvec in flight, so this way won't break current bio/bvec users, which needn't any change. Follows some multi-page bvec background: - bvecs stored in bio->bi_io_vec is always multi-page style - bvec(struct bio_vec) represents one physically contiguous I/O buffer, now the buffer may include more than one page after multi-page bvec is supported, and all these pages represented by one bvec is physically contiguous. Before multi-page bvec support, at most one page is included in one bvec, we call it single-page bvec. - .bv_page of the bvec points to the 1st page in the multi-page bvec - .bv_offset of the bvec is the offset of the buffer in the bvec The effect on the current drivers/filesystem/dm/bcache/...: - almost everyone supposes that one bvec only includes one single page, so we keep the sp interface not changed, for example, bio_for_each_segment() still returns single-page bvec - bio_for_each_segment_all() will return single-page bvec too - during iterating, iterator variable(struct bvec_iter) is always updated in multi-page bvec style, and bvec_iter_advance() is kept not changed - returned(copied) single-page bvec is built in flight by bvec helpers from the stored multi-page bvec Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-15 08:40:10 -07:00
Ming Lei	19d62f6d00	block: remove bvec_iter_rewind() Commit `7759eb23fd` ("block: remove bio_rewind_iter()") removes bio_rewind_iter(), then no one uses bvec_iter_rewind() any more, so remove it. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-15 08:40:10 -07:00
Ming Lei	1a67356e9a	block: don't use bio->bi_vcnt to figure out segment number It is wrong to use bio->bi_vcnt to figure out how many segments there are in the bio even though CLONED flag isn't set on this bio, because this bio may be splitted or advanced. So always use bio_segments() in blk_recount_segments(), and it shouldn't cause any performance loss now because the physical segment number is figured out in blk_queue_split() and BIO_SEG_VALID is set meantime since `bdced438ac` ("block: setup bi_phys_segments after splitting"). Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Fixes: `76d8137a31` ("blk-merge: recaculate segment if it isn't less than max segments") Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-15 08:40:10 -07:00
Christoph Hellwig	8a2ee44a37	btrfs: look at bi_size for repair decisions bio_readpage_error currently uses bi_vcnt to decide if it is worth retrying an I/O. But the vector count is mostly an implementation artifact - it really should figure out if there is more than a single sector worth retrying. Use bi_size for that and shift by PAGE_SHIFT. This really should be blocks/sectors, but given that btrfs doesn't support a sector size different from the PAGE_SIZE using the page size keeps the changes to a minimum. Reviewed-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-15 08:40:10 -07:00
Aleksei Zakharov	fbd72127c9	block: avoid setting none scheduler if it's already none There's no reason to freeze queue and remove scheduler if there's no scheduler already. Signed-off-by: Aleksei Zakharov <zakharov.a.g@yandex.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-11 08:21:40 -07:00
Aleksei Zakharov	b7143fe67b	block: avoid setting wbt_lat_usec to current value There's no reason to set wbt min lat and freeze request queue if current value is the same. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Aleksei Zakharov <zakharov.a.g@yandex.ru> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-11 08:20:14 -07:00
Heiner Litz	0586942f03	lightnvm: pblk: fix race condition on GC This patch fixes a race condition where a write is mapped to the last sectors of a line. The write is synced to the device but the L2P is not updated yet. When the line is garbage collected before the L2P update is performed, the sectors are ignored by the GC logic and the line is freed before all sectors are moved. When the L2P is finally updated, it contains a mapping to a freed line, subsequent reads of the corresponding LBAs fail. This patch introduces a per line counter specifying the number of sectors that are synced to the device but have not been updated in the L2P. Lines with a counter of greater than zero will not be selected for GC. Signed-off-by: Heiner Litz <hlitz@ucsc.edu> Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Reviewed-by: Javier González <javier@javigon.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-11 08:18:08 -07:00
Javier González	b4cdc4260e	lightnvm: pblk: prevent stall due to wb threshold In order to respect mw_cuinits, pblk's write buffer maintains a backpointer to protect data not yet persisted; when writing to the write buffer, this backpointer defines a threshold that pblk's rate-limiter enforces. On small PU configurations, the following scenarios might take place: (i) the threshold is larger than the write buffer and (ii) the threshold is smaller than the write buffer, but larger than the maximun allowed split bio - 256KB at this moment (Note that writes are not always split - we only do this when we the size of the buffer is smaller than the buffer). In both cases, pblk's rate-limiter prevents the I/O to be written to the buffer, thus stalling. This patch fixes the original backpointer implementation by considering the threshold both on buffer creation and on the rate-limiters path, when bio_split is triggered (case (ii) above). Fixes: `766c8ceb16` ("lightnvm: pblk: guarantee that backpointer is respected on writer stall") Signed-off-by: Javier González <javier@javigon.com> Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-11 08:18:08 -07:00
Hans Holmberg	aa8759d80a	lightnvm: pblk: extend line wp balance check pblk stripes writes of minimal write size across all non-offline chunks in a line, which means that the maximum write pointer delta should not exceed the minimal write size. Extend the line write pointer balance check to cover this case, and ignore the offline chunk wps. This will render us a warning during recovery if something unexpected has happened to the chunk write pointers (i.e. powerloss, a spurious chunk reset, ..). Reported-by: Zhoujie Wu <zjwu@marvell.com> Tested-by: Zhoujie Wu <zjwu@marvell.com> Reviewed-by: Javier González <javier@javigon.com> Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-11 08:18:07 -07:00
Masahiro Yamada	b7fce8f79d	lightnvm: pblk: fix TRACE_INCLUDE_PATH As the comment block in include/trace/define_trace.h says, TRACE_INCLUDE_PATH should be a relative path to the define_trace.h ../../drivers/lightnvm is the correct relative path. ../../../drivers/lightnvm is working by coincidence because the top Makefile adds -I$(srctree)/arch/$(SRCARCH)/include as a header search path, but we should not rely on it. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reviewed-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-11 08:18:07 -07:00
Andy Shevchenko	7e0a0847ed	lightnvm: pblk: Switch to use new generic UUID API There are new types and helpers that are supposed to be used in new code. As a preparation to get rid of legacy types and API functions do the conversion here. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Javier González <javier@javigon.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-11 08:18:07 -07:00
Andy Shevchenko	e74ecf63ef	lightnvm: Use u64 instead of __le64 for CPU visible side Sparse complains about using strict data types: drivers/lightnvm/pblk-read.c:254:43: warning: incorrect type in assignment (different base types) drivers/lightnvm/pblk-read.c:254:43: expected restricted __le64 <noident> drivers/lightnvm/pblk-read.c:254:43: got unsigned long long [unsigned] [usertype] <noident> drivers/lightnvm/pblk-read.c:255:29: warning: cast from restricted __le64 drivers/lightnvm/pblk-read.c:268:29: warning: cast from restricted __le64 drivers/lightnvm/pblk-read.c:328:41: warning: incorrect type in assignment (different base types) drivers/lightnvm/pblk-read.c:328:41: expected restricted __le64 <noident> drivers/lightnvm/pblk-read.c:328:41: got unsigned long long [unsigned] [usertype] <noident> In the code it seems explicit that lba_list_mem and lba_list_media members of struct pblk_pr_ctx are used on CPU side, which means they should not be of strict types. Change types of lba_list_mem and lba_list_media members to be u64. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Javier González <javier@javigon.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-11 08:18:07 -07:00
Hans Holmberg	6916cf5426	lightnvm: pblk: use vfree to free metadata on error path As chunk metadata is allocated using vmalloc, we need to free it using vfree. Fixes: `090ee26fd5` ("lightnvm: use internal allocation for chunk log page") Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-11 08:18:07 -07:00
Hans Holmberg	f9324980d7	lightnvm: pblk: stop taking the free lock in in pblk_lines_free pblk_line_meta_free might sleep (it can end up calling vfree, depending on how we allocate lba lists), and this can lead to a BUG() if we wake up on a different cpu and release the lock. As there is no point of grabbing the free lock when pblk has shut down, remove the lock. Signed-off-by: Hans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: Matias Bjørling <mb@lightnvm.io> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-11 08:18:07 -07:00
Linus Torvalds	d13937116f	Linux 5.0-rc6	2019-02-10 14:42:20 -08:00
Linus Torvalds	68d94a8424	dmaengine-fix-5.0-rc6 dmaengine fixes for v5.0-rc6 - Fix in at_xdmac fr wrongful channel state - Fix for imx driver for wrong callback invocation - Fix to bcm driver for interrupt race & transaction abort. - Fix in dmatest to abort in mapping error -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJcYD8SAAoJEHwUBw8lI4NHqc0P/At5/XRhS60fnsibTEnGH3ee I3TQY3DrGXHXUAq8LqVIjaql7NUQ+I07bRtP6JMX9GYrTguUVVDPau6E4CtWt571 +TnKXQefgY/UlP9f5K127Dal5YNaYsBZqbEvhE2xgYKrsFStiIYDFIPGFeIE841E J0cm8mjM+dQxZOXCCf/vSxA9h2Aqj7Fa4RRY3Sj4r/CWahaKpZt+OM6bdDpmUxmq PzDLSE4DdHLWZSTcvAFQQtwlMc4Dq8a7cs2MmCvLC2jwGq0IgLr0bfzj6pZOqcuN O05TNcyhQHZ5YRnBHTgJUSgR2abhKuNnBWN/pA1Tbsn7+OISsvt4erjITAFv6fwR mkWgyAzH3tZ1PnO1YO43KEfy/nrT6Veg1xePTzWaaYAsPJ9FJeQJuZ5YxPysOLcn lux2/B19HeWVb02vredn5eUXH7b8C0WQhfEax4hgbbPcaDX+y3WFN0n2gxOAFQ8H xloKu511Iuu0Uteov4wys6LucqX61vPPank9Ma2N7RLdZ4x2Ya8gDjG/kfQOFHDa jRbbjzw8Q5tuU2++GXwSpWq55LOKYnaXQbUIRWxrrdwAJahKAQFW++KR2ehB1kvS pCaIWwpSHXmCbat/A1JfcYMagAtw6QSgbJlNwru1B7OYujFMLml+GXtqeA/kJ8yo 3+oF8PB0xA+++o83USaa =NTGY -----END PGP SIGNATURE----- Merge tag 'dmaengine-fix-5.0-rc6' of git://git.infradead.org/users/vkoul/slave-dma Pull dmaengine fixes from Vinod Koul: - Fix in at_xdmac fr wrongful channel state - Fix for imx driver for wrong callback invocation - Fix to bcm driver for interrupt race & transaction abort. - Fix in dmatest to abort in mapping error * tag 'dmaengine-fix-5.0-rc6' of git://git.infradead.org/users/vkoul/slave-dma: dmaengine: dmatest: Abort test in case of mapping error dmaengine: bcm2835: Fix abort of transactions dmaengine: bcm2835: Fix interrupt race on RT dmaengine: imx-dma: fix wrong callback invoke dmaengine: at_xdmac: Fix wrongfull report of a channel as in use	2019-02-10 10:39:37 -08:00
Linus Torvalds	aadaa80611	Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "A handful of fixes: - Fix an MCE corner case bug/crash found via MCE injection testing - Fix 5-level paging boot crash - Fix MCE recovery cache invalidation bug - Fix regression on Xen guests caused by a recent PMD level mremap speedup optimization" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Make set_pmd_at() paravirt aware x86/mm/cpa: Fix set_mce_nospec() x86/boot/compressed/64: Do not corrupt EDX on EFER.LME=1 setting x86/MCE: Initialize mce.bank in the case of a fatal error in mce_no_way_out()	2019-02-10 09:57:42 -08:00
Linus Torvalds	73a4c52184	Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fixes from Ingo Molnar: "irqchip driver fixes: most of them are race fixes for ARM GIC (General Interrupt Controller) variants, but also a fix for the ARM MMP (Marvell PXA168 et al) irqchip affecting OLPC keyboards" * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: irqchip/gic-v3-its: Fix ITT_entry_size accessor irqchip/mmp: Only touch the PJ4 IRQ & FIQ bits on enable/disable irqchip/gic-v3-its: Gracefully fail on LPI exhaustion irqchip/gic-v3-its: Plug allocation race for devices sharing a DevID irqchip/gic-v4: Fix occasional VLPI drop	2019-02-10 09:54:19 -08:00
Linus Torvalds	212146f080	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "A couple of kernel side fixes: - Fix the Intel uncore driver on certain hardware configurations - Fix a CPU hotplug related memory allocation bug - Remove a spurious WARN() ... plus also a handful of perf tooling fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf script python: Add Python3 support to tests/attr.py perf trace: Support multiple "vfs_getname" probes perf symbols: Filter out hidden symbols from labels perf symbols: Add fallback definitions for GELF_ST_VISIBILITY() tools headers uapi: Sync linux/in.h copy from the kernel sources perf clang: Do not use 'return std::move(something)' perf mem/c2c: Fix perf_mem_events to support powerpc perf tests evsel-tp-sched: Fix bitwise operator perf/core: Don't WARN() for impossible ring-buffer sizes perf/x86/intel: Delay memory deallocation until x86_pmu_dead_cpu() perf/x86/intel/uncore: Add Node ID mask	2019-02-10 09:48:18 -08:00
Linus Torvalds	d2a6aae99f	Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: "An rtmutex (PI-futex) deadlock scenario fix, plus a locking documentation fix" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: futex: Handle early deadlock return correctly futex: Fix barrier comment	2019-02-10 09:44:52 -08:00
Marcos Paulo de Souza	1e93642837	blk-sysfs: Rework documention of __blk_release_queue The Notes section of the comment was removed, because now blk_release_queue can only be executed from blk_cleanup_queue (being called when the q->kobj reaches zero), and also blk_init_queue was removed in `a1ce35fa49`. Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-10 10:23:29 -07:00
Marcos Paulo de Souza	7585d5082e	blk-cgroup: Fix doc related to blkcg_exit_queue Since `4cf6324b17`, a portion of function blk_cleanup_queue was moved to a newly created function called blk_exit_queue, including the call of blkcg_exit_queue. So, adjust the documenation according. Reviewed-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Marcos Paulo de Souza <marcos.souza.org@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-10 08:24:08 -07:00
Juergen Gross	20e55bc17d	x86/mm: Make set_pmd_at() paravirt aware set_pmd_at() calls native_set_pmd() unconditionally on x86. This was fine as long as only huge page entries were written via set_pmd_at(), as Xen pv guests don't support those. Commit `2c91bd4a4e` ("mm: speed up mremap by 20x on large regions") introduced a usage of set_pmd_at() possible on pv guests, leading to failures like: BUG: unable to handle kernel paging request at ffff888023e26778 #PF error: [PROT] [WRITE] RIP: e030:move_page_tables+0x7c1/0xae0 move_vma.isra.3+0xd1/0x2d0 __se_sys_mremap+0x3c6/0x5b0 do_syscall_64+0x49/0x100 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Make set_pmd_at() paravirt aware by just letting it use set_pmd(). Fixes: `2c91bd4a4e` ("mm: speed up mremap by 20x on large regions") Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: xen-devel@lists.xenproject.org Cc: boris.ostrovsky@oracle.com Cc: sstabellini@kernel.org Cc: hpa@zytor.com Cc: bp@alien8.de Cc: torvalds@linux-foundation.org Link: https://lkml.kernel.org/r/20190210074056.11842-1-jgross@suse.com	2019-02-10 08:47:12 +01:00
Jens Axboe	eca7abf31a	block: queue flag cleanup We have QUEUE_FLAG_DEFAULT defined, but it's not used anymore since the legacy IO stack is gone. Kill it. Sanitize the queue flags in general, they use spaces (for some reason), and the space is pretty sparse. With the flags renumbered, we can more clearly see how many we have available. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-09 15:42:07 -07:00
Jens Axboe	d11a399898	block: kill QUEUE_FLAG_FLUSH_NQ We have various helpers for setting/clearing this flag, and also a helper to check if the queue supports queueable flushes or not. But nobody uses them anymore, kill it with fire. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-09 15:40:24 -07:00
Linus Torvalds	df3865f8f5	Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: "One PM related driver bugfix and a MAINTAINERS update" * 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: MAINTAINERS: Update the ocores i2c bus driver maintainer, etc i2c: omap: Use noirq system sleep pm ops to idle device for suspend	2019-02-09 13:43:12 -08:00
Linus Torvalds	e8b50608f6	A batch of MIPS fixes for 5.0, nothing too scary. - A workaround for a Loongson 3 CPU bug is the biggest change, but still fairly straightforward. It adds extra memory barriers (sync instructions) around atomics to avoid a CPU bug that can break atomicity. - Loongson64 also sees a fix for powering off some systems which would incorrectly reboot rather than waiting for the power down sequence to complete. - We have DT fixes for the Ingenic JZ4740 SoC & the JZ4780-based Ci20 board, and a DT warning fix for the Nexsys4/MIPSfpga board. - The Cavium Octeon platform sees a further fix to the behaviour of the pcie_disable command line argument that was introduced in v3.3. - The VDSO, introduced in v4.4, sees build fixes for configurations of GCC that were built using the --with-fp-32= flag to specify a default 32-bit floating point ABI. - get_frame_info() sees a fix for configurations with CONFIG_KALLSYMS=n, for which it previously always returned an error. - If the MIPS Coherence Manager (CM) reports an error then we'll now clear that error correctly so that the GCR_ERROR_CAUSE register will be updated with information about any future errors. -----BEGIN PGP SIGNATURE----- iIsEABYIADMWIQRgLjeFAZEXQzy86/s+p5+stXUA3QUCXF8rDhUccGF1bC5idXJ0 b25AbWlwcy5jb20ACgkQPqefrLV1AN0PngEAyEGqu+OyUjLGZApcM49LDMcr5RHq V6ED2HApfNbxsAYBAN+Uv6fm4H1LAbU+sIoHUp7DGQ1f1oFsbij4V/bqP4IA =nv8r -----END PGP SIGNATURE----- Merge tag 'mips_fixes_5.0_3' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux Pull MIPS fixes from Paul Burton: "A batch of MIPS fixes for 5.0, nothing too scary. - A workaround for a Loongson 3 CPU bug is the biggest change, but still fairly straightforward. It adds extra memory barriers (sync instructions) around atomics to avoid a CPU bug that can break atomicity. - Loongson64 also sees a fix for powering off some systems which would incorrectly reboot rather than waiting for the power down sequence to complete. - We have DT fixes for the Ingenic JZ4740 SoC & the JZ4780-based Ci20 board, and a DT warning fix for the Nexsys4/MIPSfpga board. - The Cavium Octeon platform sees a further fix to the behaviour of the pcie_disable command line argument that was introduced in v3.3. - The VDSO, introduced in v4.4, sees build fixes for configurations of GCC that were built using the --with-fp-32= flag to specify a default 32-bit floating point ABI. - get_frame_info() sees a fix for configurations with CONFIG_KALLSYMS=n, for which it previously always returned an error. - If the MIPS Coherence Manager (CM) reports an error then we'll now clear that error correctly so that the GCR_ERROR_CAUSE register will be updated with information about any future errors" * tag 'mips_fixes_5.0_3' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux: mips: cm: reprime error cause mips: loongson64: remove unreachable(), fix loongson_poweroff(). MIPS: Remove function size check in get_frame_info() MIPS: Use lower case for addresses in nexys4ddr.dts MIPS: Loongson: Introduce and use loongson_llsc_mb() MIPS: VDSO: Include $(ccflags-vdso) in o32,n32 .lds builds MIPS: VDSO: Use same -m%-float cflag as the kernel proper MIPS: OCTEON: don't set octeon_dma_bar_type if PCI is disabled DTS: CI20: Fix bugs in ci20's device tree. MIPS: DTS: jz4740: Correct interrupt number of DMA core	2019-02-09 12:41:14 -08:00
Linus Torvalds	e5a8a11632	for-linus-20190209 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlxfARoQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjsgEACP8vQzbvsOZOxHKi9Vcd8ziwyjyBebNh4F cKOx2Blgv0ReVAqLOVp9VJOJQoVQumV1btaA2YrmevxnCMpNUBpbP6G02tAqe9Z+ D75FSpZXy4UvcMSlhfc/iB/RMI06benI9LnuL7zbzIQtrbtu+OFRnO6fpQOVGLxT Qa1wt/Rgahc48L4aHnIgPn0nyBRsEvuhC6FjI2D8akDaNiaHzwtGbpx7yDdmLNml fCzC2uSRJ31bXsO/5/fJorinaJ56r5N8aHaINYwXDv8zd8i94nQZhITAasXub1Km 0nyuAg/fSzIdkrGmPINTKFaGYsOfRwpS4C4vagreBhzjfolPY0z9sQEQ63gZzDrd mAjHPxLTd165OLlR/RxoMC8AjZCZ0/YQaucxUOPkaIHfth5/dy5BFaCkWyA/I7/Z VnAyq0SqeL4hgIOGxZM0HeehKx+palNdJNZTcY7vF/7MVPuh5WM6z/FWsFa8k+ss B9YN4wchh7I8EVbLmfz9s/eqabRWF3Agh1dE+yAKwt1KIWHaMXWZTRQnj/69fs2e s3pwVMiiSz6K/Xnoe12nmQ4K0XeyKNROO78IIGY/Oa0Pe/hzCAaJMRMDsLp5EcJj dxpoi1OfGHMGoqYhL6tx6Atq5f6CMDrS28k/D44DHfO7T1qQGVy1A9SY7ZCfM5+c HKxTuRh8mg== =tuL6 -----END PGP SIGNATURE----- Merge tag 'for-linus-20190209' of git://git.kernel.dk/linux-block Pull block fixes from Jens Axboe: - NVMe pull request from Christoph, fixing namespace locking when dealing with the effects log, and a rapid add/remove issue (Keith) - blktrace tweak, ensuring requests with -1 sectors are shown (Jan) - link power management quirk for a Smasung SSD (Hans) - m68k nfblock dynamic major number fix (Chengguang) - series fixing blk-iolatency inflight counter issue (Liu) - ensure that we clear ->private when setting up the aio kiocb (Mike) - __find_get_block_slow() rate limit print (Tetsuo) * tag 'for-linus-20190209' of git://git.kernel.dk/linux-block: blk-mq: remove duplicated definition of blk_mq_freeze_queue Blk-iolatency: warn on negative inflight IO counter blk-iolatency: fix IO hang due to negative inflight counter blktrace: Show requests without sector fs: ratelimit __find_get_block_slow() failure message. m68k: set proper major_num when specifying module param major_num libata: Add NOLPM quirk for SAMSUNG MZ7TE512HMHP-000L1 SSD nvme-pci: fix rapid add remove sequence nvme: lock NS list changes while handling command effects aio: initialize kiocb private in case any filesystems expect it.	2019-02-09 10:26:09 -08:00
Linus Torvalds	5610789ad0	- Fix a problem with the imx28 ECC engine - Remove a debug trace introduced in `2b6f0090a3` ("mtd: Check add_mtd_device() ret code") - Make sure partitions of size 0 can be registered - Fix kernel-doc warning in the rawnand core - Fix the error path of spinand_init() (missing manufacturer cleanup in a few places) - Address a problem with the SPI NAND PROGRAM LOAD operation which does not work as expected on some parts. -----BEGIN PGP SIGNATURE----- iQIcBAABAgAGBQJcXqZKAAoJEGXtNgF+CLcA01sQAJSSf/dnaie76vt9icUezM7b ulHdnGJEDazEYj5aIbFbT8S8TMpPA8FRR1zOsW1DbNA+khw82p8n2wykFzl2IYHh /qWzWcNGhr/s1tTwdS4lpR+lOyKJ3izk+AkL/vYwK1XlHu8uR+1/uazEiE/qNSKO grq+jIy5z5n8iVyhd6vjKumbkmWQlueTLVLGWyQeFIlSxYXyig1x3xhcvCcSBmGL +LpO+vB8T7G60nOAN5MTM0SDEoeg3R6clKeWWbVfZ5tkFoD8SvXagn3fYHIA6lDA pqGWUk6/HmUxec3DxLYZqfoaqi9b3u182H5sY6gOKApPF0HiBGJslo/tiplfnZVe 3cKm2Mf+LJ4fkqdmCT/UcSm6GuP19CF0rSWPW9Xdq2jgOr5RJ5FlUWAucqM0QPmp q3LfQIiwegHh0PQ2dRPAFG1J6btnCxX+pnitNwskbU9XbnDvpPWnaUHcg/qXwhTt nImdqBr401d+bv9NgiPSgTQPltmSPZ1t+SI8AE6G1tdzarB2KQ0tOQL+CCMftGZ0 /+eEY1+U05JrWzJhn7VnqszkNy9kvcszWqVkxomw+ZWQiiN7Nboaadzh/TJ8Sxog lOg3EGU3BWBSG3gD/G6v4DuuK9RpKuIzxBP/7ROJJsJRM9N+I4iI1babPdTVbUc0 MX70tM0a9R5/y0EscaK0 =QL6/ -----END PGP SIGNATURE----- Merge tag 'mtd/fixes-for-5.0-rc6' of git://git.infradead.org/linux-mtd Pull mtd fixes from Boris Brezillon: - Fix a problem with the imx28 ECC engine - Remove a debug trace introduced in `2b6f0090a3` ("mtd: Check add_mtd_device() ret code") - Make sure partitions of size 0 can be registered - Fix kernel-doc warning in the rawnand core - Fix the error path of spinand_init() (missing manufacturer cleanup in a few places) - Address a problem with the SPI NAND PROGRAM LOAD operation which does not work as expected on some parts. * tag 'mtd/fixes-for-5.0-rc6' of git://git.infradead.org/linux-mtd: mtd: rawnand: gpmi: fix MX28 bus master lockup problem mtd: Make sure mtd->erasesize is valid even if the partition is of size 0 mtd: Remove a debug trace in mtdpart.c mtd: rawnand: fix kernel-doc warnings mtd: spinand: Fix the error/cleanup path in spinand_init() mtd: spinand: Handle the case where PROGRAM LOAD does not reset the cache	2019-02-09 10:17:01 -08:00
Linus Torvalds	3e5e692fcd	xen: fixes for 5.0-rc6 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCXF7kVAAKCRCAXGG7T9hj vqnqAP9eokv+M7IIhKMyOPg9L1H8pspaFvtvI3VDckJOv1yCqQD/VFi0aITkGdMU 0W7LCawec41dRwCMQ2bg6XB16ArfLww= =Stew -----END PGP SIGNATURE----- Merge tag 'for-linus-5.0-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: "Two very minor fixes: one remove of a #include for an unused header and a fix of the xen ML address in MAINTAINERS" * tag 'for-linus-5.0-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: MAINTAINERS: unify reference to xen-devel list arch/arm/xen: Remove duplicate header	2019-02-09 09:44:08 -08:00
Coly Li	dc7292a5bc	bcache: use (REQ_META\|REQ_PRIO) to indicate bio for metadata In 'commit `752f66a75a` ("bcache: use REQ_PRIO to indicate bio for metadata")' REQ_META is replaced by REQ_PRIO to indicate metadata bio. This assumption is not always correct, e.g. XFS uses REQ_META to mark metadata bio other than REQ_PRIO. This is why Nix noticed that bcache does not cache metadata for XFS after the above commit. Thanks to Dave Chinner, he explains the difference between REQ_META and REQ_PRIO from view of file system developer. Here I quote part of his explanation from mailing list, REQ_META is used for metadata. REQ_PRIO is used to communicate to the lower layers that the submitter considers this IO to be more important that non REQ_PRIO IO and so dispatch should be expedited. IOWs, if the filesystem considers metadata IO to be more important that user data IO, then it will use REQ_PRIO \| REQ_META rather than just REQ_META. Then it seems bios with REQ_META or REQ_PRIO should both be cached for performance optimation, because they are all probably low I/O latency demand by upper layer (e.g. file system). So in this patch, when we want to decide whether to bypass the cache, REQ_META and REQ_PRIO are both checked. Then both metadata and high priority I/O requests will be handled properly. Reported-by: Nix <nix@esperi.org.uk> Signed-off-by: Coly Li <colyli@suse.de> Reviewed-by: Andre Noll <maan@tuebingen.mpg.de> Tested-by: Nix <nix@esperi.org.uk> Cc: stable@vger.kernel.org Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-09 07:18:33 -07:00
Coly Li	a91fbda49f	bcache: fix input overflow to cache set sysfs file io_error_halflife Cache set sysfs entry io_error_halflife is used to set c->error_decay. c->error_decay is in type unsigned int, and it is converted by strtoul_or_return(), therefore overflow to c->error_decay is possible for a large input value. This patch fixes the overflow by using strtoul_safe_clamp() to convert input string to an unsigned long value in range [0, UINT_MAX], then divides by 88 and set it to c->error_decay. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-09 07:18:33 -07:00
Coly Li	b15008403b	bcache: fix input overflow to cache set io_error_limit c->error_limit is in type unsigned int, it is set via cache set sysfs file io_error_limit. Inside the bcache code, input string is converted by strtoul_or_return() and set the converted value to c->error_limit. Because the converted value is unsigned long, and c->error_limit is unsigned int, if the input is large enought, overflow will happen to c->error_limit. This patch uses sysfs_strtoul_clamp() to convert input string, and set the range in [0, UINT_MAX] to avoid the potential overflow. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-09 07:18:32 -07:00
Coly Li	453745fbbe	bcache: fix input overflow to journal_delay_ms c->journal_delay_ms is in type unsigned short, it is set via sysfs interface and converted by sysfs_strtoul() from input string to unsigned short value. Therefore overflow to unsigned short might be happen when the converted value exceed USHRT_MAX. e.g. writing 65536 into sysfs file journal_delay_ms, c->journal_delay_ms is set to 0. This patch uses sysfs_strtoul_clamp() to convert the input string and limit value range in [0, USHRT_MAX], to avoid the input overflow. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-09 07:18:32 -07:00
Coly Li	dab71b2db9	bcache: fix input overflow to writeback_rate_minimum dc->writeback_rate_minimum is type unsigned integer variable, it is set via sysfs interface, and converte from input string to unsigned integer by d_strtoul_nonzero(). When the converted input value is larger than UINT_MAX, overflow to unsigned integer happens. This patch fixes the overflow by using sysfs_strotoul_clamp() to convert input string and limit the value in range [1, UINT_MAX], then the overflow can be avoided. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-09 07:18:32 -07:00
Coly Li	5b5fd3c94e	bcache: fix potential div-zero error of writeback_rate_p_term_inverse Current code already uses d_strtoul_nonzero() to convert input string to an unsigned integer, to make sure writeback_rate_p_term_inverse won't be zero value. But overflow may happen when converting input string to an unsigned integer value by d_strtoul_nonzero(), then dc->writeback_rate_p_term_inverse can still be set to 0 even if the sysfs file input value is not zero, e.g. 4294967296 (a.k.a UINT_MAX+1). If dc->writeback_rate_p_term_inverse is set to 0, it might cause a dev-zero error in following code from __update_writeback_rate(), int64_t proportional_scaled = div_s64(error, dc->writeback_rate_p_term_inverse); This patch replaces d_strtoul_nonzero() by sysfs_strtoul_clamp() and limit the value range in [1, UINT_MAX]. Then the unsigned integer overflow and dev-zero error can be avoided. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-09 07:18:32 -07:00
Coly Li	c3b75a2199	bcache: fix potential div-zero error of writeback_rate_i_term_inverse dc->writeback_rate_i_term_inverse can be set via sysfs interface. It is in type unsigned int, and convert from input string by d_strtoul(). The problem is d_strtoul() does not check valid range of the input, if 4294967296 is written into sysfs file writeback_rate_i_term_inverse, an overflow of unsigned integer will happen and value 0 is set to dc->writeback_rate_i_term_inverse. In writeback.c:__update_writeback_rate(), there are following lines of code, integral_scaled = div_s64(dc->writeback_rate_integral, dc->writeback_rate_i_term_inverse); If dc->writeback_rate_i_term_inverse is set to 0 via sysfs interface, a div-zero error might be triggered in the above code. Therefore we need to add a range limitation in the sysfs interface, this is what this patch does, use sysfs_stroul_clamp() to replace d_strtoul() and restrict the input range in [1, UINT_MAX]. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-09 07:18:32 -07:00
Coly Li	369d21a73a	bcache: fix input overflow to writeback_delay Sysfs file writeback_delay is used to configure dc->writeback_delay which is type unsigned int. But bcache code uses sysfs_strtoul() to convert the input string, therefore it might be overflowed if the input value is too large. E.g. input value is 4294967296 but indeed 0 is set to dc->writeback_delay. This patch uses sysfs_strtoul_clamp() to convert the input string and set the result value range in [0, UINT_MAX] to avoid such unsigned integer overflow. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-09 07:18:32 -07:00
Coly Li	f5c0b95d2e	bcache: use sysfs_strtoul_bool() to set bit-field variables When setting bcache parameters via sysfs, there are some variables are defined as bit-field value. Current bcache code in sysfs.c uses either d_strtoul() or sysfs_strtoul() to convert the input string to unsigned integer value and set it to the corresponded bit-field value. The problem is, the bit-field value only takes the lowest bit of the converted value. If input is 2, the expected value (like bool value) of the bit-field value should be 1, but indeed it is 0. The following sysfs files for bit-field variables have such problem, bypass_torture_test, for dc->bypass_torture_test writeback_metadata, for dc->writeback_metadata writeback_running, for dc->writeback_running verify, for c->verify key_merging_disabled, for c->key_merging_disabled gc_always_rewrite, for c->gc_always_rewrite btree_shrinker_disabled,for c->shrinker_disabled copy_gc_enabled, for c->copy_gc_enabled This patch uses sysfs_strtoul_bool() to set such bit-field variables, then if the converted value is non-zero, the bit-field variables will be set to 1, like setting a bool value like expensive_debug_checks. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-09 07:18:32 -07:00
Coly Li	e4db37fb69	bcache: add sysfs_strtoul_bool() for setting bit-field variables When setting bool values via sysfs interface, e.g. writeback_metadata, if writing 1 into writeback_metadata file, dc->writeback_metadata is set to 1, but if writing 2 into the file, dc->writeback_metadata is 0. This is misleading, a better result should be 1 for all non-zero input value. It is because dc->writeback_metadata is a bit-field variable, and current code simply use d_strtoul() to convert a string into integer and takes the lowest bit value. To fix such error, we need a routine to convert the input string into unsigned integer, and set target variable to 1 if the converted integer is non-zero. This patch introduces a new macro called sysfs_strtoul_bool(), it can be used to convert input string into bool value, we can use it to set bool value for bit-field vairables. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-09 07:18:32 -07:00
Coly Li	8c27a3953e	bcache: fix input overflow to sequential_cutoff People may set sequential_cutoff of a cached device via sysfs file, but current code does not check input value overflow. E.g. if value 4294967295 (UINT_MAX) is written to file sequential_cutoff, its value is 4GB, but if 4294967296 (UINT_MAX + 1) is written into, its value will be 0. This is an unexpected behavior. This patch replaces d_strtoi_h() by sysfs_strtoul_clamp() to convert input string to unsigned integer value, and limit its range in [0, UINT_MAX]. Then the input overflow can be fixed. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-09 07:18:32 -07:00
Coly Li	f54478c6e2	bcache: fix input integer overflow of congested threshold Cache set congested threshold values congested_read_threshold_us and congested_write_threshold_us can be set via sysfs interface. These two values are 'unsigned int' type, but sysfs interface uses strtoul to convert input string. So if people input a large number like 9999999999, the value indeed set is 1410065407, which is not expected behavior. This patch replaces sysfs_strtoul() by sysfs_strtoul_clamp() when convert input string to unsigned int value, and set value range in [0, UINT_MAX], to avoid the above integer overflow errors. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-09 07:18:31 -07:00
Coly Li	596b5a5dd1	bcache: improve sysfs_strtoul_clamp() Currently sysfs_strtoul_clamp() is defined as, 82 #define sysfs_strtoul_clamp(file, var, min, max) \ 83 do { \ 84 if (attr == &sysfs_ ## file) \ 85 return strtoul_safe_clamp(buf, var, min, max) \ 86 ?: (ssize_t) size; \ 87 } while (0) The problem is, if bit width of var is less then unsigned long, min and max may not protect var from integer overflow, because overflow happens in strtoul_safe_clamp() before checking min and max. To fix such overflow in sysfs_strtoul_clamp(), to make min and max take effect, this patch adds an unsigned long variable, and uses it to macro strtoul_safe_clamp() to convert an unsigned long value in range defined by [min, max]. Then assign this value to var. By this method, if bit width of var is less than unsigned long, integer overflow won't happen before min and max are checking. Now sysfs_strtoul_clamp() can properly handle smaller data type like unsigned int, of cause min and max should be defined in range of unsigned int too. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-09 07:18:31 -07:00
Tang Junhui	58ac323084	bcache: treat stale && dirty keys as bad keys Stale && dirty keys can be produced in the follow way: After writeback in write_dirty_finish(), dirty keys k1 will replace by clean keys k2 ==>ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key); ==>btree_insert_fn(struct btree_op b_op, struct btree b) ==>static int bch_btree_insert_node(struct btree b, struct btree_op op, struct keylist insert_keys, atomic_t journal_ref, Then two steps: A) update k1 to k2 in btree node memory; bch_btree_insert_keys(b, op, insert_keys, replace_key) B) Write the bset(contains k2) to cache disk by a 30s delay work bch_btree_leaf_dirty(b, journal_ref). But before the 30s delay work write the bset to cache device, these things happened: A) GC works, and reclaim the bucket k2 point to; B) Allocator works, and invalidate the bucket k2 point to, and increase the gen of the bucket, and place it into free_inc fifo; C) Until now, the 30s delay work still does not finish work, so in the disk, the key still is k1, it is dirty and stale (its gen is smaller than the gen of the bucket). and then the machine power off suddenly happens; D) When the machine power on again, after the btree reconstruction, the stale dirty key appear. In bch_extent_bad(), when expensive_debug_checks is off, it would treat the dirty key as good even it is stale keys, and it would cause bellow probelms: A) In read_dirty() it would cause machine crash: BUG_ON(ptr_stale(dc->disk.c, &w->key, 0)); B) It could be worse when reads hits stale dirty keys, it would read old incorrect data. This patch tolerate the existence of these stale && dirty keys, and treat them as bad key in bch_extent_bad(). (Coly Li: fix indent which was modified by sender's email client) Signed-off-by: Tang Junhui <tang.junhui.linux@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-09 07:18:31 -07:00
Colin Ian King	e8cf978dff	bcache: fix indentation issue, remove tabs on a hunk of code There is a hunk of code that is indented one level too deep, fix this by removing the extra tabs. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-09 07:18:31 -07:00
Coly Li	d4610456cf	bcache: export backing_dev_uuid via sysfs When there are multiple bcache devices, after a reboot the name of bcache devices may change (e.g. current /dev/bcache1 was /dev/bcache0 before reboot). Therefore we need the backing device UUID (sb.uuid) to identify each bcache device. Backing device uuid can be found by program bcache-super-show, but directly exporting backing_dev_uuid by sysfs file /sys/block/bcache<?>/bcache/backing_dev_uuid is a much simpler method. With backing_dev_uuid, and partition uuids from /dev/disk/by-partuuid/, now we can identify each bcache device and its partitions conveniently. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-09 07:18:31 -07:00
Coly Li	926d19465b	bcache: export backing_dev_name via sysfs This patch export dc->backing_dev_name to sysfs file /sys/block/bcache<?>/bcache/backing_dev_name, then people or user space tools may know the backing device name of this bcache device. Of cause it can be done by parsing sysfs links, but this method can be much simpler to find the link between bcache device and backing device. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-09 07:18:31 -07:00
Coly Li	83ff9318c4	bcache: not use hard coded memset size in bch_cache_accounting_clear() In stats.c:bch_cache_accounting_clear(), a hard coded number '7' is used in memset(). It is because in struct cache_stats, there are 7 atomic_t type members. This is not good when new members added into struct stats, the hard coded number will only clear part of memory. This patch replaces 'sizeof(unsigned long) * 7' by more generic 'sizeof(struct cache_stats))', to avoid potential error if new member added into struct cache_stats. Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-09 07:18:31 -07:00
Daniel Axtens	9951379b0c	bcache: never writeback a discard operation Some users see panics like the following when performing fstrim on a bcached volume: [ 529.803060] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 [ 530.183928] #PF error: [normal kernel read fault] [ 530.412392] PGD 8000001f42163067 P4D 8000001f42163067 PUD 1f42168067 PMD 0 [ 530.750887] Oops: 0000 [#1] SMP PTI [ 530.920869] CPU: 10 PID: 4167 Comm: fstrim Kdump: loaded Not tainted 5.0.0-rc1+ #3 [ 531.290204] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015 [ 531.693137] RIP: 0010:blk_queue_split+0x148/0x620 [ 531.922205] Code: 60 38 89 55 a0 45 31 db 45 31 f6 45 31 c9 31 ff 89 4d 98 85 db 0f 84 7f 04 00 00 44 8b 6d 98 4c 89 ee 48 c1 e6 04 49 03 70 78 <8b> 46 08 44 8b 56 0c 48 8b 16 44 29 e0 39 d8 48 89 55 a8 0f 47 c3 [ 532.838634] RSP: 0018:ffffb9b708df39b0 EFLAGS: 00010246 [ 533.093571] RAX: 00000000ffffffff RBX: 0000000000046000 RCX: 0000000000000000 [ 533.441865] RDX: 0000000000000200 RSI: 0000000000000000 RDI: 0000000000000000 [ 533.789922] RBP: ffffb9b708df3a48 R08: ffff940d3b3fdd20 R09: 0000000000000000 [ 534.137512] R10: ffffb9b708df3958 R11: 0000000000000000 R12: 0000000000000000 [ 534.485329] R13: 0000000000000000 R14: 0000000000000000 R15: ffff940d39212020 [ 534.833319] FS: 00007efec26e3840(0000) GS:ffff940d1f480000(0000) knlGS:0000000000000000 [ 535.224098] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 535.504318] CR2: 0000000000000008 CR3: 0000001f4e256004 CR4: 00000000001606e0 [ 535.851759] Call Trace: [ 535.970308] ? mempool_alloc_slab+0x15/0x20 [ 536.174152] ? bch_data_insert+0x42/0xd0 [bcache] [ 536.403399] blk_mq_make_request+0x97/0x4f0 [ 536.607036] generic_make_request+0x1e2/0x410 [ 536.819164] submit_bio+0x73/0x150 [ 536.980168] ? submit_bio+0x73/0x150 [ 537.149731] ? bio_associate_blkg_from_css+0x3b/0x60 [ 537.391595] ? _cond_resched+0x1a/0x50 [ 537.573774] submit_bio_wait+0x59/0x90 [ 537.756105] blkdev_issue_discard+0x80/0xd0 [ 537.959590] ext4_trim_fs+0x4a9/0x9e0 [ 538.137636] ? ext4_trim_fs+0x4a9/0x9e0 [ 538.324087] ext4_ioctl+0xea4/0x1530 [ 538.497712] ? _copy_to_user+0x2a/0x40 [ 538.679632] do_vfs_ioctl+0xa6/0x600 [ 538.853127] ? __do_sys_newfstat+0x44/0x70 [ 539.051951] ksys_ioctl+0x6d/0x80 [ 539.212785] __x64_sys_ioctl+0x1a/0x20 [ 539.394918] do_syscall_64+0x5a/0x110 [ 539.568674] entry_SYSCALL_64_after_hwframe+0x44/0xa9 We have observed it where both: 1) LVM/devmapper is involved (bcache backing device is LVM volume) and 2) writeback cache is involved (bcache cache_mode is writeback) On one machine, we can reliably reproduce it with: # echo writeback > /sys/block/bcache0/bcache/cache_mode (not sure whether above line is required) # mount /dev/bcache0 /test # for i in {0..10}; do file="$(mktemp /test/zero.XXX)" dd if=/dev/zero of="$file" bs=1M count=256 sync rm $file done # fstrim -v /test Observing this with tracepoints on, we see the following writes: fstrim-18019 [022] .... 91107.302026: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 4260112 + 196352 hit 0 bypass 1 fstrim-18019 [022] .... 91107.302050: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 4456464 + 262144 hit 0 bypass 1 fstrim-18019 [022] .... 91107.302075: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 4718608 + 81920 hit 0 bypass 1 fstrim-18019 [022] .... 91107.302094: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 5324816 + 180224 hit 0 bypass 1 fstrim-18019 [022] .... 91107.302121: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 5505040 + 262144 hit 0 bypass 1 fstrim-18019 [022] .... 91107.302145: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 5767184 + 81920 hit 0 bypass 1 fstrim-18019 [022] .... 91107.308777: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0 DS 6373392 + 180224 hit 1 bypass 0 <crash> Note the final one has different hit/bypass flags. This is because in should_writeback(), we were hitting a case where the partial stripe condition was returning true and so should_writeback() was returning true early. If that hadn't been the case, it would have hit the would_skip test, and as would_skip == s->iop.bypass == true, should_writeback() would have returned false. Looking at the git history from 'commit `72c270612b` ("bcache: Write out full stripes")', it looks like the idea was to optimise for raid5/6: * If a stripe is already dirty, force writes to that stripe to writeback mode - to help build up full stripes of dirty data To fix this issue, make sure that should_writeback() on a discard op never returns true. More details of debugging: https://www.spinics.net/lists/linux-bcache/msg06996.html Previous reports: - https://bugzilla.kernel.org/show_bug.cgi?id=201051 - https://bugzilla.kernel.org/show_bug.cgi?id=196103 - https://www.spinics.net/lists/linux-bcache/msg06885.html (Coly Li: minor modification to follow maximum 75 chars per line rule) Cc: Kent Overstreet <koverstreet@google.com> Cc: stable@vger.kernel.org Fixes: `72c270612b` ("bcache: Write out full stripes") Signed-off-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Coly Li <colyli@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>	2019-02-09 07:18:31 -07:00

1 2 3 4 5 ...

812150 Commits