6455 Commits

Author SHA1 Message Date
Linus Torvalds
7cd4ecd917 drivers-5.10-2020-10-12
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EYWYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpsCgD/9Izy/mbiQMmcBPBuQFds2b2SwPAoB4RVcU
 NU7pcI3EbAlcj7xDF08Z74Sr6MKyg+JhGid15iw47o+qFq6cxDKiESYLIrFmb70R
 lUDkPr9J4OLNDSZ6hpM4sE6Qg9bzDPhRbAceDQRtVlqjuQdaOS2qZAjNG4qjO8by
 3PDO7XHCW+X4HhXiu2PDCKuwyDlHxggYzhBIFZNf58US2BU8+tLn2gvTSvmTb27F
 w0s5WU1Q5Q0W9RLrp4YTQi4SIIOq03BTSqpRjqhomIzhSQMieH95XNKGRitLjdap
 2mFNJ+5I+DTB/TW2BDBrBRXnoV/QNBJsR0DDFnUZsHEejjXKEVt5BRCpSQC9A0WW
 XUyVE1K+3GwgIxSI8tjPtyPEGzzhnqJjzHPq4LJLGlQje95v9JZ6bpODB7HHtZQt
 rbNp8IoVQ0n01nIvkkt/vnzCE9VFbWFFQiiu5/+x26iKZXW0pAF9Dnw46nFHoYZi
 llYvbKDcAUhSdZI8JuqnSnKhi7sLRNPnApBxs52mSX8qaE91sM2iRFDewYXzaaZG
 NjijYCcUtopUvojwxYZaLnIpnKWG4OZqGTNw1IdgzUtfdxoazpg6+4wAF9vo7FEP
 AePAUTKrfkGBm95uAP4bRvXBzS9UhXJvBrFW3grzRZybMj617F01yAR4N0xlMXeN
 jMLrGe7sWA==
 =xE9E
 -----END PGP SIGNATURE-----

Merge tag 'drivers-5.10-2020-10-12' of git://git.kernel.dk/linux-block

Pull block driver updates from Jens Axboe:
 "Here are the driver updates for 5.10.

  A few SCSI updates in here too, in coordination with Martin as they
  depend on core block changes for the shared tag bitmap.

  This contains:

   - NVMe pull requests via Christoph:
      - fix keep alive timer modification (Amit Engel)
      - order the PCI ID list more sensibly (Andy Shevchenko)
      - cleanup the open by controller helper (Chaitanya Kulkarni)
      - use an xarray for the CSE log lookup (Chaitanya Kulkarni)
      - support ZNS in nvmet passthrough mode (Chaitanya Kulkarni)
      - fix nvme_ns_report_zones (Christoph Hellwig)
      - add a sanity check to nvmet-fc (James Smart)
      - fix interrupt allocation when too many polled queues are
        specified (Jeffle Xu)
      - small nvmet-tcp optimization (Mark Wunderlich)
      - fix a controller refcount leak on init failure (Chaitanya
        Kulkarni)
      - misc cleanups (Chaitanya Kulkarni)
      - major refactoring of the scanning code (Christoph Hellwig)

   - MD updates via Song:
      - Bug fixes in bitmap code, from Zhao Heming
      - Fix a work queue check, from Guoqing Jiang
      - Fix raid5 oops with reshape, from Song Liu
      - Clean up unused code, from Jason Yan
      - Discard improvements, from Xiao Ni
      - raid5/6 page offset support, from Yufen Yu

   - Shared tag bitmap for SCSI/hisi_sas/null_blk (John, Kashyap,
     Hannes)

   - null_blk open/active zone limit support (Niklas)

   - Set of bcache updates (Coly, Dongsheng, Qinglang)"

* tag 'drivers-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (78 commits)
  md/raid5: fix oops during stripe resizing
  md/bitmap: fix memory leak of temporary bitmap
  md: fix the checking of wrong work queue
  md/bitmap: md_bitmap_get_counter returns wrong blocks
  md/bitmap: md_bitmap_read_sb uses wrong bitmap blocks
  md/raid0: remove unused function is_io_in_chunk_boundary()
  nvme-core: remove extra condition for vwc
  nvme-core: remove extra variable
  nvme: remove nvme_identify_ns_list
  nvme: refactor nvme_validate_ns
  nvme: move nvme_validate_ns
  nvme: query namespace identifiers before adding the namespace
  nvme: revalidate zone bitmaps in nvme_update_ns_info
  nvme: remove nvme_update_formats
  nvme: update the known admin effects
  nvme: set the queue limits in nvme_update_ns_info
  nvme: remove the 0 lba_shift check in nvme_update_ns_info
  nvme: clean up the check for too large logic block sizes
  nvme: freeze the queue over ->lba_shift updates
  nvme: factor out a nvme_configure_metadata helper
  ...
2020-10-13 13:04:41 -07:00
Linus Torvalds
3ad11d7ac8 block-5.10-2020-10-12
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EWUgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpnoxEADCVSNBRkpV0OVkOEC3wf8EGhXhk01Jnjtl
 u5Mg2V55hcgJ0thQxBV/V28XyqmsEBrmAVi0Yf8Vr9Qbq4Ze08Wae4ChS4rEOyh1
 jTcGYWx5aJB3ChLvV/HI0nWQ3bkj03mMrL3SW8rhhf5DTyKHsVeTenpx42Qu/FKf
 fRzi09FSr3Pjd0B+EX6gunwJnlyXQC5Fa4AA0GhnXJzAznANXxHkkcXu8a6Yw75x
 e28CfhIBliORsK8sRHLoUnPpeTe1vtxCBhBMsE+gJAj9ZUOWMzvNFIPP4FvfawDy
 6cCQo2m1azJ/IdZZCDjFUWyjh+wxdKMp+NNryEcoV+VlqIoc3n98rFwrSL+GIq5Z
 WVwEwq+AcwoMCsD29Lu1ytL2PQ/RVqcJP5UheMrbL4vzefNfJFumQVZLIcX0k943
 8dFL2QHL+H/hM9Dx5y5rjeiWkAlq75v4xPKVjh/DHb4nehddCqn/+DD5HDhNANHf
 c1kmmEuYhvLpIaC4DHjE6DwLh8TPKahJjwsGuBOTr7D93NUQD+OOWsIhX6mNISIl
 FFhP8cd0/ZZVV//9j+q+5B4BaJsT+ZtwmrelKFnPdwPSnh+3iu8zPRRWO+8P8fRC
 YvddxuJAmE6BLmsAYrdz6Xb/wqfyV44cEiyivF0oBQfnhbtnXwDnkDWSfJD1bvCm
 ZwfpDh2+Tg==
 =LzyE
 -----END PGP SIGNATURE-----

Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - Series of merge handling cleanups (Baolin, Christoph)

 - Series of blk-throttle fixes and cleanups (Baolin)

 - Series cleaning up BDI, seperating the block device from the
   backing_dev_info (Christoph)

 - Removal of bdget() as a generic API (Christoph)

 - Removal of blkdev_get() as a generic API (Christoph)

 - Cleanup of is-partition checks (Christoph)

 - Series reworking disk revalidation (Christoph)

 - Series cleaning up bio flags (Christoph)

 - bio crypt fixes (Eric)

 - IO stats inflight tweak (Gabriel)

 - blk-mq tags fixes (Hannes)

 - Buffer invalidation fixes (Jan)

 - Allow soft limits for zone append (Johannes)

 - Shared tag set improvements (John, Kashyap)

 - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

 - DM no-wait support (Mike, Konstantin)

 - Request allocation improvements (Ming)

 - Allow md/dm/bcache to use IO stat helpers (Song)

 - Series improving blk-iocost (Tejun)

 - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
   Xianting, Yang, Yufen, yangerkun)

* tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
  block: fix uapi blkzoned.h comments
  blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
  blk-mq: get rid of the dead flush handle code path
  block: get rid of unnecessary local variable
  block: fix comment and add lockdep assert
  blk-mq: use helper function to test hw stopped
  block: use helper function to test queue register
  block: remove redundant mq check
  block: invoke blk_mq_exit_sched no matter whether have .exit_sched
  percpu_ref: don't refer to ref->data if it isn't allocated
  block: ratelimit handle_bad_sector() message
  blk-throttle: Re-use the throtl_set_slice_end()
  blk-throttle: Open code __throtl_de/enqueue_tg()
  blk-throttle: Move service tree validation out of the throtl_rb_first()
  blk-throttle: Move the list operation after list validation
  blk-throttle: Fix IO hang for a corner case
  blk-throttle: Avoid tracking latency if low limit is invalid
  blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
  blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
  block: Remove redundant 'return' statement
  ...
2020-10-13 12:12:44 -07:00
Linus Torvalds
ca1b66922a * Extend the recovery from MCE in kernel space also to processes which
encounter an MCE in kernel space but while copying from user memory by
 sending them a SIGBUS on return to user space and umapping the faulty
 memory, by Tony Luck and Youquan Song.
 
 * memcpy_mcsafe() rework by splitting the functionality into
 copy_mc_to_user() and copy_mc_to_kernel(). This, as a result, enables
 support for new hardware which can recover from a machine check
 encountered during a fast string copy and makes that the default and
 lets the older hardware which does not support that advance recovery,
 opt in to use the old, fragile, slow variant, by Dan Williams.
 
 * New AMD hw enablement, by Yazen Ghannam and Akshay Gupta.
 
 * Do not use MSR-tracing accessors in #MC context and flag any fault
 while accessing MCA architectural MSRs as an architectural violation
 with the hope that such hw/fw misdesigns are caught early during the hw
 eval phase and they don't make it into production.
 
 * Misc fixes, improvements and cleanups, as always.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAl+EIpUACgkQEsHwGGHe
 VUouoBAAgwb+NkWZtIqGImV4f+LOyFjhTR/r/7ZyiijXdbhOIuAdc/jQM31mQxug
 sX2jxaRYnf1n6SLA0ggX99gwr2deRQ/hsNf5Abw55GC+Z1dOxpGL0k59A3ELl1IR
 H9KYmCAFQIHvzfk38qcdND73XHcgthQoXFBOG9wAPAdgDWnaiWt6lcLAq8OiJTmp
 D8pInAYhcnL8YXwMGyQQ1KkFn9HwydoWDsK5Ff2shaw2/+dMQqd1zetenbVtjhLb
 iNYGvV7Bi/RQ8PyMbzmtTWa4kwQJAHC2gptkGxty//2ADGVBbqUQdqF9TjIWCNy5
 V6Ldv5zo0/1s7DOzji3htzqkSs/K1Ea6d2LtZjejkJipHKV5x068UC6Fu+PlfS2D
 VZfcICeapU4G2F3Zvks2DlZ7dVTbHCvoI78Qi7bBgczPUVmk6iqah4xuQaiHyBJc
 kTFDA4Nnf/026GpoWRiFry9vqdnHBZyLet5A6Y+SoWF0FbhYnCVPpq4MnussYoav
 lUIi9ZZav6X2RZp9DDM1f9d5xubtKq0DKt93wvzqAhjK0T2DikckJ+riOYkI6N8t
 fHCBNUkdfgyMzJUTBPAzYQ7RmjbjKWJi7xWP0oz6+GqOJkQfSTVC5/2yEffbb3ya
 whYRS6iklbl7yshzaOeecXsZcAeK2oGPfoHg34WkHFgXdF5mNgA=
 =u1Wg
 -----END PGP SIGNATURE-----

Merge tag 'ras_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RAS updates from Borislav Petkov:

 - Extend the recovery from MCE in kernel space also to processes which
   encounter an MCE in kernel space but while copying from user memory
   by sending them a SIGBUS on return to user space and umapping the
   faulty memory, by Tony Luck and Youquan Song.

 - memcpy_mcsafe() rework by splitting the functionality into
   copy_mc_to_user() and copy_mc_to_kernel(). This, as a result, enables
   support for new hardware which can recover from a machine check
   encountered during a fast string copy and makes that the default and
   lets the older hardware which does not support that advance recovery,
   opt in to use the old, fragile, slow variant, by Dan Williams.

 - New AMD hw enablement, by Yazen Ghannam and Akshay Gupta.

 - Do not use MSR-tracing accessors in #MC context and flag any fault
   while accessing MCA architectural MSRs as an architectural violation
   with the hope that such hw/fw misdesigns are caught early during the
   hw eval phase and they don't make it into production.

 - Misc fixes, improvements and cleanups, as always.

* tag 'ras_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mce: Allow for copy_mc_fragile symbol checksum to be generated
  x86/mce: Decode a kernel instruction to determine if it is copying from user
  x86/mce: Recover from poison found while copying from user space
  x86/mce: Avoid tail copy when machine check terminated a copy from user
  x86/mce: Add _ASM_EXTABLE_CPY for copy user access
  x86/mce: Provide method to find out the type of an exception handler
  x86/mce: Pass pointer to saved pt_regs to severity calculation routines
  x86/copy_mc: Introduce copy_mc_enhanced_fast_string()
  x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()
  x86/mce: Drop AMD-specific "DEFERRED" case from Intel severity rule list
  x86/mce: Add Skylake quirk for patrol scrub reported errors
  RAS/CEC: Convert to DEFINE_SHOW_ATTRIBUTE()
  x86/mce: Annotate mce_rd/wrmsrl() with noinstr
  x86/mce/dev-mcelog: Do not update kflags on AMD systems
  x86/mce: Stop mce_reign() from re-computing severity for every CPU
  x86/mce: Make mce_rdmsrl() panic on an inaccessible MSR
  x86/mce: Increase maximum number of banks to 64
  x86/mce: Delay clearing IA32_MCG_STATUS to the end of do_machine_check()
  x86/MCE/AMD, EDAC/mce_amd: Remove struct smca_hwid.xec_bitmap
  RAS/CEC: Fix cec_init() prototype
2020-10-12 10:14:38 -07:00
Song Liu
b44c018cdf md/raid5: fix oops during stripe resizing
KoWei reported crash during raid5 reshape:

[ 1032.252932] Oops: 0002 [#1] SMP PTI
[...]
[ 1032.252943] RIP: 0010:memcpy_erms+0x6/0x10
[...]
[ 1032.252947] RSP: 0018:ffffba1ac0c03b78 EFLAGS: 00010286
[ 1032.252949] RAX: 0000784ac0000000 RBX: ffff91bec3d09740 RCX: 0000000000001000
[ 1032.252951] RDX: 0000000000001000 RSI: ffff91be6781c000 RDI: 0000784ac0000000
[ 1032.252953] RBP: ffffba1ac0c03bd8 R08: 0000000000001000 R09: ffffba1ac0c03bf8
[ 1032.252954] R10: 0000000000000000 R11: 0000000000000000 R12: ffffba1ac0c03bf8
[ 1032.252955] R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000000000
[ 1032.252958] FS:  0000000000000000(0000) GS:ffff91becf500000(0000) knlGS:0000000000000000
[ 1032.252959] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1032.252961] CR2: 0000784ac0000000 CR3: 000000031780a002 CR4: 00000000001606e0
[ 1032.252962] Call Trace:
[ 1032.252969]  ? async_memcpy+0x179/0x1000 [async_memcpy]
[ 1032.252977]  ? raid5_release_stripe+0x8e/0x110 [raid456]
[ 1032.252982]  handle_stripe_expansion+0x15a/0x1f0 [raid456]
[ 1032.252988]  handle_stripe+0x592/0x1270 [raid456]
[ 1032.252993]  handle_active_stripes.isra.0+0x3cb/0x5a0 [raid456]
[ 1032.252999]  raid5d+0x35c/0x550 [raid456]
[ 1032.253002]  ? schedule+0x42/0xb0
[ 1032.253006]  ? schedule_timeout+0x10e/0x160
[ 1032.253011]  md_thread+0x97/0x160
[ 1032.253015]  ? wait_woken+0x80/0x80
[ 1032.253019]  kthread+0x104/0x140
[ 1032.253022]  ? md_start_sync+0x60/0x60
[ 1032.253024]  ? kthread_park+0x90/0x90
[ 1032.253027]  ret_from_fork+0x35/0x40

This is because cache_size_mutex was unlocked too early in resize_stripes,
which races with grow_one_stripe() that grow_one_stripe() allocates a
stripe with wrong pool_size.

Fix this issue by unlocking cache_size_mutex after updating pool_size.

Cc: <stable@vger.kernel.org> # v4.4+
Reported-by: KoWei Sung <winders@amazon.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-10-08 22:38:10 -07:00
Zhao Heming
1383b347a8 md/bitmap: fix memory leak of temporary bitmap
Callers of get_bitmap_from_slot() are responsible to free the bitmap.

Suggested-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-10-08 22:37:39 -07:00
Guoqing Jiang
cf0b9b4821 md: fix the checking of wrong work queue
It should check md_rdev_misc_wq instead of md_misc_wq.

Fixes: cc1ffe61c026 ("md: add new workqueue for delete rdev")
Cc: <stable@vger.kernel.org> # v5.8+
Signed-off-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-10-08 22:36:30 -07:00
Zhao Heming
d837f7277f md/bitmap: md_bitmap_get_counter returns wrong blocks
md_bitmap_get_counter() has code:

```
    if (bitmap->bp[page].hijacked ||
        bitmap->bp[page].map == NULL)
        csize = ((sector_t)1) << (bitmap->chunkshift +
                      PAGE_COUNTER_SHIFT - 1);
```

The minus 1 is wrong, this branch should report 2048 bits of space.
With "-1" action, this only report 1024 bit of space.

This bug code returns wrong blocks, but it doesn't inflence bitmap logic:
1. Most callers focus this function return value (the counter of offset),
   not the parameter blocks.
2. The bug is only triggered when hijacked is true or map is NULL.
   the hijacked true condition is very rare.
   the "map == null" only true when array is creating or resizing.
3. Even the caller gets wrong blocks, current code makes caller just to
   call md_bitmap_get_counter() one more time.

Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-10-08 22:31:29 -07:00
Zhao Heming
a913096dec md/bitmap: md_bitmap_read_sb uses wrong bitmap blocks
The patched code is used to get chunks number, should use round-up div
to replace current sector_div. The same code is in md_bitmap_resize():
```
chunks = DIV_ROUND_UP_SECTOR_T(blocks, 1 << chunkshift);
```

Signed-off-by: Zhao Heming <heming.zhao@suse.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-10-08 22:31:29 -07:00
Jason Yan
d7a1c483f7 md/raid0: remove unused function is_io_in_chunk_boundary()
This function is no longger needed after commit 20d0189b1012 ("block:
Introduce new bio_split()").

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-10-08 22:31:29 -07:00
Dan Williams
ec6347bb43 x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()
In reaction to a proposal to introduce a memcpy_mcsafe_fast()
implementation Linus points out that memcpy_mcsafe() is poorly named
relative to communicating the scope of the interface. Specifically what
addresses are valid to pass as source, destination, and what faults /
exceptions are handled.

Of particular concern is that even though x86 might be able to handle
the semantics of copy_mc_to_user() with its common copy_user_generic()
implementation other archs likely need / want an explicit path for this
case:

  On Fri, May 1, 2020 at 11:28 AM Linus Torvalds <torvalds@linux-foundation.org> wrote:
  >
  > On Thu, Apr 30, 2020 at 6:21 PM Dan Williams <dan.j.williams@intel.com> wrote:
  > >
  > > However now I see that copy_user_generic() works for the wrong reason.
  > > It works because the exception on the source address due to poison
  > > looks no different than a write fault on the user address to the
  > > caller, it's still just a short copy. So it makes copy_to_user() work
  > > for the wrong reason relative to the name.
  >
  > Right.
  >
  > And it won't work that way on other architectures. On x86, we have a
  > generic function that can take faults on either side, and we use it
  > for both cases (and for the "in_user" case too), but that's an
  > artifact of the architecture oddity.
  >
  > In fact, it's probably wrong even on x86 - because it can hide bugs -
  > but writing those things is painful enough that everybody prefers
  > having just one function.

Replace a single top-level memcpy_mcsafe() with either
copy_mc_to_user(), or copy_mc_to_kernel().

Introduce an x86 copy_mc_fragile() name as the rename for the
low-level x86 implementation formerly named memcpy_mcsafe(). It is used
as the slow / careful backend that is supplanted by a fast
copy_mc_generic() in a follow-on patch.

One side-effect of this reorganization is that separating copy_mc_64.S
to its own file means that perf no longer needs to track dependencies
for its memcpy_64.S benchmarks.

 [ bp: Massage a bit. ]

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: <stable@vger.kernel.org>
Link: http://lore.kernel.org/r/CAHk-=wjSqtXAqfUJxFtWNwmguFASTgB0dz1dT3V-78Quiezqbg@mail.gmail.com
Link: https://lkml.kernel.org/r/160195561680.2163339.11574962055305783722.stgit@dwillia2-desk3.amr.corp.intel.com
2020-10-06 11:18:04 +02:00
Eric Biggers
07560151db block: make bio_crypt_clone() able to fail
bio_crypt_clone() assumes its gfp_mask argument always includes
__GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed.

However, bio_crypt_clone() might be called with GFP_ATOMIC via
setup_clone() in drivers/md/dm-rq.c, or with GFP_NOWAIT via
kcryptd_io_read() in drivers/md/dm-crypt.c.

Neither case is currently reachable with a bio that actually has an
encryption context.  However, it's fragile to rely on this.  Just make
bio_crypt_clone() able to fail, analogous to bio_integrity_clone().

Reported-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Satya Tangirala <satyat@google.com>
Cc: Satya Tangirala <satyat@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-05 10:47:43 -06:00
Coly Li
4a784266c6 bcache: remove embedded struct cache_sb from struct cache_set
Since bcache code was merged into mainline kerrnel, each cache set only
as one single cache in it. The multiple caches framework is here but the
code is far from completed. Considering the multiple copies of cached
data can also be stored on e.g. md raid1 devices, it is unnecessary to
support multiple caches in one cache set indeed.

The previous preparation patches fix the dependencies of explicitly
making a cache set only have single cache. Now we don't have to maintain
an embedded partial super block in struct cache_set, the in-memory super
block can be directly referenced from struct cache.

This patch removes the embedded struct cache_sb from struct cache_set,
and fixes all locations where the superb lock was referenced from this
removed super block by referencing the in-memory super block of struct
cache.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02 14:25:30 -06:00
Coly Li
6f9414e0f6 bcache: check and set sync status on cache's in-memory super block
Currently the cache's sync status is checked and set on cache set's in-
memory partial super block. After removing the embedded struct cache_sb
from cache set and reference cache's in-memory super block from struct
cache_set, the sync status can set and check directly on cache's super
block.

This patch checks and sets the cache sync status directly on cache's
in-memory super block. This is a preparation for later removing embedded
struct cache_sb from struct cache_set.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02 14:25:30 -06:00
Coly Li
ebaa1ac12b bcache: remove can_attach_cache()
After removing the embedded struct cache_sb from struct cache_set, cache
set will directly reference the in-memory super block of struct cache.
It is unnecessary to compare block_size, bucket_size and nr_in_set from
the identical in-memory super block in can_attach_cache().

This is a preparation patch for latter removing cache_set->sb from
struct cache_set.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02 14:25:30 -06:00
Coly Li
08a1782881 bcache: don't check seq numbers in register_cache_set()
In order to update the partial super block of cache set, the seq numbers
of cache and cache set are checked in register_cache_set(). If cache's
seq number is larger than cache set's seq number, cache set must update
its partial super block from cache's super block. It is unncessary when
the embedded struct cache_sb is removed from struct cache set.

This patch removed the seq numbers checking from register_cache_set(),
because later there will be no such partial super block in struct cache
set, the cache set will directly reference in-memory super block from
struct cache. This is a preparation patch for removing embedded struct
cache_sb from struct cache_set.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02 14:25:30 -06:00
Coly Li
63a96c05cd bcache: only use bucket_bytes() on struct cache
Because struct cache_set and struct cache both have struct cache_sb,
macro bucket_bytes() currently are used on both of them. When removing
the embedded struct cache_sb from struct cache_set, this macro won't be
used on struct cache_set anymore.

This patch unifies all bucket_bytes() usage only on struct cache, this is
one of the preparation to remove the embedded struct cache_sb from
struct cache_set.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02 14:25:30 -06:00
Coly Li
3c4fae2982 bcache: remove useless bucket_pages()
It seems alloc_bucket_pages() is the only user of bucket_pages().
Considering alloc_bucket_pages() is removed from bcache code, it is safe
to remove the useless macro bucket_pages() now.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02 14:25:30 -06:00
Coly Li
421cf1c573 bcache: remove useless alloc_bucket_pages()
Now no one uses alloc_bucket_pages() anymore, remove it from bcache.h.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02 14:25:30 -06:00
Coly Li
4e1ebae3ee bcache: only use block_bytes() on struct cache
Because struct cache_set and struct cache both have struct cache_sb,
therefore macro block_bytes() can be used on both of them. When removing
the embedded struct cache_sb from struct cache_set, this macro won't be
used on struct cache_set anymore.

This patch unifies all block_bytes() usage only on struct cache, this is
one of the preparation to remove the embedded struct cache_sb from
struct cache_set.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02 14:25:30 -06:00
Coly Li
1132e56e78 bcache: add set_uuid in struct cache_set
This patch adds a separated set_uuid[16] in struct cache_set, to store
the uuid of the cache set. This is the preparation to remove the
embedded struct cache_sb from struct cache_set.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02 14:25:30 -06:00
Coly Li
08fdb2cddb bcache: remove for_each_cache()
Since now each cache_set explicitly has single cache, for_each_cache()
is unnecessary. This patch removes this macro, and update all locations
where it is used, and makes sure all code logic still being consistent.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02 14:25:29 -06:00
Coly Li
697e23495c bcache: explicitly make cache_set only have single cache
Currently although the bcache code has a framework for multiple caches
in a cache set, but indeed the multiple caches never completed and users
use md raid1 for multiple copies of the cached data.

This patch does the following change in struct cache_set, to explicitly
make a cache_set only have single cache,
- Change pointer array "*cache[MAX_CACHES_PER_SET]" to a single pointer
  "*cache".
- Remove pointer array "*cache_by_alloc[MAX_CACHES_PER_SET]".
- Remove "caches_loaded".

Now the code looks as exactly what it does in practic: only one cache is
used in the cache set.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02 14:25:29 -06:00
Coly Li
17e4aed830 bcache: remove 'int n' from parameter list of bch_bucket_alloc_set()
The parameter 'int n' from bch_bucket_alloc_set() is not cleared
defined. From the code comments n is the number of buckets to alloc, but
from the code itself 'n' is the maximum cache to iterate. Indeed all the
locations where bch_bucket_alloc_set() is called, 'n' is alwasy 1.

This patch removes the confused and unnecessary 'int n' from parameter
list of  bch_bucket_alloc_set(), and explicitly allocates only 1 bucket
for its caller.

Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02 14:25:29 -06:00
Qinglang Miao
84e5d1363c bcache: Convert to DEFINE_SHOW_ATTRIBUTE
Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.

As inode->iprivate equals to third parameter of
debugfs_create_file() which is NULL. So it's equivalent
to original code logic.

Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02 14:25:29 -06:00
Dongsheng Yang
7e59c506c3 bcache: check c->root with IS_ERR_OR_NULL() in mca_reserve()
In mca_reserve(c) macro, we are checking root whether is NULL or not.
But that's not enough, when we read the root node in run_cache_set(),
if we got an error in bch_btree_node_read_done(), we will return
ERR_PTR(-EIO) to c->root.

And then we will go continue to unregister, but before calling
unregister_shrinker(&c->shrink), there is a possibility to call
bch_mca_count(), and we would get a crash with call trace like that:

[ 2149.876008] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000b5
... ...
[ 2150.598931] Call trace:
[ 2150.606439]  bch_mca_count+0x58/0x98 [escache]
[ 2150.615866]  do_shrink_slab+0x54/0x310
[ 2150.624429]  shrink_slab+0x248/0x2d0
[ 2150.632633]  drop_slab_node+0x54/0x88
[ 2150.640746]  drop_slab+0x50/0x88
[ 2150.648228]  drop_caches_sysctl_handler+0xf0/0x118
[ 2150.657219]  proc_sys_call_handler.isra.18+0xb8/0x110
[ 2150.666342]  proc_sys_write+0x40/0x50
[ 2150.673889]  __vfs_write+0x48/0x90
[ 2150.681095]  vfs_write+0xac/0x1b8
[ 2150.688145]  ksys_write+0x6c/0xd0
[ 2150.695127]  __arm64_sys_write+0x24/0x30
[ 2150.702749]  el0_svc_handler+0xa0/0x128
[ 2150.710296]  el0_svc+0x8/0xc

Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02 14:25:29 -06:00
Coly Li
a58e88bfdc bcache: share register sysfs with async register
Previously the experimental async registration uses a separate sysfs
file register_async. Now the async registration code seems working well
for a while, we can do furtuher testing with it now.

This patch changes the async bcache registration shares the same sysfs
file /sys/fs/bcache/register (and register_quiet). Async registration
will be default behavior if BCACHE_ASYNC_REGISTRATION is set in kernel
configure. By default, BCACHE_ASYNC_REGISTRATION is not configured yet.

Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-10-02 14:25:29 -06:00
Konstantin Khlebnikov
6abc49468e dm: add support for REQ_NOWAIT and enable it for linear target
Add DM target feature flag DM_TARGET_NOWAIT which advertises that
target works with REQ_NOWAIT bios.

Add dm_table_supports_nowait() and update dm_table_set_restrictions()
to set/clear QUEUE_FLAG_NOWAIT accordingly.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 08:20:03 -06:00
Christoph Hellwig
4245e52d25 md: don't detour through bd_contains for the gendisk
bd_disk is set on all block devices, including those for partitions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 08:18:57 -06:00
Christoph Hellwig
61a27e1f52 md: compare bd_disk instead of bd_contains
To check for partitions of the same disk bd_contains works as well, but
bd_disk is way more obvious.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 08:18:57 -06:00
Christoph Hellwig
fa01b1e973 block: add a bdev_is_partition helper
Add a littler helper to make the somewhat arcane bd_contains checks a
little more obvious.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-25 08:18:57 -06:00
Xiao Ni
d3ee2d8415 md/raid10: improve discard request for far layout
For far layout, the discard region is not continuous on disks. So it needs
far copies r10bio to cover all regions. It needs a way to know all r10bios
have finish or not. Similar with raid10_sync_request, only the first r10bio
master_bio records the discard bio. Other r10bios master_bio record the
first r10bio. The first r10bio can finish after other r10bios finish and
then return the discard bio.

Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:45 -07:00
Xiao Ni
bcc90d2804 md/raid10: improve raid10 discard request
Now the discard request is split by chunk size. So it takes a long time
to finish mkfs on disks which support discard function. This patch improve
handling raid10 discard request. It uses the similar way with patch
29efc390b (md/md0: optimize raid0 discard handling).

But it's a little complex than raid0. Because raid10 has different layout.
If raid10 is offset layout and the discard request is smaller than stripe
size. There are some holes when we submit discard bio to underlayer disks.

For example: five disks (disk1 - disk5)
D01 D02 D03 D04 D05
D05 D01 D02 D03 D04
D06 D07 D08 D09 D10
D10 D06 D07 D08 D09
The discard bio just wants to discard from D03 to D10. For disk3, there is
a hole between D03 and D08. For disk4, there is a hole between D04 and D09.
D03 is a chunk, raid10_write_request can handle one chunk perfectly. So
the part that is not aligned with stripe size is still handled by
raid10_write_request.

If reshape is running when discard bio comes and the discard bio spans the
reshape position, raid10_write_request is responsible to handle this
discard bio.

I did a test with this patch set.
Without patch:
time mkfs.xfs /dev/md0
real4m39.775s
user0m0.000s
sys0m0.298s

With patch:
time mkfs.xfs /dev/md0
real0m0.105s
user0m0.000s
sys0m0.007s

nvme3n1           259:1    0   477G  0 disk
└─nvme3n1p1       259:10   0    50G  0 part
nvme4n1           259:2    0   477G  0 disk
└─nvme4n1p1       259:11   0    50G  0 part
nvme5n1           259:6    0   477G  0 disk
└─nvme5n1p1       259:12   0    50G  0 part
nvme2n1           259:9    0   477G  0 disk
└─nvme2n1p1       259:15   0    50G  0 part
nvme0n1           259:13   0   477G  0 disk
└─nvme0n1p1       259:14   0    50G  0 part

Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:45 -07:00
Xiao Ni
f046f5d0d7 md/raid10: pull codes that wait for blocked dev into one function
The following patch will reuse these logics, so pull the same codes into
one function.

Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:45 -07:00
Xiao Ni
8650a88901 md/raid10: extend r10bio devs to raid disks
Now it allocs r10bio->devs[conf->copies]. Discard bio needs to submit
to all member disks and it needs to use r10bio. So extend to
r10bio->devs[geo.raid_disks].

Reviewed-by: Coly Li <colyli@suse.de>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:45 -07:00
Xiao Ni
2628089b74 md: add md_submit_discard_bio() for submitting discard bio
Move these logic from raid0.c to md.c, so that we can also use it in
raid10.c.

Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:45 -07:00
Zhen Lei
e287308b83 md: Simplify code with existing definition RESYNC_SECTORS in raid10.c
#define RESYNC_SECTORS (RESYNC_BLOCK_SIZE >> 9)

"RESYNC_BLOCK_SIZE/512" is equal to "RESYNC_BLOCK_SIZE >> 9", replace it
with RESYNC_SECTORS.

Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:45 -07:00
Yufen Yu
3891258443 md/raid5: reallocate page array after setting new stripe_size
When try to resize stripe_size, we also need to free old
shared page array and allocate new.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:45 -07:00
Yufen Yu
f16acaf328 md/raid5: resize stripe_head when reshape array
When reshape array, we try to reuse shared pages of old stripe_head,
and allocate more for the new one if needed.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:45 -07:00
Yufen Yu
046169f048 md/raid5: let multiple devices of stripe_head share page
In current implementation, grow_buffers() uses alloc_page() to
allocate the buffers for each stripe_head, i.e. allocate a page
for each dev[i] in stripe_head.

After setting stripe_size as a configurable value by writing
sysfs entry, it means that we always allocate 64K buffers, but
just use 4K of them when stripe_size is 4K in 64KB arm64.

To avoid wasting memory, we try to let multiple sh->dev share
one real page. That means, multiple sh->dev[i].page will point
to the only page with different offset. Example of 64K PAGE_SIZE
and 4K stripe_size as following:

                    64K PAGE_SIZE
          +---+---+---+---+------------------------------+
          |   |   |   |   |
          |   |   |   |   |
          +-+-+-+-+-+-+-+-+------------------------------+
            ^   ^   ^   ^
            |   |   |   +----------------------------+
            |   |   |                                |
            |   |   +-------------------+            |
            |   |                       |            |
            |   +----------+            |            |
            |              |            |            |
            +-+            |            |            |
              |            |            |            |
        +-----+-----+------+-----+------+-----+------+------+
sh      | offset(0) | offset(4K) | offset(8K) | offset(12K) |
 +      +-----------+------------+------------+-------------+
 +----> dev[0].page  dev[1].page  dev[2].page  dev[3].page

A new 'pages' array will be added into stripe_head to record shared
page used by this stripe_head. Allocate them when grow_buffers()
and free them when shrink_buffers().

After trying to share page, the users of sh->dev[i].page need to take
care of the related page offset: page of issued bio and page passed
to xor compution functions. But thanks for previous different page offset
supported. Here, we just need to set correct dev[i].offset.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:44 -07:00
Yufen Yu
4f86ff5580 md/raid6: let async recovery function support different page offset
For now, asynchronous raid6 recovery calculate functions are require
common offset for pages. But, we expect them to support different page
offset after introducing stripe shared page. Do that by simplily adding
page offset where each page address are referred. Then, replace the
old interface with the new ones in raid6 and raid6test.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:44 -07:00
Yufen Yu
d69454bc9f md/raid6: let syndrome computor support different page offset
For now, syndrome compute functions require common offset in the pages
array. However, we expect them to support different offset when try to
use shared page in the following. Simplily covert them by adding page
offset where each page address are referred.

Since the only caller of async_gen_syndrome() and async_syndrome_val()
are in raid6, we don't want to reserve the old interface but modify the
interface directly. After that, replacing old interfaces with new ones
for raid6 and raid6test.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:44 -07:00
Yufen Yu
a7c224a820 md/raid5: convert to new xor compution interface
We try to replace async_xor() and async_xor_val() with the new
introduced interface async_xor_offs() and async_xor_val_offs()
for raid456.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:44 -07:00
Yufen Yu
248728dd04 md/raid5: make async_copy_data() to support different page offset
ops_run_biofill() and ops_run_biodrain() will call async_copy_data()
to copy sh->dev[i].page from or to bio page. For now, it implies the
offset of dev[i].page is 0. But we want to support different page offset
in the following.

Thus, pass page offset to these functions and replace 'page_offset'
with 'page_offset + poff'.

No functional change.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:44 -07:00
Yufen Yu
7aba13b715 md/raid5: add a new member of offset into r5dev
Add a new member of offset into struct r5dev. It indicates the
offset of related dev[i].page. For now, since each device have a
privated page, the value is always 0. Thus, we set offset as 0
when allcate page in grow_buffers() and resize_stripes().

To support following different page offset, we try to use the page
offset rather than '0' directly for async_memcpy() and ops_run_io().

We try to support different page offset for xor compution functions
in the following. To avoid repeatly allocate a new array each time,
we add a memory region into scribble buffer to record offset.

No functional change.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:44 -07:00
Xianting Tian
313b825fa2 md: only calculate blocksize once and use i_blocksize()
We alreday has the interface i_blocksize(), which can be used
to get blocksize, so use it.
Only calculate blocksize once and use it within read_page().

Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
2020-09-24 16:44:44 -07:00
Jens Axboe
ac8f7a0264 Merge branch 'for-5.10/block' into for-5.10/drivers
* for-5.10/block: (140 commits)
  bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag
  bdi: invert BDI_CAP_NO_ACCT_WB
  bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag
  mm: use SWP_SYNCHRONOUS_IO more intelligently
  bdi: remove BDI_CAP_SYNCHRONOUS_IO
  bdi: remove BDI_CAP_CGROUP_WRITEBACK
  block: lift setting the readahead size into the block layer
  md: update the optimal I/O size on reshape
  bdi: initialize ->ra_pages and ->io_pages in bdi_init
  aoe: set an optimal I/O size
  bcache: inherit the optimal I/O size
  drbd: remove dead code in device_to_statistics
  fs: remove the unused SB_I_MULTIROOT flag
  block: mark blkdev_get static
  PM: mm: cleanup swsusp_swap_check
  mm: split swap_type_of
  PM: rewrite is_hibernate_resume_dev to not require an inode
  mm: cleanup claim_swapfile
  ocfs2: cleanup o2hb_region_dev_store
  dasd: cleanup dasd_scan_partitions
  ...
2020-09-24 13:44:39 -06:00
Christoph Hellwig
1cb039f3dc bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag
The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
backing_dev_info shared between the block drivers and the writeback code.
To help untangling the dependency replace it with a queue flag and a
superblock flag derived from it.  This also helps with the case of e.g.
a file system requiring stable writes due to its own checksumming, but
not forcing it on other users of the block device like the swap code.

One downside is that we an't support the stable_pages_required bdi
attribute in sysfs anymore.  It is replaced with a queue attribute which
also is writable for easier testing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig
c2e4cd57cf block: lift setting the readahead size into the block layer
Drivers shouldn't really mess with the readahead size, as that is a VM
concept.  Instead set it based on the optimal I/O size by lifting the
algorithm from the md driver when registering the disk.  Also set
bdi->io_pages there as well by applying the same scheme based on
max_sectors.  To ensure the limits work well for stacking drivers a
new helper is added to update the readahead limits from the block
limits, which is also called from disk_stack_limits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig
16ef510139 md: update the optimal I/O size on reshape
The raid5 and raid10 drivers currently update the read-ahead size,
but not the optimal I/O size on reshape.  To prepare for deriving the
read-ahead size from the optimal I/O size make sure it is updated
as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:39 -06:00
Christoph Hellwig
5d4ce78b25 bcache: inherit the optimal I/O size
Inherit the optimal I/O size setting just like the readahead window,
as any reason to do larger I/O does not apply to just readahead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24 13:43:38 -06:00