11913 Commits

Author SHA1 Message Date
Jonathan Toppins
75dddef325 mm: ratelimit PFNs busy info message
The RDMA subsystem can generate several thousand of these messages per
second eventually leading to a kernel crash.  Ratelimit these messages
to prevent this crash.

Doug said:
 "I've been carrying a version of this for several kernel versions. I
  don't remember when they started, but we have one (and only one) class
  of machines: Dell PE R730xd, that generate these errors. When it
  happens, without a rate limit, we get rcu timeouts and kernel oopses.
  With the rate limit, we just get a lot of annoying kernel messages but
  the machine continues on, recovers, and eventually the memory
  operations all succeed"

And:
 "> Well... why are all these EBUSY's occurring? It sounds inefficient
  > (at least) but if it is expected, normal and unavoidable then
  > perhaps we should just remove that message altogether?

  I don't have an answer to that question. To be honest, I haven't
  looked real hard. We never had this at all, then it started out of the
  blue, but only on our Dell 730xd machines (and it hits all of them),
  but no other classes or brands of machines. And we have our 730xd
  machines loaded up with different brands and models of cards (for
  instance one dedicated to mlx4 hardware, one for qib, one for mlx5, an
  ocrdma/cxgb4 combo, etc), so the fact that it hit all of the machines
  meant it wasn't tied to any particular brand/model of RDMA hardware.
  To me, it always smelled of a hardware oddity specific to maybe the
  CPUs or mainboard chipsets in these machines, so given that I'm not an
  mm expert anyway, I never chased it down.

  A few other relevant details: it showed up somewhere around 4.8/4.9 or
  thereabouts. It never happened before, but the prinkt has been there
  since the 3.18 days, so possibly the test to trigger this message was
  changed, or something else in the allocator changed such that the
  situation started happening on these machines?

  And, like I said, it is specific to our 730xd machines (but they are
  all identical, so that could mean it's something like their specific
  ram configuration is causing the allocator to hit this on these
  machine but not on other machines in the cluster, I don't want to say
  it's necessarily the model of chipset or CPU, there are other bits of
  identicalness between these machines)"

Link: http://lkml.kernel.org/r/499c0f6cc10d6eb829a67f2a4d75b4228a9b356e.1501695897.git.jtoppins@redhat.com
Signed-off-by: Jonathan Toppins <jtoppins@redhat.com>
Reviewed-by: Doug Ledford <dledford@redhat.com>
Tested-by: Doug Ledford <dledford@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10 15:54:06 -07:00
Johannes Weiner
d507e2ebd2 mm: fix global NR_SLAB_.*CLAIMABLE counter reads
As Tetsuo points out:
 "Commit 385386cff4c6 ("mm: vmstat: move slab statistics from zone to
  node counters") broke "Slab:" field of /proc/meminfo . It shows nearly
  0kB"

In addition to /proc/meminfo, this problem also affects the slab
counters OOM/allocation failure info dumps, can cause early -ENOMEM from
overcommit protection, and miscalculate image size requirements during
suspend-to-disk.

This is because the patch in question switched the slab counters from
the zone level to the node level, but forgot to update the global
accessor functions to read the aggregate node data instead of the
aggregate zone data.

Use global_node_page_state() to access the global slab counters.

Fixes: 385386cff4c6 ("mm: vmstat: move slab statistics from zone to node counters")
Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Stefan Agner <stefan@agner.ch>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-10 15:54:06 -07:00
Peter Zijlstra
d92a8cfcb3 locking/lockdep: Rework FS_RECLAIM annotation
A while ago someone, and I cannot find the email just now, asked if we
could not implement the RECLAIM_FS inversion stuff with a 'fake' lock
like we use for other things like workqueues etc. I think this should
be possible which allows reducing the 'irq' states and will reduce the
amount of __bfs() lookups we do.

Removing the 1 IRQ state results in 4 less __bfs() walks per
dependency, improving lockdep performance. And by moving this
annotation out of the lockdep code it becomes easier for the mm people
to extend.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Byungchul Park <byungchul.park@lge.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nikolay Borisov <nborisov@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: boqun.feng@gmail.com
Cc: iamjoonsoo.kim@lge.com
Cc: kernel-team@lge.com
Cc: kirill@shutemov.name
Cc: npiggin@gmail.com
Cc: walken@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 12:29:03 +02:00
Peter Zijlstra
8b1b436dd1 mm, locking: Rework {set,clear,mm}_tlb_flush_pending()
Commit:

  af2c1401e6f9 ("mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates")

added smp_mb__before_spinlock() to set_tlb_flush_pending(). I think we
can solve the same problem without this barrier.

If instead we mandate that mm_tlb_flush_pending() is used while
holding the PTL we're guaranteed to observe prior
set_tlb_flush_pending() instances.

For this to work we need to rework migrate_misplaced_transhuge_page()
a little and move the test up into do_huge_pmd_numa_page().

NOTE: this relies on flush_tlb_range() to guarantee:

   (1) it ensures that prior page table updates are visible to the
       page table walker and
   (2) it ensures that subsequent memory accesses are only made
       visible after the invalidation has completed

This is required for architectures that implement TRANSPARENT_HUGEPAGE
(arc, arm, arm64, mips, powerpc, s390, sparc, x86) or otherwise use
mm_tlb_flush_pending() in their page-table operations (arm, arm64,
x86).

This appears true for:

 - arm (DSB ISB before and after),
 - arm64 (DSB ISHST before, and DSB ISH after),
 - powerpc (PTESYNC before and after),
 - s390 and x86 TLB invalidate are serializing instructions

But I failed to understand the situation for:

 - arc, mips, sparc

Now SPARC64 is a wee bit special in that flush_tlb_range() is a no-op
and it flushes the TLBs using arch_{enter,leave}_lazy_mmu_mode()
inside the PTL. It still needs to guarantee the PTL unlock happens
_after_ the invalidate completes.

Vineet, Ralf and Dave could you guys please have a look?

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineet Gupta <vgupta@synopsys.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 12:29:01 +02:00
Ingo Molnar
388f8e1273 Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-10 12:20:53 +02:00
Heiko Carstens
167d0f258f mm: take memory hotplug lock within numa_zonelist_order_handler()
Andre Wild reported the following warning:

  WARNING: CPU: 2 PID: 1205 at kernel/cpu.c:240 lockdep_assert_cpus_held+0x4c/0x60
  Modules linked in:
  CPU: 2 PID: 1205 Comm: bash Not tainted 4.13.0-rc2-00022-gfd2b2c57ec20 #10
  Hardware name: IBM 2964 N96 702 (z/VM 6.4.0)
  task: 00000000701d8100 task.stack: 0000000073594000
  Krnl PSW : 0704f00180000000 0000000000145e24 (lockdep_assert_cpus_held+0x4c/0x60)
  ...
  Call Trace:
   lockdep_assert_cpus_held+0x42/0x60)
   stop_machine_cpuslocked+0x62/0xf0
   build_all_zonelists+0x92/0x150
   numa_zonelist_order_handler+0x102/0x150
   proc_sys_call_handler.isra.12+0xda/0x118
   proc_sys_write+0x34/0x48
   __vfs_write+0x3c/0x178
   vfs_write+0xbc/0x1a0
   SyS_write+0x66/0xc0
   system_call+0xc4/0x2b0
   locks held by bash/1205:
   #0:  (sb_writers#4){.+.+.+}, at: vfs_write+0xa6/0x1a0
   #1:  (zl_order_mutex){+.+...}, at: numa_zonelist_order_handler+0x44/0x150
   #2:  (zonelists_mutex){+.+...}, at: numa_zonelist_order_handler+0xf4/0x150
  Last Breaking-Event-Address:
    lockdep_assert_cpus_held+0x48/0x60

This can be easily triggered with e.g.

    echo n > /proc/sys/vm/numa_zonelist_order

In commit 3f906ba23689a ("mm/memory-hotplug: switch locking to a percpu
rwsem") memory hotplug locking was changed to fix a potential deadlock.

This also switched the stop_machine() invocation within
build_all_zonelists() to stop_machine_cpuslocked() which now expects
that online cpus are locked when being called.

This assumption is not true if build_all_zonelists() is being called
from numa_zonelist_order_handler().

In order to fix this simply add a mem_hotplug_begin()/mem_hotplug_done()
pair to numa_zonelist_order_handler().

Link: http://lkml.kernel.org/r/20170726111738.38768-1-heiko.carstens@de.ibm.com
Fixes: 3f906ba23689a ("mm/memory-hotplug: switch locking to a percpu rwsem")
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reported-by: Andre Wild <wild@linux.vnet.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02 17:16:11 -07:00
Tetsuo Handa
b0ba2d0faf mm/page_io.c: fix oops during block io poll in swapin path
When a thread is OOM-killed during swap_readpage() operation, an oops
occurs because end_swap_bio_read() is calling wake_up_process() based on
an assumption that the thread which called swap_readpage() is still
alive.

  Out of memory: Kill process 525 (polkitd) score 0 or sacrifice child
  Killed process 525 (polkitd) total-vm:528128kB, anon-rss:0kB, file-rss:4kB, shmem-rss:0kB
  oom_reaper: reaped process 525 (polkitd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
  Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp ppdev pcspkr vmw_balloon sg shpchp vmw_vmci parport_pc parport i2c_piix4 ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic pata_acpi vmwgfx ahci libahci drm_kms_helper ata_piix syscopyarea sysfillrect sysimgblt fb_sys_fops mptspi scsi_transport_spi ttm e1000 mptscsih drm mptbase i2c_core libata serio_raw
  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.13.0-rc2-next-20170725 #129
  Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
  task: ffffffffb7c16500 task.stack: ffffffffb7c00000
  RIP: 0010:__lock_acquire+0x151/0x12f0
  Call Trace:
   <IRQ>
   lock_acquire+0x59/0x80
   _raw_spin_lock_irqsave+0x3b/0x4f
   try_to_wake_up+0x3b/0x410
   wake_up_process+0x10/0x20
   end_swap_bio_read+0x6f/0xf0
   bio_endio+0x92/0xb0
   blk_update_request+0x88/0x270
   scsi_end_request+0x32/0x1c0
   scsi_io_completion+0x209/0x680
   scsi_finish_command+0xd4/0x120
   scsi_softirq_done+0x120/0x140
   __blk_mq_complete_request_remote+0xe/0x10
   flush_smp_call_function_queue+0x51/0x120
   generic_smp_call_function_single_interrupt+0xe/0x20
   smp_trace_call_function_single_interrupt+0x22/0x30
   smp_call_function_single_interrupt+0x9/0x10
   call_function_single_interrupt+0xa7/0xb0
   </IRQ>
  RIP: 0010:native_safe_halt+0x6/0x10
   default_idle+0xe/0x20
   arch_cpu_idle+0xa/0x10
   default_idle_call+0x1e/0x30
   do_idle+0x187/0x200
   cpu_startup_entry+0x6e/0x70
   rest_init+0xd0/0xe0
   start_kernel+0x456/0x477
   x86_64_start_reservations+0x24/0x26
   x86_64_start_kernel+0xf7/0x11a
   secondary_startup_64+0xa5/0xa5
  Code: c3 49 81 3f 20 9e 0b b8 41 bc 00 00 00 00 44 0f 45 e2 83 fe 01 0f 87 62 ff ff ff 89 f0 49 8b 44 c7 08 48 85 c0 0f 84 52 ff ff ff <f0> ff 80 98 01 00 00 8b 3d 5a 49 c4 01 45 8b b3 18 0c 00 00 85
  RIP: __lock_acquire+0x151/0x12f0 RSP: ffffa01f39e03c50
  ---[ end trace 6c441db499169b1e ]---
  Kernel panic - not syncing: Fatal exception in interrupt
  Kernel Offset: 0x36000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
  ---[ end Kernel panic - not syncing: Fatal exception in interrupt

Fix it by holding a reference to the thread.

[akpm@linux-foundation.org: add comment]
Fixes: 23955622ff8d231b ("swap: add block io poll in swapin path")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Shaohua Li <shli@fb.com>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02 17:16:11 -07:00
Minchan Kim
3189c82056 zram: do not free pool->size_class
Mike reported kernel goes oops with ltp:zram03 testcase.

  zram: Added device: zram0
  zram0: detected capacity change from 0 to 107374182400
  BUG: unable to handle kernel paging request at 0000306d61727a77
  IP: zs_map_object+0xb9/0x260
  PGD 0
  P4D 0
  Oops: 0000 [#1] SMP
  Dumping ftrace buffer:
     (ftrace buffer empty)
  Modules linked in: zram(E) xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) loop(E) ebtable_filter(E) ebtables(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) ip_tables(E) x_tables(E) af_packet(E) br_netfilter(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) nls_iso8859_1(E) nls_cp437(E) vfat(E) fat(E) intel_powerclamp(E) coretemp(E) cdc_ether(E) kvm_intel(E) usbnet(E) mii(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) crc32c_intel(E) iTCO_wdt(E) ghash_clmulni_intel(E) bnx2(E) iTCO_vendor_support(E) pcbc(E) ioatdma(E) ipmi_ssif(E) aesni_intel(E) i5500_temp(E) i2c_i801(E) aes_x86_64(E) lpc_ich(E) shpchp(E) mfd_core(E) crypto_simd(E) i7core_edac(E) dca(E) glue_helper(E) cryptd(E) ipmi_si(E) button(E) acpi_cpufreq(E) ipmi_devintf(E) pcspkr(E) ipmi_msghandler(E)
   nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) ext4(E) crc16(E) mbcache(E) jbd2(E) sd_mod(E) ata_generic(E) i2c_algo_bit(E) ata_piix(E) drm_kms_helper(E) ahci(E) syscopyarea(E) sysfillrect(E) libahci(E) sysimgblt(E) fb_sys_fops(E) uhci_hcd(E) ehci_pci(E) ttm(E) ehci_hcd(E) libata(E) drm(E) megaraid_sas(E) usbcore(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) efivarfs(E) autofs4(E) [last unloaded: zram]
  CPU: 6 PID: 12356 Comm: swapon Tainted: G            E   4.13.0.g87b2c3f-default #194
  Hardware name: IBM System x3550 M3 -[7944K3G]-/69Y5698     , BIOS -[D6E150AUS-1.10]- 12/15/2010
  task: ffff880158d2c4c0 task.stack: ffffc90001680000
  RIP: 0010:zs_map_object+0xb9/0x260
  Call Trace:
   zram_bvec_rw.isra.26+0xe8/0x780 [zram]
   zram_rw_page+0x6e/0xa0 [zram]
   bdev_read_page+0x81/0xb0
   do_mpage_readpage+0x51a/0x710
   mpage_readpages+0x122/0x1a0
   blkdev_readpages+0x1d/0x20
   __do_page_cache_readahead+0x1b2/0x270
   ondemand_readahead+0x180/0x2c0
   page_cache_sync_readahead+0x31/0x50
   generic_file_read_iter+0x7e7/0xaf0
   blkdev_read_iter+0x37/0x40
   __vfs_read+0xce/0x140
   vfs_read+0x9e/0x150
   SyS_read+0x46/0xa0
   entry_SYSCALL_64_fastpath+0x1a/0xa5
  Code: 81 e6 00 c0 3f 00 81 fe 00 00 16 00 0f 85 9f 01 00 00 0f b7 13 65 ff 05 5e 07 dc 7e 66 c1 ea 02 81 e2 ff 01 00 00 49 8b 54 d4 08 <8b> 4a 48 41 0f af ce 81 e1 ff 0f 00 00 41 89 c9 48 c7 c3 a0 70
  RIP: zs_map_object+0xb9/0x260 RSP: ffffc90001683988
  CR2: 0000306d61727a77

He bisected the problem is [1].

After commit cf8e0fedf078 ("mm/zsmalloc: simplify zs_max_alloc_size
handling"), zram doesn't use double pointer for pool->size_class any
more in zs_create_pool so counter function zs_destroy_pool don't need to
free it, either.

Otherwise, it does kfree wrong address and then, kernel goes Oops.

Link: http://lkml.kernel.org/r/20170725062650.GA12134@bbox
Fixes: cf8e0fedf078 ("mm/zsmalloc: simplify zs_max_alloc_size handling")
Signed-off-by: Minchan Kim <minchan@kernel.org>
Reported-by: Mike Galbraith <efault@gmx.de>
Tested-by: Mike Galbraith <efault@gmx.de>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02 16:34:47 -07:00
Arnd Bergmann
e7701557bf kasan: avoid -Wmaybe-uninitialized warning
gcc-7 produces this warning:

  mm/kasan/report.c: In function 'kasan_report':
  mm/kasan/report.c:351:3: error: 'info.first_bad_addr' may be used uninitialized in this function [-Werror=maybe-uninitialized]
     print_shadow_for_address(info->first_bad_addr);
     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  mm/kasan/report.c:360:27: note: 'info.first_bad_addr' was declared here

The code seems fine as we only print info.first_bad_addr when there is a
shadow, and we always initialize it in that case, but this is relatively
hard for gcc to figure out after the latest rework.

Adding an intialization to the most likely value together with the other
struct members shuts up that warning.

Fixes: b235b9808664 ("kasan: unify report headers")
Link: https://patchwork.kernel.org/patch/9641417/
Link: http://lkml.kernel.org/r/20170725152739.4176967-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Alexander Potapenko <glider@google.com>
Suggested-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02 16:34:46 -07:00
Mike Rapoport
b228237193 userfaultfd: non-cooperative: notify about unmap of destination during mremap
When mremap is called with MREMAP_FIXED it unmaps memory at the
destination address without notifying userfaultfd monitor.

If the destination were registered with userfaultfd, the monitor has no
way to distinguish between the old and new ranges and to properly relate
the page faults that would occur in the destination region.

Fixes: 897ab3e0c49e ("userfaultfd: non-cooperative: add event for memory unmaps")
Link: http://lkml.kernel.org/r/1500276876-3350-1-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Acked-by: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02 16:34:46 -07:00
Mel Gorman
3ea277194d mm, mprotect: flush TLB if potentially racing with a parallel reclaim leaving stale TLB entries
Nadav Amit identified a theoritical race between page reclaim and
mprotect due to TLB flushes being batched outside of the PTL being held.

He described the race as follows:

        CPU0                            CPU1
        ----                            ----
                                        user accesses memory using RW PTE
                                        [PTE now cached in TLB]
        try_to_unmap_one()
        ==> ptep_get_and_clear()
        ==> set_tlb_ubc_flush_pending()
                                        mprotect(addr, PROT_READ)
                                        ==> change_pte_range()
                                        ==> [ PTE non-present - no flush ]

                                        user writes using cached RW PTE
        ...

        try_to_unmap_flush()

The same type of race exists for reads when protecting for PROT_NONE and
also exists for operations that can leave an old TLB entry behind such
as munmap, mremap and madvise.

For some operations like mprotect, it's not necessarily a data integrity
issue but it is a correctness issue as there is a window where an
mprotect that limits access still allows access.  For munmap, it's
potentially a data integrity issue although the race is massive as an
munmap, mmap and return to userspace must all complete between the
window when reclaim drops the PTL and flushes the TLB.  However, it's
theoritically possible so handle this issue by flushing the mm if
reclaim is potentially currently batching TLB flushes.

Other instances where a flush is required for a present pte should be ok
as either the page lock is held preventing parallel reclaim or a page
reference count is elevated preventing a parallel free leading to
corruption.  In the case of page_mkclean there isn't an obvious path
that userspace could take advantage of without using the operations that
are guarded by this patch.  Other users such as gup as a race with
reclaim looks just at PTEs.  huge page variants should be ok as they
don't race with reclaim.  mincore only looks at PTEs.  userfault also
should be ok as if a parallel reclaim takes place, it will either fault
the page back in or read some of the data before the flush occurs
triggering a fault.

Note that a variant of this patch was acked by Andy Lutomirski but this
was for the x86 parts on top of his PCID work which didn't make the 4.13
merge window as expected.  His ack is dropped from this version and
there will be a follow-on patch on top of PCID that will include his
ack.

[akpm@linux-foundation.org: tweak comments]
[akpm@linux-foundation.org: fix spello]
Link: http://lkml.kernel.org/r/20170717155523.emckq2esjro6hf3z@suse.de
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: <stable@vger.kernel.org>	[v4.4+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02 16:34:46 -07:00
Daniel Jordan
2be7cfed99 mm/hugetlb.c: __get_user_pages ignores certain follow_hugetlb_page errors
Commit 9a291a7c9428 ("mm/hugetlb: report -EHWPOISON not -EFAULT when
FOLL_HWPOISON is specified") causes __get_user_pages to ignore certain
errors from follow_hugetlb_page.  After such error, __get_user_pages
subsequently calls faultin_page on the same VMA and start address that
follow_hugetlb_page failed on instead of returning the error immediately
as it should.

In follow_hugetlb_page, when hugetlb_fault returns a value covered under
VM_FAULT_ERROR, follow_hugetlb_page returns it without setting nr_pages
to 0 as __get_user_pages expects in this case, which causes the
following to happen in __get_user_pages: the "while (nr_pages)" check
succeeds, we skip the "if (!vma..." check because we got a VMA the last
time around, we find no page with follow_page_mask, and we call
faultin_page, which calls hugetlb_fault for the second time.

This issue also slightly changes how __get_user_pages works.  Before, it
only returned error if it had made no progress (i = 0).  But now,
follow_hugetlb_page can clobber "i" with an error code since its new
return path doesn't check for progress.  So if "i" is nonzero before a
failing call to follow_hugetlb_page, that indication of progress is lost
and __get_user_pages can return error even if some pages were
successfully pinned.

To fix this, change follow_hugetlb_page so that it updates nr_pages,
allowing __get_user_pages to fail immediately and restoring the "error
only if no progress" behavior to __get_user_pages.

Tested that __get_user_pages returns when expected on error from
hugetlb_fault in follow_hugetlb_page.

Fixes: 9a291a7c9428 ("mm/hugetlb: report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified")
Link: http://lkml.kernel.org/r/1500406795-58462-1-git-send-email-daniel.m.jordan@oracle.com
Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: Punit Agrawal <punit.agrawal@arm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: James Morse <james.morse@arm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: zhong jiang <zhongjiang@huawei.com>
Cc: <stable@vger.kernel.org>	[4.12.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-08-02 16:34:46 -07:00
Jeff Layton
ffb959bbdf mm: remove optimizations based on i_size in mapping writeback waits
Marcelo added this i_size based optimization with a patch in 2004
(commitid is from the linux-history tree):

    commit 765dad09b4ac101a32d87af2bb793c3060497d3c
    Author: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
    Date:   Tue Sep 7 17:51:17 2004 -0700

	small wait_on_page_writeback_range() optimization

	filemap_fdatawait() calls wait_on_page_writeback_range() with -1
	as "end" parameter.  This is not needed since we know the EOF
	from the inode.  Use that instead.

There may be races here, particularly with clustered or network
filesystems. It also seems like a bit of a layering violation since
we're operating on an address_space here, not an inode.

Finally, it's also questionable whether this optimization really helps
on workloads that we care about. Should we be optimizing for writeback
vs. truncate races in a codepath where we expect to wait anyway? It
doesn't seem worth the risk.

Remove this optimization from the filemap_fdatawait codepaths. This
means that filemap_fdatawait becomes a trivial wrapper around
filemap_fdatawait_range.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-08-01 08:39:29 -04:00
Jeff Layton
a823e4589e mm: add file_fdatawait_range and file_write_and_wait
Necessary now for gfs2_fsync and sync_file_range, but there will
eventually be other callers.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-31 19:12:26 -04:00
Jeff Layton
9326c9b20d mm: consolidate dax / non-dax checks for writeback
We have this complex conditional copied to several places. Turn it into
a helper function.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-29 09:01:02 -04:00
Dennis Zhou (Facebook)
5e81ee3e6a percpu: update header to contain bitmap allocator explanation.
The other patches contain a lot of information, so adding this
information in a separate patch. It adds my copyright and a brief
explanation of how the bitmap allocator works. There is a minor typo as
well in the prior explanation so that is fixed.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26 17:41:06 -04:00
Dennis Zhou (Facebook)
b4c2116cfa percpu: update pcpu_find_block_fit to use an iterator
The simple, and expensive, way to find a free area is to iterate over
the entire bitmap until an area is found that fits the allocation size
and alignment. This patch makes use of an iterate that find an area to
check by using the block level contig hints. It will only return an area
that can fit the size and alignment request. If the request can fit
inside a block, it returns the first_free bit to start checking from to
see if it can be fulfilled prior to the contig hint. The pcpu_alloc_area
check has a bound of a block size added in case it is wrong.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26 17:41:06 -04:00
Dennis Zhou (Facebook)
525ca84dae percpu: use metadata blocks to update the chunk contig hint
The largest free region will either be a block level contig hint or an
aggregate over the left_free and right_free areas of blocks. This is a
much smaller set of free areas that need to be checked than a full
traverse.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26 17:41:06 -04:00
Dennis Zhou (Facebook)
b185cd0dc6 percpu: update free path to take advantage of contig hints
The bitmap allocator must keep metadata consistent. The easiest way is
to scan after every allocation for each affected block and the entire
chunk. This is rather expensive.

The free path can take advantage of current contig hints to prevent
scanning within the start and end block.  If a scan is needed, it can
be done by scanning backwards from the start and forwards from the end
to identify the entire free area this can be combined with. The blocks
can then be updated by some basic checks rather than complete block
scans.

A chunk scan happens when the freed area makes a page free, a block
free, or spans across blocks. This is necessary as the contig hint at
this point could span across blocks. The check uses the minimum of page
size and the block size to allow for variable sized blocks. There is a
tradeoff here with not updating after every free. It is possible a
contig hint in one block can be merged with the contig hint in the next
block. This means the contig hint can be off by up to a page. However,
if the chunk's contig hint is contained in one block, the contig hint
will be accurate.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26 17:41:06 -04:00
Dennis Zhou (Facebook)
fc3043345a percpu: update alloc path to only scan if contig hints are broken
Metadata is kept per block to keep track of where the contig hints are.
Scanning can be avoided when the contig hints are not broken. In that
case, left and right contigs have to be managed manually.

This patch changes the allocation path hint updating to only scan when
contig hints are broken.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26 17:41:06 -04:00
Dennis Zhou (Facebook)
268625a6f9 percpu: keep track of the best offset for contig hints
This patch makes the contig hint starting offset optimization from the
previous patch as honest as it can be. For both chunk and block starting
offsets, make sure it keeps the starting offset with the best alignment.

The block skip optimization is added in a later patch when the
pcpu_find_block_fit iterator is swapped in.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26 17:41:05 -04:00
Dennis Zhou (Facebook)
13f966373f percpu: skip chunks if the alloc does not fit in the contig hint
This patch adds chunk->contig_bits_start to keep track of the contig
hint's offset and the check to skip the chunk if it does not fit. If
the chunk's contig hint starting offset cannot satisfy an allocation,
the allocator assumes there is enough memory pressure in this chunk to
either use a different chunk or create a new one. This accepts a less
tight packing for a smoother latency curve.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26 17:41:05 -04:00
Dennis Zhou (Facebook)
86b442fbce percpu: add first_bit to keep track of the first free in the bitmap
This patch adds first_bit to keep track of the first free bit in the
bitmap. This hint helps prevent scanning of fully allocated blocks.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26 17:41:05 -04:00
Dennis Zhou (Facebook)
ca460b3c96 percpu: introduce bitmap metadata blocks
This patch introduces the bitmap metadata blocks and adds the skeleton
of the code that will be used to maintain these blocks.  Each chunk's
bitmap is made up of full metadata blocks. These blocks maintain basic
metadata to help prevent scanning unnecssarily to update hints. Full
scanning methods are used for the skeleton and will be replaced in the
coming patches. A number of helper functions are added as well to do
conversion of pages to blocks and manage offsets. Comments will be
updated as the final version of each function is added.

There exists a relationship between PAGE_SIZE, PCPU_BITMAP_BLOCK_SIZE,
the region size, and unit_size. Every chunk's region (including offsets)
is page aligned at the beginning to preserve alignment. The end is
aligned to LCM(PAGE_SIZE, PCPU_BITMAP_BLOCK_SIZE) to ensure that the end
can fit with the populated page map which is by page and every metadata
block is fully accounted for. The unit_size is already page aligned, but
must also be aligned with PCPU_BITMAP_BLOCK_SIZE to ensure full metadata
blocks.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26 17:41:05 -04:00
Dennis Zhou (Facebook)
40064aeca3 percpu: replace area map allocator with bitmap
The percpu memory allocator is experiencing scalability issues when
allocating and freeing large numbers of counters as in BPF.
Additionally, there is a corner case where iteration is triggered over
all chunks if the contig_hint is the right size, but wrong alignment.

This patch replaces the area map allocator with a basic bitmap allocator
implementation. Each subsequent patch will introduce new features and
replace full scanning functions with faster non-scanning options when
possible.

Implementation:
This patchset removes the area map allocator in favor of a bitmap
allocator backed by metadata blocks. The primary goal is to provide
consistency in performance and memory footprint with a focus on small
allocations (< 64 bytes). The bitmap removes the heavy memmove from the
freeing critical path and provides a consistent memory footprint. The
metadata blocks provide a bound on the amount of scanning required by
maintaining a set of hints.

In an effort to make freeing fast, the metadata is updated on the free
path if the new free area makes a page free, a block free, or spans
across blocks. This causes the chunk's contig hint to potentially be
smaller than what it could allocate by up to the smaller of a page or a
block. If the chunk's contig hint is contained within a block, a check
occurs and the hint is kept accurate. Metadata is always kept accurate
on allocation, so there will not be a situation where a chunk has a
later contig hint than available.

Evaluation:
I have primarily done testing against a simple workload of allocation of
1 million objects (2^20) of varying size. Deallocation was done by in
order, alternating, and in reverse. These numbers were collected after
rebasing ontop of a80099a152. I present the worst-case numbers here:

  Area Map Allocator:

        Object Size | Alloc Time (ms) | Free Time (ms)
        ----------------------------------------------
              4B    |        310      |     4770
             16B    |        557      |     1325
             64B    |        436      |      273
            256B    |        776      |      131
           1024B    |       3280      |      122

  Bitmap Allocator:

        Object Size | Alloc Time (ms) | Free Time (ms)
        ----------------------------------------------
              4B    |        490      |       70
             16B    |        515      |       75
             64B    |        610      |       80
            256B    |        950      |      100
           1024B    |       3520      |      200

This data demonstrates the inability for the area map allocator to
handle less than ideal situations. In the best case of reverse
deallocation, the area map allocator was able to perform within range
of the bitmap allocator. In the worst case situation, freeing took
nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
dramatically improves the consistency of the free path. The small
allocations performed nearly identical regardless of the freeing
pattern.

While it does add to the allocation latency, the allocation scenario
here is optimal for the area map allocator. The area map allocator runs
into trouble when it is allocating in chunks where the latter half is
full. It is difficult to replicate this, so I present a variant where
the pages are second half filled. Freeing was done sequentially. Below
are the numbers for this scenario:

  Area Map Allocator:

        Object Size | Alloc Time (ms) | Free Time (ms)
        ----------------------------------------------
              4B    |       4118      |     4892
             16B    |       1651      |     1163
             64B    |        598      |      285
            256B    |        771      |      158
           1024B    |       3034      |      160

  Bitmap Allocator:

        Object Size | Alloc Time (ms) | Free Time (ms)
        ----------------------------------------------
              4B    |        481      |       67
             16B    |        506      |       69
             64B    |        636      |       75
            256B    |        892      |       90
           1024B    |       3262      |      147

The data shows a parabolic curve of performance for the area map
allocator. This is due to the memmove operation being the dominant cost
with the lower object sizes as more objects are packed in a chunk and at
higher object sizes, the traversal of the chunk slots is the dominating
cost. The bitmap allocator suffers this problem as well. The above data
shows the inability to scale for the allocation path with the area map
allocator and that the bitmap allocator demonstrates consistent
performance in general.

The second problem of additional scanning can result in the area map
allocator completing in 52 minutes when trying to allocate 1 million
4-byte objects with 8-byte alignment. The same workload takes
approximately 16 seconds to complete for the bitmap allocator.

V2:
Fixed a bug in pcpu_alloc_first_chunk end_offset was setting the bitmap
using bytes instead of bits.

Added a comment to pcpu_cnt_pop_pages to explain bitmap_weight.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26 17:41:05 -04:00
Jeff Layton
3acdfd280f errseq: rename __errseq_set to errseq_set
Nothing calls this wrapper anymore, so just remove it and rename the
old function to get rid of the double underscore prefix.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-26 12:24:36 -04:00
Dennis Zhou (Facebook)
91e914c5a4 percpu: generalize bitmap (un)populated iterators
The area map allocator only used a bitmap for the backing page state.
The new bitmap allocator will use bitmaps to manage the allocation
region in addition to this.

This patch generalizes the bitmap iterators so they can be reused with
the bitmap allocator.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26 10:23:53 -04:00
Dennis Zhou (Facebook)
d2f3c38494 percpu: increase minimum percpu allocation size and align first regions
This patch increases the minimum allocation size of percpu memory to
4-bytes. This change will help minimize the metadata overhead
associated with the bitmap allocator. The assumption is that most
allocations will be of objects or structs greater than 2 bytes with
integers or longs being used rather than shorts.

The first chunk regions are now aligned with the minimum allocation
size. The reserved region is expected to be set as a multiple of the
minimum allocation size. The static region is aligned up and the delta
is removed from the dynamic size. This works because the dynamic size is
increased to be page aligned. If the static size is not minimum
allocation size aligned, then there must be a gap that is added to the
dynamic size. The dynamic size will never be smaller than the set value.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26 10:23:53 -04:00
Dennis Zhou (Facebook)
0cecf50cf0 percpu: introduce nr_empty_pop_pages to help empty page accounting
pcpu_nr_empty_pop_pages is used to ensure there are a handful of free
pages around to serve atomic allocations. A new field, nr_empty_pop_pages,
is added to the pcpu_chunk struct to keep track of the number of empty
pages. This field is needed as the number of empty populated pages is
globally tracked and deltas are used to update in the bitmap allocator.
Pages that contain a hidden area are not considered to be empty. This
new field is exposed in percpu_stats.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26 10:23:53 -04:00
Dennis Zhou (Facebook)
8ab16c43ea percpu: change the number of pages marked in the first_chunk pop bitmap
The populated bitmap represents the state of the pages the chunk serves.
Prior, the bitmap was marked completely used as the first chunk was
allocated and immutable. This is misleading because the first chunk may
not be completely filled. Additionally, with moving the base_addr up in
the previous patch, the population check no longer corresponds to what
was being checked.

This patch modifies the population map to be only the number of pages
the region serves and to make what it was checking correspond correctly
again. The change is to remove any misunderstanding between the size of
the populated bitmap and the actual size of it. The work function page
iterators now use nr_pages for the check rather than pcpu_unit_pages
because nr_populated is now chunk specific. Without this, the work
function would try to populate the remainder of these chunks despite it
not serving any more than nr_pages when nr_pages is set less than
pcpu_unit_pages.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26 10:23:53 -04:00
Dennis Zhou (Facebook)
560f2c2366 percpu: combine percpu address checks
The percpu address checks for the reserved and dynamic region chunks are
now specific to each region. The address checking logic can be combined
taking advantage of the global references to the dynamic and static
region chunks.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26 10:23:53 -04:00
Dennis Zhou (Facebook)
c0ebfdc3fe percpu: modify base_addr to be region specific
Originally, the first chunk was served by one or two chunks, each
given a region they are responsible for. Despite this, the arithmetic
was based off of the true base_addr of the chunk making it be overly
inclusive.

This patch moves the base_addr of chunks that are responsible for the
first chunk. The base_addr must remain page aligned to keep the
address alignment correct, so it is the beginning of the region served
page aligned down. start_offset holds where the region served begins
from this new base_addr.

The corresponding percpu address checks are modified to be more specific
as a result. The first chunk considers only the dynamic region and both
first chunk and reserved chunk checks ignore the static region. The
static region addresses should never be passed into the allocator. There
is no impact here besides distinguishing the first chunk and making the
checks specific.

The percpu pointer to physical address is left intact as addresses are
not given out in the non-allocated portion of percpu memory.

nr_pages is added to pcpu_chunk to keep track of the size of the entire
region served containing both start_offset and end_offset. This variable
will be used to manage the bitmap allocator.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26 10:23:52 -04:00
Dennis Zhou (Facebook)
0c4169c3d1 percpu: setup_first_chunk rename schunk/dchunk to chunk
There is no need to have the static chunk and dynamic chunk be named
separately as the allocations are sequential. This preemptively solves
the misnomer problem with the base_addrs being moved up in the following
patch. It also removes a ternary operation deciding the first chunk.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26 10:23:52 -04:00
Dennis Zhou (Facebook)
6b9d7c8e8e percpu: end chunk area maps page aligned for the populated bitmap
The area map allocator manages the first chunk area by hiding all but
the region it is responsible for serving in the area map. To align this
with the populated page bitmap, end_offset is introduced to keep track
of the delta to end page aligned. The area map is appended with the
page aligned end when necessary to be in line with how the bitmap
allocator requires the ending to be aligned with the LCM of PAGE_SIZE
and the size of each bitmap block. percpu_stats is updated to ignore
this region when present.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26 10:23:52 -04:00
Dennis Zhou (Facebook)
10edf5b0b6 percpu: unify allocation of schunk and dchunk
Create a common allocator for first chunk initialization,
pcpu_alloc_first_chunk. Comments for this function will be added in a
later patch once the bitmap allocator is added.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26 10:23:52 -04:00
Dennis Zhou (Facebook)
b9c39442ce percpu: setup_first_chunk remove dyn_size and consolidate logic
There is logic for setting variables in the static chunk init code that
could be consolidated with the dynamic chunk init code. This combines
this logic to setup for combining the allocation paths. reserved_size is
used as the conditional as a dynamic region will always exist.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26 10:23:52 -04:00
Dennis Zhou (Facebook)
4af1e6fbd8 percpu: remove has_reserved from pcpu_chunk
Prior this variable was used to manage statistics when the first chunk
had a reserved region. The previous patch introduced start_offset to
keep track of the offset by value rather than boolean. Therefore,
has_reserved can be removed.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26 10:23:51 -04:00
Dennis Zhou (Facebook)
e226670566 percpu: introduce start_offset to pcpu_chunk
The reserved chunk arithmetic uses a global variable
pcpu_reserved_chunk_limit that is set in the first chunk init code to
hide a portion of the area map. The bitmap allocator to come will
eventually move the base_addr up and require both the reserved chunk
and static chunk to maintain this offset. pcpu_reserved_chunk_limit is
removed and start_offset is added.

The first chunk that is circulated and is pcpu_first_chunk serves the
dynamic region, the region following the reserved region. The reserved
chunk address check will temporarily use the first chunk to identify its
address range. A following patch will increase the base_addr and remove
this. If there is no reserved chunk, this will check the static region
and return false because those values should never be passed into the
allocator.

Lastly, when linking in the first chunk, make sure to count the right
free region for the number of empty populated pages.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26 10:23:51 -04:00
Dennis Zhou (Facebook)
fb29a2cc6b percpu: setup_first_chunk enforce dynamic region must exist
The first chunk is handled as a special case as it is composed of the
static, reserved, and dynamic regions. The code handles each case
individually. The next several patches will merge these code paths and
lay the foundation for the bitmap allocator.

This patch modifies logic to enforce that a dynamic region exists and
changes the area map to account for that. This brings the logic closer
to the dynamic chunk's init logic.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Reviewed-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-26 10:23:51 -04:00
Dmitry Vyukov
f06e8c584f kasan: Allow kasan_check_read/write() to accept pointers to volatiles
Currently kasan_check_read/write() accept 'const void*', make them
accept 'const volatile void*'. This is required for instrumentation
of atomic operations and there is just no reason to not allow that.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kasan-dev@googlegroups.com
Cc: linux-mm@kvack.org
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/33e5ec275c1ee89299245b2ebbccd63709c6021f.1498140838.git.dvyukov@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-26 13:08:54 +02:00
Tejun Heo
bc2fb7ed08 cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS
css_task_iter currently always walks all tasks.  With the scheduled
cgroup v2 thread support, the iterator would need to handle multiple
types of iteration.  As a preparation, add @flags to
css_task_iter_start() and implement CSS_TASK_ITER_PROCS.  If the flag
is not specified, it walks all tasks as before.  When asserted, the
iterator only walks the group leaders.

For now, the only user of the flag is cgroup v2 "cgroup.procs" file
which no longer needs to skip non-leader tasks in cgroup_procs_next().
Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
cgroup" but "list all thread group id's with any threads in the
cgroup".

While at it, update cgroup_procs_show() to use task_pid_vnr() instead
of task_tgid_vnr().  As the iteration guarantees that the function
only sees group leaders, this doesn't change the output and will allow
sharing the function for thread iteration.

Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-21 11:14:51 -04:00
Tom Lendacky
8f716c9b5f x86/mm: Add support to access boot related data in the clear
Boot data (such as EFI related data) is not encrypted when the system is
booted because UEFI/BIOS does not run with SME active. In order to access
this data properly it needs to be mapped decrypted.

Update early_memremap() to provide an arch specific routine to modify the
pagetable protection attributes before they are applied to the new
mapping. This is used to remove the encryption mask for boot related data.

Update memremap() to provide an arch specific routine to determine if RAM
remapping is allowed.  RAM remapping will cause an encrypted mapping to be
generated. By preventing RAM remapping, ioremap_cache() will be used
instead, which will provide a decrypted mapping of the boot related data.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Toshimitsu Kani <toshi.kani@hpe.com>
Cc: kasan-dev@googlegroups.com
Cc: kvm@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-efi@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/81fb6b4117a5df6b9f2eda342f81bbef4b23d2e5.1500319216.git.thomas.lendacky@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-18 11:38:02 +02:00
Tom Lendacky
f88a68facd x86/mm: Extend early_memremap() support with additional attrs
Add early_memremap() support to be able to specify encrypted and
decrypted mappings with and without write-protection. The use of
write-protection is necessary when encrypting data "in place". The
write-protect attribute is considered cacheable for loads, but not
stores. This implies that the hardware will never give the core a
dirty line with this memtype.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Toshimitsu Kani <toshi.kani@hpe.com>
Cc: kasan-dev@googlegroups.com
Cc: kvm@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-efi@vger.kernel.org
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/479b5832c30fae3efa7932e48f81794e86397229.1500319216.git.thomas.lendacky@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-07-18 11:38:00 +02:00
Dennis Zhou (Facebook)
9c01516278 percpu: update the header comment and pcpu_build_alloc_info comments
The header comment for percpu memory is a little hard to parse and is
not super clear about how the first chunk is managed. This adds a
little more clarity to the situation.

There is also quite a bit of tricky logic in the pcpu_build_alloc_info.
This adds a restructure of a comment to add a little more information.
Unfortunately, you will still have to piece together a handful of other
comments too, but should help direct you to the meaningful comments.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-17 10:53:59 -04:00
Dennis Zhou (Facebook)
6b9b6f3994 percpu: expose pcpu_nr_empty_pop_pages in pcpu_stats
Percpu memory holds a minimum threshold of pages that are populated
in order to serve atomic percpu memory requests. This change makes it
easier to verify that there are a minimum number of populated pages
lying around.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-17 10:46:58 -04:00
Dennis Zhou (Facebook)
02459164a2 percpu: change the format for percpu_stats output
This makes the debugfs output for percpu_stats a little easier
to read by changing the spacing of the output to be consistent.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-17 10:45:46 -04:00
Dennis Zhou (Facebook)
cd6a884d09 percpu: pcpu-stats change void buffer to int buffer
Changes the use of a void buffer to an int buffer for clarity.

Signed-off-by: Dennis Zhou <dennisszhou@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2017-07-17 10:45:42 -04:00
Linus Torvalds
78dcf73421 Merge branch 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull ->s_options removal from Al Viro:
 "Preparations for fsmount/fsopen stuff (coming next cycle). Everything
  gets moved to explicit ->show_options(), killing ->s_options off +
  some cosmetic bits around fs/namespace.c and friends. Basically, the
  stuff needed to work with fsmount series with minimum of conflicts
  with other work.

  It's not strictly required for this merge window, but it would reduce
  the PITA during the coming cycle, so it would be nice to have those
  bits and pieces out of the way"

* 'work.mount' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  isofs: Fix isofs_show_options()
  VFS: Kill off s_options and helpers
  orangefs: Implement show_options
  9p: Implement show_options
  isofs: Implement show_options
  afs: Implement show_options
  affs: Implement show_options
  befs: Implement show_options
  spufs: Implement show_options
  bpf: Implement show_options
  ramfs: Implement show_options
  pstore: Implement show_options
  omfs: Implement show_options
  hugetlbfs: Implement show_options
  VFS: Don't use save/replace_mount_options if not using generic_show_options
  VFS: Provide empty name qstr
  VFS: Make get_filesystem() return the affected filesystem
  VFS: Clean up whitespace in fs/namespace.c and fs/super.c
  Provide a function to create a NUL-terminated string from unterminated data
2017-07-15 12:00:42 -07:00
Helge Deller
37511fb5c9 mm: fix overflow check in expand_upwards()
Jörn Engel noticed that the expand_upwards() function might not return
-ENOMEM in case the requested address is (unsigned long)-PAGE_SIZE and
if the architecture didn't defined TASK_SIZE as multiple of PAGE_SIZE.

Affected architectures are arm, frv, m68k, blackfin, h8300 and xtensa
which all define TASK_SIZE as 0xffffffff, but since none of those have
an upwards-growing stack we currently have no actual issue.

Nevertheless let's fix this just in case any of the architectures with
an upward-growing stack (currently parisc, metag and partly ia64) define
TASK_SIZE similar.

Link: http://lkml.kernel.org/r/20170702192452.GA11868@p100.box
Fixes: bd726c90b6b8 ("Allow stack to grow up to address space limit")
Signed-off-by: Helge Deller <deller@gmx.de>
Reported-by: Jörn Engel <joern@purestorage.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-14 15:05:12 -07:00
Nikolay Borisov
3e8f399da4 writeback: rework wb_[dec|inc]_stat family of functions
Currently the writeback statistics code uses a percpu counters to hold
various statistics.  Furthermore we have 2 families of functions - those
which disable local irq and those which doesn't and whose names begin
with double underscore.  However, they both end up calling
__add_wb_stats which in turn calls percpu_counter_add_batch which is
already irq-safe.

Exploiting this fact allows to eliminated the __wb_* functions since
they don't add any further protection than we already have.
Furthermore, refactor the wb_* function to call __add_wb_stat directly
without the irq-disabling dance.  This will likely result in better
runtime of code which deals with modifying the stat counters.

While at it also document why percpu_counter_add_batch is in fact
preempt and irq-safe since at least 3 people got confused.

Link: http://lkml.kernel.org/r/1498029937-27293-1-git-send-email-nborisov@suse.com
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-12 16:26:05 -07:00