39774 Commits

Author SHA1 Message Date
Matthew Wilcox (Oracle)
fcb72a585a perf: use VMA iterator
The VMA iterator is faster than the linked list and removing the linked
list will shrink the vm_area_struct.

Link: https://lkml.kernel.org/r/20220906194824.2110408-48-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:22 -07:00
Matthew Wilcox (Oracle)
160c820023 acct: use VMA iterator instead of linked list
The VMA iterator is faster than the linked list.

Link: https://lkml.kernel.org/r/20220906194824.2110408-47-Liam.Howlett@oracle.com
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:22 -07:00
Liam R. Howlett
7964cf8caa mm: remove vmacache
By using the maple tree and the maple tree state, the vmacache is no
longer beneficial and is complicating the VMA code.  Remove the vmacache
to reduce the work in keeping it up to date and code complexity.

Link: https://lkml.kernel.org/r/20220906194824.2110408-26-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:18 -07:00
Liam R. Howlett
524e00b36e mm: remove rb tree.
Remove the RB tree and start using the maple tree for vm_area_struct
tracking.

Drop validate_mm() calls in expand_upwards() and expand_downwards() as the
lock is not held.

Link: https://lkml.kernel.org/r/20220906194824.2110408-18-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:16 -07:00
Liam R. Howlett
c9dbe82cb9 kernel/fork: use maple tree for dup_mmap() during forking
The maple tree was already tracking VMAs in this function by an earlier
commit, but the rbtree iterator was being used to iterate the list.
Change the iterator to use a maple tree native iterator and switch to the
maple tree advanced API to avoid multiple walks of the tree during insert
operations.  Unexport the now-unused vma_store() function.

For performance reasons we bulk allocate the maple tree nodes.  The node
calculations are done internally to the tree and use the VMA count and
assume the worst-case node requirements.  The VM_DONT_COPY flag does not
allow for the most efficient copy method of the tree and so a bulk loading
algorithm is used.

Link: https://lkml.kernel.org/r/20220906194824.2110408-15-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:16 -07:00
Liam R. Howlett
d4af56c5c7 mm: start tracking VMAs with maple tree
Start tracking the VMAs with the new maple tree structure in parallel with
the rb_tree.  Add debug and trace events for maple tree operations and
duplicate the rb_tree that is created on forks into the maple tree.

The maple tree is added to the mm_struct including the mm_init struct,
added support in required mm/mmap functions, added tracking in kernel/fork
for process forking, and used to find the unmapped_area and checked
against what the rbtree finds.

This also moves the mmap_lock() in exit_mmap() since the oom reaper call
does walk the VMAs.  Otherwise lockdep will be unhappy if oom happens.

When splitting a vma fails due to allocations of the maple tree nodes,
the error path in __split_vma() calls new->vm_ops->close(new).  The page
accounting for hugetlb is actually in the close() operation,  so it
accounts for the removal of 1/2 of the VMA which was not adjusted.  This
results in a negative exit value.  To avoid the negative charge, set
vm_start = vm_end and vm_pgoff = 0.

There is also a potential accounting issue in special mappings from
insert_vm_struct() failing to allocate, so reverse the charge there in
the failure scenario.

Link: https://lkml.kernel.org/r/20220906194824.2110408-9-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:14 -07:00
Aneesh Kumar K.V
467b171af8 mm/demotion: update node_is_toptier to work with memory tiers
With memory tier support we can have memory only NUMA nodes in the top
tier from which we want to avoid promotion tracking NUMA faults.  Update
node_is_toptier to work with memory tiers.  All NUMA nodes are by default
top tier nodes.  With lower(slower) memory tiers added we consider all
memory tiers above a memory tier having CPU NUMA nodes as a top memory
tier

[sj@kernel.org: include missed header file, memory-tiers.h]
  Link: https://lkml.kernel.org/r/20220820190720.248704-1-sj@kernel.org
[akpm@linux-foundation.org: mm/memory.c needs linux/memory-tiers.h]
[aneesh.kumar@linux.ibm.com: make toptier_distance inclusive upper bound of toptiers]
  Link: https://lkml.kernel.org/r/20220830081457.118960-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20220818131042.113280-10-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Acked-by: Wei Xu <weixugc@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Hesham Almatary <hesham.almatary@huawei.com>
Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tim Chen <tim.c.chen@intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:12 -07:00
Yu Zhao
354ed59744 mm: multi-gen LRU: kill switch
Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
can be disabled include:
  0x0001: the multi-gen LRU core
  0x0002: walking page table, when arch_has_hw_pte_young() returns
          true
  0x0004: clearing the accessed bit in non-leaf PMD entries, when
          CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
  [yYnN]: apply to all the components above
E.g.,
  echo y >/sys/kernel/mm/lru_gen/enabled
  cat /sys/kernel/mm/lru_gen/enabled
  0x0007
  echo 5 >/sys/kernel/mm/lru_gen/enabled
  cat /sys/kernel/mm/lru_gen/enabled
  0x0005

NB: the page table walks happen on the scale of seconds under heavy memory
pressure, in which case the mmap_lock contention is a lesser concern,
compared with the LRU lock contention and the I/O congestion.  So far the
only well-known case of the mmap_lock contention happens on Android, due
to Scudo [1] which allocates several thousand VMAs for merely a few
hundred MBs.  The SPF and the Maple Tree also have provided their own
assessments [2][3].  However, if walking page tables does worsen the
mmap_lock contention, the kill switch can be used to disable it.  In this
case the multi-gen LRU will suffer a minor performance degradation, as
shown previously.

Clearing the accessed bit in non-leaf PMD entries can also be disabled,
since this behavior was not tested on x86 varieties other than Intel and
AMD.

[1] https://source.android.com/devices/tech/debug/scudo
[2] https://lore.kernel.org/r/20220128131006.67712-1-michel@lespinasse.org/
[3] https://lore.kernel.org/r/20220426150616.3937571-1-Liam.Howlett@oracle.com/

Link: https://lkml.kernel.org/r/20220918080010.2920238-11-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:10 -07:00
Yu Zhao
bd74fdaea1 mm: multi-gen LRU: support page table walks
To further exploit spatial locality, the aging prefers to walk page tables
to search for young PTEs and promote hot pages.  A kill switch will be
added in the next patch to disable this behavior.  When disabled, the
aging relies on the rmap only.

NB: this behavior has nothing similar with the page table scanning in the
2.4 kernel [1], which searches page tables for old PTEs, adds cold pages
to swapcache and unmaps them.

To avoid confusion, the term "iteration" specifically means the traversal
of an entire mm_struct list; the term "walk" will be applied to page
tables and the rmap, as usual.

An mm_struct list is maintained for each memcg, and an mm_struct follows
its owner task to the new memcg when this task is migrated.  Given an
lruvec, the aging iterates lruvec_memcg()->mm_list and calls
walk_page_range() with each mm_struct on this list to promote hot pages
before it increments max_seq.

When multiple page table walkers iterate the same list, each of them gets
a unique mm_struct; therefore they can run concurrently.  Page table
walkers ignore any misplaced pages, e.g., if an mm_struct was migrated,
pages it left in the previous memcg will not be promoted when its current
memcg is under reclaim.  Similarly, page table walkers will not promote
pages from nodes other than the one under reclaim.

This patch uses the following optimizations when walking page tables:
1. It tracks the usage of mm_struct's between context switches so that
   page table walkers can skip processes that have been sleeping since
   the last iteration.
2. It uses generational Bloom filters to record populated branches so
   that page table walkers can reduce their search space based on the
   query results, e.g., to skip page tables containing mostly holes or
   misplaced pages.
3. It takes advantage of the accessed bit in non-leaf PMD entries when
   CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
4. It does not zigzag between a PGD table and the same PMD table
   spanning multiple VMAs. IOW, it finishes all the VMAs within the
   range of the same PMD table before it returns to a PGD table. This
   improves the cache performance for workloads that have large
   numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.

Server benchmark results:
  Single workload:
    fio (buffered I/O): no change

  Single workload:
    memcached (anon): +[8, 10]%
                Ops/sec      KB/sec
      patch1-7: 1147696.57   44640.29
      patch1-8: 1245274.91   48435.66

  Configurations:
    no change

Client benchmark results:
  kswapd profiles:
    patch1-7
      48.16%  lzo1x_1_do_compress (real work)
       8.20%  page_vma_mapped_walk (overhead)
       7.06%  _raw_spin_unlock_irq
       2.92%  ptep_clear_flush
       2.53%  __zram_bvec_write
       2.11%  do_raw_spin_lock
       2.02%  memmove
       1.93%  lru_gen_look_around
       1.56%  free_unref_page_list
       1.40%  memset

    patch1-8
      49.44%  lzo1x_1_do_compress (real work)
       6.19%  page_vma_mapped_walk (overhead)
       5.97%  _raw_spin_unlock_irq
       3.13%  get_pfn_folio
       2.85%  ptep_clear_flush
       2.42%  __zram_bvec_write
       2.08%  do_raw_spin_lock
       1.92%  memmove
       1.44%  alloc_zspage
       1.36%  memset

  Configurations:
    no change

Thanks to the following developers for their efforts [3].
  kernel test robot <lkp@intel.com>

[1] https://lwn.net/Articles/23732/
[2] https://llvm.org/docs/ScudoHardenedAllocator.html
[3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/

Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:09 -07:00
Yu Zhao
ac35a49023 mm: multi-gen LRU: minimal implementation
To avoid confusion, the terms "promotion" and "demotion" will be applied
to the multi-gen LRU, as a new convention; the terms "activation" and
"deactivation" will be applied to the active/inactive LRU, as usual.

The aging produces young generations.  Given an lruvec, it increments
max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS.  The aging promotes
hot pages to the youngest generation when it finds them accessed through
page tables; the demotion of cold pages happens consequently when it
increments max_seq.  Promotion in the aging path does not involve any LRU
list operations, only the updates of the gen counter and
lrugen->nr_pages[]; demotion, unless as the result of the increment of
max_seq, requires LRU list operations, e.g., lru_deactivate_fn().  The
aging has the complexity O(nr_hot_pages), since it is only interested in
hot pages.

The eviction consumes old generations.  Given an lruvec, it increments
min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
A feedback loop modeled after the PID controller monitors refaults over
anon and file types and decides which type to evict when both types are
available from the same generation.

The protection of pages accessed multiple times through file descriptors
takes place in the eviction path.  Each generation is divided into
multiple tiers.  A page accessed N times through file descriptors is in
tier order_base_2(N).  Tiers do not have dedicated lrugen->lists[], only
bits in folio->flags.  The aforementioned feedback loop also monitors
refaults over all tiers and decides when to protect pages in which tiers
(N>1), using the first tier (N=0,1) as a baseline.  The first tier
contains single-use unmapped clean pages, which are most likely the best
choices.  In contrast to promotion in the aging path, the protection of a
page in the eviction path is achieved by moving this page to the next
generation, i.e., min_seq+1, if the feedback loop decides so.  This
approach has the following advantages:

1. It removes the cost of activation in the buffered access path by
   inferring whether pages accessed multiple times through file
   descriptors are statistically hot and thus worth protecting in the
   eviction path.
2. It takes pages accessed through page tables into account and avoids
   overprotecting pages accessed multiple times through file
   descriptors. (Pages accessed through page tables are in the first
   tier, since N=0.)
3. More tiers provide better protection for pages accessed more than
   twice through file descriptors, when under heavy buffered I/O
   workloads.

Server benchmark results:
  Single workload:
    fio (buffered I/O): +[30, 32]%
                IOPS         BW
      5.19-rc1: 2673k        10.2GiB/s
      patch1-6: 3491k        13.3GiB/s

  Single workload:
    memcached (anon): -[4, 6]%
                Ops/sec      KB/sec
      5.19-rc1: 1161501.04   45177.25
      patch1-6: 1106168.46   43025.04

  Configurations:
    CPU: two Xeon 6154
    Mem: total 256G

    Node 1 was only used as a ram disk to reduce the variance in the
    results.

    patch drivers/block/brd.c <<EOF
    99,100c99,100
    < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
    < 	page = alloc_page(gfp_flags);
    ---
    > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
    > 	page = alloc_pages_node(1, gfp_flags, 0);
    EOF

    cat >>/etc/systemd/system.conf <<EOF
    CPUAffinity=numa
    NUMAPolicy=bind
    NUMAMask=0
    EOF

    cat >>/etc/memcached.conf <<EOF
    -m 184320
    -s /var/run/memcached/memcached.sock
    -a 0766
    -t 36
    -B binary
    EOF

    cat fio.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkfs.ext4 /dev/ram0
    mount -t ext4 /dev/ram0 /mnt

    mkdir /sys/fs/cgroup/user.slice/test
    echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
    echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
    fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
      --buffered=1 --ioengine=io_uring --iodepth=128 \
      --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
      --rw=randread --random_distribution=random --norandommap \
      --time_based --ramp_time=10m --runtime=5m --group_reporting

    cat memcached.sh
    modprobe brd rd_nr=1 rd_size=113246208
    swapoff -a
    mkswap /dev/ram0
    swapon /dev/ram0

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
      --ratio 1:0 --pipeline 8 -d 2000

    memtier_benchmark -S /var/run/memcached/memcached.sock \
      -P memcache_binary -n allkeys --key-minimum=1 \
      --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
      --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed

Client benchmark results:
  kswapd profiles:
    5.19-rc1
      40.33%  page_vma_mapped_walk (overhead)
      21.80%  lzo1x_1_do_compress (real work)
       7.53%  do_raw_spin_lock
       3.95%  _raw_spin_unlock_irq
       2.52%  vma_interval_tree_iter_next
       2.37%  folio_referenced_one
       2.28%  vma_interval_tree_subtree_search
       1.97%  anon_vma_interval_tree_iter_first
       1.60%  ptep_clear_flush
       1.06%  __zram_bvec_write

    patch1-6
      39.03%  lzo1x_1_do_compress (real work)
      18.47%  page_vma_mapped_walk (overhead)
       6.74%  _raw_spin_unlock_irq
       3.97%  do_raw_spin_lock
       2.49%  ptep_clear_flush
       2.48%  anon_vma_interval_tree_iter_first
       1.92%  folio_referenced_one
       1.88%  __zram_bvec_write
       1.48%  memmove
       1.31%  vma_interval_tree_iter_next

  Configurations:
    CPU: single Snapdragon 7c
    Mem: total 4G

    ChromeOS MemoryPressure [1]

[1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/

Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:09 -07:00
Yu Zhao
ec1c86b25f mm: multi-gen LRU: groundwork
Evictable pages are divided into multiple generations for each lruvec.
The youngest generation number is stored in lrugen->max_seq for both
anon and file types as they are aged on an equal footing. The oldest
generation numbers are stored in lrugen->min_seq[] separately for anon
and file types as clean file pages can be evicted regardless of swap
constraints. These three variables are monotonically increasing.

Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
in order to fit into the gen counter in folio->flags. Each truncated
generation number is an index to lrugen->lists[]. The sliding window
technique is used to track at least MIN_NR_GENS and at most
MAX_NR_GENS generations. The gen counter stores a value within [1,
MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
stores 0.

There are two conceptually independent procedures: "the aging", which
produces young generations, and "the eviction", which consumes old
generations.  They form a closed-loop system, i.e., "the page reclaim". 
Both procedures can be invoked from userspace for the purposes of working
set estimation and proactive reclaim.  These techniques are commonly used
to optimize job scheduling (bin packing) in data centers [1][2].

To avoid confusion, the terms "hot" and "cold" will be applied to the
multi-gen LRU, as a new convention; the terms "active" and "inactive" will
be applied to the active/inactive LRU, as usual.

The protection of hot pages and the selection of cold pages are based
on page access channels and patterns. There are two access channels:
one through page tables and the other through file descriptors. The
protection of the former channel is by design stronger because:
1. The uncertainty in determining the access patterns of the former
   channel is higher due to the approximation of the accessed bit.
2. The cost of evicting the former channel is higher due to the TLB
   flushes required and the likelihood of encountering the dirty bit.
3. The penalty of underprotecting the former channel is higher because
   applications usually do not prepare themselves for major page
   faults like they do for blocked I/O. E.g., GUI applications
   commonly use dedicated I/O threads to avoid blocking rendering
   threads.

There are also two access patterns: one with temporal locality and the
other without.  For the reasons listed above, the former channel is
assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
present; the latter channel is assumed to follow the latter pattern unless
outlying refaults have been observed [3][4].

The next patch will address the "outlying refaults".  Three macros, i.e.,
LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
this patch to make the entire patchset less diffy.

A page is added to the youngest generation on faulting.  The aging needs
to check the accessed bit at least twice before handing this page over to
the eviction.  The first check takes care of the accessed bit set on the
initial fault; the second check makes sure this page has not been used
since then.  This protocol, AKA second chance, requires a minimum of two
generations, hence MIN_NR_GENS.

[1] https://dl.acm.org/doi/10.1145/3297858.3304053
[2] https://dl.acm.org/doi/10.1145/3503222.3507731
[3] https://lwn.net/Articles/495543/
[4] https://lwn.net/Articles/815342/

Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
Acked-by: Steven Barrett <steven@liquorix.net>
Acked-by: Suleiman Souhlal <suleiman@google.com>
Tested-by: Daniel Byrne <djbyrne@mtu.edu>
Tested-by: Donald Carr <d@chaos-reins.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
Tested-by: Sofia Trinh <sofia.trinh@edi.works>
Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michael Larabel <Michael@MichaelLarabel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:09 -07:00
Yang Yang
aa1cf99b87 delayacct: support re-entrance detection of thrashing accounting
Once upon a time, we only support accounting thrashing of page cache. 
Then Joonsoo introduced workingset detection for anonymous pages and we
gained the ability to account thrashing of them[1].

For page cache thrashing accounting, there is no suitable place to do it
in fs level likes swap_readpage().  So we have to do it in
folio_wait_bit_common().

Then for anonymous pages thrashing accounting, we have to do it in both
swap_readpage() and folio_wait_bit_common().  This likes PSI, so we should
let thrashing accounting supports re-entrance detection.

This patch is to prepare complete thrashing accounting, and is based on
patch "filemap: make the accounting of thrashing more consistent".

[1] commit aae466b0052e ("mm/swap: implement workingset detection for anonymous LRU")

Link: https://lkml.kernel.org/r/20220815071134.74551-1-yang.yang29@zte.com.cn
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: CGEL ZTE <cgel.zte@gmail.com>
Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Reviewed-by: wangyong <wang.yong12@zte.com.cn>
Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-26 19:46:07 -07:00
Huang Ying
c959924b0d memory tiering: adjust hot threshold automatically
The promotion hot threshold is workload and system configuration
dependent.  So in this patch, a method to adjust the hot threshold
automatically is implemented.  The basic idea is to control the number of
the candidate promotion pages to match the promotion rate limit.  If the
hint page fault latency of a page is less than the hot threshold, we will
try to promote the page, and the page is called the candidate promotion
page.

If the number of the candidate promotion pages in the statistics interval
is much more than the promotion rate limit, the hot threshold will be
decreased to reduce the number of the candidate promotion pages. 
Otherwise, the hot threshold will be increased to increase the number of
the candidate promotion pages.

To make the above method works, in each statistics interval, the total
number of the pages to check (on which the hint page faults occur) and the
hot/cold distribution need to be stable.  Because the page tables are
scanned linearly in NUMA balancing, but the hot/cold distribution isn't
uniform along the address usually, the statistics interval should be
larger than the NUMA balancing scan period.  So in the patch, the max scan
period is used as statistics interval and it works well in our tests.

Link: https://lkml.kernel.org/r/20220713083954.34196-4-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: osalvador <osalvador@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11 20:25:54 -07:00
Huang Ying
c6833e1000 memory tiering: rate limit NUMA migration throughput
In NUMA balancing memory tiering mode, if there are hot pages in slow
memory node and cold pages in fast memory node, we need to promote/demote
hot/cold pages between the fast and cold memory nodes.

A choice is to promote/demote as fast as possible.  But the CPU cycles and
memory bandwidth consumed by the high promoting/demoting throughput will
hurt the latency of some workload because of accessing inflating and slow
memory bandwidth contention.

A way to resolve this issue is to restrict the max promoting/demoting
throughput.  It will take longer to finish the promoting/demoting.  But
the workload latency will be better.  This is implemented in this patch as
the page promotion rate limit mechanism.

The number of the candidate pages to be promoted to the fast memory node
via NUMA balancing is counted, if the count exceeds the limit specified by
the users, the NUMA balancing promotion will be stopped until the next
second.

A new sysctl knob kernel.numa_balancing_promote_rate_limit_MBps is added
for the users to specify the limit.

Link: https://lkml.kernel.org/r/20220713083954.34196-3-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: osalvador <osalvador@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Wei Xu <weixugc@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11 20:25:54 -07:00
Huang Ying
33024536ba memory tiering: hot page selection with hint page fault latency
Patch series "memory tiering: hot page selection", v4.

To optimize page placement in a memory tiering system with NUMA balancing,
the hot pages in the slow memory nodes need to be identified. 
Essentially, the original NUMA balancing implementation selects the mostly
recently accessed (MRU) pages to promote.  But this isn't a perfect
algorithm to identify the hot pages.  Because the pages with quite low
access frequency may be accessed eventually given the NUMA balancing page
table scanning period could be quite long (e.g.  60 seconds).  So in this
patchset, we implement a new hot page identification algorithm based on
the latency between NUMA balancing page table scanning and hint page
fault.  Which is a kind of mostly frequently accessed (MFU) algorithm.

In NUMA balancing memory tiering mode, if there are hot pages in slow
memory node and cold pages in fast memory node, we need to promote/demote
hot/cold pages between the fast and cold memory nodes.

A choice is to promote/demote as fast as possible.  But the CPU cycles and
memory bandwidth consumed by the high promoting/demoting throughput will
hurt the latency of some workload because of accessing inflating and slow
memory bandwidth contention.

A way to resolve this issue is to restrict the max promoting/demoting
throughput.  It will take longer to finish the promoting/demoting.  But
the workload latency will be better.  This is implemented in this patchset
as the page promotion rate limit mechanism.

The promotion hot threshold is workload and system configuration
dependent.  So in this patchset, a method to adjust the hot threshold
automatically is implemented.  The basic idea is to control the number of
the candidate promotion pages to match the promotion rate limit.

We used the pmbench memory accessing benchmark tested the patchset on a
2-socket server system with DRAM and PMEM installed.  The test results are
as follows,

		pmbench score		promote rate
		 (accesses/s)			MB/s
		-------------		------------
base		  146887704.1		       725.6
hot selection     165695601.2		       544.0
rate limit	  162814569.8		       165.2
auto adjustment	  170495294.0                  136.9

From the results above,

With hot page selection patch [1/3], the pmbench score increases about
12.8%, and promote rate (overhead) decreases about 25.0%, compared with
base kernel.

With rate limit patch [2/3], pmbench score decreases about 1.7%, and
promote rate decreases about 69.6%, compared with hot page selection
patch.

With threshold auto adjustment patch [3/3], pmbench score increases about
4.7%, and promote rate decrease about 17.1%, compared with rate limit
patch.

Baolin helped to test the patchset with MySQL on a machine which contains
1 DRAM node (30G) and 1 PMEM node (126G).

sysbench /usr/share/sysbench/oltp_read_write.lua \
......
--tables=200 \
--table-size=1000000 \
--report-interval=10 \
--threads=16 \
--time=120

The tps can be improved about 5%.


This patch (of 3):

To optimize page placement in a memory tiering system with NUMA balancing,
the hot pages in the slow memory node need to be identified.  Essentially,
the original NUMA balancing implementation selects the mostly recently
accessed (MRU) pages to promote.  But this isn't a perfect algorithm to
identify the hot pages.  Because the pages with quite low access frequency
may be accessed eventually given the NUMA balancing page table scanning
period could be quite long (e.g.  60 seconds).  The most frequently
accessed (MFU) algorithm is better.

So, in this patch we implemented a better hot page selection algorithm. 
Which is based on NUMA balancing page table scanning and hint page fault
as follows,

- When the page tables of the processes are scanned to change PTE/PMD
  to be PROT_NONE, the current time is recorded in struct page as scan
  time.

- When the page is accessed, hint page fault will occur.  The scan
  time is gotten from the struct page.  And The hint page fault
  latency is defined as

    hint page fault time - scan time

The shorter the hint page fault latency of a page is, the higher the
probability of their access frequency to be higher.  So the hint page
fault latency is a better estimation of the page hot/cold.

It's hard to find some extra space in struct page to hold the scan time. 
Fortunately, we can reuse some bits used by the original NUMA balancing.

NUMA balancing uses some bits in struct page to store the page accessing
CPU and PID (referring to page_cpupid_xchg_last()).  Which is used by the
multi-stage node selection algorithm to avoid to migrate pages shared
accessed by the NUMA nodes back and forth.  But for pages in the slow
memory node, even if they are shared accessed by multiple NUMA nodes, as
long as the pages are hot, they need to be promoted to the fast memory
node.  So the accessing CPU and PID information are unnecessary for the
slow memory pages.  We can reuse these bits in struct page to record the
scan time.  For the fast memory pages, these bits are used as before.

For the hot threshold, the default value is 1 second, which works well in
our performance test.  All pages with hint page fault latency < hot
threshold will be considered hot.

It's hard for users to determine the hot threshold.  So we don't provide a
kernel ABI to set it, just provide a debugfs interface for advanced users
to experiment.  We will continue to work on a hot threshold automatic
adjustment mechanism.

The downside of the above method is that the response time to the workload
hot spot changing may be much longer.  For example,

- A previous cold memory area becomes hot

- The hint page fault will be triggered.  But the hint page fault
  latency isn't shorter than the hot threshold.  So the pages will
  not be promoted.

- When the memory area is scanned again, maybe after a scan period,
  the hint page fault latency measured will be shorter than the hot
  threshold and the pages will be promoted.

To mitigate this, if there are enough free space in the fast memory node,
the hot threshold will not be used, all pages will be promoted upon the
hint page fault for fast response.

Thanks Zhong Jiang reported and tested the fix for a bug when disabling
memory tiering mode dynamically.

Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com
Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.com
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Wei Xu <weixugc@google.com>
Cc: osalvador <osalvador@suse.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-09-11 20:25:54 -07:00
Linus Torvalds
b467192ec7 Seventeen hotfixes. Mostly memory management things. Ten patches are
cc:stable, addressing pre-6.0 issues.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYwvgrAAKCRDdBJ7gKXxA
 jlweAQC9dzE08Elxl4F7Uvxe+62JWVeflBRrT7sJ6jU1Gu3QcQEAhhI1Xit3/MGq
 pRytDBObGADxlA67c9eNq6J5pCT/7gE=
 =pD67
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull more hotfixes from Andrew Morton:
 "Seventeen hotfixes.  Mostly memory management things.

  Ten patches are cc:stable, addressing pre-6.0 issues"

* tag 'mm-hotfixes-stable-2022-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  .mailmap: update Luca Ceresoli's e-mail address
  mm/mprotect: only reference swap pfn page if type match
  squashfs: don't call kmalloc in decompressors
  mm/damon/dbgfs: avoid duplicate context directory creation
  mailmap: update email address for Colin King
  asm-generic: sections: refactor memory_intersects
  bootmem: remove the vmemmap pages from kmemleak in put_page_bootmem
  ocfs2: fix freeing uninitialized resource on ocfs2_dlm_shutdown
  Revert "memcg: cleanup racy sum avoidance code"
  mm/zsmalloc: do not attempt to free IS_ERR handle
  binder_alloc: add missing mmap_lock calls when using the VMA
  mm: re-allow pinning of zero pfns (again)
  vmcoreinfo: add kallsyms_num_syms symbol
  mailmap: update Guilherme G. Piccoli's email addresses
  writeback: avoid use-after-free after removing device
  shmem: update folio if shmem_replace_page() updates the page
  mm/hugetlb: avoid corrupting page->mapping in hugetlb_mcopy_atomic_pte
2022-08-28 14:49:59 -07:00
Stephen Brennan
f09bddbd86 vmcoreinfo: add kallsyms_num_syms symbol
The rest of the kallsyms symbols are useless without knowing the number of
symbols in the table.  In an earlier patch, I somehow dropped the
kallsyms_num_syms symbol, so add it back in.

Link: https://lkml.kernel.org/r/20220808205410.18590-1-stephen.s.brennan@oracle.com
Fixes: 5fd8fea935a1 ("vmcoreinfo: include kallsyms symbols")
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-28 14:02:44 -07:00
Linus Torvalds
17b28d4267 audit/stable-6.0 PR 20220826
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmMJG04UHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXOBtBAAkfUY6U8EtMvrPPu6kMyREPdU/9Zh
 wCBrKjY59fMWOl1RT8zYqyZaCZRvSc/Wd73XLvU2r0pf83N3i6sH7CozVhQyhM8H
 icNSzFRcZetaaOu2VKvfp5sSHR0ulLlYy26+zud6Syl/F7AJVwID0wsyHLVMuLs0
 PVb+oOoOoHzLdAxY6GlwHFHww3NgDPuYTo2v/19AAQ9f9HHHbr8iMwso4kBPA3TX
 x6tS/0YNKdAKAEtzwBmLQ7d8rFsjuBVActzoIOHjSluH5hg7UrrY4OwSOK1tp0bY
 r+tnpa4M1bBBqxgNlHY9CHlpveNNzDtiDNjxOA/EsGHyNPrjkna017MEc9kGO7Bn
 uwu0ytGoLt/IWeWdn3edmlDJtg782JmGI5YS3ihCE6vrqjd1sDh6QUVGMMy29Cm2
 dSPp1WY+I7IW9zTD1RzsdqDWdtnuN2XL591VxPW8WyvcU4QS5bBXQmUT+T8Ribkr
 jsZHiG4GqozF7bzuN38iw+MO2dV7TFvrzTQmqbji/8cDC68QANagdBaqUx8dGZ1w
 itW6UDZiUeSN8XUNJgDNX2b7jxnVPpEBQ1a0Ncbo6ykfZ4NKKujGE2kv7GMJ2d7x
 vYP/MxQdw15hQsSlT3vhmCQq6OpchpLUIywIsT3uTYATb5dMHDaWW7RtUg55/yNv
 xxiKWBMeALHGE9w=
 =j67g
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20220826' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit fix from Paul Moore:
 "Another small audit patch, this time to fix a bug where the return
  codes were not properly set before the audit filters were run,
  potentially resulting in missed audit records"

* tag 'audit-pr-20220826' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: move audit_return_fixup before the filters
2022-08-27 15:31:12 -07:00
Mikulas Patocka
8238b45798 wait_on_bit: add an acquire memory barrier
There are several places in the kernel where wait_on_bit is not followed
by a memory barrier (for example, in drivers/md/dm-bufio.c:new_read).

On architectures with weak memory ordering, it may happen that memory
accesses that follow wait_on_bit are reordered before wait_on_bit and
they may return invalid data.

Fix this class of bugs by introducing a new function "test_bit_acquire"
that works like test_bit, but has acquire memory ordering semantics.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-08-26 09:30:25 -07:00
Richard Guy Briggs
d4fefa4801 audit: move audit_return_fixup before the filters
The success and return_code are needed by the filters.  Move
audit_return_fixup() before the filters.  This was causing syscall
auditing events to be missed.

Link: https://github.com/linux-audit/audit-kernel/issues/138
Cc: stable@vger.kernel.org
Fixes: 12c5e81d3fd0 ("audit: prepare audit_context for use in calling contexts beyond syscalls")
Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
[PM: manual merge required]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2022-08-25 17:25:08 -04:00
Linus Torvalds
3f5c20055a 4f7e7236435ca0ab ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock()
deadlock") in the previous fix pull required cgroup core to grab
 cpus_read_lock() before invoking ->attach(). Unfortunately, it missed adding
 cpus_read_lock() in cgroup_attach_task_all(). Fix it.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCYwe0GA4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGee0AP9jrsUgnmis/PzqyyPlkD95rRSDyyUNjMjfHnJe
 HW+YbgD/XcEo1eJvijqP1g/ZJhRKQl6vA1JSMgnL9obc3wNpGg8=
 =7LzT
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.0-rc2-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull another cgroup fix from Tejun Heo:
 "Commit 4f7e7236435c ("cgroup: Fix threadgroup_rwsem <->
  cpus_read_lock() deadlock") required the cgroup
  core to grab cpus_read_lock() before invoking ->attach().

  Unfortunately, it missed adding cpus_read_lock() in
  cgroup_attach_task_all(). Fix it"

* tag 'cgroup-for-6.0-rc2-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Add missing cpus_read_lock() to cgroup_attach_task_all()
2022-08-25 10:52:16 -07:00
Tetsuo Handa
43626dade3 cgroup: Add missing cpus_read_lock() to cgroup_attach_task_all()
syzbot is hitting percpu_rwsem_assert_held(&cpu_hotplug_lock) warning at
cpuset_attach() [1], for commit 4f7e7236435ca0ab ("cgroup: Fix
threadgroup_rwsem <-> cpus_read_lock() deadlock") missed that
cpuset_attach() is also called from cgroup_attach_task_all().
Add cpus_read_lock() like what cgroup_procs_write_start() does.

Link: https://syzkaller.appspot.com/bug?extid=29d3a3b4d86c8136ad9e [1]
Reported-by: syzbot <syzbot+29d3a3b4d86c8136ad9e@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Fixes: 4f7e7236435ca0ab ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock")
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-08-25 07:36:30 -10:00
Linus Torvalds
a86766c49e Tracing: Fix for 6.0-rc2
- Fix build warning for when MODULES and FTRACE_WITH_DIRECT_CALLS are not
   set. A warning happens with ops_references_rec() defined but not used.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYwTkGhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qtUTAP4tOmf0I0c+GsWTzpecvv7fa+9rxmZa
 SfBuoPqzC/TBqAEArqaf91+57aehCrJC3X5HaE7OJisW9nd2Epnvrpxk4QY=
 =0yZV
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:

 - Fix build warning for when MODULES and FTRACE_WITH_DIRECT_CALLS are
   not set. A warning happens with ops_references_rec() defined but not
   used.

* tag 'trace-v6.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ftrace: Fix build warning for ops_references_rec() not used
2022-08-24 10:43:34 -07:00
Linus Torvalds
c40e8341e3 cgroup fixes for v6.0-rc2
Contains fixes for the following issues:
 
 * psi data structure was changed to be allocated dynamically but it wasn't
   being cleared leading to reporting garbage values and triggering spurious
   oom kills.
 
 * A deadlock involving cpuset and cpu hotplug.
 
 * When a controller is moved across cgroup hierarchies, css->rstat_css_node
   didn't get RCU drained properly from the previous list.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCYwVmRg4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGb/eAP44dr9/OQtapKm63H/qmLF39LWE6nC99RYHECl5
 ncuZvwD/XIkZt212nr/qC1C0ggB5qCGG7tIZG6tIgkS+J5huqg4=
 =CC/Y
 -----END PGP SIGNATURE-----

Merge tag 'cgroup-for-6.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup fixes from Tejun Heo:

 - The psi data structure was changed to be allocated dynamically but
   it wasn't being cleared leading to it reporting garbage values and
   triggering spurious oom kills.

 - A deadlock involving cpuset and cpu hotplug.

 - When a controller is moved across cgroup hierarchies,
   css->rstat_css_node didn't get RCU drained properly from the previous
   list.

* tag 'cgroup-for-6.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: Fix race condition at rebind_subsystems()
  cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock
  sched/psi: Remove redundant cgroup_psi() when !CONFIG_CGROUPS
  sched/psi: Remove unused parameter nbytes of psi_trigger_create()
  sched/psi: Zero the memory of struct psi_group
2022-08-23 19:33:28 -07:00
Linus Torvalds
072c92b1b1 audit/stable-6.0 PR 20220823
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmMFJkQUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXONqQ/+L06aobl3pPbFlatTW0YXgmYKxXBb
 Vf329u1P1pcmHYuUf/c4pCGxxbMbbEHwsmdtp6YKLnh97gP0GYUNTNI9WMbU7he9
 ZNrBe1gUhUHNi0ZL1OPVxokfeV2UK+hsyGQuR1wXHwjTTbONsghGCvdy1LEw4DMe
 dGRPNkxzoKJ5K7SnScplUhSBAoVtLLBQB1+HKd5mILV22TTWWzTwcde0RSIkAX1s
 /VM4P77DSEw5DX4fYgIt85yHZ/c8MUUyECFkALph/VUkkLvEWrISTXIzoOdJXfJO
 Ock88Gz2HAj3L+4b0CL0zW67sERks1H5udmhtI+ymGObklMjfJh88QL44K+n8f9w
 2ap9Hlgl4b2TjE2KK313ixX6Om7xxvH62IAMr0/x8y5tk+qZTNvbQsT8TpkRXxtt
 vHxp9x4qqeRL2Si/5A6rvyKvEaZI26hOmNTQIKzJvUIs5tyV0UySABgOheVp42PS
 VeF9/lUc7XOXI9CffhSw636I0WZYmp0bSIcDvRpeTuxobCG6SpOup+ODPoRdA+0A
 8jdwQlJWO36H0qQnqrN/dfDKqcfUu2epKFrP46TxEtC60WztLnS1Nw1ZLFhcEIUw
 MiKEirj1PjVYmES4/aW8l3OPQqMcYHXp5Lj3qBrcOEpi3tPIzOzYwnmdfASxfIvE
 QXE2QHEBjtx8lHE=
 =j0DY
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20220823' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit fix from Paul Moore:
 "A single fix for a potential double-free on a fsnotify error path"

* tag 'audit-pr-20220823' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: fix potential double free on error path from fsnotify_add_inode_mark
2022-08-23 19:26:48 -07:00
Linus Torvalds
95607ad99b Thirteen fixes, almost all for MM. Seven of these are cc:stable and the
remainder fix up the changes which went into this -rc cycle.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYwQZcgAKCRDdBJ7gKXxA
 jnCxAQCk8L6PPm0L2KvKr5Vu3M/T0o9SvfxfM5yho80zM68fHQD/eLxz+nd3m+N5
 K7Mdbcb2u6F46qQaS+S5RialEWKpsw8=
 =WtBo
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2022-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "Thirteen fixes, almost all for MM.

  Seven of these are cc:stable and the remainder fix up the changes
  which went into this -rc cycle"

* tag 'mm-hotfixes-stable-2022-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  kprobes: don't call disarm_kprobe() for disabled kprobes
  mm/shmem: shmem_replace_page() remember NR_SHMEM
  mm/shmem: tmpfs fallocate use file_modified()
  mm/shmem: fix chattr fsflags support in tmpfs
  mm/hugetlb: support write-faults in shared mappings
  mm/hugetlb: fix hugetlb not supporting softdirty tracking
  mm/uffd: reset write protection when unregister with wp-mode
  mm/smaps: don't access young/dirty bit if pte unpresent
  mm: add DEVICE_ZONE to FOR_ALL_ZONES
  kernel/sys_ni: add compat entry for fadvise64_64
  mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW
  Revert "zram: remove double compression logic"
  get_maintainer: add Alan to .get_maintainer.ignore
2022-08-23 13:33:08 -07:00
Linus Torvalds
6234806f8c linux-kselftest-kunit-fixes-6.0-rc3
This KUnit fixes update for Linux 6.0-rc3 consists of fixes to mmc
 test and fix to load .kunit_test_suites section when CONFIG_KUNIT=m,
 and not just when KUnit is built-in.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEPZKym/RZuOCGeA/kCwJExA0NQxwFAmMD/D8ACgkQCwJExA0N
 QxzK8A/7B5VP2K2BYVeXZqK8E7Y+GuFAQc0oD+b9Sgfvv2QkeRO8xAsd19DmvIsi
 MscFjIhScqoU50nrVwxOEiU5Wzeg+BuG8ghPWIM8mo/heeuD27UNE1uEOv7jlOZp
 hlKD9SqCf2HV3YLre4I83fwrK6hilQT4R55yQeXsl/EWvRCwBb1axBp2NCt5Jh5Q
 PUK066pNy3KozumKTL65RLzCKwkoxqNZ5+XTa105heh9lhPqrRZ9+kR+eVfskPI2
 q8F3NcCpGYV+YcHa899MMS4R98nDpB9GYK/sbJxVAIeWxcUT+9fZIJnr4oYi6z0N
 sZDoOsg864R29JUt/rhZkisOuOMjJ94vLVJY62dTVskEVGK6YiARaIfwI39sMWoH
 4ATYCcQW50WLZDC/zc0X9Cm2Bp4Dv/WRL2xTWYpH2P5caxRcnoVm89ggMXsTHE8U
 QApjQW8e9STn+8vzio4KdityIOZ9EsQNpsSilq/Zq2iL2B0ZOIXWSf+JwplAjJdA
 Or9N7EzIZU9PSxdxE9Xwjq0f/bvDC+DH8h9/X3Sy0WeLuA9KUlgVSbFLwSp8jz5Q
 CwoWPase+BAWrAXMEZSqnjrgwIfa1FOXy5jcWNaJWsPVNKirmW0j0n9K1Zj4QFCC
 lYj/H+FgkYXIE6GQRNuHEbnNmExWEDdEdolr+vi+p8Xs3PVyVls=
 =HQ19
 -----END PGP SIGNATURE-----

Merge tag 'linux-kselftest-kunit-fixes-6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull KUnit fixes from Shuah Khan:
 "Fix for a mmc test and to load .kunit_test_suites section when
  CONFIG_KUNIT=m, and not just when KUnit is built-in"

* tag 'linux-kselftest-kunit-fixes-6.0-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  module: kunit: Load .kunit_test_suites section when CONFIG_KUNIT=m
  mmc: sdhci-of-aspeed: test: Fix dependencies when KUNIT=m
2022-08-23 13:23:07 -07:00
Jing-Ting Wu
763f4fb76e cgroup: Fix race condition at rebind_subsystems()
Root cause:
The rebind_subsystems() is no lock held when move css object from A
list to B list,then let B's head be treated as css node at
list_for_each_entry_rcu().

Solution:
Add grace period before invalidating the removed rstat_css_node.

Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Suggested-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Tested-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Link: https://lore.kernel.org/linux-arm-kernel/d8f0bc5e2fb6ed259f9334c83279b4c011283c41.camel@mediatek.com/T/
Acked-by: Mukesh Ojha <quic_mojha@quicinc.com>
Fixes: a7df69b81aac ("cgroup: rstat: support cgroup1")
Cc: stable@vger.kernel.org # v5.13+
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-08-23 08:11:06 -10:00
Gaosheng Cui
ad982c3be4 audit: fix potential double free on error path from fsnotify_add_inode_mark
Audit_alloc_mark() assign pathname to audit_mark->path, on error path
from fsnotify_add_inode_mark(), fsnotify_put_mark will free memory
of audit_mark->path, but the caller of audit_alloc_mark will free
the pathname again, so there will be double free problem.

Fix this by resetting audit_mark->path to NULL pointer on error path
from fsnotify_add_inode_mark().

Cc: stable@vger.kernel.org
Fixes: 7b1293234084d ("fsnotify: Add group pointer in fsnotify_init_mark()")
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2022-08-22 18:50:06 -04:00
Wang Jingjin
123d645577 ftrace: Fix build warning for ops_references_rec() not used
The change that made IPMODIFY and DIRECT ops work together needed access
to the ops_references_ip() function, which it pulled out of the module
only code. But now if both CONFIG_MODULES and
CONFIG_DYNAMIC_FTRACE_WITH_DIRECT_CALLS is not set, we get the below
warning:

    ‘ops_references_rec’ defined but not used.

Since ops_references_rec() only calls ops_references_ip() replace the
usage of ops_references_rec() with ops_references_ip() and encompass the
function with an #ifdef of DIRECT_CALLS || MODULES being defined.

Link: https://lkml.kernel.org/r/20220801084745.1187987-1-wangjingjin1@huawei.com

Fixes: 53cd885bc5c3 ("ftrace: Allow IPMODIFY and DIRECT ops on the same function")
Signed-off-by: Wang Jingjin <wangjingjin1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-08-22 09:41:12 -04:00
Linus Torvalds
7fb312d225 Various fixes for tracing:
- Fix a return value of traceprobe_parse_event_name()
 
  - Fix NULL pointer dereference from failed ftrace enabling
 
  - Fix NULL pointer dereference when asking for registers from eprobes
 
  - Make eprobes consistent with kprobes/uprobes, filters and histograms
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYwKRrhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qosDAP9WySmPxjoMfR0hbjmnepLy2zJtBbIq
 ABWR3LDrjvLlYwD9H/wrD+6ctOZtXh5XJc0Vn5z6XEyNtqrVGSse7Lm+sg4=
 =qb/R
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.0-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Various fixes for tracing:

   - Fix a return value of traceprobe_parse_event_name()

   - Fix NULL pointer dereference from failed ftrace enabling

   - Fix NULL pointer dereference when asking for registers from eprobes

   - Make eprobes consistent with kprobes/uprobes, filters and
     histograms"

* tag 'trace-v6.0-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Have filter accept "common_cpu" to be consistent
  tracing/probes: Have kprobes and uprobes use $COMM too
  tracing/eprobes: Have event probes be consistent with kprobes and uprobes
  tracing/eprobes: Fix reading of string fields
  tracing/eprobes: Do not hardcode $comm as a string
  tracing/eprobes: Do not allow eprobes to use $stack, or % for regs
  ftrace: Fix NULL pointer dereference in is_ftrace_trampoline when ftrace is dead
  tracing/perf: Fix double put of trace event when init fails
  tracing: React to error return from traceprobe_parse_event_name()
2022-08-21 14:49:42 -07:00
Steven Rostedt (Google)
b2380577d4 tracing: Have filter accept "common_cpu" to be consistent
Make filtering consistent with histograms. As "cpu" can be a field of an
event, allow for "common_cpu" to keep it from being confused with the
"cpu" field of the event.

Link: https://lkml.kernel.org/r/20220820134401.513062765@goodmis.org
Link: https://lore.kernel.org/all/20220820220920.e42fa32b70505b1904f0a0ad@kernel.org/

Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 1e3bac71c5053 ("tracing/histogram: Rename "cpu" to "common_cpu"")
Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-08-21 15:56:08 -04:00
Steven Rostedt (Google)
ab8384442e tracing/probes: Have kprobes and uprobes use $COMM too
Both $comm and $COMM can be used to get current->comm in eprobes and the
filtering and histogram logic. Make kprobes and uprobes consistent in this
regard and allow both $comm and $COMM as well. Currently kprobes and
uprobes only handle $comm, which is inconsistent with the other utilities,
and can be confusing to users.

Link: https://lkml.kernel.org/r/20220820134401.317014913@goodmis.org
Link: https://lore.kernel.org/all/20220820220442.776e1ddaf8836e82edb34d01@kernel.org/

Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 533059281ee5 ("tracing: probeevent: Introduce new argument fetching code")
Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-08-21 15:56:08 -04:00
Steven Rostedt (Google)
6a832ec3d6 tracing/eprobes: Have event probes be consistent with kprobes and uprobes
Currently, if a symbol "@" is attempted to be used with an event probe
(eprobes), it will cause a NULL pointer dereference crash.

Both kprobes and uprobes can reference data other than the main registers.
Such as immediate address, symbols and the current task name. Have eprobes
do the same thing.

For "comm", if "comm" is used and the event being attached to does not
have the "comm" field, then make it the "$comm" that kprobes has. This is
consistent to the way histograms and filters work.

Link: https://lkml.kernel.org/r/20220820134401.136924220@goodmis.org

Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-08-21 15:56:08 -04:00
Steven Rostedt (Google)
f04dec9346 tracing/eprobes: Fix reading of string fields
Currently when an event probe (eprobe) hooks to a string field, it does
not display it as a string, but instead as a number. This makes the field
rather useless. Handle the different kinds of strings, dynamic, static,
relational/dynamic etc.

Now when a string field is used, the ":string" type can be used to display
it:

  echo "e:sw sched/sched_switch comm=$next_comm:string" > dynamic_events

Link: https://lkml.kernel.org/r/20220820134400.959640191@goodmis.org

Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-08-21 15:56:08 -04:00
Steven Rostedt (Google)
02333de90e tracing/eprobes: Do not hardcode $comm as a string
The variable $comm is hard coded as a string, which is true for both
kprobes and uprobes, but for event probes (eprobes) it is a field name. In
most cases the "comm" field would be a string, but there's no guarantee of
that fact.

Do not assume that comm is a string. Not to mention, it currently forces
comm fields to fault, as string processing for event probes is currently
broken.

Link: https://lkml.kernel.org/r/20220820134400.756152112@goodmis.org

Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-08-21 15:56:08 -04:00
Steven Rostedt (Google)
2673c60ee6 tracing/eprobes: Do not allow eprobes to use $stack, or % for regs
While playing with event probes (eprobes), I tried to see what would
happen if I attempted to retrieve the instruction pointer (%rip) knowing
that event probes do not use pt_regs. The result was:

 BUG: kernel NULL pointer dereference, address: 0000000000000024
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] PREEMPT SMP PTI
 CPU: 1 PID: 1847 Comm: trace-cmd Not tainted 5.19.0-rc5-test+ #309
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01
v03.03 07/14/2016
 RIP: 0010:get_event_field.isra.0+0x0/0x50
 Code: ff 48 c7 c7 c0 8f 74 a1 e8 3d 8b f5 ff e8 88 09 f6 ff 4c 89 e7 e8
50 6a 13 00 48 89 ef 5b 5d 41 5c 41 5d e9 42 6a 13 00 66 90 <48> 63 47 24
8b 57 2c 48 01 c6 8b 47 28 83 f8 02 74 0e 83 f8 04 74
 RSP: 0018:ffff916c394bbaf0 EFLAGS: 00010086
 RAX: ffff916c854041d8 RBX: ffff916c8d9fbf50 RCX: ffff916c255d2000
 RDX: 0000000000000000 RSI: ffff916c255d2008 RDI: 0000000000000000
 RBP: 0000000000000000 R08: ffff916c3a2a0c08 R09: ffff916c394bbda8
 R10: 0000000000000000 R11: 0000000000000000 R12: ffff916c854041d8
 R13: ffff916c854041b0 R14: 0000000000000000 R15: 0000000000000000
 FS:  0000000000000000(0000) GS:ffff916c9ea40000(0000)
knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000024 CR3: 000000011b60a002 CR4: 00000000001706e0
 Call Trace:
  <TASK>
  get_eprobe_size+0xb4/0x640
  ? __mod_node_page_state+0x72/0xc0
  __eprobe_trace_func+0x59/0x1a0
  ? __mod_lruvec_page_state+0xaa/0x1b0
  ? page_remove_file_rmap+0x14/0x230
  ? page_remove_rmap+0xda/0x170
  event_triggers_call+0x52/0xe0
  trace_event_buffer_commit+0x18f/0x240
  trace_event_raw_event_sched_wakeup_template+0x7a/0xb0
  try_to_wake_up+0x260/0x4c0
  __wake_up_common+0x80/0x180
  __wake_up_common_lock+0x7c/0xc0
  do_notify_parent+0x1c9/0x2a0
  exit_notify+0x1a9/0x220
  do_exit+0x2ba/0x450
  do_group_exit+0x2d/0x90
  __x64_sys_exit_group+0x14/0x20
  do_syscall_64+0x3b/0x90
  entry_SYSCALL_64_after_hwframe+0x46/0xb0

Obviously this is not the desired result.

Move the testing for TPARG_FL_TPOINT which is only used for event probes
to the top of the "$" variable check, as all the other variables are not
used for event probes. Also add a check in the register parsing "%" to
fail if an event probe is used.

Link: https://lkml.kernel.org/r/20220820134400.564426983@goodmis.org

Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: Tom Zanussi <zanussi@kernel.org>
Fixes: 7491e2c44278 ("tracing: Add a probe that attaches to trace events")
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-08-21 15:56:08 -04:00
Yang Jihong
c3b0f72e80 ftrace: Fix NULL pointer dereference in is_ftrace_trampoline when ftrace is dead
ftrace_startup does not remove ops from ftrace_ops_list when
ftrace_startup_enable fails:

register_ftrace_function
  ftrace_startup
    __register_ftrace_function
      ...
      add_ftrace_ops(&ftrace_ops_list, ops)
      ...
    ...
    ftrace_startup_enable // if ftrace failed to modify, ftrace_disabled is set to 1
    ...
  return 0 // ops is in the ftrace_ops_list.

When ftrace_disabled = 1, unregister_ftrace_function simply returns without doing anything:
unregister_ftrace_function
  ftrace_shutdown
    if (unlikely(ftrace_disabled))
            return -ENODEV;  // return here, __unregister_ftrace_function is not executed,
                             // as a result, ops is still in the ftrace_ops_list
    __unregister_ftrace_function
    ...

If ops is dynamically allocated, it will be free later, in this case,
is_ftrace_trampoline accesses NULL pointer:

is_ftrace_trampoline
  ftrace_ops_trampoline
    do_for_each_ftrace_op(op, ftrace_ops_list) // OOPS! op may be NULL!

Syzkaller reports as follows:
[ 1203.506103] BUG: kernel NULL pointer dereference, address: 000000000000010b
[ 1203.508039] #PF: supervisor read access in kernel mode
[ 1203.508798] #PF: error_code(0x0000) - not-present page
[ 1203.509558] PGD 800000011660b067 P4D 800000011660b067 PUD 130fb8067 PMD 0
[ 1203.510560] Oops: 0000 [#1] SMP KASAN PTI
[ 1203.511189] CPU: 6 PID: 29532 Comm: syz-executor.2 Tainted: G    B   W         5.10.0 #8
[ 1203.512324] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 1203.513895] RIP: 0010:is_ftrace_trampoline+0x26/0xb0
[ 1203.514644] Code: ff eb d3 90 41 55 41 54 49 89 fc 55 53 e8 f2 00 fd ff 48 8b 1d 3b 35 5d 03 e8 e6 00 fd ff 48 8d bb 90 00 00 00 e8 2a 81 26 00 <48> 8b ab 90 00 00 00 48 85 ed 74 1d e8 c9 00 fd ff 48 8d bb 98 00
[ 1203.518838] RSP: 0018:ffffc900012cf960 EFLAGS: 00010246
[ 1203.520092] RAX: 0000000000000000 RBX: 000000000000007b RCX: ffffffff8a331866
[ 1203.521469] RDX: 0000000000000000 RSI: 0000000000000008 RDI: 000000000000010b
[ 1203.522583] RBP: 0000000000000000 R08: 0000000000000000 R09: ffffffff8df18b07
[ 1203.523550] R10: fffffbfff1be3160 R11: 0000000000000001 R12: 0000000000478399
[ 1203.524596] R13: 0000000000000000 R14: ffff888145088000 R15: 0000000000000008
[ 1203.525634] FS:  00007f429f5f4700(0000) GS:ffff8881daf00000(0000) knlGS:0000000000000000
[ 1203.526801] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1203.527626] CR2: 000000000000010b CR3: 0000000170e1e001 CR4: 00000000003706e0
[ 1203.528611] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1203.529605] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Therefore, when ftrace_startup_enable fails, we need to rollback registration
process and remove ops from ftrace_ops_list.

Link: https://lkml.kernel.org/r/20220818032659.56209-1-yangjihong1@huawei.com

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-08-21 15:56:07 -04:00
Steven Rostedt (Google)
7249921d94 tracing/perf: Fix double put of trace event when init fails
If in perf_trace_event_init(), the perf_trace_event_open() fails, then it
will call perf_trace_event_unreg() which will not only unregister the perf
trace event, but will also call the put() function of the tp_event.

The problem here is that the trace_event_try_get_ref() is called by the
caller of perf_trace_event_init() and if perf_trace_event_init() returns a
failure, it will then call trace_event_put(). But since the
perf_trace_event_unreg() already called the trace_event_put() function, it
triggers a WARN_ON().

 WARNING: CPU: 1 PID: 30309 at kernel/trace/trace_dynevent.c:46 trace_event_dyn_put_ref+0x15/0x20

If perf_trace_event_reg() does not call the trace_event_try_get_ref() then
the perf_trace_event_unreg() should not be calling trace_event_put(). This
breaks symmetry and causes bugs like these.

Pull out the trace_event_put() from perf_trace_event_unreg() and call it
in the locations that perf_trace_event_unreg() is called. This not only
fixes this bug, but also brings back the proper symmetry of the reg/unreg
vs get/put logic.

Link: https://lore.kernel.org/all/cover.1660347763.git.kjlx@templeofstupid.com/
Link: https://lkml.kernel.org/r/20220816192817.43d5e17f@gandalf.local.home

Cc: stable@vger.kernel.org
Fixes: 1d18538e6a092 ("tracing: Have dynamic events have a ref counter")
Reported-by: Krister Johansen <kjlx@templeofstupid.com>
Reviewed-by: Krister Johansen <kjlx@templeofstupid.com>
Tested-by: Krister Johansen <kjlx@templeofstupid.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-08-21 15:56:07 -04:00
Lukas Bulwahn
d8a64313c1 tracing: React to error return from traceprobe_parse_event_name()
The function traceprobe_parse_event_name() may set the first two function
arguments to a non-null value and still return -EINVAL to indicate an
unsuccessful completion of the function. Hence, it is not sufficient to
just check the result of the two function arguments for being not null,
but the return value also needs to be checked.

Commit 95c104c378dc ("tracing: Auto generate event name when creating a
group of events") changed the error-return-value checking of the second
traceprobe_parse_event_name() invocation in __trace_eprobe_create() and
removed checking the return value to jump to the error handling case.

Reinstate using the return value in the error-return-value checking.

Link: https://lkml.kernel.org/r/20220811071734.20700-1-lukas.bulwahn@gmail.com

Fixes: 95c104c378dc ("tracing: Auto generate event name when creating a group of events")
Acked-by: Linyu Yuan <quic_linyyuan@quicinc.com>
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2022-08-21 15:56:07 -04:00
Kuniyuki Iwashima
9c80e79906 kprobes: don't call disarm_kprobe() for disabled kprobes
The assumption in __disable_kprobe() is wrong, and it could try to disarm
an already disarmed kprobe and fire the WARN_ONCE() below. [0]  We can
easily reproduce this issue.

1. Write 0 to /sys/kernel/debug/kprobes/enabled.

  # echo 0 > /sys/kernel/debug/kprobes/enabled

2. Run execsnoop.  At this time, one kprobe is disabled.

  # /usr/share/bcc/tools/execsnoop &
  [1] 2460
  PCOMM            PID    PPID   RET ARGS

  # cat /sys/kernel/debug/kprobes/list
  ffffffff91345650  r  __x64_sys_execve+0x0    [FTRACE]
  ffffffff91345650  k  __x64_sys_execve+0x0    [DISABLED][FTRACE]

3. Write 1 to /sys/kernel/debug/kprobes/enabled, which changes
   kprobes_all_disarmed to false but does not arm the disabled kprobe.

  # echo 1 > /sys/kernel/debug/kprobes/enabled

  # cat /sys/kernel/debug/kprobes/list
  ffffffff91345650  r  __x64_sys_execve+0x0    [FTRACE]
  ffffffff91345650  k  __x64_sys_execve+0x0    [DISABLED][FTRACE]

4. Kill execsnoop, when __disable_kprobe() calls disarm_kprobe() for the
   disabled kprobe and hits the WARN_ONCE() in __disarm_kprobe_ftrace().

  # fg
  /usr/share/bcc/tools/execsnoop
  ^C

Actually, WARN_ONCE() is fired twice, and __unregister_kprobe_top() misses
some cleanups and leaves the aggregated kprobe in the hash table.  Then,
__unregister_trace_kprobe() initialises tk->rp.kp.list and creates an
infinite loop like this.

  aggregated kprobe.list -> kprobe.list -.
                                     ^    |
                                     '.__.'

In this situation, these commands fall into the infinite loop and result
in RCU stall or soft lockup.

  cat /sys/kernel/debug/kprobes/list : show_kprobe_addr() enters into the
                                       infinite loop with RCU.

  /usr/share/bcc/tools/execsnoop : warn_kprobe_rereg() holds kprobe_mutex,
                                   and __get_valid_kprobe() is stuck in
				   the loop.

To avoid the issue, make sure we don't call disarm_kprobe() for disabled
kprobes.

[0]
Failed to disarm kprobe-ftrace at __x64_sys_execve+0x0/0x40 (error -2)
WARNING: CPU: 6 PID: 2460 at kernel/kprobes.c:1130 __disarm_kprobe_ftrace.isra.19 (kernel/kprobes.c:1129)
Modules linked in: ena
CPU: 6 PID: 2460 Comm: execsnoop Not tainted 5.19.0+ #28
Hardware name: Amazon EC2 c5.2xlarge/, BIOS 1.0 10/16/2017
RIP: 0010:__disarm_kprobe_ftrace.isra.19 (kernel/kprobes.c:1129)
Code: 24 8b 02 eb c1 80 3d c4 83 f2 01 00 75 d4 48 8b 75 00 89 c2 48 c7 c7 90 fa 0f 92 89 04 24 c6 05 ab 83 01 e8 e4 94 f0 ff <0f> 0b 8b 04 24 eb b1 89 c6 48 c7 c7 60 fa 0f 92 89 04 24 e8 cc 94
RSP: 0018:ffff9e6ec154bd98 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffffffff930f7b00 RCX: 0000000000000001
RDX: 0000000080000001 RSI: ffffffff921461c5 RDI: 00000000ffffffff
RBP: ffff89c504286da8 R08: 0000000000000000 R09: c0000000fffeffff
R10: 0000000000000000 R11: ffff9e6ec154bc28 R12: ffff89c502394e40
R13: ffff89c502394c00 R14: ffff9e6ec154bc00 R15: 0000000000000000
FS:  00007fe800398740(0000) GS:ffff89c812d80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000c00057f010 CR3: 0000000103b54006 CR4: 00000000007706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
 __disable_kprobe (kernel/kprobes.c:1716)
 disable_kprobe (kernel/kprobes.c:2392)
 __disable_trace_kprobe (kernel/trace/trace_kprobe.c:340)
 disable_trace_kprobe (kernel/trace/trace_kprobe.c:429)
 perf_trace_event_unreg.isra.2 (./include/linux/tracepoint.h:93 kernel/trace/trace_event_perf.c:168)
 perf_kprobe_destroy (kernel/trace/trace_event_perf.c:295)
 _free_event (kernel/events/core.c:4971)
 perf_event_release_kernel (kernel/events/core.c:5176)
 perf_release (kernel/events/core.c:5186)
 __fput (fs/file_table.c:321)
 task_work_run (./include/linux/sched.h:2056 (discriminator 1) kernel/task_work.c:179 (discriminator 1))
 exit_to_user_mode_prepare (./include/linux/resume_user_mode.h:49 kernel/entry/common.c:169 kernel/entry/common.c:201)
 syscall_exit_to_user_mode (./arch/x86/include/asm/jump_label.h:55 ./arch/x86/include/asm/nospec-branch.h:384 ./arch/x86/include/asm/entry-common.h:94 kernel/entry/common.c:133 kernel/entry/common.c:296)
 do_syscall_64 (arch/x86/entry/common.c:87)
 entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
RIP: 0033:0x7fe7ff210654
Code: 15 79 89 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb be 0f 1f 00 8b 05 9a cd 20 00 48 63 ff 85 c0 75 11 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3a f3 c3 48 83 ec 18 48 89 7c 24 08 e8 34 fc
RSP: 002b:00007ffdbd1d3538 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000000000000008 RCX: 00007fe7ff210654
RDX: 0000000000000000 RSI: 0000000000002401 RDI: 0000000000000008
RBP: 0000000000000000 R08: 94ae31d6fda838a4 R0900007fe8001c9d30
R10: 00007ffdbd1d34b0 R11: 0000000000000246 R12: 00007ffdbd1d3600
R13: 0000000000000000 R14: fffffffffffffffc R15: 00007ffdbd1d3560
</TASK>

Link: https://lkml.kernel.org/r/20220813020509.90805-1-kuniyu@amazon.com
Fixes: 69d54b916d83 ("kprobes: makes kprobes/enabled works correctly for optimized kprobes.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reported-by: Ayushman Dutta <ayudutta@amazon.com>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: Kuniyuki Iwashima <kuni1840@gmail.com>
Cc: Ayushman Dutta <ayudutta@amazon.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-20 15:17:46 -07:00
Randy Dunlap
a8faed3a02 kernel/sys_ni: add compat entry for fadvise64_64
When CONFIG_ADVISE_SYSCALLS is not set/enabled and CONFIG_COMPAT is
set/enabled, the riscv compat_syscall_table references
'compat_sys_fadvise64_64', which is not defined:

riscv64-linux-ld: arch/riscv/kernel/compat_syscall_table.o:(.rodata+0x6f8):
undefined reference to `compat_sys_fadvise64_64'

Add 'fadvise64_64' to kernel/sys_ni.c as a conditional COMPAT function so
that when CONFIG_ADVISE_SYSCALLS is not set, there is a fallback function
available.

Link: https://lkml.kernel.org/r/20220807220934.5689-1-rdunlap@infradead.org
Fixes: d3ac21cacc24 ("mm: Support compiling out madvise and fadvise")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-20 15:17:45 -07:00
Linus Torvalds
4c2d0b039c Including fixes from netfilter.
Current release - regressions:
 
  - tcp: fix cleanup and leaks in tcp_read_skb() (the new way BPF
    socket maps get data out of the TCP stack)
 
  - tls: rx: react to strparser initialization errors
 
  - netfilter: nf_tables: fix scheduling-while-atomic splat
 
  - net: fix suspicious RCU usage in bpf_sk_reuseport_detach()
 
 Current release - new code bugs:
 
  - mlxsw: ptp: fix a couple of races, static checker warnings
    and error handling
 
 Previous releases - regressions:
 
  - netfilter:
    - nf_tables: fix possible module reference underflow in error path
    - make conntrack helpers deal with BIG TCP (skbs > 64kB)
    - nfnetlink: re-enable conntrack expectation events
 
  - net: fix potential refcount leak in ndisc_router_discovery()
 
 Previous releases - always broken:
 
  - sched: cls_route: disallow handle of 0
 
  - neigh: fix possible local DoS due to net iface start/stop loop
 
  - rtnetlink: fix module refcount leak in rtnetlink_rcv_msg
 
  - sched: fix adding qlen to qcpu->backlog in gnet_stats_add_queue_cpu
 
  - virtio_net: fix endian-ness for RSS
 
  - dsa: mv88e6060: prevent crash on an unused port
 
  - fec: fix timer capture timing in `fec_ptp_enable_pps()`
 
  - ocelot: stats: fix races, integer wrapping and reading incorrect
    registers (the change of register definitions here accounts for
    bulk of the changed LoC in this PR)
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmL+lGYACgkQMUZtbf5S
 IrunKw/+OfV68qJ2C+zg/qPgZg5XAD/v+3WuQo9Vsj4Z+dmxelyQkKqok61xLc6t
 eXr8v3/stDM1/zxHqCc0zJZMGhOug4RLS6kfVVwNbo6XaceTJlKcFTgM1bjQgLyT
 pMlet2JMhzpmWkMma2oztsG4zQaWSITCCjgLJByUmeO8+zKXDMojc1eew2bH8ueo
 KzZjIys+lHdEIo2uhGEU3OdhqnFn2zdVGVxcmtgtV3N9rIobnHiJdVwqLlTgnTvQ
 nU5ZoYUM4h1AG7gKSXsDbM0CPH3s4xavpkA3rMB1x4ahfxNd3y6WmpVt9qjE5wME
 8HbzutQ+x7Xf2XAQBBZma/KjmLW0GCHlQhRT+RHBryk21Yizb04HqXNMB1sPFZe6
 uDAvSZjZqPX+3aMznLTzz1T+F1TJygoeVNQ2tlxHkMuPrfS9g3T+jiohGnELF8+K
 /A3g7oCQin/qiMk35JXBuhGk4RqjyPsITOwAZ2OycHZWD/U5xd1OlkKPGUoUAg+m
 y+7XswZZJ/uBw+U+16AMMzg8vxCmoBHbgYGvnw0+96wpv4yVqTW26Wtzv01gjZPp
 wZuJkd+sHZLBNP5RkBC0PQj5rfcUj+4PUTXtW+57z+XM0HcmcqsXZHLXpMr4rS0b
 EnSsuDlfp9SWwfpMld75v/LA19a6opi6novjY4Nds3+t22ffEHY=
 =ednY
 -----END PGP SIGNATURE-----

Merge tag 'net-6.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from netfilter.

  Current release - regressions:

   - tcp: fix cleanup and leaks in tcp_read_skb() (the new way BPF
     socket maps get data out of the TCP stack)

   - tls: rx: react to strparser initialization errors

   - netfilter: nf_tables: fix scheduling-while-atomic splat

   - net: fix suspicious RCU usage in bpf_sk_reuseport_detach()

  Current release - new code bugs:

   - mlxsw: ptp: fix a couple of races, static checker warnings and
     error handling

  Previous releases - regressions:

   - netfilter:
      - nf_tables: fix possible module reference underflow in error path
      - make conntrack helpers deal with BIG TCP (skbs > 64kB)
      - nfnetlink: re-enable conntrack expectation events

   - net: fix potential refcount leak in ndisc_router_discovery()

  Previous releases - always broken:

   - sched: cls_route: disallow handle of 0

   - neigh: fix possible local DoS due to net iface start/stop loop

   - rtnetlink: fix module refcount leak in rtnetlink_rcv_msg

   - sched: fix adding qlen to qcpu->backlog in gnet_stats_add_queue_cpu

   - virtio_net: fix endian-ness for RSS

   - dsa: mv88e6060: prevent crash on an unused port

   - fec: fix timer capture timing in `fec_ptp_enable_pps()`

   - ocelot: stats: fix races, integer wrapping and reading incorrect
     registers (the change of register definitions here accounts for
     bulk of the changed LoC in this PR)"

* tag 'net-6.0-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (77 commits)
  net: moxa: MAC address reading, generating, validity checking
  tcp: handle pure FIN case correctly
  tcp: refactor tcp_read_skb() a bit
  tcp: fix tcp_cleanup_rbuf() for tcp_read_skb()
  tcp: fix sock skb accounting in tcp_read_skb()
  igb: Add lock to avoid data race
  dt-bindings: Fix incorrect "the the" corrections
  net: genl: fix error path memory leak in policy dumping
  stmmac: intel: Add a missing clk_disable_unprepare() call in intel_eth_pci_remove()
  net: ethernet: mtk_eth_soc: fix possible NULL pointer dereference in mtk_xdp_run
  net/mlx5e: Allocate flow steering storage during uplink initialization
  net: mscc: ocelot: report ndo_get_stats64 from the wraparound-resistant ocelot->stats
  net: mscc: ocelot: keep ocelot_stat_layout by reg address, not offset
  net: mscc: ocelot: make struct ocelot_stat_layout array indexable
  net: mscc: ocelot: fix race between ndo_get_stats64 and ocelot_check_stats_work
  net: mscc: ocelot: turn stats_lock into a spinlock
  net: mscc: ocelot: fix address of SYS_COUNT_TX_AGING counter
  net: mscc: ocelot: fix incorrect ndo_get_stats64 packet counters
  net: dsa: felix: fix ethtool 256-511 and 512-1023 TX packet counters
  net: dsa: don't warn in dsa_port_set_state_now() when driver doesn't support it
  ...
2022-08-18 19:37:15 -07:00
David Howells
fc4aaf9fb3 net: Fix suspicious RCU usage in bpf_sk_reuseport_detach()
bpf_sk_reuseport_detach() calls __rcu_dereference_sk_user_data_with_flags()
to obtain the value of sk->sk_user_data, but that function is only usable
if the RCU read lock is held, and neither that function nor any of its
callers hold it.

Fix this by adding a new helper, __locked_read_sk_user_data_with_flags()
that checks to see if sk->sk_callback_lock() is held and use that here
instead.

Alternatively, making __rcu_dereference_sk_user_data_with_flags() use
rcu_dereference_checked() might suffice.

Without this, the following warning can be occasionally observed:

=============================
WARNING: suspicious RCU usage
6.0.0-rc1-build2+ #563 Not tainted
-----------------------------
include/net/sock.h:592 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
5 locks held by locktest/29873:
 #0: ffff88812734b550 (&sb->s_type->i_mutex_key#9){+.+.}-{3:3}, at: __sock_release+0x77/0x121
 #1: ffff88812f5621b0 (sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_close+0x1c/0x70
 #2: ffff88810312f5c8 (&h->lhash2[i].lock){+.+.}-{2:2}, at: inet_unhash+0x76/0x1c0
 #3: ffffffff83768bb8 (reuseport_lock){+...}-{2:2}, at: reuseport_detach_sock+0x18/0xdd
 #4: ffff88812f562438 (clock-AF_INET){++..}-{2:2}, at: bpf_sk_reuseport_detach+0x24/0xa4

stack backtrace:
CPU: 1 PID: 29873 Comm: locktest Not tainted 6.0.0-rc1-build2+ #563
Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x4c/0x5f
 bpf_sk_reuseport_detach+0x6d/0xa4
 reuseport_detach_sock+0x75/0xdd
 inet_unhash+0xa5/0x1c0
 tcp_set_state+0x169/0x20f
 ? lockdep_sock_is_held+0x3a/0x3a
 ? __lock_release.isra.0+0x13e/0x220
 ? reacquire_held_locks+0x1bb/0x1bb
 ? hlock_class+0x31/0x96
 ? mark_lock+0x9e/0x1af
 __tcp_close+0x50/0x4b6
 tcp_close+0x28/0x70
 inet_release+0x8e/0xa7
 __sock_release+0x95/0x121
 sock_close+0x14/0x17
 __fput+0x20f/0x36a
 task_work_run+0xa3/0xcc
 exit_to_user_mode_prepare+0x9c/0x14d
 syscall_exit_to_user_mode+0x18/0x44
 entry_SYSCALL_64_after_hwframe+0x63/0xcd

Fixes: cf8c1e967224 ("net: refactor bpf_sk_reuseport_detach()")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Hawkins Jiawei <yin31149@gmail.com>
Link: https://lore.kernel.org/r/166064248071.3502205.10036394558814861778.stgit@warthog.procyon.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-08-17 16:42:59 -07:00
Tejun Heo
4f7e723643 cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock
Bringing up a CPU may involve creating and destroying tasks which requires
read-locking threadgroup_rwsem, so threadgroup_rwsem nests inside
cpus_read_lock(). However, cpuset's ->attach(), which may be called with
thredagroup_rwsem write-locked, also wants to disable CPU hotplug and
acquires cpus_read_lock(), leading to a deadlock.

Fix it by guaranteeing that ->attach() is always called with CPU hotplug
disabled and removing cpus_read_lock() call from cpuset_attach().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-and-tested-by: Imran Khan <imran.f.khan@oracle.com>
Reported-and-tested-by: Xuewen Yan <xuewen.yan@unisoc.com>
Fixes: 05c7b7a92cc8 ("cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug")
Cc: stable@vger.kernel.org # v5.17+
2022-08-17 07:36:05 -10:00
Hao Jia
76b079ef4c sched/psi: Remove unused parameter nbytes of psi_trigger_create()
psi_trigger_create()'s 'nbytes' parameter is not used, so we can remove it.

Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-08-15 12:35:25 -10:00
Hao Jia
2b97cf7628 sched/psi: Zero the memory of struct psi_group
After commit 5f69a6577bc3 ("psi: dont alloc memory for psi by default"),
the memory used by struct psi_group is no longer allocated and zeroed
in cgroup_create().

Since the memory of struct psi_group is not zeroed, the data in this
memory is random, which will lead to inaccurate psi statistics when
creating a new cgroup.

So we use kzlloc() to allocate and zero the struct psi_group and
remove the redundant zeroing in group_init().

Steps to reproduce:
1. Use cgroup v2 and enable CONFIG_PSI
2. Create a new cgroup, and query psi statistics
mkdir /sys/fs/cgroup/test
cat /sys/fs/cgroup/test/cpu.pressure
some avg10=0.00 avg60=0.00 avg300=47927752200.00 total=12884901
full avg10=561815124.00 avg60=125835394188.00 avg300=1077090462000.00 total=10273561772

cat /sys/fs/cgroup/test/io.pressure
some avg10=1040093132823.95 avg60=1203770351379.21 avg300=3862252669559.46 total=4294967296
full avg10=921884564601.39 avg60=0.00 avg300=1984507298.35 total=442381631

cat /sys/fs/cgroup/test/memory.pressure
some avg10=232476085778.11 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=2585658472280.57 total=12884901

Fixes: commit 5f69a6577bc3 ("psi: dont alloc memory for psi by default")
Signed-off-by: Hao Jia <jiahao.os@bytedance.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
2022-08-15 12:35:13 -10:00
David Gow
41a55567b9 module: kunit: Load .kunit_test_suites section when CONFIG_KUNIT=m
The new KUnit module handling has KUnit test suites listed in a
.kunit_test_suites section of each module. This should be loaded when
the module is, but at the moment this only happens if KUnit is built-in.

Also load this when KUnit is enabled as a module: it'll not be usable
unless KUnit is loaded, but such modules are likely to depend on KUnit
anyway, so it's unlikely to ever be loaded needlessly.

Fixes: 3d6e44623841 ("kunit: unify module and builtin suite definitions")
Signed-off-by: David Gow <davidgow@google.com>
Reviewed-by: Brendan Higgins <brendanhiggins@google.com>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2022-08-15 13:51:07 -06:00
Linus Torvalds
5d6a0f4da9 xen: branch for v6.0-rc1b
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYvi0yQAKCRCAXGG7T9hj
 vmikAQDWSrcWuxDkGnzut0A1tBQRUCWDMyKPqigWAA5tH2sPgAEAtWfBvT1xyl7T
 gZ22I7o21WxxDGyvNUcA65pK7c2cpg8=
 =UMbq
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-6.0-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull more xen updates from Juergen Gross:

 - fix the handling of the "persistent grants" feature negotiation
   between Xen blkfront and Xen blkback drivers

 - a cleanup of xen.config and adding xen.config to Xen section in
   MAINTAINERS

 - support HVMOP_set_evtchn_upcall_vector, which is more compliant to
   "normal" interrupt handling than the global callback used up to now

 - further small cleanups

* tag 'for-linus-6.0-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  MAINTAINERS: add xen config fragments to XEN HYPERVISOR sections
  xen: remove XEN_SCRUB_PAGES in xen.config
  xen/pciback: Fix comment typo
  xen/xenbus: fix return type in xenbus_file_read()
  xen-blkfront: Apply 'feature_persistent' parameter when connect
  xen-blkback: Apply 'feature_persistent' parameter when connect
  xen-blkback: fix persistent grants negotiation
  x86/xen: Add support for HVMOP_set_evtchn_upcall_vector
2022-08-14 09:28:54 -07:00
Linus Torvalds
f6eb0fed6a Misc timer fixes:
- fix a potential use-after-free bug in posix timers
  - correct a prototype
  - address a build warning
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmL3epQRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1iPZw/+I/9GXcf3SzbG5M6Nf21SJpSjC4hAHHgb
 eyv5MUNxKvCHU5iT2SrCvgKjESl5I/E70kubeRHJnvarBPUzGnHHzGlYIYOaJPQ7
 irJpUj/6R8ps4UsMBJ8vj5f3b7163zhBJVP8egDW6roT1HUrYTFeIjIli/SOCxpY
 H1/DqHlbEALE5o5xykg3zuqAbywym+hNRleIVls4wqjZNnfqiTElSuW9xqw9xt3n
 9xYmOKZaztdv5Lp2JCm7QOu2byGzeHje72ppsDcBZ3EBvHUBLSndhfe5NQUGhtxy
 UlBqAELA653uPgPnNKLRMqt/kop8emHqvAx8T0RawPwoUS6XGDVxRX+my8+HKklg
 P8KsM/8W7+3KTHz0bf72DEHTFiXCzlswRzdOSvP5bR4xw1G4ychzvuxAiPDFR3zT
 v7uPgykxxCrEexVCBCdPmrl4WikwLJtcrSXtJ4bsisxQFlq7WWd2/osZkTffI3pN
 IIxDXuHFHC78lrUMk2OQ+ITBz01z4nCFSlgMGZ6ZY6ppS1Rndy1HG/B2NgjW1zGP
 Y/1xq/nWaql0QO7RmyoJXt1ZSMJYCyKFocRDh9nBmtBSlYm3A8aIA8b4i1VRRG1G
 8HOkdS8ef2eOWj8wqk0NvoTbiGjV7YM5pf0g1dmRLA+aGCBD1P9/iFcBv5b6Uxaq
 qZ7ZtuQzsyc=
 =Plg8
 -----END PGP SIGNATURE-----

Merge tag 'timers-urgent-2022-08-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer fixes from Ingo Molnar:
 "Misc timer fixes:

   - fix a potential use-after-free bug in posix timers

   - correct a prototype

   - address a build warning"

* tag 'timers-urgent-2022-08-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  posix-cpu-timers: Cleanup CPU timers before freeing them during exec
  time: Correct the prototype of ns_to_kernel_old_timeval and ns_to_timespec64
  posix-timers: Make do_clock_gettime() static
2022-08-13 14:38:22 -07:00