37720 Commits

Author SHA1 Message Date
Linus Torvalds
43aa0a195f objtool updates:
- Improve retpoline code patching by separating it from alternatives which
    reduces memory footprint and allows to do better optimizations in the
    actual runtime patching.
 
  - Add proper retpoline support for x86/BPF
 
  - Address noinstr warnings in x86/kvm, lockdep and paravirtualization code
 
  - Add support to handle pv_opsindirect calls in the noinstr analysis
 
  - Classify symbols upfront and cache the result to avoid redundant
    str*cmp() invocations.
 
  - Add a CFI hash to reduce memory consumption which also reduces runtime
    on a allyesconfig by ~50%
 
  - Adjust XEN code to make objtool handling more robust and as a side
    effect to prevent text fragmentation due to placement of the hypercall
    page.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/GFgTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoc1JD/0Sz6seP2OUMxbMT3gCcFo9sMvYTdsM
 7WuGFbBbnCIo7g8JH7k0zRRBigptMp2eUtQXKkgaaIbWN4JbuVKf8KxN5/qXxLi4
 fJ12QnNTGH9N2jtzl5wKmpjaKJnnJMD9D10XwoR+T6gn6NHd+AgLEs7GxxuQUlgo
 eC9oEXhNHC8uNhiZc38EwfwmItI1bRgaLrnZWIL4rYGSMxfCK1/cEOpWrFfX9wmj
 /diB6oqMyPXZXMCtgpX7TniUr5XOTCcUkeO9mQv5bmyq/YM/8hrTbcVSJlsVYLvP
 EsBnUSHAcfLFiHXwa1RNiIGdbiPjbN+UYeXGAvqF58f3e5dTIHtN/UmWo7OH93If
 9rLMVNcMpsfPx7QRk2IxEPumLCkyfwjzfKrVDM6P6TKEIUzD1og4IK9gTlfykVsh
 56G5XiCOC/X2x8IMxKTLGuBiAVLFHXK/rSwoqhvNEWBFKDbP13QWs0LurBcW09Sa
 /kQI9pIBT1xFA/R+OY5Xy1cqNVVK1Gxmk8/bllCijA9pCFSCFM4hLZE5CevdrBCV
 h5SdqEK5hIlzFyypXfsCik/4p/+rfvlGfUKtFsPctxx29SPe+T0orx+l61jiWQok
 rZOflwMawK5lDuASHrvNHGJcWaTwoo3VcXMQDnQY0Wulc43J5IFBaPxkZzgyd+S1
 4lktHxatrCMUgw==
 =pfZi
 -----END PGP SIGNATURE-----

Merge tag 'objtool-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull objtool updates from Thomas Gleixner:

 - Improve retpoline code patching by separating it from alternatives
   which reduces memory footprint and allows to do better optimizations
   in the actual runtime patching.

 - Add proper retpoline support for x86/BPF

 - Address noinstr warnings in x86/kvm, lockdep and paravirtualization
   code

 - Add support to handle pv_opsindirect calls in the noinstr analysis

 - Classify symbols upfront and cache the result to avoid redundant
   str*cmp() invocations.

 - Add a CFI hash to reduce memory consumption which also reduces
   runtime on a allyesconfig by ~50%

 - Adjust XEN code to make objtool handling more robust and as a side
   effect to prevent text fragmentation due to placement of the
   hypercall page.

* tag 'objtool-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (41 commits)
  bpf,x86: Respect X86_FEATURE_RETPOLINE*
  bpf,x86: Simplify computing label offsets
  x86,bugs: Unconditionally allow spectre_v2=retpoline,amd
  x86/alternative: Add debug prints to apply_retpolines()
  x86/alternative: Try inline spectre_v2=retpoline,amd
  x86/alternative: Handle Jcc __x86_indirect_thunk_\reg
  x86/alternative: Implement .retpoline_sites support
  x86/retpoline: Create a retpoline thunk array
  x86/retpoline: Move the retpoline thunk declarations to nospec-branch.h
  x86/asm: Fixup odd GEN-for-each-reg.h usage
  x86/asm: Fix register order
  x86/retpoline: Remove unused replacement symbols
  objtool,x86: Replace alternatives with .retpoline_sites
  objtool: Shrink struct instruction
  objtool: Explicitly avoid self modifying code in .altinstr_replacement
  objtool: Classify symbols
  objtool: Support pv_opsindirect calls for noinstr
  x86/xen: Rework the xen_{cpu,irq,mmu}_opsarrays
  x86/xen: Mark xen_force_evtchn_callback() noinstr
  x86/xen: Make irq_disable() noinstr
  ...
2021-11-01 13:24:43 -07:00
Linus Torvalds
595b28fb0c Locking updates:
- Move futex code into kernel/futex/ and split up the kitchen sink into
    seperate files to make integration of sys_futex_waitv() simpler.
 
  - Add a new sys_futex_waitv() syscall which allows to wait on multiple
    futexes. The main use case is emulating Windows' WaitForMultipleObjects
    which allows Wine to improve the performance of Windows Games. Also
    native Linux games can benefit from this interface as this is a common
    wait pattern for this kind of applications.
 
  - Add context to ww_mutex_trylock() to provide a path for i915 to rework
    their eviction code step by step without making lockdep upset until the
    final steps of rework are completed. It's also useful for regulator and
    TTM to avoid dropping locks in the non contended path.
 
  - Lockdep and might_sleep() cleanups and improvements
 
  - A few improvements for the RT substitutions.
 
  - The usual small improvements and cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/FTITHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVNZD/9vIm3Bu1Coz8tbNXz58AiCYq9Y/vp5
 mzFgSzz+VJTkW5Vh8jo5Uel4rCKZyt+rL276EoaRPzYl8KFtWDbpK3qd3PrXKqTX
 At49JO4ttAMJUHIBQ6vblEkykmfEd9YPU1uSWk5roJ+s7Jmr5VWnu0FEWHP00As5
 tWOca/TM0ei9kof26V2fl5aecTGII4i4Zsvy+LPsXtI+TnmP0gSBcGAS/5UnZTtJ
 vQRWTR3ojoYvh5iTmNqbaURYoQLe2j8yscn1DSW1CABWVmP12eDWs+N7jRP4b5S9
 73xOv5P7vpva41wxrK2ir5iNkpsLE97VL2JOHTW8nm7orblfiuxHLTCkTjEdd2pO
 h8blI2IBizEB3JYn2BMkOAaZQOSjN8hd6Ye/b2B4AMEGWeXEoEv6eVy/orYKCluQ
 XDqGn47Vce/SYmo5vfTB8VMt6nANx8PKvOP3IvjHInYEQBgiT6QrlUw3RRkXBp5s
 clQkjYYwjAMVIXowcCrdhoKjMROzi6STShVwHwGL8MaZXqr8Vl6BUO9ckU0pY+4C
 F000Hzwxi8lGEQ9k+P+BnYOEzH5osCty8lloKiQ/7ciX6T+CZHGJPGK/iY4YL8P5
 C3CJWMsHCqST7DodNFJmdfZt99UfIMmEhshMDduU9AAH0tHCn8vOu0U6WvCtpyBp
 BvHj68zteAtlYg==
 =RZ4x
 -----END PGP SIGNATURE-----

Merge tag 'locking-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Thomas Gleixner:

 - Move futex code into kernel/futex/ and split up the kitchen sink into
   seperate files to make integration of sys_futex_waitv() simpler.

 - Add a new sys_futex_waitv() syscall which allows to wait on multiple
   futexes.

   The main use case is emulating Windows' WaitForMultipleObjects which
   allows Wine to improve the performance of Windows Games. Also native
   Linux games can benefit from this interface as this is a common wait
   pattern for this kind of applications.

 - Add context to ww_mutex_trylock() to provide a path for i915 to
   rework their eviction code step by step without making lockdep upset
   until the final steps of rework are completed. It's also useful for
   regulator and TTM to avoid dropping locks in the non contended path.

 - Lockdep and might_sleep() cleanups and improvements

 - A few improvements for the RT substitutions.

 - The usual small improvements and cleanups.

* tag 'locking-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
  locking: Remove spin_lock_flags() etc
  locking/rwsem: Fix comments about reader optimistic lock stealing conditions
  locking: Remove rcu_read_{,un}lock() for preempt_{dis,en}able()
  locking/rwsem: Disable preemption for spinning region
  docs: futex: Fix kernel-doc references
  futex: Fix PREEMPT_RT build
  futex2: Documentation: Document sys_futex_waitv() uAPI
  selftests: futex: Test sys_futex_waitv() wouldblock
  selftests: futex: Test sys_futex_waitv() timeout
  selftests: futex: Add sys_futex_waitv() test
  futex,arm: Wire up sys_futex_waitv()
  futex,x86: Wire up sys_futex_waitv()
  futex: Implement sys_futex_waitv()
  futex: Simplify double_lock_hb()
  futex: Split out wait/wake
  futex: Split out requeue
  futex: Rename mark_wake_futex()
  futex: Rename: match_futex()
  futex: Rename: hb_waiter_{inc,dec,pending}()
  futex: Split out PI futex
  ...
2021-11-01 13:15:36 -07:00
Linus Torvalds
91e1c99e17 perf updates:
core:
 
   - Allow ftrace to instrument parts of the perf core code
 
   - Add a new mem_hops field to perf_mem_data_src which allows to represent
     intra-node/package or inter-node/off-package details to prepare for
     next generation systems which have more hieararchy within the
     node/pacakge level.
 
  tools:
 
   - Update for the new mem_hops field in perf_mem_data_src
 
  arch:
 
   - A set of constraints fixes for the Intel uncore PMU
 
   - The usual set of small fixes and improvements for x86 and PPC
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF/GkQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoaD8D/wLhXR8RxtF4W9HJmHA+5XFsPtg+isp
 ZNU2kOs4gZskFx75NQaRv5ikA8y68TKdIx+NuQvRLYItaMveTToLSsJ55bfGMxIQ
 JHqDvANUNxBmAACnbYQlqf9WgB0i/3fCUHY5lpmN0waKjaswz7WNpycv4ccShVZr
 PKbgEjkeFBhplCqqOF0X5H3V+4q85+nZONm1iSNd4S7/3B6OCxOf1u78usL1bbtW
 yJAMSuTeOVUZCJm7oVywKW/ZlCscT135aKr6xe5QTrjlPuRWzuLaXNezdMnMyoVN
 HVv8a0ClACb8U5KiGfhvaipaIlIAliWJp2qoiNjrspDruhH6Yc+eNh1gUhLbtNpR
 4YZR5jxv4/mS13kzMMQg00cCWQl7N4whPT+ZE9pkpshGt+EwT+Iy3U+v13wDfnnp
 MnDggpWYGEkAck13t/T6DwC3qBIsVujtpiG+tt/ERbTxiuxi1ccQTGY3PDjtHV3k
 tIMH5n7l4jEpfl8VmoSUgz/2h1MLZnQUWp41GXkjkaOt7uunQZen+nAwqpTm28KV
 7U6U0h1q6r7HxOZRxkPPe4HSV+aBNH3H1LeNBfEd3hDCFGf6MY6vLow+2BE9ybk7
 Y6LPbRqq0SN3sd5MND0ZvQEt5Zgol8CMlX+UKoLEEv7RognGbIxkgpK7exv5pC9w
 nWj7TaMfpRzPgw==
 =Oj0G
 -----END PGP SIGNATURE-----

Merge tag 'perf-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf updates from Thomas Gleixner:
 "Core:

   - Allow ftrace to instrument parts of the perf core code

   - Add a new mem_hops field to perf_mem_data_src which allows to
     represent intra-node/package or inter-node/off-package details to
     prepare for next generation systems which have more hieararchy
     within the node/pacakge level.

  Tools:

   - Update for the new mem_hops field in perf_mem_data_src

  Arch:

   - A set of constraints fixes for the Intel uncore PMU

   - The usual set of small fixes and improvements for x86 and PPC"

* tag 'perf-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/intel: Fix ICL/SPR INST_RETIRED.PREC_DIST encodings
  powerpc/perf: Fix data source encodings for L2.1 and L3.1 accesses
  tools/perf: Add mem_hops field in perf_mem_data_src structure
  perf: Add mem_hops field in perf_mem_data_src structure
  perf: Add comment about current state of PERF_MEM_LVL_* namespace and remove an extra line
  perf/core: Allow ftrace for functions in kernel/event/core.c
  perf/x86: Add new event for AUX output counter index
  perf/x86: Add compiler barrier after updating BTS
  perf/x86/intel/uncore: Fix Intel SPR M3UPI event constraints
  perf/x86/intel/uncore: Fix Intel SPR M2PCIE event constraints
  perf/x86/intel/uncore: Fix Intel SPR IIO event constraints
  perf/x86/intel/uncore: Fix Intel SPR CHA event constraints
  perf/x86/intel/uncore: Fix Intel ICX IIO event constraints
  perf/x86/intel/uncore: Fix invalid unit check
  perf/x86/intel/uncore: Support extra IMC channel on Ice Lake server
2021-11-01 13:12:15 -07:00
Linus Torvalds
5a47ebe98e Updates for the interrupt subsystem:
Core changes:
 
   - Prevent a potential deadlock when initial priority is assigned to a
     newly created interrupt thread. A recent change to plug a race between
     cpuset and __sched_setscheduler() introduced a new lock dependency
     which is now triggered. Break the lock dependency chain by moving the
     priority assignment to the thread function.
 
   - A couple of small updates to make the irq core RT safe.
 
   - Confine the irq_cpu_online/offline() API to the only left unfixable
     user Cavium Octeon so that it does not grow new usage.
 
   - A small documentation update
 
  Driver changes:
 
   - A large cross architecture rework to move irq_enter/exit() into the
     architecture code to make addressing the NOHZ_FULL/RCU issues simpler.
 
   - The obligatory new irq chip driver for Microchip EIC
 
   - Modularize a few irq chip drivers
 
   - Expand usage of devm_*() helpers throughout the driver code
 
   - The usual small fixes and improvements all over the place
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmF+8BUTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoWs2EACeNbL93aIFokd2/RllRSr4VvMjKNyW
 PpA0RYDOz1Jh4ldK+7b/EYapKgAkR3yyOtz+jyjRE7jsQK0pQeLtYNLd3cTzsD7K
 LCvl8rq6cbRqyFoSC15UKKNbQ/f+o/3LeGPoipr5NQZRMepxk2J/yBCNRXHvIbe6
 oLMQJUgw7KKtvCrCUX9OSei4F09T1qsNrIYb7QafP5+v0zndAT7uKNivWrKGFrsh
 Uk9epoH3hIkvQERkpmzwJEJaq6oyqhoYQy7ZRGayEPwIdCyivJGZrVX0mZk1LX58
 uc8u5grIslX9MqZEQWBweR5y7nISB494NGKmoCInu66U/+3DSOg3AGH2Rfw8PNFZ
 lMKdXzYoDgv2y6LeiLtTUKV4K1NBRXo0BhwSGbPw0o6C03/x003kG824Y+/naU75
 6q05BZSia1PagPV3e0UAm0A2Rnjj/5uso2fEk0eGBSGM27jf9SQcSE8DVrEiLRd1
 2N5uAXbMdfu4xACsEI1Uxu1KNOSQnUhBCy0X6Ppj1a083kLG7jg/126ebb05R8G4
 MF79PFt+xUPSzmuKc/xwCdANtW+zzoyjYl5w6mwELBJ9veNbPShokGBTN/qzjXKZ
 vdr3/pXx95lRAzFnGOnETesm3IyObruU4K8NbMKd2b+eYa0w1WuZCKnutGLfsqxg
 byhCEw459e3P2g==
 =r6ln
 -----END PGP SIGNATURE-----

Merge tag 'irq-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq updates from Thomas Gleixner:
 "Updates for the interrupt subsystem:

  Core changes:

   - Prevent a potential deadlock when initial priority is assigned to a
     newly created interrupt thread. A recent change to plug a race
     between cpuset and __sched_setscheduler() introduced a new lock
     dependency which is now triggered. Break the lock dependency chain
     by moving the priority assignment to the thread function.

   - A couple of small updates to make the irq core RT safe.

   - Confine the irq_cpu_online/offline() API to the only left unfixable
     user Cavium Octeon so that it does not grow new usage.

   - A small documentation update

  Driver changes:

   - A large cross architecture rework to move irq_enter/exit() into the
     architecture code to make addressing the NOHZ_FULL/RCU issues
     simpler.

   - The obligatory new irq chip driver for Microchip EIC

   - Modularize a few irq chip drivers

   - Expand usage of devm_*() helpers throughout the driver code

   - The usual small fixes and improvements all over the place"

* tag 'irq-core-2021-10-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
  h8300: Fix linux/irqchip.h include mess
  dt-bindings: irqchip: renesas-irqc: Document r8a774e1 bindings
  MIPS: irq: Avoid an unused-variable error
  genirq: Hide irq_cpu_{on,off}line() behind a deprecated option
  irqchip/mips-gic: Get rid of the reliance on irq_cpu_online()
  MIPS: loongson64: Drop call to irq_cpu_offline()
  irq: remove handle_domain_{irq,nmi}()
  irq: remove CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
  irq: riscv: perform irqentry in entry code
  irq: openrisc: perform irqentry in entry code
  irq: csky: perform irqentry in entry code
  irq: arm64: perform irqentry in entry code
  irq: arm: perform irqentry in entry code
  irq: add a (temporary) CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
  irq: nds32: avoid CONFIG_HANDLE_DOMAIN_IRQ
  irq: arc: avoid CONFIG_HANDLE_DOMAIN_IRQ
  irq: add generic_handle_arch_irq()
  irq: unexport handle_irq_desc()
  irq: simplify handle_domain_{irq,nmi}()
  irq: mips: simplify do_domain_IRQ()
  ...
2021-11-01 13:09:10 -07:00
He Fengqing
588e5d8766 cgroup: bpf: Move wrapper for __cgroup_bpf_*() to kernel/bpf/cgroup.c
In commit 324bda9e6c5a("bpf: multi program support for cgroup+bpf")
cgroup_bpf_*() called from kernel/bpf/syscall.c, but now they are only
used in kernel/bpf/cgroup.c, so move these function to
kernel/bpf/cgroup.c, like cgroup_bpf_replace().

Signed-off-by: He Fengqing <hefengqing@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-11-01 06:49:00 -10:00
Dan Schatzberg
81c49d39ae cgroup: Fix rootcg cpu.stat guest double counting
In account_guest_time in kernel/sched/cputime.c guest time is
attributed to both CPUTIME_NICE and CPUTIME_USER in addition to
CPUTIME_GUEST_NICE and CPUTIME_GUEST respectively. Therefore, adding
both to calculate usage results in double counting any guest time at
the rootcg.

Fixes: 936f2a70f207 ("cgroup: add cpu.stat file to root cgroup")
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2021-11-01 06:47:08 -10:00
Linus Torvalds
33c8846c81 for-5.16/block-2021-10-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmF8KDgQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmQ2D/wO0nH3U+3+OZChi3XUwYck9Dev3o6BANCF
 ClATiK/kivZY0xY1r8J4ixirZo2gcjIMpWSC3JGYZ5LdspfmYGLUbMjfZsaeU23i
 lAKaX1IqfArmHN76k3IU1bKCg7B0/LFwC0q9QTFWTSwNSs8RK/EZLJ61U1hEXUb3
 OfIpaMmvPiMaU7yuPqhcZK14m1cg1srrLM4rFB/PqsWWStF07pHq32WeArGDAU0e
 Fe0YSnYD7qqA5Qc37KwqjCTmmxKX5YZf7etIcA6p3DNmwcuQrVNzKoCH/ZEDijaD
 E2bS/BWbN1x96+rtoEZfBYEaNIrkmJzmW6+fJ53OITbJF3KqP6V66erhqNcFYCzC
 mhFlRe7voXb/8AP7zQqSIhK529BUBM36sQ6nF7EiQcDrfLc1z39mq6eblUxbknIA
 DDPISD5Tseik9N9x0bc7vINseKyHI1E90VAU/XKADcuGbzLvehPx+2p+Iq5ch5Ah
 oa1G3RdlWWQOZxphJHWJhu1qMfo5+FP9dFZj1aoo7b8Kbc/CedyoQe71cpIE5wNh
 Jj/EpWJnuyKXwuTic2VYGC+6ezM9O5DSdqCfP3YuZky95VESyvRCKJYMMgBYRVdC
 /LuxhnBXIY2G8An7ZTnX0kLCCvLbapIwa0NyA98/xeOngO843coJ6wn8ZmE9LJNH
 kMmpCygUrA==
 =QWC+
 -----END PGP SIGNATURE-----

Merge tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - mq-deadline accounting improvements (Bart)

 - blk-wbt timer fix (Andrea)

 - Untangle the block layer includes (Christoph)

 - Rework the poll support to be bio based, which will enable adding
   support for polling for bio based drivers (Christoph)

 - Block layer core support for multi-actuator drives (Damien)

 - blk-crypto improvements (Eric)

 - Batched tag allocation support (me)

 - Request completion batching support (me)

 - Plugging improvements (me)

 - Shared tag set improvements (John)

 - Concurrent queue quiesce support (Ming)

 - Cache bdev in ->private_data for block devices (Pavel)

 - bdev dio improvements (Pavel)

 - Block device invalidation and block size improvements (Xie)

 - Various cleanups, fixes, and improvements (Christoph, Jackie,
   Masahira, Tejun, Yu, Pavel, Zheng, me)

* tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block: (174 commits)
  blk-mq-debugfs: Show active requests per queue for shared tags
  block: improve readability of blk_mq_end_request_batch()
  virtio-blk: Use blk_validate_block_size() to validate block size
  loop: Use blk_validate_block_size() to validate block size
  nbd: Use blk_validate_block_size() to validate block size
  block: Add a helper to validate the block size
  block: re-flow blk_mq_rq_ctx_init()
  block: prefetch request to be initialized
  block: pass in blk_mq_tags to blk_mq_rq_ctx_init()
  block: add rq_flags to struct blk_mq_alloc_data
  block: add async version of bio_set_polled
  block: kill DIO_MULTI_BIO
  block: kill unused polling bits in __blkdev_direct_IO()
  block: avoid extra iter advance with async iocb
  block: Add independent access ranges support
  blk-mq: don't issue request directly in case that current is to be blocked
  sbitmap: silence data race warning
  blk-cgroup: synchronize blkg creation against policy deactivation
  block: refactor bio_iov_bvec_set()
  block: add single bio async direct IO helper
  ...
2021-11-01 09:19:50 -07:00
Pawan Gupta
8a03e56b25 bpf: Disallow unprivileged bpf by default
Disabling unprivileged BPF would help prevent unprivileged users from
creating certain conditions required for potential speculative execution
side-channel attacks on unmitigated affected hardware.

A deep dive on such attacks and current mitigations is available here [0].

Sync with what many distros are currently applying already, and disable
unprivileged BPF by default. An admin can enable this at runtime, if
necessary, as described in 08389d888287 ("bpf: Add kconfig knob for
disabling unpriv bpf by default").

  [0] "BPF and Spectre: Mitigating transient execution attacks", Daniel Borkmann, eBPF Summit '21
      https://ebpf.io/summit-2021-slides/eBPF_Summit_2021-Keynote-Daniel_Borkmann-BPF_and_Spectre.pdf

Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/bpf/0ace9ce3f97656d5f62d11093ad7ee81190c3c25.1635535215.git.pawan.kumar.gupta@linux.intel.com
2021-11-01 17:06:47 +01:00
Linus Torvalds
49f8275c7d Memory folios
Add memory folios, a new type to represent either order-0 pages or
 the head page of a compound page.  This should be enough infrastructure
 to support filesystems converting from pages to folios.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmF9uI0ACgkQDpNsjXcp
 gj7MUAf/R7LCZ+xFiIedw7SAgb/DGK0C9uVjuBEIZgAw21ZUw/GuPI6cuKBMFGGf
 rRcdtlvMpwi7yZJcoNXxaqU/xPaaJMjf2XxscIvYJP1mjlZVuwmP9dOx0neNvWOc
 T+8lqR6c1TLl82lpqIjGFLwvj2eVowq2d3J5jsaIJFd4odmmYVInrhJXOzC/LQ54
 Niloj5ksehf+KUIRLDz7ycppvIHhlVsoAl0eM2dWBAtL0mvT7Nyn/3y+vnMfV2v3
 Flb4opwJUgTJleYc16oxTn9svT2yS8q2uuUemRDLW8ABghoAtH3fUUk43RN+5Krd
 LYCtbeawtkikPVXZMfWybsx5vn0c3Q==
 =7SBe
 -----END PGP SIGNATURE-----

Merge tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache

Pull memory folios from Matthew Wilcox:
 "Add memory folios, a new type to represent either order-0 pages or the
  head page of a compound page. This should be enough infrastructure to
  support filesystems converting from pages to folios.

  The point of all this churn is to allow filesystems and the page cache
  to manage memory in larger chunks than PAGE_SIZE. The original plan
  was to use compound pages like THP does, but I ran into problems with
  some functions expecting only a head page while others expect the
  precise page containing a particular byte.

  The folio type allows a function to declare that it's expecting only a
  head page. Almost incidentally, this allows us to remove various calls
  to VM_BUG_ON(PageTail(page)) and compound_head().

  This converts just parts of the core MM and the page cache. For 5.17,
  we intend to convert various filesystems (XFS and AFS are ready; other
  filesystems may make it) and also convert more of the MM and page
  cache to folios. For 5.18, multi-page folios should be ready.

  The multi-page folios offer some improvement to some workloads. The
  80% win is real, but appears to be an artificial benchmark (postgres
  startup, which isn't a serious workload). Real workloads (eg building
  the kernel, running postgres in a steady state, etc) seem to benefit
  between 0-10%. I haven't heard of any performance losses as a result
  of this series. Nobody has done any serious performance tuning; I
  imagine that tweaking the readahead algorithm could provide some more
  interesting wins. There are also other places where we could choose to
  create large folios and currently do not, such as writes that are
  larger than PAGE_SIZE.

  I'd like to thank all my reviewers who've offered review/ack tags:
  Christoph Hellwig, David Howells, Jan Kara, Jeff Layton, Johannes
  Weiner, Kirill A. Shutemov, Michal Hocko, Mike Rapoport, Vlastimil
  Babka, William Kucharski, Yu Zhao and Zi Yan.

  I'd also like to thank those who gave feedback I incorporated but
  haven't offered up review tags for this part of the series: Nick
  Piggin, Mel Gorman, Ming Lei, Darrick Wong, Ted Ts'o, John Hubbard,
  Hugh Dickins, and probably a few others who I forget"

* tag 'folio-5.16' of git://git.infradead.org/users/willy/pagecache: (90 commits)
  mm/writeback: Add folio_write_one
  mm/filemap: Add FGP_STABLE
  mm/filemap: Add filemap_get_folio
  mm/filemap: Convert mapping_get_entry to return a folio
  mm/filemap: Add filemap_add_folio()
  mm/filemap: Add filemap_alloc_folio
  mm/page_alloc: Add folio allocation functions
  mm/lru: Add folio_add_lru()
  mm/lru: Convert __pagevec_lru_add_fn to take a folio
  mm: Add folio_evictable()
  mm/workingset: Convert workingset_refault() to take a folio
  mm/filemap: Add readahead_folio()
  mm/filemap: Add folio_mkwrite_check_truncate()
  mm/filemap: Add i_blocks_per_folio()
  mm/writeback: Add folio_redirty_for_writepage()
  mm/writeback: Add folio_account_redirty()
  mm/writeback: Add folio_clear_dirty_for_io()
  mm/writeback: Add folio_cancel_dirty()
  mm/writeback: Add folio_account_cleaned()
  mm/writeback: Add filemap_dirty_folio()
  ...
2021-11-01 08:47:59 -07:00
Joerg Roedel
52d96919d6 Merge branches 'apple/dart', 'arm/mediatek', 'arm/renesas', 'arm/smmu', 'arm/tegra', 'iommu/fixes', 'x86/amd', 'x86/vt-d' and 'core' into next 2021-10-31 22:26:53 +01:00
Vincent Guittot
8ea9183db4 sched/fair: Cleanup newidle_balance
update_next_balance() uses sd->last_balance which is not modified by
load_balance() so we can merge the 2 calls in one place.

No functional change

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20211019123537.17146-6-vincent.guittot@linaro.org
2021-10-31 11:11:38 +01:00
Vincent Guittot
c5b0a7eefc sched/fair: Remove sysctl_sched_migration_cost condition
With a default value of 500us, sysctl_sched_migration_cost is
significanlty higher than the cost of load_balance. Remove the
condition and rely on the sd->max_newidle_lb_cost to abort
newidle_balance.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20211019123537.17146-5-vincent.guittot@linaro.org
2021-10-31 11:11:38 +01:00
Vincent Guittot
e60b56e46b sched/fair: Wait before decaying max_newidle_lb_cost
Decay max_newidle_lb_cost only when it has not been updated for a while
and ensure to not decay a recently changed value.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20211019123537.17146-4-vincent.guittot@linaro.org
2021-10-31 11:11:38 +01:00
Vincent Guittot
9d783c8dd1 sched/fair: Skip update_blocked_averages if we are defering load balance
In newidle_balance(), the scheduler skips load balance to the new idle cpu
when the 1st sd of this_rq is:

   this_rq->avg_idle < sd->max_newidle_lb_cost

Doing a costly call to update_blocked_averages() will not be useful and
simply adds overhead when this condition is true.

Check the condition early in newidle_balance() to skip
update_blocked_averages() when possible.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20211019123537.17146-3-vincent.guittot@linaro.org
2021-10-31 11:11:37 +01:00
Vincent Guittot
9e9af819db sched/fair: Account update_blocked_averages in newidle_balance cost
The time spent to update the blocked load can be significant depending of
the complexity fo the cgroup hierarchy. Take this time into account in
the cost of the 1st load balance of a newly idle cpu.

Also reduce the number of call to sched_clock_cpu() and track more actual
work.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lore.kernel.org/r/20211019123537.17146-2-vincent.guittot@linaro.org
2021-10-31 11:11:37 +01:00
Arnd Bergmann
f98a3dccfc locking: Remove spin_lock_flags() etc
parisc, ia64 and powerpc32 are the only remaining architectures that
provide custom arch_{spin,read,write}_lock_flags() functions, which are
meant to re-enable interrupts while waiting for a spinlock.

However, none of these can actually run into this codepath, because
it is only called on architectures without CONFIG_GENERIC_LOCKBREAK,
or when CONFIG_DEBUG_LOCK_ALLOC is set without CONFIG_LOCKDEP, and none
of those combinations are possible on the three architectures.

Going back in the git history, it appears that arch/mn10300 may have
been able to run into this code path, but there is a good chance that
it never worked. On the architectures that still exist, it was
already impossible to hit back in 2008 after the introduction of
CONFIG_GENERIC_LOCKBREAK, and possibly earlier.

As this is all dead code, just remove it and the helper functions built
around it. For arch/ia64, the inline asm could be cleaned up, but
it seems safer to leave it untouched.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Helge Deller <deller@gmx.de>  # parisc
Link: https://lore.kernel.org/r/20211022120058.1031690-1-arnd@kernel.org
2021-10-30 16:37:28 +02:00
kernel test robot
feea69ec12 tracing/histogram: Fix semicolon.cocci warnings
kernel/trace/trace_events_hist.c:6039:2-3: Unneeded semicolon

 Remove unneeded semicolon.

Generated by: scripts/coccinelle/misc/semicolon.cocci

Link: https://lkml.kernel.org/r/20211030005615.GA41257@3074f0d39c61

Fixes: c5eac6ee8bc5 ("tracing/histogram: Simplify handling of .sym-offset in expressions")
CC: Kalesh Singh <kaleshsingh@google.com>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-29 22:12:38 -04:00
Eric W. Biederman
941edc5bf1 exit/syscall_user_dispatch: Send ordinary signals on failure
Use force_fatal_sig instead of calling do_exit directly.  This ensures
the ordinary signal handling path gets invoked, core dumps as
appropriate get created, and for multi-threaded processes all of the
threads are terminated not just a single thread.

When asked Gabriel Krisman Bertazi <krisman@collabora.com> said [1]:
> ebiederm@xmission.com (Eric W. Biederman) asked:
>
> > Why does do_syscal_user_dispatch call do_exit(SIGSEGV) and
> > do_exit(SIGSYS) instead of force_sig(SIGSEGV) and force_sig(SIGSYS)?
> >
> > Looking at the code these cases are not expected to happen, so I would
> > be surprised if userspace depends on any particular behaviour on the
> > failure path so I think we can change this.
>
> Hi Eric,
>
> There is not really a good reason, and the use case that originated the
> feature doesn't rely on it.
>
> Unless I'm missing yet another problem and others correct me, I think
> it makes sense to change it as you described.
>
> > Is using do_exit in this way something you copied from seccomp?
>
> I'm not sure, its been a while, but I think it might be just that.  The
> first prototype of SUD was implemented as a seccomp mode.

If at some point it becomes interesting we could relax
"force_fatal_sig(SIGSEGV)" to instead say
"force_sig_fault(SIGSEGV, SEGV_MAPERR, sd->selector)".

I avoid doing that in this patch to avoid making it possible
to catch currently uncatchable signals.

Cc: Gabriel Krisman Bertazi <krisman@collabora.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
[1] https://lkml.kernel.org/r/87mtr6gdvi.fsf@collabora.com
Link: https://lkml.kernel.org/r/20211020174406.17889-14-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-10-29 14:31:33 -05:00
Eric W. Biederman
26d5badbcc signal: Implement force_fatal_sig
Add a simple helper force_fatal_sig that causes a signal to be
delivered to a process as if the signal handler was set to SIG_DFL.

Reimplement force_sigsegv based upon this new helper.  This fixes
force_sigsegv so that when it forces the default signal handler
to be used the code now forces the signal to be unblocked as well.

Reusing the tested logic in force_sig_info_to_task that was built for
force_sig_seccomp this makes the implementation trivial.

This is interesting both because it makes force_sigsegv simpler and
because there are a couple of buggy places in the kernel that call
do_exit(SIGILL) or do_exit(SIGSYS) because there is no straight
forward way today for those places to simply force the exit of a
process with the chosen signal.  Creating force_fatal_sig allows
those places to be implemented with normal signal exits.

Link: https://lkml.kernel.org/r/20211020174406.17889-13-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-10-29 14:31:33 -05:00
Eric W. Biederman
111e70490d exit/kthread: Have kernel threads return instead of calling do_exit
In 2009 Oleg reworked[1] the kernel threads so that it is not
necessary to call do_exit if you are not using kthread_stop().  Remove
the explicit calls of do_exit and complete_and_exit (with a NULL
completion) that were previously necessary.

[1] 63706172f332 ("kthreads: rework kthread_stop()")
Link: https://lkml.kernel.org/r/20211020174406.17889-12-ebiederm@xmission.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2021-10-29 14:31:33 -05:00
Linus Torvalds
6f11521267 Tracing comment fixes:
- Some bots have informed me that some of the ftrace functions kernel-doc
   has formatting issues.
 
 - Also, fix my snake instinct.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYXwAqxQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qpl+AQDn8QTG2RN9+jWVTNkCcWNNQDKPi5Ij
 QYquG/clrCVnwgEAtxUoC+ZAcnsadBm5gXwjzCi62aWfAb9apkTHb6fZAgc=
 =H3GU
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.15-rc6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing comment fixes from Steven Rostedt:

 - Some bots have informed me that some of the ftrace functions
   kernel-doc has formatting issues.

 - Also, fix my snake instinct.

* tag 'trace-v5.15-rc6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix misspelling of "missing"
  ftrace: Fix kernel-doc formatting issues
2021-10-29 10:41:07 -07:00
Steven Rostedt (VMware)
ddcf906fe5 tracing: Fix misspelling of "missing"
My snake instinct was on and I wrote "misssing" instead of "missing".

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-29 09:54:14 -04:00
Steven Rostedt (VMware)
6130722f11 ftrace: Fix kernel-doc formatting issues
Some functions had kernel-doc that used a comma instead of a hash to
separate the function name from the one line description.

Also, the "ftrace_is_dead()" had an incomplete description.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-29 09:52:23 -04:00
Will Deacon
3d9c8315fa Merge branch 'for-next/scs' into for-next/core
* for-next/scs:
  scs: Release kasan vmalloc poison in scs_free process
2021-10-29 12:25:24 +01:00
Borislav Petkov
2258a6fc33 irqchip updates for Linux 5.16
- A large cross-arch rework to move irq_enter()/irq_exit() into
   the arch code, and removing it from the generic irq code.
   Thanks to Mark Rutland for the huge effort!
 
 - A few irqchip drivers are made modular (broadcom, meson), because
   that's apparently a thing...
 
 - A new driver for the Microchip External Interrupt Controller
 
 - The irq_cpu_offline()/irq_cpu_online() API is now deprecated and
   can only be selected on the Cavium Octeon platform. Once this
   platform is removed, the API will be removed at the same time.
 
 - A sprinkle of devm_* helper, as people seem to love that.
 
 - The usual spattering of small fixes and minor improvements.
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmF7rnYPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpDudEP/i3WmAcXQYKJpRz075M8S6PZ8BXeTKUe7WMK
 rrslOkxDqyQ2SVqMLII1xkyOWafC7BnRjexm/ASwrBsc6GyQha7B2YsKy1m/NEwy
 ZcnXCCIg71LpDrUyxbscFxB6s5OvUN0yv+a+WnEAmOXpD1x3S8x5tHmRUfsRGksR
 zOhKaYPLqgCiw3VHRuhEKFUA+CMjXxHhw3lJv6gPh6TRjdXQuJouau2dBzr7tQEd
 h9Jq2OatWXiwPr00hQDDILbdH4+fQYKJqsaaLNX0Pxexg2slRWHwrgA2o/w0tTVW
 99HOc9hN04QoLkDfyQis40L1YC7VOIr5OAqzUehdYELT8UsrZS288Rr6099n4M/Y
 x8Nzcg4eA+jVUz1VMEBA9qR45fKjEMcTAXyNAAYLsov/obSgGH/PSOYaunG2xvYq
 iiJBM/g506PTw2MRROqrH5oKiER3tTD65f5NM0mJONr3xEm9XT74m0JIodgVZ4QX
 0LMJytgetg0b+yZcFY25GhJ+2mGoYwB2eiZBVjE3FyLSs0epcuzogaKRi5axK4sN
 rvlAtgNZiOg7tzRqiPIQKSzO3dCyJjR86t5fd1cRBl/WPmywvA2Lkcgd09V2oyJe
 FEp1QllpgYw0a5+aIS+bdOUK63FLnLdEMas7WgSAAxA4/jjgP1p+SbytOD81psL0
 4r02YN2A
 =/NLR
 -----END PGP SIGNATURE-----

Merge tag 'irqchip-5.16' into irq/core

Merge irqchip updates for Linux 5.16 from Marc Zyngier:

- A large cross-arch rework to move irq_enter()/irq_exit() into
  the arch code, and removing it from the generic irq code.
  Thanks to Mark Rutland for the huge effort!

- A few irqchip drivers are made modular (broadcom, meson), because
  that's apparently a thing...

- A new driver for the Microchip External Interrupt Controller

- The irq_cpu_offline()/irq_cpu_online() API is now deprecated and
  can only be selected on the Cavium Octeon platform. Once this
  platform is removed, the API will be removed at the same time.

- A sprinkle of devm_* helper, as people seem to love that.

- The usual spattering of small fixes and minor improvements.

* tag 'irqchip-5.16': (912 commits)
  h8300: Fix linux/irqchip.h include mess
  dt-bindings: irqchip: renesas-irqc: Document r8a774e1 bindings
  MIPS: irq: Avoid an unused-variable error
  genirq: Hide irq_cpu_{on,off}line() behind a deprecated option
  irqchip/mips-gic: Get rid of the reliance on irq_cpu_online()
  MIPS: loongson64: Drop call to irq_cpu_offline()
  irq: remove handle_domain_{irq,nmi}()
  irq: remove CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
  irq: riscv: perform irqentry in entry code
  irq: openrisc: perform irqentry in entry code
  irq: csky: perform irqentry in entry code
  irq: arm64: perform irqentry in entry code
  irq: arm: perform irqentry in entry code
  irq: add a (temporary) CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
  irq: nds32: avoid CONFIG_HANDLE_DOMAIN_IRQ
  irq: arc: avoid CONFIG_HANDLE_DOMAIN_IRQ
  irq: add generic_handle_arch_irq()
  irq: unexport handle_irq_desc()
  irq: simplify handle_domain_{irq,nmi}()
  irq: mips: simplify do_domain_IRQ()
  ...

Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211029083332.3680101-1-maz@kernel.org
2021-10-29 11:58:35 +02:00
Kumar Kartikeya Dwivedi
d6aef08a87 bpf: Add bpf_kallsyms_lookup_name helper
This helper allows us to get the address of a kernel symbol from inside
a BPF_PROG_TYPE_SYSCALL prog (used by gen_loader), so that we can
relocate typeless ksym vars.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211028063501.2239335-2-memxor@gmail.com
2021-10-28 16:30:06 -07:00
Joanne Koong
9330986c03 bpf: Add bloom filter map implementation
This patch adds the kernel-side changes for the implementation of
a bpf bloom filter map.

The bloom filter map supports peek (determining whether an element
is present in the map) and push (adding an element to the map)
operations.These operations are exposed to userspace applications
through the already existing syscalls in the following way:

BPF_MAP_LOOKUP_ELEM -> peek
BPF_MAP_UPDATE_ELEM -> push

The bloom filter map does not have keys, only values. In light of
this, the bloom filter map's API matches that of queue stack maps:
user applications use BPF_MAP_LOOKUP_ELEM/BPF_MAP_UPDATE_ELEM
which correspond internally to bpf_map_peek_elem/bpf_map_push_elem,
and bpf programs must use the bpf_map_peek_elem and bpf_map_push_elem
APIs to query or add an element to the bloom filter map. When the
bloom filter map is created, it must be created with a key_size of 0.

For updates, the user will pass in the element to add to the map
as the value, with a NULL key. For lookups, the user will pass in the
element to query in the map as the value, with a NULL key. In the
verifier layer, this requires us to modify the argument type of
a bloom filter's BPF_FUNC_map_peek_elem call to ARG_PTR_TO_MAP_VALUE;
as well, in the syscall layer, we need to copy over the user value
so that in bpf_map_peek_elem, we know which specific value to query.

A few things to please take note of:
 * If there are any concurrent lookups + updates, the user is
responsible for synchronizing this to ensure no false negative lookups
occur.
 * The number of hashes to use for the bloom filter is configurable from
userspace. If no number is specified, the default used will be 5 hash
functions. The benchmarks later in this patchset can help compare the
performance of using different number of hashes on different entry
sizes. In general, using more hashes decreases both the false positive
rate and the speed of a lookup.
 * Deleting an element in the bloom filter map is not supported.
 * The bloom filter map may be used as an inner map.
 * The "max_entries" size that is specified at map creation time is used
to approximate a reasonable bitmap size for the bloom filter, and is not
otherwise strictly enforced. If the user wishes to insert more entries
into the bloom filter than "max_entries", they may do so but they should
be aware that this may lead to a higher false positive rate.

Signed-off-by: Joanne Koong <joannekoong@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211027234504.30744-2-joannekoong@fb.com
2021-10-28 13:22:49 -07:00
Jakub Kicinski
7df621a3ee Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
include/net/sock.h
  7b50ecfcc6cd ("net: Rename ->stream_memory_read to ->sock_is_readable")
  4c1e34c0dbff ("vsock: Enable y2038 safe timeval for timeout")

drivers/net/ethernet/marvell/octeontx2/af/rvu_debugfs.c
  0daa55d033b0 ("octeontx2-af: cn10k: debugfs for dumping LMTST map table")
  e77bcdd1f639 ("octeontx2-af: Display all enabled PF VF rsrc_alloc entries.")

Adjacent code addition in both cases, keep both.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-28 10:43:58 -07:00
Linus Torvalds
411a44c24a Networking fixes for 5.15-rc8/final, including fixes from WiFi
(mac80211), and BPF.
 
 Current release - regressions:
 
  - skb_expand_head: adjust skb->truesize to fix socket memory
    accounting
 
  - mptcp: fix corrupt receiver key in MPC + data + checksum
 
 Previous releases - regressions:
 
  - multicast: calculate csum of looped-back and forwarded packets
 
  - cgroup: fix memory leak caused by missing cgroup_bpf_offline
 
  - cfg80211: fix management registrations locking, prevent list
    corruption
 
  - cfg80211: correct false positive in bridge/4addr mode check
 
  - tcp_bpf: fix race in the tcp_bpf_send_verdict resulting in reusing
    previous verdict
 
 Previous releases - always broken:
 
  - sctp: enhancements for the verification tag, prevent attackers
    from killing SCTP sessions
 
  - tipc: fix size validations for the MSG_CRYPTO type
 
  - mac80211: mesh: fix HE operation element length check, prevent
    out of bound access
 
  - tls: fix sign of socket errors, prevent positive error codes
    being reported from read()/write()
 
  - cfg80211: scan: extend RCU protection in cfg80211_add_nontrans_list()
 
  - implement ->sock_is_readable() for UDP and AF_UNIX, fix poll()
    for sockets in a BPF sockmap
 
  - bpf: fix potential race in tail call compatibility check resulting
    in two operations which would make the map incompatible succeeding
 
  - bpf: prevent increasing bpf_jit_limit above max
 
  - bpf: fix error usage of map_fd and fdget() in generic batch update
 
  - phy: ethtool: lock the phy for consistency of results
 
  - prevent infinite while loop in skb_tx_hash() when Tx races with
    driver reconfiguring the queue <> traffic class mapping
 
  - usbnet: fixes for bad HW conjured by syzbot
 
  - xen: stop tx queues during live migration, prevent UAF
 
  - net-sysfs: initialize uid and gid before calling net_ns_get_ownership
 
  - mlxsw: prevent Rx stalls under memory pressure
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmF6zR8ACgkQMUZtbf5S
 Irtu4w//e7BJVjn1zKnGHo5SdpHwIxePP5sgD7rGT9udTlYUeteRAEQPALlA8oHo
 6nx59eZBlvOt4+1yyK8qVzT94aLca8dwJ4j7dlONrvLFyWinSUlSZ5ayH8Co2f8t
 ZGYVg+EinR6b+iaeUp5PG0VGbW+FmyIO6iS2xytireJxW6sytZ2BLlrus4+j7BCM
 oCaLD+P6A1bCH1PxDMO2GRFNXphxPX3azec33HpNevHD0iwdgrjAlnz0+PZ5uiEA
 AQ92PVw59+I1el/h0SxDsxfOrCdUMWbM4ZoC2wf0jDC8o6tAg4KxhlI0lAiMvhla
 iqo0BTxhMWAeqC4CMmZJVGAR7zY7xpZiLowp92qSCzZpG0XKVxPaomLFGHFiU3Im
 JDNGGmWAsE5maWsMbUtGr9Gd4Gxoor5r2YfFwCBsMsg7VtO98lCIdORi44VGVD3A
 z2OtHEk0ismOmu8ktDumpFPTF0l1G5HR+YxsVP7obEh3T+hv4nEwPtHQ7OsByuxg
 BBUrmOGr3hB5kcYtbmT/CqLqGXHNCCEAS0f+7mCPsGJfMvsOZeVurdk5GS7SjPIE
 870ogJycV+KRP7ZoefXQxiRe9oCmQs9tFQqCxyPwZ8O5vevTx78D5vj1dkuF4Q5w
 cbn2qbgJAtNe0UN1Gxw3emarofXarAVQoO7n+29CDFG5fe/NlnY=
 =61p6
 -----END PGP SIGNATURE-----

Merge tag 'net-5.15-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Including fixes from WiFi (mac80211), and BPF.

  Current release - regressions:

   - skb_expand_head: adjust skb->truesize to fix socket memory
     accounting

   - mptcp: fix corrupt receiver key in MPC + data + checksum

  Previous releases - regressions:

   - multicast: calculate csum of looped-back and forwarded packets

   - cgroup: fix memory leak caused by missing cgroup_bpf_offline

   - cfg80211: fix management registrations locking, prevent list
     corruption

   - cfg80211: correct false positive in bridge/4addr mode check

   - tcp_bpf: fix race in the tcp_bpf_send_verdict resulting in reusing
     previous verdict

  Previous releases - always broken:

   - sctp: enhancements for the verification tag, prevent attackers from
     killing SCTP sessions

   - tipc: fix size validations for the MSG_CRYPTO type

   - mac80211: mesh: fix HE operation element length check, prevent out
     of bound access

   - tls: fix sign of socket errors, prevent positive error codes being
     reported from read()/write()

   - cfg80211: scan: extend RCU protection in
     cfg80211_add_nontrans_list()

   - implement ->sock_is_readable() for UDP and AF_UNIX, fix poll() for
     sockets in a BPF sockmap

   - bpf: fix potential race in tail call compatibility check resulting
     in two operations which would make the map incompatible succeeding

   - bpf: prevent increasing bpf_jit_limit above max

   - bpf: fix error usage of map_fd and fdget() in generic batch update

   - phy: ethtool: lock the phy for consistency of results

   - prevent infinite while loop in skb_tx_hash() when Tx races with
     driver reconfiguring the queue <> traffic class mapping

   - usbnet: fixes for bad HW conjured by syzbot

   - xen: stop tx queues during live migration, prevent UAF

   - net-sysfs: initialize uid and gid before calling
     net_ns_get_ownership

   - mlxsw: prevent Rx stalls under memory pressure"

* tag 'net-5.15-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (67 commits)
  Revert "net: hns3: fix pause config problem after autoneg disabled"
  mptcp: fix corrupt receiver key in MPC + data + checksum
  riscv, bpf: Fix potential NULL dereference
  octeontx2-af: Fix possible null pointer dereference.
  octeontx2-af: Display all enabled PF VF rsrc_alloc entries.
  octeontx2-af: Check whether ipolicers exists
  net: ethernet: microchip: lan743x: Fix skb allocation failure
  net/tls: Fix flipped sign in async_wait.err assignment
  net/tls: Fix flipped sign in tls_err_abort() calls
  net/smc: Correct spelling mistake to TCPF_SYN_RECV
  net/smc: Fix smc_link->llc_testlink_time overflow
  nfp: bpf: relax prog rejection for mtu check through max_pkt_offset
  vmxnet3: do not stop tx queues after netif_device_detach()
  r8169: Add device 10ec:8162 to driver r8169
  ptp: Document the PTP_CLK_MAGIC ioctl number
  usbnet: fix error return code in usbnet_probe()
  net: hns3: adjust string spaces of some parameters of tx bd info in debugfs
  net: hns3: expand buffer len for some debugfs command
  net: hns3: add more string spaces for dumping packets number of queue info in debugfs
  net: hns3: fix data endian problem of some functions of debugfs
  ...
2021-10-28 10:17:31 -07:00
Linus Torvalds
fc18cc89b9 Do not WARN when attaching event probe to non-existent event
If the user tries to attach an event probe (eprobe) to an event that does
 not exist, it will trigger a warning. There's an error check that only
 expects memory issues otherwise it is considered a bug. But changes in the
 code to move around the locking made it that it can error out if the user
 attempts to attach to an event that does not exist, returning an -ENODEV.
 As this path can be caused by user space putting in a bad value, do not
 trigger a WARN.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYXoHQhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qjT+AQCx4ThfDRwuUkIyfzJR68b6t9YnOL3p
 gqoSsjIj2JvzzQD/VrsXbmZJw9iYBYKFzkDxaNkRpI7HWFdInD7jzRTo4w0=
 =RWQl
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.15-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "Do not WARN when attaching event probe to non-existent event

  If the user tries to attach an event probe (eprobe) to an event that
  does not exist, it will trigger a warning. There's an error check that
  only expects memory issues otherwise it is considered a bug. But
  changes in the code to move around the locking made it that it can
  error out if the user attempts to attach to an event that does not
  exist, returning an -ENODEV. As this path can be caused by user space
  putting in a bad value, do not trigger a WARN"

* tag 'trace-v5.15-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Do not warn when connecting eprobe to non existing event
2021-10-28 09:50:56 -07:00
Marc Zyngier
5f5739d5f7 Merge branch irq/irq_cpu_offline into irq/irqchip-next
* irq/irq_cpu_offline:
  : .
  : Make irq_cpu_{on,off}line() deprecated kernel API, and only
  : enable it for some obscure Cavium platform after having
  : moved all the other users away from it.
  :
  : Next step, drop the platform itself.
  : .
  genirq: Hide irq_cpu_{on,off}line() behind a deprecated option
  irqchip/mips-gic: Get rid of the reliance on irq_cpu_online()
  MIPS: loongson64: Drop call to irq_cpu_offline()

Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-10-28 13:34:57 +01:00
Marc Zyngier
c6dca712f6 Merge branch irq/remove-handle-domain-irq-20211026 into irq/irqchip-next
* irq/remove-handle-domain-irq-20211026:
  : Large rework of the architecture entry code from Mark Rutland.
  : From the cover letter:
  :
  : <quote>
  : The handle_domain_{irq,nmi}() functions were oringally intended as a
  : convenience, but recent rework to entry code across the kernel tree has
  : demonstrated that they cause more pain than they're worth and prevent
  : architectures from being able to write robust entry code.
  :
  : This series reworks the irq code to remove them, handling the necessary
  : entry work consistently in entry code (be it architectural or generic).
  : </quote>
  MIPS: irq: Avoid an unused-variable error
  irq: remove handle_domain_{irq,nmi}()
  irq: remove CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
  irq: riscv: perform irqentry in entry code
  irq: openrisc: perform irqentry in entry code
  irq: csky: perform irqentry in entry code
  irq: arm64: perform irqentry in entry code
  irq: arm: perform irqentry in entry code
  irq: add a (temporary) CONFIG_HANDLE_DOMAIN_IRQ_IRQENTRY
  irq: nds32: avoid CONFIG_HANDLE_DOMAIN_IRQ
  irq: arc: avoid CONFIG_HANDLE_DOMAIN_IRQ
  irq: add generic_handle_arch_irq()
  irq: unexport handle_irq_desc()
  irq: simplify handle_domain_{irq,nmi}()
  irq: mips: simplify do_domain_IRQ()
  irq: mips: stop (ab)using handle_domain_irq()
  irq: mips: simplify bcm6345_l1_irq_handle()
  irq: mips: avoid nested irq_enter()

Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-10-28 13:34:52 +01:00
Steven Rostedt (VMware)
7fa598f970 tracing: Do not warn when connecting eprobe to non existing event
When the syscall trace points are not configured in, the kselftests for
ftrace will try to attach an event probe (eprobe) to one of the system
call trace points. This triggered a WARNING, because the failure only
expects to see memory issues. But this is not the only failure. The user
may attempt to attach to a non existent event, and the kernel must not
warn about it.

Link: https://lkml.kernel.org/r/20211027120854.0680aa0f@gandalf.local.home

Fixes: 7491e2c442781 ("tracing: Add a probe that attaches to trace events")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-27 21:47:55 -04:00
Eric Dumazet
61a0abaee2 bpf: Use u64_stats_t in struct bpf_prog_stats
Commit 316580b69d0a ("u64_stats: provide u64_stats_t type")
fixed possible load/store tearing on 64bit arches.

For instance the following C code

stats->nsecs += sched_clock() - start;

Could be rightfully implemented like this by a compiler,
confusing concurrent readers a lot:

stats->nsecs += sched_clock();
// arbitrary delay
stats->nsecs -= start;

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211026214133.3114279-4-eric.dumazet@gmail.com
2021-10-27 11:13:52 -07:00
Eric Dumazet
d979617aa8 bpf: Fixes possible race in update_prog_stats() for 32bit arches
It seems update_prog_stats() suffers from same issue fixed
in the prior patch:

As it can run while interrupts are enabled, it could
be re-entered and the u64_stats syncp could be mangled.

Fixes: fec56f5890d9 ("bpf: Introduce BPF trampoline")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211026214133.3114279-3-eric.dumazet@gmail.com
2021-10-27 11:13:52 -07:00
Robin H. Johnson
a90afe8d02 tracing: Show size of requested perf buffer
If the perf buffer isn't large enough, provide a hint about how large it
needs to be for whatever is running.

Link: https://lkml.kernel.org/r/20210831043723.13481-1-robbat2@gentoo.org

Signed-off-by: Robin H. Johnson <robbat2@gentoo.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-27 12:25:09 -04:00
王贇
d33cc65737 ftrace: do CPU checking after preemption disabled
With CONFIG_DEBUG_PREEMPT we observed reports like:

  BUG: using smp_processor_id() in preemptible
  caller is perf_ftrace_function_call+0x6f/0x2e0
  CPU: 1 PID: 680 Comm: a.out Not tainted
  Call Trace:
   <TASK>
   dump_stack_lvl+0x8d/0xcf
   check_preemption_disabled+0x104/0x110
   ? optimize_nops.isra.7+0x230/0x230
   ? text_poke_bp_batch+0x9f/0x310
   perf_ftrace_function_call+0x6f/0x2e0
   ...
   __text_poke+0x5/0x620
   text_poke_bp_batch+0x9f/0x310

This telling us the CPU could be changed after task is preempted, and
the checking on CPU before preemption will be invalid.

Since now ftrace_test_recursion_trylock() will help to disable the
preemption, this patch just do the checking after trylock() to address
the issue.

Link: https://lkml.kernel.org/r/54880691-5fe2-33e7-d12f-1fa6136f5183@linux.alibaba.com

CC: Steven Rostedt <rostedt@goodmis.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Jisheng Zhang <jszhang@kernel.org>
Reported-by: Abaci <abaci@linux.alibaba.com>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-27 11:22:09 -04:00
王贇
ce5e48036c ftrace: disable preemption when recursion locked
As the documentation explained, ftrace_test_recursion_trylock()
and ftrace_test_recursion_unlock() were supposed to disable and
enable preemption properly, however currently this work is done
outside of the function, which could be missing by mistake.

And since the internal using of trace_test_and_set_recursion()
and trace_clear_recursion() also require preemption disabled, we
can just merge the logical.

This patch will make sure the preemption has been disabled when
trace_test_and_set_recursion() return bit >= 0, and
trace_clear_recursion() will enable the preemption if previously
enabled.

Link: https://lkml.kernel.org/r/13bde807-779c-aa4c-0672-20515ae365ea@linux.alibaba.com

CC: Petr Mladek <pmladek@suse.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Jisheng Zhang <jszhang@kernel.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Miroslav Benes <mbenes@suse.cz>
Reported-by: Abaci <abaci@linux.alibaba.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com>
[ Removed extra line in comment - SDR ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-27 11:21:49 -04:00
Amir Goldstein
dabe729ddd fsnotify: clarify contract for create event hooks
Clarify argument names and contract for fsnotify_create() and
fsnotify_mkdir() to reflect the anomaly of kernfs, which leaves dentries
negavite after mkdir/create.

Remove the WARN_ON(!inode) in audit code that were added by the Fixes
commit under the wrong assumption that dentries cannot be negative after
mkdir/create.

Fixes: aa93bdc5500c ("fsnotify: use helpers to access data by data_type")
Link: https://lore.kernel.org/linux-fsdevel/87mtp5yz0q.fsf@collabora.com/
Link: https://lore.kernel.org/r/20211025192746.66445-4-krisman@collabora.com
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Jan Kara <jack@suse.cz>
2021-10-27 12:32:34 +02:00
Christophe JAILLET
9fbd8dc19a dma-mapping: use 'bitmap_zalloc()' when applicable
'dma_mem->bitmap' is a bitmap. So use 'bitmap_zalloc()' to simplify code,
improve the semantic and avoid some open-coded arithmetic in allocator
arguments.

Also change the corresponding 'kfree()' into 'bitmap_free()' to keep
consistency.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-27 08:20:09 +02:00
Kalesh Singh
722eddaa40 tracing/histogram: Optimize division by a power of 2
The division is a slow operation. If the divisor is a power of 2, use a
shift instead.

Results were obtained using Android's version of perf (simpleperf[1]) as
described below:

1. hist_field_div() is modified to call 2 test functions:
   test_hist_field_div_[not]_optimized(); passing them the
   same args. Use noinline and volatile to ensure these are
   not optimized out by the compiler.
2. Create a hist event trigger that uses division:
      events/kmem/rss_stat$ echo 'hist:keys=common_pid:x=size/<divisor>'
         >> trigger
      events/kmem/rss_stat$ echo 'hist:keys=common_pid:vals=$x'
         >> trigger
3. Run Android's lmkd_test[2] to generate rss_stat events, and
   record CPU samples with Android's simpleperf:
      simpleperf record -a --exclude-perf --post-unwind=yes -m 16384 -g
         -f 2000 -o perf.data

== Results ==

Divisor is a power of 2 (divisor == 32):

   test_hist_field_div_not_optimized  | 8,717,091 cpu-cycles
   test_hist_field_div_optimized      | 1,643,137 cpu-cycles

If the divisor is a power of 2, the optimized version is ~5.3x faster.

Divisor is not a power of 2 (divisor == 33):

   test_hist_field_div_not_optimized  | 4,444,324 cpu-cycles
   test_hist_field_div_optimized      | 5,497,958 cpu-cycles

If the divisor is not a power of 2, as expected, the optimized version is
slightly slower (~24% slower).

[1] https://android.googlesource.com/platform/system/extras/+/master/simpleperf/doc/README.md
[2] https://cs.android.com/android/platform/superproject/+/master:system/memory/lmkd/tests/lmkd_test.cpp

Link: https://lkml.kernel.org/r/20211025200852.3002369-7-kaleshsingh@google.com

Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-26 20:27:20 -04:00
Kalesh Singh
f47716b7a9 tracing/histogram: Covert expr to const if both operands are constants
If both operands of a hist trigger expression are constants, convert the
expression to a constant. This optimization avoids having to perform the
same calculation multiple times and also saves on memory since the
merged constants are represented by a single struct hist_field instead
or multiple.

Link: https://lkml.kernel.org/r/20211025200852.3002369-6-kaleshsingh@google.com

Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-26 20:27:19 -04:00
Kalesh Singh
c5eac6ee8b tracing/histogram: Simplify handling of .sym-offset in expressions
The '-' in .sym-offset can confuse the hist trigger arithmetic
expression parsing. Simplify the handling of this by replacing the
'sym-offset' with 'symXoffset'. This allows us to correctly evaluate
expressions where the user may have inadvertently added a .sym-offset
modifier to one of the operands in an expression, instead of bailing
out. In this case the .sym-offset has no effect on the evaluation of the
expression. The only valid use of the .sym-offset is as a hist key
modifier.

Link: https://lkml.kernel.org/r/20211025200852.3002369-5-kaleshsingh@google.com

Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-26 20:27:19 -04:00
Kalesh Singh
9710b2f341 tracing: Fix operator precedence for hist triggers expression
The current histogram expression evaluation logic evaluates the
expression from right to left. This can lead to incorrect results
if the operations are not associative (as is the case for subtraction
and, the now added, division operators).
	e.g. 16-8-4-2 should be 2 not 10 --> 16-8-4-2 = ((16-8)-4)-2
	     64/8/4/2 should be 1 not 16 --> 64/8/4/2 = ((64/8)/4)/2

Division and multiplication are currently limited to single operation
expression due to operator precedence support not yet implemented.

Rework the expression parsing to support the correct evaluation of
expressions containing operators of different precedences; and fix
the associativity error by evaluating expressions with operators of
the same precedence from left to right.

Examples:
        (1) echo 'hist:keys=common_pid:a=8,b=4,c=2,d=1,w=$a-$b-$c-$d' \
                  >> event/trigger
        (2) echo 'hist:keys=common_pid:x=$a/$b/3/2' >> event/trigger
        (3) echo 'hist:keys=common_pid:y=$a+10/$c*1024' >> event/trigger
        (4) echo 'hist:keys=common_pid:z=$a/$b+$c*$d' >> event/trigger

Link: https://lkml.kernel.org/r/20211025200852.3002369-4-kaleshsingh@google.com

Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Reviewed-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-26 20:27:19 -04:00
Kalesh Singh
bcef044150 tracing: Add division and multiplication support for hist triggers
Adds basic support for division and multiplication operations for
hist trigger variable expressions.

For simplicity this patch only supports, division and multiplication
for a single operation expression (e.g. x=$a/$b), as currently
expressions are always evaluated right to left. This can lead to some
incorrect results:

	e.g. echo 'hist:keys=common_pid:x=8-4-2' >> event/trigger

	     8-4-2 should evaluate to 2 i.e. (8-4)-2
	     but currently x evaluate to  6 i.e. 8-(4-2)

Multiplication and division in sub-expressions will work correctly, once
correct operator precedence support is added (See next patch in this
series).

For the undefined case of division by 0, the histogram expression
evaluates to (u64)(-1). Since this cannot be detected when the
expression is created, it is the responsibility of the user to be
aware and account for this possibility.

Examples:
	echo 'hist:keys=common_pid:a=8,b=4,x=$a/$b' \
                   >> event/trigger

	echo 'hist:keys=common_pid:y=5*$b' \
                   >> event/trigger

Link: https://lkml.kernel.org/r/20211025200852.3002369-3-kaleshsingh@google.com

Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-26 20:27:19 -04:00
Kalesh Singh
52cfb37353 tracing: Add support for creating hist trigger variables from literal
Currently hist trigger expressions don't support the use of numeric
literals:
	e.g. echo 'hist:keys=common_pid:x=$y-1234'
		--> is not valid expression syntax

Having the ability to use numeric constants in hist triggers supports
a wider range of expressions for creating variables.

Add support for creating trace event histogram variables from numeric
literals.

	e.g. echo 'hist:keys=common_pid:x=1234,y=size-1024' >> event/trigger

A negative numeric constant is created, using unary minus operator
(parentheses are required).

	e.g. echo 'hist:keys=common_pid:z=-(2)' >> event/trigger

Constants can be used with division/multiplication (added in the
next patch in this series) to implement granularity filters for frequent
trace events. For instance we can limit emitting the rss_stat
trace event to when there is a 512KB cross over in the rss size:

  # Create a synthetic event to monitor instead of the high frequency
  # rss_stat event
  echo 'rss_stat_throttled unsigned int mm_id; unsigned int curr;
	int member; long size' >> tracing/synthetic_events

  # Create a hist trigger that emits the synthetic rss_stat_throttled
  # event only when the rss size crosses a 512KB boundary.
  echo 'hist:keys=keys=mm_id,member:bucket=size/0x80000:onchange($bucket)
      .rss_stat_throttled(mm_id,curr,member,size)'
        >> events/kmem/rss_stat/trigger

A use case for using constants with addition/subtraction is not yet
known, but for completeness the use of constants are supported for all
operators.

Link: https://lkml.kernel.org/r/20211025200852.3002369-2-kaleshsingh@google.com

Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-26 20:27:19 -04:00
Jakub Kicinski
440ffcdd9d Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2021-10-26

We've added 12 non-merge commits during the last 7 day(s) which contain
a total of 23 files changed, 118 insertions(+), 98 deletions(-).

The main changes are:

1) Fix potential race window in BPF tail call compatibility check, from Toke Høiland-Jørgensen.

2) Fix memory leak in cgroup fs due to missing cgroup_bpf_offline(), from Quanyang Wang.

3) Fix file descriptor reference counting in generic_map_update_batch(), from Xu Kuohai.

4) Fix bpf_jit_limit knob to the max supported limit by the arch's JIT, from Lorenz Bauer.

5) Fix BPF sockmap ->poll callbacks for UDP and AF_UNIX sockets, from Cong Wang and Yucong Sun.

6) Fix BPF sockmap concurrency issue in TCP on non-blocking sendmsg calls, from Liu Jian.

7) Fix build failure of INODE_STORAGE and TASK_STORAGE maps on !CONFIG_NET, from Tejun Heo.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Fix potential race in tail call compatibility check
  bpf: Move BPF_MAP_TYPE for INODE_STORAGE and TASK_STORAGE outside of CONFIG_NET
  selftests/bpf: Use recv_timeout() instead of retries
  net: Implement ->sock_is_readable() for UDP and AF_UNIX
  skmsg: Extract and reuse sk_msg_is_readable()
  net: Rename ->stream_memory_read to ->sock_is_readable
  tcp_bpf: Fix one concurrency problem in the tcp_bpf_send_verdict function
  cgroup: Fix memory leak caused by missing cgroup_bpf_offline
  bpf: Fix error usage of map_fd and fdget() in generic_map_update_batch()
  bpf: Prevent increasing bpf_jit_limit above max
  bpf: Define bpf_jit_alloc_exec_limit for arm64 JIT
  bpf: Define bpf_jit_alloc_exec_limit for riscv JIT
====================

Link: https://lore.kernel.org/r/20211026201920.11296-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-10-26 14:38:55 -07:00
Tiezhu Yang
b9e94a7bb6 test_kprobes: Move it from kernel/ to lib/
Since config KPROBES_SANITY_TEST is in lib/Kconfig.debug, it is better to
let test_kprobes.c in lib/, just like other similar tests found in lib/.

Link: https://lkml.kernel.org/r/1635213091-24387-4-git-send-email-yangtiezhu@loongson.cn

Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-26 17:23:46 -04:00
Masami Hiramatsu
1f6d3a8f5e kprobes: Add a test case for stacktrace from kretprobe handler
Add a test case for stacktrace from kretprobe handler and
nested kretprobe handlers.

This test checks both of stack trace inside kretprobe handler
and stack trace from pt_regs. Those stack trace must include
actual function return address instead of kretprobe trampoline.
The nested kretprobe stacktrace test checks whether the unwinder
can correctly unwind the call frame on the stack which has been
modified by the kretprobe.

Since the stacktrace on kretprobe is correctly fixed only on x86,
this introduces a meta kconfig ARCH_CORRECT_STACKTRACE_ON_KRETPROBE
which tells user that the stacktrace on kretprobe is correct or not.

The test results will be shown like below;

 TAP version 14
 1..1
     # Subtest: kprobes_test
     1..6
     ok 1 - test_kprobe
     ok 2 - test_kprobes
     ok 3 - test_kretprobe
     ok 4 - test_kretprobes
     ok 5 - test_stacktrace_on_kretprobe
     ok 6 - test_stacktrace_on_nested_kretprobe
 # kprobes_test: pass:6 fail:0 skip:0 total:6
 # Totals: pass:6 fail:0 skip:0 total:6
 ok 1 - kprobes_test

Link: https://lkml.kernel.org/r/163516211244.604541.18350507860972214415.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-10-26 17:23:45 -04:00
Toke Høiland-Jørgensen
54713c85f5 bpf: Fix potential race in tail call compatibility check
Lorenzo noticed that the code testing for program type compatibility of
tail call maps is potentially racy in that two threads could encounter a
map with an unset type simultaneously and both return true even though they
are inserting incompatible programs.

The race window is quite small, but artificially enlarging it by adding a
usleep_range() inside the check in bpf_prog_array_compatible() makes it
trivial to trigger from userspace with a program that does, essentially:

        map_fd = bpf_create_map(BPF_MAP_TYPE_PROG_ARRAY, 4, 4, 2, 0);
        pid = fork();
        if (pid) {
                key = 0;
                value = xdp_fd;
        } else {
                key = 1;
                value = tc_fd;
        }
        err = bpf_map_update_elem(map_fd, &key, &value, 0);

While the race window is small, it has potentially serious ramifications in
that triggering it would allow a BPF program to tail call to a program of a
different type. So let's get rid of it by protecting the update with a
spinlock. The commit in the Fixes tag is the last commit that touches the
code in question.

v2:
- Use a spinlock instead of an atomic variable and cmpxchg() (Alexei)
v3:
- Put lock and the members it protects into an embedded 'owner' struct (Daniel)

Fixes: 3324b584b6f6 ("ebpf: misc core cleanup")
Reported-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20211026110019.363464-1-toke@redhat.com
2021-10-26 12:37:28 -07:00