35575 Commits

Author SHA1 Message Date
Steven Rostedt (VMware)
256cfdd6fd tracing: Do not count ftrace events in top level enable output
The file /sys/kernel/tracing/events/enable is used to enable all events by
echoing in "1", or disabling all events when echoing in "0". To know if all
events are enabled, disabled, or some are enabled but not all of them,
cating the file should show either "1" (all enabled), "0" (all disabled), or
"X" (some enabled but not all of them). This works the same as the "enable"
files in the individule system directories (like tracing/events/sched/enable).

But when all events are enabled, the top level "enable" file shows "X". The
reason is that its checking the "ftrace" events, which are special events
that only exist for their format files. These include the format for the
function tracer events, that are enabled when the function tracer is
enabled, but not by the "enable" file. The check includes these events,
which will always be disabled, and even though all true events are enabled,
the top level "enable" file will show "X" instead of "1".

To fix this, have the check test the event's flags to see if it has the
"IGNORE_ENABLE" flag set, and if so, not test it.

Cc: stable@vger.kernel.org
Fixes: 553552ce1796c ("tracing: Combine event filter_active and enable into single flags field")
Reported-by: "Yordan Karadzhov (VMware)" <y.karadz@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-02-05 15:40:04 -05:00
Johannes Berg
55b6f763d8 init/gcov: allow CONFIG_CONSTRUCTORS on UML to fix module gcov
On ARCH=um, loading a module doesn't result in its constructors getting
called, which breaks module gcov since the debugfs files are never
registered.  On the other hand, in-kernel constructors have already been
called by the dynamic linker, so we can't call them again.

Get out of this conundrum by allowing CONFIG_CONSTRUCTORS to be
selected, but avoiding the in-kernel constructor calls.

Also remove the "if !UML" from GCOV selecting CONSTRUCTORS now, since we
really do want CONSTRUCTORS, just not kernel binary ones.

Link: https://lkml.kernel.org/r/20210120172041.c246a2cac2fb.I1358f584b76f1898373adfed77f4462c8705b736@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jessica Yu <jeyu@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-02-05 11:03:47 -08:00
Alexey Dobriyan
174bcc691f timens: Delete no-op time_ns_init()
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andrei Vagin <avagin@gmail.com>
Link: https://lore.kernel.org/r/20201228215402.GA572900@localhost.localdomain
2021-02-05 19:32:09 +01:00
Alexandre Belloni
b5c28ea601 alarmtimer: Update kerneldoc
Update kerneldoc comments to reflect the actual arguments and return values
of the documented functions.

Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210202013457.3482388-1-alexandre.belloni@bootlin.com
2021-02-05 19:26:41 +01:00
Geert Uytterhoeven
24c242ec7a ntp: Use freezable workqueue for RTC synchronization
The bug fixed by commit e3fab2f3de081e98 ("ntp: Fix RTC synchronization on
32-bit platforms") revealed an underlying issue: RTC synchronization may
happen anytime, even while the system is partially suspended.

On systems where the RTC is connected to an I2C bus, the I2C bus controller
may already or still be suspended, triggering a WARNING during suspend or
resume from s2ram:

    WARNING: CPU: 0 PID: 124 at drivers/i2c/i2c-core.h:54 __i2c_transfer+0x634/0x680
    i2c i2c-6: Transfer while suspended
    [...]
    Workqueue: events_power_efficient sync_hw_clock
    [...]
      (__i2c_transfer)
      (i2c_transfer)
      (regmap_i2c_read)
      ...
      (da9063_rtc_set_time)
      (rtc_set_time)
      (sync_hw_clock)
      (process_one_work)

Fix this race condition by using the freezable instead of the normal
power-efficient workqueue.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Rafael J. Wysocki <rafael@kernel.org>
Link: https://lore.kernel.org/r/20210125143039.1051912-1-geert+renesas@glider.be
2021-02-05 18:03:13 +01:00
Peter Zijlstra
7f82e631d2 locking/lockdep: Avoid unmatched unlock
Commit f6f48e180404 ("lockdep: Teach lockdep about "USED" <- "IN-NMI"
inversions") overlooked that print_usage_bug() releases the graph_lock
and called it without the graph lock held.

Fixes: f6f48e180404 ("lockdep: Teach lockdep about "USED" <- "IN-NMI" inversions")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lkml.kernel.org/r/YBfkuyIfB1+VRxXP@hirez.programming.kicks-ass.net
2021-02-05 17:20:15 +01:00
Barry Song
9f5f8ec501 dma-mapping: benchmark: use u8 for reserved field in uAPI structure
The original code put five u32 before a u64 expansion[10] array. Five is
odd, this will cause trouble in the extension of the structure by adding
new features. This patch moves to use u8 for reserved field to avoid
future alignment risk.
Meanwhile, it also clears the memory of struct map_benchmark in tools,
otherwise, if users use old version to run on newer kernel, the random
expansion value will cause side effect on newer kernel.

Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-02-05 12:48:46 +01:00
Yonghong Song
23a2d70c7a bpf: Refactor BPF_PSEUDO_CALL checking as a helper function
There is no functionality change. This refactoring intends
to facilitate next patch change with BPF_PSEUDO_FUNC.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210204234827.1628953-1-yhs@fb.com
2021-02-04 21:51:47 -08:00
KP Singh
ba90c2cc02 bpf: Allow usage of BPF ringbuffer in sleepable programs
The BPF ringbuffer map is pre-allocated and the implementation logic
does not rely on disabling preemption or per-cpu data structures. Using
the BPF ringbuffer sleepable LSM and tracing programs does not trigger
any warnings with DEBUG_ATOMIC_SLEEP, DEBUG_PREEMPT,
PROVE_RCU and PROVE_LOCKING and LOCKDEP enabled.

This allows helpers like bpf_copy_from_user and bpf_ima_inode_hash to
write to the BPF ring buffer from sleepable BPF programs.

Signed-off-by: KP Singh <kpsingh@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210204193622.3367275-2-kpsingh@kernel.org
2021-02-04 16:35:00 -08:00
Ben Gardon
f3d4b4b1dc sched: Add cond_resched_rwlock
Safely rescheduling while holding a spin lock is essential for keeping
long running kernel operations running smoothly. Add the facility to
cond_resched rwlocks.

CC: Ingo Molnar <mingo@redhat.com>
CC: Will Deacon <will@kernel.org>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20210202185734.1680553-9-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-02-04 05:27:43 -05:00
Bui Quang Minh
6183f4d3a0 bpf: Check for integer overflow when using roundup_pow_of_two()
On 32-bit architecture, roundup_pow_of_two() can return 0 when the argument
has upper most bit set due to resulting 1UL << 32. Add a check for this case.

Fixes: d5a3b1f69186 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE")
Signed-off-by: Bui Quang Minh <minhquangbui99@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210127063653.3576-1-minhquangbui99@gmail.com
2021-02-03 21:45:33 +01:00
Linus Torvalds
dbc15d24f9 Tracing fixes:
- Initialize tracing-graph-pause at task creation, not start of
    function tracing. Causes the pause counter to be corrupted.
  - Set "pause-on-trace" for latency tracers as that option breaks
    their output (regression).
  - Fix the wrong error return for setting kretprobes on future
    modules (before they are loaded).
  - Fix re-registering the same kretprobe.
  - Add missing value check for added RCU variable reload.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYBnNzhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qlpoAP4hU98lfAButfYTuuS7Id+/r21bB4lG
 9HHB72wkpEfs8AEAlTDC5c3eXhnXXJC4a8b4sGv1wvBiHL2ZoW/yQ/4oZgA=
 =hpY/
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Initialize tracing-graph-pause at task creation, not start of
   function tracing, to avoid corrupting the pause counter.

 - Set "pause-on-trace" for latency tracers as that option breaks their
   output (regression).

 - Fix the wrong error return for setting kretprobes on future modules
   (before they are loaded).

 - Fix re-registering the same kretprobe.

 - Add missing value check for added RCU variable reload.

* tag 'trace-v5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracepoint: Fix race between tracing and removing tracepoint
  kretprobe: Avoid re-registration of the same kretprobe earlier
  tracing/kprobe: Fix to support kretprobe events on unloaded modules
  tracing: Use pause-on-trace with the latency tracers
  fgraph: Initialize tracing_graph_pause at task creation
2021-02-03 10:02:00 -08:00
Alexei Starovoitov
548f1191d8 bpf: Unbreak BPF_PROG_TYPE_KPROBE when kprobe is called via do_int3
The commit 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
converted do_int3 handler to be "NMI-like".
That made old if (in_nmi()) check abort execution of bpf programs
attached to kprobe when kprobe is firing via int3
(For example when kprobe is placed in the middle of the function).
Remove the check to restore user visible behavior.

Fixes: 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")
Reported-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/bpf/20210203070636.70926-1-alexei.starovoitov@gmail.com
2021-02-03 15:54:22 +01:00
Brendan Jackman
37086bfdc7 bpf: Propagate stack bounds to registers in atomics w/ BPF_FETCH
When BPF_FETCH is set, atomic instructions load a value from memory
into a register. The current verifier code first checks via
check_mem_access whether we can access the memory, and then checks
via check_reg_arg whether we can write into the register.

For loads, check_reg_arg has the side-effect of marking the
register's value as unkonwn, and check_mem_access has the side effect
of propagating bounds from memory to the register. This currently only
takes effect for stack memory.

Therefore with the current order, bounds information is thrown away,
but by simply reversing the order of check_reg_arg
vs. check_mem_access, we can instead propagate bounds smartly.

A simple test is added with an infinite loop that can only be proved
unreachable if this propagation is present. This is implemented both
with C and directly in test_verifier using assembly.

Suggested-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Brendan Jackman <jackmanb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210202135002.4024825-1-jackmanb@google.com
2021-02-02 18:23:29 -08:00
Jakub Kicinski
d1e1355aef Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-02-02 14:21:31 -08:00
Linus Torvalds
7d36ccd4bd dma-mapping fix for 5.11
- fix a kernel crash in the new dma-mapping benchmark test (Barry Song)
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmAZGkkLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYNF0A//WjFHsPCMbbhlOlCMcRD/I1+JLQYchjlWp48wAzxq
 K3fmGnLlsvd3lCyUQzLLcUMSs+NsaTlzNNtH+MSNfGvX3x/8Mz9F57AgJ7C2xlaD
 XXbob5SKAqls6UFw6sBhlNbUe/l3Tup7LgqyCqQGRfpftycO7Vk70oHfFir0xqeB
 IX2s2s9UK+iePRtqfhyOylipIPXG3A3TnDJ+T3x5wsw6m4ejr8TVUVNsA1Xg0mha
 xsMVELyPwp3pEYp0+LNZsvtRC6uv7MeNf11mqmtk9VOZtrnS1VsD+ZPeNXzoCP3n
 5iipcFGeiz8ac2bdldIjwYa4wyIBZR1kOOQ+1+gq42/Hs/Eu9N3aQydhgrzw8ZnY
 MxUHbHazhpB2qt2lxyMjODVRKFdEt2FrkP7a7d3guw2Il4o2f14g9MZS72c5H5jk
 Vg90VSW8UQr51MKamZkgLeIgFaqVTOkgXO5h7e+dfD/XMgm4Vt3qsjcBSQfTcCrj
 4efmJwUvyv8x13HXoSvistKXG6kTH+rBtbxD6RvzE2C9dasb5L3mYIj+loQJqU7+
 u/+X9ZUVHTvXlH7j7Fc9vBgCUf0VlZJ8N3LXl5zN7zRP40vcvudlhyNj1PWhiUlu
 G5RBhuyTm3jd+NqCYE7XCqf+dvA7BGgkW2LZM/qAhlZgmJaWLwlJwAe0XA0KR9uR
 U5w=
 =l+IX
 -----END PGP SIGNATURE-----

Merge tag 'dma-mapping-5.11-1' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fix from Christoph Hellwig:
 "Fix a kernel crash in the new dma-mapping benchmark test (Barry Song)"

* tag 'dma-mapping-5.11-1' of git://git.infradead.org/users/hch/dma-mapping:
  dma-mapping: benchmark: fix kernel crash when dma_map_single fails
2021-02-02 10:40:20 -08:00
Linus Torvalds
a992562872 Networking fixes for 5.11-rc7, including fixes from bpf and mac80211
trees.
 
 Current release - regressions:
 
  - ip_tunnel: fix mtu calculation
 
  - mlx5: fix function calculation for page trees
 
 Previous releases - regressions:
 
  - vsock: fix the race conditions in multi-transport support
 
  - neighbour: prevent a dead entry from updating gc_list
 
  - dsa: mv88e6xxx: override existent unicast portvec in port_fdb_add
 
 Previous releases - always broken:
 
  - bpf, cgroup: two copy_{from,to}_user() warn_on_once splats for BPF
                 cgroup getsockopt infra when user space is trying
 		to race against optlen, from Loris Reiff.
 
  - bpf: add missing fput() in BPF inode storage map update helper
 
  - udp: ipv4: manipulate network header of NATed UDP GRO fraglist
 
  - mac80211: fix station rate table updates on assoc
 
  - r8169: work around RTL8125 UDP HW bug
 
  - igc: report speed and duplex as unknown when device is runtime
         suspended
 
  - rxrpc: fix deadlock around release of dst cached on udp tunnel
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmAZjwQACgkQMUZtbf5S
 IruLbQ//Yg9+xEnqhDuOJZtYHB0rsJjLlKmtvgOsBr8BaTcUEPoPoqUPm+EMvCHb
 o1fFa1qIrbS5luVEofu9hNX7DGXwvgawaMW2TympJhqLZQqjazCMB/st99LphhJw
 RvaZI8aDOikosT4c+I0vm83jDQETonrjziIcPfHHPjn/Q+amGRRRXiTSQnRF/MlU
 oARCG+U3kHsHBDUPNSCtSjKXshoZPjFb/pD7fQAlzzm7CssvbPhNWbducueyP2Fb
 XW4RwJu9QBBH2JS6uZJ1Y6LVoRzusmE9dUam3KhkiL/CHs72lWPsc+Rn5gbBPvc5
 Y4T4h61Xti1O4ULKdqhGceror6XY+4Qb1VlHWWztOhIo00wIAv3IHbTup/4o0HBr
 j84MtcyOl/qxSFXjunPJkbWJngXikrkIMS0Bl6ZcPAejYM9wN6vCgbvFCHbEg1Rx
 cWFnYyS9FCLduaxHSizv050tWhknOdX+zHK3fOtlW0yWnreJAB8Hoc21Zm7YKvg0
 GxxcGK6AhqJ6s2ixVDv7MyJrltJ/hOJQb+T3HgHFuY2BYUs8F2r/HoHU/u4uCl76
 RdBzbC/sLnBpMHf6r1rHTnGPsapoJOOYWnej71l425vX1qr5xnmxVNNB6HReObNv
 +/jPoRYa5BVsVt2LmDcuH1O32pXJPWKVBR7Yfa6Bn2yzhcbECTc=
 =ZByM
 -----END PGP SIGNATURE-----

Merge tag 'net-5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski:
 "Networking fixes for 5.11-rc7, including fixes from bpf and mac80211
  trees.

  Current release - regressions:

   - ip_tunnel: fix mtu calculation

   - mlx5: fix function calculation for page trees

  Previous releases - regressions:

   - vsock: fix the race conditions in multi-transport support

   - neighbour: prevent a dead entry from updating gc_list

   - dsa: mv88e6xxx: override existent unicast portvec in port_fdb_add

  Previous releases - always broken:

   - bpf, cgroup: two copy_{from,to}_user() warn_on_once splats for BPF
     cgroup getsockopt infra when user space is trying to race against
     optlen, from Loris Reiff.

   - bpf: add missing fput() in BPF inode storage map update helper

   - udp: ipv4: manipulate network header of NATed UDP GRO fraglist

   - mac80211: fix station rate table updates on assoc

   - r8169: work around RTL8125 UDP HW bug

   - igc: report speed and duplex as unknown when device is runtime
     suspended

   - rxrpc: fix deadlock around release of dst cached on udp tunnel"

* tag 'net-5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (36 commits)
  net: hsr: align sup_multicast_addr in struct hsr_priv to u16 boundary
  net: ipa: fix two format specifier errors
  net: ipa: use the right accessor in ipa_endpoint_status_skip()
  net: ipa: be explicit about endianness
  net: ipa: add a missing __iomem attribute
  net: ipa: pass correct dma_handle to dma_free_coherent()
  r8169: fix WoL on shutdown if CONFIG_DEBUG_SHIRQ is set
  net/rds: restrict iovecs length for RDS_CMSG_RDMA_ARGS
  net: mvpp2: TCAM entry enable should be written after SRAM data
  net: lapb: Copy the skb before sending a packet
  net/mlx5e: Release skb in case of failure in tc update skb
  net/mlx5e: Update max_opened_tc also when channels are closed
  net/mlx5: Fix leak upon failure of rule creation
  net/mlx5: Fix function calculation for page trees
  docs: networking: swap words in icmp_errors_use_inbound_ifaddr doc
  udp: ipv4: manipulate network header of NATed UDP GRO fraglist
  net: ip_tunnel: fix mtu calculation
  vsock: fix the race conditions in multi-transport support
  net: sched: replaced invalid qdisc tree flush helper in qdisc_replace
  ibmvnic: device remove has higher precedence over reset
  ...
2021-02-02 10:26:09 -08:00
Kan Liang
2a6c6b7d7a perf/core: Add PERF_SAMPLE_WEIGHT_STRUCT
Current PERF_SAMPLE_WEIGHT sample type is very useful to expresses the
cost of an action represented by the sample. This allows the profiler
to scale the samples to be more informative to the programmer. It could
also help to locate a hotspot, e.g., when profiling by memory latencies,
the expensive load appear higher up in the histograms. But current
PERF_SAMPLE_WEIGHT sample type is solely determined by one factor. This
could be a problem, if users want two or more factors to contribute to
the weight. For example, Golden Cove core PMU can provide both the
instruction latency and the cache Latency information as factors for the
memory profiling.

For current X86 platforms, although meminfo::latency is defined as a
u64, only the lower 32 bits include the valid data in practice (No
memory access could last than 4G cycles). The higher 32 bits can be used
to store new factors.

Add a new sample type, PERF_SAMPLE_WEIGHT_STRUCT, to indicate the new
sample weight structure. It shares the same space as the
PERF_SAMPLE_WEIGHT sample type.

Users can apply either the PERF_SAMPLE_WEIGHT sample type or the
PERF_SAMPLE_WEIGHT_STRUCT sample type to retrieve the sample weight, but
they cannot apply both sample types simultaneously.

Currently, only X86 and PowerPC use the PERF_SAMPLE_WEIGHT sample type.
- For PowerPC, there is nothing changed for the PERF_SAMPLE_WEIGHT
  sample type. There is no effect for the new PERF_SAMPLE_WEIGHT_STRUCT
  sample type. PowerPC can re-struct the weight field similarly later.
- For X86, the same value will be dumped for the PERF_SAMPLE_WEIGHT
  sample type or the PERF_SAMPLE_WEIGHT_STRUCT sample type for now.
  The following patches will apply the new factors for the
  PERF_SAMPLE_WEIGHT_STRUCT sample type.

The field in the union perf_sample_weight should be shared among
different architectures. A generic name is required, but it's hard to
abstract a name that applies to all architectures. For example, on X86,
the fields are to store all kinds of latency. While on PowerPC, it
stores MMCRA[TECX/TECM], which should not be latency. So a general name
prefix 'var$NUM' is used here.

Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1611873611-156687-2-git-send-email-kan.liang@linux.intel.com
2021-02-01 15:31:36 +01:00
Linus Torvalds
f7ea44c717 A single fix for the single step reporting regression caused by getting the
condition wrong when moving SYSCALL_EMU away from TIF flags.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmAWh7cTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYof8zEAC+Qm4Myg9SiHWr8EiZa4+TqmUxTge8
 oV36+Je18y7vHFElGBByCwfEHvsLO/mi3kgKn2lBZsSyiSiUs15p0S5M/7A7HmbW
 mcFmHioECe7VUL5Ml1Y6mhyhA9o3QdAv3PAHwNBbvUwbJSrCS7rld94T4xeZiaBh
 y00qFikxKTbblgSZpVKDG7wUKYHVQwJMqYVw6I6Y4iB+QfM1EGQxWFzV2td3H/UE
 A7g8Ay8QOXxd/agnwZaOTHrQy2Rsnp3n9sD5Y6hVpZLT3FulxsaSftK/ngn9uTou
 bFYagpXxJRPt6TylK2Y8Nn2Y1ZcLoq/bj7XKSN0MpgcM+y3/vV9GUOpyFmDTug2F
 P5onx7S6vKxG3ews+WlTxHYaSRWbO0OHWLTM+FHbW7ben/DjWNVNBa4L1u3w0Skq
 igyqmCzQURjkDbsCaMsdKPeG0KJOlCqTNj4aImskNGv5OUt77rziGg42jI07MLYV
 mE9+e/cw5P1FVoVaaMwplUvOmGaG8647IdapDo0UctHm7Y+GC81bwry/bbW/oesi
 7acnmCrO/sILwzE1H+YQnxofVlTXW/pCx3MUfNUEyJOUuI7orobfd1MOJjVUKj++
 Zm5bRy8h0RZ5q9Xy2GwCh0mSRihbtQzBXdbwIDpltFYUBF6cHh1ryqKAHBE55JiB
 IQ4J+3OK1f8dNQ==
 =WMj1
 -----END PGP SIGNATURE-----

Merge tag 'core-urgent-2021-01-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull single stepping fix from Thomas Gleixner:
 "A single fix for the single step reporting regression caused by
  getting the condition wrong when moving SYSCALL_EMU away from TIF
  flags"

[ There's apparently another problem too, fix pending ]

* tag 'core-urgent-2021-01-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  entry: Unbreak single step reporting behaviour
2021-01-31 11:39:32 -08:00
Marc Zyngier
4c457e8cb7 genirq/msi: Activate Multi-MSI early when MSI_FLAG_ACTIVATE_EARLY is set
When MSI_FLAG_ACTIVATE_EARLY is set (which is the case for PCI),
__msi_domain_alloc_irqs() performs the activation of the interrupt (which
in the case of PCI results in the endpoint being programmed) as soon as the
interrupt is allocated.

But it appears that this is only done for the first vector, introducing an
inconsistent behaviour for PCI Multi-MSI.

Fix it by iterating over the number of vectors allocated to each MSI
descriptor. This is easily achieved by introducing a new
"for_each_msi_vector" iterator, together with a tiny bit of refactoring.

Fixes: f3b0946d629c ("genirq/msi: Make sure PCI MSIs are activated early")
Reported-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210123122759.1781359-1-maz@kernel.org
2021-01-30 01:22:31 +01:00
Wang ShaoBo
0188b87899 kretprobe: Avoid re-registration of the same kretprobe earlier
Our system encountered a re-init error when re-registering same kretprobe,
where the kretprobe_instance in rp->free_instances is illegally accessed
after re-init.

Implementation to avoid re-registration has been introduced for kprobe
before, but lags for register_kretprobe(). We must check if kprobe has
been re-registered before re-initializing kretprobe, otherwise it will
destroy the data struct of kretprobe registered, which can lead to memory
leak, system crash, also some unexpected behaviors.

We use check_kprobe_rereg() to check if kprobe has been re-registered
before running register_kretprobe()'s body, for giving a warning message
and terminate registration process.

Link: https://lkml.kernel.org/r/20210128124427.2031088-1-bobo.shaobowang@huawei.com

Cc: stable@vger.kernel.org
Fixes: 1f0ab40976460 ("kprobes: Prevent re-registration of the same kprobe")
[ The above commit should have been done for kretprobes too ]
Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@linux.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com>
Signed-off-by: Cheng Jian <cj.chengjian@huawei.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-01-29 17:29:16 -05:00
Linus Torvalds
32b0c410cd Power management fixes for 5.11-rc6
- Fix a deadlock caused by attempting to acquire the same mutex
    twice in a row in the "kexec jump" code (Baoquan He).
 
  - Modify the hibernation image saving code to flush the unwritten
    data to the swap storage later so as to avoid failing to write the
    image signature which is possible in some cases (Laurent Badel).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmAUNj8SHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxrEcP/2KQPLD4PkHMw8qr2h2m9Dp6Lc5bl+C2
 bEL/IeDNojtndF7z9q3Fp7EOpffJJV1q9zX06HEKZF4d59fa9gE5oGt9bRcpRbpf
 74cDRTLCNr4UpigzTJux2wfgy9XZ8mWuRzIQUTOHgn17YK2tKteTFInxsCqo45+A
 i6zj0EYM/0UVGX48ZPf/JS6QqzI5Zh73dOuz/PjqTsmKBKQl3X1mJRGyLKeBhb6I
 MTaBR622PyTDCXzksLxApk4k1Oh7+f6TRUMmykA8KdIwRZCfdp23AxzT8EWaRXZD
 BNUwCBCKLSiQFtuySvXLgeMAf2yPk0B+0CHFAriy8YiuGqJSN4Q4/PtnDl7TS61J
 BieKAJPbNClvNRc3j8XxyWHR1lcNabxsoE4l4PKXVrrsHu7qrylJV1+d/ZfeL5o+
 k0izFUf5PCECBo0nIA1sWWWJU0ro5YQ3mkTB6Yk0jTt4PK//UaZjrFhpbebtPWnS
 M06El03mzebRDl87K6L5/kDAty8yx+5Y1L3Y/KSk3X4LTsySnwsIbPJh1ZUL9HLe
 FXJRa7zUYX0CiwXT65oWhnrbaat02BA/CrkFVmkFPA/+izhgN580TcDx7ljC3Hyt
 1WrsWyvmmmPYrTDqB6DirrwwAYqF9XO53lqf42CFSzdu+fjoDHwDVUyEOMQMO50p
 HuLwvCyGb7Mm
 =jh7b
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix a deadlock in the 'kexec jump' code and address a possible
  hibernation image creation issue.

  Specifics:

   - Fix a deadlock caused by attempting to acquire the same mutex twice
     in a row in the "kexec jump" code (Baoquan He)

   - Modify the hibernation image saving code to flush the unwritten
     data to the swap storage later so as to avoid failing to write the
     image signature which is possible in some cases (Laurent Badel)"

* tag 'pm-5.11-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: hibernate: flush swap writer after marking
  kernel: kexec: remove the lock operation of system_transition_mutex
2021-01-29 13:30:09 -08:00
Masami Hiramatsu
97c753e62e tracing/kprobe: Fix to support kretprobe events on unloaded modules
Fix kprobe_on_func_entry() returns error code instead of false so that
register_kretprobe() can return an appropriate error code.

append_trace_kprobe() expects the kprobe registration returns -ENOENT
when the target symbol is not found, and it checks whether the target
module is unloaded or not. If the target module doesn't exist, it
defers to probe the target symbol until the module is loaded.

However, since register_kretprobe() returns -EINVAL instead of -ENOENT
in that case, it always fail on putting the kretprobe event on unloaded
modules. e.g.

Kprobe event:
/sys/kernel/debug/tracing # echo p xfs:xfs_end_io >> kprobe_events
[   16.515574] trace_kprobe: This probe might be able to register after target module is loaded. Continue.

Kretprobe event: (p -> r)
/sys/kernel/debug/tracing # echo r xfs:xfs_end_io >> kprobe_events
sh: write error: Invalid argument
/sys/kernel/debug/tracing # cat error_log
[   41.122514] trace_kprobe: error: Failed to register probe event
  Command: r xfs:xfs_end_io
             ^

To fix this bug, change kprobe_on_func_entry() to detect symbol lookup
failure and return -ENOENT in that case. Otherwise it returns -EINVAL
or 0 (succeeded, given address is on the entry).

Link: https://lkml.kernel.org/r/161176187132.1067016.8118042342894378981.stgit@devnote2

Cc: stable@vger.kernel.org
Fixes: 59158ec4aef7 ("tracing/kprobes: Check the probe on unloaded module correctly")
Reported-by: Jianlin Lv <Jianlin.Lv@arm.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-01-29 15:39:48 -05:00
Viktor Rosendahl
da7f84cdf0 tracing: Use pause-on-trace with the latency tracers
Eaerlier, tracing was disabled when reading the trace file. This behavior
was changed with:

commit 06e0a548bad0 ("tracing: Do not disable tracing when reading the
trace file").

This doesn't seem to work with the latency tracers.

The above mentioned commit dit not only change the behavior but also added
an option to emulate the old behavior. The idea with this patch is to
enable this pause-on-trace option when the latency tracers are used.

Link: https://lkml.kernel.org/r/20210119164344.37500-2-Viktor.Rosendahl@bmw.de

Cc: stable@vger.kernel.org
Fixes: 06e0a548bad0 ("tracing: Do not disable tracing when reading the trace file")
Signed-off-by: Viktor Rosendahl <Viktor.Rosendahl@bmw.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-01-29 15:35:33 -05:00
Steven Rostedt (VMware)
7e0a922046 fgraph: Initialize tracing_graph_pause at task creation
On some archs, the idle task can call into cpu_suspend(). The cpu_suspend()
will disable or pause function graph tracing, as there's some paths in
bringing down the CPU that can have issues with its return address being
modified. The task_struct structure has a "tracing_graph_pause" atomic
counter, that when set to something other than zero, the function graph
tracer will not modify the return address.

The problem is that the tracing_graph_pause counter is initialized when the
function graph tracer is enabled. This can corrupt the counter for the idle
task if it is suspended in these architectures.

   CPU 1				CPU 2
   -----				-----
  do_idle()
    cpu_suspend()
      pause_graph_tracing()
          task_struct->tracing_graph_pause++ (0 -> 1)

				start_graph_tracing()
				  for_each_online_cpu(cpu) {
				    ftrace_graph_init_idle_task(cpu)
				      task-struct->tracing_graph_pause = 0 (1 -> 0)

      unpause_graph_tracing()
          task_struct->tracing_graph_pause-- (0 -> -1)

The above should have gone from 1 to zero, and enabled function graph
tracing again. But instead, it is set to -1, which keeps it disabled.

There's no reason that the field tracing_graph_pause on the task_struct can
not be initialized at boot up.

Cc: stable@vger.kernel.org
Fixes: 380c4b1411ccd ("tracing/function-graph-tracer: append the tracing_graph_flag")
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=211339
Reported-by: pierre.gondois@arm.com
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-01-29 15:07:32 -05:00
Nikolay Borisov
442187f3c2 locking/rwsem: Remove empty rwsem.h
This is a leftover from 7f26482a872c ("locking/percpu-rwsem: Remove the embedded rwsem")

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/20210126101721.976027-1-nborisov@suse.com
2021-01-29 20:02:34 +01:00
Jakub Kicinski
06cc6e5dc6 Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2021-01-29

1) Fix two copy_{from,to}_user() warn_on_once splats for BPF cgroup getsockopt
   infra when user space is trying to race against optlen, from Loris Reiff.

2) Fix a missing fput() in BPF inode storage map update helper, from Pan Bian.

3) Fix a build error on unresolved symbols on disabled networking / keys LSM
   hooks, from Mikko Ylinen.

4) Fix preload BPF prog build when the output directory from make points to a
   relative path, from Quentin Monnet.

* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf, preload: Fix build when $(O) points to a relative path
  bpf: Drop disabled LSM hooks from the sleepable set
  bpf, inode_storage: Put file handler if no storage was found
  bpf, cgroup: Fix problematic bounds check
  bpf, cgroup: Fix optlen WARN_ON_ONCE toctou
====================

Link: https://lore.kernel.org/r/20210129001556.6648-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 21:07:45 -08:00
Viresh Kumar
be65de6b03 fs: Remove dcookies support
The dcookies stuff was only used by the kernel's old oprofile code. Now
that oprofile's support is removed from the kernel, there is no need for
dcookies as well. Remove it.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Robert Richter <rric@kernel.org>
Acked-by: William Cohen <wcohen@redhat.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
2021-01-29 10:06:46 +05:30
Tobias Klauser
61ca36c8c4 bpf: Simplify cases in bpf_base_func_proto
!perfmon_capable() is checked before the last switch(func_id) in
bpf_base_func_proto. Thus, the cases BPF_FUNC_trace_printk and
BPF_FUNC_snprintf_btf can be moved to that last switch(func_id) to omit
the inline !perfmon_capable() checks.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210127174615.3038-1-tklauser@distanz.ch
2021-01-29 02:20:28 +01:00
Jakub Kicinski
c358f95205 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
drivers/net/can/dev.c
  b552766c872f ("can: dev: prevent potential information leak in can_fill_info()")
  3e77f70e7345 ("can: dev: move driver related infrastructure into separate subdir")
  0a042c6ec991 ("can: dev: move netlink related code into seperate file")

  Code move.

drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
  57ac4a31c483 ("net/mlx5e: Correctly handle changing the number of queues when the interface is down")
  214baf22870c ("net/mlx5e: Support HTB offload")

  Adjacent code changes

net/switchdev/switchdev.c
  20776b465c0c ("net: switchdev: don't set port_obj_info->handled true when -EOPNOTSUPP")
  ffb68fc58e96 ("net: switchdev: remove the transaction structure from port object notifiers")
  bae33f2b5afe ("net: switchdev: remove the transaction structure from port attributes")

  Transaction parameter gets dropped otherwise keep the fix.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-28 17:09:31 -08:00
Yuxuan Shui
41c1a06d1d entry: Unbreak single step reporting behaviour
The move of TIF_SYSCALL_EMU to SYSCALL_WORK_SYSCALL_EMU broke single step
reporting. The original code reported the single step when TIF_SINGLESTEP
was set and TIF_SYSCALL_EMU was not set. The SYSCALL_WORK conversion got
the logic wrong and now the reporting only happens when both bits are set.

Restore the original behaviour.

[ tglx: Massaged changelog and dropped the pointless double negation ]

Fixes: 64eb35f701f0 ("ptrace: Migrate TIF_SYSCALL_EMU to use SYSCALL_WORK flag")
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Link: https://lore.kernel.org/r/877do3gaq9.fsf@m5Zedd9JOGzJrf0
2021-01-28 13:46:55 +01:00
Alex Shi
bf594bf400 locking/rtmutex: Add missing kernel-doc markup
To fix the following issues:
kernel/locking/rtmutex.c:1612: warning: Function parameter or member
'lock' not described in '__rt_mutex_futex_unlock'
kernel/locking/rtmutex.c:1612: warning: Function parameter or member
'wake_q' not described in '__rt_mutex_futex_unlock'
kernel/locking/rtmutex.c:1675: warning: Function parameter or member
'name' not described in '__rt_mutex_init'
kernel/locking/rtmutex.c:1675: warning: Function parameter or member
'key' not described in '__rt_mutex_init'

[ tglx: Change rt lock to rt_mutex for consistency sake ]

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/1605257895-5536-2-git-send-email-alex.shi@linux.alibaba.com
2021-01-28 13:20:18 +01:00
Jangwoong Kim
0f9438503e futex: Remove unneeded gotos
Get rid of gotos that do not contain any cleanup. These were not removed in
commit 9180bd467f9a ("futex: Remove put_futex_key()").

Signed-off-by: Jangwoong Kim <6812skiii@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201230122953.10473-1-6812skiii@gmail.com
2021-01-28 13:20:18 +01:00
Alejandro Colomar
1ce53e2c2a futex: Change utime parameter to be 'const ... *'
futex(2) says that 'utime' is a pointer to 'const'.  The implementation
doesn't use 'const'; however, it _never_ modifies the contents of utime.

- futex() either uses 'utime' as a pointer to struct or as a 'u32'.

- In case it's used as a 'u32', it makes a copy of it, and of course it is
  not dereferenced.

- In case it's used as a 'struct __kernel_timespec __user *', the pointer
  is not dereferenced inside the futex() definition, and it is only passed
  to a function: get_timespec64(), which accepts a 'const struct
  __kernel_timespec __user *'.

[ tglx: Make the same change to the compat syscall and fixup the prototypes. ]

Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201128123945.4592-1-alx.manpages@gmail.com
2021-01-28 13:20:18 +01:00
Emil Renner Berthing
c260954177 genirq: Use new tasklet API for resend_tasklet
This converts the resend_tasklet to use the new API in
commit 12cc923f1ccc ("tasklet: Introduce new initialization API")

The new API changes the argument passed to the callback function, but
fortunately the argument isn't used so it is straight forward to use
DECLARE_TASKLET() rather than DECLARE_TASKLET_OLD().

Signed-off-by: Emil Renner Berthing <kernel@esmil.dk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210123182456.6521-1-esmil@mailme.dk
2021-01-28 11:18:04 +01:00
Yang Yang
127c8c5f05 audit: Make audit_filter_syscall() return void
No invoker uses the return value of audit_filter_syscall().
So make it return void, and amend the comment of
audit_filter_syscall().

Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Reviewed-by: Richard Guy Briggs <rgb@redhat.com>
[PM: removed the changelog from the description]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2021-01-27 21:55:14 -05:00
Stanislav Fomichev
772412176f bpf: Allow rewriting to ports under ip_unprivileged_port_start
At the moment, BPF_CGROUP_INET{4,6}_BIND hooks can rewrite user_port
to the privileged ones (< ip_unprivileged_port_start), but it will
be rejected later on in the __inet_bind or __inet6_bind.

Let's add another return value to indicate that CAP_NET_BIND_SERVICE
check should be ignored. Use the same idea as we currently use
in cgroup/egress where bit #1 indicates CN. Instead, for
cgroup/bind{4,6}, bit #1 indicates that CAP_NET_BIND_SERVICE should
be bypassed.

v5:
- rename flags to be less confusing (Andrey Ignatov)
- rework BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY to work on flags
  and accept BPF_RET_SET_CN (no behavioral changes)

v4:
- Add missing IPv6 support (Martin KaFai Lau)

v3:
- Update description (Martin KaFai Lau)
- Fix capability restore in selftest (Martin KaFai Lau)

v2:
- Switch to explicit return code (Martin KaFai Lau)

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrey Ignatov <rdna@fb.com>
Link: https://lore.kernel.org/bpf/20210127193140.3170382-1-sdf@google.com
2021-01-27 18:18:15 -08:00
Menglong Dong
60e578e82b bpf: Change 'BPF_ADD' to 'BPF_AND' in print_bpf_insn()
This 'BPF_ADD' is duplicated, and I belive it should be 'BPF_AND'.

Fixes: 981f94c3e921 ("bpf: Add bitwise atomic instructions")
Signed-off-by: Menglong Dong <dong.menglong@zte.com.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Brendan Jackman <jackmanb@google.com>
Link: https://lore.kernel.org/bpf/20210127022507.23674-1-dong.menglong@zte.com.cn
2021-01-27 22:23:46 +01:00
Zqiang
ccf7ce46ab PM: sleep: No need to check PF_WQ_WORKER in thaw_kernel_threads()
Because PF_KTHREAD is set for all wq worker threads, it is
not necessary to check PF_WQ_WORKER in addition to it in
thaw_kernel_threads(), so stop doing that.

Signed-off-by: Zqiang <qiang.zhang@windriver.com>
[ rjw: Subject and changelog rewrite ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2021-01-27 19:20:17 +01:00
Mel Gorman
bae4ec1364 sched/fair: Move avg_scan_cost calculations under SIS_PROP
As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
check and while we are at it, exclude the cost of initialising the CPU
mask from the average scan cost.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210125085909.4600-3-mgorman@techsingularity.net
2021-01-27 17:26:44 +01:00
Mel Gorman
e6e0dc2d54 sched/fair: Remove SIS_AVG_CPU
SIS_AVG_CPU was introduced as a means of avoiding a search when the
average search cost indicated that the search would likely fail. It was
a blunt instrument and disabled by commit 4c77b18cf8b7 ("sched/fair: Make
select_idle_cpu() more aggressive") and later replaced with a proportional
search depth by commit 1ad3aaf3fcd2 ("sched/core: Implement new approach
to scale select_idle_cpu()").

While there are corner cases where SIS_AVG_CPU is better, it has now been
disabled for almost three years. As the intent of SIS_PROP is to reduce
the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus
on SIS_PROP as a throttling mechanism.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210125085909.4600-2-mgorman@techsingularity.net
2021-01-27 17:26:43 +01:00
Valentin Schneider
620a6dc407 sched/topology: Make sched_init_numa() use a set for the deduplicating sort
The deduplicating sort in sched_init_numa() assumes that the first line in
the distance table contains all unique values in the entire table. I've
been trying to pen what this exactly means for the topology, but it's not
straightforward. For instance, topology.c uses this example:

  node   0   1   2   3
    0:  10  20  20  30
    1:  20  10  20  20
    2:  20  20  10  20
    3:  30  20  20  10

  0 ----- 1
  |     / |
  |   /   |
  | /     |
  2 ----- 3

Which works out just fine. However, if we swap nodes 0 and 1:

  1 ----- 0
  |     / |
  |   /   |
  | /     |
  2 ----- 3

we get this distance table:

  node   0  1  2  3
    0:  10 20 20 20
    1:  20 10 20 30
    2:  20 20 10 20
    3:  20 30 20 10

Which breaks the deduplicating sort (non-representative first line). In
this case this would just be a renumbering exercise, but it so happens that
we can have a deduplicating sort that goes through the whole table in O(n²)
at the extra cost of a temporary memory allocation (i.e. any form of set).

The ACPI spec (SLIT) mentions distances are encoded on 8 bits. Following
this, implement the set as a 256-bits bitmap. Should this not be
satisfactory (i.e. we want to support 32-bit values), then we'll have to go
for some other sparse set implementation.

This has the added benefit of letting us allocate just the right amount of
memory for sched_domains_numa_distance[], rather than an arbitrary
(nr_node_ids + 1).

Note: DT binding equivalent (distance-map) decodes distances as 32-bit
values.

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210122123943.1217-2-valentin.schneider@arm.com
2021-01-27 17:26:42 +01:00
Qais Yousef
0ae78eec8a sched/eas: Don't update misfit status if the task is pinned
If the task is pinned to a cpu, setting the misfit status means that
we'll unnecessarily continuously attempt to migrate the task but fail.

This continuous failure will cause the balance_interval to increase to
a high value, and eventually cause unnecessary significant delays in
balancing the system when real imbalance happens.

Caught while testing uclamp where rt-app calibration loop was pinned to
cpu 0, shortly after which we spawn another task with high util_clamp
value. The task was failing to migrate after over 40ms of runtime due to
balance_interval unnecessary expanded to a very high value from the
calibration loop.

Not done here, but it could be useful to extend the check for pinning to
verify that the affinity of the task has a cpu that fits. We could end
up in a similar situation otherwise.

Fixes: 3b1baa6496e6 ("sched/fair: Add 'group_misfit_task' load-balance type")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Acked-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210119120755.2425264-1-qais.yousef@arm.com
2021-01-27 17:26:42 +01:00
Barry Song
d17405d52b dma-mapping: benchmark: fix kernel crash when dma_map_single fails
if dma_map_single() fails, kernel will give the below oops since
task_struct has been destroyed and we are running into the memory
corruption due to use-after-free in kthread_stop():

[   48.095310] Unable to handle kernel paging request at virtual address 000000c473548040
[   48.095736] Mem abort info:
[   48.095864]   ESR = 0x96000004
[   48.096025]   EC = 0x25: DABT (current EL), IL = 32 bits
[   48.096268]   SET = 0, FnV = 0
[   48.096401]   EA = 0, S1PTW = 0
[   48.096538] Data abort info:
[   48.096659]   ISV = 0, ISS = 0x00000004
[   48.096820]   CM = 0, WnR = 0
[   48.097079] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000104639000
[   48.098099] [000000c473548040] pgd=0000000000000000, p4d=0000000000000000
[   48.098832] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[   48.099232] Modules linked in:
[   48.099387] CPU: 0 PID: 2 Comm: kthreadd Tainted: G        W
[   48.099887] Hardware name: linux,dummy-virt (DT)
[   48.100078] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
[   48.100516] pc : __kmalloc_node+0x214/0x368
[   48.100944] lr : __kmalloc_node+0x1f4/0x368
[   48.101458] sp : ffff800011f0bb80
[   48.101843] x29: ffff800011f0bb80 x28: ffff0000c0098ec0
[   48.102330] x27: 0000000000000000 x26: 00000000001d4600
[   48.102648] x25: ffff0000c0098ec0 x24: ffff800011b6a000
[   48.102988] x23: 00000000ffffffff x22: ffff0000c0098ec0
[   48.103333] x21: ffff8000101d7a54 x20: 0000000000000dc0
[   48.103657] x19: ffff0000c0001e00 x18: 0000000000000000
[   48.104069] x17: 0000000000000000 x16: 0000000000000000
[   48.105449] x15: 000001aa0304e7b9 x14: 00000000000003b1
[   48.106401] x13: ffff8000122d5000 x12: ffff80001228d000
[   48.107296] x11: ffff0000c0154340 x10: 0000000000000000
[   48.107862] x9 : ffff80000fffffff x8 : ffff0000c473527f
[   48.108326] x7 : ffff800011e62f58 x6 : ffff0000c01c8ed8
[   48.108778] x5 : ffff0000c0098ec0 x4 : 0000000000000000
[   48.109223] x3 : 00000000001d4600 x2 : 0000000000000040
[   48.109656] x1 : 0000000000000001 x0 : ff0000c473548000
[   48.110104] Call trace:
[   48.110287]  __kmalloc_node+0x214/0x368
[   48.110493]  __vmalloc_node_range+0xc4/0x298
[   48.110805]  copy_process+0x2c8/0x15c8
[   48.111133]  kernel_clone+0x5c/0x3c0
[   48.111373]  kernel_thread+0x64/0x90
[   48.111604]  kthreadd+0x158/0x368
[   48.111810]  ret_from_fork+0x10/0x30
[   48.112336] Code: 17ffffe9 b9402a62 b94008a1 11000421 (f8626802)
[   48.112884] ---[ end trace d4890e21e75419d5 ]---

Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-01-27 17:18:38 +01:00
Quentin Monnet
150a27328b bpf, preload: Fix build when $(O) points to a relative path
Building the kernel with CONFIG_BPF_PRELOAD, and by providing a relative
path for the output directory, may fail with the following error:

  $ make O=build bindeb-pkg
  ...
  /.../linux/tools/scripts/Makefile.include:5: *** O=build does not exist.  Stop.
  make[7]: *** [/.../linux/kernel/bpf/preload/Makefile:9: kernel/bpf/preload/libbpf.a] Error 2
  make[6]: *** [/.../linux/scripts/Makefile.build:500: kernel/bpf/preload] Error 2
  make[5]: *** [/.../linux/scripts/Makefile.build:500: kernel/bpf] Error 2
  make[4]: *** [/.../linux/Makefile:1799: kernel] Error 2
  make[4]: *** Waiting for unfinished jobs....

In the case above, for the "bindeb-pkg" target, the error is produced by
the "dummy" check in Makefile.include, called from libbpf's Makefile.
This check changes directory to $(PWD) before checking for the existence
of $(O). But at this step we have $(PWD) pointing to "/.../linux/build",
and $(O) pointing to "build". So the Makefile.include tries in fact to
assert the existence of a directory named "/.../linux/build/build",
which does not exist.

Note that the error does not occur for all make targets and
architectures combinations. This was observed on x86 for "bindeb-pkg",
or for a regular build for UML [0].

Here are some details. The root Makefile recursively calls itself once,
after changing directory to $(O). The content for the variable $(PWD) is
preserved across recursive calls to make, so it is unchanged at this
step. For "bindeb-pkg", $(PWD) is eventually updated because the target
writes a new Makefile (as debian/rules) and calls it indirectly through
dpkg-buildpackage. This script does not preserve $(PWD), which is reset
to the current working directory when the target in debian/rules is
called.

Although not investigated, it seems likely that something similar causes
UML to change its value for $(PWD).

Non-trivial fixes could be to remove the use of $(PWD) from the "dummy"
check, or to make sure that $(PWD) and $(O) are preserved or updated to
always play well and form a valid $(PWD)/$(O) path across the different
targets and architectures. Instead, we take a simpler approach and just
update $(O) when calling libbpf's Makefile, so it points to an absolute
path which should always resolve for the "dummy" check run (through
includes) by that Makefile.

David Gow previously posted a slightly different version of this patch
as a RFC [0], two months ago or so.

  [0] https://lore.kernel.org/bpf/20201119085022.3606135-1-davidgow@google.com/t/#u

Fixes: d71fa5c9763c ("bpf: Add kernel module with user mode driver that populates bpffs.")
Reported-by: David Gow <davidgow@google.com>
Signed-off-by: Quentin Monnet <quentin@isovalent.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Link: https://lore.kernel.org/bpf/20210126161320.24561-1-quentin@isovalent.com
2021-01-26 23:13:25 +01:00
Mikko Ylinen
78031381ae bpf: Drop disabled LSM hooks from the sleepable set
Some networking and keys LSM hooks are conditionally enabled
and when building the new sleepable BPF LSM hooks with those
LSM hooks disabled, the following build error occurs:

  BTFIDS  vmlinux
  FAILED unresolved symbol bpf_lsm_socket_socketpair

To fix the error, conditionally add the relevant networking/keys
LSM hooks to the sleepable set.

Fixes: 423f16108c9d8 ("bpf: Augment the set of sleepable LSM hooks")
Signed-off-by: Mikko Ylinen <mikko.ylinen@linux.intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: KP Singh <kpsingh@kernel.org>
Link: https://lore.kernel.org/bpf/20210125063936.89365-1-mikko.ylinen@linux.intel.com
2021-01-26 17:08:50 +01:00
Thomas Gleixner
34b1a1ce14 futex: Handle faults correctly for PI futexes
fixup_pi_state_owner() tries to ensure that the state of the rtmutex,
pi_state and the user space value related to the PI futex are consistent
before returning to user space. In case that the user space value update
faults and the fault cannot be resolved by faulting the page in via
fault_in_user_writeable() the function returns with -EFAULT and leaves
the rtmutex and pi_state owner state inconsistent.

A subsequent futex_unlock_pi() operates on the inconsistent pi_state and
releases the rtmutex despite not owning it which can corrupt the RB tree of
the rtmutex and cause a subsequent kernel stack use after free.

It was suggested to loop forever in fixup_pi_state_owner() if the fault
cannot be resolved, but that results in runaway tasks which is especially
undesired when the problem happens due to a programming error and not due
to malice.

As the user space value cannot be fixed up, the proper solution is to make
the rtmutex and the pi_state consistent so both have the same owner. This
leaves the user space value out of sync. Any subsequent operation on the
futex will fail because the 10th rule of PI futexes (pi_state owner and
user space value are consistent) has been violated.

As a consequence this removes the inept attempts of 'fixing' the situation
in case that the current task owns the rtmutex when returning with an
unresolvable fault by unlocking the rtmutex which left pi_state::owner and
rtmutex::owner out of sync in a different and only slightly less dangerous
way.

Fixes: 1b7558e457ed ("futexes: fix fault handling in futex_lock_pi")
Reported-by: gzobqq@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
2021-01-26 15:11:00 +01:00
Thomas Gleixner
f2dac39d93 futex: Simplify fixup_pi_state_owner()
Too many gotos already and an upcoming fix would make it even more
unreadable.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
2021-01-26 15:10:59 +01:00
Thomas Gleixner
6ccc84f917 futex: Use pi_state_update_owner() in put_pi_state()
No point in open coding it. This way it gains the extra sanity checks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
2021-01-26 15:10:59 +01:00
Thomas Gleixner
2156ac1934 rtmutex: Remove unused argument from rt_mutex_proxy_unlock()
Nothing uses the argument. Remove it as preparation to use
pi_state_update_owner().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
2021-01-26 15:10:58 +01:00