Commit Graph

36743 Commits

Author SHA1 Message Date
Masami Hiramatsu
8e242060c6 tracing/probes: Reject events which have the same name of existing one
Since kprobe_events and uprobe_events only check whether the
other same-type probe event has the same name or not, if the
user gives the same name of the existing tracepoint event (or
the other type of probe events), it silently fails to create
the tracefs entry (but registered.) as below.

/sys/kernel/tracing # ls events/task/task_rename
enable   filter   format   hist     id       trigger
/sys/kernel/tracing # echo p:task/task_rename vfs_read >> kprobe_events
[  113.048508] Could not create tracefs 'task_rename' directory
/sys/kernel/tracing # cat kprobe_events
p:task/task_rename vfs_read

To fix this issue, check whether the existing events have the
same name or not in trace_probe_register_event_call(). If exists,
it rejects to register the new event.

Link: https://lkml.kernel.org/r/162936876189.187130.17558311387542061930.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-19 09:15:19 -04:00
Steven Rostedt (VMware)
8565a45d08 tracing/probes: Have process_fetch_insn() take a void * instead of pt_regs
In preparation to allow event probes to use the process_fetch_insn()
callback in trace_probe_tmpl.h, change the data passed to it from a
pointer to pt_regs, as the event probe will not be using regs, and make it
a void pointer instead.

Update the process_fetch_insn() callers for kprobe and uprobe events to
have the regs defined in the function and just typecast the void pointer
parameter.

Link: https://lkml.kernel.org/r/20210819041842.291622924@goodmis.org

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-19 09:09:03 -04:00
Steven Rostedt (VMware)
007517a019 tracing/probe: Change traceprobe_set_print_fmt() to take a type
Instead of a boolean "is_return" have traceprobe_set_print_fmt() take a
type (currently just PROBE_PRINT_NORMAL and PROBE_PRINT_RETURN). This will
simplify adding different types. For example, the development of the
event_probe, will need its own type as it prints an event, and not an IP.

Link: https://lkml.kernel.org/r/20210819041842.104626301@goodmis.org

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-19 09:08:29 -04:00
Steven Rostedt (VMware)
845cbf3e11 tracing/probes: Use struct_size() instead of defining custom macros
Remove SIZEOF_TRACE_KPROBE() and SIZEOF_TRACE_UPROBE() and use
struct_size() as that's what it is made for. No need to have custom
macros. Especially since struct_size() has some extra memory checks for
correctness.

Link: https://lkml.kernel.org/r/20210817035027.795000217@goodmis.org

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-18 18:13:52 -04:00
Steven Rostedt (VMware)
bc1b973455 tracing/probes: Allow for dot delimiter as well as slash for system names
Kprobe and uprobe events can add a "system" to the events that are created
via the kprobe_events and uprobe_events files respectively. If they do not
include a "system" in the name, then the default "kprobes" or "uprobes" is
used. The current notation to specify a system for one of these probe
events is to add a '/' delimiter in the name, where the content before the
'/' will be the system to use, and the content after will be the event
name.

 echo 'p:my_system/my_event' > kprobe_events

But this is inconsistent with the way histogram triggers separate their
system / event names. The histogram triggers use a '.' delimiter, which
can be confusing.

To allow this to be more consistent, as well as keep backward
compatibility, allow the kprobe and uprobe events to denote a system name
with either a '/' or a '.'.

That is:

  echo 'p:my_system/my_event' > kprobe_events

is equivalent to:

  echo 'p:my_system.my_event' > kprobe_events

Link: https://lore.kernel.org/linux-trace-devel/20210813004448.51c7de69ce432d338f4d226b@kernel.org/
Link: https://lkml.kernel.org/r/20210817035027.580493202@goodmis.org

Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-18 18:13:52 -04:00
Steven Rostedt (VMware)
fcd9db51df tracing/probe: Have traceprobe_parse_probe_arg() take a const arg
The two places that call traceprobe_parse_probe_arg() allocate a temporary
buffer to copy the argv[i] into, because argv[i] is constant and the
traceprobe_parse_probe_arg() will modify it to do the parsing. These two
places allocate this buffer and then free it right after calling this
function, leaving the onus of this allocation to the caller.

As there's about to be a third user of this function that will have to do
the same thing, instead of having the caller allocate the temporary
buffer, simply move that allocation into the traceprobe_parse_probe_arg()
itself, which will simplify the code of the callers.

Link: https://lkml.kernel.org/r/20210817035027.385422828@goodmis.org

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-18 18:13:52 -04:00
Steven Rostedt (VMware)
1d18538e6a tracing: Have dynamic events have a ref counter
As dynamic events are not created by modules, if something is attached to
one, calling "try_module_get()" on its "mod" field, is not going to keep
the dynamic event from going away.

Since dynamic events do not need the "mod" pointer of the event structure,
make a union out of it in order to save memory (there's one structure for
each of the thousand+ events in the kernel), and have any event with the
DYNAMIC flag set to use a ref counter instead.

Link: https://lore.kernel.org/linux-trace-devel/20210813004448.51c7de69ce432d338f4d226b@kernel.org/
Link: https://lkml.kernel.org/r/20210817035027.174869074@goodmis.org

Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-18 18:13:47 -04:00
Steven Rostedt (VMware)
8b0e6c744f tracing: Add DYNAMIC flag for dynamic events
To differentiate between static and dynamic events, add a new flag
DYNAMIC to the event flags that all dynamic events have set. This will
allow to differentiate when attaching to a dynamic event from a static
event.

Static events have a mod pointer that references the module they were
created in (or NULL for core kernel). This can be incremented when the
event has something attached to it. But there exists no such mechanism for
dynamic events. This is dangerous as the dynamic events may now disappear
without the "attachment" knowing that it no longer exists.

To enforce the dynamic flag, change dyn_event_add() to pass the event that
is being created such that it can set the DYNAMIC flag of the event. This
helps make sure that no location that creates a dynamic event misses
setting this flag.

Link: https://lore.kernel.org/linux-trace-devel/20210813004448.51c7de69ce432d338f4d226b@kernel.org/
Link: https://lkml.kernel.org/r/20210817035026.936958254@goodmis.org

Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-18 18:10:32 -04:00
Sebastian Andrzej Siewior
99c37d1a63 tracing: Replace deprecated CPU-hotplug functions.
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().

Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.

Link: https://lkml.kernel.org/r/20210803141621.780504-37-bigeasy@linutronix.de

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-17 15:47:14 -04:00
zhaoxiao
bd74095389 tracepoint: Fix kerneldoc comments
Fix function name in tracepoint.c kernel-doc comment
to remove a warning found by clang_w1.

kernel/tracepoint.c:589: warning: expecting prototype for register_tracepoint_notifier(). Prototype was for register_tracepoint_module_notifier() instead
kernel/tracepoint.c:613: warning: expecting prototype for unregister_tracepoint_notifier(). Prototype was for unregister_tracepoint_module_notifier() instead

Link: https://lkml.kernel.org/r/20210816052430.16539-1-zhaoxiao@uniontech.com

Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: zhaoxiao <zhaoxiao@uniontech.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-16 11:39:51 -04:00
Masami Hiramatsu
64dc7f6958 tracing/boot: Show correct histogram error command
Since trigger_process_regex() modifies given trigger actions
while parsing, the error message couldn't show what command
was passed to the trigger_process_regex() when it returns
an error.

To fix that, show the backed up trigger action command
instead of parsed buffer.

Link: https://lkml.kernel.org/r/162856126413.203126.9465564928450701424.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-16 11:37:21 -04:00
Masami Hiramatsu
17abd7c36c tracing/boot: Support multiple histograms for each event
Add multiple histograms support for each event. This allows
user to set multiple histograms to an event.

ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist[.N] {
...
}

The 'N' is a digit started string and it can be omitted
for the default histogram.

For example, multiple hist triggers example in the
Documentation/trace/histogram.rst can be written as below;

ftrace.event.net.netif_receive_skb.hist {
	1 {
		keys = skbaddr.hex
		values = len
		filter = len < 0
	}
	2 {
		keys = skbaddr.hex
		values = len
		filter = len > 4096
	}
	3 {
		keys = skbaddr.hex
		values = len
		filter = len == 256
	}
	4 {
		keys = skbaddr.hex
		values = len
	}
	5 {
		keys = len
		values = common_preempt_count
	}
}

Link: https://lkml.kernel.org/r/162856125628.203126.15846930277378572120.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-16 11:37:21 -04:00
Masami Hiramatsu
8993665abc tracing/boot: Support multiple handlers for per-event histogram
Support multiple handlers for per-event histogram in boot-time tracing.
Since the histogram can register multiple same handler-actions with
different parameters, this expands the syntax to support such cases.

With this update, the 'onmax', 'onchange' and 'onmatch' handler subkeys
under per-event histogram option will take a number subkeys optionally
as below. (see [.N])

ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist {
     onmax|onchange[.N] { var = <VAR>; <ACTION> [= <PARAM>] }
     onmatch[.N] { event = <EVENT>; <ACTION> [= <PARAM>] }
}

The 'N' must be a digit (or digit started word).

Thus user can add several handler-actions to the histogram,
for example,

ftrace.event.SOMEGROUP.SOMEEVENT.hist {
   keys = SOME_ID; lat = common_timestamp.usecs-$ts0
   onmatch.1 {
	event = GROUP1.STARTEVENT1
	trace = latency_event, SOME_ID, $lat
   }
   onmatch.2 {
	event = GROUP2.STARTEVENT2
	trace = latency_event, SOME_ID, $lat
   }
}

Then, it can trace the elapsed time from GROUP1.STARTEVENT1 to
SOMEGROUP.SOMEEVENT, and from GROUP2.STARTEVENT2 to
SOMEGROUP.SOMEEVENT with SOME_ID key.

Link: https://lkml.kernel.org/r/162856124905.203126.14913731908137885922.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-16 11:37:21 -04:00
Masami Hiramatsu
e66ed86ca6 tracing/boot: Add per-event histogram action options
Add a hist-trigger action syntax support to boot-time tracing.
Currently, boot-time tracing supports per-event actions as option
strings. However, for the histogram action, it has a special syntax
and usually needs a long action definition.
To make it readable and fit to the bootconfig syntax, this introduces
a new options for histogram.

Here are the histogram action options for boot-time tracing.

ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist {
     keys = <KEY>[,...]
     values = <VAL>[,...]
     sort = <SORT-KEY>[,...]
     size = <ENTRIES>
     name = <HISTNAME>
     var { <VAR> = <EXPR> ... }
     pause|continue|clear
     onmax|onchange { var = <VAR>; <ACTION> [= <PARAM>] }
     onmatch { event = <EVENT>; <ACTION> [= <PARAM>] }
     filter = <FILTER>
}

Where <ACTION> is one of below;

     trace = <EVENT>, <ARG1>[, ...]
     save = <ARG1>[, ...]
     snapshot

Link: https://lkml.kernel.org/r/162856124106.203126.10501871028479029087.stgit@devnote2

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-16 11:37:21 -04:00
Masahiro Yamada
de32951b29 tracing: Simplify the Kconfig dependency of FTRACE
The entire FTRACE block is surrounded by 'if TRACING_SUPPORT' ...
'endif'.

Using 'depends on' is a simpler way to guard FTRACE.

Link: https://lkml.kernel.org/r/20210731052233.4703-1-masahiroy@kernel.org

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-16 11:37:20 -04:00
Steven Rostedt (VMware)
ed2cf90735 tracing: Allow execnames to be passed as args for synthetic events
Allow common_pid.execname to be saved in a variable in one histogram to be
passed to another histogram that can pass it as a parameter to a synthetic
event.

 ># echo 'hist:keys=pid:__arg__1=common_timestamp.usecs:arg2=common_pid.execname' \
       > events/sched/sched_waking/trigger
 ># echo 'wakeup_lat s32 pid; u64 delta; char wake_comm[]' > synthetic_events
 ># echo 'hist:keys=next_pid:pid=next_pid,delta=common_timestamp.usecs-$__arg__1,exec=$arg2'\
':onmatch(sched.sched_waking).trace(wakeup_lat,$pid,$delta,$exec)' \
 > events/sched/sched_switch/trigger

The above is a wake up latency synthetic event setup that passes the execname
of the common_pid that woke the task to the scheduling of that task, which
triggers a synthetic event that passes the original execname as a
parameter to display it.

 ># echo 1 > events/synthetic/enable
 ># cat trace
    <idle>-0       [006] d..4   186.863801: wakeup_lat: pid=1306 delta=65 wake_comm=kworker/u16:3
    <idle>-0       [000] d..4   186.863858: wakeup_lat: pid=163 delta=27 wake_comm=<idle>
    <idle>-0       [001] d..4   186.863903: wakeup_lat: pid=1307 delta=36 wake_comm=kworker/u16:4
    <idle>-0       [000] d..4   186.863927: wakeup_lat: pid=163 delta=5 wake_comm=<idle>
    <idle>-0       [006] d..4   186.863957: wakeup_lat: pid=1306 delta=24 wake_comm=kworker/u16:3
      sshd-1306    [006] d..4   186.864051: wakeup_lat: pid=61 delta=62 wake_comm=<idle>
    <idle>-0       [000] d..4   186.965030: wakeup_lat: pid=609 delta=18 wake_comm=<idle>
    <idle>-0       [006] d..4   186.987582: wakeup_lat: pid=1306 delta=65 wake_comm=kworker/u16:3
    <idle>-0       [000] d..4   186.987639: wakeup_lat: pid=163 delta=27 wake_comm=<idle>

Link: https://lkml.kernel.org/r/20210722142837.458596338@goodmis.org

Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-16 11:37:20 -04:00
Steven Rostedt (VMware)
3347d80baa tracing: Have histogram types be constant when possible
Instead of kstrdup("const", GFP_KERNEL), have the hist_field type simply
assign the constant hist_field->type = "const"; And when the value passed
to it is a variable, use "kstrdup_const(var, GFP_KERNEL);" which will just
copy the value if the variable is already a constant. This saves on having
to allocate when not needed.

All frees of the hist_field->type will need to use kfree_const().

Link: https://lkml.kernel.org/r/20210722142837.280718447@goodmis.org

Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-16 11:37:20 -04:00
Steven Rostedt (VMware)
3703643519 tracing/histogram: Update the documentation for the buckets modifier
Update both the tracefs README file as well as the histogram.rst to
include an explanation of what the buckets modifier is and how to use it.
Include an example with the wakeup_latency example for both log2 and the
buckets modifiers as there was no existing log2 example.

Link: https://lkml.kernel.org/r/20210707213922.167218794@goodmis.org

Acked-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-16 11:37:20 -04:00
Steven Rostedt (VMware)
de9a48a360 tracing: Add linear buckets to histogram logic
There's been several times I wished the histogram logic had a "grouping"
feature for the buckets. Currently, each bucket has a size of one. That
is, if you trace the amount of requested allocations, each allocation is
its own bucket, even if you are interested in what allocates 100 bytes or
less, 100 to 200, 200 to 300, etc.

Also, without grouping, it fills up the allocated histogram buckets
quickly. If you are tracking latency, and don't care if something is 200
microseconds off, or 201 microseconds off, but want to track them by say
10 microseconds each. This can not currently be done.

There is a log2 but that grouping get's too big too fast for a lot of
cases.

Introduce a "buckets=SIZE" command to each field where it will record in a
rounded number. For example:

 ># echo 'hist:keys=bytes_req.buckets=100:sort=bytes_req' > events/kmem/kmalloc/trigger
 ># cat events/kmem/kmalloc/hist
 # event histogram
 #
 # trigger info:
 hist:keys=bytes_req.buckets=100:vals=hitcount:sort=bytes_req.buckets=100:size=2048
 [active]
 #

 { bytes_req: ~ 0-99 } hitcount:       3149
 { bytes_req: ~ 100-199 } hitcount:       1468
 { bytes_req: ~ 200-299 } hitcount:         39
 { bytes_req: ~ 300-399 } hitcount:        306
 { bytes_req: ~ 400-499 } hitcount:        364
 { bytes_req: ~ 500-599 } hitcount:         32
 { bytes_req: ~ 600-699 } hitcount:         69
 { bytes_req: ~ 700-799 } hitcount:         37
 { bytes_req: ~ 1200-1299 } hitcount:         16
 { bytes_req: ~ 1400-1499 } hitcount:         30
 { bytes_req: ~ 2000-2099 } hitcount:          6
 { bytes_req: ~ 4000-4099 } hitcount:       2168
 { bytes_req: ~ 5000-5099 } hitcount:          6

 Totals:
     Hits: 7690
     Entries: 13
     Dropped: 0

Link: https://lkml.kernel.org/r/20210707213921.980359719@goodmis.org

Acked-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-16 11:37:20 -04:00
Masami Hiramatsu
6fe7c745f2 tracing/boot: Fix a hist trigger dependency for boot time tracing
Fixes a build error when CONFIG_HIST_TRIGGERS=n with boot-time
tracing. Since the trigger_process_regex() is defined only
when CONFIG_HIST_TRIGGERS=y, if it is disabled, the 'actions'
event option also must be disabled.

Link: https://lkml.kernel.org/r/162856123376.203126.582144262622247352.stgit@devnote2

Fixes: 81a59555ff ("tracing/boot: Add per-event settings")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-16 11:37:20 -04:00
Pingfan Liu
6c34df6f35 tracing: Apply trace filters on all output channels
The event filters are not applied on all of the output, which results in
the flood of printk when using tp_printk. Unfolding
event_trigger_unlock_commit_regs() into trace_event_buffer_commit(), so
the filters can be applied on every output.

Link: https://lkml.kernel.org/r/20210814034538.8428-1-kernelfans@gmail.com

Cc: stable@vger.kernel.org
Fixes: 0daa230296 ("tracing: Add tp_printk cmdline to have tracepoints go to printk()")
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-16 11:01:52 -04:00
Steven Rostedt (VMware)
5acce0bff2 tracing / histogram: Fix NULL pointer dereference on strcmp() on NULL event name
The following commands:

 # echo 'read_max u64 size;' > synthetic_events
 # echo 'hist:keys=common_pid:count=count:onmax($count).trace(read_max,count)' > events/syscalls/sys_enter_read/trigger

Causes:

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] PREEMPT SMP
 CPU: 4 PID: 1763 Comm: bash Not tainted 5.14.0-rc2-test+ #155
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01
v03.03 07/14/2016
 RIP: 0010:strcmp+0xc/0x20
 Code: 75 f7 31 c0 0f b6 0c 06 88 0c 02 48 83 c0 01 84 c9 75 f1 4c 89 c0
c3 0f 1f 80 00 00 00 00 31 c0 eb 08 48 83 c0 01 84 d2 74 0f <0f> b6 14 07
3a 14 06 74 ef 19 c0 83 c8 01 c3 31 c0 c3 66 90 48 89
 RSP: 0018:ffffb5fdc0963ca8 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffffffffb3a4e040 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: ffff9714c0d0b640 RDI: 0000000000000000
 RBP: 0000000000000000 R08: 00000022986b7cde R09: ffffffffb3a4dff8
 R10: 0000000000000000 R11: 0000000000000000 R12: ffff9714c50603c8
 R13: 0000000000000000 R14: ffff97143fdf9e48 R15: ffff9714c01a2210
 FS:  00007f1fa6785740(0000) GS:ffff9714da400000(0000)
knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 000000002d863004 CR4: 00000000001706e0
 Call Trace:
  __find_event_file+0x4e/0x80
  action_create+0x6b7/0xeb0
  ? kstrdup+0x44/0x60
  event_hist_trigger_func+0x1a07/0x2130
  trigger_process_regex+0xbd/0x110
  event_trigger_write+0x71/0xd0
  vfs_write+0xe9/0x310
  ksys_write+0x68/0xe0
  do_syscall_64+0x3b/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae
 RIP: 0033:0x7f1fa6879e87

The problem was the "trace(read_max,count)" where the "count" should be
"$count" as "onmax()" only handles variables (although it really should be
able to figure out that "count" is a field of sys_enter_read). But there's
a path that does not find the variable and ends up passing a NULL for the
event, which ends up getting passed to "strcmp()".

Add a check for NULL to return and error on the command with:

 # cat error_log
  hist:syscalls:sys_enter_read: error: Couldn't create or find variable
  Command: hist:keys=common_pid:count=count:onmax($count).trace(read_max,count)
                                ^
Link: https://lkml.kernel.org/r/20210808003011.4037f8d0@oasis.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 50450603ec tracing: Add 'onmax' hist trigger action support
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-12 13:35:57 -04:00
Lukas Bulwahn
12f9951d3f tracing: define needed config DYNAMIC_FTRACE_WITH_ARGS
Commit 2860cd8a23 ("livepatch: Use the default ftrace_ops instead of
REGS when ARGS is available") intends to enable config LIVEPATCH when
ftrace with ARGS is available. However, the chain of configs to enable
LIVEPATCH is incomplete, as HAVE_DYNAMIC_FTRACE_WITH_ARGS is available,
but the definition of DYNAMIC_FTRACE_WITH_ARGS, combining DYNAMIC_FTRACE
and HAVE_DYNAMIC_FTRACE_WITH_ARGS, needed to enable LIVEPATCH, is missing
in the commit.

Fortunately, ./scripts/checkkconfigsymbols.py detects this and warns:

DYNAMIC_FTRACE_WITH_ARGS
Referencing files: kernel/livepatch/Kconfig

So, define the config DYNAMIC_FTRACE_WITH_ARGS analogously to the already
existing similar configs, DYNAMIC_FTRACE_WITH_REGS and
DYNAMIC_FTRACE_WITH_DIRECT_CALLS, in ./kernel/trace/Kconfig to connect the
chain of configs.

Link: https://lore.kernel.org/kernel-janitors/CAKXUXMwT2zS9fgyQHKUUiqo8ynZBdx2UEUu1WnV_q0OCmknqhw@mail.gmail.com/
Link: https://lkml.kernel.org/r/20210806195027.16808-1-lukas.bulwahn@gmail.com

Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: stable@vger.kernel.org
Fixes: 2860cd8a23 ("livepatch: Use the default ftrace_ops instead of REGS when ARGS is available")
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-12 13:35:57 -04:00
Daniel Bristot de Oliveira
0e05ba498d trace/osnoise: Print a stop tracing message
When using osnoise/timerlat with stop tracing, sometimes it is
not clear in which CPU the stop condition was hit, mainly
when using some extra events.

Print a message informing in which CPU the trace stopped, like
in the example below:

          <idle>-0       [006] d.h.  2932.676616: #1672599 context    irq timer_latency     34689 ns
          <idle>-0       [006] dNh.  2932.676618: irq_noise: local_timer:236 start 2932.676615639 duration 2391 ns
          <idle>-0       [006] dNh.  2932.676620: irq_noise: virtio0-output.0:47 start 2932.676620180 duration 86 ns
          <idle>-0       [003] d.h.  2932.676621: #1673374 context    irq timer_latency      1200 ns
          <idle>-0       [006] d...  2932.676623: thread_noise: swapper/6:0 start 2932.676615964 duration 4339 ns
          <idle>-0       [003] dNh.  2932.676623: irq_noise: local_timer:236 start 2932.676620597 duration 1881 ns
          <idle>-0       [006] d...  2932.676623: sched_switch: prev_comm=swapper/6 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=timerlat/6 next_pid=852 next_prio=4
      timerlat/6-852     [006] ....  2932.676623: #1672599 context thread timer_latency     41931 ns
          <idle>-0       [003] d...  2932.676623: thread_noise: swapper/3:0 start 2932.676620854 duration 880 ns
          <idle>-0       [003] d...  2932.676624: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=timerlat/3 next_pid=849 next_prio=4
      timerlat/6-852     [006] ....  2932.676624: timerlat_main: stop tracing hit on cpu 6
      timerlat/3-849     [003] ....  2932.676624: #1673374 context thread timer_latency      4310 ns

Link: https://lkml.kernel.org/r/b30a0d7542adba019185f44ee648e60e14923b11.1626598844.git.bristot@kernel.org

Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-12 13:35:56 -04:00
Daniel Bristot de Oliveira
e1c4ad4a7f trace/timerlat: Add a header with PREEMPT_RT additional fields
Some extra flags are printed to the trace header when using the
PREEMPT_RT config. The extra flags are: need-resched-lazy,
preempt-lazy-depth, and migrate-disable.

Without printing these fields, the timerlat specific fields are
shifted by three positions, for example:

 # tracer: timerlat
 #
 #                                _-----=> irqs-off
 #                               / _----=> need-resched
 #                              | / _---=> hardirq/softirq
 #                              || / _--=> preempt-depth
 #                              || /
 #                              ||||             ACTIVATION
 #           TASK-PID      CPU# ||||   TIMESTAMP    ID            CONTEXT                LATENCY
 #              | |         |   ||||      |         |                  |                       |
           <idle>-0       [000] d..h...  3279.798871: #1     context    irq timer_latency       830 ns
            <...>-807     [000] .......  3279.798881: #1     context thread timer_latency     11301 ns

Add a new header for timerlat with the missing fields, to be used
when the PREEMPT_RT is enabled.

Link: https://lkml.kernel.org/r/babb83529a3211bd0805be0b8c21608230202c55.1626598844.git.bristot@kernel.org

Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-12 13:35:56 -04:00
Daniel Bristot de Oliveira
d03721a6e7 trace/osnoise: Add a header with PREEMPT_RT additional fields
Some extra flags are printed to the trace header when using the
PREEMPT_RT config. The extra flags are: need-resched-lazy,
preempt-lazy-depth, and migrate-disable.

Without printing these fields, the osnoise specific fields are
shifted by three positions, for example:

 # tracer: osnoise
 #
 #                                _-----=> irqs-off
 #                               / _----=> need-resched
 #                              | / _---=> hardirq/softirq
 #                              || / _--=> preempt-depth                            MAX
 #                              || /                                             SINGLE      Interference counters:
 #                              ||||               RUNTIME      NOISE  %% OF CPU  NOISE    +-----------------------------+
 #           TASK-PID      CPU# ||||   TIMESTAMP    IN US       IN US  AVAILABLE  IN US     HW    NMI    IRQ   SIRQ THREAD
 #              | |         |   ||||      |           |             |    |            |      |      |      |      |      |
            <...>-741     [000] .......  1105.690909: 1000000        234  99.97660      36     21      0   1001     22      3
            <...>-742     [001] .......  1105.691923: 1000000        281  99.97190     197      7      0   1012     35     14
            <...>-743     [002] .......  1105.691958: 1000000       1324  99.86760     118     11      0   1016    155    143
            <...>-744     [003] .......  1105.691998: 1000000        109  99.98910      21      4      0   1004     33      7
            <...>-745     [004] .......  1105.692015: 1000000       2023  99.79770      97     37      0   1023     52     18

Add a new header for osnoise with the missing fields, to be used
when the PREEMPT_RT is enabled.

Link: https://lkml.kernel.org/r/1f03289d2a51fde5a58c2e7def063dc630820ad1.1626598844.git.bristot@kernel.org

Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-12 13:35:56 -04:00
Mathieu Desnoyers
7b40066c97 tracepoint: Use rcu get state and cond sync for static call updates
State transitions from 1->0->1 and N->2->1 callbacks require RCU
synchronization. Rather than performing the RCU synchronization every
time the state change occurs, which is quite slow when many tracepoints
are registered in batch, instead keep a snapshot of the RCU state on the
most recent transitions which belong to a chain, and conditionally wait
for a grace period on the last transition of the chain if one g.p. has
not elapsed since the last snapshot.

This applies to both RCU and SRCU.

This brings the performance regression caused by commit 231264d692
("Fix: tracepoint: static call function vs data state mismatch") back to
what it was originally.

Before this commit:

  # trace-cmd start -e all
  # time trace-cmd start -p nop

  real	0m10.593s
  user	0m0.017s
  sys	0m0.259s

After this commit:

  # trace-cmd start -e all
  # time trace-cmd start -p nop

  real	0m0.878s
  user	0m0.000s
  sys	0m0.103s

Link: https://lkml.kernel.org/r/20210805192954.30688-1-mathieu.desnoyers@efficios.com
Link: https://lore.kernel.org/io-uring/4ebea8f0-58c9-e571-fd30-0ce4f6f09c70@samba.org/

Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Stefan Metzmacher <metze@samba.org>
Fixes: 231264d692 ("Fix: tracepoint: static call function vs data state mismatch")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-06 10:54:41 -04:00
Mathieu Desnoyers
231264d692 tracepoint: Fix static call function vs data state mismatch
On a 1->0->1 callbacks transition, there is an issue with the new
callback using the old callback's data.

Considering __DO_TRACE_CALL:

        do {                                                            \
                struct tracepoint_func *it_func_ptr;                    \
                void *__data;                                           \
                it_func_ptr =                                           \
                        rcu_dereference_raw((&__tracepoint_##name)->funcs); \
                if (it_func_ptr) {                                      \
                        __data = (it_func_ptr)->data;                   \

----> [ delayed here on one CPU (e.g. vcpu preempted by the host) ]

                        static_call(tp_func_##name)(__data, args);      \
                }                                                       \
        } while (0)

It has loaded the tp->funcs of the old callback, so it will try to use the old
data. This can be fixed by adding a RCU sync anywhere in the 1->0->1
transition chain.

On a N->2->1 transition, we need an rcu-sync because you may have a
sequence of 3->2->1 (or 1->2->1) where the element 0 data is unchanged
between 2->1, but was changed from 3->2 (or from 1->2), which may be
observed by the static call. This can be fixed by adding an
unconditional RCU sync in transition 2->1.

Note, this fixes a correctness issue at the cost of adding a tremendous
performance regression to the disabling of tracepoints.

Before this commit:

  # trace-cmd start -e all
  # time trace-cmd start -p nop

  real	0m0.778s
  user	0m0.000s
  sys	0m0.061s

After this commit:

  # trace-cmd start -e all
  # time trace-cmd start -p nop

  real	0m10.593s
  user	0m0.017s
  sys	0m0.259s

A follow up fix will introduce a more lightweight scheme based on RCU
get_state and cond_sync, that will return the performance back to what it
was. As both this change and the lightweight versions are complex on their
own, for bisecting any issues that this may cause, they are kept as two
separate changes.

Link: https://lkml.kernel.org/r/20210805132717.23813-3-mathieu.desnoyers@efficios.com
Link: https://lore.kernel.org/io-uring/4ebea8f0-58c9-e571-fd30-0ce4f6f09c70@samba.org/

Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Stefan Metzmacher <metze@samba.org>
Fixes: d25e37d89d ("tracepoint: Optimize using static_call()")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-05 15:42:08 -04:00
Mathieu Desnoyers
f7ec412125 tracepoint: static call: Compare data on transition from 2->1 callees
On transition from 2->1 callees, we should be comparing .data rather
than .func, because the same callback can be registered twice with
different data, and what we care about here is that the data of array
element 0 is unchanged to skip rcu sync.

Link: https://lkml.kernel.org/r/20210805132717.23813-2-mathieu.desnoyers@efficios.com
Link: https://lore.kernel.org/io-uring/4ebea8f0-58c9-e571-fd30-0ce4f6f09c70@samba.org/

Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Stefan Metzmacher <metze@samba.org>
Fixes: 547305a646 ("tracepoint: Fix out of sync data passing by static caller")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-05 15:40:41 -04:00
Steven Rostedt (VMware)
51397dc6f2 tracing: Quiet smp_processor_id() use in preemptable warning in hwlat
The hardware latency detector (hwlat) has a mode that it runs one thread
across CPUs. The logic to move from the currently running CPU to the next
one in the list does a smp_processor_id() to find where it currently is.
Unfortunately, it's done with preemption enabled, and this triggers a
warning for using smp_processor_id() in a preempt enabled section.

As it is only using smp_processor_id() to get information on where it
currently is in order to simply move it to the next CPU, it doesn't really
care if it got moved in the mean time. It will simply balance out later if
such a case arises.

Switch smp_processor_id() to raw_smp_processor_id() to quiet that warning.

Link: https://lkml.kernel.org/r/20210804141848.79edadc0@oasis.local.home

Acked-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Fixes: 8fa826b734 ("trace/hwlat: Implement the mode config option")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-05 09:27:31 -04:00
Masami Hiramatsu
a9d10ca498 tracing: Reject string operand in the histogram expression
Since the string type can not be the target of the addition / subtraction
operation, it must be rejected. Without this fix, the string type silently
converted to digits.

Link: https://lkml.kernel.org/r/162742654278.290973.1523000673366456634.stgit@devnote2

Cc: stable@vger.kernel.org
Fixes: 100719dcef ("tracing: Add simple expression support to hist triggers")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-04 17:49:26 -04:00
Steven Rostedt (VMware)
2c05caa7ba tracing / histogram: Give calculation hist_fields a size
When working on my user space applications, I found a bug in the synthetic
event code where the automated synthetic event field was not matching the
event field calculation it was attached to. Looking deeper into it, it was
because the calculation hist_field was not given a size.

The synthetic event fields are matched to their hist_fields either by
having the field have an identical string type, or if that does not match,
then the size and signed values are used to match the fields.

The problem arose when I tried to match a calculation where the fields
were "unsigned int". My tool created a synthetic event of type "u32". But
it failed to match. The string was:

  diff=field1-field2:onmatch(event).trace(synth,$diff)

Adding debugging into the kernel, I found that the size of "diff" was 0.
And since it was given "unsigned int" as a type, the histogram fallback
code used size and signed. The signed matched, but the size of u32 (4) did
not match zero, and the event failed to be created.

This can be worse if the field you want to match is not one of the
acceptable fields for a synthetic event. As event fields can have any type
that is supported in Linux, this can cause an issue. For example, if a
type is an enum. Then there's no way to use that with any calculations.

Have the calculation field simply take on the size of what it is
calculating.

Link: https://lkml.kernel.org/r/20210730171951.59c7743f@oasis.local.home

Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: 100719dcef ("tracing: Add simple expression support to hist triggers")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-08-04 17:48:41 -04:00
Kamal Agrawal
ff41c28c4b tracing: Fix NULL pointer dereference in start_creating
The event_trace_add_tracer() can fail. In this case, it leads to a crash
in start_creating with below call stack. Handle the error scenario
properly in trace_array_create_dir.

Call trace:
down_write+0x7c/0x204
start_creating.25017+0x6c/0x194
tracefs_create_file+0xc4/0x2b4
init_tracer_tracefs+0x5c/0x940
trace_array_create_dir+0x58/0xb4
trace_array_create+0x1bc/0x2b8
trace_array_get_by_name+0xdc/0x18c

Link: https://lkml.kernel.org/r/1627651386-21315-1-git-send-email-kamaagra@codeaurora.org

Cc: stable@vger.kernel.org
Fixes: 4114fbfd02 ("tracing: Enable creating new instance early boot")
Signed-off-by: Kamal Agrawal <kamaagra@codeaurora.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-07-30 18:45:11 -04:00
Steven Rostedt (VMware)
352384d5c8 tracepoints: Update static_call before tp_funcs when adding a tracepoint
Because of the significant overhead that retpolines pose on indirect
calls, the tracepoint code was updated to use the new "static_calls" that
can modify the running code to directly call a function instead of using
an indirect caller, and this function can be changed at runtime.

In the tracepoint code that calls all the registered callbacks that are
attached to a tracepoint, the following is done:

	it_func_ptr = rcu_dereference_raw((&__tracepoint_##name)->funcs);
	if (it_func_ptr) {
		__data = (it_func_ptr)->data;
		static_call(tp_func_##name)(__data, args);
	}

If there's just a single callback, the static_call is updated to just call
that callback directly. Once another handler is added, then the static
caller is updated to call the iterator, that simply loops over all the
funcs in the array and calls each of the callbacks like the old method
using indirect calling.

The issue was discovered with a race between updating the funcs array and
updating the static_call. The funcs array was updated first and then the
static_call was updated. This is not an issue as long as the first element
in the old array is the same as the first element in the new array. But
that assumption is incorrect, because callbacks also have a priority
field, and if there's a callback added that has a higher priority than the
callback on the old array, then it will become the first callback in the
new array. This means that it is possible to call the old callback with
the new callback data element, which can cause a kernel panic.

	static_call = callback1()
	funcs[] = {callback1,data1};
	callback2 has higher priority than callback1

	CPU 1				CPU 2
	-----				-----

   new_funcs = {callback2,data2},
               {callback1,data1}

   rcu_assign_pointer(tp->funcs, new_funcs);

  /*
   * Now tp->funcs has the new array
   * but the static_call still calls callback1
   */

				it_func_ptr = tp->funcs [ new_funcs ]
				data = it_func_ptr->data [ data2 ]
				static_call(callback1, data);

				/* Now callback1 is called with
				 * callback2's data */

				[ KERNEL PANIC ]

   update_static_call(iterator);

To prevent this from happening, always switch the static_call to the
iterator before assigning the tp->funcs to the new array. The iterator will
always properly match the callback with its data.

To trigger this bug:

  In one terminal:

    while :; do hackbench 50; done

  In another terminal

    echo 1 > /sys/kernel/tracing/events/sched/sched_waking/enable
    while :; do
        echo 1 > /sys/kernel/tracing/set_event_pid;
        sleep 0.5
        echo 0 > /sys/kernel/tracing/set_event_pid;
        sleep 0.5
   done

And it doesn't take long to crash. This is because the set_event_pid adds
a callback to the sched_waking tracepoint with a high priority, which will
be called before the sched_waking trace event callback is called.

Note, the removal to a single callback updates the array first, before
changing the static_call to single callback, which is the proper order as
the first element in the array is the same as what the static_call is
being changed to.

Link: https://lore.kernel.org/io-uring/4ebea8f0-58c9-e571-fd30-0ce4f6f09c70@samba.org/

Cc: stable@vger.kernel.org
Fixes: d25e37d89d ("tracepoint: Optimize using static_call()")
Reported-by: Stefan Metzmacher <metze@samba.org>
tested-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-07-23 08:46:22 -04:00
Colin Ian King
3b1a8f457f ftrace: Remove redundant initialization of variable ret
The variable ret is being initialized with a value that is never
read, it is being updated later on. The assignment is redundant and
can be removed.

Link: https://lkml.kernel.org/r/20210721120915.122278-1-colin.king@canonical.com

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-07-23 08:46:02 -04:00
Nicolas Saenz Julienne
68e83498cb ftrace: Avoid synchronize_rcu_tasks_rude() call when not necessary
synchronize_rcu_tasks_rude() triggers IPIs and forces rescheduling on
all CPUs. It is a costly operation and, when targeting nohz_full CPUs,
very disrupting (hence the name). So avoid calling it when 'old_hash'
doesn't need to be freed.

Link: https://lkml.kernel.org/r/20210721114726.1545103-1-nsaenzju@redhat.com

Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-07-23 08:45:53 -04:00
Steven Rostedt (VMware)
9528c19507 tracing: Clean up alloc_synth_event()
alloc_synth_event() currently has the following code to initialize the
event fields and dynamic_fields:

	for (i = 0, j = 0; i < n_fields; i++) {
		event->fields[i] = fields[i];

		if (fields[i]->is_dynamic) {
			event->dynamic_fields[j] = fields[i];
			event->dynamic_fields[j]->field_pos = i;
			event->dynamic_fields[j++] = fields[i];
			event->n_dynamic_fields++;
		}
	}

1) It would make more sense to have all fields keep track of their
   field_pos.

2) event->dynmaic_fields[j] is assigned twice for no reason.

3) We can move updating event->n_dynamic_fields outside the loop, and just
   assign it to j.

This combination makes the code much cleaner.

Link: https://lkml.kernel.org/r/20210721195341.29bb0f77@oasis.local.home

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-07-23 08:45:30 -04:00
Steven Rostedt (VMware)
1e3bac71c5 tracing/histogram: Rename "cpu" to "common_cpu"
Currently the histogram logic allows the user to write "cpu" in as an
event field, and it will record the CPU that the event happened on.

The problem with this is that there's a lot of events that have "cpu"
as a real field, and using "cpu" as the CPU it ran on, makes it
impossible to run histograms on the "cpu" field of events.

For example, if I want to have a histogram on the count of the
workqueue_queue_work event on its cpu field, running:

 ># echo 'hist:keys=cpu' > events/workqueue/workqueue_queue_work/trigger

Gives a misleading and wrong result.

Change the command to "common_cpu" as no event should have "common_*"
fields as that's a reserved name for fields used by all events. And
this makes sense here as common_cpu would be a field used by all events.

Now we can even do:

 ># echo 'hist:keys=common_cpu,cpu if cpu < 100' > events/workqueue/workqueue_queue_work/trigger
 ># cat events/workqueue/workqueue_queue_work/hist
 # event histogram
 #
 # trigger info: hist:keys=common_cpu,cpu:vals=hitcount:sort=hitcount:size=2048 if cpu < 100 [active]
 #

 { common_cpu:          0, cpu:          2 } hitcount:          1
 { common_cpu:          0, cpu:          4 } hitcount:          1
 { common_cpu:          7, cpu:          7 } hitcount:          1
 { common_cpu:          0, cpu:          7 } hitcount:          1
 { common_cpu:          0, cpu:          1 } hitcount:          1
 { common_cpu:          0, cpu:          6 } hitcount:          2
 { common_cpu:          0, cpu:          5 } hitcount:          2
 { common_cpu:          1, cpu:          1 } hitcount:          4
 { common_cpu:          6, cpu:          6 } hitcount:          4
 { common_cpu:          5, cpu:          5 } hitcount:         14
 { common_cpu:          4, cpu:          4 } hitcount:         26
 { common_cpu:          0, cpu:          0 } hitcount:         39
 { common_cpu:          2, cpu:          2 } hitcount:        184

Now for backward compatibility, I added a trick. If "cpu" is used, and
the field is not found, it will fall back to "common_cpu" and work as
it did before. This way, it will still work for old programs that use
"cpu" to get the actual CPU, but if the event has a "cpu" as a field, it
will get that event's "cpu" field, which is probably what it wants
anyway.

I updated the tracefs/README to include documentation about both the
common_timestamp and the common_cpu. This way, if that text is present in
the README, then an application can know that common_cpu is supported over
just plain "cpu".

Link: https://lkml.kernel.org/r/20210721110053.26b4f641@oasis.local.home

Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: 8b7622bf94 ("tracing: Add cpu field for hist triggers")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-07-23 08:44:47 -04:00
Steven Rostedt (VMware)
3b13911a2f tracing: Synthetic event field_pos is an index not a boolean
Performing the following:

 ># echo 'wakeup_lat s32 pid; u64 delta; char wake_comm[]' > synthetic_events
 ># echo 'hist:keys=pid:__arg__1=common_timestamp.usecs' > events/sched/sched_waking/trigger
 ># echo 'hist:keys=next_pid:pid=next_pid,delta=common_timestamp.usecs-$__arg__1:onmatch(sched.sched_waking).trace(wakeup_lat,$pid,$delta,prev_comm)'\
      > events/sched/sched_switch/trigger
 ># echo 1 > events/synthetic/enable

Crashed the kernel:

 BUG: kernel NULL pointer dereference, address: 000000000000001b
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] PREEMPT SMP
 CPU: 7 PID: 0 Comm: swapper/7 Not tainted 5.13.0-rc5-test+ #104
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
 RIP: 0010:strlen+0x0/0x20
 Code: f6 82 80 2b 0b bc 20 74 11 0f b6 50 01 48 83 c0 01 f6 82 80 2b 0b bc
  20 75 ef c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 <80> 3f 00 74 10
  48 89 f8 48 83 c0 01 80 38 9 f8 c3 31
 RSP: 0018:ffffaa75000d79d0 EFLAGS: 00010046
 RAX: 0000000000000002 RBX: ffff9cdb55575270 RCX: 0000000000000000
 RDX: ffff9cdb58c7a320 RSI: ffffaa75000d7b40 RDI: 000000000000001b
 RBP: ffffaa75000d7b40 R08: ffff9cdb40a4f010 R09: ffffaa75000d7ab8
 R10: ffff9cdb4398c700 R11: 0000000000000008 R12: ffff9cdb58c7a320
 R13: ffff9cdb55575270 R14: ffff9cdb58c7a000 R15: 0000000000000018
 FS:  0000000000000000(0000) GS:ffff9cdb5aa00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000000000000001b CR3: 00000000c0612006 CR4: 00000000001706e0
 Call Trace:
  trace_event_raw_event_synth+0x90/0x1d0
  action_trace+0x5b/0x70
  event_hist_trigger+0x4bd/0x4e0
  ? cpumask_next_and+0x20/0x30
  ? update_sd_lb_stats.constprop.0+0xf6/0x840
  ? __lock_acquire.constprop.0+0x125/0x550
  ? find_held_lock+0x32/0x90
  ? sched_clock_cpu+0xe/0xd0
  ? lock_release+0x155/0x440
  ? update_load_avg+0x8c/0x6f0
  ? enqueue_entity+0x18a/0x920
  ? __rb_reserve_next+0xe5/0x460
  ? ring_buffer_lock_reserve+0x12a/0x3f0
  event_triggers_call+0x52/0xe0
  trace_event_buffer_commit+0x1ae/0x240
  trace_event_raw_event_sched_switch+0x114/0x170
  __traceiter_sched_switch+0x39/0x50
  __schedule+0x431/0xb00
  schedule_idle+0x28/0x40
  do_idle+0x198/0x2e0
  cpu_startup_entry+0x19/0x20
  secondary_startup_64_no_verify+0xc2/0xcb

The reason is that the dynamic events array keeps track of the field
position of the fields array, via the field_pos variable in the
synth_field structure. Unfortunately, that field is a boolean for some
reason, which means any field_pos greater than 1 will be a bug (in this
case it was 2).

Link: https://lkml.kernel.org/r/20210721191008.638bce34@oasis.local.home

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: bd82631d7c ("tracing: Add support for dynamic strings to synthetic events")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-07-23 08:43:04 -04:00
Haoran Luo
67f0d6d988 tracing: Fix bug in rb_per_cpu_empty() that might cause deadloop.
The "rb_per_cpu_empty()" misinterpret the condition (as not-empty) when
"head_page" and "commit_page" of "struct ring_buffer_per_cpu" points to
the same buffer page, whose "buffer_data_page" is empty and "read" field
is non-zero.

An error scenario could be constructed as followed (kernel perspective):

1. All pages in the buffer has been accessed by reader(s) so that all of
them will have non-zero "read" field.

2. Read and clear all buffer pages so that "rb_num_of_entries()" will
return 0 rendering there's no more data to read. It is also required
that the "read_page", "commit_page" and "tail_page" points to the same
page, while "head_page" is the next page of them.

3. Invoke "ring_buffer_lock_reserve()" with large enough "length"
so that it shot pass the end of current tail buffer page. Now the
"head_page", "commit_page" and "tail_page" points to the same page.

4. Discard current event with "ring_buffer_discard_commit()", so that
"head_page", "commit_page" and "tail_page" points to a page whose buffer
data page is now empty.

When the error scenario has been constructed, "tracing_read_pipe" will
be trapped inside a deadloop: "trace_empty()" returns 0 since
"rb_per_cpu_empty()" returns 0 when it hits the CPU containing such
constructed ring buffer. Then "trace_find_next_entry_inc()" always
return NULL since "rb_num_of_entries()" reports there's no more entry
to read. Finally "trace_seq_to_user()" returns "-EBUSY" spanking
"tracing_read_pipe" back to the start of the "waitagain" loop.

I've also written a proof-of-concept script to construct the scenario
and trigger the bug automatically, you can use it to trace and validate
my reasoning above:

  https://github.com/aegistudio/RingBufferDetonator.git

Tests has been carried out on linux kernel 5.14-rc2
(2734d6c1b1), my fixed version
of kernel (for testing whether my update fixes the bug) and
some older kernels (for range of affected kernels). Test result is
also attached to the proof-of-concept repository.

Link: https://lore.kernel.org/linux-trace-devel/YPaNxsIlb2yjSi5Y@aegistudio/
Link: https://lore.kernel.org/linux-trace-devel/YPgrN85WL9VyrZ55@aegistudio

Cc: stable@vger.kernel.org
Fixes: bf41a158ca ("ring-buffer: make reentrant")
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Haoran Luo <www@aegistudio.net>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-07-22 11:52:33 -04:00
Linus Torvalds
3fdacf402b tracing: Fix the histogram logic from possibly crashing the kernel
Working on the histogram code, I found that if you dereference a char
 pointer in a trace event that happens to point to user space, it can crash
 the kernel, as it does no checks of that pointer. I have code coming that
 will do this better, so just remove this ability to treat character
 pointers in trace events as stings in the histogram.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYPH9FRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qsyhAQDKiQzVJtjfsNbIWliDQOaUwJMO9tNl
 Qu5TUDmPbAA4fwD+MgYsnITPL+o/YcKQ+aMdj/wLLMKfIjhNkFY8wqdLvwg=
 =CN97
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.14-5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fix from Steven Rostedt:
 "Fix the histogram logic from possibly crashing the kernel

  Working on the histogram code, I found that if you dereference a char
  pointer in a trace event that happens to point to user space, it can
  crash the kernel, as it does no checks of that pointer. I have code
  coming that will do this better, so just remove this ability to treat
  character pointers in trace events as stings in the histogram"

* tag 'trace-v5.14-5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Do not reference char * as a string in histograms
2021-07-17 12:36:51 -07:00
Linus Torvalds
6e442d0662 Merge branch 'urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU fixes from Paul McKenney:

 - fix regressions induced by a merge-window change in scheduler
   semantics, which means that smp_processor_id() can no longer be used
   in kthreads using simple affinity to bind themselves to a specific
   CPU.

 - fix a bug in Tasks Trace RCU that was thought to be strictly
   theoretical. However, production workloads have started hitting this,
   so these fixes need to be merged sooner rather than later.

 - fix a minor printk()-format-mismatch issue introduced during the
   merge window.

* 'urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
  rcu: Fix pr_info() formats and values in show_rcu_gp_kthreads()
  rcu-tasks: Don't delete holdouts within trc_wait_for_one_reader()
  rcu-tasks: Don't delete holdouts within trc_inspect_reader()
  refscale: Avoid false-positive warnings in ref_scale_reader()
  scftorture: Avoid false-positive warnings in scftorture_invoker()
2021-07-16 11:08:57 -07:00
Steven Rostedt (VMware)
704adfb5a9 tracing: Do not reference char * as a string in histograms
The histogram logic was allowing events with char * pointers to be used as
normal strings. But it was easy to crash the kernel with:

 # echo 'hist:keys=filename' > events/syscalls/sys_enter_openat/trigger

And open some files, and boom!

 BUG: unable to handle page fault for address: 00007f2ced0c3280
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 1173fa067 P4D 1173fa067 PUD 1171b6067 PMD 1171dd067 PTE 0
 Oops: 0000 [#1] PREEMPT SMP
 CPU: 6 PID: 1810 Comm: cat Not tainted 5.13.0-rc5-test+ #61
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01
v03.03 07/14/2016
 RIP: 0010:strlen+0x0/0x20
 Code: f6 82 80 2a 0b a9 20 74 11 0f b6 50 01 48 83 c0 01 f6 82 80 2a 0b
a9 20 75 ef c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 <80> 3f 00 74
10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3

 RSP: 0018:ffffbdbf81567b50 EFLAGS: 00010246
 RAX: 0000000000000003 RBX: ffff93815cdb3800 RCX: ffff9382401a22d0
 RDX: 0000000000000100 RSI: 0000000000000000 RDI: 00007f2ced0c3280
 RBP: 0000000000000100 R08: ffff9382409ff074 R09: ffffbdbf81567c98
 R10: ffff9382409ff074 R11: 0000000000000000 R12: ffff9382409ff074
 R13: 0000000000000001 R14: ffff93815a744f00 R15: 00007f2ced0c3280
 FS:  00007f2ced0f8580(0000) GS:ffff93825a800000(0000)
knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f2ced0c3280 CR3: 0000000107069005 CR4: 00000000001706e0
 Call Trace:
  event_hist_trigger+0x463/0x5f0
  ? find_held_lock+0x32/0x90
  ? sched_clock_cpu+0xe/0xd0
  ? lock_release+0x155/0x440
  ? kernel_init_free_pages+0x6d/0x90
  ? preempt_count_sub+0x9b/0xd0
  ? kernel_init_free_pages+0x6d/0x90
  ? get_page_from_freelist+0x12c4/0x1680
  ? __rb_reserve_next+0xe5/0x460
  ? ring_buffer_lock_reserve+0x12a/0x3f0
  event_triggers_call+0x52/0xe0
  ftrace_syscall_enter+0x264/0x2c0
  syscall_trace_enter.constprop.0+0x1ee/0x210
  do_syscall_64+0x1c/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xae

Where it triggered a fault on strlen(key) where key was the filename.

The reason is that filename is a char * to user space, and the histogram
code just blindly dereferenced it, with obvious bad results.

I originally tried to use strncpy_from_user/kernel_nofault() but found
that there's other places that its dereferenced and not worth the effort.

Just do not allow "char *" to act like strings.

Link: https://lkml.kernel.org/r/20210715000206.025df9d2@rorschach.local.home

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Tzvetomir Stoyanov <tz.stoyanov@gmail.com>
Cc: stable@vger.kernel.org
Acked-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Tom Zanussi <zanussi@kernel.org>
Fixes: 79e577cbce ("tracing: Support string type key properly")
Fixes: 5967bd5c42 ("tracing: Let filter_assign_type() detect FILTER_PTR_STRING")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-07-15 17:06:13 -04:00
Linus Torvalds
e9338abf0e fallthrough fixes for Clang for 5.14-rc2
Hi Linus,
 
 Please, pull the following patches that fix many fall-through
 warnings when building with Clang and -Wimplicit-fallthrough.
 
 This pull-request also contains the patch for Makefile that enables
 -Wimplicit-fallthrough for Clang, globally.
 
 It's also important to notice that since we have adopted the use of
 the pseudo-keyword macro fallthrough; we also want to avoid having
 more /* fall through */ comments being introduced. Notice that contrary
 to GCC, Clang doesn't recognize any comments as implicit fall-through
 markings when the -Wimplicit-fallthrough option is enabled. So, in
 order to avoid having more comments being introduced, we have to use
 the option -Wimplicit-fallthrough=5 for GCC, which similar to Clang,
 will cause a warning in case a code comment is intended to be used
 as a fall-through marking. The patch for Makefile also enforces this.
 
 We had almost 4,000 of these issues for Clang in the beginning,
 and there might be a couple more out there when building some
 architectures with certain configurations. However, with the
 recent fixes I think we are in good shape and it is now possible
 to enable -Wimplicit-fallthrough for Clang. :)
 
 Thanks!
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEkmRahXBSurMIg1YvRwW0y0cG2zEFAmDvPV0ACgkQRwW0y0cG
 2zGJ1xAAsbSN8I+ESxFCKi4UwQC71988isb0rHGL6rQPBbEYD2bbI+82aGu6WH8z
 2ZVQZa5x3tUoIsqR2GMK5Npn1dFvuNWGZkYzlgjAp+FgzaULCLUc4NW9DEWU/Kbd
 UBz8ilXL7tUAlDalUYjXAVVTxMnRTZ+BDCbEmIdRZ5X/pMlD2xHeMGG4kh8/cHjJ
 +mqFwCFPZoyLJzPBnm37OEfIr8JxrLZAo1gfmz5sxfmBdYDY3hjV/BVgP8h0bkY9
 ITaq0LMEAPMXvU+4DP3DxuqzRr9COYMcddmbaWwsXPd7UAuoXwGLvqx7JE0QqY3g
 J80w4/hXqEa4o4/fUAsfjYEQoNTEnhl0iGskBJD2xy5Lj17/m4aOSL44ibr2Ag67
 yfJPWi9tooYELEGNobPVHPuqk4ts6amKTOKqHWMBEcTEOIGzaGWPEjRoyaMivfZ3
 G3ZbEPEYERcJRizm63UbciLeslGsxb6tdYQEy5VI8Wa7caz9Ci9HT+jjS7Vm2fuq
 1zg+vIgwuSxvwxrSV/PDTRcQm9EM/MoRL0D4jbr7gtYD6KNwH8Vo6M21CoaB9gH3
 FgMiCT4zy9KiONjVDJKkjtzuk+9OwpI5AAg0/fAQ4Ak9TzUN3Xfg7HpytETCIJ0Z
 Ii3Jqwe+3IRTNq94ZTIiX/0hT7i5GsPxUfPzI2vwttCJn9ctUcM=
 =XVNN
 -----END PGP SIGNATURE-----

Merge tag 'Wimplicit-fallthrough-clang-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux

Pull fallthrough fixes from Gustavo Silva:
 "This fixes many fall-through warnings when building with Clang and
  -Wimplicit-fallthrough, and also enables -Wimplicit-fallthrough for
  Clang, globally.

  It's also important to notice that since we have adopted the use of
  the pseudo-keyword macro fallthrough, we also want to avoid having
  more /* fall through */ comments being introduced. Contrary to GCC,
  Clang doesn't recognize any comments as implicit fall-through markings
  when the -Wimplicit-fallthrough option is enabled.

  So, in order to avoid having more comments being introduced, we use
  the option -Wimplicit-fallthrough=5 for GCC, which similar to Clang,
  will cause a warning in case a code comment is intended to be used as
  a fall-through marking. The patch for Makefile also enforces this.

  We had almost 4,000 of these issues for Clang in the beginning, and
  there might be a couple more out there when building some
  architectures with certain configurations. However, with the recent
  fixes I think we are in good shape and it is now possible to enable
  the warning for Clang"

* tag 'Wimplicit-fallthrough-clang-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: (27 commits)
  Makefile: Enable -Wimplicit-fallthrough for Clang
  powerpc/smp: Fix fall-through warning for Clang
  dmaengine: mpc512x: Fix fall-through warning for Clang
  usb: gadget: fsl_qe_udc: Fix fall-through warning for Clang
  powerpc/powernv: Fix fall-through warning for Clang
  MIPS: Fix unreachable code issue
  MIPS: Fix fall-through warnings for Clang
  ASoC: Mediatek: MT8183: Fix fall-through warning for Clang
  power: supply: Fix fall-through warnings for Clang
  dmaengine: ti: k3-udma: Fix fall-through warning for Clang
  s390: Fix fall-through warnings for Clang
  dmaengine: ipu: Fix fall-through warning for Clang
  iommu/arm-smmu-v3: Fix fall-through warning for Clang
  mmc: jz4740: Fix fall-through warning for Clang
  PCI: Fix fall-through warning for Clang
  scsi: libsas: Fix fall-through warning for Clang
  video: fbdev: Fix fall-through warning for Clang
  math-emu: Fix fall-through warning
  cpufreq: Fix fall-through warning for Clang
  drm/msm: Fix fall-through warning in msm_gem_new_impl()
  ...
2021-07-15 13:57:31 -07:00
Linus Torvalds
8096acd744 Networking fixes for 5.14-rc2, including fixes from bpf and netfilter.
Current release - regressions:
 
  - sock: fix parameter order in sock_setsockopt()
 
 Current release - new code bugs:
 
  - netfilter: nft_last:
      - fix incorrect arithmetic when restoring last used
      - honor NFTA_LAST_SET on restoration
 
 Previous releases - regressions:
 
  - udp: properly flush normal packet at GRO time
 
  - sfc: ensure correct number of XDP queues; don't allow enabling the
         feature if there isn't sufficient resources to Tx from any CPU
 
  - dsa: sja1105: fix address learning getting disabled on the CPU port
 
  - mptcp: addresses a rmem accounting issue that could keep packets
         in subflow receive buffers longer than necessary, delaying
 	MPTCP-level ACKs
 
  - ip_tunnel: fix mtu calculation for ETHER tunnel devices
 
  - do not reuse skbs allocated from skbuff_fclone_cache in the napi
    skb cache, we'd try to return them to the wrong slab cache
 
  - tcp: consistently disable header prediction for mptcp
 
 Previous releases - always broken:
 
  - bpf: fix subprog poke descriptor tracking use-after-free
 
  - ipv6:
       - allocate enough headroom in ip6_finish_output2() in case
         iptables TEE is used
       - tcp: drop silly ICMPv6 packet too big messages to avoid
         expensive and pointless lookups (which may serve as a DDOS
 	vector)
       - make sure fwmark is copied in SYNACK packets
       - fix 'disable_policy' for forwarded packets (align with IPv4)
 
  - netfilter: conntrack: do not renew entry stuck in tcp SYN_SENT state
 
  - netfilter: conntrack: do not mark RST in the reply direction coming
       after SYN packet for an out-of-sync entry
 
  - mptcp: cleanly handle error conditions with MP_JOIN and syncookies
 
  - mptcp: fix double free when rejecting a join due to port mismatch
 
  - validate lwtstate->data before returning from skb_tunnel_info()
 
  - tcp: call sk_wmem_schedule before sk_mem_charge in zerocopy path
 
  - mt76: mt7921: continue to probe driver when fw already downloaded
 
  - bonding: fix multiple issues with offloading IPsec to (thru?) bond
 
  - stmmac: ptp: fix issues around Qbv support and setting time back
 
  - bcmgenet: always clear wake-up based on energy detection
 
 Misc:
 
  - sctp: move 198 addresses from unusable to private scope
 
  - ptp: support virtual clocks and timestamping
 
  - openvswitch: optimize operation for key comparison
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmDu3mMACgkQMUZtbf5S
 Irsjxg//UwcPJMYFmXV+fGkEsWYe1Kf29FcUDEeANFtbltfAcIfZ0GoTbSDRnrVb
 HcYAKcm4XRx5bWWdQrQsQq/yiLbnS/rSLc7VRB+uRHWRKl3eYcaUB2rnCXsxrjGw
 wQJgOmztDCJS4BIky24iQpF/8lg7p/Gj2Ih532gh93XiYo612FrEJKkYb2/OQfYX
 GkbnZ0kL2Y1SV+bhy6aT5azvhHKM4/3eA4fHeJ2p8e2gOZ5ni0vpX0xEzdzKOCd0
 vwR/Wu3h/+2QuFYVcSsVguuM++JXACG8MAS/Tof78dtNM4a3kQxzqeh5Bv6IkfTu
 rokENLq4pjNRy+nBAOeQZj8Jd0K0kkf/PN9WMdGQtplMoFhjjV25R6PeRrV9wwPo
 peozIz2MuQo7Kfof1D+44h2foyLfdC28/Z0CvRbDpr5EHOfYynvBbrnhzIGdQp6V
 xgftKTOdgz2Djgg8HiblZund1FA44OYerddVAASrIsnSFnIz1VLVQIsfV+GLBwwc
 FawrIZ6WfIjzRSrDGOvDsbAQI47T/1jbaPJeK6XgjWkQmjEd6UtRWRZLYCxemQEw
 4HP3sWC96BOehuD8ylipVE1oFqrxCiOB/fZxezXqjo8dSX3NLdak4cCHTHoW5SuZ
 eEAxQRaBliKd+P7hoy9cZ57CAu3zUa8kijfM5QRlCAHF+zSxaPs=
 =QFnb
 -----END PGP SIGNATURE-----

Merge tag 'net-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Jakub Kicinski.
 "Including fixes from bpf and netfilter.

  Current release - regressions:

   - sock: fix parameter order in sock_setsockopt()

  Current release - new code bugs:

   - netfilter: nft_last:
       - fix incorrect arithmetic when restoring last used
       - honor NFTA_LAST_SET on restoration

  Previous releases - regressions:

   - udp: properly flush normal packet at GRO time

   - sfc: ensure correct number of XDP queues; don't allow enabling the
     feature if there isn't sufficient resources to Tx from any CPU

   - dsa: sja1105: fix address learning getting disabled on the CPU port

   - mptcp: addresses a rmem accounting issue that could keep packets in
     subflow receive buffers longer than necessary, delaying MPTCP-level
     ACKs

   - ip_tunnel: fix mtu calculation for ETHER tunnel devices

   - do not reuse skbs allocated from skbuff_fclone_cache in the napi
     skb cache, we'd try to return them to the wrong slab cache

   - tcp: consistently disable header prediction for mptcp

  Previous releases - always broken:

   - bpf: fix subprog poke descriptor tracking use-after-free

   - ipv6:
       - allocate enough headroom in ip6_finish_output2() in case
         iptables TEE is used
       - tcp: drop silly ICMPv6 packet too big messages to avoid
         expensive and pointless lookups (which may serve as a DDOS
         vector)
       - make sure fwmark is copied in SYNACK packets
       - fix 'disable_policy' for forwarded packets (align with IPv4)

   - netfilter: conntrack:
       - do not renew entry stuck in tcp SYN_SENT state
       - do not mark RST in the reply direction coming after SYN packet
         for an out-of-sync entry

   - mptcp: cleanly handle error conditions with MP_JOIN and syncookies

   - mptcp: fix double free when rejecting a join due to port mismatch

   - validate lwtstate->data before returning from skb_tunnel_info()

   - tcp: call sk_wmem_schedule before sk_mem_charge in zerocopy path

   - mt76: mt7921: continue to probe driver when fw already downloaded

   - bonding: fix multiple issues with offloading IPsec to (thru?) bond

   - stmmac: ptp: fix issues around Qbv support and setting time back

   - bcmgenet: always clear wake-up based on energy detection

  Misc:

   - sctp: move 198 addresses from unusable to private scope

   - ptp: support virtual clocks and timestamping

   - openvswitch: optimize operation for key comparison"

* tag 'net-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (158 commits)
  net: dsa: properly check for the bridge_leave methods in dsa_switch_bridge_leave()
  sfc: add logs explaining XDP_TX/REDIRECT is not available
  sfc: ensure correct number of XDP queues
  sfc: fix lack of XDP TX queues - error XDP TX failed (-22)
  net: fddi: fix UAF in fza_probe
  net: dsa: sja1105: fix address learning getting disabled on the CPU port
  net: ocelot: fix switchdev objects synced for wrong netdev with LAG offload
  net: Use nlmsg_unicast() instead of netlink_unicast()
  octeontx2-pf: Fix uninitialized boolean variable pps
  ipv6: allocate enough headroom in ip6_finish_output2()
  net: hdlc: rename 'mod_init' & 'mod_exit' functions to be module-specific
  net: bridge: multicast: fix MRD advertisement router port marking race
  net: bridge: multicast: fix PIM hello router port marking race
  net: phy: marvell10g: fix differentiation of 88X3310 from 88X3340
  dsa: fix for_each_child.cocci warnings
  virtio_net: check virtqueue_add_sgs() return value
  mptcp: properly account bulk freed memory
  selftests: mptcp: fix case multiple subflows limited by server
  mptcp: avoid processing packet if a subflow reset
  mptcp: fix syncookie process if mptcp can not_accept new subflow
  ...
2021-07-14 09:24:32 -07:00
Christian Brauner
d1d488d813 fs: add vfs_parse_fs_param_source() helper
Add a simple helper that filesystems can use in their parameter parser
to parse the "source" parameter. A few places open-coded this function
and that already caused a bug in the cgroup v1 parser that we fixed.
Let's make it harder to get this wrong by introducing a helper which
performs all necessary checks.

Link: https://syzkaller.appspot.com/bug?id=6312526aba5beae046fdae8f00399f87aab48b12
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-14 09:19:06 -07:00
Christian Brauner
3b0462726e cgroup: verify that source is a string
The following sequence can be used to trigger a UAF:

    int fscontext_fd = fsopen("cgroup");
    int fd_null = open("/dev/null, O_RDONLY);
    int fsconfig(fscontext_fd, FSCONFIG_SET_FD, "source", fd_null);
    close_range(3, ~0U, 0);

The cgroup v1 specific fs parser expects a string for the "source"
parameter.  However, it is perfectly legitimate to e.g.  specify a file
descriptor for the "source" parameter.  The fs parser doesn't know what
a filesystem allows there.  So it's a bug to assume that "source" is
always of type fs_value_is_string when it can reasonably also be
fs_value_is_file.

This assumption in the cgroup code causes a UAF because struct
fs_parameter uses a union for the actual value.  Access to that union is
guarded by the param->type member.  Since the cgroup paramter parser
didn't check param->type but unconditionally moved param->string into
fc->source a close on the fscontext_fd would trigger a UAF during
put_fs_context() which frees fc->source thereby freeing the file stashed
in param->file causing a UAF during a close of the fd_null.

Fix this by verifying that param->type is actually a string and report
an error if not.

In follow up patches I'll add a new generic helper that can be used here
and by other filesystems instead of this error-prone copy-pasta fix.
But fixing it in here first makes backporting a it to stable a lot
easier.

Fixes: 8d2451f499 ("cgroup1: switch to option-by-option parsing")
Reported-by: syzbot+283ce5a46486d6acdbaf@syzkaller.appspotmail.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@kernel.org>
Cc: syzkaller-bugs <syzkaller-bugs@googlegroups.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-07-14 09:19:06 -07:00
Gustavo A. R. Silva
1adee589cd kernel: debug: Fix unreachable code in gdb_serial_stub()
Fix the following warning:

kernel/debug/gdbstub.c:1049:4: warning: fallthrough annotation in unreachable code [-Wimplicit-fallthrough]
                           fallthrough;
                           ^
   include/linux/compiler_attributes.h:210:41: note: expanded from macro 'fallthrough'
   # define fallthrough                    __attribute__((__fallthrough__)

by placing the fallthrough; statement inside ifdeffery.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-07-12 11:03:35 -05:00
Linus Torvalds
98f7fdced2 Two fixes:
- Fix a MIPS IRQ handling RCU bug
  - Remove a DocBook annotation for a parameter that doesn't exist anymore
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDq9IcRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1gOnA//U6VcTbivf1QG05Omof+378iFZWWSdDHP
 /iF12DrtcalqCH73o4iuiEg3W8PdISqvLh5QI9G1WvKNXOmX/RRoFPpd1W24uM/D
 DM9m3hEjjEX3AmRZoQ7yrA2Jzd435DKQOHjzClV/5ykKiyBw533twHm7L9sI6WCl
 fsP5yiCMXUP2wS2Go/wWgRjrIIb5iQezXGGxeKhP4zW/lam86ntPcdm8I6EuRD7r
 bT4TXmJkjk98FL+A6QSk8cP+ZlzauHkNa0HYiSz1pvOnwPDYs8IXacCEeOwog3Wr
 hH0IXbX2NXJIUPDQc/BvWzOMxRNBgMZ1ixnj8B2kaDfIM5YfaGAV3/GuogFCUSns
 1u9E+LkLohvE61w7m49PjnYIlWSkuQFrMlJL+69O2wiqmbFX9aQa/T9i7Mq7bmsv
 Q5qTe935QRVcraUPQ5h0uJnKmBL8xteHpQSSrF5ppsC8cxEDX/CC/GsqAn0xG15Q
 VZ91ocXr58zCXDpLbesULY7LbGyB6c1Kt55WYxhp/IOtRhEpVzD5Y9ukXYdckIpq
 25TMFsPyTUTaS3r8nD8G0QeW3CLVj8Lzi/r0P3oa8GvgyqxlX7jzXI8aGPRb7QIi
 JtX6iY2nH0DZfp600VMJNUS67yMDuDXJ2Me5WY3XkvpYuFtPGRm41HOs3dEV+Ay1
 nxBMxUC/6Ug=
 =xghF
 -----END PGP SIGNATURE-----

Merge tag 'irq-urgent-2021-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Ingo Molnar:
 "Two fixes:

   - Fix a MIPS IRQ handling RCU bug

   - Remove a DocBook annotation for a parameter that doesn't exist
     anymore"

* tag 'irq-urgent-2021-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/mips: Fix RCU violation when using irqdomain lookup on interrupt entry
  genirq/irqdesc: Drop excess kernel-doc entry @lookup
2021-07-11 11:17:57 -07:00
Linus Torvalds
877029d921 Three fixes:
- Fix load tracking bug/inconsistency
  - Fix a sporadic CFS bandwidth constraints enforcement bug
  - Fix a uclamp utilization tracking bug for newly woken tasks
 
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDq8scRHG1pbmdvQGtl
 cm5lbC5vcmcACgkQEnMQ0APhK1i1nQ/+PsOimY0+2kW2EUp/a1a8h1ZHfyG1RjDo
 Kl/CCUDdEP0OuWC/cdnvYgo3hNPTVCpZvsIoR/t7CFinuR/ubfsFtLZ5juOzjk5v
 2RjmSyOr0DbhUcO3GFKo+ABlrmiiJ9de89+oal+W/t/Zks/xqVHQ+f7HqSGvSd1f
 Lhem4U9UakbO3yjT/+VwSvNZgP8trtPQ6rrKw+yrwrxfjSo4D0Y+/3u/HCaYthIW
 5i1+uBEXGnZaU7QhDfqqzbGcAKLRA+i2vBmNfbOyUeCcyKTsKlLwX9L1DlgNPLRP
 XvxVJWJcxwTsLbwVG1F3TvWw93iSLi34jPO//2ZnNppEhA4fjxmLSYV3uIsm8PUY
 /YmDdZ6fTW7ZIO/nhfcf3nS8Sp0UlfHXL9dV3mn2EzeMLGOKZY6vgAKdvEd+Fj+y
 J+VB01MgmVzGvFr9o1/ez3vWyk03CLDQuYMUo0yVcqAi4OLaArAz5vxXR/cF4PsB
 r69CCdSinMj2finaR39Eq0431Tpv71NDDfyqjVJEOk88Weszu6IACIOJCvpy0ZLQ
 LOA5kl2I8/mYEevnXgg9NPX8XO2iUFS1cVVNsRHUe4zqQZPPoBD6Oppb+kmfUQUe
 gABCZK217nkqFH4GdC9RCtRdnb4HO+6H15cLlDHjilECgGOPeJ8CPaK3pRvzv0g3
 N2m4KcFI2j0=
 =FBcX
 -----END PGP SIGNATURE-----

Merge tag 'sched-urgent-2021-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduler fixes from Ingo Molnar:
 "Three fixes:

   - Fix load tracking bug/inconsistency

   - Fix a sporadic CFS bandwidth constraints enforcement bug

   - Fix a uclamp utilization tracking bug for newly woken tasks"

* tag 'sched-urgent-2021-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/uclamp: Ignore max aggregation if rq is idle
  sched/fair: Fix CFS bandwidth hrtimer expiry type
  sched/fair: Sync load_sum with load_avg after dequeue
2021-07-11 11:13:57 -07:00