125 Commits

Author SHA1 Message Date
Daniel Borkmann
445c31d813 bpf: fix map not being uncharged during map creation failure
commit 20b2b24f91f70e7d3f0918c077546cb21bd73a87 upstream.

In map_create(), we first find and create the map, then once that
suceeded, we charge it to the user's RLIMIT_MEMLOCK, and then fetch
a new anon fd through anon_inode_getfd(). The problem is, once the
latter fails f.e. due to RLIMIT_NOFILE limit, then we only destruct
the map via map->ops->map_free(), but without uncharging the previously
locked memory first. That means that the user_struct allocation is
leaked as well as the accounted RLIMIT_MEMLOCK memory not released.
Make the label names in the fix consistent with bpf_prog_load().

Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-10 10:26:19 +02:00
Greg Kroah-Hartman
2350f8a157 bpf: Explicitly memset the bpf_attr structure
commit 8096f229421f7b22433775e928d506f0342e5907 upstream.

For the bpf syscall, we are relying on the compiler to properly zero out
the bpf_attr union that we copy userspace data into. Unfortunately that
doesn't always work properly, padding and other oddities might not be
correctly zeroed, and in some tests odd things have been found when the
stack is pre-initialized to other values.

Fix this by explicitly memsetting the structure to 0 before using it.

Reported-by: Maciej Żenczykowski <maze@google.com>
Reported-by: John Stultz <john.stultz@linaro.org>
Reported-by: Alexander Potapenko <glider@google.com>
Reported-by: Alistair Delva <adelva@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://android-review.googlesource.com/c/kernel/common/+/1235490
Link: https://lore.kernel.org/bpf/20200320094813.GA421650@kroah.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-02 19:02:40 +02:00
Valdis Klētnieks
7dd2dc6524 bpf: silence warning messages in core
[ Upstream commit aee450cbe482a8c2f6fa5b05b178ef8b8ff107ca ]

Compiling kernel/bpf/core.c with W=1 causes a flood of warnings:

kernel/bpf/core.c:1198:65: warning: initialized field overwritten [-Woverride-init]
 1198 | #define BPF_INSN_3_TBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = true
      |                                                                 ^~~~
kernel/bpf/core.c:1087:2: note: in expansion of macro 'BPF_INSN_3_TBL'
 1087 |  INSN_3(ALU, ADD,  X),   \
      |  ^~~~~~
kernel/bpf/core.c:1202:3: note: in expansion of macro 'BPF_INSN_MAP'
 1202 |   BPF_INSN_MAP(BPF_INSN_2_TBL, BPF_INSN_3_TBL),
      |   ^~~~~~~~~~~~
kernel/bpf/core.c:1198:65: note: (near initialization for 'public_insntable[12]')
 1198 | #define BPF_INSN_3_TBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = true
      |                                                                 ^~~~
kernel/bpf/core.c:1087:2: note: in expansion of macro 'BPF_INSN_3_TBL'
 1087 |  INSN_3(ALU, ADD,  X),   \
      |  ^~~~~~
kernel/bpf/core.c:1202:3: note: in expansion of macro 'BPF_INSN_MAP'
 1202 |   BPF_INSN_MAP(BPF_INSN_2_TBL, BPF_INSN_3_TBL),
      |   ^~~~~~~~~~~~

98 copies of the above.

The attached patch silences the warnings, because we *know* we're overwriting
the default initializer. That leaves bpf/core.c with only 6 other warnings,
which become more visible in comparison.

Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-08-04 09:34:46 +02:00
Alexei Starovoitov
1c74bd22e8 bpf: Prevent memory disambiguation attack
commit af86ca4e3088fe5eacf2f7e58c01fa68ca067672 upstream.

Detect code patterns where malicious 'speculative store bypass' can be used
and sanitize such patterns.

 39: (bf) r3 = r10
 40: (07) r3 += -216
 41: (79) r8 = *(u64 *)(r7 +0)   // slow read
 42: (7a) *(u64 *)(r10 -72) = 0  // verifier inserts this instruction
 43: (7b) *(u64 *)(r8 +0) = r3   // this store becomes slow due to r8
 44: (79) r1 = *(u64 *)(r6 +0)   // cpu speculatively executes this load
 45: (71) r2 = *(u8 *)(r1 +0)    // speculatively arbitrary 'load byte'
                                 // is now sanitized

Above code after x86 JIT becomes:
 e5: mov    %rbp,%rdx
 e8: add    $0xffffffffffffff28,%rdx
 ef: mov    0x0(%r13),%r14
 f3: movq   $0x0,-0x48(%rbp)
 fb: mov    %rdx,0x0(%r14)
 ff: mov    0x0(%rbx),%rdi
103: movzbq 0x0(%rdi),%rsi

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
[bwh: Backported to 4.4:
 - Add verifier_env parameter to check_stack_write()
 - Look up stack slot_types with state->stack_slot_type[] rather than
   state->stack[].slot_type[]
 - Drop bpf_verifier_env argument to verbose()
 - Adjust filename, context]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17 21:55:15 +01:00
Ben Hutchings
451624d470 bpf/verifier: Pass instruction index to check_mem_access() and check_xadd()
Extracted from commit 31fd85816dbe "bpf: permits narrower load from
bpf program context fields".

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17 21:55:15 +01:00
Ben Hutchings
168cb9b7b2 bpf/verifier: Add spi variable to check_stack_write()
Extracted from commit dc503a8ad984 "bpf/verifier: track liveness for
pruning".

Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17 21:55:15 +01:00
Alexei Starovoitov
3c4bb079e1 bpf: support 8-byte metafield access
commit cedaf52693f02372010548c63b2e63228b959099 upstream.

The verifier supported only 4-byte metafields in
struct __sk_buff and struct xdp_md. The metafields in upcoming
struct bpf_perf_event are 8-byte to match register width in struct pt_regs.
Teach verifier to recognize 8-byte metafield access.
The patch doesn't affect safety of sockets and xdp programs.
They check for 4-byte only ctx access before these conditions are hit.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-12-17 21:55:15 +01:00
Daniel Borkmann
e25dc63aa3 bpf: generally move prog destruction to RCU deferral
[ Upstream commit 1aacde3d22c42281236155c1ef6d7a5aa32a826b ]

Jann Horn reported following analysis that could potentially result
in a very hard to trigger (if not impossible) UAF race, to quote his
event timeline:

 - Set up a process with threads T1, T2 and T3
 - Let T1 set up a socket filter F1 that invokes another filter F2
   through a BPF map [tail call]
 - Let T1 trigger the socket filter via a unix domain socket write,
   don't wait for completion
 - Let T2 call PERF_EVENT_IOC_SET_BPF with F2, don't wait for completion
 - Now T2 should be behind bpf_prog_get(), but before bpf_prog_put()
 - Let T3 close the file descriptor for F2, dropping the reference
   count of F2 to 2
 - At this point, T1 should have looked up F2 from the map, but not
   finished executing it
 - Let T3 remove F2 from the BPF map, dropping the reference count of
   F2 to 1
 - Now T2 should call bpf_prog_put() (wrong BPF program type), dropping
   the reference count of F2 to 0 and scheduling bpf_prog_free_deferred()
   via schedule_work()
 - At this point, the BPF program could be freed
 - BPF execution is still running in a freed BPF program

While at PERF_EVENT_IOC_SET_BPF time it's only guaranteed that the perf
event fd we're doing the syscall on doesn't disappear from underneath us
for whole syscall time, it may not be the case for the bpf fd used as
an argument only after we did the put. It needs to be a valid fd pointing
to a BPF program at the time of the call to make the bpf_prog_get() and
while T2 gets preempted, F2 must have dropped reference to 1 on the other
CPU. The fput() from the close() in T3 should also add additionally delay
to the reference drop via exit_task_work() when bpf_prog_release() gets
called as well as scheduling bpf_prog_free_deferred().

That said, it makes nevertheless sense to move the BPF prog destruction
generally after RCU grace period to guarantee that such scenario above,
but also others as recently fixed in ceb56070359b ("bpf, perf: delay release
of BPF prog after grace period") with regards to tail calls won't happen.
Integrating bpf_prog_free_deferred() directly into the RCU callback is
not allowed since the invocation might happen from either softirq or
process context, so we're not permitted to block. Reviewing all bpf_prog_put()
invocations from eBPF side (note, cBPF -> eBPF progs don't use this for
their destruction) with call_rcu() look good to me.

Since we don't know whether at the time of attaching the program, we're
already part of a tail call map, we need to use RCU variant. However, due
to this, there won't be severely more stress on the RCU callback queue:
situations with above bpf_prog_get() and bpf_prog_put() combo in practice
normally won't lead to releases, but even if they would, enough effort/
cycles have to be put into loading a BPF program into the kernel already.

Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
2018-11-10 07:41:37 -08:00
Jakub Kicinski
b23dab51e9 bpf: fix references to free_bpf_prog_info() in comments
[ Upstream commit ab7f5bf0928be2f148d000a6eaa6c0a36e74750e ]

Comments in the verifier refer to free_bpf_prog_info() which
seems to have never existed in tree.  Replace it with
free_used_maps().

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Sasha Levin <alexander.levin@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-08-06 16:24:37 +02:00
Teng Qin
ea7c24c785 bpf: map_get_next_key to return first key on NULL
commit 8fe45924387be6b5c1be59a7eb330790c61d5d10 upstream.

When iterating through a map, we need to find a key that does not exist
in the map so map_get_next_key will give us the first key of the map.
This often requires a lot of guessing in production systems.

This patch makes map_get_next_key return the first key when the key
pointer in the parameter is NULL.

Signed-off-by: Teng Qin <qinteng@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Chenbo Feng <fengc@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-16 10:06:46 +02:00
Chenbo Feng
c9ea2f8af6 bpf: skip unnecessary capability check
commit 0fa4fe85f4724fff89b09741c437cbee9cf8b008 upstream.

The current check statement in BPF syscall will do a capability check
for CAP_SYS_ADMIN before checking sysctl_unprivileged_bpf_disabled. This
code path will trigger unnecessary security hooks on capability checking
and cause false alarms on unprivileged process trying to get CAP_SYS_ADMIN
access. This can be resolved by simply switch the order of the statement
and CAP_SYS_ADMIN is not required anyway if unprivileged bpf syscall is
allowed.

Signed-off-by: Chenbo Feng <fengc@google.com>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-28 18:40:17 +02:00
Jann Horn
a6132276ab bpf: fix incorrect sign extension in check_alu_op()
commit 95a762e2c8c942780948091f8f2a4f32fce1ac6f upstream.

Distinguish between
BPF_ALU64|BPF_MOV|BPF_K (load 32-bit immediate, sign-extended to 64-bit)
and BPF_ALU|BPF_MOV|BPF_K (load 32-bit immediate, zero-padded to 64-bit);
only perform sign extension in the first case.

This patch differs from the mainline one because the verifier's internals
have changed in the meantime. Mainline tracks register values as 64-bit
values; however, 4.4 still stores tracked register values as 32-bit
values with sign extension. Therefore, in the case of a 32-bit op with
negative immediate, the value can't be tracked; leave the register as
UNKNOWN_VALUE (set by the preceding check_reg_arg() call).


I have manually tested this patch on top of 4.4.122. For the following BPF
bytecode:

        BPF_MOV64_IMM(BPF_REG_1, 1),
        BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 1, 1),
        BPF_EXIT_INSN(),

        BPF_MOV32_IMM(BPF_REG_1, 1),
        BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 1, 1),
        BPF_EXIT_INSN(),

        BPF_MOV64_IMM(BPF_REG_1, -1),
        BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, -1, 1),
        BPF_EXIT_INSN(),

        BPF_MOV32_IMM(BPF_REG_1, -1),
        BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, -1, 2),
        BPF_MOV32_IMM(BPF_REG_0, 42),
        BPF_EXIT_INSN(),

        BPF_MOV32_IMM(BPF_REG_0, 43),
        BPF_EXIT_INSN()

Verifier output on 4.4.122 without this patch:

0: (b7) r1 = 1
1: (15) if r1 == 0x1 goto pc+1
3: (b4) (u32) r1 = (u32) 1
4: (15) if r1 == 0x1 goto pc+1
6: (b7) r1 = -1
7: (15) if r1 == 0xffffffff goto pc+1
9: (b4) (u32) r1 = (u32) -1
10: (15) if r1 == 0xffffffff goto pc+2
13: (b4) (u32) r0 = (u32) 43
14: (95) exit

Verifier output on 4.4.122+ with this patch:

0: (b7) r1 = 1
1: (15) if r1 == 0x1 goto pc+1
3: (b4) (u32) r1 = (u32) 1
4: (15) if r1 == 0x1 goto pc+1
6: (b7) r1 = -1
7: (15) if r1 == 0xffffffff goto pc+1
9: (b4) (u32) r1 = (u32) -1
10: (15) if r1 == 0xffffffff goto pc+2
 R1=inv R10=fp
11: (b4) (u32) r0 = (u32) 42
12: (95) exit

from 10 to 13: R1=imm-1 R10=fp
13: (b4) (u32) r0 = (u32) 43
14: (95) exit


Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-22 09:23:32 +01:00
Daniel Borkmann
faa74a862a bpf: reject stores into ctx via st and xadd
[ upstream commit f37a8cb84cce18762e8f86a70bd6a49a66ab964c ]

Alexei found that verifier does not reject stores into context
via BPF_ST instead of BPF_STX. And while looking at it, we
also should not allow XADD variant of BPF_STX.

The context rewriter is only assuming either BPF_LDX_MEM- or
BPF_STX_MEM-type operations, thus reject anything other than
that so that assumptions in the rewriter properly hold. Add
test cases as well for BPF selftests.

Fixes: d691f9e8d440 ("bpf: allow programs to write to certain skb fields")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-03 17:04:25 +01:00
Alexei Starovoitov
02662601a2 bpf: fix 32-bit divide by zero
[ upstream commit 68fda450a7df51cff9e5a4d4a4d9d0d5f2589153 ]

due to some JITs doing if (src_reg == 0) check in 64-bit mode
for div/mod operations mask upper 32-bits of src register
before doing the check

Fixes: 622582786c9e ("net: filter: x86: internal BPF JIT")
Fixes: 7a12b5031c6b ("sparc64: Add eBPF JIT.")
Reported-by: syzbot+48340bb518e88849e2e3@syzkaller.appspotmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-03 17:04:25 +01:00
Eric Dumazet
b72ba2a0d8 bpf: fix divides by zero
[ upstream commit c366287ebd698ef5e3de300d90cd62ee9ee7373e ]

Divides by zero are not nice, lets avoid them if possible.

Also do_div() seems not needed when dealing with 32bit operands,
but this seems a minor detail.

Fixes: bd4cf0ed331a ("net: filter: rework/optimize internal BPF interpreter's instruction set")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-03 17:04:24 +01:00
Daniel Borkmann
7dcda40e52 bpf: arsh is not supported in 32 bit alu thus reject it
[ upstream commit 7891a87efc7116590eaba57acc3c422487802c6f ]

The following snippet was throwing an 'unknown opcode cc' warning
in BPF interpreter:

  0: (18) r0 = 0x0
  2: (7b) *(u64 *)(r10 -16) = r0
  3: (cc) (u32) r0 s>>= (u32) r0
  4: (95) exit

Although a number of JITs do support BPF_ALU | BPF_ARSH | BPF_{K,X}
generation, not all of them do and interpreter does neither. We can
leave existing ones and implement it later in bpf-next for the
remaining ones, but reject this properly in verifier for the time
being.

Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)")
Reported-by: syzbot+93c4904c5c70348a6890@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-03 17:04:24 +01:00
Alexei Starovoitov
28c486744e bpf: introduce BPF_JIT_ALWAYS_ON config
[ upstream commit 290af86629b25ffd1ed6232c4e9107da031705cb ]

The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.

A quote from goolge project zero blog:
"At this point, it would normally be necessary to locate gadgets in
the host kernel code that can be used to actually leak data by reading
from an attacker-controlled location, shifting and masking the result
appropriately and then using the result of that as offset to an
attacker-controlled address for a load. But piecing gadgets together
and figuring out which ones work in a speculation context seems annoying.
So instead, we decided to use the eBPF interpreter, which is built into
the host kernel - while there is no legitimate way to invoke it from inside
a VM, the presence of the code in the host kernel's text section is sufficient
to make it usable for the attack, just like with ordinary ROP gadgets."

To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
option that removes interpreter from the kernel in favor of JIT-only mode.
So far eBPF JIT is supported by:
x64, arm64, arm32, sparc64, s390, powerpc64, mips64

The start of JITed program is randomized and code page is marked as read-only.
In addition "constant blinding" can be turned on with net.core.bpf_jit_harden

v2->v3:
- move __bpf_prog_ret0 under ifdef (Daniel)

v1->v2:
- fix init order, test_bpf and cBPF (Daniel's feedback)
- fix offloaded bpf (Jakub's feedback)
- add 'return 0' dummy in case something can invoke prog->bpf_func
- retarget bpf tree. For bpf-next the patch would need one extra hunk.
  It will be sent when the trees are merged back to net-next

Considered doing:
  int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
but it seems better to land the patch as-is and in bpf-next remove
bpf_jit_enable global variable from all JITs, consolidate in one place
and remove this jit_init() function.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-03 17:04:24 +01:00
Alexei Starovoitov
361fb04812 bpf: fix bpf_tail_call() x64 JIT
[ upstream commit 90caccdd8cc0215705f18b92771b449b01e2474a ]

- bpf prog_array just like all other types of bpf array accepts 32-bit index.
  Clarify that in the comment.
- fix x64 JIT of bpf_tail_call which was incorrectly loading 8 instead of 4 bytes
- tighten corresponding check in the interpreter to stay consistent

The JIT bug can be triggered after introduction of BPF_F_NUMA_NODE flag
in commit 96eabe7a40aa in 4.14. Before that the map_flags would stay zero and
though JIT code is wrong it will check bounds correctly.
Hence two fixes tags. All other JITs don't have this problem.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Fixes: 96eabe7a40aa ("bpf: Allow selecting numa node during map creation")
Fixes: b52f00e6a715 ("x86: bpf_jit: implement bpf_tail_call() helper")
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-03 17:04:24 +01:00
Alexei Starovoitov
1367d854b9 bpf: fix branch pruning logic
[ Upstream commit c131187db2d3fa2f8bf32fdf4e9a4ef805168467 ]

when the verifier detects that register contains a runtime constant
and it's compared with another constant it will prune exploration
of the branch that is guaranteed not to be taken at runtime.
This is all correct, but malicious program may be constructed
in such a way that it always has a constant comparison and
the other branch is never taken under any conditions.
In this case such path through the program will not be explored
by the verifier. It won't be taken at run-time either, but since
all instructions are JITed the malicious program may cause JITs
to complain about using reserved fields, etc.
To fix the issue we have to track the instructions explored by
the verifier and sanitize instructions that are dead at run time
with NOPs. We cannot reject such dead code, since llvm generates
it for valid C code, since it doesn't do as much data flow
analysis as the verifier does.

Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-02-03 17:04:24 +01:00
Daniel Borkmann
095b0ba360 bpf, array: fix overflow in max_entries and undefined behavior in index_mask
commit bbeb6e4323dad9b5e0ee9f60c223dd532e2403b1 upstream.

syzkaller tried to alloc a map with 0xfffffffd entries out of a userns,
and thus unprivileged. With the recently added logic in b2157399cc98
("bpf: prevent out-of-bounds speculation") we round this up to the next
power of two value for max_entries for unprivileged such that we can
apply proper masking into potentially zeroed out map slots.

However, this will generate an index_mask of 0xffffffff, and therefore
a + 1 will let this overflow into new max_entries of 0. This will pass
allocation, etc, and later on map access we still enforce on the original
attr->max_entries value which was 0xfffffffd, therefore triggering GPF
all over the place. Thus bail out on overflow in such case.

Moreover, on 32 bit archs roundup_pow_of_two() can also not be used,
since fls_long(max_entries - 1) can result in 32 and 1UL << 32 in 32 bit
space is undefined. Therefore, do this by hand in a 64 bit variable.

This fixes all the issues triggered by syzkaller's reproducers.

Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation")
Reported-by: syzbot+b0efb8e572d01bce1ae0@syzkaller.appspotmail.com
Reported-by: syzbot+6c15e9744f75f2364773@syzkaller.appspotmail.com
Reported-by: syzbot+d2f5524fb46fd3b312ee@syzkaller.appspotmail.com
Reported-by: syzbot+61d23c95395cc90dbc2b@syzkaller.appspotmail.com
Reported-by: syzbot+0d363c942452cca68c01@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17 09:35:31 +01:00
Alexei Starovoitov
9a7fad4c0e bpf: prevent out-of-bounds speculation
commit b2157399cc9898260d6031c5bfe45fe137c1fbe7 upstream.

Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
memory accesses under a bounds check may be speculated even if the
bounds check fails, providing a primitive for building a side channel.

To avoid leaking kernel data round up array-based maps and mask the index
after bounds check, so speculated load with out of bounds index will load
either valid value from the array or zero from the padded area.

Unconditionally mask index for all array types even when max_entries
are not rounded to power of 2 for root user.
When map is created by unpriv user generate a sequence of bpf insns
that includes AND operation to make sure that JITed code includes
the same 'index & index_mask' operation.

If prog_array map is created by unpriv user replace
  bpf_tail_call(ctx, map, index);
with
  if (index >= max_entries) {
    index &= map->index_mask;
    bpf_tail_call(ctx, map, index);
  }
(along with roundup to power 2) to prevent out-of-bounds speculation.
There is secondary redundant 'if (index >= max_entries)' in the interpreter
and in all JITs, but they can be optimized later if necessary.

Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array)
cannot be used by unpriv, so no changes there.

That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on
all architectures with and without JIT.

v2->v3:
Daniel noticed that attack potentially can be crafted via syscall commands
without loading the program, so add masking to those paths as well.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17 09:35:31 +01:00
Alexei Starovoitov
648064515d bpf: adjust insn_aux_data when patching insns
commit 8041902dae5299c1f194ba42d14383f734631009 upstream.

convert_ctx_accesses() replaces single bpf instruction with a set of
instructions. Adjust corresponding insn_aux_data while patching.
It's needed to make sure subsequent 'for(all insn)' loops
have matching insn and insn_aux_data.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17 09:35:31 +01:00
Alexei Starovoitov
19614eee06 bpf: refactor fixup_bpf_calls()
commit 79741b3bdec01a8628368fbcfccc7d189ed606cb upstream.

reduce indent and make it iterate over instructions similar to
convert_ctx_accesses(). Also convert hard BUG_ON into soft verifier error.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17 09:35:30 +01:00
Alexei Starovoitov
14c7c55f45 bpf: move fixup_bpf_calls() function
commit e245c5c6a5656e4d61aa7bb08e9694fd6e5b2b9d upstream.

no functional change.
move fixup_bpf_calls() to verifier.c
it's being refactored in the next patch

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17 09:35:30 +01:00
Jakub Kicinski
0748b80e43 bpf: don't (ab)use instructions to store state
commit 3df126f35f88dc76eea33769f85a3c3bb8ce6c6b upstream.

Storing state in reserved fields of instructions makes
it impossible to run verifier on programs already
marked as read-only. Allocate and use an array of
per-instruction state instead.

While touching the error path rename and move existing
jump target.

Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17 09:35:30 +01:00
Daniel Borkmann
087a92287d bpf: add bpf_patch_insn_single helper
commit c237ee5eb33bf19fe0591c04ff8db19da7323a83 upstream.

Move the functionality to patch instructions out of the verifier
code and into the core as the new bpf_patch_insn_single() helper
will be needed later on for blinding as well. No changes in
functionality.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-01-17 09:35:30 +01:00
Daniel Borkmann
49630dd2e1 bpf: don't let ldimm64 leak map addresses on unprivileged
commit 0d0e57697f162da4aa218b5feafe614fb666db07 upstream.

The patch fixes two things at once:

1) It checks the env->allow_ptr_leaks and only prints the map address to
   the log if we have the privileges to do so, otherwise it just dumps 0
   as we would when kptr_restrict is enabled on %pK. Given the latter is
   off by default and not every distro sets it, I don't want to rely on
   this, hence the 0 by default for unprivileged.

2) Printing of ldimm64 in the verifier log is currently broken in that
   we don't print the full immediate, but only the 32 bit part of the
   first insn part for ldimm64. Thus, fix this up as well; it's okay to
   access, since we verified all ldimm64 earlier already (including just
   constants) through replace_map_fd_with_map_ptr().

Fixes: 1be7f75d1668 ("bpf: enable non-root eBPF programs")
Fixes: cbd357008604 ("bpf: verifier (add ability to receive verification log)")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
[bwh: Backported to 4.4: s/bpf_verifier_env/verifier_env/]
Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-21 09:21:17 +01:00
Edward Cree
2ec54b21dd bpf/verifier: reject BPF_ALU64|BPF_END
[ Upstream commit e67b8a685c7c984e834e3181ef4619cd7025a136 ]

Neither ___bpf_prog_run nor the JITs accept it.
Also adds a new test case.

Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-10-21 17:09:02 +02:00
Daniel Borkmann
1a4f13e0a9 bpf: prevent leaking pointer via xadd on unpriviledged
commit 6bdf6abc56b53103324dfd270a86580306e1a232 upstream.

Leaking kernel addresses on unpriviledged is generally disallowed,
for example, verifier rejects the following:

  0: (b7) r0 = 0
  1: (18) r2 = 0xffff897e82304400
  3: (7b) *(u64 *)(r1 +48) = r2
  R2 leaks addr into ctx

Doing pointer arithmetic on them is also forbidden, so that they
don't turn into unknown value and then get leaked out. However,
there's xadd as a special case, where we don't check the src reg
for being a pointer register, e.g. the following will pass:

  0: (b7) r0 = 0
  1: (7b) *(u64 *)(r1 +48) = r0
  2: (18) r2 = 0xffff897e82304400 ; map
  4: (db) lock *(u64 *)(r1 +48) += r2
  5: (95) exit

We could store the pointer into skb->cb, loose the type context,
and then read it out from there again to leak it eventually out
of a map value. Or more easily in a different variant, too:

   0: (bf) r6 = r1
   1: (7a) *(u64 *)(r10 -8) = 0
   2: (bf) r2 = r10
   3: (07) r2 += -8
   4: (18) r1 = 0x0
   6: (85) call bpf_map_lookup_elem#1
   7: (15) if r0 == 0x0 goto pc+3
   R0=map_value(ks=8,vs=8,id=0),min_value=0,max_value=0 R6=ctx R10=fp
   8: (b7) r3 = 0
   9: (7b) *(u64 *)(r0 +0) = r3
  10: (db) lock *(u64 *)(r0 +0) += r6
  11: (b7) r0 = 0
  12: (95) exit

  from 7 to 11: R0=inv,min_value=0,max_value=0 R6=ctx R10=fp
  11: (b7) r0 = 0
  12: (95) exit

Prevent this by checking xadd src reg for pointer types. Also
add a couple of test cases related to this.

Fixes: 1be7f75d1668 ("bpf: enable non-root eBPF programs")
Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-07-21 07:44:55 +02:00
Daniel Borkmann
bfe951d547 bpf, inode: disallow userns mounts
[ Upstream commit 612bacad78ba6d0a91166fc4487af114bac172a8 ]

Follow-up to commit e27f4a942a0e ("bpf: Use mount_nodev not mount_ns
to mount the bpf filesystem"), which removes the FS_USERNS_MOUNT flag.

The original idea was to have a per mountns instance instead of a
single global fs instance, but that didn't work out and we had to
switch to mount_nodev() model. The intent of that middle ground was
that we avoid users who don't play nice to create endless instances
of bpf fs which are difficult to control and discover from an admin
point of view, but at the same time it would have allowed us to be
more flexible with regard to namespaces.

Therefore, since we now did the switch to mount_nodev() as a fix
where individual instances are created, we also need to remove userns
mount flag along with it to avoid running into mentioned situation.
I don't expect any breakage at this early point in time with removing
the flag and we can revisit this later should the requirement for
this come up with future users. This and commit e27f4a942a0e have
been split to facilitate tracking should any of them run into the
unlikely case of causing a regression.

Fixes: b2197755b263 ("bpf: add support for persistent maps/progs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24 10:18:17 -07:00
Eric W. Biederman
5b7ea922e1 bpf: Use mount_nodev not mount_ns to mount the bpf filesystem
[ Upstream commit e27f4a942a0ee4b84567a3c6cfa84f273e55cbb7 ]

While reviewing the filesystems that set FS_USERNS_MOUNT I spotted the
bpf filesystem.  Looking at the code I saw a broken usage of mount_ns
with current->nsproxy->mnt_ns. As the code does not acquire a
reference to the mount namespace it can not possibly be correct to
store the mount namespace on the superblock as it does.

Replace mount_ns with mount_nodev so that each mount of the bpf
filesystem returns a distinct instance, and the code is not buggy.

In discussion with Hannes Frederic Sowa it was reported that the use
of mount_ns was an attempt to have one bpf instance per mount
namespace, in an attempt to keep resources that pin resources from
hiding.  That intent simply does not work, the vfs is not built to
allow that kind of behavior.  Which means that the bpf filesystem
really is buggy both semantically and in it's implemenation as it does
not nor can it implement the original intent.

This change is userspace visible, but my experience with similar
filesystems leads me to believe nothing will break with a model of each
mount of the bpf filesystem is distinct from all others.

Fixes: b2197755b263 ("bpf: add support for persistent maps/progs")
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-06-24 10:18:16 -07:00
Alexei Starovoitov
bb10156f57 bpf: fix check_map_func_compatibility logic
[ Upstream commit 6aff67c85c9e5a4bc99e5211c1bac547936626ca ]

The commit 35578d798400 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter")
introduced clever way to check bpf_helper<->map_type compatibility.
Later on commit a43eec304259 ("bpf: introduce bpf_perf_event_output() helper") adjusted
the logic and inadvertently broke it.
Get rid of the clever bool compare and go back to two-way check
from map and from helper perspective.

Fixes: a43eec304259 ("bpf: introduce bpf_perf_event_output() helper")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-18 17:06:38 -07:00
Alexei Starovoitov
3899251bdb bpf: fix refcnt overflow
[ Upstream commit 92117d8443bc5afacc8d5ba82e541946310f106e ]

On a system with >32Gbyte of phyiscal memory and infinite RLIMIT_MEMLOCK,
the malicious application may overflow 32-bit bpf program refcnt.
It's also possible to overflow map refcnt on 1Tb system.
Impose 32k hard limit which means that the same bpf program or
map cannot be shared by more than 32k processes.

Fixes: 1be7f75d1668 ("bpf: enable non-root eBPF programs")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-18 17:06:37 -07:00
Jann Horn
608d2c3c7a bpf: fix double-fdput in replace_map_fd_with_map_ptr()
[ Upstream commit 8358b02bf67d3a5d8a825070e1aa73f25fb2e4c7 ]

When bpf(BPF_PROG_LOAD, ...) was invoked with a BPF program whose bytecode
references a non-map file descriptor as a map file descriptor, the error
handling code called fdput() twice instead of once (in __bpf_map_get() and
in replace_map_fd_with_map_ptr()). If the file descriptor table of the
current task is shared, this causes f_count to be decremented too much,
allowing the struct file to be freed while it is still in use
(use-after-free). This can be exploited to gain root privileges by an
unprivileged user.

This bug was introduced in
commit 0246e64d9a5f ("bpf: handle pseudo BPF_LD_IMM64 insn"), but is only
exploitable since
commit 1be7f75d1668 ("bpf: enable non-root eBPF programs") because
previously, CAP_SYS_ADMIN was required to reach the vulnerable code.

(posted publicly according to request by maintainer)

Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-18 17:06:37 -07:00
Alexei Starovoitov
8427d5547d bpf/verifier: reject invalid LD_ABS | BPF_DW instruction
[ Upstream commit d82bccc69041a51f7b7b9b4a36db0772f4cdba21 ]

verifier must check for reserved size bits in instruction opcode and
reject BPF_LD | BPF_ABS | BPF_DW and BPF_LD | BPF_IND | BPF_DW instructions,
otherwise interpreter will WARN_RATELIMIT on them during execution.

Fixes: ddd872bc3098 ("bpf: verifier: add checks for BPF_ABS | BPF_IND instructions")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-05-18 17:06:35 -07:00
Alexei Starovoitov
e8e4323262 bpf: avoid copying junk bytes in bpf_get_current_comm()
[ Upstream commit cdc4e47da8f4c32eeb6b2061a8a834f4362a12b7 ]

Lots of places in the kernel use memcpy(buf, comm, TASK_COMM_LEN); but
the result is typically passed to print("%s", buf) and extra bytes
after zero don't cause any harm.
In bpf the result of bpf_get_current_comm() is used as the part of
map key and was causing spurious hash map mismatches.
Use strlcpy() to guarantee zero-terminated string.
bpf verifier checks that output buffer is zero-initialized,
so even for short task names the output buffer don't have junk bytes.
Note it's not a security concern, since kprobe+bpf is root only.

Fixes: ffeedafbf023 ("bpf: introduce current->pid, tgid, uid, gid, comm accessors")
Reported-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-04-20 15:42:01 +09:00
Daniel Borkmann
a34f2f9f20 bpf: fix branch offset adjustment on backjumps after patching ctx expansion
[ Upstream commit a1b14d27ed0965838350f1377ff97c93ee383492 ]

When ctx access is used, the kernel often needs to expand/rewrite
instructions, so after that patching, branch offsets have to be
adjusted for both forward and backward jumps in the new eBPF program,
but for backward jumps it fails to account the delta. Meaning, for
example, if the expansion happens exactly on the insn that sits at
the jump target, it doesn't fix up the back jump offset.

Analysis on what the check in adjust_branches() is currently doing:

  /* adjust offset of jmps if necessary */
  if (i < pos && i + insn->off + 1 > pos)
    insn->off += delta;
  else if (i > pos && i + insn->off + 1 < pos)
    insn->off -= delta;

First condition (forward jumps):

  Before:                         After:

  insns[0]                        insns[0]
  insns[1] <--- i/insn            insns[1] <--- i/insn
  insns[2] <--- pos               insns[P] <--- pos
  insns[3]                        insns[P]  `------| delta
  insns[4] <--- target_X          insns[P]   `-----|
  insns[5]                        insns[3]
                                  insns[4] <--- target_X
                                  insns[5]

First case is if we cross pos-boundary and the jump instruction was
before pos. This is handeled correctly. I.e. if i == pos, then this
would mean our jump that we currently check was the patchlet itself
that we just injected. Since such patchlets are self-contained and
have no awareness of any insns before or after the patched one, the
delta is correctly not adjusted. Also, for the second condition in
case of i + insn->off + 1 == pos, means we jump to that newly patched
instruction, so no offset adjustment are needed. That part is correct.

Second condition (backward jumps):

  Before:                         After:

  insns[0]                        insns[0]
  insns[1] <--- target_X          insns[1] <--- target_X
  insns[2] <--- pos <-- target_Y  insns[P] <--- pos <-- target_Y
  insns[3]                        insns[P]  `------| delta
  insns[4] <--- i/insn            insns[P]   `-----|
  insns[5]                        insns[3]
                                  insns[4] <--- i/insn
                                  insns[5]

Second interesting case is where we cross pos-boundary and the jump
instruction was after pos. Backward jump with i == pos would be
impossible and pose a bug somewhere in the patchlet, so the first
condition checking i > pos is okay only by itself. However, i +
insn->off + 1 < pos does not always work as intended to trigger the
adjustment. It works when jump targets would be far off where the
delta wouldn't matter. But, for example, where the fixed insn->off
before pointed to pos (target_Y), it now points to pos + delta, so
that additional room needs to be taken into account for the check.
This means that i) both tests here need to be adjusted into pos + delta,
and ii) for the second condition, the test needs to be <= as pos
itself can be a target in the backjump, too.

Fixes: 9bac3d6d548e ("bpf: allow extended BPF programs access skb fields")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-03-03 15:07:06 -08:00
Rabin Vincent
35987ff2ea net: bpf: reject invalid shifts
[ Upstream commit 229394e8e62a4191d592842cf67e80c62a492937 ]

On ARM64, a BUG() is triggered in the eBPF JIT if a filter with a
constant shift that can't be encoded in the immediate field of the
UBFM/SBFM instructions is passed to the JIT.  Since these shifts
amounts, which are negative or >= regsize, are invalid, reject them in
the eBPF verifier and the classic BPF filter checker, for all
architectures.

Signed-off-by: Rabin Vincent <rabin@rab.in>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-01-31 11:29:01 -08:00
Alexei Starovoitov
01b3f52157 bpf: fix allocation warnings in bpf maps and integer overflow
For large map->value_size the user space can trigger memory allocation warnings like:
WARNING: CPU: 2 PID: 11122 at mm/page_alloc.c:2989
__alloc_pages_nodemask+0x695/0x14e0()
Call Trace:
 [<     inline     >] __dump_stack lib/dump_stack.c:15
 [<ffffffff82743b56>] dump_stack+0x68/0x92 lib/dump_stack.c:50
 [<ffffffff81244ec9>] warn_slowpath_common+0xd9/0x140 kernel/panic.c:460
 [<ffffffff812450f9>] warn_slowpath_null+0x29/0x30 kernel/panic.c:493
 [<     inline     >] __alloc_pages_slowpath mm/page_alloc.c:2989
 [<ffffffff81554e95>] __alloc_pages_nodemask+0x695/0x14e0 mm/page_alloc.c:3235
 [<ffffffff816188fe>] alloc_pages_current+0xee/0x340 mm/mempolicy.c:2055
 [<     inline     >] alloc_pages include/linux/gfp.h:451
 [<ffffffff81550706>] alloc_kmem_pages+0x16/0xf0 mm/page_alloc.c:3414
 [<ffffffff815a1c89>] kmalloc_order+0x19/0x60 mm/slab_common.c:1007
 [<ffffffff815a1cef>] kmalloc_order_trace+0x1f/0xa0 mm/slab_common.c:1018
 [<     inline     >] kmalloc_large include/linux/slab.h:390
 [<ffffffff81627784>] __kmalloc+0x234/0x250 mm/slub.c:3525
 [<     inline     >] kmalloc include/linux/slab.h:463
 [<     inline     >] map_update_elem kernel/bpf/syscall.c:288
 [<     inline     >] SYSC_bpf kernel/bpf/syscall.c:744

To avoid never succeeding kmalloc with order >= MAX_ORDER check that
elem->value_size and computed elem_size are within limits for both hash and
array type maps.
Also add __GFP_NOWARN to kmalloc(value_size | elem_size) to avoid OOM warnings.
Note kmalloc(key_size) is highly unlikely to trigger OOM, since key_size <= 512,
so keep those kmalloc-s as-is.

Large value_size can cause integer overflows in elem_size and map.pages
formulas, so check for that as well.

Fixes: aaac3ba95e4c ("bpf: charge user for creation of BPF maps and programs")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-02 23:36:00 -05:00
Daniel Borkmann
fbca9d2d35 bpf, array: fix heap out-of-bounds access when updating elements
During own review but also reported by Dmitry's syzkaller [1] it has been
noticed that we trigger a heap out-of-bounds access on eBPF array maps
when updating elements. This happens with each map whose map->value_size
(specified during map creation time) is not multiple of 8 bytes.

In array_map_alloc(), elem_size is round_up(attr->value_size, 8) and
used to align array map slots for faster access. However, in function
array_map_update_elem(), we update the element as ...

memcpy(array->value + array->elem_size * index, value, array->elem_size);

... where we access 'value' out-of-bounds, since it was allocated from
map_update_elem() from syscall side as kmalloc(map->value_size, GFP_USER)
and later on copied through copy_from_user(value, uvalue, map->value_size).
Thus, up to 7 bytes, we can access out-of-bounds.

Same could happen from within an eBPF program, where in worst case we
access beyond an eBPF program's designated stack.

Since 1be7f75d1668 ("bpf: enable non-root eBPF programs") didn't hit an
official release yet, it only affects priviledged users.

In case of array_map_lookup_elem(), the verifier prevents eBPF programs
from accessing beyond map->value_size through check_map_access(). Also
from syscall side map_lookup_elem() only copies map->value_size back to
user, so nothing could leak.

  [1] http://github.com/google/syzkaller

Fixes: 28fbcfa08d8e ("bpf: add array type of eBPF maps")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-12-01 21:56:17 -05:00
Daniel Borkmann
c9da161c65 bpf: fix clearing on persistent program array maps
Currently, when having map file descriptors pointing to program arrays,
there's still the issue that we unconditionally flush program array
contents via bpf_fd_array_map_clear() in bpf_map_release(). This happens
when such a file descriptor is released and is independent of the map's
refcount.

Having this flush independent of the refcount is for a reason: there
can be arbitrary complex dependency chains among tail calls, also circular
ones (direct or indirect, nesting limit determined during runtime), and
we need to make sure that the map drops all references to eBPF programs
it holds, so that the map's refcount can eventually drop to zero and
initiate its freeing. Btw, a walk of the whole dependency graph would
not be possible for various reasons, one being complexity and another
one inconsistency, i.e. new programs can be added to parts of the graph
at any time, so there's no guaranteed consistent state for the time of
such a walk.

Now, the program array pinning itself works, but the issue is that each
derived file descriptor on close would nevertheless call unconditionally
into bpf_fd_array_map_clear(). Instead, keep track of users and postpone
this flush until the last reference to a user is dropped. As this only
concerns a subset of references (f.e. a prog array could hold a program
that itself has reference on the prog array holding it, etc), we need to
track them separately.

Short analysis on the refcounting: on map creation time usercnt will be
one, so there's no change in behaviour for bpf_map_release(), if unpinned.
If we already fail in map_create(), we are immediately freed, and no
file descriptor has been made public yet. In bpf_obj_pin_user(), we need
to probe for a possible map in bpf_fd_probe_obj() already with a usercnt
reference, so before we drop the reference on the fd with fdput().
Therefore, if actual pinning fails, we need to drop that reference again
in bpf_any_put(), otherwise we keep holding it. When last reference
drops on the inode, the bpf_any_put() in bpf_evict_inode() will take
care of dropping the usercnt again. In the bpf_obj_get_user() case, the
bpf_any_get() will grab a reference on the usercnt, still at a time when
we have the reference on the path. Should we later on fail to grab a new
file descriptor, bpf_any_put() will drop it, otherwise we hold it until
bpf_map_release() time.

Joint work with Alexei.

Fixes: b2197755b263 ("bpf: add support for persistent maps/progs")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-25 12:14:09 -05:00
Daniel Borkmann
1d056d9c95 bpf, verifier: annotate verbose printer with __printf
The verbose() printer dumps the verifier state to user space, so let gcc
take care to check calls to verbose() for (future) errors. make with W=1
correctly suggests: function might be possible candidate for 'gnu_printf'
format attribute [-Wsuggest-attribute=format].

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-03 11:29:56 -05:00
Daniel Borkmann
b2197755b2 bpf: add support for persistent maps/progs
This work adds support for "persistent" eBPF maps/programs. The term
"persistent" is to be understood that maps/programs have a facility
that lets them survive process termination. This is desired by various
eBPF subsystem users.

Just to name one example: tc classifier/action. Whenever tc parses
the ELF object, extracts and loads maps/progs into the kernel, these
file descriptors will be out of reach after the tc instance exits.
So a subsequent tc invocation won't be able to access/relocate on this
resource, and therefore maps cannot easily be shared, f.e. between the
ingress and egress networking data path.

The current workaround is that Unix domain sockets (UDS) need to be
instrumented in order to pass the created eBPF map/program file
descriptors to a third party management daemon through UDS' socket
passing facility. This makes it a bit complicated to deploy shared
eBPF maps or programs (programs f.e. for tail calls) among various
processes.

We've been brainstorming on how we could tackle this issue and various
approches have been tried out so far, which can be read up further in
the below reference.

The architecture we eventually ended up with is a minimal file system
that can hold map/prog objects. The file system is a per mount namespace
singleton, and the default mount point is /sys/fs/bpf/. Any subsequent
mounts within a given namespace will point to the same instance. The
file system allows for creating a user-defined directory structure.
The objects for maps/progs are created/fetched through bpf(2) with
two new commands (BPF_OBJ_PIN/BPF_OBJ_GET). I.e. a bpf file descriptor
along with a pathname is being passed to bpf(2) that in turn creates
(we call it eBPF object pinning) the file system nodes. Only the pathname
is being passed to bpf(2) for getting a new BPF file descriptor to an
existing node. The user can use that to access maps and progs later on,
through bpf(2). Removal of file system nodes is being managed through
normal VFS functions such as unlink(2), etc. The file system code is
kept to a very minimum and can be further extended later on.

The next step I'm working on is to add dump eBPF map/prog commands
to bpf(2), so that a specification from a given file descriptor can
be retrieved. This can be used by things like CRIU but also applications
can inspect the meta data after calling BPF_OBJ_GET.

Big thanks also to Alexei and Hannes who significantly contributed
in the design discussion that eventually let us end up with this
architecture here.

Reference: https://lkml.org/lkml/2015/10/15/925
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 22:48:39 -05:00
Daniel Borkmann
e9d8afa90b bpf: consolidate bpf_prog_put{, _rcu} dismantle paths
We currently have duplicated cleanup code in bpf_prog_put() and
bpf_prog_put_rcu() cleanup paths. Back then we decided that it was
not worth it to make it a common helper called by both, but with
the recent addition of resource charging, we could have avoided
the fix in commit ac00737f4e81 ("bpf: Need to call bpf_prog_uncharge_memlock
from bpf_prog_put") if we would have had only a single, common path.
We can simplify it further by assigning aux->prog only once during
allocation time.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 22:48:39 -05:00
Daniel Borkmann
c210129760 bpf: align and clean bpf_{map,prog}_get helpers
Add a bpf_map_get() function that we're going to use later on and
align/clean the remaining helpers a bit so that we have them a bit
more consistent:

  - __bpf_map_get() and __bpf_prog_get() that both work on the fd
    struct, check whether the descriptor is eBPF and return the
    pointer to the map/prog stored in the private data.

    Also, we can return f.file->private_data directly, the function
    signature is enough of a documentation already.

  - bpf_map_get() and bpf_prog_get() that both work on u32 user fd,
    call their respective __bpf_map_get()/__bpf_prog_get() variants,
    and take a reference.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 22:48:39 -05:00
Daniel Borkmann
aa79781b65 bpf: abstract anon_inode_getfd invocations
Since we're going to use anon_inode_getfd() invocations in more than just
the current places, make a helper function for both, so that we only need
to pass a map/prog pointer to the helper itself in order to get a fd. The
new helpers are called bpf_map_new_fd() and bpf_prog_new_fd().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 22:48:39 -05:00
Yang Shi
ac00881f92 bpf: convert hashtab lock to raw lock
When running bpf samples on rt kernel, it reports the below warning:

BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:917
in_atomic(): 1, irqs_disabled(): 128, pid: 477, name: ping
Preemption disabled at:[<ffff80000017db58>] kprobe_perf_func+0x30/0x228

CPU: 3 PID: 477 Comm: ping Not tainted 4.1.10-rt8 #4
Hardware name: Freescale Layerscape 2085a RDB Board (DT)
Call trace:
[<ffff80000008a5b0>] dump_backtrace+0x0/0x128
[<ffff80000008a6f8>] show_stack+0x20/0x30
[<ffff8000007da90c>] dump_stack+0x7c/0xa0
[<ffff8000000e4830>] ___might_sleep+0x188/0x1a0
[<ffff8000007e2200>] rt_spin_lock+0x28/0x40
[<ffff80000018bf9c>] htab_map_update_elem+0x124/0x320
[<ffff80000018c718>] bpf_map_update_elem+0x40/0x58
[<ffff800000187658>] __bpf_prog_run+0xd48/0x1640
[<ffff80000017ca6c>] trace_call_bpf+0x8c/0x100
[<ffff80000017db58>] kprobe_perf_func+0x30/0x228
[<ffff80000017dd84>] kprobe_dispatcher+0x34/0x58
[<ffff8000007e399c>] kprobe_handler+0x114/0x250
[<ffff8000007e3bf4>] kprobe_breakpoint_handler+0x1c/0x30
[<ffff800000085b80>] brk_handler+0x88/0x98
[<ffff8000000822f0>] do_debug_exception+0x50/0xb8
Exception stack(0xffff808349687460 to 0xffff808349687580)
7460: 4ca2b600 ffff8083 4a3a7000 ffff8083 49687620 ffff8083 0069c5f8 ffff8000
7480: 00000001 00000000 007e0628 ffff8000 496874b0 ffff8083 007e1de8 ffff8000
74a0: 496874d0 ffff8083 0008e04c ffff8000 00000001 00000000 4ca2b600 ffff8083
74c0: 00ba2e80 ffff8000 49687528 ffff8083 49687510 ffff8083 000e5c70 ffff8000
74e0: 00c22348 ffff8000 00000000 ffff8083 49687510 ffff8083 000e5c74 ffff8000
7500: 4ca2b600 ffff8083 49401800 ffff8083 00000001 00000000 00000000 00000000
7520: 496874d0 ffff8083 00000000 00000000 00000000 00000000 00000000 00000000
7540: 2f2e2d2c 33323130 00000000 00000000 4c944500 ffff8083 00000000 00000000
7560: 00000000 00000000 008751e0 ffff8000 00000001 00000000 124e2d1d 00107b77

Convert hashtab lock to raw lock to avoid such warning.

Signed-off-by: Yang Shi <yang.shi@linaro.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-11-02 15:45:43 -05:00
Alexei Starovoitov
62544ce8e0 bpf: fix bpf_perf_event_read() helper
Fix safety checks for bpf_perf_event_read():
- only non-inherited events can be added to perf_event_array map
  (do this check statically at map insertion time)
- dynamically check that event is local and !pmu->count
Otherwise buggy bpf program can cause kernel splat.

Also fix error path after perf_event_attrs()
and remove redundant 'extern'.

Fixes: 35578d798400 ("bpf: Implement function bpf_perf_event_read() that get the selected hardware PMU conuter")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Wang Nan <wangnan0@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-26 21:49:26 -07:00
Alexei Starovoitov
a43eec3042 bpf: introduce bpf_perf_event_output() helper
This helper is used to send raw data from eBPF program into
special PERF_TYPE_SOFTWARE/PERF_COUNT_SW_BPF_OUTPUT perf_event.
User space needs to perf_event_open() it (either for one or all cpus) and
store FD into perf_event_array (similar to bpf_perf_event_read() helper)
before eBPF program can send data into it.

Today the programs triggered by kprobe collect the data and either store
it into the maps or print it via bpf_trace_printk() where latter is the debug
facility and not suitable to stream the data. This new helper replaces
such bpf_trace_printk() usage and allows programs to have dedicated
channel into user space for post-processing of the raw data collected.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-22 06:42:15 -07:00
Tom Herbert
ac00737f4e bpf: Need to call bpf_prog_uncharge_memlock from bpf_prog_put
Currently, is only called from __prog_put_rcu in the bpf_prog_release
path. Need this to call this from bpf_prog_put also to get correct
accounting.

Fixes: aaac3ba95e4c8b49 ("bpf: charge user for creation of BPF maps and programs")
Signed-off-by: Tom Herbert <tom@herbertland.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-10-16 00:55:02 -07:00