linux

iv/linux

History

Alexei Starovoitov bd4cf0ed33 net: filter: rework/optimize internal BPF interpreter's instruction set This patch replaces/reworks the kernel-internal BPF interpreter with an optimized BPF instruction set format that is modelled closer to mimic native instruction sets and is designed to be JITed with one to one mapping. Thus, the new interpreter is noticeably faster than the current implementation of sk_run_filter(); mainly for two reasons: 1. Fall-through jumps: BPF jump instructions are forced to go either 'true' or 'false' branch which causes branch-miss penalty. The new BPF jump instructions have only one branch and fall-through otherwise, which fits the CPU branch predictor logic better. `perf stat` shows drastic difference for branch-misses between the old and new code. 2. Jump-threaded implementation of interpreter vs switch statement: Instead of single table-jump at the top of 'switch' statement, gcc will now generate multiple table-jump instructions, which helps CPU branch predictor logic. Note that the verification of filters is still being done through sk_chk_filter() in classical BPF format, so filters from user- or kernel space are verified in the same way as we do now, and same restrictions/constraints hold as well. We reuse current BPF JIT compilers in a way that this upgrade would even be fine as is, but nevertheless allows for a successive upgrade of BPF JIT compilers to the new format. The internal instruction set migration is being done after the probing for JIT compilation, so in case JIT compilers are able to create a native opcode image, we're going to use that, and in all other cases we're doing a follow-up migration of the BPF program's instruction set, so that it can be transparently run in the new interpreter. In short, the internal format extends BPF in the following way (more details can be taken from the appended documentation): - Number of registers increase from 2 to 10 - Register width increases from 32-bit to 64-bit - Conditional jt/jf targets replaced with jt/fall-through - Adds signed > and >= insns - 16 4-byte stack slots for register spill-fill replaced with up to 512 bytes of multi-use stack space - Introduction of bpf_call insn and register passing convention for zero overhead calls from/to other kernel functions - Adds arithmetic right shift and endianness conversion insns - Adds atomic_add insn - Old tax/txa insns are replaced with 'mov dst,src' insn Performance of two BPF filters generated by libpcap resp. bpf_asm was measured on x86_64, i386 and arm32 (other libpcap programs have similar performance differences): fprog #1 is taken from Documentation/networking/filter.txt: tcpdump -i eth0 port 22 -dd fprog #2 is taken from 'man tcpdump': tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) - ((tcp[12]&0xf0)>>2)) != 0)' -dd Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call, smaller is better: --x86_64-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 90 101 192 202 new BPF 31 71 47 97 old BPF jit 12 34 17 44 new BPF jit TBD --i386-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 107 136 227 252 new BPF 40 119 69 172 --arm32-- fprog #1 fprog #1 fprog #2 fprog #2 cache-hit cache-miss cache-hit cache-miss old BPF 202 300 475 540 new BPF 180 270 330 470 old BPF jit 26 182 37 202 new BPF jit TBD Thus, without changing any userland BPF filters, applications on top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf classifier, netfilter's xt_bpf, team driver's load-balancing mode, and many more will have better interpreter filtering performance. While we are replacing the internal BPF interpreter, we also need to convert seccomp BPF in the same step to make use of the new internal structure since it makes use of lower-level API details without being further decoupled through higher-level calls like sk_unattached_filter_{create,destroy}(), for example. Just as for normal socket filtering, also seccomp BPF experiences a time-to-verdict speedup: 05-sim-long_jumps.c of libseccomp was used as micro-benchmark: seccomp_rule_add_exact(ctx,... seccomp_rule_add_exact(ctx,... rc = seccomp_load(ctx); for (i = 0; i < 10000000; i++) syscall(199, 100); 'short filter' has 2 rules 'large filter' has 200 rules 'short filter' performance is slightly better on x86_64/i386/arm32 'large filter' is much faster on x86_64 and i386 and shows no difference on arm32 --x86_64-- short filter old BPF: 2.7 sec 39.12% bench libc-2.15.so [.] syscall 8.10% bench [kernel.kallsyms] [k] sk_run_filter 6.31% bench [kernel.kallsyms] [k] system_call 5.59% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller 4.37% bench [kernel.kallsyms] [k] trace_hardirqs_off_caller 3.70% bench [kernel.kallsyms] [k] __secure_computing 3.67% bench [kernel.kallsyms] [k] lock_is_held 3.03% bench [kernel.kallsyms] [k] seccomp_bpf_load new BPF: 2.58 sec 42.05% bench libc-2.15.so [.] syscall 6.91% bench [kernel.kallsyms] [k] system_call 6.25% bench [kernel.kallsyms] [k] trace_hardirqs_on_caller 6.07% bench [kernel.kallsyms] [k] __secure_computing 5.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp --arm32-- short filter old BPF: 4.0 sec 39.92% bench [kernel.kallsyms] [k] vector_swi 16.60% bench [kernel.kallsyms] [k] sk_run_filter 14.66% bench libc-2.17.so [.] syscall 5.42% bench [kernel.kallsyms] [k] seccomp_bpf_load 5.10% bench [kernel.kallsyms] [k] __secure_computing new BPF: 3.7 sec 35.93% bench [kernel.kallsyms] [k] vector_swi 21.89% bench libc-2.17.so [.] syscall 13.45% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp 6.25% bench [kernel.kallsyms] [k] __secure_computing 3.96% bench [kernel.kallsyms] [k] syscall_trace_exit --x86_64-- large filter old BPF: 8.6 seconds 73.38% bench [kernel.kallsyms] [k] sk_run_filter 10.70% bench libc-2.15.so [.] syscall 5.09% bench [kernel.kallsyms] [k] seccomp_bpf_load 1.97% bench [kernel.kallsyms] [k] system_call new BPF: 5.7 seconds 66.20% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp 16.75% bench libc-2.15.so [.] syscall 3.31% bench [kernel.kallsyms] [k] system_call 2.88% bench [kernel.kallsyms] [k] __secure_computing --i386-- large filter old BPF: 5.4 sec new BPF: 3.8 sec --arm32-- large filter old BPF: 13.5 sec 73.88% bench [kernel.kallsyms] [k] sk_run_filter 10.29% bench [kernel.kallsyms] [k] vector_swi 6.46% bench libc-2.17.so [.] syscall 2.94% bench [kernel.kallsyms] [k] seccomp_bpf_load 1.19% bench [kernel.kallsyms] [k] __secure_computing 0.87% bench [kernel.kallsyms] [k] sys_getuid new BPF: 13.5 sec 76.08% bench [kernel.kallsyms] [k] sk_run_filter_int_seccomp 10.98% bench [kernel.kallsyms] [k] vector_swi 5.87% bench libc-2.17.so [.] syscall 1.77% bench [kernel.kallsyms] [k] __secure_computing 0.93% bench [kernel.kallsyms] [k] sys_getuid BPF filters generated by seccomp are very branchy, so the new internal BPF performance is better than the old one. Performance gains will be even higher when BPF JIT is committed for the new structure, which is planned in future work (as successive JIT migrations). BPF has also been stress-tested with trinity's BPF fuzzer. Joint work with Daniel Borkmann. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Cc: Kees Cook <keescook@chromium.org> Cc: Paul Moore <pmoore@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: linux-kernel@vger.kernel.org Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>		2014-03-31 00:45:09 -04:00
..
9p	9p/trans_virtio.c: Fix broken zero-copy on vmalloc() buffers	2014-02-10 17:48:54 -08:00
802
8021q	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2014-03-29 18:48:54 -04:00
appletalk	appletalk: fix checkpatch error with indent	2014-02-14 16:18:32 -05:00
atm	atm: replace del_timer by del_timer_sync	2014-03-27 15:28:06 -04:00
ax25
batman-adv	batman-adv: Start new development cycle	2014-03-22 09:18:59 +01:00
bluetooth	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem	2014-03-21 14:02:04 -04:00
bridge	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2014-03-29 18:48:54 -04:00
caif	net: Include appropriate header file in caif/cfsrvl.c	2014-02-09 17:32:49 -08:00
can	can: remove CAN FD compatibility for CAN 2.0 sockets	2014-03-03 14:29:52 +01:00
ceph	net: remove unnecessary return's	2014-02-13 18:33:38 -05:00
core	net: filter: rework/optimize internal BPF interpreter's instruction set	2014-03-31 00:45:09 -04:00
dcb
dccp	dccp: re-enable debug macro	2014-02-16 23:45:00 -05:00
decnet	net: Move prototype declaration to header file include/net/dn.h from net/decnet/af_decnet.c	2014-02-09 17:32:49 -08:00
dns_resolver
dsa
ethernet
hsr	hsr: replace del_timer by del_timer_sync	2014-03-27 15:28:06 -04:00
ieee802154	ieee802154: dgram: cleanup set of broadcast panid	2014-03-20 17:19:45 -04:00
ipv4	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2014-03-29 18:48:54 -04:00
ipv6	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2014-03-29 18:48:54 -04:00
ipx	ipx: implement shutdown()	2014-02-12 19:26:32 -05:00
irda
iucv	af_iucv: recvmsg problem for SOCK_STREAM sockets	2014-03-20 00:06:55 -04:00
key	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2014-03-25 20:29:20 -04:00
l2tp	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2014-03-14 22:31:55 -04:00
lapb
llc
mac80211	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem	2014-03-21 14:02:04 -04:00
mac802154	ieee802154: add proper length checks to header creations	2014-03-14 22:15:26 -04:00
mpls
netfilter	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2014-03-29 18:48:54 -04:00
netlabel
netlink	netlink: autosize skb lengthes	2014-03-10 13:56:26 -04:00
netrom
nfc	NFC: 3.15: First pull request	2014-03-17 13:16:50 -04:00
openvswitch	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2014-03-29 18:48:54 -04:00
packet	packet: respect devices with LLTX flag in direct xmit	2014-03-28 16:49:48 -04:00
phonet
rds
rfkill
rose
rxrpc	af_rxrpc: Keep rxrpc_call pointers in a hashtable	2014-03-04 10:36:53 +00:00
sched	net: sched: use no more than one page in struct fw_head	2014-03-18 14:17:55 -04:00
sctp	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2014-03-14 22:31:55 -04:00
sunrpc	NFS client bugfixes for Linux 3.14	2014-02-19 12:13:02 -08:00
tipc	tipc: make discovery domain a bearer attribute	2014-03-28 14:46:29 -04:00
unix	net: unix: non blocking recvmsg() should not return -EINTR	2014-03-26 17:05:40 -04:00
vmw_vsock
wimax
wireless	Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem	2014-03-21 14:02:04 -04:00
x25
xfrm	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2014-03-25 20:29:20 -04:00
compat.c
Kconfig
Makefile
nonet.c
socket.c	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2014-03-14 22:31:55 -04:00
sysctl_net.c