linux

iv/linux

Author	SHA1	Message	Date
Daniel Borkmann	a4afd37b26	test_bpf: add tests related to BPF_MAXINSNS Couple of torture test cases related to the bug fixed in `0b59d8806a` ("ARM: net: delegate filter to kernel interpreter when imm_offset() return value can't fit into 12bits."). I've added a helper to allocate and fill the insn space. Output on x86_64 from my laptop: test_bpf: #233 BPF_MAXINSNS: Maximum possible literals jited:0 7 PASS test_bpf: #234 BPF_MAXINSNS: Single literal jited:0 8 PASS test_bpf: #235 BPF_MAXINSNS: Run/add until end jited:0 11553 PASS test_bpf: #236 BPF_MAXINSNS: Too many instructions PASS test_bpf: #237 BPF_MAXINSNS: Very long jump jited:0 9 PASS test_bpf: #238 BPF_MAXINSNS: Ctx heavy transformations jited:0 20329 20398 PASS test_bpf: #239 BPF_MAXINSNS: Call heavy transformations jited:0 32178 32475 PASS test_bpf: #240 BPF_MAXINSNS: Jump heavy test jited:0 10518 PASS test_bpf: #233 BPF_MAXINSNS: Maximum possible literals jited:1 4 PASS test_bpf: #234 BPF_MAXINSNS: Single literal jited:1 4 PASS test_bpf: #235 BPF_MAXINSNS: Run/add until end jited:1 1625 PASS test_bpf: #236 BPF_MAXINSNS: Too many instructions PASS test_bpf: #237 BPF_MAXINSNS: Very long jump jited:1 8 PASS test_bpf: #238 BPF_MAXINSNS: Ctx heavy transformations jited:1 3301 3174 PASS test_bpf: #239 BPF_MAXINSNS: Call heavy transformations jited:1 24107 23491 PASS test_bpf: #240 BPF_MAXINSNS: Jump heavy test jited:1 8651 PASS Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Nicolas Schichan <nschichan@freebox.fr> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 22:34:10 -04:00
Eric Dumazet	264ea103a7	tcp: syncookies: extend validity range Now we allow storing more request socks per listener, we might hit syncookie mode less often and hit following bug in our stack : When we send a burst of syncookies, then exit this mode, tcp_synq_no_recent_overflow() can return false if the ACK packets coming from clients are coming three seconds after the end of syncookie episode. This is a way too strong requirement and conflicts with rest of syncookie code which allows ACK to be aged up to 2 minutes. Perfectly valid ACK packets are dropped just because clients might be in a crowded wifi environment or on another planet. So let's fix this, and also change tcp_synq_overflow() to not dirty a cache line for every syncookie we send, as we are under attack. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Florian Westphal <fw@strlen.de> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 22:32:17 -04:00
Alexander Duyck	c24a59649f	ip_tunnel: Report Rx dropped in ip_tunnel_get_stats64 The rx_dropped stat wasn't being reported when ip_tunnel_get_stats64 was called. This was leading to some confusing results in my debug as I was seeing rx_errors increment but no other value which pointed me toward the type of error being seen. This change corrects that by using netdev_stats_to_stats64 to copy all available dev stats instead of just the few that were hand picked. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 22:30:54 -04:00
Willem de Bruijn	54d7c01d3e	packet: fix warnings in rollover lock contention Avoid two xchg calls whose return values were unused, causing a warning on some architectures. The relevant variable is a hint and read without mutual exclusion. This fix makes all writers hold the receive_queue lock. Suggested-by: David S. Miller <davem@davemloft.net> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 17:40:54 -04:00
françois romieu	4ffd3c730e	net: batch of last_rx update avoidance in ethernet drivers. None of those drivers uses last_rx for its own needs. See `4dc89133f4` ("net: add a comment on netdev->last_rx") for reference. Signed-off-by: Francois Romieu <romieu@fr.zoreil.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Zhangfei Gao <zhangfei.gao@linaro.org> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Wingman Kwok <w-kwok2@ti.com> Cc: Murali Karicheri <m-karicheri2@ti.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 17:38:17 -04:00
David S. Miller	7852dada15	Merge branch 'phy_turn_around' Florian Fainelli says: ==================== net: phy: broken turn-around support This is an attempt at solving the broken turn-around problem in a way that is not specific to the mdio-gpio driver, since it affects different kinds of platforms. We cannot make that localized to PHY device drivers because probing the PHY device which has a broken turn-around can fail as early as in get_phy_id(), therefore we need a bit of help from Device Tree/platform_data. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 13:40:55 -04:00
Florian Fainelli	ea48b2b8ad	net: phy: mdio-gpio: Handle phy_ignore_ta_mask Update mdiobb_read() to read whether the PHY has a broken turn-around, and if it does, ignore it to make the read succeeed. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 13:40:55 -04:00
Florian Fainelli	ab6016e0c1	of: mdio: Add a "broken-turn-around" property Some Ethernet PHY devices/switches may not properly release the MDIO bus during turn-around time, and fail to drive it low, which can be seen by some controllers as a read failure, while the data clocked in is still correct. Add a boolean property "broken-turn-around" which is parsed by the generic MDIO bus probing code and will set the corresponding bit in the MDIO bus phy_ignore_ta_mask bitmask for MDIO bus drivers to utilize that information. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 13:40:55 -04:00
Florian Fainelli	922f2dd1b6	net: phy: Add phy_ignore_ta_mask to account for broken turn-around Some PHY devices/switches will not release the turn-around line as they should do at the end of a MDIO transaction. To help with such situations, allow MDIO bus drivers to be made aware of such restrictions. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 13:40:55 -04:00
Ying Xue	fa787ae062	tipc: use sock_create_kern interface to create kernel socket After commit `eeb1bd5c40` ("net: Add a struct net parameter to sock_create_kern"), we should use sock_create_kern() to create kernel socket as the interface doesn't reference count struct net any more. Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 13:39:33 -04:00
Brian Haley	dd3aa3b5fb	cls_flower: Fix compile error Fix compile error in net/sched/cls_flower.c net/sched/cls_flower.c: In function ‘fl_set_key’: net/sched/cls_flower.c:240:3: error: implicit declaration of function ‘tcf_change_indev’ [-Werror=implicit-function-declaration] err = tcf_change_indev(net, tb[TCA_FLOWER_INDEV]); Introduced in `77b9900ef5` Fixes: `77b9900ef5` ("tc: introduce Flower classifier") Signed-off-by: Brian Haley <brian.haley@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 13:34:35 -04:00
David S. Miller	b55b10bebb	Merge branch 'tipc-next' Jon Maloy says: ==================== tipc: some link layer improvements We continue eliminating redundant complexity at the link layer, and add a couple of improvements to the packet sending functionality. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 12:24:46 -04:00
Jon Paul Maloy	dd3f9e70f5	tipc: add packet sequence number at instant of transmission Currently, the packet sequence number is updated and added to each packet at the moment a packet is added to the link backlog queue. This is wasteful, since it forces the code to traverse the send packet list packet by packet when adding them to the backlog queue. It would be better to just splice the whole packet list into the backlog queue when that is the right action to do. In this commit, we do this change. Also, since the sequence numbers cannot now be assigned to the packets at the moment they are added the backlog queue, we do instead calculate and add them at the moment of transmission, when the backlog queue has to be traversed anyway. We do this in the function tipc_link_push_packet(). Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 12:24:46 -04:00
Jon Paul Maloy	f21e897ecc	tipc: improve link congestion algorithm The link congestion algorithm used until now implies two problems. - It is too generous towards lower-level messages in situations of high load by giving "absolute" bandwidth guarantees to the different priority levels. LOW traffic is guaranteed 10%, MEDIUM is guaranted 20%, HIGH is guaranteed 30%, and CRITICAL is guaranteed 40% of the available bandwidth. But, in the absence of higher level traffic, the ratio between two distinct levels becomes unreasonable. E.g. if there is only LOW and MEDIUM traffic on a system, the former is guaranteed 1/3 of the bandwidth, and the latter 2/3. This again means that if there is e.g. one LOW user and 10 MEDIUM users, the former will have 33.3% of the bandwidth, and the others will have to compete for the remainder, i.e. each will end up with 6.7% of the capacity. - Packets of type MSG_BUNDLER are created at SYSTEM importance level, but only after the packets bundled into it have passed the congestion test for their own respective levels. Since bundled packets don't result in incrementing the level counter for their own importance, only occasionally for the SYSTEM level counter, they do in practice obtain SYSTEM level importance. Hence, the current implementation provides a gap in the congestion algorithm that in the worst case may lead to a link reset. We now refine the congestion algorithm as follows: - A message is accepted to the link backlog only if its own level counter, and all superior level counters, permit it. - The importance of a created bundle packet is set according to its contents. A bundle packet created from messges at levels LOW to CRITICAL is given importance level CRITICAL, while a bundle created from a SYSTEM level message is given importance SYSTEM. In the latter case only subsequent SYSTEM level messages are allowed to be bundled into it. This solves the first problem described above, by making the bandwidth guarantee relative to the total number of users at all levels; only the upper limit for each level remains absolute. In the example described above, the single LOW user would use 1/11th of the bandwidth, the same as each of the ten MEDIUM users, but he still has the same guarantee against starvation as the latter ones. The fix also solves the second problem. If the CRITICAL level is filled up by bundle packets of that level, no lower level packets will be accepted any more. Suggested-by: Gergely Kiss <gergely.kiss@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 12:24:46 -04:00
Jon Paul Maloy	cd4eee3c2e	tipc: simplify link supervision checkpointing We change the sequence number checkpointing that is performed by the timer in order to discover if the peer is active. Currently, we store a checkpoint of the next expected sequence number "rcv_nxt" at each timer expiration, and compare it to the current expected number at next timeout expiration. Instead, we now use the already existing field "silent_intv_cnt" for this task. We step the counter at each timeout expiration, and zero it at each valid received packet. If no valid packet has been received from the peer after "abort_limit" number of silent timer intervals, the link is declared faulty and reset. We also remove the multiple instances of timer activation from inside the FSM function "link_state_event()", and now do it at only one place; at the end of the timer function itself. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 12:24:46 -04:00
Jon Paul Maloy	a97b9d3fa9	tipc: rename fields in struct tipc_link We rename some fields in struct tipc_link, in order to give them more descriptive names: next_in_no -> rcv_nxt next_out_no-> snd_nxt fsm_msg_cnt-> silent_intv_cnt cont_intv -> keepalive_intv last_retransmitted -> last_retransm There are no functional changes in this commit. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 12:24:46 -04:00
Jon Paul Maloy	e4bf4f7696	tipc: simplify packet sequence number handling Although the sequence number in the TIPC protocol is 16 bits, we have until now stored it internally as an unsigned 32 bits integer. We got around this by always doing explicit modulo-65535 operations whenever we need to access a sequence number. We now make the incoming and outgoing sequence numbers to unsigned 16-bit integers, and remove the modulo operations where applicable. We also move the arithmetic inline functions for 16 bit integers to core.h, and the function buf_seqno() to msg.h, so they can easily be accessed from anywhere in the code. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 12:24:46 -04:00
Jon Paul Maloy	a6bf70f792	tipc: simplify include dependencies When we try to add new inline functions in the code, we sometimes run into circular include dependencies. The main problem is that the file core.h, which really should be at the root of the dependency chain, instead is a leaf. I.e., core.h includes a number of header files that themselves should be allowed to include core.h. In reality this is unnecessary, because core.h does not need to know the full signature of any of the structs it refers to, only their type declaration. In this commit, we remove all dependencies from core.h towards any other tipc header file. As a consequence of this change, we can now move the function tipc_own_addr(net) from addr.c to addr.h, and make it inline. There are no functional changes in this commit. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 12:24:45 -04:00
Jon Paul Maloy	75b44b018e	tipc: simplify link timer handling Prior to this commit, the link timer has been running at a "continuity interval" of configured link tolerance/4. When a timer wakes up and discovers that there has been no sign of life from the peer during the previous interval, it divides its own timer interval by another factor four, and starts sending one probe per new interval. When the configured link tolerance time has passed without answer, i.e. after 16 unacked probes, the link is declared faulty and reset. This is unnecessary complex. It is sufficient to continue with the original continuity interval, and instead reset the link after four missed probe responses. This makes the timer handling in the link simpler, and opens up for some planned later changes in this area. This commit implements this change. Reviewed-by: Richard Alpe <richard.alpe@ericsson.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 12:24:45 -04:00
Jon Paul Maloy	b1c29f6b10	tipc: simplify resetting and disabling of bearers Since commit 4b475e3f2f8e4e241de101c8240f1d74d0470494 ("tipc: eliminate delayed link deletion at link failover") the extra boolean parameter "shutting_down" is not any longer needed for the functions bearer_disable() and tipc_link_delete_list(). Furhermore, the function tipc_link_reset_links(), called from bearer_reset() is now unnecessary. We can just as well delete all the links, as we do in bearer_disable(), and start over with creating new links. This commit introduces those changes. Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 12:24:45 -04:00
David S. Miller	c16ead798e	Merge branch 'be2net-next' Venkat Duvvuru says: ==================== be2net: patch-set The following patch set has one new feature addition and two fixes. Patch 1 adds support for hwmon sysfs interface to display board temperature. Board temperature display through ethtool statistics is removed. Patch 2 reports "link down" in a few more error cases which are not handled currently. Patch 3 adds support for os2bmc. OS2BMC feature will allow the server to communicate with the on-board BMC/idrac (Baseboard Management Controller) over the LOM via standard Ethernet. More details are added in the commit log. Please review. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 12:21:42 -04:00
Venkata Duvvuru	760c295e0e	be2net: Support for OS2BMC. OS2BMC feature will allow the server to communicate with the on-board BMC/idrac (Baseboard Management Controller) over the LOM via standard Ethernet. When OS2BMC feature is enabled, the LOM will filter traffic coming from the host. If the destination MAC address matches the iDRAC MAC address, it will forward the packet to the NC-SI side band interface for iDRAC processing. Otherwise, it would send it out on the wire to the external network. Broadcast and multicast packets are sent on the side-band NC-SI channel and on the wire as well. Some of the packet filters are not supported in the NIC and hence driver will identify such packets and will hint the NIC to send those packets to the BMC. This is done by duplicating packets on the management ring. Packets are sent to the management ring, by setting mgmt bit in the wrb header. The NIC will forward the packets on the management ring to the BMC through the side-band NC-SI channel. Please refer to this online document for more details, http://www.dell.com/downloads/global/products/pedge/ os_to_bmc_passthrough_a_new_chapter_in_system_management.pdf Signed-off-by: Venkat Duvvuru <VenkatKumar.Duvvuru@Emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 12:21:42 -04:00
Venkata Duvvuru	954f6825ee	be2net: Report a "link down" to the stack when a fatal error or fw reset happens. When an error (related to HW or FW) is detected on a function, the driver must pro-actively report a "link down" to the stack so that a possible failover can be initiated. This is being done currently only for some HW errors. This patch reports a "link down" even for fatal FW errors and EEH errors. Signed-off-by: Venkat Duvvuru <VenkatKumar.Duvvuru@Emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 12:21:41 -04:00
Venkata Duvvuru	29e9122b3a	be2net: Export board temperature using hwmon-sysfs interface. Ethtool statistics is not the right place to display board temperature. This patch adds support to export die temperature of devices supported by be2net driver via the sysfs hwmon interface. Signed-off-by: Venkat Duvvuru <VenkatKumar.Duvvuru@Emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 12:21:41 -04:00
David S. Miller	5a99e7f22b	Merge branch 'nf-ingress' Pablo Neira Ayuso says: ==================== Netfilter ingress support (v4) This is the v4 round of patches to add the Netfilter ingress hook, it basically comes in two steps: 1) Add the CONFIG_NET_INGRESS switch to wrap the ingress static key around it. The idea is to use the same global static key to avoid adding more code to the hot path. 2) Add the Netfilter ingress hook after the tc ingress hook, under the global ingress_needed static key. As I said, the netfilter ingress hook also has its own static key, that is nested under the global static key. Please, see patch 5/5 for performance numbers and more information. I originally started this next round, as it was suggested, exploring the independent static key for netfilter ingress just after tc ingress, but the results that I gathered from that patch are not good for non-users: Result: OK: 6425927(c6425843+d83) usec, 100000000 (60byte,0frags) 15561955pps 7469Mb/sec (7469738400bps) errors: 100000000 this roughly means 500Kpps less performance wrt. the base numbers, so that's the reason why I discarded that approach and I focused on this. The idea of this patchset is to open the window to nf_tables, which comes with features that will work out-of-the-box (once the boiler plate code to support the 'netdev' table family is in place), to avoid repeating myself [1], the most relevant features are: 1) Multi-dimensional key dictionary lookups. 2) Arbitrary stateful flow tables. 3) Transactions and good support for dynamic updates. But there are also interest aspects to consider from userspace, such as the ability to support new layer 2 protocols without kernel updates, a well-defined netlink interface, userspace libraries and utilities for third party applications, among others. I hope we can be happy with this approach. Please, apply. Thanks. [1] http://marc.info/?l=netfilter-devel&m=143033337020328&w=2 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 01:10:06 -04:00
Pablo Neira	e687ad60af	netfilter: add netfilter ingress hook after handle_ing() under unique static key This patch adds the Netfilter ingress hook just after the existing tc ingress hook, that seems to be the consensus solution for this. Note that the Netfilter hook resides under the global static key that enables ingress filtering. Nonetheless, Netfilter still also has its own static key for minimal impact on the existing handle_ing(). * Without this patch: Result: OK: 6216490(c6216338+d152) usec, 100000000 (60byte,0frags) 16086246pps 7721Mb/sec (7721398080bps) errors: 100000000 42.46% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core 25.92% kpktgend_0 [kernel.kallsyms] [k] kfree_skb 7.81% kpktgend_0 [pktgen] [k] pktgen_thread_worker 5.62% kpktgend_0 [kernel.kallsyms] [k] ip_rcv 2.70% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal 2.34% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk 1.44% kpktgend_0 [kernel.kallsyms] [k] __build_skb * With this patch: Result: OK: 6214833(c6214731+d101) usec, 100000000 (60byte,0frags) 16090536pps 7723Mb/sec (7723457280bps) errors: 100000000 41.23% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core 26.57% kpktgend_0 [kernel.kallsyms] [k] kfree_skb 7.72% kpktgend_0 [pktgen] [k] pktgen_thread_worker 5.55% kpktgend_0 [kernel.kallsyms] [k] ip_rcv 2.78% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal 2.06% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk 1.43% kpktgend_0 [kernel.kallsyms] [k] __build_skb * Without this patch + tc ingress: tc filter add dev eth4 parent ffff: protocol ip prio 1 \ u32 match ip dst 4.3.2.1/32 Result: OK: 9269001(c9268821+d179) usec, 100000000 (60byte,0frags) 10788648pps 5178Mb/sec (5178551040bps) errors: 100000000 40.99% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core 17.50% kpktgend_0 [kernel.kallsyms] [k] kfree_skb 11.77% kpktgend_0 [cls_u32] [k] u32_classify 5.62% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat 5.18% kpktgend_0 [pktgen] [k] pktgen_thread_worker 3.23% kpktgend_0 [kernel.kallsyms] [k] tc_classify 2.97% kpktgend_0 [kernel.kallsyms] [k] ip_rcv 1.83% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal 1.50% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk 0.99% kpktgend_0 [kernel.kallsyms] [k] __build_skb * With this patch + tc ingress: tc filter add dev eth4 parent ffff: protocol ip prio 1 \ u32 match ip dst 4.3.2.1/32 Result: OK: 9308218(c9308091+d126) usec, 100000000 (60byte,0frags) 10743194pps 5156Mb/sec (5156733120bps) errors: 100000000 42.01% kpktgend_0 [kernel.kallsyms] [k] __netif_receive_skb_core 17.78% kpktgend_0 [kernel.kallsyms] [k] kfree_skb 11.70% kpktgend_0 [cls_u32] [k] u32_classify 5.46% kpktgend_0 [kernel.kallsyms] [k] tc_classify_compat 5.16% kpktgend_0 [pktgen] [k] pktgen_thread_worker 2.98% kpktgend_0 [kernel.kallsyms] [k] ip_rcv 2.84% kpktgend_0 [kernel.kallsyms] [k] tc_classify 1.96% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_internal 1.57% kpktgend_0 [kernel.kallsyms] [k] netif_receive_skb_sk Note that the results are very similar before and after. I can see gcc gets the code under the ingress static key out of the hot path. Then, on that cold branch, it generates the code to accomodate the netfilter ingress static key. My explanation for this is that this reduces the pressure on the instruction cache for non-users as the new code is out of the hot path, and it comes with minimal impact for tc ingress users. Using gcc version 4.8.4 on: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 [...] L1d cache: 16K L1i cache: 64K L2 cache: 2048K L3 cache: 8192K Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 01:10:05 -04:00
Pablo Neira	1cf51900f8	net: add CONFIG_NET_INGRESS to enable ingress filtering This new config switch enables the ingress filtering infrastructure that is controlled through the ingress_needed static key. This prepares the introduction of the Netfilter ingress hook that resides under this unique static key. Note that CONFIG_SCH_INGRESS automatically selects this, that should be no problem since this also depends on CONFIG_NET_CLS_ACT. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 01:10:05 -04:00
Pablo Neira	b8d0aad0c7	netfilter: add nf_hook_list_active() In preparation to have netfilter ingress per-device hook list. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 01:10:05 -04:00
Pablo Neira	f719148346	netfilter: add hook list to nf_hook_state Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 01:10:05 -04:00
Pablo Neira	87d5c18ce1	netfilter: cleanup struct nf_hook_ops indentation Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 01:10:05 -04:00
Dan Carpenter	a104a6b309	net: macb: OR vs AND typos The bitwise tests are always true here because it uses '\|' where '&' is intended. Fixes: `98b5a0f4a2` ('net: macb: Add support for jumbo frames') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-14 00:49:09 -04:00
Alexander Duyck	a080e7bd0a	net: Reserve skb headroom and set skb->dev even if using __alloc_skb When I had inlined __alloc_rx_skb into __netdev_alloc_skb and __napi_alloc_skb I had overlooked the fact that there was a return in the __alloc_rx_skb. As a result we weren't reserving headroom or setting the skb->dev in certain cases. This change corrects that by adding a couple of jump labels to jump to depending on __alloc_skb either succeeding or failing. Fixes: `9451980a66` ("net: Use cached copy of pfmemalloc to avoid accessing page") Reported-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> Tested-by: Kevin Hilman <khilman@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 18:07:24 -04:00
David S. Miller	59af132bb6	Merge branch 'geneve_tunnel_driver' John W. Linville says: ==================== add GENEVE netdev tunnel driver This 5-patch kernel series adds a netdev implementation of a GENEVE tunnel driver, and the single iproute2 patch enables creation and such for those netdevs. This makes use of the existing GENEVE infrastructure already used by the OVS code. The net/ipv4/geneve.c file is renamed as net/ipv4/geneve_core.c as part of these changes. drivers/net/Kconfig \| 14 + drivers/net/Makefile \| 1 drivers/net/geneve.c \| 503 +++++++++++++++++++++++++++++++++++++++++ include/net/geneve.h \| 5 include/uapi/linux/if_link.h \| 9 net/ipv4/Kconfig \| 4 net/ipv4/Makefile \| 2 net/ipv4/geneve.c \| 6 net/ipv4/geneve_core.c \| 4 net/openvswitch/Kconfig \| 2 net/openvswitch/vport-geneve.c \| 5 11 files changed, 538 insertions(+), 17 deletions(-) The overall structure of the GENEVE netdev driver is strongly influenced by the VXLAN netdev driver. This is not surprising, as the two drivers are intended to serve similar purposes. As development of the GENEVE driver continues, it is likely that those similarities will grow stronger. This will include both simple configuration options (e.g. TOS and TTL settings) and new control plane support. The current implementation is very simple, restricting itself to point to point links over IPv4. This is due only to the simplicity of the implementation, and no such limit is inherent to GENEVE in any way. Support for IPv6 links and more sophisticated control plane options are predictable enhancements. Using the included iproute2 patch, a GENEVE tunnel is created thusly: ip link add dev gnv0 type geneve remote 192.168.22.1 vni 1234 ip link set gnv0 up ip addr add 10.1.1.1/24 dev gnv0 After a corresponding tunnel interface is created at the link partner, traffic should proceed as expected. Please let me know if anyone has problems...thanks! ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:59:14 -04:00
John W. Linville	2d07dc79fe	geneve: add initial netdev driver for GENEVE tunnels This is an initial implementation of a netdev driver for GENEVE tunnels. This implementation uses a fixed UDP port, and only supports point-to-point links with specific partner endpoints. Only IPv4 links are supported at this time. Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:59:13 -04:00
John W. Linville	d37d29c305	geneve_core: identify as driver library in modules description Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:59:13 -04:00
John W. Linville	11e1fa46b4	geneve: Rename support library as geneve_core net/ipv4/geneve.c -> net/ipv4/geneve_core.c This name better reflects the purpose of the module. Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:59:13 -04:00
John W. Linville	35d32e8fe4	geneve: move definition of geneve_hdr() to geneve.h This is a static inline with identical definitions in multiple places... Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:59:13 -04:00
John W. Linville	125907ae5e	geneve: remove MODULE_ALIAS_RTNL_LINK from net/ipv4/geneve.c This file is essentially a library for implementing the geneve encapsulation protocol. The file does not register any rtnl_link_ops, so the MODULE_ALIAS_RTNL_LINK macro is inappropriate here. Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:59:12 -04:00
Pablo Neira	f0b5e8a42f	net: kill useless net_*_ingress_queue() definitions when NET_CLS_ACT is unset This fixes `4577139b2d` ("net: use jump label patching for ingress qdisc in __netif_receive_skb_core"). The only client of this is sch_ingress and it depends on NET_CLS_ACT. So there is no way these definition can be of any help. Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:44:28 -04:00
David S. Miller	9f0a74d7b6	Merge branch 'packet_rollover' Willem de Bruijn says: ==================== refine packet socket rollover: 1. mitigate a case of lock contention 2. avoid exporting resource exhaustion to other sockets, by migrating only to a victim socket that has ample room 3. avoid reordering of most flows on the socket, by migrating first the flow responsible for load imbalance 4. help processes detect load imbalance, by exporting rollover counters Context: rollover implements flow migration in packet socket fanout groups in case of extreme load imbalance. It is a specific implementation of migration that minimizes reordering by selecting the same victim socket when possible (and by selecting subsequent victims in a round robin fashion, from which its name derives). Changes: v2 -> v3: - statistics: replace unsigned long with __aligned_u64 v1 -> v2: - huge flow detection: run lockless - huge flow detection: replace stored index with random - contention avoidance: test in packet_poll while lock held - contention avoidance: clear pressure sooner packet_poll and packet_recvmsg would clear only if the sock is empty to avoid taking the necessary lock. But, * packet_poll already holds this lock, so a lockless variant __packet_rcv_has_room is cheap. * packet_recvmsg is usually called only for non-ring sockets, which also runs lockless. - preparation: drop "single return" patch packet_rcv_has_room is now a locked wrapper around __packet_rcv_has_room, achieving the same (single footer). The benchmark mentioned in the patches is at https://github.com/wdebruij/kerneltools/blob/master/tests/bench_rollover.c ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:43:01 -04:00
Willem de Bruijn	a9b6391814	packet: rollover statistics Rollover indicates exceptional conditions. Export a counter to inform socket owners of this state. If no socket with sufficient room is found, rollover fails. Also count these events. Finally, also count when flows are rolled over early thanks to huge flow detection, to validate its correctness. Tested: Read counters in bench_rollover on all other tests in the patchset Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:43:00 -04:00
Willem de Bruijn	3b3a5b0aab	packet: rollover huge flows before small flows Migrate flows from a socket to another socket in the fanout group not only when the socket is full. Start migrating huge flows early, to divert possible 4-tuple attacks without affecting normal traffic. Introduce fanout_flow_is_huge(). This detects huge flows, which are defined as taking up more than half the load. It does so cheaply, by storing the rxhashes of the N most recent packets. If over half of these are the same rxhash as the current packet, then drop it. This only protects against 4-tuple attacks. N is chosen to fit all data in a single cache line. Tested: Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input. lpbb5:/export/hda3/willemb# ./bench_rollover -l 1000 -r -s cpu rx rx.k drop.k rollover r.huge r.failed 0 14 14 0 0 0 0 1 20 20 0 0 0 0 2 16 16 0 0 0 0 3 6168824 6168824 0 4867721 4867721 0 4 4867741 4867741 0 0 0 0 5 12 12 0 0 0 0 6 15 15 0 0 0 0 7 17 17 0 0 0 0 Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:43:00 -04:00
Willem de Bruijn	2ccdbaa6d5	packet: rollover lock contention avoidance Rollover has to call packet_rcv_has_room on sockets in the fanout group to find a socket to migrate to. This operation is expensive especially if the packet sockets use rings, when a lock has to be acquired. Avoid pounding on the lock by all sockets by temporarily marking a socket as "under memory pressure" when such pressure is detected. While set, only the socket owner may call packet_rcv_has_room on the socket. Once it detects normal conditions, it clears the flag. The socket is not used as a victim by any other socket in the meantime. Under reasonably balanced load, each socket writer frequently calls packet_rcv_has_room and clears its own pressure field. As a backup for when the socket is rarely written to, also clear the flag on reading (packet_recvmsg, packet_poll) if this can be done cheaply (i.e., without calling packet_rcv_has_room). This is only for edge cases. Tested: Ran bench_rollover: a process with 8 sockets in a single fanout group, each pinned to a single cpu that receives one nic recv interrupt. RPS and RFS are disabled. The benchmark uses packet rx_ring, which has to take a lock when determining whether a socket has room. Sent 3.5 Mpps of UDP traffic with sufficient entropy to spread uniformly across the packet sockets (and inserted an iptables rule to drop in PREROUTING to avoid protocol stack processing). Without this patch, all sockets try to migrate traffic to neighbors, causing lock contention when searching for a non- empty neighbor. The lock is the top 9 entries. perf record -a -g sleep 5 - 17.82% bench_rollover [kernel.kallsyms] [k] _raw_spin_lock - _raw_spin_lock - 99.00% spin_lock + 81.77% packet_rcv_has_room.isra.41 + 18.23% tpacket_rcv + 0.84% packet_rcv_has_room.isra.41 + 5.20% ksoftirqd/6 [kernel.kallsyms] [k] _raw_spin_lock + 5.15% ksoftirqd/1 [kernel.kallsyms] [k] _raw_spin_lock + 5.14% ksoftirqd/2 [kernel.kallsyms] [k] _raw_spin_lock + 5.12% ksoftirqd/7 [kernel.kallsyms] [k] _raw_spin_lock + 5.12% ksoftirqd/5 [kernel.kallsyms] [k] _raw_spin_lock + 5.10% ksoftirqd/4 [kernel.kallsyms] [k] _raw_spin_lock + 4.66% ksoftirqd/0 [kernel.kallsyms] [k] _raw_spin_lock + 4.45% ksoftirqd/3 [kernel.kallsyms] [k] _raw_spin_lock + 1.55% bench_rollover [kernel.kallsyms] [k] packet_rcv_has_room.isra.41 On net-next with this patch, this lock contention is no longer a top entry. Most time is spent in the actual read function. Next up are other locks: + 15.52% bench_rollover bench_rollover [.] reader + 4.68% swapper [kernel.kallsyms] [k] memcpy_erms + 2.77% swapper [kernel.kallsyms] [k] packet_lookup_frame.isra.51 + 2.56% ksoftirqd/1 [kernel.kallsyms] [k] memcpy_erms + 2.16% swapper [kernel.kallsyms] [k] tpacket_rcv + 1.93% swapper [kernel.kallsyms] [k] mlx4_en_process_rx_cq Looking closer at the remaining _raw_spin_lock, the cost of probing in rollover is now comparable to the cost of taking the lock later in tpacket_rcv. - 1.51% swapper [kernel.kallsyms] [k] _raw_spin_lock - _raw_spin_lock + 33.41% packet_rcv_has_room + 28.15% tpacket_rcv + 19.54% enqueue_to_backlog + 6.45% __free_pages_ok + 2.78% packet_rcv_fanout + 2.13% fanout_demux_rollover + 2.01% netif_receive_skb_internal Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:43:00 -04:00
Willem de Bruijn	9954729bc3	packet: rollover only to socket with headroom Only migrate flows to sockets that have sufficient headroom, where sufficient is defined as having at least 25% empty space. The kernel has three different buffer types: a regular socket, a ring with frames (TPACKET_V[12]) or a ring with blocks (TPACKET_V3). The latter two do not expose a read pointer to the kernel, so headroom is not computed easily. All three needs a different implementation to estimate free space. Tested: Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input. bench_rollover has as many sockets as there are NIC receive queues in the system. Each socket is owned by a process that is pinned to one of the receive cpus. RFS is disabled. RPS is enabled with an identity mapping (cpu x -> cpu x), to count drops with softnettop. lpbb5:/export/hda3/willemb# ./bench_rollover -r -l 1000 -s Press [Enter] to exit cpu rx rx.k drop.k rollover r.huge r.failed 0 16 16 0 0 0 0 1 21 21 0 0 0 0 2 5227502 5227502 0 0 0 0 3 18 18 0 0 0 0 4 6083289 6083289 0 5227496 0 0 5 22 22 0 0 0 0 6 21 21 0 0 0 0 7 9 9 0 0 0 0 Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:42:59 -04:00
Willem de Bruijn	0648ab70af	packet: rollover prepare: per-socket state Replace rollover state per fanout group with state per socket. Future patches will add fields to the new structure. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:42:59 -04:00
Willem de Bruijn	ad377cab49	packet: rollover prepare: move code out of callsites packet_rcv_fanout calls fanout_demux_rollover twice. Move all rollover logic into the callee to simplify these callsites, especially with upcoming changes. The main differences between the two callsites is that the FLAG variant tests whether the socket previously selected by another mode (RR, RND, HASH, ..) has room before migrating flows, whereas the rollover mode has no original socket to test. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:42:59 -04:00
Eric Dumazet	7d771aaac7	ipv4: __ip_local_out_sk() is static __ip_local_out_sk() is only used from net/ipv4/ip_output.c net/ipv4/ip_output.c:94:5: warning: symbol '__ip_local_out_sk' was not declared. Should it be static? Fixes: `7026b1ddb6` ("netfilter: Pass socket pointer down through okfn().") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:21:33 -04:00
Eric Dumazet	216f8bb9f6	tcp/dccp: tw_timer_handler() is static tw_timer_handler() is only used from net/ipv4/inet_timewait_sock.c Fixes: `789f558cfb` ("tcp/dccp: get rid of central timewait timer") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:21:33 -04:00
David S. Miller	dd58c6359b	Merge branch 'cls_flower' Jiri Pirko says: ==================== introduce programable flow dissector and cls_flower Per Davem's request, I prepared this patchset which introduces programmable flow dissector. For current users of flow_keys, there is a wrapper skb_flow_dissect_flow_keys which maintains the previous behaviour. For purposes of cls_flower, couple of new dissection keys were introduced. Note that this dissector can be also eventually used by openvswitch code. Also, as a next step, I plan to get rid of skb_flow_get_ports(export) and __skb_get_poff as their functionality can be now implemented by skb_flow_dissect as well. v2->v3: - remove TCA_FLOWER_POLICE attr suggested by Jamal v1->v2: - move __skb_tx_hash rather to dev.c as suggested by Alex ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:19:48 -04:00
Jiri Pirko	77b9900ef5	tc: introduce Flower classifier This patch introduces a flow-based filter. So far, the very essential packet fields are supported. This patch is only the first step. There is a lot of potential performance improvements possible to implement. Also a lot of features are missing now. They will be addressed in follow-up patches. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2015-05-13 15:19:48 -04:00

1 2 3 4 5 ...

519707 Commits