IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
- Keep fragments equally sized, avoids some problems with too small fragments,
by Sven Eckelmann
- Initialize gateway class correctly when BATMAN V is compiled in,
by Sven Eckelmann
-----BEGIN PGP SIGNATURE-----
iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAljKiocWHHN3QHNpbW9u
d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoSYaD/4o3vSRQbiu95rzaPZbZ8cfFRYz
FYS7b/SLV8Q8uxzCo05jimnYB9qGD8bZZlAPNhNZLqkSsTXc8vqfwPnyB9ytasrp
A8jefoMlO0A5w3AQxBErzg3wrYr5Gz6lqkgqZTv06Sb6xa7F/OEo9pMjdbCBX15D
G9y1FZdLZVhdVE8r9yUohbfcnZ4KkfBlViEXnjAcJoV2x4OOVGQjDdBGpaEQnwyk
wyGEFuIO2oWW6AlQXCbXomTUZ6bAzJ43lPbRDWTVdrVJg2RV65F4MsNDfLpXVBQV
GvmB33WVTzphQshxvhnjs2uWglnj8KGVcvpIp7GuRfGgpFJh9tbS2B0bITXCVxWU
+HkNja6QwObwhEvAWfINaTT9RR1kxWAQk6YB9VEq3RJoHufP49felOZ877oAkK8T
7CwazIe4rOhA+ISLUcUFk5W9B46DusPRyfhzrDZ6EM0P/ppGWpKAioawltLqX2P3
kBmIW8kB6YRHnTeCHJZTjdPdQkYeIAYVunnhLvRrmTbf0oyVh3+UnwgAVquwJj/5
WsEqNPsP0B++nsWmT+MPVOKj/JwcmtKfaDY2sor72aN6TZxq7MF9VGdcZiOu9neB
zILNcb0oIvUahb2SO2u/gH2E/4am+ED5StQmNxHdJf1A+iCYGXEy/yETgeIm51/S
TaHd1aH3VwmCR0sAKw==
=zGdm
-----END PGP SIGNATURE-----
Merge tag 'batadv-net-for-davem-20170316' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here are two batman-adv bugfixes:
- Keep fragments equally sized, avoids some problems with too small fragments,
by Sven Eckelmann
- Initialize gateway class correctly when BATMAN V is compiled in,
by Sven Eckelmann
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The code introduced by commit 2ccccf5fb43f ("net_sched: update
hierarchical backlog too") only sets prev_backlog in fq_codel_dequeue()
but not using that anywhere, remove that setting.
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Added a case for OVS_TUNNEL_KEY_ATTR_PAD to the switch statement
in ip_tun_from_nlattr in order to prevent the default case
returning an error.
Fixes: b46f6ded906e ("libnl: nla_put_be64(): align on a 64-bit area")
Signed-off-by: Kris Murphy <kriskend@linux.vnet.ibm.com>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In commit c3852ef7f2f8 ("ipv4: fib: Replay events when registering FIB
notifier") we dumped the FIB tables and replayed the events to the
passed notification block.
However, we merely sent a RULE_ADD notification in case custom rules
were in use. As explained in previous patches, this approach won't work
anymore. Instead, we should notify the caller about all the FIB rules
and let it act accordingly.
Upon registration to the FIB notification chain, replay a RULE_ADD
notification for each programmed FIB rule, custom or not. The integrity
of the dump is ensured by the mechanism introduced in the above
mentioned commit.
Prevent regressions by making sure current listeners correctly sanitize
the notified rules.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Whenever a FIB rule is added or removed, a notification is sent in the
FIB notification chain. However, listeners don't have a way to tell
which rule was added or removed.
This is problematic as we would like to give listeners the ability to
decide which action to execute based on the notified rule. Specifically,
offloading drivers should be able to determine if they support the
reflection of the notified FIB rule and flush their LPM tables in case
they don't.
Do that by adding a notifier info to these notifications and embed the
common FIB rule struct in it.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, when non-default (custom) FIB rules are used, devices capable
of layer 3 offloading flush their tables and let the kernel do the
forwarding instead.
When these devices' drivers are loaded they register to the FIB
notification chain, which lets them know about the existence of any
custom FIB rules. This is done by sending a RULE_ADD notification based
on the value of 'net->ipv4.fib_has_custom_rules'.
This approach is problematic when VRF offload is taken into account, as
upon the creation of the first VRF netdev, a l3mdev rule is programmed
to direct skbs to the VRF's table.
Instead of merely reading the above value and sending a single RULE_ADD
notification, we should iterate over all the FIB rules and send a
detailed notification for each, thereby allowing offloading drivers to
sanitize the rules they don't support and potentially flush their
tables.
While l3mdev rules are uniquely marked, the default rules are not.
Therefore, when they are being notified they might invoke offloading
drivers to unnecessarily flush their tables.
Solve this by adding an helper to check if a FIB rule is a default rule.
Namely, its selector should match all packets and its action should
point to the local, main or default tables.
As noted by David Ahern, uniquely marking the default rules is
insufficient. When using VRFs, it's common to avoid false hits by moving
the rule for the local table to just before the main table:
Default configuration:
$ ip rule show
0: from all lookup local
32766: from all lookup main
32767: from all lookup default
Common configuration with VRFs:
$ ip rule show
1000: from all lookup [l3mdev-table]
32765: from all lookup local
32766: from all lookup main
32767: from all lookup default
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
At most it is used for debugging purpose, but I don't think
it is even useful for debugging, just remove it.
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
Sowmini pointed out Dmitry's RTNL deadlock report to me, and it turns out
to be perfectly accurate - there are various error paths that miss unlock
of the RTNL.
To fix those, change the locking a bit to not be conditional in all those
nl80211_prepare_*_dump() functions, but make those require the RTNL to
start with, and fix the buggy error paths. This also let me use sparse
(by appropriately overriding the rtnl_lock/rtnl_unlock functions) to
validate the changes.
Cc: stable@vger.kernel.org
Reported-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
I mistakenly added the code to release sk->sk_frag in
sk_common_release() instead of sk_destruct()
TCP sockets using sk->sk_allocation == GFP_ATOMIC do no call
sk_common_release() at close time, thus leaking one (order-3) page.
iSCSI is using such sockets.
Fixes: 5640f7685831 ("net: use a per task frag allocator")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If a DSA switch driver cannot program an ageing time value due to it
being out-of-range, switchdev will raise a stack trace before failing.
To fix this, add ageing_time_min and ageing_time_max members to the
dsa_switch in order for the switch drivers to optionally specify their
supported ageing time limits.
The DSA core will now check for provided ageing time limits and return
-ERANGE from the switchdev prepare phase if the value is out-of-range.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ageing time is defined as unsigned int, so make
dsa_fastest_ageing_time return an unsigned int instead of int.
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The configurable priority to traffic class mapping and the user specified
queue ranges are used to configure the traffic class, overriding the
hardware defaults when the 'hw' option is set to 0. However, when the 'hw'
option is non-zero, the hardware QOS defaults are used.
This patch makes it so that we can pass the data the user provided to
ndo_setup_tc. This allows us to pull in the queue configuration if the
user requested it as well as any additional hardware offload type
requested by using a value other than 1 for the hw value.
Finally it also provides a means for the device driver to return the level
supported for the offload type via the qopt->hw value. Previously we were
just always assuming the value to be 1, in the future values beyond just 1
may be supported.
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch is meant to allow for support of multiple hardware offload type
for a single device. There is currently no bounds checking for the hw
member of the mqprio_qopt structure. This results in us being able to pass
values from 1 to 255 with all being treated the same. On retreiving the
value it is returned as 1 for anything 1 or greater being set.
With this change we are currently adding limited bounds checking by
defining an enum and using those values to limit the reported hardware
offloads.
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following patchset contains Netfilter fixes for your net tree, a
rather large batch of fixes targeted to nf_tables, conntrack and bridge
netfilter. More specifically, they are:
1) Don't track fragmented packets if the socket option IP_NODEFRAG is set.
From Florian Westphal.
2) SCTP protocol tracker assumes that ICMP error messages contain the
checksum field, what results in packet drops. From Ying Xue.
3) Fix inconsistent handling of AH traffic from nf_tables.
4) Fix new bitmap set representation with big endian. Fix mismatches in
nf_tables due to incorrect big endian handling too. Both patches
from Liping Zhang.
5) Bridge netfilter doesn't honor maximum fragment size field, cap to
largest fragment seen. From Florian Westphal.
6) Fake conntrack entry needs to be aligned to 8 bytes since the 3 LSB
bits are now used to store the ctinfo. From Steven Rostedt.
7) Fix element comments with the bitmap set type. Revert the flush
field in the nft_set_iter structure, not required anymore after
fixing up element comments.
8) Missing error on invalid conntrack direction from nft_ct, also from
Liping Zhang.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When dealing with ipv6 source tunnel key address attribute
(OVS_TUNNEL_KEY_ATTR_IPV6_SRC) we are wrongly setting the tunnel
dst ip, fix that.
Fixes: 6b26ba3a7d95 ('openvswitch: netlink attributes for IPv6 tunneling')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Reported-by: Paul Blakey <paulb@mellanox.com>
Acked-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Conflicts:
drivers/net/ethernet/broadcom/genet/bcmgenet.c
net/core/sock.c
Conflicts were overlapping changes in bcmgenet and the
lockdep handling of sockets.
Signed-off-by: David S. Miller <davem@davemloft.net>
We should jump to invoke __nft_ct_set_destroy() instead of just
return error.
Fixes: edee4f1e9245 ("netfilter: nft_ct: add zone id set support")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Pull networking fixes from David Miller:
1) Ensure that mtu is at least IPV6_MIN_MTU in ipv6 VTI tunnel driver,
from Steffen Klassert.
2) Fix crashes when user tries to get_next_key on an LPM bpf map, from
Alexei Starovoitov.
3) Fix detection of VLAN fitlering feature for bnx2x VF devices, from
Michal Schmidt.
4) We can get a divide by zero when TCP socket are morphed into
listening state, fix from Eric Dumazet.
5) Fix socket refcounting bugs in skb_complete_wifi_ack() and
skb_complete_tx_timestamp(). From Eric Dumazet.
6) Use after free in dccp_feat_activate_values(), also from Eric
Dumazet.
7) Like bonding team needs to use ETH_MAX_MTU as netdev->max_mtu, from
Jarod Wilson.
8) Fix use after free in vrf_xmit(), from David Ahern.
9) Don't do UDP Fragmentation Offload on IPComp ipsec packets, from
Alexey Kodanev.
10) Properly check napi_complete_done() return value in order to decide
whether to re-enable IRQs or not in amd-xgbe driver, from Thomas
Lendacky.
11) Fix double free of hwmon device in marvell phy driver, from Andrew
Lunn.
12) Don't crash on malformed netlink attributes in act_connmark, from
Etienne Noss.
13) Don't remove routes with a higher metric in ipv6 ECMP route replace,
from Sabrina Dubroca.
14) Don't write into a cloned SKB in ipv6 fragmentation handling, from
Florian Westphal.
15) Fix routing redirect races in dccp and tcp, basically the ICMP
handler can't modify the socket's cached route in it's locked by the
user at this moment. From Jon Maxwell.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (108 commits)
qed: Enable iSCSI Out-of-Order
qed: Correct out-of-bound access in OOO history
qed: Fix interrupt flags on Rx LL2
qed: Free previous connections when releasing iSCSI
qed: Fix mapping leak on LL2 rx flow
qed: Prevent creation of too-big u32-chains
qed: Align CIDs according to DORQ requirement
mlxsw: reg: Fix SPVMLR max record count
mlxsw: reg: Fix SPVM max record count
net: Resend IGMP memberships upon peer notification.
dccp: fix memory leak during tear-down of unsuccessful connection request
tun: fix premature POLLOUT notification on tun devices
dccp/tcp: fix routing redirect race
ucc/hdlc: fix two little issue
vxlan: fix ovs support
net: use net->count to check whether a netns is alive or not
bridge: drop netfilter fake rtable unconditionally
ipv6: avoid write to a possibly cloned skb
net: wimax/i2400m: fix NULL-deref at probe
isdn/gigaset: fix NULL-deref at probe
...
When we notify peers of potential changes, it's also good to update
IGMP memberships. For example, during VM migration, updating IGMP
memberships will redirect existing multicast streams to the VM at the
new location.
Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
silences the below warning:
net/core/lwtunnel.c: In function ‘lwtunnel_valid_encap_type_attr’:
net/core/lwtunnel.c:165:17: warning: variable ‘nla’ set but not used
[-Wunused-but-set-variable]
Fixes: 9ed59592e3e3 ("lwtunnel: fix autoload of lwt modules")
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When some errors occur, the scatter/gather list mapped to DMA addresses
should be handled.
Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function rds_ib_map_fmr is used only in the ib_fmr.c
file. As such, the static type is added to limit it in this file.
Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The function ib_dealloc_fmr will never be called. As such, it should
be removed.
Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When rdma_accept fails, rdma_reject is called in it. As such, it is
not necessary to execute rdma_reject again.
Cc: Joe Jin <joe.jin@oracle.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes a memory leak, which happens if the connection request
is not fulfilled between parsing the DCCP options and handling the SYN
(because e.g. the backlog is full), because we forgot to free the
list of ack vectors.
Reported-by: Jianwen Ji <jiji@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
As Eric Dumazet pointed out this also needs to be fixed in IPv6.
v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.
We have seen a few incidents lately where a dst_enty has been freed
with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
dst_entry. If the conditions/timings are right a crash then ensues when the
freed dst_entry is referenced later on. A Common crashing back trace is:
#8 [] page_fault at ffffffff8163e648
[exception RIP: __tcp_ack_snd_check+74]
.
.
#9 [] tcp_rcv_established at ffffffff81580b64
#10 [] tcp_v4_do_rcv at ffffffff8158b54a
#11 [] tcp_v4_rcv at ffffffff8158cd02
#12 [] ip_local_deliver_finish at ffffffff815668f4
#13 [] ip_local_deliver at ffffffff81566bd9
#14 [] ip_rcv_finish at ffffffff8156656d
#15 [] ip_rcv at ffffffff81566f06
#16 [] __netif_receive_skb_core at ffffffff8152b3a2
#17 [] __netif_receive_skb at ffffffff8152b608
#18 [] netif_receive_skb at ffffffff8152b690
#19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
#20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
#21 [] net_rx_action at ffffffff8152bac2
#22 [] __do_softirq at ffffffff81084b4f
#23 [] call_softirq at ffffffff8164845c
#24 [] do_softirq at ffffffff81016fc5
#25 [] irq_exit at ffffffff81084ee5
#26 [] do_IRQ at ffffffff81648ff8
Of course it may happen with other NIC drivers as well.
It's found the freed dst_entry here:
224 static bool tcp_in_quickack_mode(struct sock *sk)↩
225 {↩
226 ▹ const struct inet_connection_sock *icsk = inet_csk(sk);↩
227 ▹ const struct dst_entry *dst = __sk_dst_get(sk);↩
228 ↩
229 ▹ return (dst && dst_metric(dst, RTAX_QUICKACK)) ||↩
230 ▹ ▹ (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);↩
231 }↩
But there are other backtraces attributed to the same freed dst_entry in
netfilter code as well.
All the vmcores showed 2 significant clues:
- Remote hosts behind the default gateway had always been redirected to a
different gateway. A rtable/dst_entry will be added for that host. Making
more dst_entrys with lower reference counts. Making this more probable.
- All vmcores showed a postitive LockDroppedIcmps value, e.g:
LockDroppedIcmps 267
A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
regardless of whether user space has the socket locked. This can result in a
race condition where the same dst_entry cached in sk->sk_dst_entry can be
decremented twice for the same socket via:
do_redirect()->__sk_dst_check()-> dst_release().
Which leads to the dst_entry being prematurely freed with another socket
pointing to it via sk->sk_dst_cache and a subsequent crash.
To fix this skip do_redirect() if usespace has the socket locked. Instead let
the redirect take place later when user space does not have the socket
locked.
The dccp/IPv6 code is very similar in this respect, so fixing it there too.
As Eric Garver pointed out the following commit now invalidates routes. Which
can set the dst->obsolete flag so that ipv4_dst_check() returns null and
triggers the dst_release().
Fixes: ceb3320610d6 ("ipv4: Kill routes during PMTU/redirect updates.")
Cc: Eric Garver <egarver@redhat.com>
Cc: Hannes Sowa <hsowa@redhat.com>
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The previous idea was to check whether a net namespace is in
net_exit_list or not. It doesn't work, because net->exit_list is used in
__register_pernet_operations and __unregister_pernet_operations where
all namespaces are added to a temporary list to make cleanup in a error
case, so list_empty(&net->exit_list) always returns false.
Reported-by: Mantas Mikulėnas <grawity@gmail.com>
Fixes: 002d8a1a6c11 ("net: skip genenerating uevents for network namespaces that are exiting")
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andrey reported this kernel warning:
WARNING: CPU: 0 PID: 4114 at kernel/sched/core.c:7737 __might_sleep+0x149/0x1a0
do not call blocking ops when !TASK_RUNNING; state=1 set at
[<ffffffff813fcb22>] prepare_to_wait+0x182/0x530
The deeply nested alloc_skb is a problem.
Diagnosis: nesting is wrong. It makes zero sense. Fix it and the
implicit task state change problem automagically goes away.
alloc_skb() does not need to be in the "while" loop.
alloc_skb() does not need to be in the {prepare_to_wait/add_wait_queue ...
finish_wait/remove_wait_queue} block.
I claim that:
- alloc_tx() should only perform the "wait_for_decent_tx_drain" part
- alloc_skb() ought to be done directly in vcc_sendmsg
- alloc_skb() failure can be handled gracefully in vcc_sendmsg
- alloc_skb() may use a (m->msg_flags & MSG_DONTWAIT) dependent
GFP_{KERNEL / ATOMIC} flag
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-and-Tested-by: Chas Williams <3chas3@gmail.com>
Signed-off-by: Chas Williams <3chas3@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow TTL propagation from IP packets to MPLS packets to be
configured. Add a new optional LWT attribute, MPLS_IPTUNNEL_TTL, which
allows the TTL to be set in the resulting MPLS packet, with the value
of 0 having the semantics of enabling propagation of the TTL from the
IP header (i.e. non-zero values disable propagation).
Also allow the configuration to be overridden globally by reusing the
same sysctl to control whether the TTL is propagated from IP packets
into the MPLS header. If the per-LWT attribute is set then it
overrides the global configuration. If the TTL isn't propagated then a
default TTL value is used which can be configured via a new sysctl,
"net.mpls.default_ttl". This is kept separate from the configuration
of whether IP TTL propagation is enabled as it can be used in the
future when non-IP payloads are supported (i.e. where there is no
payload TTL that can be propagated).
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Provide the ability to control on a per-route basis whether the TTL
value from an MPLS packet is propagated to an IPv4/IPv6 packet when
the last label is popped as per the theoretical model in RFC 3443
through a new route attribute, RTA_TTL_PROPAGATE which can be 0 to
mean disable propagation and 1 to mean enable propagation.
In order to provide the ability to change the behaviour for packets
arriving with IPv4/IPv6 Explicit Null labels and to provide an easy
way for a user to change the behaviour for all existing routes without
having to reprogram them, a global knob is provided. This is done
through the addition of a new per-namespace sysctl,
"net.mpls.ip_ttl_propagate", which defaults to enabled. If the
per-route attribute is set (either enabled or disabled) then it
overrides the global configuration.
Signed-off-by: Robert Shearman <rshearma@brocade.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andreas reports kernel oops during rmmod of the br_netfilter module.
Hannes debugged the oops down to a NULL rt6info->rt6i_indev.
Problem is that br_netfilter has the nasty concept of adding a fake
rtable to skb->dst; this happens in a br_netfilter prerouting hook.
A second hook (in bridge LOCAL_IN) is supposed to remove these again
before the skb is handed up the stack.
However, on module unload hooks get unregistered which means an
skb could traverse the prerouting hook that attaches the fake_rtable,
while the 'fake rtable remove' hook gets removed from the hooklist
immediately after.
Fixes: 34666d467cbf1e2e3c7 ("netfilter: bridge: move br_netfilter out of the core")
Reported-by: Andreas Karis <akaris@redhat.com>
Debugged-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_fragment, in case skb has a fraglist, checks if the
skb is cloned. If it is, it will move to the 'slow path' and allocates
new skbs for each fragment.
However, right before entering the slowpath loop, it updates the
nexthdr value of the last ipv6 extension header to NEXTHDR_FRAGMENT,
to account for the fragment header that will be inserted in the new
ipv6-fragment skbs.
In case original skb is cloned this munges nexthdr value of another
skb. Avoid this by doing the nexthdr update for each of the new fragment
skbs separately.
This was observed with tcpdump on a bridge device where netfilter ipv6
reassembly is active: tcpdump shows malformed fragment headers as
the l4 header (icmpv6, tcp, etc). is decoded as a fragment header.
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reported-by: Andreas Karis <akaris@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 27596472473a ("ipv6: fix ECMP route replacement") introduced a
loop that removes all siblings of an ECMP route that is being
replaced. However, this loop doesn't stop when it has replaced
siblings, and keeps removing other routes with a higher metric.
We also end up triggering the WARN_ON after the loop, because after
this nsiblings < 0.
Instead, stop the loop when we have taken care of all routes with the
same metric as the route being replaced.
Reproducer:
===========
#!/bin/sh
ip netns add ns1
ip netns add ns2
ip -net ns1 link set lo up
for x in 0 1 2 ; do
ip link add veth$x netns ns2 type veth peer name eth$x netns ns1
ip -net ns1 link set eth$x up
ip -net ns2 link set veth$x up
done
ip -net ns1 -6 r a 2000::/64 nexthop via fe80::0 dev eth0 \
nexthop via fe80::1 dev eth1 nexthop via fe80::2 dev eth2
ip -net ns1 -6 r a 2000::/64 via fe80::42 dev eth0 metric 256
ip -net ns1 -6 r a 2000::/64 via fe80::43 dev eth0 metric 2048
echo "before replace, 3 routes"
ip -net ns1 -6 r | grep -v '^fe80\|^ff00'
echo
ip -net ns1 -6 r c 2000::/64 nexthop via fe80::4 dev eth0 \
nexthop via fe80::5 dev eth1 nexthop via fe80::6 dev eth2
echo "after replace, only 2 routes, metric 2048 is gone"
ip -net ns1 -6 r | grep -v '^fe80\|^ff00'
Fixes: 27596472473a ("ipv6: fix ECMP route replacement")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
Karel Rericha reported that in his test case, ICMP packets going through
boxes had normally about 5ms latency. But when running nft, actually
listing the sets with interval flags, latency would go up to 30-100ms.
This was observed when router throughput is from 600Mbps to 2Gbps.
This is because we use a single global spinlock to protect the whole
rbtree sets, so "dumping sets" will race with the "key lookup" inevitably.
But actually they are all _readers_, so it's ok to convert the spinlock
to rwlock to avoid competition between them. Also use per-set rwlock since
each set is independent.
Reported-by: Karel Rericha <karel@unitednetworks.cz>
Tested-by: Karel Rericha <karel@unitednetworks.cz>
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The limit token is independent between each rules, so there's no
need to use a global spinlock.
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
also mark init_conntrack noinline, in most cases resolve_normal_ct will
find an existing conntrack entry.
text data bss dec hex filename
16735 5707 176 22618 585a net/netfilter/nf_conntrack_core.o
16687 5707 176 22570 582a net/netfilter/nf_conntrack_core.o
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This reverts commit 1f48ff6c5393aa7fe290faf5d633164f105b0aa7.
This patch is not required anymore now that we keep a dummy list of
set elements in the bitmap set implementation, so revert this before
we forget this code has no clients.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Instead of the actual interface index or name, set destination register
to just 1 or 0 depending on whether the lookup succeeded or not if
NFTA_FIB_F_PRESENT was set in userspace.
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
this allows to assign connection tracking helpers to
connections via nft objref infrastructure.
The idea is to first specifiy a helper object:
table ip filter {
ct helper some-name {
type "ftp"
protocol tcp
l3proto ip
}
}
and then assign it via
nft add ... ct helper set "some-name"
helper assignment works for new conntracks only as we cannot expand the
conntrack extension area once it has been committed to the main conntrack
table.
ipv4 and ipv6 protocols are tracked stored separately so
we can also handle families that observe both ipv4 and ipv6 traffic.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
this is needed by the upcoming ct helper object type --
we'd like to be able use the table family (ip, ip6, inet) to figure
out which helper has to be requested.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Element comments may come without any prior set flag, so we have to keep
a list of dummy struct nft_set_ext to keep this information around. This
is only useful for set dumps to userspace. From the packet path, this
set type relies on the bitmap representation. This patch simplifies the
logic since we don't need to allocate the dummy nft_set_ext structure
anymore on the fly at the cost of increasing memory consumption because
of the list of dummy struct nft_set_ext.
Fixes: 665153ff5752 ("netfilter: nf_tables: add bitmap set type")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Since the nfct and nfctinfo have been combined, the nf_conn structure
must be at least 8 bytes aligned, as the 3 LSB bits are used for the
nfctinfo. But there's a fake nf_conn structure to denote untracked
connections, which is created by a PER_CPU construct. This does not
guarantee that it will be 8 bytes aligned and can break the logic in
determining the correct nfctinfo.
I triggered this on a 32bit machine with the following error:
BUG: unable to handle kernel NULL pointer dereference at 00000af4
IP: nf_ct_deliver_cached_events+0x1b/0xfb
*pdpt = 0000000031962001 *pde = 0000000000000000
Oops: 0000 [#1] SMP
[Modules linked in: ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 crc_ccitt ppdev r8169 parport_pc parport
OK ]
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-test+ #75
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
task: c126ec00 task.stack: c1258000
EIP: nf_ct_deliver_cached_events+0x1b/0xfb
EFLAGS: 00010202 CPU: 0
EAX: 0021cd01 EBX: 00000000 ECX: 27b0c767 EDX: 32bcb17a
ESI: f34135c0 EDI: f34135c0 EBP: f2debd60 ESP: f2debd3c
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 80050033 CR2: 00000af4 CR3: 309a0440 CR4: 001406f0
Call Trace:
<SOFTIRQ>
? ipv6_skip_exthdr+0xac/0xcb
ipv6_confirm+0x10c/0x119 [nf_conntrack_ipv6]
nf_hook_slow+0x22/0xc7
nf_hook+0x9a/0xad [ipv6]
? ip6t_do_table+0x356/0x379 [ip6_tables]
? ip6_fragment+0x9e9/0x9e9 [ipv6]
ip6_output+0xee/0x107 [ipv6]
? ip6_fragment+0x9e9/0x9e9 [ipv6]
dst_output+0x36/0x4d [ipv6]
NF_HOOK.constprop.37+0xb2/0xba [ipv6]
? icmp6_dst_alloc+0x2c/0xfd [ipv6]
? local_bh_enable+0x14/0x14 [ipv6]
mld_sendpack+0x1c5/0x281 [ipv6]
? mark_held_locks+0x40/0x5c
mld_ifc_timer_expire+0x1f6/0x21e [ipv6]
call_timer_fn+0x135/0x283
? detach_if_pending+0x55/0x55
? mld_dad_timer_expire+0x3e/0x3e [ipv6]
__run_timers+0x111/0x14b
? mld_dad_timer_expire+0x3e/0x3e [ipv6]
run_timer_softirq+0x1c/0x36
__do_softirq+0x185/0x37c
? test_ti_thread_flag.constprop.19+0xd/0xd
do_softirq_own_stack+0x22/0x28
</SOFTIRQ>
irq_exit+0x5a/0xa4
smp_apic_timer_interrupt+0x2a/0x34
apic_timer_interrupt+0x37/0x3c
By using DEFINE/DECLARE_PER_CPU_ALIGNED we can enforce at least 8 byte
alignment as all cache line sizes are at least 8 bytes or more.
Fixes: a9e419dc7be6 ("netfilter: merge ctinfo into nfct pointer storage area")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
consider a bridge with mtu 9000, but end host sending smaller
packets to another host with mtu < 9000.
In this case, after reassembly, bridge+defrag would refragment,
and then attempt to send the reassembled packet as long as it
was below 9k.
Instead we have to cap by the largest fragment size seen.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently, there are two different methods to store an u16 integer to
the u32 data register. For example:
u32 *dest = ®s->data[priv->dreg];
1. *dest = 0; *(u16 *) dest = val_u16;
2. *dest = val_u16;
For method 1, the u16 value will be stored like this, either in
big-endian or little-endian system:
0 15 31
+-+-+-+-+-+-+-+-+-+-+-+-+
| Value | 0 |
+-+-+-+-+-+-+-+-+-+-+-+-+
For method 2, in little-endian system, the u16 value will be the same
as listed above. But in big-endian system, the u16 value will be stored
like this:
0 15 31
+-+-+-+-+-+-+-+-+-+-+-+-+
| 0 | Value |
+-+-+-+-+-+-+-+-+-+-+-+-+
So later we use "memcmp(®s->data[priv->sreg], data, 2);" to do
compare in nft_cmp, nft_lookup expr ..., method 2 will get the wrong
result in big-endian system, as 0~15 bits will always be zero.
For the similar reason, when loading an u16 value from the u32 data
register, we should use "*(u16 *) sreg;" instead of "(u16)*sreg;",
the 2nd method will get the wrong value in the big-endian system.
So introduce some wrapper functions to store/load an u8 or u16
integer to/from the u32 data register, and use them in the right
place.
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Currently we just assume the element key as a u32 integer, regardless of
the set key length.
This is incorrect, for example, the tcp port number is only 16 bits.
So when we use the nft_payload expr to get the tcp dport and store
it to dreg, the dport will be stored at 0~15 bits, and 16~31 bits
will be padded with zero.
So the reg->data[dreg] will be looked like as below:
0 15 31
+-+-+-+-+-+-+-+-+-+-+-+-+
| tcp dport | 0 |
+-+-+-+-+-+-+-+-+-+-+-+-+
But for these big-endian systems, if we treate this register as a u32
integer, the element key will be larger than 65535, so the following
lookup in bitmap set will cause out of bound access.
Another issue is that if we add element with comments in bitmap
set(although the comments will be ignored eventually), the element will
vanish strangely. Because we treate the element key as a u32 integer, so
the comments will become the part of the element key, then the element
key will also be larger than 65535 and out of bound access will happen:
# nft add element t s { 1 comment test }
Since set->klen is 1 or 2, it's fine to treate the element key as a u8 or
u16 integer.
Fixes: 665153ff5752 ("netfilter: nf_tables: add bitmap set type")
Signed-off-by: Liping Zhang <zlpnobody@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes: 49b499718fa1 ("net: sched: make default fifo qdiscs appear in the dump")
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use setup_timer() instead of init_timer() to simplify the code.
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Multipath routes can be rendered usesless when a device in one of the
paths is deleted. For example:
$ ip -f mpls ro ls
100
nexthop as to 200 via inet 172.16.2.2 dev virt12
nexthop as to 300 via inet 172.16.3.2 dev br0
101
nexthop as to 201 via inet6 2000:2::2 dev virt12
nexthop as to 301 via inet6 2000:3::2 dev br0
$ ip li del br0
When br0 is deleted the other hop is not considered in
mpls_select_multipath because of the alive check -- rt_nhn_alive
is 0.
rt_nhn_alive is decremented once in mpls_ifdown when the device is taken
down (NETDEV_DOWN) and again when it is deleted (NETDEV_UNREGISTER). For
a 2 hop route, deleting one device drops the alive count to 0. Since
devices are taken down before unregistering, the decrement on
NETDEV_UNREGISTER is redundant.
Fixes: c89359a42e2a4 ("mpls: support for dead routes")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the mpls_router module is unloaded, mpls routes are deleted but
notifications are not sent to userspace leaving userspace caches
out of sync. Add the call to mpls_notify_route in mpls_net_exit as
routes are freed.
Fixes: 0189197f44160 ("mpls: Basic routing support")
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>