IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Eliminate the assumption that SKBs and SKB list heads can
be cast to eachother in SKB list handling code.
This change also appears to fix a bug since the list->next pointer is
sampled outside of holding the SKB queue lock.
Signed-off-by: David S. Miller <davem@davemloft.net>
It turns out I missed one VLAN_TAG_PRESENT in OVS code while rebasing.
This fixes it.
Fixes: 9df46aefafa6 ("OVS: remove use of VLAN_TAG_PRESENT")
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCA_FLOWER_KEY_ENC_OPTS and TCA_FLOWER_KEY_ENC_OPTS_MASK can only
currently contain further nested attributes, which are parsed by
hand, so the policy is never actually used resulting in a W=1
build warning:
net/sched/cls_flower.c:492:1: warning: ‘enc_opts_policy’ defined but not used [-Wunused-const-variable=]
enc_opts_policy[TCA_FLOWER_KEY_ENC_OPTS_MAX + 1] = {
Add the validation anyway to avoid potential bugs when other
attributes are added and to make the attribute structure slightly
more clear. Validation will also set extact to point to bad
attribute on error.
Fixes: 0a6e77784f49 ("net/sched: allow flower to match tunnel options")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In the udp6 code path, we needed multiple tests to select the correct
mib to be updated. Since we touch at least a counter at each iteration,
it's convenient to use the recently introduced __UDPX_MIB() helper once
and remove some code duplication.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Only first fragment has the sport/dport information,
not the following ones.
If we want consistent hash for all fragments, we need to
ignore ports even for first fragment.
This bug is visible for IPv6 traffic, if incoming fragments
do not have a flow label, since skb_get_hash() will give
different results for first fragment and following ones.
It is also visible if any routing rule wants dissection
and sport or dport.
See commit 5e5d6fed3741 ("ipv6: route: dissect flow
in input path if fib rules need it") for details.
[edumazet] rewrote the changelog completely.
Fixes: 06635a35d13d ("flow_dissect: use programable dissector in skb_flow_dissect and friends")
Signed-off-by: 배석진 <soukjin.bae@samsung.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
if skb is NULL pointer, and the following access of skb's
skb_mstamp_ns will trigger panic, which is same as BUG_ON
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When the socket is CAN FD enabled it can handle CAN FD frame
transmissions. Add an additional check in raw_sendmsg() as a CAN2.0 CAN
driver (non CAN FD) should never see a CAN FD frame. Due to the commonly
used can_dropped_invalid_skb() function the CAN 2.0 driver would drop
that CAN FD frame anyway - but with this patch the user gets a proper
-EINVAL return code.
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
This patch proposes to extend the sk_lookup() BPF API to the
XDP hookpoint. The sk_lookup() helper supports a lookup
on incoming packet to find the corresponding socket that will
receive this packet. Current support for this BPF API is
at the tc hookpoint. This patch will extend this API at XDP
hookpoint. A XDP program can map the incoming packet to the
5-tuple parameter and invoke the API to find the corresponding
socket structure.
Signed-off-by: Nitin Hande <Nitin.Hande@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
This patch allows eBPF programs that use sock_ops to send perf
based event notifications using bpf_perf_event_output(). Our main
use case for this is the following:
We would like to monitor some subset of TCP sockets in user-space,
(the monitoring application would define 4-tuples it wants to monitor)
using TCP_INFO stats to analyze reported problems. The idea is to
use those stats to see where the bottlenecks are likely to be ("is
it application-limited?" or "is there evidence of BufferBloat in
the path?" etc).
Today we can do this by periodically polling for tcp_info, but this
could be made more efficient if the kernel would asynchronously
notify the application via tcp_info when some "interesting"
thresholds (e.g., "RTT variance > X", or "total_retrans > Y" etc)
are reached. And to make this effective, it is better if
we could apply the threshold check *before* constructing the
tcp_info netlink notification, so that we don't waste resources
constructing notifications that will be discarded by the filter.
This work solves the problem by adding perf event based notification
support for sock_ops. The eBPF program can thus be designed to apply
any desired filters to the bpf_sock_ops and trigger a perf event
notification based on the evaluation from the filter. The user space
component can use these perf event notifications to either read any
state managed by the eBPF program, or issue a TCP_INFO netlink call
if desired.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Co-developed-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
To mirror software behaviour on offload more precisely inform
the drivers about the state of the harddrop flag.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Recently, in commit ab408b6dc744 ("tcp: switch tcp and sch_fq to new
earliest departure time model"), the TCP BBR code switched to a new
approach of using an explicit bbr_pacing_margin_percent for shaving a
pacing rate "haircut", rather than the previous implict
approach. Update an old comment to reflect the new approach.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes assumption than vlan_tci != 0 when tag is present.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
This removes assumptions about VLAN_TAG_PRESENT bit.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
__skb_checksum_complete_head() and __skb_checksum_complete()
are both declared in skbuff.h, they fit better in skbuff.c
than datagram.c.
Cc: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It's redundancy for the drivers to hold the list of vlans when
absolutely the same list exists in vlan core. In most cases it's
needed only to traverse the vlan devices, their vids and sync some
settings with h/w, so add API to simplify this.
At least some of these drivers also can benefit:
grep "for_each.*vid" -r drivers/net/ethernet/
drivers/net/ethernet/hisilicon/hns3/hns3_enet.c:
drivers/net/ethernet/synopsys/dwc-xlgmac-hw.c:
drivers/net/ethernet/qlogic/qlge/qlge_main.c:
drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c:
drivers/net/ethernet/via/via-rhine.c:
drivers/net/ethernet/via/via-velocity.c:
drivers/net/ethernet/intel/igb/igb_main.c:
drivers/net/ethernet/intel/ice/ice_main.c:
drivers/net/ethernet/intel/e1000/e1000_main.c:
drivers/net/ethernet/intel/i40e/i40e_main.c:
drivers/net/ethernet/intel/e1000e/netdev.c:
drivers/net/ethernet/intel/igbvf/netdev.c:
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c:
drivers/net/ethernet/intel/ixgb/ixgb_main.c:
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:
drivers/net/ethernet/amd/xgbe/xgbe-dev.c:
drivers/net/ethernet/emulex/benet/be_main.c:
drivers/net/ethernet/neterion/vxge/vxge-main.c:
drivers/net/ethernet/adaptec/starfire.c:
drivers/net/ethernet/brocade/bna/bnad.c:
Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to avoid all table update, and only remove or add new
address, the auxiliary function exists, named __hw_addr_sync_dev().
It allows end driver do nothing when nothing changed and add/rm when
concrete address is firstly added or lastly removed. But it doesn't
include cases when an address of real device or vlan was reused by
other vlans or vlan/macval devices.
For handaling events when address was reused/unreused the patch adds
new auxiliary routine - __hw_addr_ref_sync_dev(). It allows to do
nothing when nothing was changed and do updates only for an address
being added/reused/deleted/unreused. Thus, clone address changes for
vlans can be mirrored in the table. The function is exclusive with
__hw_addr_sync_dev(). It's responsibility of the end driver to
identify address vlan device, if it needs so.
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a minimal change to allow removing of VLAN_TAG_PRESENT.
It leaves OVS unable to use CFI bit, as fixing this would need
a deeper surgery involving userspace interface.
Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
When setting the SO_MARK socket option, if the mark changes, the dst
needs to be reset so that a new route lookup is performed.
This fixes the case where an application wants to change routing by
setting a new sk_mark. If this is done after some packets have already
been sent, the dst is cached and has no effect.
Signed-off-by: David Barmann <david.barmann@stackpath.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Multiple cpus might attempt to insert a new fragment in rhashtable,
if for example RPS is buggy, as reported by 배석진 in
https://patchwork.ozlabs.org/patch/994601/
We use rhashtable_lookup_get_insert_key() instead of
rhashtable_insert_fast() to let cpus losing the race
free their own inet_frag_queue and use the one that
was inserted by another cpu.
Fixes: 648700f76b03 ("inet: frags: use rhashtables for reassembly units")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: 배석진 <soukjin.bae@samsung.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
if local is NULL pointer, and the following access of local's
dev will trigger panic, which is same as BUG_ON
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As the destination port in FoU and GUE receiving sockets doesn't
necessarily match the remote destination port, we can't associate errors
to the encapsulating tunnels with a socket lookup -- we need to blindly
try them instead. This means we don't even know if we are handling errors
for FoU or GUE without digging into the packets.
Hence, implement a single handler for both, one for IPv4 and one for IPv6,
that will check whether the packet that generated the ICMP error used a
direct IP encapsulation or if it had a GUE header, and send the error to
the matching protocol handler, if any.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ICMP error handling is currently not possible for UDP tunnels not
employing a receiving socket with local destination port matching the
remote one, because we have no way to look them up.
Add an err_handler tunnel encapsulation operation that can be exported by
tunnels in order to pass the error to the protocol implementing the
encapsulation. We can't easily use a lookup function as we did for VXLAN
and GENEVE, as protocol error handlers, which would be in turn called by
implementations of this new operation, handle the errors themselves,
together with the tunnel lookup.
Without a socket, we can't be sure which encapsulation error handler is
the appropriate one: encapsulation handlers (the ones for FoU and GUE
introduced in the next patch, e.g.) will need to check the new error codes
returned by protocol handlers to figure out if errors match the given
encapsulation, and, in turn, report this error back, so that we can try
all of them in __udp{4,6}_lib_err_encap_no_sk() until we have a match.
v2:
- Name all arguments in err_handler prototypes (David Miller)
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
We'll need this to handle ICMP errors for tunnels without a sending socket
(i.e. FoU and GUE). There, we might have to look up different types of IP
tunnels, registered as network protocols, before we get a match, so we
want this for the error handlers of IPPROTO_IPIP and IPPROTO_IPV6 in both
inet_protos and inet6_protos. These error codes will be used in the next
patch.
For consistency, return sensible error codes in protocol error handlers
whenever handlers can't handle errors because, even if valid, they don't
match a protocol or any of its states.
This has no effect on existing error handling paths.
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
For both IPv4 and IPv6, if we can't match errors to a socket, try
tunnels before ignoring them. Look up a socket with the original source
and destination ports as found in the UDP packet inside the ICMP payload,
this will work for tunnels that force the same destination port for both
endpoints, i.e. VXLAN and GENEVE.
Actually, lwtunnels could break this assumption if they are configured by
an external control plane to have different destination ports on the
endpoints: in this case, we won't be able to trace ICMP messages back to
them.
For IPv6 redirect messages, call ip6_redirect() directly with the output
interface argument set to the interface we received the packet from (as
it's the very interface we should build the exception on), otherwise the
new nexthop will be rejected. There's no such need for IPv4.
Tunnels can now export an encap_err_lookup() operation that indicates a
match. Pass the packet to the lookup function, and if the tunnel driver
reports a matching association, continue with regular ICMP error handling.
v2:
- Added newline between network and transport header sets in
__udp{4,6}_lib_err_encap() (David Miller)
- Removed redundant skb_reset_network_header(skb); in
__udp4_lib_err_encap()
- Removed redundant reassignment of iph in __udp4_lib_err_encap()
(Sabrina Dubroca)
- Edited comment to __udp{4,6}_lib_err_encap() to reflect the fact this
won't work with lwtunnels configured to use asymmetric ports. By the way,
it's VXLAN, not VxLAN (Jiri Benc)
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
avoid to compute the hash value if dev is not null, since
hash value is not used
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move destroying of the old child qdiscs outside of the sch_tree_lock()
section. This should improve the software qdisc replace but is even
more important for offloads. Calling offloads under a spin lock is
best avoided, and child's destroy would be called under sch_tree_lock().
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move destroying of the old child qdisc outside of the sch_tree_lock()
section. This should improve the software qdisc replace but is even
more important for offloads. Firstly calling offloads under a spin
lock is best avoided. Secondly the destroy event of existing child
would have been sent to the offload device before the replace, causing
confusion.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The code for grafting Qdiscs when there is a parent has two needless
indentation levels, and breaks the "keep the success path unindented"
guideline. Refactor.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Qdisc graft operation of offload-capable qdiscs performs a few
extra steps which are identical among all the qdiscs. Add
a helper to share this code.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
PRIO and RED mark the qdisc with TCQ_F_OFFLOADED upon successful offload,
make MQ do the same. The consistency will help with consistent
graft callback behaviour.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Offload dump helper does not use opt parameter, remove it.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Qdisc dump operation of offload-capable qdiscs performs a few
extra steps which are identical among all the qdiscs. Add
a helper to share this code.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is no need to have the '__be32 *p' variable static since new value
always be assigned before use it.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Set the backlog earlier in inet_dccp_listen() and inet_listen(),
then we can avoid the redundant setting.
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
GSO tunneled packets are always segmented in software before they are
transmitted by a VLAN, even when the lower device can offload tunnel
encapsulation and VLAN together (i.e., some bits in NETIF_F_GSO_ENCAP_ALL
mask are set in the lower device 'vlan_features'). If we let VLANs have
the same tunnel offload capabilities as their lower device, throughput
can improve significantly when CPU is limited on the transmitter side.
- set NETIF_F_GSO_ENCAP_ALL bits in the VLAN 'hw_features', to ensure
that 'features' will have those bits zeroed only when the lower device
has no hardware support for tunnel encapsulation.
- for the same reason, copy GSO-related bits of 'hw_enc_features' from
lower device to VLAN, and ensure to update that value when the lower
device changes its features.
- set NETIF_F_HW_CSUM bit in the VLAN 'hw_enc_features' if 'real_dev'
is able to compute checksums at least for a kind of packets, like done
with commit 8403debeead8 ("vlan: Keep NETIF_F_HW_CSUM similar to other
software devices"). This avoids software segmentation due to mismatching
checksum capabilities between VLAN's 'features' and 'hw_enc_features'.
Reported-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In some scenarios, the GRO engine can assemble an UDP GRO packet
that ultimately lands on a non GRO-enabled socket.
This patch tries to address the issue explicitly checking for the UDP
socket features before enqueuing the packet, and eventually segmenting
the unexpected GRO packet, as needed.
We must also cope with re-insertion requests: after segmentation the
UDP code calls the helper introduced by the previous patches, as needed.
Segmentation is performed by a common helper, which takes care of
updating socket and protocol stats is case of failure.
rfc v3 -> v1
- fix compile issues with rxrpc
- when gso_segment returns NULL, treat is as an error
- added 'ipv4' argument to udp_rcv_segment()
rfc v2 -> rfc v3
- moved udp_rcv_segment() into net/udp.h, account errors to socket
and ns, always return NULL or segs list
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that we can re-use it at the UDP level in the next patch
rfc v3 -> v1:
- add the helper declaration into the ipv6 header
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
So that we can re-use it at the UDP level in a later patch
rfc v3 -> v1
- add the helper declaration into the ip header
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When UDP GRO is enabled, the UDP_GRO cmsg will carry the ingress
datagram size. User-space can use such info to compute the original
packets layout.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This is the RX counterpart of commit bec1f6f69736 ("udp: generate gso
with UDP_SEGMENT"). When UDP_GRO is enabled, such socket is also
eligible for GRO in the rx path: UDP segments directed to such socket
are assembled into a larger GSO_UDP_L4 packet.
The core UDP GRO support is enabled with setsockopt(UDP_GRO).
Initial benchmark numbers:
Before:
udp rx: 1079 MB/s 769065 calls/s
After:
udp rx: 1466 MB/s 24877 calls/s
This change introduces a side effect in respect to UDP tunnels:
after a UDP tunnel creation, now the kernel performs a lookup per ingress
UDP packet, while before such lookup happened only if the ingress packet
carried a valid internal header csum.
rfc v2 -> rfc v3:
- fixed typos in macro name and comments
- really enforce UDP_GRO_CNT_MAX, instead of UDP_GRO_CNT_MAX + 1
- acquire socket lock in UDP_GRO setsockopt
rfc v1 -> rfc v2:
- use a new option to enable UDP GRO
- use static keys to protect the UDP GRO socket lookup
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The *encap_needed static keys are enabled by UDP tunnels
and several UDP encapsulations type, but they are never
turned off. This can cause unneeded overall performance
degradation for systems where such features are used
transiently.
This patch introduces complete book-keeping for such keys,
decreasing the usage at socket destruction time, if needed,
and avoiding that the same socket could increase the key
usage multiple times.
rfc v3 -> v1:
- add socket lock around udp_tunnel_encap_enable()
rfc v2 -> rfc v3:
- use udp_tunnel_encap_enable() in setsockopt()
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For bound udp sockets in a vrf, also check the sdif to get the index
for ingress devices enslaved to an l3mdev.
Signed-off-by: Dewi Morgan <morgand@vyatta.att-mail.com>
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If the skb for multicast packets marked as enslaved to a VRF are
received, then the secondary device index should be used to obtain
the real device. And verify the multicast address against the
enslaved rather than the l3mdev device.
Signed-off-by: Dewi Morgan <morgand@vyatta.att-mail.com>
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
If link-local packets are marked as enslaved to a VRF, then to allow
ping to the link-local from a vrf, the error handling for IPV6_PKTINFO
needs to be relaxed to also allow the pkt ipi6_ifindex to be that of a
slave device to the vrf.
Note that the real device also needs to be retrieved in icmp6_iif()
to set the ipv6 flow oif to this for icmp echo reply handling. The
recent commit 24b711edfc34 ("net/ipv6: Fix linklocal to global address
with VRF") takes care of this, so the sdif does not need checking here.
This fix makes ping to link-local consistent with that to global
addresses, in that this can now be done from within the same VRF that
the address is in.
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When there exist a pair of raw sockets one unbound and one bound
to a VRF but equal in all other respects, when a packet is received
in the VRF context, __raw_v4_lookup() matches on both sockets.
This results in the packet being delivered over both sockets,
instead of only the raw socket bound to the VRF. The bound device
checks in __raw_v4_lookup() are replaced with a call to
raw_sk_bound_dev_eq() which correctly handles whether the packet
should be delivered over the unbound socket in such cases.
In __raw_v6_lookup() the match on the device binding of the socket is
similarly updated to use raw_sk_bound_dev_eq() which matches the
handling in __raw_v4_lookup().
Importantly raw_sk_bound_dev_eq() takes the raw_l3mdev_accept sysctl
into account.
Signed-off-by: Duncan Eastoe <deastoe@vyatta.att-mail.com>
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a sysctl raw_l3mdev_accept to control raw socket lookup in a manner
similar to use of tcp_l3mdev_accept for stream and of udp_l3mdev_accept
for datagram sockets. Have this default to enabled for reasons of
backwards compatibility. This is so as to specify the output device
with cmsg and IP_PKTINFO, but using a socket not bound to the
corresponding VRF. This allows e.g. older ping implementations to be
run with specifying the device but without executing it in the VRF.
If the option is disabled, packets received in a VRF context are only
handled by a raw socket bound to the VRF, and correspondingly packets
in the default VRF are only handled by a socket not bound to any VRF.
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>