IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
Back when tcp_tso_autosize() and TCP pacing were introduced,
our focus was really to reduce burst sizes for long distance
flows.
The simple heuristic of using sk_pacing_rate/1024 has worked
well, but can lead to too small packets for hosts in the same
rack/cluster, when thousands of flows compete for the bottleneck.
Neal Cardwell had the idea of making the TSO burst size
a function of both sk_pacing_rate and tcp_min_rtt()
Indeed, for local flows, sending bigger bursts is better
to reduce cpu costs, as occasional losses can be repaired
quite fast.
This patch is based on Neal Cardwell implementation
done more than two years ago.
bbr is adjusting max_pacing_rate based on measured bandwidth,
while cubic would over estimate max_pacing_rate.
/proc/sys/net/ipv4/tcp_tso_rtt_log can be used to tune or disable
this new feature, in logarithmic steps.
Tested:
100Gbit NIC, two hosts in the same rack, 4K MTU.
600 flows rate-limited to 20000000 bytes per second.
Before patch: (TSO sizes would be limited to 20000000/1024/4096 -> 4 segments per TSO)
~# echo 0 >/proc/sys/net/ipv4/tcp_tso_rtt_log
~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
96005
Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':
65,945.29 msec task-clock # 2.845 CPUs utilized
1,314,632 context-switches # 19935.279 M/sec
5,292 cpu-migrations # 80.249 M/sec
940,641 page-faults # 14264.023 M/sec
201,117,030,926 cycles # 3049769.216 GHz (83.45%)
17,699,435,405 stalled-cycles-frontend # 8.80% frontend cycles idle (83.48%)
136,584,015,071 stalled-cycles-backend # 67.91% backend cycles idle (83.44%)
53,809,530,436 instructions # 0.27 insn per cycle
# 2.54 stalled cycles per insn (83.36%)
9,062,315,523 branches # 137422329.563 M/sec (83.22%)
153,008,621 branch-misses # 1.69% of all branches (83.32%)
23.182970846 seconds time elapsed
TcpInSegs 15648792 0.0
TcpOutSegs 58659110 0.0 # Average of 3.7 4K segments per TSO packet
TcpExtTCPDelivered 58654791 0.0
TcpExtTCPDeliveredCE 19 0.0
After patch:
~# echo 9 >/proc/sys/net/ipv4/tcp_tso_rtt_log
~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
96046
Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':
48,982.58 msec task-clock # 2.104 CPUs utilized
186,014 context-switches # 3797.599 M/sec
3,109 cpu-migrations # 63.472 M/sec
941,180 page-faults # 19214.814 M/sec
153,459,763,868 cycles # 3132982.807 GHz (83.56%)
12,069,861,356 stalled-cycles-frontend # 7.87% frontend cycles idle (83.32%)
120,485,917,953 stalled-cycles-backend # 78.51% backend cycles idle (83.24%)
36,803,672,106 instructions # 0.24 insn per cycle
# 3.27 stalled cycles per insn (83.18%)
5,947,266,275 branches # 121417383.427 M/sec (83.64%)
87,984,616 branch-misses # 1.48% of all branches (83.43%)
23.281200256 seconds time elapsed
TcpInSegs 1434706 0.0
TcpOutSegs 58883378 0.0 # Average of 41 4K segments per TSO packet
TcpExtTCPDelivered 58878971 0.0
TcpExtTCPDeliveredCE 9664 0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20220309015757.2532973-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
tcp_should_autocork() is evaluating if it makes senses
to not immediately send current skb, hoping that
user space will add more payload on it by the
time TCP stack reacts to upcoming TX completions.
If current skb got MSG_EOR mark, then we know
that no further data will be added, it is therefore
futile to wait.
SOF_TIMESTAMPING_TX_ACK will become a bit more accurate,
if prior packets are still in qdisc/device queues.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Link: https://lore.kernel.org/r/20220309054706.2857266-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
when CONFIG_SYSCTL not set, smc_sysctl_net_init/exit
need to be static inline to avoid missing-prototypes
if compile with W=1.
Since __net_exit has noinline annotation when CONFIG_NET_NS
not set, it should not be used with static inline.
So remove the __net_init/exit when CONFIG_SYSCTL not set.
Fixes: 7de8eb0d9039 ("net/smc: fix compile warning for smc_sysctl")
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220309033051.41893-1-dust.li@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This adds support for running XDP programs through BPF_PROG_RUN in a mode
that enables live packet processing of the resulting frames. Previous uses
of BPF_PROG_RUN for XDP returned the XDP program return code and the
modified packet data to userspace, which is useful for unit testing of XDP
programs.
The existing BPF_PROG_RUN for XDP allows userspace to set the ingress
ifindex and RXQ number as part of the context object being passed to the
kernel. This patch reuses that code, but adds a new mode with different
semantics, which can be selected with the new BPF_F_TEST_XDP_LIVE_FRAMES
flag.
When running BPF_PROG_RUN in this mode, the XDP program return codes will
be honoured: returning XDP_PASS will result in the frame being injected
into the networking stack as if it came from the selected networking
interface, while returning XDP_TX and XDP_REDIRECT will result in the frame
being transmitted out that interface. XDP_TX is translated into an
XDP_REDIRECT operation to the same interface, since the real XDP_TX action
is only possible from within the network drivers themselves, not from the
process context where BPF_PROG_RUN is executed.
Internally, this new mode of operation creates a page pool instance while
setting up the test run, and feeds pages from that into the XDP program.
The setup cost of this is amortised over the number of repetitions
specified by userspace.
To support the performance testing use case, we further optimise the setup
step so that all pages in the pool are pre-initialised with the packet
data, and pre-computed context and xdp_frame objects stored at the start of
each page. This makes it possible to entirely avoid touching the page
content on each XDP program invocation, and enables sending up to 9
Mpps/core on my test box.
Because the data pages are recycled by the page pool, and the test runner
doesn't re-initialise them for each run, subsequent invocations of the XDP
program will see the packet data in the state it was after the last time it
ran on that particular page. This means that an XDP program that modifies
the packet before redirecting it has to be careful about which assumptions
it makes about the packet content, but that is only an issue for the most
naively written programs.
Enabling the new flag is only allowed when not setting ctx_out and data_out
in the test specification, since using it means frames will be redirected
somewhere else, so they can't be returned.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20220309105346.100053-2-toke@redhat.com
Steffen Klassert says:
====================
pull request (net): ipsec 2022-03-09
1) Fix IPv6 PMTU discovery for xfrm interfaces.
From Lina Wang.
2) Revert failing for policies and states that are
configured with XFRMA_IF_ID 0. It broke a
user configuration. From Kai Lueke.
3) Fix a possible buffer overflow in the ESP output path.
4) Fix ESP GSO for tunnel and BEET mode on inter address
family tunnels.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
When two ax25 devices attempted to establish connection, the requester use ax25_create(),
ax25_bind() and ax25_connect() to initiate connection. The receiver use ax25_rcv() to
accept connection and use ax25_create_cb() in ax25_rcv() to create ax25_cb, but the
ax25_cb->sk is NULL. When the receiver is detaching, a NULL pointer dereference bug
caused by sock_hold(sk) in ax25_kill_by_device() will happen. The corresponding
fail log is shown below:
===============================================================
BUG: KASAN: null-ptr-deref in ax25_device_event+0xfd/0x290
Call Trace:
...
ax25_device_event+0xfd/0x290
raw_notifier_call_chain+0x5e/0x70
dev_close_many+0x174/0x220
unregister_netdevice_many+0x1f7/0xa60
unregister_netdevice_queue+0x12f/0x170
unregister_netdev+0x13/0x20
mkiss_close+0xcd/0x140
tty_ldisc_release+0xc0/0x220
tty_release_struct+0x17/0xa0
tty_release+0x62d/0x670
...
This patch add condition check in ax25_kill_by_device(). If s->sk is
NULL, it will goto if branch to kill device.
Fixes: 4e0f718daf97 ("ax25: improve the incomplete fix to avoid UAF and NPD bugs")
Reported-by: Thomas Osterried <thomas@osterried.de>
Signed-off-by: Duoming Zhou <duoming@zju.edu.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
We have a number of cases where function returns drop/no drop
decision as a boolean. Now that we want to report the reason
code as well we have to pass extra output arguments.
We can make the reason code evaluate correctly as bool.
I believe we're good to reorder the reasons as they are
reported to user space as strings.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Felix driver declares FDB isolation but puts all standalone ports in
VID 0. This is mostly problem-free as discussed with Alvin here:
https://patchwork.kernel.org/project/netdevbpf/cover/20220302191417.1288145-1-vladimir.oltean@nxp.com/#24763870
however there is one catch. DSA still thinks that FDB entries are
installed on the CPU port as many times as there are user ports, and
this is problematic when multiple user ports share the same MAC address.
Consider the default case where all user ports inherit their MAC address
from the DSA master, and then the user runs:
ip link set swp0 address 00:01:02:03:04:05
The above will make dsa_slave_set_mac_address() call
dsa_port_standalone_host_fdb_add() for 00:01:02:03:04:05 in port 0's
standalone database, and dsa_port_standalone_host_fdb_del() for the old
address of swp0, again in swp0's standalone database.
Both the ->port_fdb_add() and ->port_fdb_del() will be propagated down
to the felix driver, which will end up deleting the old MAC address from
the CPU port. But this is still in use by other user ports, so we end up
breaking unicast termination for them.
There isn't a problem in the fact that DSA keeps track of host
standalone addresses in the individual database of each user port: some
drivers like sja1105 need this. There also isn't a problem in the fact
that some drivers choose the same VID/FID for all standalone ports.
It is just that the deletion of these host addresses must be delayed
until they are known to not be in use any longer, and only the driver
has this knowledge. Since DSA keeps these addresses in &cpu_dp->fdbs and
&cpu_db->mdbs, it is just a matter of walking over those lists and see
whether the same MAC address is present on the CPU port in the port db
of another user port.
I have considered reusing the generic dsa_port_walk_fdbs() and
dsa_port_walk_mdbs() schemes for this, but locking makes it difficult.
In the ->port_fdb_add() method and co, &dp->addr_lists_lock is held, but
dsa_port_walk_fdbs() also acquires that lock. Also, even assuming that
we introduce an unlocked variant of the address iterator, we'd still
need some relatively complex data structures, and a void *ctx in the
dsa_fdb_walk_cb_t which we don't currently pass, such that drivers are
able to figure out, after iterating, whether the same MAC address is or
isn't present in the port db of another port.
All the above, plus the fact that I expect other drivers to follow the
same model as felix where all standalone ports use the same FID, made me
conclude that a generic method provided by DSA is necessary:
dsa_fdb_present_in_other_db() and the mdb equivalent. Felix calls this
from the ->port_fdb_del() handler for the CPU port, when the database
was classified to either a port db, or a LAG db.
For symmetry, we also call this from ->port_fdb_add(), because if the
address was installed once, then installing it a second time serves no
purpose: it's already in hardware in VID 0 and it affects all standalone
ports.
This change moves dsa_db_equal() from switch.c to dsa.c, since it now
has one more caller.
Fixes: 54c319846086 ("net: mscc: ocelot: enforce FDB isolation when VLAN-unaware")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the slave unicast address is synced to hardware and to the DSA
master during dsa_slave_open(), this means that a call to
dsa_slave_set_mac_address() while the slave interface is down will
result to a call to dsa_port_standalone_host_fdb_del() and to
dev_uc_del() for the MAC address while there was no previous
dsa_port_standalone_host_fdb_add() or dev_uc_add().
This is a partial revert of the blamed commit below, which was too
aggressive.
Fixes: 35aae5ab9121 ("net: dsa: remove workarounds for changing master promisc/allmulti only while up")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
&cpu_db->fdbs and &cpu_db->mdbs may be uninitialized lists during some
call paths of felix_set_tag_protocol().
There was an attempt to avoid calling dsa_port_walk_fdbs() during setup
by using a "bool change" in the felix driver, but this doesn't work when
the tagging protocol is defined in the device tree, and a change is
triggered by DSA at pseudo-runtime:
dsa_tree_setup_switches
-> dsa_switch_setup
-> dsa_switch_setup_tag_protocol
-> ds->ops->change_tag_protocol
dsa_tree_setup_ports
-> dsa_port_setup
-> &dp->fdbs and &db->mdbs only get initialized here
So it seems like the only way to fix this is to move the initialization
of these lists earlier.
dsa_port_touch() is called from dsa_switch_touch_ports() which is called
from dsa_switch_parse_of(), and this runs completely before
dsa_tree_setup(). Similarly, dsa_switch_release_ports() runs after
dsa_tree_teardown().
Fixes: f9cef64fa23f ("net: dsa: felix: migrate host FDB and MDB entries when changing tag proto")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
There has been recent work towards matching each switchdev object
addition with a corresponding deletion.
Therefore, having elements in the fdbs, mdbs, vlans lists at the time of
a shared (DSA, CPU) port's teardown is indicative of a bug somewhere
else, and not something that is to be expected.
We shouldn't try to silently paper over that. Instead, print a warning
and a stack trace.
This change is a prerequisite for moving the initialization/teardown of
these lists. Make it clear that clearing the lists isn't needed.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When receiving a state message, function tipc_link_validate_msg()
is called to validate its header portion. Then, its data portion
is validated before it can be accessed correctly. However, current
data sanity check is done after the message header is accessed to
update some link variables.
This commit fixes this issue by moving the data sanity check to
the beginning of state message handling and right after the header
sanity check.
Fixes: 9aa422ad3266 ("tipc: improve size validations for received domain records")
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Link: https://lore.kernel.org/r/20220308021200.9245-1-tung.q.nguyen@dektech.com.au
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ENOTSUPP is documented as "should never be seen by user programs",
and thus not exposed in <errno.h>, and thus applications cannot safely
check against it (they get "Unknown error 524" as strerror). We should
rather return the well-known -EOPNOTSUPP.
This is similar to 2230a7ef5198 ("drop_monitor: Use correct error
code") and 4a5cdc604b9c ("net/tls: Fix return values to avoid
ENOTSUPP"), which did not seem to cause problems.
Signed-off-by: Samuel Thibault <samuel.thibault@labri.fr>
Acked-by: Willem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20220307223126.djzvg44v2o2jkjsx@begin
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The fullmesh flag mustn't be used with the signal flag when adding an
address. This patch added the necessary flags check for this case.
Fixes: 73c762c1f07d ("mptcp: set fullmesh flag in pm_netlink")
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The address ID selection for MPJ subflows created in response
to incoming ADD_ADDR option is currently unreliable: it happens
at MPJ socket creation time, when the local address could be
unknown.
Additionally, if the no local endpoint is available for the local
address, a new dummy endpoint is created, confusing the user-land.
This change refactor the code to move the address ID selection inside
the rebuild_header() helper, when the local address eventually
selected by the route lookup is finally known. If the address used
is not mapped by any endpoint - and thus can't be advertised/removed
pick the id 0 instead of allocate a new endpoint.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In some edge scenarios, an MPTCP subflows can use a local address
mapped by a "implicit" endpoint created by the in-kernel path manager.
Such endpoints presence can be confusing, as it's creation is hard
to track and will prevent the later endpoint creation from the user-space
using the same address.
Define a new endpoint flag to mark implicit endpoints and allow the
user-space to replace implicit them with user-provided data at endpoint
creation time.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The in-kernel MPTCP path manager, when processing the MPTCP_PM_CMD_FLUSH_ADDR
command, generates RM_ADDR events for each known local address. While that
is allowed by the RFC, it makes unpredictable the exact number of RM_ADDR
generated when both ends flush the PM addresses.
This change restricts the RM_ADDR generation to previously explicitly
announced addresses, and adjust the expected results in a bunch of related
self-tests.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Set subflow->data_avail with the enum value MPTCP_SUBFLOW_NODATA, instead
of using 0 directly.
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The tracepoint in get_mapping_status() only dumped the incoming mpext
fields. This patch added a new tracepoint in mptcp_sendmsg_frag() to dump
the outgoing mpext too.
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This was a prerequisite for the ill-fated
"netfilter: nat: force port remap to prevent shadowing well-known ports".
As this has been reverted, this change can be backed out too.
Signed-off-by: Florian Westphal <fw@strlen.de>
This reverts commit 878aed8db324bec64f3c3f956e64d5ae7375a5de.
This change breaks existing setups where conntrack is used with
asymmetric paths.
In these cases, the NAT transformation occurs on the syn-ack instead of
the syn:
1. SYN x:12345 -> y -> 443 // sent by initiator, receiverd by responder
2. SYNACK y:443 -> x:12345 // First packet seen by conntrack, as sent by responder
3. tuple_force_port_remap() gets called, sees:
'tcp from 443 to port 12345 NAT' -> pick a new source port, inititor receives
4. SYNACK y:$RANDOM -> x:12345 // connection is never established
While its possible to avoid the breakage with NOTRACK rules, a kernel
update should not break working setups.
An alternative to the revert is to augment conntrack to tag
mid-stream connections plus more code in the nat core to skip NAT
for such connections, however, this leads to more interaction/integration
between conntrack and NAT.
Therefore, revert, users will need to add explicit nat rules to avoid
port shadowing.
Link: https://lore.kernel.org/netfilter-devel/20220302105908.GA5852@breakpoint.cc/#R
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2051413
Signed-off-by: Florian Westphal <fw@strlen.de>
In this situation (VLAN filtering disabled on br0):
br0.10
/
br0
/ \
swp0 swp1
When a frame is transmitted from the VLAN upper, the bridge will send
it down to one of the switch ports with forward offloading
enabled. This will cause tag_dsa to generate a FORWARD tag. Before
this change, that tag would have it's VID set to 10, even though VID
10 is not loaded in the VTU.
Before the blamed commit, the frame would trigger a VTU miss and be
forwarded according to the PVT configuration. Now that all fabric
ports are in 802.1Q secure mode, the frame is dropped instead.
Therefore, restrict the condition under which we rewrite an 802.1Q tag
to a DSA tag. On standalone port's, reuse is always safe since we will
always generate FROM_CPU tags in that case. For bridged ports though,
we must ensure that VLAN filtering is enabled, which in turn
guarantees that the VID in question is loaded into the VTU.
Fixes: d352b20f4174 ("net: dsa: mv88e6xxx: Improve multichip isolation of standalone ports")
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://lore.kernel.org/r/20220307110548.812455-1-tobias@waldekranz.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
The clang static analyzer reports this issue
rtnetlink.c:5481:2: warning: Undefined or garbage
value returned to caller
return err;
^~~~~~~~~~
There is a function level err variable, in the
list_for_each_entry_rcu block there is a shadow
err. Remove the shadow.
In the same block, the call to nla_nest_start_noflag()
can fail without setting an err. Set the err
to -EMSGSIZE.
Fixes: 216e690631f5 ("net: rtnetlink: rtnl_fill_statsinfo(): Permit non-EMSGSIZE error returns")
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Clang static analysis reports this representative issue
dsa.c:486:2: warning: Undefined or garbage value
returned to caller
return err;
^~~~~~~~~~
err is only set in the loop. If the loop is empty,
garbage will be returned. So initialize err to 0
to handle this noop case.
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The esp tunnel GSO handlers use skb_mac_gso_segment to
push the inner packet to the segmentation handlers.
However, skb_mac_gso_segment takes the Ethernet Protocol
ID from 'skb->protocol' which is wrong for inter address
family tunnels. We fix this by introducing a new
skb_eth_gso_segment function.
This function can be used if it is necessary to pass the
Ethernet Protocol ID directly to the segmentation handler.
First users of this function will be the esp4 and esp6
tunnel segmentation handlers.
Fixes: c35fe4106b92 ("xfrm: Add mode handlers for IPsec on layer 2")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The xfrm{4,6}_beet_gso_segment() functions did not correctly set the
SKB_GSO_IPXIP4 and SKB_GSO_IPXIP6 gso types for the address family
tunneling case. Fix this by setting these gso types.
Fixes: 384a46ea7bdc7 ("esp4: add gso_segment for esp4 beet mode")
Fixes: 7f9e40eb18a99 ("esp6: add gso_segment for esp6 beet mode")
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
The maximum message size that can be send is bigger than
the maximum site that skb_page_frag_refill can allocate.
So it is possible to write beyond the allocated buffer.
Fix this by doing a fallback to COW in that case.
v2:
Avoid get get_order() costs as suggested by Linus Torvalds.
Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible")
Fixes: 03e2a30f6a27 ("esp6: Avoid skb_cow_data whenever possible")
Reported-by: valis <sec@valis.email>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
kernel test robot reports multiple warning for smc_sysctl:
In file included from net/smc/smc_sysctl.c:17:
>> net/smc/smc_sysctl.h:23:5: warning: no previous prototype \
for function 'smc_sysctl_init' [-Wmissing-prototypes]
int smc_sysctl_init(void)
^
and
>> WARNING: modpost: vmlinux.o(.text+0x12ced2d): Section mismatch \
in reference from the function smc_sysctl_exit() to the variable
.init.data:smc_sysctl_ops
The function smc_sysctl_exit() references
the variable __initdata smc_sysctl_ops.
This is often because smc_sysctl_exit lacks a __initdata
annotation or the annotation of smc_sysctl_ops is wrong.
and
net/smc/smc_sysctl.c: In function 'smc_sysctl_init_net':
net/smc/smc_sysctl.c:47:17: error: 'struct netns_smc' has no member named 'smc_hdr'
47 | net->smc.smc_hdr = register_net_sysctl(net, "net/smc", table);
Since we don't need global sysctl initialization. To make things
clean and simple, remove the global pernet_operations and
smc_sysctl_{init|exit}. Call smc_sysctl_net_{init|exit} directly
from smc_net_{init|exit}.
Also initialized sysctl_autocorking_size if CONFIG_SYSCTL it not
set, this make sure SMC autocorking is enabled by default if
CONFIG_SYSCTL is not set.
Fixes: 462791bbfa35 ("net/smc: add sysctl interface for SMC")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit
baebdf48c3600 ("net: dev: Makes sure netif_rx() can be invoked in any context.")
the function netif_rx() can be used in preemptible/thread context as
well as in interrupt context.
Use netif_rx().
Cc: Remi Denis-Courmont <courmisch@gmail.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit
baebdf48c3600 ("net: dev: Makes sure netif_rx() can be invoked in any context.")
the function netif_rx() can be used in preemptible/thread context as
well as in interrupt context.
Use netif_rx().
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: Luiz Augusto von Dentz <luiz.dentz@gmail.com>
Cc: linux-bluetooth@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit
baebdf48c3600 ("net: dev: Makes sure netif_rx() can be invoked in any context.")
the function netif_rx() can be used in preemptible/thread context as
well as in interrupt context.
Use netif_rx().
Cc: Antonio Quartulli <a@unstable.cc>
Cc: Marek Lindner <mareklindner@neomailbox.ch>
Cc: Simon Wunderlich <sw@simonwunderlich.de>
Cc: Sven Eckelmann <sven@narfation.org>
Cc: b.a.t.m.a.n@lists.open-mesh.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit
baebdf48c3600 ("net: dev: Makes sure netif_rx() can be invoked in any context.")
the function netif_rx() can be used in preemptible/thread context as
well as in interrupt context.
Use netif_rx().
Cc: Jon Maloy <jmaloy@redhat.com>
Cc: Ying Xue <ying.xue@windriver.com>
Cc: tipc-discussion@lists.sourceforge.net
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
PHY drivers such as micrel or dp83640 need to analyze whether a given
skb is a PTP sync message for one step functionality.
In order to avoid code duplication introduce a generic function and
move it to ptp classify.
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of __get_free_pages() and free_pages() use alloc_pages_exact()
and free_pages_exact(). This is in preparation of a change of
gnttab_end_foreign_access() which will prohibit use of high-order
pages.
By using the local variable "order" instead of ring->intf->ring_order
in the error path of xen_9pfs_front_alloc_dataring() another bug is
fixed, as the error path can be entered before ring->intf->ring_order
is being set.
By using alloc_pages_exact() the size in bytes is specified for the
allocation, which fixes another bug for the case of
order < (PAGE_SHIFT - XEN_PAGE_SHIFT).
This is part of CVE-2022-23041 / XSA-396.
Reported-by: Simon Gaiser <simon@invisiblethingslab.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
V4:
- new patch
Since commit
baebdf48c3600 ("net: dev: Makes sure netif_rx() can be invoked in any context.")
the function netif_rx() can be used in preemptible/thread context as
well as in interrupt context.
Use netif_rx().
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: linux-wireless@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit
baebdf48c3600 ("net: dev: Makes sure netif_rx() can be invoked in any context.")
the function netif_rx() can be used in preemptible/thread context as
well as in interrupt context.
Use netif_rx().
Cc: Marc Kleine-Budde <mkl@pengutronix.de>
Cc: Oliver Hartkopp <socketcan@hartkopp.net>
Cc: Wolfgang Grandegger <wg@grandegger.com>
Cc: linux-can@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit a505cce6f7cfaf2aa2385aab7286063c96444526.
Leon says:
We already discussed that. SMC should be changed to use
RDMA CQ pool API
drivers/infiniband/core/cq.c.
ib_poll_handler() has much better implementation (tracing,
IRQ rescheduling, proper error handling) than this SMC variant.
Since we will switch to ib_poll_handler() in the future,
revert this patch.
Link: https://lore.kernel.org/netdev/20220301105332.GA9417@linux.alibaba.com/
Suggested-by: Leon Romanovsky <leon@kernel.org>
Suggested-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
After the blamed commit, dsa_tree_setup_master() may exit without
calling rtnl_unlock(), fix that.
Fixes: c146f9bc195a ("net: dsa: hold rtnl_mutex when calling dsa_master_{setup,teardown}")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 68ac0f3810e76a853b5f7b90601a05c3048b8b54 because ID
0 was meant to be used for configuring the policy/state without
matching for a specific interface (e.g., Cilium is affected, see
https://github.com/cilium/cilium/pull/18789 and
https://github.com/cilium/cilium/pull/19019).
Signed-off-by: Kai Lueke <kailueke@linux.microsoft.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Include a few verifier selftests that test against the problems being
fixed by previous commits, i.e. release kfunc always require
PTR_TO_BTF_ID fixed and var_off to be 0, and negative offset is not
permitted and returns a helpful error message.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220304224645.3677453-9-memxor@gmail.com
Currently, -Wmissing-prototypes warning is ignored for GCC, but not
clang. This leads to clang build warning in W=1 mode. Since the flag
used by both compilers is same, we can use the unified __diag_ignore_all
macro that works for all supported versions and compilers which have
__diag macro support (currently GCC >= 8.0, and Clang >= 11.0).
Also add nf_conntrack_bpf.h include to prevent missing prototype warning
for register_nf_conntrack_bpf.
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220304224645.3677453-8-memxor@gmail.com
Realtek switches supports the same tag both before ethertype or between
payload and the CRC.
Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch added two more mibs for MP_RST, MPTCP_MIB_MPRSTTX for
the MP_RST sending and MPTCP_MIB_MPRSTRX for the MP_RST receiving.
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch added two more mibs for MP_FASTCLOSE, MPTCP_MIB_MPFASTCLOSETX
for the MP_FASTCLOSE sending and MPTCP_MIB_MPFASTCLOSERX for receiving.
Also added a debug log for MP_FASTCLOSE receiving, printed out the recv_key
of MP_FASTCLOSE in mptcp_parse_option to show that MP_RST is received.
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- Add new PID/VID (0x13d3/0x3567) for MT7921
- Add new PID/VID (0x2550/0x8761) for Realtek 8761BU
- Add support for LG LGSBWAC02 (MT7663BUN)
- Add support for BCM43430A0 and BCM43430A1
- Add support for Intel Madison Peak (MsP2)
-----BEGIN PGP SIGNATURE-----
iQJNBAABCAA3FiEE7E6oRXp8w05ovYr/9JCA4xAyCykFAmIiaaUZHGx1aXoudm9u
LmRlbnR6QGludGVsLmNvbQAKCRD0kIDjEDILKUXQD/9ZbabqpDtsIDPVEqJH1GHb
Z1i0ID6FAUq9VnRHcs5PgrPlD3JaQhI3wMnF1FILu3NWqEyQnAf39LVlL8R0PvaQ
pYtkf+BZI4TKjqa+e4Mw+8NbBLHV2EV8O2tVUQ2gb3OwYvq+2ped4oDC1CkXJ/pA
6LfAuH7z4FNNiWlHeg6wCwYnTFicZmHfwUFab+laThMJ8Pj1mWHNxAmsWRhA+tM/
35WcHlocIPlXDNrY+tVOQNzme/pE91/yMUbI4ZWXIFf0vfMfjA7cbTS5mL9Tjgc8
1/HwiyOuCtPajp/uYAHMEveRK/W3O1FQGurVq3O7FfjfmzllIjm4aK3iLf3fwURe
ZSeeMSLr/FF28stnD2v1CtFET4GWhZMO1zn0jUI7I0GA+YHa7NY0dkczUFavng4o
P3UtkKtDhMgXLNU1vflBdtjjRYeiNw7G0PFsRJak/9x37MQIvUnR3pdN89fbS5Pl
D7R6i1oCSzoGXz245zObHPUMDEJ6vmksH4vEq4+TZX+Ngml6UNEib10KrpoxT/h3
A3KyX40ORfslkAAT8YHZwEUM6pS7gqIQUBdRzHVCKBo89A9HGBs91ZK1rshyDDsO
xfcp0gjr0hpoFhP0JNR7tj/jiGg+Lh8OjFCUGurNnWVLgi0hv3WRC8bBckIIWUVs
WpIuoOUAuz7Qlo0aiZ8ogQ==
=2uvT
-----END PGP SIGNATURE-----
Merge tag 'for-net-next-2022-03-04' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Luiz Augusto von Dentz says:
====================
bluetooth-next pull request for net-next:
- Add new PID/VID (0x13d3/0x3567) for MT7921
- Add new PID/VID (0x2550/0x8761) for Realtek 8761BU
- Add support for LG LGSBWAC02 (MT7663BUN)
- Add support for BCM43430A0 and BCM43430A1
- Add support for Intel Madison Peak (MsP2)
* tag 'for-net-next-2022-03-04' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (21 commits)
Bluetooth: btusb: Add another Realtek 8761BU
Bluetooth: hci_bcm: add BCM43430A0 & BCM43430A1
Bluetooth: use memset avoid memory leaks
Bluetooth: btmtksdio: Fix kernel oops when sdio suspend.
Bluetooth: btusb: Add a new PID/VID 13d3/3567 for MT7921
Bluetooth: move adv_instance_cnt read within the device lock
Bluetooth: hci_event: Add missing locking on hdev in hci_le_ext_adv_term_evt
Bluetooth: btusb: Make use of of BIT macro to declare flags
Bluetooth: Fix not checking for valid hdev on bt_dev_{info,warn,err,dbg}
Bluetooth: mediatek: fix the conflict between mtk and msft vendor event
Bluetooth: mt7921s: support bluetooth reset mechanism
Bluetooth: make array bt_uuid_any static const
Bluetooth: 6lowpan: No need to clear memory twice
Bluetooth: btusb: Improve stability for QCA devices
Bluetooth: btusb: add support for LG LGSBWAC02 (MT7663BUN)
Bluetooth: btusb: Add support for Intel Madison Peak (MsP2) device
Bluetooth: Improve skb handling in mgmt_device_connected()
Bluetooth: Fix skb allocation in mgmt_remote_name() & mgmt_device_connected()
Bluetooth: mgmt: Remove unneeded variable
Bluetooth: hci_sync: fix undefined return of hci_disconnect_all_sync()
...
====================
Link: https://lore.kernel.org/r/20220304193919.649815-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2022-03-04
We've added 32 non-merge commits during the last 14 day(s) which contain
a total of 59 files changed, 1038 insertions(+), 473 deletions(-).
The main changes are:
1) Optimize BPF stackmap's build_id retrieval by caching last valid build_id,
as consecutive stack frames are likely to be in the same VMA and therefore
have the same build id, from Hao Luo.
2) Several improvements to arm64 BPF JIT, that is, support for JITing
the atomic[64]_fetch_add, atomic[64]_[fetch_]{and,or,xor} and lastly
atomic[64]_{xchg|cmpxchg}. Also fix the BTF line info dump for JITed
programs, from Hou Tao.
3) Optimize generic BPF map batch deletion by only enforcing synchronize_rcu()
barrier once upon return to user space, from Eric Dumazet.
4) For kernel build parse DWARF and generate BTF through pahole with enabled
multithreading, from Kui-Feng Lee.
5) BPF verifier usability improvements by making log info more concise and
replacing inv with scalar type name, from Mykola Lysenko.
6) Two follow-up fixes for BPF prog JIT pack allocator, from Song Liu.
7) Add a new Kconfig to allow for loading kernel modules with non-matching
BTF type info; their BTF info is then removed on load, from Connor O'Brien.
8) Remove reallocarray() usage from bpftool and switch to libbpf_reallocarray()
in order to fix compilation errors for older glibc, from Mauricio Vásquez.
9) Fix libbpf to error on conflicting name in BTF when type declaration
appears before the definition, from Xu Kuohai.
10) Fix issue in BPF preload for in-kernel light skeleton where loaded BPF
program fds prevent init process from setting up fd 0-2, from Yucong Sun.
11) Fix libbpf reuse of pinned perf RB map when max_entries is auto-determined
by libbpf, from Stijn Tintel.
12) Several cleanups for libbpf and a fix to enforce perf RB map #pages to be
non-zero, from Yuntao Wang.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (32 commits)
bpf: Small BPF verifier log improvements
libbpf: Add a check to ensure that page_cnt is non-zero
bpf, x86: Set header->size properly before freeing it
x86: Disable HAVE_ARCH_HUGE_VMALLOC on 32-bit x86
bpf, test_run: Fix overflow in XDP frags bpf_test_finish
selftests/bpf: Update btf_dump case for conflicting names
libbpf: Skip forward declaration when counting duplicated type names
bpf: Add some description about BPF_JIT_ALWAYS_ON in Kconfig
bpf, docs: Add a missing colon in verifier.rst
bpf: Cache the last valid build_id
libbpf: Fix BPF_MAP_TYPE_PERF_EVENT_ARRAY auto-pinning
bpf, selftests: Use raw_tp program for atomic test
bpf, arm64: Support more atomic operations
bpftool: Remove redundant slashes
bpf: Add config to allow loading modules with BTF mismatches
bpf, arm64: Feed byte-offset into bpf line info
bpf, arm64: Call build_prologue() first in first JIT pass
bpf: Fix issue with bpf preload module taking over stdout/stdin of kernel.
bpftool: Bpf skeletons assert type sizes
bpf: Cleanup comments
...
====================
Link: https://lore.kernel.org/r/20220304164313.31675-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Use memset to initialize structs to prevent memory leaks
in l2cap_ecred_connect
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Minghao Chi (CGEL ZTE) <chi.minghao@zte.com.cn>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
The field adv_instance_cnt is always accessed within a device lock,
except in the function add_advertising. A concurrent remove of an
advertisement with adding another one could result in the if check
"if a new instance was actually added" to not trigger, resulting
in not triggering the "advertising added event".
Signed-off-by: Niels Dossche <niels.dossche@ugent.be>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Both hci_find_adv_instance and hci_remove_adv_instance have a comment
above their function definition saying that these two functions require
the caller to hold the hdev->lock lock. However, hci_le_ext_adv_term_evt
does not acquire that lock and neither does its caller hci_le_meta_evt
(hci_le_meta_evt calls hci_le_ext_adv_term_evt via an indirect function
call because of the lookup in hci_le_ev_table).
The other event handlers all acquire and release the hdev->lock and they
follow the rule that hci_find_adv_instance and hci_remove_adv_instance
must be called while holding the hdev->lock lock.
The solution is to make sure hci_le_ext_adv_term_evt also acquires and
releases the hdev->lock lock. The check on ev->status which logs a
warning and does an early return is not covered by the lock because
other functions also access ev->status without holding the lock.
Signed-off-by: Niels Dossche <niels.dossche@ugent.be>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>