IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
commit 63960260457a02af2a6cb35d75e6bdb17299c882 upstream.
When evaluating access control over kallsyms visibility, credentials at
open() time need to be used, not the "current" creds (though in BPF's
case, this has likely always been the same). Plumb access to associated
file->f_cred down through bpf_dump_raw_ok() and its callers now that
kallsysm_show_value() has been refactored to take struct cred.
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: bpf@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: 7105e828c087 ("bpf: allow for correlation of maps and helpers in dump")
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 8025751d4d55a2f32be6bdf825b6a80c299875f5 ]
If an ingress verdict program specifies message sizes greater than
skb->len and there is an ENOMEM error due to memory pressure we
may call the rcv_msg handler outside the strp_data_ready() caller
context. This is because on an ENOMEM error the strparser will
retry from a workqueue. The caller currently protects the use of
psock by calling the strp_data_ready() inside a rcu_read_lock/unlock
block.
But, in above workqueue error case the psock is accessed outside
the read_lock/unlock block of the caller. So instead of using
psock directly we must do a look up against the sk again to
ensure the psock is available.
There is an an ugly piece here where we must handle
the case where we paused the strp and removed the psock. On
psock removal we first pause the strparser and then remove
the psock. If the strparser is paused while an skb is
scheduled on the workqueue the skb will be dropped on the
flow and kfree_skb() is called. If the workqueue manages
to get called before we pause the strparser but runs the rcvmsg
callback after the psock is removed we will hit the unlikely
case where we run the sockmap rcvmsg handler but do not have
a psock. For now we will follow strparser logic and drop the
skb on the floor with skb_kfree(). This is ugly because the
data is dropped. To date this has not caused problems in practice
because either the application controlling the sockmap is
coordinating with the datapath so that skbs are "flushed"
before removal or we simply wait for the sock to be closed before
removing it.
This patch fixes the describe RCU bug and dropping the skb doesn't
make things worse. Future patches will improve this by allowing
the normal case where skbs are not merged to skip the strparser
altogether. In practice many (most?) use cases have no need to
merge skbs so its both a code complexity hit as seen above and
a performance issue. For example, in the Cilium case we always
set the strparser up to return sbks 1:1 without any merging and
have avoided above issues.
Fixes: e91de6afa81c1 ("bpf: Fix running sk_skb program types with ktls")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/159312679888.18340.15248924071966273998.stgit@john-XPS-13-9370
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 93dd5f185916b05e931cffae636596f21f98546e ]
There are two paths to generate the below RCU splat the first and
most obvious is the result of the BPF verdict program issuing a
redirect on a TLS socket (This is the splat shown below). Unlike
the non-TLS case the caller of the *strp_read() hooks does not
wrap the call in a rcu_read_lock/unlock. Then if the BPF program
issues a redirect action we hit the RCU splat.
However, in the non-TLS socket case the splat appears to be
relatively rare, because the skmsg caller into the strp_data_ready()
is wrapped in a rcu_read_lock/unlock. Shown here,
static void sk_psock_strp_data_ready(struct sock *sk)
{
struct sk_psock *psock;
rcu_read_lock();
psock = sk_psock(sk);
if (likely(psock)) {
if (tls_sw_has_ctx_rx(sk)) {
psock->parser.saved_data_ready(sk);
} else {
write_lock_bh(&sk->sk_callback_lock);
strp_data_ready(&psock->parser.strp);
write_unlock_bh(&sk->sk_callback_lock);
}
}
rcu_read_unlock();
}
If the above was the only way to run the verdict program we
would be safe. But, there is a case where the strparser may throw an
ENOMEM error while parsing the skb. This is a result of a failed
skb_clone, or alloc_skb_for_msg while building a new merged skb when
the msg length needed spans multiple skbs. This will in turn put the
skb on the strp_wrk workqueue in the strparser code. The skb will
later be dequeued and verdict programs run, but now from a
different context without the rcu_read_lock()/unlock() critical
section in sk_psock_strp_data_ready() shown above. In practice
I have not seen this yet, because as far as I know most users of the
verdict programs are also only working on single skbs. In this case no
merge happens which could trigger the above ENOMEM errors. In addition
the system would need to be under memory pressure. For example, we
can't hit the above case in selftests because we missed having tests
to merge skbs. (Added in later patch)
To fix the below splat extend the rcu_read_lock/unnlock block to
include the call to sk_psock_tls_verdict_apply(). This will fix both
TLS redirect case and non-TLS redirect+error case. Also remove
psock from the sk_psock_tls_verdict_apply() function signature its
not used there.
[ 1095.937597] WARNING: suspicious RCU usage
[ 1095.940964] 5.7.0-rc7-02911-g463bac5f1ca79 #1 Tainted: G W
[ 1095.944363] -----------------------------
[ 1095.947384] include/linux/skmsg.h:284 suspicious rcu_dereference_check() usage!
[ 1095.950866]
[ 1095.950866] other info that might help us debug this:
[ 1095.950866]
[ 1095.957146]
[ 1095.957146] rcu_scheduler_active = 2, debug_locks = 1
[ 1095.961482] 1 lock held by test_sockmap/15970:
[ 1095.964501] #0: ffff9ea6b25de660 (sk_lock-AF_INET){+.+.}-{0:0}, at: tls_sw_recvmsg+0x13a/0x840 [tls]
[ 1095.968568]
[ 1095.968568] stack backtrace:
[ 1095.975001] CPU: 1 PID: 15970 Comm: test_sockmap Tainted: G W 5.7.0-rc7-02911-g463bac5f1ca79 #1
[ 1095.977883] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 1095.980519] Call Trace:
[ 1095.982191] dump_stack+0x8f/0xd0
[ 1095.984040] sk_psock_skb_redirect+0xa6/0xf0
[ 1095.986073] sk_psock_tls_strp_read+0x1d8/0x250
[ 1095.988095] tls_sw_recvmsg+0x714/0x840 [tls]
v2: Improve commit message to identify non-TLS redirect plus error case
condition as well as more common TLS case. In the process I decided
doing the rcu_read_unlock followed by the lock/unlock inside branches
was unnecessarily complex. We can just extend the current rcu block
and get the same effeective without the shuffling and branching.
Thanks Martin!
Fixes: e91de6afa81c1 ("bpf: Fix running sk_skb program types with ktls")
Reported-by: Jakub Sitnicki <jakub@cloudflare.com>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/159312677907.18340.11064813152758406626.stgit@john-XPS-13-9370
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit bc7a39b4272b9672d806d422b6850e8c1a09914c ]
When a memory leak was fixed, a return err was changed to goto err,
but, accidentally, the if (err) was removed, so now we always exit at
this point.
Fix it by adding if (err) back.
Fixes: 9951ebfcdf2b ("nl80211: fix potential leak in AP start")
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20200626124931.871ba5b31eee.I97340172d92164ee92f3c803fe20a8a6e97714e1@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 8ff41cc21714704ef0158a546c3c4d07fae2c952 upstream.
This code assumes that the user passed in enough data for a
qrtr_hdr_v1 or qrtr_hdr_v2 struct, but it's not necessarily true. If
the buffer is too small then it will read beyond the end.
Reported-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reported-by: syzbot+b8fe393f999a291a9ea6@syzkaller.appspotmail.com
Fixes: 194ccc88297a ("net: qrtr: Support decoding incoming v2 packets")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 02c28dffb13abbaaedece1e4a6493b48ad3f913a ]
Commit 2ad6691d988c, which moved the modification of the status annotation
for a packet in the Tx buffer prior to the retransmission moved the state
clearance, but managed to lose the bit that set it to UNACK.
Consequently, if a retransmission occurs, the packet is accidentally
changed to the ACK state (ie. 0) by masking it off, which means that the
packet isn't counted towards the tally of newly-ACK'd packets if it gets
hard-ACK'd. This then prevents the congestion control algorithm from
recovering properly.
Fix by reinstating the change of state to UNACK.
Spotted by the generic/460 xfstest.
Fixes: 2ad6691d988c ("rxrpc: Fix race between incoming ACK parser and retransmitter")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2ad6691d988c0c611362ddc2aad89e0fb50e3261 ]
There's a race between the retransmission code and the received ACK parser.
The problem is that the retransmission loop has to drop the lock under
which it is iterating through the transmission buffer in order to transmit
a packet, but whilst the lock is dropped, the ACK parser can crank the Tx
window round and discard the packets from the buffer.
The retransmission code then updated the annotations for the wrong packet
and a later retransmission thought it had to retransmit a packet that
wasn't there, leading to a NULL pointer dereference.
Fix this by:
(1) Moving the annotation change to before we drop the lock prior to
transmission. This means we can't vary the annotation depending on
the outcome of the transmission, but that's fine - we'll retransmit
again later if it failed now.
(2) Skipping the packet if the skb pointer is NULL.
The following oops was seen:
BUG: kernel NULL pointer dereference, address: 000000000000002d
Workqueue: krxrpcd rxrpc_process_call
RIP: 0010:rxrpc_get_skb+0x14/0x8a
...
Call Trace:
rxrpc_resend+0x331/0x41e
? get_vtime_delta+0x13/0x20
rxrpc_process_call+0x3c0/0x4ac
process_one_work+0x18f/0x27f
worker_thread+0x1a3/0x247
? create_worker+0x17d/0x17d
kthread+0xe6/0xeb
? kthread_delayed_work_timer_fn+0x83/0x83
ret_from_fork+0x1f/0x30
Fixes: 248f219cb8bc ("rxrpc: Rewrite the data and ack handling code")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 7b2182ec381f8ea15c7eb1266d6b5d7da620ad93 upstream.
The RPC client currently doesn't handle ERR_CHUNK replies correctly.
rpcrdma_complete_rqst() incorrectly passes a negative number to
xprt_complete_rqst() as the number of bytes copied. Instead, set
task->tk_status to the error value, and return zero bytes copied.
In these cases, return -EIO rather than -EREMOTEIO. The RPC client's
finite state machine doesn't know what to do with -EREMOTEIO.
Additional clean ups:
- Don't double-count RDMA_ERROR replies
- Remove a stale comment
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@kernel.vger.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit 89a3c9f5b9f0bcaa9aea3e8b2a616fcaea9aad78 upstream.
@subbuf is an output parameter of xdr_buf_subsegment(). A survey of
call sites shows that @subbuf is always uninitialized before
xdr_buf_segment() is invoked by callers.
There are some execution paths through xdr_buf_subsegment() that do
not set all of the fields in @subbuf, leaving some pointer fields
containing garbage addresses. Subsequent processing of that buffer
then results in a page fault.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit b7ade38165ca0001c5a3bd5314a314abbbfbb1b7 upstream.
__rpc_depopulate(gssd_dentry) was lost on error path
cc: stable@vger.kernel.org
Fixes: commit 4b9a445e3eeb ("sunrpc: create a new dummy pipe for gssd to hold open")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 715028460082d07a7ec6fcd87b14b46784346a72 ]
When using ip_set with counters and comment, traffic causes the kernel
to panic on 32-bit ARM:
Alignment trap: not handling instruction e1b82f9f at [<bf01b0dc>]
Unhandled fault: alignment exception (0x221) at 0xea08133c
PC is at ip_set_match_extensions+0xe0/0x224 [ip_set]
The problem occurs when we try to update the 64-bit counters - the
faulting address above is not 64-bit aligned. The problem occurs
due to the way elements are allocated, for example:
set->dsize = ip_set_elem_len(set, tb, 0, 0);
map = ip_set_alloc(sizeof(*map) + elements * set->dsize);
If the element has a requirement for a member to be 64-bit aligned,
and set->dsize is not a multiple of 8, but is a multiple of four,
then every odd numbered elements will be misaligned - and hitting
an atomic64_add() on that element will cause the kernel to panic.
ip_set_elem_len() must return a size that is rounded to the maximum
alignment of any extension field stored in the element. This change
ensures that is the case.
Fixes: 95ad1f4a9358 ("netfilter: ipset: Fix extension alignment")
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit a2ad7c21ad8cf1ce4ad65e13df1c2a1c29b38ac5 ]
The handling of the receive window size (rwind) from a received ACK packet
is not correct. The rxrpc_input_ackinfo() function currently checks the
current Tx window size against the rwind from the ACK to see if it has
changed, but then limits the rwind size before storing it in the tx_winsize
member and, if it increased, wake up the transmitting process. This means
that if rwind > RXRPC_RXTX_BUFF_SIZE - 1, this path will always be
followed.
Fix this by limiting rwind before we compare it to tx_winsize.
The effect of this can be seen by enabling the rxrpc_rx_rwind_change
tracepoint.
Fixes: 702f2ac87a9a ("rxrpc: Wake up the transmitter if Rx window size increases on the peer")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 94579ac3f6d0820adc83b5dc5358ead0158101e9 ]
During IPsec performance testing, we see bad ICMP checksum. The error packet
has duplicated ESP trailer due to double validate_xmit_xfrm calls. The first call
is from ip_output, but the packet cannot be sent because
netif_xmit_frozen_or_stopped is true and the packet gets dev_requeue_skb. The second
call is from NET_TX softirq. However after the first call, the packet already
has the ESP trailer.
Fix by marking the skb with XFRM_XMIT bit after the packet is handled by
validate_xmit_xfrm to avoid duplicate ESP trailer insertion.
Fixes: f6e27114a60a ("net: Add a xfrm validate function to validate_xmit_skb")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Boris Pismenny <borisp@mellanox.com>
Reviewed-by: Raed Salem <raeds@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 1a3db27ad9a72d033235b9673653962c02e3486e ]
Since the quiesce/activate rework, __netdev_watchdog_up() is directly
called in the ucc_geth driver.
Unfortunately, this function is not available for modules and thus
ucc_geth cannot be built as a module anymore. Fix it by exporting
__netdev_watchdog_up().
Since the commit introducing the regression was backported to stable
branches, this one should ideally be as well.
Fixes: 79dde73cf9bc ("net/ethernet/freescale: rework quiesce/activate for ucc_geth")
Signed-off-by: Valentin Longchamp <valentin@longchamp.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit b344579ca8478598937215f7005d6c7b84d28aee ]
Mirja Kuehlewind reported a bug in Linux TCP CUBIC Hystart, where
Hystart HYSTART_DELAY mechanism can exit Slow Start spuriously on an
ACK when the minimum rtt of a connection goes down. From inspection it
is clear from the existing code that this could happen in an example
like the following:
o The first 8 RTT samples in a round trip are 150ms, resulting in a
curr_rtt of 150ms and a delay_min of 150ms.
o The 9th RTT sample is 100ms. The curr_rtt does not change after the
first 8 samples, so curr_rtt remains 150ms. But delay_min can be
lowered at any time, so delay_min falls to 100ms. The code executes
the HYSTART_DELAY comparison between curr_rtt of 150ms and delay_min
of 100ms, and the curr_rtt is declared far enough above delay_min to
force a (spurious) exit of Slow start.
The fix here is simple: allow every RTT sample in a round trip to
lower the curr_rtt.
Fixes: ae27e98a5152 ("[TCP] CUBIC v2.3")
Reported-by: Mirja Kuehlewind <mirja.kuehlewind@ericsson.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 3f608f0c41360b11b04c763f348b712f651c8bac ]
I spotted a few nits when comparing the in-tree version of sch_cake with
the out-of-tree one: A redundant error variable declaration shadowing an
outer declaration, and an indentation alignment issue. Fix both of these.
Fixes: 046f6fd5daef ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 8c95eca0bb8c4bd2231a0d581f1ad0d50c90488c ]
As a further optimisation of the diffserv parsing codepath, we can skip it
entirely if CAKE is configured to neither use diffserv-based
classification, nor to zero out the diffserv bits.
Fixes: c87b4ecdbe8d ("sch_cake: Make sure we can write the IP header before changing DSCP bits")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 9208d2863ac689a563b92f2161d8d1e7127d0add ]
cake_handle_diffserv() tries to linearize mac and network header parts of
skb and to make it writable unconditionally. In some cases it leads to full
skb reallocation, which reduces throughput and increases CPU load. Some
measurements of IPv4 forward + NAPT on MIPS router with 580 MHz single-core
CPU was conducted. It appears that on kernel 4.9 skb_try_make_writable()
reallocates skb, if skb was allocated in ethernet driver via so-called
'build skb' method from page cache (it was discovered by strange increase
of kmalloc-2048 slab at first).
Obtain DSCP value via read-only skb_header_pointer() call, and leave
linearization only for DSCP bleaching or ECN CE setting. And, as an
additional optimisation, skip diffserv parsing entirely if it is not needed
by the current configuration.
Fixes: c87b4ecdbe8d ("sch_cake: Make sure we can write the IP header before changing DSCP bits")
Signed-off-by: Ilya Ponetayev <i.ponetaev@ndmsystems.com>
[ fix a few style issues, reflow commit message ]
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ba61539c6ae57f4146284a5cb4f7b7ed8d42bf45 ]
In the datapath, the ip_tunnel_lookup() is used and it internally uses
fallback tunnel device pointer, which is fb_tunnel_dev.
This pointer variable should be set to NULL when a fb interface is deleted.
But there is no routine to set fb_tunnel_dev pointer to NULL.
So, this pointer will be still used after interface is deleted and
it eventually results in the use-after-free problem.
Test commands:
ip netns add A
ip netns add B
ip link add eth0 type veth peer name eth1
ip link set eth0 netns A
ip link set eth1 netns B
ip netns exec A ip link set lo up
ip netns exec A ip link set eth0 up
ip netns exec A ip link add gre1 type gre local 10.0.0.1 \
remote 10.0.0.2
ip netns exec A ip link set gre1 up
ip netns exec A ip a a 10.0.100.1/24 dev gre1
ip netns exec A ip a a 10.0.0.1/24 dev eth0
ip netns exec B ip link set lo up
ip netns exec B ip link set eth1 up
ip netns exec B ip link add gre1 type gre local 10.0.0.2 \
remote 10.0.0.1
ip netns exec B ip link set gre1 up
ip netns exec B ip a a 10.0.100.2/24 dev gre1
ip netns exec B ip a a 10.0.0.2/24 dev eth1
ip netns exec A hping3 10.0.100.2 -2 --flood -d 60000 &
ip netns del B
Splat looks like:
[ 77.793450][ C3] ==================================================================
[ 77.794702][ C3] BUG: KASAN: use-after-free in ip_tunnel_lookup+0xcc4/0xf30
[ 77.795573][ C3] Read of size 4 at addr ffff888060bd9c84 by task hping3/2905
[ 77.796398][ C3]
[ 77.796664][ C3] CPU: 3 PID: 2905 Comm: hping3 Not tainted 5.8.0-rc1+ #616
[ 77.797474][ C3] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 77.798453][ C3] Call Trace:
[ 77.798815][ C3] <IRQ>
[ 77.799142][ C3] dump_stack+0x9d/0xdb
[ 77.799605][ C3] print_address_description.constprop.7+0x2cc/0x450
[ 77.800365][ C3] ? ip_tunnel_lookup+0xcc4/0xf30
[ 77.800908][ C3] ? ip_tunnel_lookup+0xcc4/0xf30
[ 77.801517][ C3] ? ip_tunnel_lookup+0xcc4/0xf30
[ 77.802145][ C3] kasan_report+0x154/0x190
[ 77.802821][ C3] ? ip_tunnel_lookup+0xcc4/0xf30
[ 77.803503][ C3] ip_tunnel_lookup+0xcc4/0xf30
[ 77.804165][ C3] __ipgre_rcv+0x1ab/0xaa0 [ip_gre]
[ 77.804862][ C3] ? rcu_read_lock_sched_held+0xc0/0xc0
[ 77.805621][ C3] gre_rcv+0x304/0x1910 [ip_gre]
[ 77.806293][ C3] ? lock_acquire+0x1a9/0x870
[ 77.806925][ C3] ? gre_rcv+0xfe/0x354 [gre]
[ 77.807559][ C3] ? erspan_xmit+0x2e60/0x2e60 [ip_gre]
[ 77.808305][ C3] ? rcu_read_lock_sched_held+0xc0/0xc0
[ 77.809032][ C3] ? rcu_read_lock_held+0x90/0xa0
[ 77.809713][ C3] gre_rcv+0x1b8/0x354 [gre]
[ ... ]
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit dafabb6590cb15f300b77c095d50312e2c7c8e0f ]
In the datapath, the ip6gre_tunnel_lookup() is used and it internally uses
fallback tunnel device pointer, which is fb_tunnel_dev.
This pointer variable should be set to NULL when a fb interface is deleted.
But there is no routine to set fb_tunnel_dev pointer to NULL.
So, this pointer will be still used after interface is deleted and
it eventually results in the use-after-free problem.
Test commands:
ip netns add A
ip netns add B
ip link add eth0 type veth peer name eth1
ip link set eth0 netns A
ip link set eth1 netns B
ip netns exec A ip link set lo up
ip netns exec A ip link set eth0 up
ip netns exec A ip link add ip6gre1 type ip6gre local fc:0::1 \
remote fc:0::2
ip netns exec A ip -6 a a fc💯:1/64 dev ip6gre1
ip netns exec A ip link set ip6gre1 up
ip netns exec A ip -6 a a fc:0::1/64 dev eth0
ip netns exec A ip link set ip6gre0 up
ip netns exec B ip link set lo up
ip netns exec B ip link set eth1 up
ip netns exec B ip link add ip6gre1 type ip6gre local fc:0::2 \
remote fc:0::1
ip netns exec B ip -6 a a fc💯:2/64 dev ip6gre1
ip netns exec B ip link set ip6gre1 up
ip netns exec B ip -6 a a fc:0::2/64 dev eth1
ip netns exec B ip link set ip6gre0 up
ip netns exec A ping fc💯:2 -s 60000 &
ip netns del B
Splat looks like:
[ 73.087285][ C1] BUG: KASAN: use-after-free in ip6gre_tunnel_lookup+0x1064/0x13f0 [ip6_gre]
[ 73.088361][ C1] Read of size 4 at addr ffff888040559218 by task ping/1429
[ 73.089317][ C1]
[ 73.089638][ C1] CPU: 1 PID: 1429 Comm: ping Not tainted 5.7.0+ #602
[ 73.090531][ C1] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 73.091725][ C1] Call Trace:
[ 73.092160][ C1] <IRQ>
[ 73.092556][ C1] dump_stack+0x96/0xdb
[ 73.093122][ C1] print_address_description.constprop.6+0x2cc/0x450
[ 73.094016][ C1] ? ip6gre_tunnel_lookup+0x1064/0x13f0 [ip6_gre]
[ 73.094894][ C1] ? ip6gre_tunnel_lookup+0x1064/0x13f0 [ip6_gre]
[ 73.095767][ C1] ? ip6gre_tunnel_lookup+0x1064/0x13f0 [ip6_gre]
[ 73.096619][ C1] kasan_report+0x154/0x190
[ 73.097209][ C1] ? ip6gre_tunnel_lookup+0x1064/0x13f0 [ip6_gre]
[ 73.097989][ C1] ip6gre_tunnel_lookup+0x1064/0x13f0 [ip6_gre]
[ 73.098750][ C1] ? gre_del_protocol+0x60/0x60 [gre]
[ 73.099500][ C1] gre_rcv+0x1c5/0x1450 [ip6_gre]
[ 73.100199][ C1] ? ip6gre_header+0xf00/0xf00 [ip6_gre]
[ 73.100985][ C1] ? rcu_read_lock_sched_held+0xc0/0xc0
[ 73.101830][ C1] ? ip6_input_finish+0x5/0xf0
[ 73.102483][ C1] ip6_protocol_deliver_rcu+0xcbb/0x1510
[ 73.103296][ C1] ip6_input_finish+0x5b/0xf0
[ 73.103920][ C1] ip6_input+0xcd/0x2c0
[ 73.104473][ C1] ? ip6_input_finish+0xf0/0xf0
[ 73.105115][ C1] ? rcu_read_lock_held+0x90/0xa0
[ 73.105783][ C1] ? rcu_read_lock_sched_held+0xc0/0xc0
[ 73.106548][ C1] ipv6_rcv+0x1f1/0x300
[ ... ]
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: c12b395a4664 ("gre: Support GRE over IPv6")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 662051215c758ae8545451628816204ed6cd372d ]
Back in 2013, we made a change that broke fast retransmit
for non SACK flows.
Indeed, for these flows, a sender needs to receive three duplicate
ACK before starting fast retransmit. Sending ACK with different
receive window do not count.
Even if enabling SACK is strongly recommended these days,
there still are some cases where it has to be disabled.
Not increasing the window seems better than having to
rely on RTO.
After the fix, following packetdrill test gives :
// Initialize connection
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
+0 < S 0:0(0) win 32792 <mss 1000,nop,wscale 7>
+0 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 8>
+0 < . 1:1(0) ack 1 win 514
+0 accept(3, ..., ...) = 4
+0 < . 1:1001(1000) ack 1 win 514
// Quick ack
+0 > . 1:1(0) ack 1001 win 264
+0 < . 2001:3001(1000) ack 1 win 514
// DUPACK : Normally we should not change the window
+0 > . 1:1(0) ack 1001 win 264
+0 < . 3001:4001(1000) ack 1 win 514
// DUPACK : Normally we should not change the window
+0 > . 1:1(0) ack 1001 win 264
+0 < . 4001:5001(1000) ack 1 win 514
// DUPACK : Normally we should not change the window
+0 > . 1:1(0) ack 1001 win 264
+0 < . 1001:2001(1000) ack 1 win 514
// Hole is repaired.
+0 > . 1:1(0) ack 5001 win 272
Fixes: 4e4f1fc22681 ("tcp: properly increase rcv_ssthresh for ofo packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 2570284060b48f3f79d8f1a2698792f36c385e9a ]
there is a problem with the CWR flag set in an incoming ACK segment
and it leads to the situation when the ECE flag is latched forever
the following packetdrill script shows what happens:
// Stack receives incoming segments with CE set
+0.1 <[ect0] . 11001:12001(1000) ack 1001 win 65535
+0.0 <[ce] . 12001:13001(1000) ack 1001 win 65535
+0.0 <[ect0] P. 13001:14001(1000) ack 1001 win 65535
// Stack repsonds with ECN ECHO
+0.0 >[noecn] . 1001:1001(0) ack 12001
+0.0 >[noecn] E. 1001:1001(0) ack 13001
+0.0 >[noecn] E. 1001:1001(0) ack 14001
// Write a packet
+0.1 write(3, ..., 1000) = 1000
+0.0 >[ect0] PE. 1001:2001(1000) ack 14001
// Pure ACK received
+0.01 <[noecn] W. 14001:14001(0) ack 2001 win 65535
// Since CWR was sent, this packet should NOT have ECE set
+0.1 write(3, ..., 1000) = 1000
+0.0 >[ect0] P. 2001:3001(1000) ack 14001
// but Linux will still keep ECE latched here, with packetdrill
// flagging a missing ECE flag, expecting
// >[ect0] PE. 2001:3001(1000) ack 14001
// in the script
In the situation above we will continue to send ECN ECHO packets
and trigger the peer to reduce the congestion window. To avoid that
we can check CWR on pure ACKs received.
v3:
- Add a sequence check to avoid sending an ACK to an ACK
v2:
- Adjusted the comment
- move CWR check before checking for unacknowledged packets
Signed-off-by: Denis Kirjanov <denis.kirjanov@suse.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 471e39df96b9a4c4ba88a2da9e25a126624d7a9c ]
If a socket is set ipv6only, it will still send IPv4 addresses in the
INIT and INIT_ACK packets. This potentially misleads the peer into using
them, which then would cause association termination.
The fix is to not add IPv4 addresses to ipv6only sockets.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-by: Corey Minyard <cminyard@mvista.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Corey Minyard <cminyard@mvista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 17843655708e1941c0653af3cd61be6948e36f43 ]
ovs connection tracking module performs de-fragmentation on incoming
fragmented traffic. Take info account if traffic has been de-fragmented
in execute_check_pkt_len action otherwise we will perform the wrong
nested action considering the original packet size. This issue typically
occurs if ovs-vswitchd adds a rule in the pipeline that requires connection
tracking (e.g. OVN stateful ACLs) before execute_check_pkt_len action.
Moreover take into account GSO fragment size for GSO packet in
execute_check_pkt_len routine
Fixes: 4d5ec89fc8d14 ("net: openvswitch: Add a new action check_pkt_len")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 5eea3a63ff4aba6a26002e657a6d21934b7e2b96 ]
ie.,
$ ifconfig eth0 6.6.6.6 netmask 255.255.255.0
$ ip rule add from 6.6.6.6 table 6666
$ ip route add 9.9.9.9 via 6.6.6.6
$ ping -I 6.6.6.6 9.9.9.9
PING 9.9.9.9 (9.9.9.9) from 6.6.6.6 : 56(84) bytes of data.
3 packets transmitted, 0 received, 100% packet loss, time 2079ms
$ arp
Address HWtype HWaddress Flags Mask Iface
6.6.6.6 (incomplete) eth0
The arp request address is error, this is because fib_table_lookup in
fib_check_nh lookup the destnation 9.9.9.9 nexthop, the scope of
the fib result is RT_SCOPE_LINK,the correct scope is RT_SCOPE_HOST.
Here I add a check of whether this is RT_TABLE_MAIN to solve this problem.
Fixes: 3bfd847203c6 ("net: Use passed in table for nexthop lookups")
Signed-off-by: guodeqing <geffrey.guo@huawei.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 41b14fb8724d5a4b382a63cb4a1a61880347ccb8 ]
Clearing the sock TX queue in sk_set_socket() might cause unexpected
out-of-order transmit when called from sock_orphan(), as outstanding
packets can pick a different TX queue and bypass the ones already queued.
This is undesired in general. More specifically, it breaks the in-order
scheduling property guarantee for device-offloaded TLS sockets.
Remove the call to sk_tx_queue_clear() in sk_set_socket(), and add it
explicitly only where needed.
Fixes: e022f0b4a03f ("net: Introduce sk_tx_queue_mapping")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit db7202dec92e6caa2706c21d6fc359af318bde2e ]
The eth_addr member is passed to ether_addr functions that require
2-byte alignment, therefore the member must be properly aligned
to avoid unaligned accesses.
The problem is in place since the initial merge of multicast to unicast:
commit 6db6f0eae6052b70885562e1733896647ec1d807 bridge: multicast to unicast
Fixes: 6db6f0eae605 ("bridge: multicast to unicast")
Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Felix Fietkau <nbd@nbd.name>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Martitz <t.martitz@avm.de>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 11d6011c2cf29f7c8181ebde6c8bc0c4d83adcd7 ]
Sequence counters write paths are critical sections that must never be
preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
Commit 5dbe7c178d3f ("net: fix kernel deadlock with interface rename and
netdev name retrieval.") handled a deadlock, observed with
CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
infinitely spinning: it got scheduled after the seqcount write side
blocked inside its own critical section.
To fix that deadlock, among other issues, the commit added a
cond_resched() inside the read side section. While this will get the
non-preemptible kernel eventually unstuck, the seqcount reader is fully
exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
The fix is also still broken: if the seqcount reader belongs to a
real-time scheduling policy, it can spin forever and the kernel will
livelock.
Disabling preemption over the seqcount write side critical section will
not work: inside it are a number of GFP_KERNEL allocations and mutex
locking through the drivers/base/ :: device_rename() call chain.
>From all the above, replace the seqcount with a rwsem.
Fixes: 5dbe7c178d3f (net: fix kernel deadlock with interface rename and netdev name retrieval.)
Fixes: 30e6c9fa93cf (net: devnet_rename_seq should be a seqcount)
Fixes: c91f6df2db49 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
Cc: <stable@vger.kernel.org>
Reported-by: kbuild test robot <lkp@intel.com> [ v1 missing up_read() on error exit ]
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> [ v1 missing up_read() on error exit ]
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 2da2b32fd9346009e9acdb68c570ca8d3966aba7 ]
CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
Both PREEMPT and PREEMPT_RT require the same functionality which today
depends on CONFIG_PREEMPT.
Update the comment to use CONFIG_PREEMPTION.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: netdev@vger.kernel.org
Link: https://lore.kernel.org/r/20191015191821.11479-22-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 60e5ca8a64bad8f3e2e20a1e57846e497361c700 ]
Add missed bpf_map_charge_init() in sock_hash_alloc() and
correspondingly bpf_map_charge_finish() on ENOMEM.
It was found accidentally while working on unrelated selftest that
checks "map->memory.pages > 0" is true for all map types.
Before:
# bpftool m l
...
3692: sockhash name m_sockhash flags 0x0
key 4B value 4B max_entries 8 memlock 0B
After:
# bpftool m l
...
84: sockmap name m_sockmap flags 0x0
key 4B value 4B max_entries 8 memlock 4096B
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200612000857.2881453-1-rdna@fb.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit aa2cad0600ed2ca6a0ab39948d4db1666b6c962b ]
Propagate sock_alloc_send_skb error code, not set it to
EAGAIN unconditionally, when fail to allocate skb, which
might cause that user space unnecessary loops.
Fixes: 35fcde7f8deb ("xsk: support for Tx")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1591852266-24017-1-git-send-email-lirongqing@baidu.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0f5d82f187e1beda3fe7295dfc500af266a5bd80 ]
Added a check in the switch case on start_header that checks for
the existence of the header, and in the case that MAC is not set
and the caller requests for MAC, -EFAULT. If the caller requests
for NET then MAC's existence is completely ignored.
There is no function to check NET header's existence and as far
as cgroup_skb/egress is concerned it should always be set.
Removed for ptr >= the start of header, considering offset is
bounded unsigned and should always be true. len <= end - mac is
redundant to ptr + len <= end.
Fixes: 3eee1f75f2b9 ("bpf: fix bpf_skb_load_bytes_relative pkt length check")
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/76bb820ddb6a95f59a772ecbd8c8a336f646b362.1591812755.git.zhuyifei@google.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 75e68e5bf2c7fa9d3e874099139df03d5952a3e1 ]
We can end up modifying the sockhash bucket list from two CPUs when a
sockhash is being destroyed (sock_hash_free) on one CPU, while a socket
that is in the sockhash is unlinking itself from it on another CPU
it (sock_hash_delete_from_link).
This results in accessing a list element that is in an undefined state as
reported by KASAN:
| ==================================================================
| BUG: KASAN: wild-memory-access in sock_hash_free+0x13c/0x280
| Write of size 8 at addr dead000000000122 by task kworker/2:1/95
|
| CPU: 2 PID: 95 Comm: kworker/2:1 Not tainted 5.7.0-rc7-02961-ge22c35ab0038-dirty #691
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
| Workqueue: events bpf_map_free_deferred
| Call Trace:
| dump_stack+0x97/0xe0
| ? sock_hash_free+0x13c/0x280
| __kasan_report.cold+0x5/0x40
| ? mark_lock+0xbc1/0xc00
| ? sock_hash_free+0x13c/0x280
| kasan_report+0x38/0x50
| ? sock_hash_free+0x152/0x280
| sock_hash_free+0x13c/0x280
| bpf_map_free_deferred+0xb2/0xd0
| ? bpf_map_charge_finish+0x50/0x50
| ? rcu_read_lock_sched_held+0x81/0xb0
| ? rcu_read_lock_bh_held+0x90/0x90
| process_one_work+0x59a/0xac0
| ? lock_release+0x3b0/0x3b0
| ? pwq_dec_nr_in_flight+0x110/0x110
| ? rwlock_bug.part.0+0x60/0x60
| worker_thread+0x7a/0x680
| ? _raw_spin_unlock_irqrestore+0x4c/0x60
| kthread+0x1cc/0x220
| ? process_one_work+0xac0/0xac0
| ? kthread_create_on_node+0xa0/0xa0
| ret_from_fork+0x24/0x30
| ==================================================================
Fix it by reintroducing spin-lock protected critical section around the
code that removes the elements from the bucket on sockhash free.
To do that we also need to defer processing of removed elements, until out
of atomic context so that we can unlink the socket from the map when
holding the sock lock.
Fixes: 90db6d772f74 ("bpf, sockmap: Remove bucket->lock from sock_{hash|map}_free")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200607205229.2389672-3-jakub@cloudflare.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 487082fb7bd2a32b66927d2b22e3a81b072b44f0 ]
When user application calls read() with MSG_PEEK flag to read data
of bpf sockmap socket, kernel panic happens at
__tcp_bpf_recvmsg+0x12c/0x350. sk_msg is not removed from ingress_msg
queue after read out under MSG_PEEK flag is set. Because it's not
judged whether sk_msg is the last msg of ingress_msg queue, the next
sk_msg may be the head of ingress_msg queue, whose memory address of
sg page is invalid. So it's necessary to add check codes to prevent
this problem.
[20759.125457] BUG: kernel NULL pointer dereference, address:
0000000000000008
[20759.132118] CPU: 53 PID: 51378 Comm: envoy Tainted: G E
5.4.32 #1
[20759.140890] Hardware name: Inspur SA5212M4/YZMB-00370-109, BIOS
4.1.12 06/18/2017
[20759.149734] RIP: 0010:copy_page_to_iter+0xad/0x300
[20759.270877] __tcp_bpf_recvmsg+0x12c/0x350
[20759.276099] tcp_bpf_recvmsg+0x113/0x370
[20759.281137] inet_recvmsg+0x55/0xc0
[20759.285734] __sys_recvfrom+0xc8/0x130
[20759.290566] ? __audit_syscall_entry+0x103/0x130
[20759.296227] ? syscall_trace_enter+0x1d2/0x2d0
[20759.301700] ? __audit_syscall_exit+0x1e4/0x290
[20759.307235] __x64_sys_recvfrom+0x24/0x30
[20759.312226] do_syscall_64+0x55/0x1b0
[20759.316852] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: dihu <anny.hu@linux.alibaba.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20200605084625.9783-1-anny.hu@linux.alibaba.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 32f71aa497cfb23d37149c2ef16ad71fce2e45e2 ]
The user ID value isn't actually much use - and leaks a kernel pointer or a
userspace value - so replace it with the call debug ID, which appears in trace
points.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 118917d696dc59fd3e1741012c2f9db2294bed6f ]
Fix off-by-one issues in 'rpc_ntop6':
- 'snprintf' returns the number of characters which would have been
written if enough space had been available, excluding the terminating
null byte. Thus, a return value of 'sizeof(scopebuf)' means that the
last character was dropped.
- 'strcat' adds a terminating null byte to the string, thus if len ==
buflen, the null byte is written past the end of the buffer.
Signed-off-by: Fedor Tokarev <ftokarev@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 33a7c831565c43a7ee2f38c7df4c4a40e1dfdfed ]
When sockhash gets destroyed while sockets are still linked to it, we will
walk the bucket lists and delete the links. However, we are not freeing the
list elements after processing them, leaking the memory.
The leak can be triggered by close()'ing a sockhash map when it still
contains sockets, and observed with kmemleak:
unreferenced object 0xffff888116e86f00 (size 64):
comm "race_sock_unlin", pid 223, jiffies 4294731063 (age 217.404s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
81 de e8 41 00 00 00 00 c0 69 2f 15 81 88 ff ff ...A.....i/.....
backtrace:
[<00000000dd089ebb>] sock_hash_update_common+0x4ca/0x760
[<00000000b8219bd5>] sock_hash_update_elem+0x1d2/0x200
[<000000005e2c23de>] __do_sys_bpf+0x2046/0x2990
[<00000000d0084618>] do_syscall_64+0xad/0x9a0
[<000000000d96f263>] entry_SYSCALL_64_after_hwframe+0x49/0xb3
Fix it by freeing the list element when we're done with it.
Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200607205229.2389672-2-jakub@cloudflare.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 24c5efe41c29ee3e55bcf5a1c9f61ca8709622e8 upstream.
gss_mech_register() calls svcauth_gss_register_pseudoflavor() for each
flavour, but gss_mech_unregister() does not call auth_domain_put().
This is unbalanced and makes it impossible to reload the module.
Change svcauth_gss_register_pseudoflavor() to return the registered
auth_domain, and save it for later release.
Cc: stable@vger.kernel.org (v2.6.12+)
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206651
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit d47a5dc2888fd1b94adf1553068b8dad76cec96c upstream.
There is no valid case for supporting duplicate pseudoflavor
registrations.
Currently the silent acceptance of such registrations is hiding a bug.
The rpcsec_gss_krb5 module registers 2 flavours but does not unregister
them, so if you load, unload, reload the module, it will happily
continue to use the old registration which now has pointers to the
memory were the module was originally loaded. This could lead to
unexpected results.
So disallow duplicate registrations.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206651
Cc: stable@vger.kernel.org (v2.6.12+)
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e91de6afa81c10e9f855c5695eb9a53168d96b73 ]
KTLS uses a stream parser to collect TLS messages and send them to
the upper layer tls receive handler. This ensures the tls receiver
has a full TLS header to parse when it is run. However, when a
socket has BPF_SK_SKB_STREAM_VERDICT program attached before KTLS
is enabled we end up with two stream parsers running on the same
socket.
The result is both try to run on the same socket. First the KTLS
stream parser runs and calls read_sock() which will tcp_read_sock
which in turn calls tcp_rcv_skb(). This dequeues the skb from the
sk_receive_queue. When this is done KTLS code then data_ready()
callback which because we stacked KTLS on top of the bpf stream
verdict program has been replaced with sk_psock_start_strp(). This
will in turn kick the stream parser again and eventually do the
same thing KTLS did above calling into tcp_rcv_skb() and dequeuing
a skb from the sk_receive_queue.
At this point the data stream is broke. Part of the stream was
handled by the KTLS side some other bytes may have been handled
by the BPF side. Generally this results in either missing data
or more likely a "Bad Message" complaint from the kTLS receive
handler as the BPF program steals some bytes meant to be in a
TLS header and/or the TLS header length is no longer correct.
We've already broke the idealized model where we can stack ULPs
in any order with generic callbacks on the TX side to handle this.
So in this patch we do the same thing but for RX side. We add
a sk_psock_strp_enabled() helper so TLS can learn a BPF verdict
program is running and add a tls_sw_has_ctx_rx() helper so BPF
side can learn there is a TLS ULP on the socket.
Then on BPF side we omit calling our stream parser to avoid
breaking the data stream for the KTLS receiver. Then on the
KTLS side we call BPF_SK_SKB_STREAM_VERDICT once the KTLS
receiver is done with the packet but before it posts the
msg to userspace. This gives us symmetry between the TX and
RX halfs and IMO makes it usable again. On the TX side we
process packets in this order BPF -> TLS -> TCP and on
the receive side in the reverse order TCP -> TLS -> BPF.
Discovered while testing OpenSSL 3.0 Alpha2.0 release.
Fixes: d829e9c4112b5 ("tls: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/159079361946.5745.605854335665044485.stgit@john-Precision-5820-Tower
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit ca2f5f21dbbd5e3a00cd3e97f728aa2ca0b2e011 ]
We will need this block of code called from tls context shortly
lets refactor the redirect logic so its easy to use. This also
cleans up the switch stmt so we have fewer fallthrough cases.
No logic changes are intended.
Fixes: d829e9c4112b5 ("tls: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/159079360110.5745.7024009076049029819.stgit@john-Precision-5820-Tower
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0d7c83463fdf7841350f37960a7abadd3e650b41 ]
Instead of EINVAL which should be used for malformed netlink messages.
Fixes: eb31628e37a0 ("netfilter: nf_tables: Add support for IPv6 NAT")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 9ad346c90509ebd983f60da7d082f261ad329507 ]
The commit 8c46fcd78308 ("batman-adv: disable ethtool link speed detection
when auto negotiation off") disabled the usage of ethtool's link_ksetting
when auto negotation was enabled due to invalid values when used with
tun/tap virtual net_devices. According to the patch, automatic measurements
should be used for these kind of interfaces.
But there are major flaws with this argumentation:
* automatic measurements are not implemented
* auto negotiation has nothing to do with the validity of the retrieved
values
The first point has to be fixed by a longer patch series. The "validity"
part of the second point must be addressed in the same patch series by
dropping the usage of ethtool's link_ksetting (thus always doing automatic
measurements over ethernet).
Drop the patch again to have more default values for various net_device
types/configurations. The user can still overwrite them using the
batadv_hardif's BATADV_ATTR_THROUGHPUT_OVERRIDE.
Reported-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 56b5453a86203a44726f523b4133c1feca49ce7c ]
Bluetooth PTS test case HFP/AG/ACC/BI-12-I accepts SCO connection
with invalid parameter at the first SCO request expecting AG to
attempt another SCO request with the use of "safe settings" for
given codec, base on section 5.7.1.2 of HFP 1.7 specification.
This patch addresses it by adding "Invalid LMP Parameters" (0x1e)
to the SCO fallback case. Verified with below log:
< HCI Command: Setup Synchronous Connection (0x01|0x0028) plen 17
Handle: 256
Transmit bandwidth: 8000
Receive bandwidth: 8000
Max latency: 13
Setting: 0x0003
Input Coding: Linear
Input Data Format: 1's complement
Input Sample Size: 8-bit
# of bits padding at MSB: 0
Air Coding Format: Transparent Data
Retransmission effort: Optimize for link quality (0x02)
Packet type: 0x0380
3-EV3 may not be used
2-EV5 may not be used
3-EV5 may not be used
> HCI Event: Command Status (0x0f) plen 4
Setup Synchronous Connection (0x01|0x0028) ncmd 1
Status: Success (0x00)
> HCI Event: Number of Completed Packets (0x13) plen 5
Num handles: 1
Handle: 256
Count: 1
> HCI Event: Max Slots Change (0x1b) plen 3
Handle: 256
Max slots: 1
> HCI Event: Synchronous Connect Complete (0x2c) plen 17
Status: Invalid LMP Parameters / Invalid LL Parameters (0x1e)
Handle: 0
Address: 00:1B:DC:F2:21:59 (OUI 00-1B-DC)
Link type: eSCO (0x02)
Transmission interval: 0x00
Retransmission window: 0x02
RX packet length: 0
TX packet length: 0
Air mode: Transparent (0x03)
< HCI Command: Setup Synchronous Connection (0x01|0x0028) plen 17
Handle: 256
Transmit bandwidth: 8000
Receive bandwidth: 8000
Max latency: 8
Setting: 0x0003
Input Coding: Linear
Input Data Format: 1's complement
Input Sample Size: 8-bit
# of bits padding at MSB: 0
Air Coding Format: Transparent Data
Retransmission effort: Optimize for link quality (0x02)
Packet type: 0x03c8
EV3 may be used
2-EV3 may not be used
3-EV3 may not be used
2-EV5 may not be used
3-EV5 may not be used
> HCI Event: Command Status (0x0f) plen 4
Setup Synchronous Connection (0x01|0x0028) ncmd 1
Status: Success (0x00)
> HCI Event: Max Slots Change (0x1b) plen 3
Handle: 256
Max slots: 5
> HCI Event: Max Slots Change (0x1b) plen 3
Handle: 256
Max slots: 1
> HCI Event: Synchronous Connect Complete (0x2c) plen 17
Status: Success (0x00)
Handle: 257
Address: 00:1B:DC:F2:21:59 (OUI 00-1B-DC)
Link type: eSCO (0x02)
Transmission interval: 0x06
Retransmission window: 0x04
RX packet length: 30
TX packet length: 30
Air mode: Transparent (0x03)
Signed-off-by: Hsin-Yu Chao <hychao@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>