IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
[ Upstream commit acab713177 ]
This un-breaks lookups in sets that have the 'dynamic' flag set.
Given this active example configuration:
table filter {
set set1 {
type ipv4_addr
size 64
flags dynamic,timeout
timeout 1m
}
chain input {
type filter hook input priority 0; policy accept;
}
}
... this works:
nft add rule ip filter input add @set1 { ip saddr }
-> whenever rule is triggered, the source ip address is inserted
into the set (if it did not exist).
This won't work:
nft add rule ip filter input ip saddr @set1 counter
Error: Could not process rule: Operation not supported
In other words, we can add entries to the set, but then can't make
matching decision based on that set.
That is just wrong -- all set backends support lookups (else they would
not be very useful).
The failure comes from an explicit rejection in nft_lookup.c.
Looking at the history, it seems like NFT_SET_EVAL used to mean
'set contains expressions' (aka. "is a meter"), for instance something like
nft add rule ip filter input meter example { ip saddr limit rate 10/second }
or
nft add rule ip filter input meter example { ip saddr counter }
The actual meaning of NFT_SET_EVAL however, is
'set can be updated from the packet path'.
'meters' and packet-path insertions into sets, such as
'add @set { ip saddr }' use exactly the same kernel code (nft_dynset.c)
and thus require a set backend that provides the ->update() function.
The only set that provides this also is the only one that has the
NFT_SET_EVAL feature flag.
Removing the wrong check makes the above example work.
While at it, also fix the flag check during set instantiation to
allow supported combinations only.
Fixes: 8aeff920dc ("netfilter: nf_tables: add stateful object reference to set elements")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 18917d5147 upstream.
nfc_genl_deactivate_target() relies on the NFC_ATTR_TARGET_INDEX
attribute being present, but doesn't check whether it is actually
provided by the user. Same goes for nfc_genl_fw_download() and
NFC_ATTR_FIRMWARE_NAME.
This patch adds appropriate checks.
Found with syzkaller.
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 2d819d250a ]
Rajendra reported a kernel panic when a link was taken down:
[ 6870.263084] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
[ 6870.271856] IP: [<ffffffff8efc5764>] __ipv6_ifa_notify+0x154/0x290
<snip>
[ 6870.570501] Call Trace:
[ 6870.573238] [<ffffffff8efc58c6>] ? ipv6_ifa_notify+0x26/0x40
[ 6870.579665] [<ffffffff8efc98ec>] ? addrconf_dad_completed+0x4c/0x2c0
[ 6870.586869] [<ffffffff8efe70c6>] ? ipv6_dev_mc_inc+0x196/0x260
[ 6870.593491] [<ffffffff8efc9c6a>] ? addrconf_dad_work+0x10a/0x430
[ 6870.600305] [<ffffffff8f01ade4>] ? __switch_to_asm+0x34/0x70
[ 6870.606732] [<ffffffff8ea93a7a>] ? process_one_work+0x18a/0x430
[ 6870.613449] [<ffffffff8ea93d6d>] ? worker_thread+0x4d/0x490
[ 6870.619778] [<ffffffff8ea93d20>] ? process_one_work+0x430/0x430
[ 6870.626495] [<ffffffff8ea99dd9>] ? kthread+0xd9/0xf0
[ 6870.632145] [<ffffffff8f01ade4>] ? __switch_to_asm+0x34/0x70
[ 6870.638573] [<ffffffff8ea99d00>] ? kthread_park+0x60/0x60
[ 6870.644707] [<ffffffff8f01ae77>] ? ret_from_fork+0x57/0x70
[ 6870.650936] Code: 31 c0 31 d2 41 b9 20 00 08 02 b9 09 00 00 0
addrconf_dad_work is kicked to be scheduled when a device is brought
up. There is a race between addrcond_dad_work getting scheduled and
taking the rtnl lock and a process taking the link down (under rtnl).
The latter removes the host route from the inet6_addr as part of
addrconf_ifdown which is run for NETDEV_DOWN. The former attempts
to use the host route in __ipv6_ifa_notify. If the down event removes
the host route due to the race to the rtnl, then the BUG listed above
occurs.
Since the DAD sequence can not be aborted, add a check for the missing
host route in __ipv6_ifa_notify. The only way this should happen is due
to the previously mentioned race. The host route is created when the
address is added to an interface; it is only removed on a down event
where the address is kept. Add a warning if the host route is missing
AND the device is up; this is a situation that should never happen.
Fixes: f1705ec197 ("net: ipv6: Make address flushing on ifdown optional")
Reported-by: Rajendra Dendukuri <rajendra.dendukuri@broadcom.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit e95584a889 ]
We have identified a problem with the "oversubscription" policy in the
link transmission code.
When small messages are transmitted, and the sending link has reached
the transmit window limit, those messages will be bundled and put into
the link backlog queue. However, bundles of data messages are counted
at the 'CRITICAL' level, so that the counter for that level, instead of
the counter for the real, bundled message's level is the one being
increased.
Subsequent, to-be-bundled data messages at non-CRITICAL levels continue
to be tested against the unchanged counter for their own level, while
contributing to an unrestrained increase at the CRITICAL backlog level.
This leaves a gap in congestion control algorithm for small messages
that can result in starvation for other users or a "real" CRITICAL
user. Even that eventually can lead to buffer exhaustion & link reset.
We fix this by keeping a 'target_bskb' buffer pointer at each levels,
then when bundling, we only bundle messages at the same importance
level only. This way, we know exactly how many slots a certain level
have occupied in the queue, so can manage level congestion accurately.
By bundling messages at the same level, we even have more benefits. Let
consider this:
- One socket sends 64-byte messages at the 'CRITICAL' level;
- Another sends 4096-byte messages at the 'LOW' level;
When a 64-byte message comes and is bundled the first time, we put the
overhead of message bundle to it (+ 40-byte header, data copy, etc.)
for later use, but the next message can be a 4096-byte one that cannot
be bundled to the previous one. This means the last bundle carries only
one payload message which is totally inefficient, as for the receiver
also! Later on, another 64-byte message comes, now we make a new bundle
and the same story repeats...
With the new bundling algorithm, this will not happen, the 64-byte
messages will be bundled together even when the 4096-byte message(s)
comes in between. However, if the 4096-byte messages are sent at the
same level i.e. 'CRITICAL', the bundling algorithm will again cause the
same overhead.
Also, the same will happen even with only one socket sending small
messages at a rate close to the link transmit's one, so that, when one
message is bundled, it's transmitted shortly. Then, another message
comes, a new bundle is created and so on...
We will solve this issue radically by another patch.
Fixes: 365ad353c2 ("tipc: reduce risk of user starvation during link congestion")
Reported-by: Hoang Le <hoang.h.le@dektech.com.au>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit d64bf89a75 ]
rds_ibdev:ipaddr_list and rds_ibdev:conn_list are initialized
after allocation some resources such as protection domain.
If allocation of such resources fail, then these uninitialized
variables are accessed in rds_ib_dev_free() in failure path. This
can potentially crash the system. The code has been updated to
initialize these variables very early in the function.
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
Signed-off-by: Sudhakar Dindukurti <sudhakar.dindukurti@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 0d9138ffac ]
Lockdep is unhappy if two locks from the same class are held.
Fix the below warning for hyperv and virtio sockets (vmci socket code
doesn't have the issue) by using lock_sock_nested() when __vsock_release()
is called recursively:
============================================
WARNING: possible recursive locking detected
5.3.0+ #1 Not tainted
--------------------------------------------
server/1795 is trying to acquire lock:
ffff8880c5158990 (sk_lock-AF_VSOCK){+.+.}, at: hvs_release+0x10/0x120 [hv_sock]
but task is already holding lock:
ffff8880c5158150 (sk_lock-AF_VSOCK){+.+.}, at: __vsock_release+0x2e/0xf0 [vsock]
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(sk_lock-AF_VSOCK);
lock(sk_lock-AF_VSOCK);
*** DEADLOCK ***
May be due to missing lock nesting notation
2 locks held by server/1795:
#0: ffff8880c5d05ff8 (&sb->s_type->i_mutex_key#10){+.+.}, at: __sock_release+0x2d/0xa0
#1: ffff8880c5158150 (sk_lock-AF_VSOCK){+.+.}, at: __vsock_release+0x2e/0xf0 [vsock]
stack backtrace:
CPU: 5 PID: 1795 Comm: server Not tainted 5.3.0+ #1
Call Trace:
dump_stack+0x67/0x90
__lock_acquire.cold.67+0xd2/0x20b
lock_acquire+0xb5/0x1c0
lock_sock_nested+0x6d/0x90
hvs_release+0x10/0x120 [hv_sock]
__vsock_release+0x24/0xf0 [vsock]
__vsock_release+0xa0/0xf0 [vsock]
vsock_release+0x12/0x30 [vsock]
__sock_release+0x37/0xa0
sock_close+0x14/0x20
__fput+0xc1/0x250
task_work_run+0x98/0xc0
do_exit+0x344/0xc60
do_group_exit+0x47/0xb0
get_signal+0x15c/0xc50
do_signal+0x30/0x720
exit_to_usermode_loop+0x50/0xa0
do_syscall_64+0x24e/0x270
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f4184e85f31
Tested-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 8c7138b33e ]
The "reuse->sock[]" array is shared by multiple sockets. The going away
sk must unpublish itself from "reuse->sock[]" before making call_rcu()
call. However, this unpublish-action is currently done after a grace
period and it may cause use-after-free.
The fix is to move reuseport_detach_sock() to sk_destruct().
Due to the above reason, any socket with sk_reuseport_cb has
to go through the rcu grace period before freeing it.
It is a rather old bug (~3 yrs). The Fixes tag is not necessary
the right commit but it is the one that introduced the SOCK_RCU_FREE
logic and this fix is depending on it.
Fixes: a4298e4522 ("net: add SOCK_RCU_FREE socket flag")
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit b406472b5a ]
Since commit c09551c6ff ("net: ipv4: use a dedicated counter
for icmp_v4 redirect packets") we use 'n_redirects' to account
for redirect packets, but we still use 'rate_tokens' to compute
the redirect packets exponential backoff.
If the device sent to the relevant peer any ICMP error packet
after sending a redirect, it will also update 'rate_token' according
to the leaking bucket schema; typically 'rate_token' will raise
above BITS_PER_LONG and the redirect packets backoff algorithm
will produce undefined behavior.
Fix the issue using 'n_redirects' to compute the exponential backoff
in ip_rt_send_redirect().
Note that we still clear rate_tokens after a redirect silence period,
to avoid changing an established behaviour.
The root cause predates git history; before the mentioned commit in
the critical scenario, the kernel stopped sending redirects, after
the mentioned commit the behavior more randomic.
Reported-by: Xiumei Mu <xmu@redhat.com>
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Fixes: c09551c6ff ("net: ipv4: use a dedicated counter for icmp_v4 redirect packets")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 6af1799aaf ]
This began with a syzbot report. syzkaller was injecting
IPv6 TCP SYN packets having a v4mapped source address.
After an unsuccessful 4-tuple lookup, TCP creates a request
socket (SYN_RECV) and calls reqsk_queue_hash_req()
reqsk_queue_hash_req() calls sk_ehashfn(sk)
At this point we have AF_INET6 sockets, and the heuristic
used by sk_ehashfn() to either hash the IPv4 or IPv6 addresses
is to use ipv6_addr_v4mapped(&sk->sk_v6_daddr)
For the particular spoofed packet, we end up hashing V4 addresses
which were not initialized by the TCP IPv6 stack, so KMSAN fired
a warning.
I first fixed sk_ehashfn() to test both source and destination addresses,
but then faced various problems, including user-space programs
like packetdrill that had similar assumptions.
Instead of trying to fix the whole ecosystem, it is better
to admit that we have a dual stack behavior, and that we
can not build linux kernels without V4 stack anyway.
The dual stack API automatically forces the traffic to be IPv4
if v4mapped addresses are used at bind() or connect(), so it makes
no sense to allow IPv6 traffic to use the same v4mapped class.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 0e141f757b ]
erspan driver calls ether_setup(), after commit 61e84623ac
("net: centralize net_device min/max MTU checking"), the range
of mtu is [min_mtu, max_mtu], which is [68, 1500] by default.
It causes the dev mtu of the erspan device to not be greater
than 1500, this limit value is not correct for ipgre tap device.
Tested:
Before patch:
# ip link set erspan0 mtu 1600
Error: mtu greater than device maximum.
After patch:
# ip link set erspan0 mtu 1600
# ip -d link show erspan0
21: erspan0@NONE: <BROADCAST,MULTICAST> mtu 1600 qdisc noop state DOWN
mode DEFAULT group default qlen 1000
link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 0
Fixes: 61e84623ac ("net: centralize net_device min/max MTU checking")
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit c1d3ad84ea upstream.
Currently frame registrations are not purged, even when changing the
interface type. This can lead to potentially weird situations where
frames possibly not allowed on a given interface type remain registered
due to the type switching happening after registration.
The kernel currently relies on userspace apps to actually purge the
registrations themselves, this is not something that the kernel should
rely on.
Add a call to cfg80211_mlme_purge_registrations() to forcefully remove
any registrations left over prior to switching the iftype.
Cc: stable@vger.kernel.org
Signed-off-by: Denis Kenzior <denkenz@gmail.com>
Link: https://lore.kernel.org/r/20190828211110.15005-1-denkenz@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit ea8564c865 ]
userspace openvswitch patch "(dpif-linux: Implement the API
functions to allow multiple handler threads read upcall)"
changes its type from U32 to UNSPEC, but leave the kernel
unchanged
and after kernel 6e237d099f "(netlink: Relax attr validation
for fixed length types)", this bug is exposed by the below
warning
[ 57.215841] netlink: 'ovs-vswitchd': attribute type 5 has an invalid length.
Fixes: 5cd667b0a4 ("openvswitch: Allow each vport to have an array of 'port_id's")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 92974a1d00 ]
current 'sample' action doesn't push the mac header of ingress packets if
they are received by a layer 3 tunnel (like gre or sit); but it forgot to
check for gre over ipv6, so the following script:
# tc q a dev $d clsact
# tc f a dev $d ingress protocol ip flower ip_proto icmp action sample \
> group 100 rate 1
# psample -v -g 100
dumps everything, including outer header and mac, when $d is a gre tunnel
over ipv6. Fix this adding a missing label for ARPHRD_IP6GRE devices.
Fixes: 5c5670fae4 ("net/sched: Introduce sample tc action")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Yotam Gigi <yotam.gi@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 73f0c11d11 ]
As the endpoint is unregistered there might still be work pending to
handle incoming messages, which will result in a use after free
scenario. The plan is to remove the rx_worker, but until then (and for
stable@) ensure that the work is stopped before the node is freed.
Fixes: bdabad3e36 ("net: Add Qualcomm IPC router")
Cc: stable@vger.kernel.org
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 68d19d7d99 ]
This reverts commit c49a8682fc.
There are devices which require low connection intervals for usable operation
including keyboards and mice. Forcing a static connection interval for
these types of devices has an impact in latency and causes a regression.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
If a SYN/FIN-segment is on the write-queue, skb->len is 0, but the
segment actually has been transmitted. end_seq and seq of the tcp_skb_cb
in that case will indicate this difference.
We should not remove such segments from the write-queue as we might be
in SYN_SENT-state and a retransmission-timer is running. When that one
fires, packets_out will be 1, but the write-queue would be empty,
resulting in:
[ 61.280214] ------------[ cut here ]------------
[ 61.281307] WARNING: CPU: 0 PID: 0 at net/ipv4/tcp_timer.c:429 tcp_retransmit_timer+0x18f9/0x2660
[ 61.283498] Modules linked in:
[ 61.284084] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.142 #58
[ 61.285214] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
[ 61.286644] task: ffffffff8401e1c0 task.stack: ffffffff84000000
[ 61.287758] RIP: 0010:tcp_retransmit_timer+0x18f9/0x2660
[ 61.288715] RSP: 0018:ffff88806ce07cb8 EFLAGS: 00010206
[ 61.289669] RAX: ffffffff8401e1c0 RBX: ffff88805c998b00 RCX: 0000000000000006
[ 61.290968] RDX: 0000000000000100 RSI: 0000000000000000 RDI: ffff88805c9994d8
[ 61.292314] RBP: ffff88805c99919a R08: ffff88807fff901c R09: ffff88807fff9008
[ 61.293547] R10: ffff88807fff9017 R11: ffff88807fff9010 R12: ffff88805c998b30
[ 61.294834] R13: ffffffff844b9380 R14: 0000000000000000 R15: ffff88805c99930c
[ 61.296086] FS: 0000000000000000(0000) GS:ffff88806ce00000(0000) knlGS:0000000000000000
[ 61.297523] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 61.298646] CR2: 00007f721da50ff8 CR3: 0000000004014002 CR4: 00000000001606f0
[ 61.299944] Call Trace:
[ 61.300403] <IRQ>
[ 61.300806] ? kvm_sched_clock_read+0x21/0x30
[ 61.301689] ? sched_clock+0x5/0x10
[ 61.302433] ? sched_clock_cpu+0x18/0x170
[ 61.303173] tcp_write_timer_handler+0x2c1/0x7a0
[ 61.304038] tcp_write_timer+0x13e/0x160
[ 61.304794] call_timer_fn+0x14a/0x5f0
[ 61.305480] ? tcp_write_timer_handler+0x7a0/0x7a0
[ 61.306364] ? __next_timer_interrupt+0x140/0x140
[ 61.307229] ? _raw_spin_unlock_irq+0x24/0x40
[ 61.308033] ? tcp_write_timer_handler+0x7a0/0x7a0
[ 61.308887] ? tcp_write_timer_handler+0x7a0/0x7a0
[ 61.309760] run_timer_softirq+0xc41/0x1080
[ 61.310539] ? trigger_dyntick_cpu.isra.33+0x180/0x180
[ 61.311506] ? ktime_get+0x13f/0x1c0
[ 61.312232] ? clockevents_program_event+0x10d/0x2f0
[ 61.313158] __do_softirq+0x20b/0x96b
[ 61.313889] irq_exit+0x1a7/0x1e0
[ 61.314513] smp_apic_timer_interrupt+0xfc/0x4d0
[ 61.315386] apic_timer_interrupt+0x8f/0xa0
[ 61.316129] </IRQ>
Followed by a panic.
So, before removing an skb with skb->len == 0, let's make sure that the
skb is really empty by checking the end_seq and seq.
This patch needs to be backported only to 4.14 and older (among those
that applied the backport of fdfc5c8594).
Fixes: fdfc5c8594 ("tcp: remove empty skb from write queue in error cases")
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Vladimir Rutsky <rutsky@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
syzkaller is not happy since commit fdfc5c8594 ("tcp: remove empty skb
from write queue in error cases"):
CPU: 1 PID: 13814 Comm: syz-executor.4 Not tainted 4.14.143 #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
task: ffff888040105c00 task.stack: ffff8880649c0000
RIP: 0010:tcp_sendmsg_locked+0x6b4/0x4390 net/ipv4/tcp.c:1350
RSP: 0018:ffff8880649cf718 EFLAGS: 00010206
RAX: 0000000000000014 RBX: 000000000000001e RCX: ffffc90000717000
RDX: 0000000000000077 RSI: ffffffff82e760f7 RDI: 00000000000000a0
RBP: ffff8880649cfaa8 R08: 1ffff1100c939e7a R09: ffff8880401063c8
R10: 0000000000000003 R11: 0000000000000001 R12: dffffc0000000000
R13: ffff888043d74750 R14: ffff888043d74500 R15: 000000000000001e
FS: 00007f0afcb6d700(0000) GS:ffff88806cf00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2ca22000 CR3: 0000000040496004 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
tcp_sendmsg+0x2a/0x40 net/ipv4/tcp.c:1533
inet_sendmsg+0x173/0x4e0 net/ipv4/af_inet.c:784
sock_sendmsg_nosec net/socket.c:646 [inline]
sock_sendmsg+0xc3/0x100 net/socket.c:656
SYSC_sendto+0x35d/0x5e0 net/socket.c:1766
do_syscall_64+0x241/0x680 arch/x86/entry/common.c:292
entry_SYSCALL_64_after_hwframe+0x42/0xb7
The problem is that we are removing an skb from the write-queue that
could have been referenced by the sk_send_head. Thus, we need to check
for the send_head's sanity after removing it.
This patch needs to be backported only to 4.14 and older (among those
that applied the backport of fdfc5c8594).
Fixes: fdfc5c8594 ("tcp: remove empty skb from write queue in error cases")
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jason Baron <jbaron@akamai.com>
Cc: Vladimir Rutsky <rutsky@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 3a069024d3 ]
The find_pattern() debug output was printing the 'skip' character.
This can be a NULL-byte and messes up further pr_debug() output.
Output without the fix:
kernel: nf_conntrack_ftp: Pattern matches!
kernel: nf_conntrack_ftp: Skipped up to `<7>nf_conntrack_ftp: find_pattern `PORT': dlen = 8
kernel: nf_conntrack_ftp: find_pattern `EPRT': dlen = 8
Output with the fix:
kernel: nf_conntrack_ftp: Pattern matches!
kernel: nf_conntrack_ftp: Skipped up to 0x0 delimiter!
kernel: nf_conntrack_ftp: Match succeeded!
kernel: nf_conntrack_ftp: conntrack_ftp: match `172,17,0,100,200,207' (20 bytes at 4150681645)
kernel: nf_conntrack_ftp: find_pattern `PORT': dlen = 8
Signed-off-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 0ff0f15a32 ]
Multiple batadv_ogm2_packet can be stored in an skbuff. The functions
batadv_v_ogm_send_to_if() uses batadv_v_ogm_aggr_packet() to check if there
is another additional batadv_ogm2_packet in the skb or not before they
continue processing the packet.
The length for such an OGM2 is BATADV_OGM2_HLEN +
batadv_ogm2_packet->tvlv_len. The check must first check that at least
BATADV_OGM2_HLEN bytes are available before it accesses tvlv_len (which is
part of the header. Otherwise it might try read outside of the currently
available skbuff to get the content of tvlv_len.
Fixes: 9323158ef9 ("batman-adv: OGMv2 - implement originators logic")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit 89a26cd4b5 ]
When running a 64-bit kernel with a 32-bit iptables binary, the size of
the xt_nfacct_match_info struct diverges.
kernel: sizeof(struct xt_nfacct_match_info) : 40
iptables: sizeof(struct xt_nfacct_match_info)) : 36
Trying to append nfacct related rules results in an unhelpful message.
Although it is suggested to look for more information in dmesg, nothing
can be found there.
# iptables -A <chain> -m nfacct --nfacct-name <acct-object>
iptables: Invalid argument. Run `dmesg' for more information.
This patch fixes the memory misalignment by enforcing 8-byte alignment
within the struct's first revision. This solution is often used in many
other uapi netfilter headers.
Signed-off-by: Juliana Rodrigueiro <juliana.rodrigueiro@intra2net.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit af38d07ed3 ]
Fix tcp_ecn_withdraw_cwr() to clear the correct bit:
TCP_ECN_QUEUE_CWR.
Rationale: basically, TCP_ECN_DEMAND_CWR is a bit that is purely about
the behavior of data receivers, and deciding whether to reflect
incoming IP ECN CE marks as outgoing TCP th->ece marks. The
TCP_ECN_QUEUE_CWR bit is purely about the behavior of data senders,
and deciding whether to send CWR. The tcp_ecn_withdraw_cwr() function
is only called from tcp_undo_cwnd_reduction() by data senders during
an undo, so it should zero the sender-side state,
TCP_ECN_QUEUE_CWR. It does not make sense to stop the reflection of
incoming CE bits on incoming data packets just because outgoing
packets were spuriously retransmitted.
The bug has been reproduced with packetdrill to manifest in a scenario
with RFC3168 ECN, with an incoming data packet with CE bit set and
carrying a TCP timestamp value that causes cwnd undo. Before this fix,
the IP CE bit was ignored and not reflected in the TCP ECE header bit,
and sender sent a TCP CWR ('W') bit on the next outgoing data packet,
even though the cwnd reduction had been undone. After this fix, the
sender properly reflects the CE bit and does not set the W bit.
Note: the bug actually predates 2005 git history; this Fixes footer is
chosen to be the oldest SHA1 I have tested (from Sep 2007) for which
the patch applies cleanly (since before this commit the code was in a
.h file).
Fixes: bdf1ee5d3b ("[TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & remove it")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 3dcbdb134f ]
Historically, support for frag_list packets entering skb_segment() was
limited to frag_list members terminating on exact same gso_size
boundaries. This is verified with a BUG_ON since commit 89319d3801
("net: Add frag_list support to skb_segment"), quote:
As such we require all frag_list members terminate on exact MSS
boundaries. This is checked using BUG_ON.
As there should only be one producer in the kernel of such packets,
namely GRO, this requirement should not be difficult to maintain.
However, since commit 6578171a7f ("bpf: add bpf_skb_change_proto helper"),
the "exact MSS boundaries" assumption no longer holds:
An eBPF program using bpf_skb_change_proto() DOES modify 'gso_size', but
leaves the frag_list members as originally merged by GRO with the
original 'gso_size'. Example of such programs are bpf-based NAT46 or
NAT64.
This lead to a kernel BUG_ON for flows involving:
- GRO generating a frag_list skb
- bpf program performing bpf_skb_change_proto() or bpf_skb_adjust_room()
- skb_segment() of the skb
See example BUG_ON reports in [0].
In commit 13acc94eff ("net: permit skb_segment on head_frag frag_list skb"),
skb_segment() was modified to support the "gso_size mangling" case of
a frag_list GRO'ed skb, but *only* for frag_list members having
head_frag==true (having a page-fragment head).
Alas, GRO packets having frag_list members with a linear kmalloced head
(head_frag==false) still hit the BUG_ON.
This commit adds support to skb_segment() for a 'head_skb' packet having
a frag_list whose members are *non* head_frag, with gso_size mangled, by
disabling SG and thus falling-back to copying the data from the given
'head_skb' into the generated segmented skbs - as suggested by Willem de
Bruijn [1].
Since this approach involves the penalty of skb_copy_and_csum_bits()
when building the segments, care was taken in order to enable this
solution only when required:
- untrusted gso_size, by testing SKB_GSO_DODGY is set
(SKB_GSO_DODGY is set by any gso_size mangling functions in
net/core/filter.c)
- the frag_list is non empty, its item is a non head_frag, *and* the
headlen of the given 'head_skb' does not match the gso_size.
[0]
https://lore.kernel.org/netdev/20190826170724.25ff616f@pixies/https://lore.kernel.org/netdev/9265b93f-253d-6b8c-f2b8-4b54eff1835c@fb.com/
[1]
https://lore.kernel.org/netdev/CA+FuTSfVsgNDi7c=GUU8nMg2hWxF2SjCNLXetHeVPdnxAW5K-w@mail.gmail.com/
Fixes: 6578171a7f ("bpf: add bpf_skb_change_proto helper")
Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 10cc514f45 ]
In event of failure during register_netdevice, free_netdev is
invoked immediately. free_netdev assumes that all the netdevice
refcounts have been dropped prior to it being called and as a
result frees and clears out the refcount pointer.
However, this is not necessarily true as some of the operations
in the NETDEV_UNREGISTER notifier handlers queue RCU callbacks for
invocation after a grace period. The IPv4 callback in_dev_rcu_put
tries to access the refcount after free_netdev is called which
leads to a null de-reference-
44837.761523: <6> Unable to handle kernel paging request at
virtual address 0000004a88287000
44837.761651: <2> pc : in_dev_finish_destroy+0x4c/0xc8
44837.761654: <2> lr : in_dev_finish_destroy+0x2c/0xc8
44837.762393: <2> Call trace:
44837.762398: <2> in_dev_finish_destroy+0x4c/0xc8
44837.762404: <2> in_dev_rcu_put+0x24/0x30
44837.762412: <2> rcu_nocb_kthread+0x43c/0x468
44837.762418: <2> kthread+0x118/0x128
44837.762424: <2> ret_from_fork+0x10/0x1c
Fix this by waiting for the completion of the call_rcu() in
case of register_netdevice errors.
Fixes: 93ee31f14f ("[NET]: Fix free_netdev on register_netdev failure.")
Cc: Sean Tranchetti <stranche@codeaurora.org>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit d23dbc479a ]
The '.exit' functions from 'pernet_operations' structure should be marked
as __net_exit, not __net_init.
Fixes: d862e54614 ("net: ipv6: Implement /proc/net/icmp6.")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit 685703b497 ]
There is a race condition for an established connection that is being closed
by the guest: the refcnt is 4 at the end of hvs_release() (Note: here the
'remove_sock' is false):
1 for the initial value;
1 for the sk being in the bound list;
1 for the sk being in the connected list;
1 for the delayed close_work.
After hvs_release() finishes, __vsock_release() -> sock_put(sk) *may*
decrease the refcnt to 3.
Concurrently, hvs_close_connection() runs in another thread:
calls vsock_remove_sock() to decrease the refcnt by 2;
call sock_put() to decrease the refcnt to 0, and free the sk;
next, the "release_sock(sk)" may hang due to use-after-free.
In the above, after hvs_release() finishes, if hvs_close_connection() runs
faster than "__vsock_release() -> sock_put(sk)", then there is not any issue,
because at the beginning of hvs_close_connection(), the refcnt is still 4.
The issue can be resolved if an extra reference is taken when the
connection is established.
Fixes: a9eeb998c2 ("hv_sock: Add support for delayed close")
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit a15d56a607 upstream.
Multiple batadv_ogm_packet can be stored in an skbuff. The functions
batadv_iv_ogm_send_to_if()/batadv_iv_ogm_receive() use
batadv_iv_ogm_aggr_packet() to check if there is another additional
batadv_ogm_packet in the skb or not before they continue processing the
packet.
The length for such an OGM is BATADV_OGM_HLEN +
batadv_ogm_packet->tvlv_len. The check must first check that at least
BATADV_OGM_HLEN bytes are available before it accesses tvlv_len (which is
part of the header. Otherwise it might try read outside of the currently
available skbuff to get the content of tvlv_len.
Fixes: ef26157747 ("batman-adv: tvlv - basic infrastructure")
Reported-by: syzbot+355cab184197dbbfa384@syzkaller.appspotmail.com
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
commit dbb2483b2a upstream.
In commit 6a53b75932 ("xfrm: check id proto in validate_tmpl()")
I introduced a check for xfrm protocol, but according to Herbert
IPSEC_PROTO_ANY should only be used as a wildcard for lookup, so
it should be removed from validate_tmpl().
And, IPSEC_PROTO_ANY is expected to only match 3 IPSec-specific
protocols, this is why xfrm_state_flush() could still miss
IPPROTO_ROUTING, which leads that those entries are left in
net->xfrm.state_all before exit net. Fix this by replacing
IPSEC_PROTO_ANY with zero.
This patch also extracts the check from validate_tmpl() to
xfrm_id_proto_valid() and uses it in parse_ipsecrequest().
With this, no other protocols should be added into xfrm.
Fixes: 6a53b75932 ("xfrm: check id proto in validate_tmpl()")
Reported-by: syzbot+0bf0519d6e0de15914fe@syzkaller.appspotmail.com
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Zubin Mithra <zsm@chromium.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit dbf47a2a09 ]
Action sample doesn't properly handle psample_group pointer in overwrite
case. Following issues need to be fixed:
- In tcf_sample_init() function RCU_INIT_POINTER() is used to set
s->psample_group, even though we neither setting the pointer to NULL, nor
preventing concurrent readers from accessing the pointer in some way.
Use rcu_swap_protected() instead to safely reset the pointer.
- Old value of s->psample_group is not released or deallocated in any way,
which results resource leak. Use psample_group_put() on non-NULL value
obtained with rcu_swap_protected().
- The function psample_group_put() that released reference to struct
psample_group pointed by rcu-pointer s->psample_group doesn't respect rcu
grace period when deallocating it. Extend struct psample_group with rcu
head and use kfree_rcu when freeing it.
Fixes: 5c5670fae4 ("net/sched: Introduce sample tc action")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[ Upstream commit fdfc5c8594 ]
Vladimir Rutsky reported stuck TCP sessions after memory pressure
events. Edge Trigger epoll() user would never receive an EPOLLOUT
notification allowing them to retry a sendmsg().
Jason tested the case of sk_stream_alloc_skb() returning NULL,
but there are other paths that could lead both sendmsg() and sendpage()
to return -1 (EAGAIN), with an empty skb queued on the write queue.
This patch makes sure we remove this empty skb so that
Jason code can detect that the queue is empty, and
call sk->sk_write_space(sk) accordingly.
Fixes: ce5ec44099 ("tcp: ensure epoll edge trigger wakeup when write queue is empty")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jason Baron <jbaron@akamai.com>
Reported-by: Vladimir Rutsky <rutsky@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>