linux

iv/linux

Author	SHA1	Message	Date
Jing-Ting Wu	763f4fb76e	cgroup: Fix race condition at rebind_subsystems() Root cause: The rebind_subsystems() is no lock held when move css object from A list to B list,then let B's head be treated as css node at list_for_each_entry_rcu(). Solution: Add grace period before invalidating the removed rstat_css_node. Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com> Suggested-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Jing-Ting Wu <jing-ting.wu@mediatek.com> Tested-by: Jing-Ting Wu <jing-ting.wu@mediatek.com> Link: https://lore.kernel.org/linux-arm-kernel/d8f0bc5e2fb6ed259f9334c83279b4c011283c41.camel@mediatek.com/T/ Acked-by: Mukesh Ojha <quic_mojha@quicinc.com> Fixes: a7df69b81aac ("cgroup: rstat: support cgroup1") Cc: stable@vger.kernel.org # v5.13+ Signed-off-by: Tejun Heo <tj@kernel.org>	2022-08-23 08:11:06 -10:00
Florian Westphal	cf97769c76	netfilter: conntrack: work around exceeded receive window When a TCP sends more bytes than allowed by the receive window, all future packets can be marked as invalid. This can clog up the conntrack table because of 5-day default timeout. Sequence of packets: 01 initiator > responder: [S], seq 171, win 5840, options [mss 1330,sackOK,TS val 63 ecr 0,nop,wscale 1] 02 responder > initiator: [S.], seq 33211, ack 172, win 65535, options [mss 1460,sackOK,TS val 010 ecr 63,nop,wscale 8] 03 initiator > responder: [.], ack 33212, win 2920, options [nop,nop,TS val 068 ecr 010], length 0 04 initiator > responder: [P.], seq 172:240, ack 33212, win 2920, options [nop,nop,TS val 279 ecr 010], length 68 Window is 5840 starting from 33212 -> 39052. 05 responder > initiator: [.], ack 240, win 256, options [nop,nop,TS val 872 ecr 279], length 0 06 responder > initiator: [.], seq 33212:34530, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318 This is fine, conntrack will flag the connection as having outstanding data (UNACKED), which lowers the conntrack timeout to 300s. 07 responder > initiator: [.], seq 34530:35848, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318 08 responder > initiator: [.], seq 35848:37166, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318 09 responder > initiator: [.], seq 37166:38484, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318 10 responder > initiator: [.], seq 38484:39802, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 1318 Packet 10 is already sending more than permitted, but conntrack doesn't validate this (only seq is tested vs. maxend, not 'seq+len'). 38484 is acceptable, but only up to 39052, so this packet should not have been sent (or only 568 bytes, not 1318). At this point, connection is still in '300s' mode. Next packet however will get flagged: 11 responder > initiator: [P.], seq 39802:40128, ack 240, win 256, options [nop,nop,TS val 892 ecr 279], length 326 nf_ct_proto_6: SEQ is over the upper bound (over the window of the receiver) .. LEN=378 .. SEQ=39802 ACK=240 ACK PSH .. Now, a couple of replies/acks comes in: 12 initiator > responder: [.], ack 34530, win 4368, [.. irrelevant acks removed ] 16 initiator > responder: [.], ack 39802, win 8712, options [nop,nop,TS val 296201291 ecr 2982371892], length 0 This ack is significant -- this acks the last packet send by the responder that conntrack considered valid. This means that ack == td_end. This will withdraw the 'unacked data' flag, the connection moves back to the 5-day timeout of established conntracks. 17 initiator > responder: ack 40128, win 10030, ... This packet is also flagged as invalid. Because conntrack only updates state based on packets that are considered valid, packet 11 'did not exist' and that gets us: nf_ct_proto_6: ACK is over upper bound 39803 (ACKed data not seen yet) .. SEQ=240 ACK=40128 WINDOW=10030 RES=0x00 ACK URG Because this received and processed by the endpoints, the conntrack entry remains in a bad state, no packets will ever be considered valid again: 30 responder > initiator: [F.], seq 40432, ack 2045, win 391, .. 31 initiator > responder: [.], ack 40433, win 11348, .. 32 initiator > responder: [F.], seq 2045, ack 40433, win 11348 .. ... all trigger 'ACK is over bound' test and we end up with non-early-evictable 5-day default timeout. NB: This patch triggers a bunch of checkpatch warnings because of silly indent. I will resend the cleanup series linked below to reduce the indent level once this change has propagated to net-next. I could route the cleanup via nf but that causes extra backport work for stable maintainers. Link: https://lore.kernel.org/netfilter-devel/20220720175228.17880-1-fw@strlen.de/T/#mb1d7147d36294573cc4f81d00f9f8dadfdd06cd8 Signed-off-by: Florian Westphal <fw@strlen.de>	2022-08-23 18:26:48 +02:00
Florian Westphal	7997eff828	netfilter: ebtables: reject blobs that don't provide all entry points Harshit Mogalapalli says: In ebt_do_table() function dereferencing 'private->hook_entry[hook]' can lead to NULL pointer dereference. [..] Kernel panic: general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f] [..] RIP: 0010:ebt_do_table+0x1dc/0x1ce0 Code: 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 5c 16 00 00 48 b8 00 00 00 00 00 fc ff df 49 8b 6c df 08 48 8d 7d 2c 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 88 [..] Call Trace: nf_hook_slow+0xb1/0x170 __br_forward+0x289/0x730 maybe_deliver+0x24b/0x380 br_flood+0xc6/0x390 br_dev_xmit+0xa2e/0x12c0 For some reason ebtables rejects blobs that provide entry points that are not supported by the table, but what it should instead reject is the opposite: blobs that DO NOT provide an entry point supported by the table. t->valid_hooks is the bitmask of hooks (input, forward ...) that will see packets. Providing an entry point that is not support is harmless (never called/used), but the inverse isn't: it results in a crash because the ebtables traverser doesn't expect a NULL blob for a location its receiving packets for. Instead of fixing all the individual checks, do what iptables is doing and reject all blobs that differ from the expected hooks. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2022-08-23 18:23:15 +02:00
Vladimir Oltean	855a28f9c9	net: dsa: don't dereference NULL extack in dsa_slave_changeupper() When a driver returns -EOPNOTSUPP in dsa_port_bridge_join() but failed to provide a reason for it, DSA attempts to set the extack to say that software fallback will kick in. The problem is, when we use brctl and the legacy bridge ioctls, the extack will be NULL, and DSA dereferences it in the process of setting it. Sergei Antonov proves this using the following stack trace: Unable to handle kernel NULL pointer dereference at virtual address 00000000 PC is at dsa_slave_changeupper+0x5c/0x158 dsa_slave_changeupper from raw_notifier_call_chain+0x38/0x6c raw_notifier_call_chain from __netdev_upper_dev_link+0x198/0x3b4 __netdev_upper_dev_link from netdev_master_upper_dev_link+0x50/0x78 netdev_master_upper_dev_link from br_add_if+0x430/0x7f4 br_add_if from br_ioctl_stub+0x170/0x530 br_ioctl_stub from br_ioctl_call+0x54/0x7c br_ioctl_call from dev_ifsioc+0x4e0/0x6bc dev_ifsioc from dev_ioctl+0x2f8/0x758 dev_ioctl from sock_ioctl+0x5f0/0x674 sock_ioctl from sys_ioctl+0x518/0xe40 sys_ioctl from ret_fast_syscall+0x0/0x1c Fix the problem by only overriding the extack if non-NULL. Fixes: 1c6e8088d9a7 ("net: dsa: allow port_bridge_join() to override extack message") Link: https://lore.kernel.org/netdev/CABikg9wx7vB5eRDAYtvAm7fprJ09Ta27a4ZazC=NX5K4wn6pWA@mail.gmail.com/ Reported-by: Sergei Antonov <saproj@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Sergei Antonov <saproj@gmail.com> Link: https://lore.kernel.org/r/20220819173925.3581871-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-08-23 07:54:16 -07:00
Maciej Żenczykowski	4b2e3a17e9	net: ipvtap - add __init/__exit annotations to module init/exit funcs Looks to have been left out in an oversight. Cc: Mahesh Bandewar <maheshb@google.com> Cc: Sainath Grandhi <sainath.grandhi@intel.com> Fixes: 235a9d89da97 ('ipvtap: IP-VLAN based tap driver') Signed-off-by: Maciej Żenczykowski <maze@google.com> Link: https://lore.kernel.org/r/20220821130808.12143-1-zenczykowski@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-08-23 15:45:53 +02:00
Paolo Abeni	52412f5543	Merge branch 'dsa-changes-for-multiple-cpu-ports-part-3' Vladimir Oltean says: ==================== DSA changes for multiple CPU ports (part 3) Those who have been following part 1: https://patchwork.kernel.org/project/netdevbpf/cover/20220511095020.562461-1-vladimir.oltean@nxp.com/ and part 2: https://patchwork.kernel.org/project/netdevbpf/cover/20220521213743.2735445-1-vladimir.oltean@nxp.com/ will know that I am trying to enable the second internal port pair from the NXP LS1028A Felix switch for DSA-tagged traffic via "ocelot-8021q". This series represents part 3 of that effort. Covered here are some preparations in DSA for handling multiple DSA masters: - when changing the tagging protocol via sysfs - when the masters go down as well as preparation for monitoring the upper devices of a DSA master (to support DSA masters under a LAG). There are also 2 small preparations for the ocelot driver, for the case where multiple tag_8021q CPU ports are used in a LAG. Both those changes have to do with PGID forwarding domains. Compared to v1, the patches were trimmed down to just another preparation stage, and the UAPI changes were pushed further out to part 4. https://patchwork.kernel.org/project/netdevbpf/cover/20220523104256.3556016-1-olteanv@gmail.com/ Compared to v2, I had to export a symbol I forgot to (ocelot_port_teardown_dsa_8021q_cpu), to avoid a build breakage when the felix and seville drivers are built as modules. ==================== Link: https://lore.kernel.org/r/20220819174820.3585002-1-vladimir.oltean@nxp.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-08-23 11:39:37 +02:00
Vladimir Oltean	291ac1517a	net: mscc: ocelot: adjust forwarding domain for CPU ports in a LAG Currently when we have 2 CPU ports configured for DSA tag_8021q mode and we put them in a LAG, a PGID dump looks like this: PGID_SRC[0] = ports 4, PGID_SRC[1] = ports 4, PGID_SRC[2] = ports 4, PGID_SRC[3] = ports 4, PGID_SRC[4] = ports 0, 1, 2, 3, 4, 5, PGID_SRC[5] = no ports (ports 0-3 are user ports, ports 4 and 5 are CPU ports) There are 2 problems with the configuration above: - user ports should enable forwarding towards both CPU ports, not just 4, and the aggregation PGIDs should prune one CPU port or the other from the destination port mask, based on a hash computed from packet headers. - CPU ports should not be allowed to forward towards themselves and also not towards other ports in the same LAG as themselves The first problem requires fixing up the PGID_SRC of user ports, when ocelot_port_assigned_dsa_8021q_cpu_mask() is called. We need to say that when a user port is assigned to a tag_8021q CPU port and that port is in a LAG, it should forward towards all ports in that LAG. The second problem requires fixing up the PGID_SRC of port 4, to remove ports 4 and 5 (in a LAG) from the allowed destinations. After this change, the PGID source masks look as follows: PGID_SRC[0] = ports 4, 5, PGID_SRC[1] = ports 4, 5, PGID_SRC[2] = ports 4, 5, PGID_SRC[3] = ports 4, 5, PGID_SRC[4] = ports 0, 1, 2, 3, PGID_SRC[5] = no ports Note that PGID_SRC[5] still looks weird (it should say "0, 1, 2, 3" just like PGID_SRC[4] does), but I've tested forwarding through this CPU port and it doesn't seem like anything is affected (it appears that PGID_SRC[4] is being looked up on forwarding from the CPU, since both ports 4 and 5 have logical port ID 4). The reason why it looks weird is because we've never called ocelot_port_assign_dsa_8021q_cpu() for any user port towards port 5 (all user ports are assigned to port 4 which is in a LAG with 5). Since things aren't broken, I'm willing to leave it like that for now and just document the oddity. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-08-23 11:39:22 +02:00
Vladimir Oltean	36a0bf4435	net: mscc: ocelot: set up tag_8021q CPU ports independent of user port affinity This is a partial revert of commit c295f9831f1d ("net: mscc: ocelot: switch from {,un}set to {,un}assign for tag_8021q CPU ports"), because as it turns out, this isn't how tag_8021q CPU ports under a LAG are supposed to work. Under that scenario, all user ports are "assigned" to the single tag_8021q CPU port represented by the logical port corresponding to the bonding interface. So one CPU port in a LAG would have is_dsa_8021q_cpu set to true (the one whose physical port ID is equal to the logical port ID), and the other one to false. In turn, this makes 2 undesirable things happen: (1) PGID_CPU contains only the first physical CPU port, rather than both (2) only the first CPU port will be added to the private VLANs used by ocelot for VLAN-unaware bridging To make the driver behave in the same way for both bonded CPU ports, we need to bring back the old concept of setting up a port as a tag_8021q CPU port, and this is what deals with VLAN membership and PGID_CPU updating. But we also need the CPU port "assignment" (the user to CPU port affinity), and this is what updates the PGID_SRC forwarding rules. All DSA CPU ports are statically configured for tag_8021q mode when the tagging protocol is changed to ocelot-8021q. User ports are "assigned" to one CPU port or the other dynamically (this will be handled by a future change). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-08-23 11:39:22 +02:00
Vladimir Oltean	5dc760d120	net: dsa: use dsa_tree_for_each_cpu_port in dsa_tree_{setup,teardown}_master More logic will be added to dsa_tree_setup_master() and dsa_tree_teardown_master() in upcoming changes. Reduce the indentation by one level in these functions by introducing and using a dedicated iterator for CPU ports of a tree. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-08-23 11:39:22 +02:00
Vladimir Oltean	f41ec1fd1c	net: dsa: all DSA masters must be down when changing the tagging protocol The fact that the tagging protocol is set and queried from the /sys/class/net/<dsa-master>/dsa/tagging file is a bit of a quirk from the single CPU port days which isn't aging very well now that DSA can have more than a single CPU port. This is because the tagging protocol is a switch property, yet in the presence of multiple CPU ports it can be queried and set from multiple sysfs files, all of which are handled by the same implementation. The current logic ensures that the net device whose sysfs file we're changing the tagging protocol through must be down. That net device is the DSA master, and this is fine for single DSA master / CPU port setups. But exactly because the tagging protocol is per switch [ tree, in fact ] and not per DSA master, this isn't fine any longer with multiple CPU ports, and we must iterate through the tree and find all DSA masters, and make sure that all of them are down. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-08-23 11:39:22 +02:00
Vladimir Oltean	7136097e11	net: dsa: only bring down user ports assigned to a given DSA master This is an adaptation of commit c0a8a9c27493 ("net: dsa: automatically bring user ports down when master goes down") for multiple DSA masters. When a DSA master goes down, only the user ports under its control should go down too, the others can still send/receive traffic. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-08-23 11:39:22 +02:00
Vladimir Oltean	4f03dcc6b9	net: dsa: existing DSA masters cannot join upper interfaces All the traffic to/from a DSA master is supposed to be distributed among its DSA switch upper interfaces, so we should not allow other upper device kinds. An exception to this is DSA_TAG_PROTO_NONE (switches with no DSA tags), and in that case it is actually expected to create e.g. VLAN interfaces on the master. But for those, netdev_uses_dsa(master) returns false, so the restriction doesn't apply. The motivation for this change is to allow LAG interfaces of DSA masters to be DSA masters themselves. We want to restrict the user's degrees of freedom by 1: the LAG should already have all DSA masters as lowers, and while lower ports of the LAG can be removed, none can be added after the fact. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-08-23 11:39:22 +02:00
Vladimir Oltean	920a33cd72	net: bridge: move DSA master bridging restriction to DSA When DSA gains support for multiple CPU ports in a LAG, it will become mandatory to monitor the changeupper events for the DSA master. In fact, there are already some restrictions to be imposed in that area, namely that a DSA master cannot be a bridge port except in some special circumstances. Centralize the restrictions at the level of the DSA layer as a preliminary step. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-08-23 11:39:22 +02:00
Vladimir Oltean	0498277ee1	net: dsa: don't stop at NOTIFY_OK when calling ds->ops->port_prechangeupper dsa_slave_prechangeupper_sanity_check() is supposed to enforce some adjacency restrictions, and calls ds->ops->port_prechangeupper if the driver implements it. We convert the error code from the port_prechangeupper() call to a notifier code, and 0 is converted to NOTIFY_OK, but the caller of dsa_slave_prechangeupper_sanity_check() stops at any notifier code different from NOTIFY_DONE. Avoid this by converting back the notifier code to an error code, so that both NOTIFY_OK and NOTIFY_DONE will be seen as 0. This allows more parallel sanity check functions to be added. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-08-23 11:39:22 +02:00
Vladimir Oltean	4c3f80d22b	net: dsa: walk through all changeupper notifier functions Traditionally, DSA has had a single netdev notifier handling function for each device type. For the sake of code cleanliness, we would like to introduce more handling functions which do one thing, but the conditions for entering these functions start to overlap. Example: a handling function which tracks whether any bridges contain both DSA and non-DSA interfaces. Either this is placed before dsa_slave_changeupper(), case in which it will prevent that function from executing, or we place it after dsa_slave_changeupper(), case in which we will prevent it from executing. The other alternative is to ignore errors from the new handling function (not ideal). To support this usage, we need to change the pattern. In the new model, we enter all notifier handling sub-functions, and exit with NOTIFY_DONE if there is nothing to do. This allows the sub-functions to be relatively free-form and independent from each other. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-08-23 11:39:22 +02:00
Paolo Abeni	139b5fbd52	Merge branch 'vsock-updates-for-so_rcvlowat-handling' Arseniy Krasnov says: ==================== vsock: updates for SO_RCVLOWAT handling This patchset includes some updates for SO_RCVLOWAT: 1) af_vsock: During my experiments with zerocopy receive, i found, that in some cases, poll() implementation violates POSIX: when socket has non- default SO_RCVLOWAT(e.g. not 1), poll() will always set POLLIN and POLLRDNORM bits in 'revents' even number of bytes available to read on socket is smaller than SO_RCVLOWAT value. In this case,user sees POLLIN flag and then tries to read data(for example using 'read()' call), but read call will be blocked, because SO_RCVLOWAT logic is supported in dequeue loop in af_vsock.c. But the same time, POSIX requires that: "POLLIN Data other than high-priority data may be read without blocking. POLLRDNORM Normal data may be read without blocking." See https://www.open-std.org/jtc1/sc22/open/n4217.pdf, page 293. So, we have, that poll() syscall returns POLLIN, but read call will be blocked. Also in man page socket(7) i found that: "Since Linux 2.6.28, select(2), poll(2), and epoll(7) indicate a socket as readable only if at least SO_RCVLOWAT bytes are available." I checked TCP callback for poll()(net/ipv4/tcp.c, tcp_poll()), it uses SO_RCVLOWAT value to set POLLIN bit, also i've tested TCP with this case for TCP socket, it works as POSIX required. I've added some fixes to af_vsock.c and virtio_transport_common.c, test is also implemented. 2) virtio/vsock: It adds some optimization to wake ups, when new data arrived. Now, SO_RCVLOWAT is considered before wake up sleepers who wait new data. There is no sense, to kick waiter, when number of available bytes in socket's queue < SO_RCVLOWAT, because if we wake up reader in this case, it will wait for SO_RCVLOWAT data anyway during dequeue, or in poll() case, POLLIN/POLLRDNORM bits won't be set, so such exit from poll() will be "spurious". This logic is also used in TCP sockets. 3) vmci/vsock: Same as 2), but i'm not sure about this changes. Will be very good, to get comments from someone who knows this code. 4) Hyper-V: As Dexuan Cui mentioned, for Hyper-V transport it is difficult to support SO_RCVLOWAT, so he suggested to disable this feature for Hyper-V. ==================== Link: https://lore.kernel.org/r/de41de4c-0345-34d7-7c36-4345258b7ba8@sberdevices.ru Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-08-23 10:45:03 +02:00
Arseniy Krasnov	b1346338fb	vsock_test: POLLIN + SO_RCVLOWAT test This adds test to check, that when poll() returns POLLIN, POLLRDNORM bits, next read call won't block. Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-08-23 10:43:12 +02:00
Arseniy Krasnov	e061aed998	vmci/vsock: check SO_RCVLOWAT before wake up reader This adds extra condition to wake up data reader: do it only when number of readable bytes >= SO_RCVLOWAT. Otherwise, there is no sense to kick user, because it will wait until SO_RCVLOWAT bytes will be dequeued. This check is performed in vsock_data_ready(). Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Vishnu Dasa <vdasa@vmware.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-08-23 10:43:12 +02:00
Arseniy Krasnov	39f1ed33a4	virtio/vsock: check SO_RCVLOWAT before wake up reader This adds extra condition to wake up data reader: do it only when number of readable bytes >= SO_RCVLOWAT. Otherwise, there is no sense to kick user,because it will wait until SO_RCVLOWAT bytes will be dequeued. This check is performed in vsock_data_ready(). Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-08-23 10:43:11 +02:00
Arseniy Krasnov	f2fdcf67ac	vsock: add API call for data ready This adds 'vsock_data_ready()' which must be called by transport to kick sleeping data readers. It checks for SO_RCVLOWAT value before waking user, thus preventing spurious wake ups. Based on 'tcp_data_ready()' logic. Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-08-23 10:43:11 +02:00
Arseniy Krasnov	ee0b3843a2	vsock: pass sock_rcvlowat to notify_poll_in as target Passing 1 as the target to notify_poll_in(), we don't honor what the user has set via SO_RCVLOWAT, going to set POLLIN and POLLRDNORM, even if we don't have the amount of bytes expected by the user. Let's use sock_rcvlowat() to get the right target to pass to notify_poll_in(); Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-08-23 10:43:11 +02:00
Arseniy Krasnov	a274f6ff3c	vmci/vsock: use 'target' in notify_poll_in callback This callback controls setting of POLLIN, POLLRDNORM output bits of poll() syscall, but in some cases, it is incorrectly to set it, when socket has at least 1 bytes of available data. Use 'target' which is already exists. Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Vishnu Dasa <vdasa@vmware.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-08-23 10:43:11 +02:00
Arseniy Krasnov	e7a3266c91	virtio/vsock: use 'target' in notify_poll_in callback This callback controls setting of POLLIN, POLLRDNORM output bits of poll() syscall, but in some cases, it is incorrectly to set it, when socket has at least 1 bytes of available data. Use 'target' which is already exists. Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-08-23 10:43:11 +02:00
Arseniy Krasnov	24764f8d3c	hv_sock: disable SO_RCVLOWAT support For Hyper-V it is quiet difficult to support this socket option,due to transport internals, so disable it. Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-08-23 10:43:11 +02:00
Arseniy Krasnov	e38f22c860	vsock: SO_RCVLOWAT transport set callback This adds transport specific callback for SO_RCVLOWAT, because in some transports it may be difficult to know current available number of bytes ready to read. Thus, when SO_RCVLOWAT is set, transport may reject it. Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-08-23 10:43:11 +02:00
Zhengchao Shao	ab48508191	net: sched: remove duplicate check of user rights in qdisc In rtnetlink_rcv_msg function, the permission for all user operations is checked except the GET operation, which is the same as the checking in qdisc. Therefore, remove the user rights check in qdisc. Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com> Message-Id: <20220819041854.83372-1-shaozhengchao@huawei.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-08-23 10:18:26 +02:00
Roi Dayan	72e0bcd156	net/mlx5: TC, Add support for SF tunnel offload VF tunnel flow already exists and SF tunnel is the same flow. Support offloading of tunneling over SF device by allow to attach an encap route over SF and set to use indirect flow table on SF. Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2022-08-22 22:44:26 -07:00
Roi Dayan	430e2d5e2a	net/mlx5: E-Switch, Move send to vport meta rule creation Move the creation of the rules from offloads fdb table init to per rep vport init. This way the driver will creating the send to vport meta rule on any representor, e.g. SF representors. Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2022-08-22 22:44:25 -07:00
Roi Dayan	4a56181706	net/mlx5: E-Switch, Split creating fdb tables into smaller chunks Split esw_create_offloads_fdb_tables() into smaller functions. This will help maintenance. Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2022-08-22 22:44:25 -07:00
Jianbo Liu	8ea7bcf632	net/mlx5: E-Switch, Add default drop rule for unmatched packets The ft_offloads table serves to steer packets, which are from the eswitch, to the representor associated with the packets' source vport. Previously, if a packet's source vport or metadata was not associated with any representor, it was forwarded to the uplink representor. The representor got packets it shouldn't have as they weren't coming from the uplink vport. One such effect of this breakage can be observed if the uplink representor is attached to a bridge, where such illegal packets will be broadcast to the remaining ports, flooding the switch with illegal packets. In the case where IB loopback (e.g, SNAP) is enabled, all transmitted packets would be looped back, and received by the uplink representor, and result in an infinite feedback loop. Therefore, block this hole by adding a default drop rule to the ft_offloads table, so that all unmatched packets with no associated representor are dropped. Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Gavi Teitz <gavi@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2022-08-22 22:44:25 -07:00
Lama Kayal	d494dd2bb7	net/mlx5e: Completely eliminate priv from fs.h Complete the decoupling process of flow steering from en.h. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2022-08-22 22:44:24 -07:00
Lama Kayal	ca959d97d6	net/mlx5e: Make all ttc functions of en_fs get fs struct as argument Let all ttc creation be independent of priv, and pass relevant members of priv only. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2022-08-22 22:44:24 -07:00
Lama Kayal	45b83c6c68	net/mlx5e: Make flow steering arfs independent of priv Decouple arfs flow steering functionality from priv. Make all arfs functions defined under fs.h get flow_steering struct as an argument, thus helping with the process of decoupling the whole flow steering API from en.h. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2022-08-22 22:44:23 -07:00
Lama Kayal	93a07599ee	net/mlx5e: Introduce flow steering debug macros Introduce flow steering debug macros family, fs_*. These macros bring clean finish to the decoupling of flow steering process such that all flow steering flows can report warnings and provide debug information via these exclusive macros. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2022-08-22 22:44:23 -07:00
Lama Kayal	9c2c1c5e7f	net/mlx5e: Separate ethtool_steering from fs.h and make private Create a new fs_ethtool.h header file, where ethtool steering init and cleanup functions are declared in it. Make mlx5e_ethtool_steering struct private and declare at en_fs_ethtool.c. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2022-08-22 22:44:22 -07:00
Lama Kayal	e8b5c4bcb5	net/mlx5e: Directly get flow_steering struct as input when init/cleanup ethtool steering Let both mlx5e_ethtool_init_steering and mlx5e_ethtool_cleanup_steering get ethtool steering struct as input instead of priv, as passing priv is obsolete. Also modify other function through the flow similarly. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2022-08-22 22:44:22 -07:00
Lama Kayal	c7eafc5ed0	net/mlx5e: Convert ethtool_steering member of flow_steering struct to pointer Convert mlx5e_ethtool_steering member of mlx5e_flow_steering to a pointer, and allocate dynamically for each profile at flow_steering init. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2022-08-22 22:44:22 -07:00
Lama Kayal	81a0b241af	net/mlx5e: Drop priv argument of ptp function in en_fs Both mlx5e_ptp_alloc_rx_fs and mlx5e_ptp_free_rx_fs only make use of two priv member, pass them directly instead. This will help dropping priv from all en_fs file. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2022-08-22 22:44:21 -07:00
Lama Kayal	1be44b42b2	net/mlx5e: Decouple fs_tcp from en.h Make flow steering files fs_tcp.c/h independent of en.h such that they go through the flow steering API only. Make error reports be via mlx5_core API instead of netdev_err API, this to ensure a safe decoupling from en.h, and prevent redundant argument passing. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2022-08-22 22:44:21 -07:00
Lama Kayal	4e0ecc17a7	net/mlx5e: Decouple fs_tt_redirect from en.h Make flow steering files fs_tt_redirect.c/h independent of en.h such that it goes through the flow steering API only. Make error reports be via mlx5_core API instead of netdev_err API, this to ensure a safe decoupling from en.h, and prevent redundant argument passing. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2022-08-22 22:44:21 -07:00
Lama Kayal	f52f2faee5	net/mlx5e: Introduce flow steering API Move mlx5e_flow_steering struct to fs_en.c to make it private. Introduce flow_steering API and let other files go through it. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>	2022-08-22 22:44:20 -07:00
Jakub Kicinski	97d29b9231	Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2022-08-18 (ixgbe) This series contains updates to ixgbe driver only. Fabio M. De Francesco replaces kmap() call to page_address() for rx_buffer->page(). Jeff Daly adds a manual AN-37 restart to help resolve issues with some link partners. * '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: ixgbe: Manual AN-37 for troublesome link partners for X550 SFI ixgbe: Don't call kmap() on page allocated with GFP_ATOMIC ==================== Link: https://lore.kernel.org/r/20220818223402.1294091-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-08-22 20:24:45 -07:00
Jakub Kicinski	0134fe8512	Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== Intel Wired LAN Driver Updates 2022-08-18 (ice) This series contains updates to ice driver only. Jesse and Anatolii add support for controlling FCS/CRC stripping via ethtool. Anirudh allows for 100M speeds on devices which support it. Sylwester removes ucast_shared field and the associated dead code related to it. Mikael removes non-inclusive language from the driver. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: ice: remove non-inclusive language ice: Remove ucast_shared ice: Allow 100M speeds for some devices ice: Implement FCS/CRC and VLAN stripping co-existence policy ice: Implement control of FCS/CRC stripping ==================== Link: https://lore.kernel.org/r/20220818155207.996297-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-08-22 20:10:50 -07:00
Jakub Kicinski	5003e52c31	Merge branch 'bonding-802-3ad-fix-no-transmission-of-lacpdus' Jonathan Toppins says: ==================== bonding: 802.3ad: fix no transmission of LACPDUs Configuring a bond in a specific order can leave the bond in a state where it never transmits LACPDUs. The first patch adds some kselftest infrastructure and the reproducer that demonstrates the problem. The second patch fixes the issue. The new third patch makes ad_ticks_per_sec a static const and removes the passing of this variable via the stack. ==================== Link: https://lore.kernel.org/r/cover.1660919940.git.jtoppins@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-08-22 18:30:28 -07:00
Jonathan Toppins	f2e44dffa9	bonding: 3ad: make ad_ticks_per_sec a const The value is only ever set once in bond_3ad_initialize and only ever read otherwise. There seems to be no reason to set the variable via bond_3ad_initialize when setting the global variable will do. Change ad_ticks_per_sec to a const to enforce its read-only usage. Signed-off-by: Jonathan Toppins <jtoppins@redhat.com> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-08-22 18:30:24 -07:00
Jonathan Toppins	d745b5062a	bonding: 802.3ad: fix no transmission of LACPDUs This is caused by the global variable ad_ticks_per_sec being zero as demonstrated by the reproducer script discussed below. This causes all timer values in __ad_timer_to_ticks to be zero, resulting in the periodic timer to never fire. To reproduce: Run the script in `tools/testing/selftests/drivers/net/bonding/bond-break-lacpdu-tx.sh` which puts bonding into a state where it never transmits LACPDUs. line 44: ip link add fbond type bond mode 4 miimon 200 \ xmit_hash_policy 1 ad_actor_sys_prio 65535 lacp_rate fast setting bond param: ad_actor_sys_prio given: params.ad_actor_system = 0 call stack: bond_option_ad_actor_sys_prio() -> bond_3ad_update_ad_actor_settings() -> set ad.system.sys_priority = bond->params.ad_actor_sys_prio -> ad.system.sys_mac_addr = bond->dev->dev_addr; because params.ad_actor_system == 0 results: ad.system.sys_mac_addr = bond->dev->dev_addr line 48: ip link set fbond address 52:54:00:3B:7C:A6 setting bond MAC addr call stack: bond->dev->dev_addr = new_mac line 52: ip link set fbond type bond ad_actor_sys_prio 65535 setting bond param: ad_actor_sys_prio given: params.ad_actor_system = 0 call stack: bond_option_ad_actor_sys_prio() -> bond_3ad_update_ad_actor_settings() -> set ad.system.sys_priority = bond->params.ad_actor_sys_prio -> ad.system.sys_mac_addr = bond->dev->dev_addr; because params.ad_actor_system == 0 results: ad.system.sys_mac_addr = bond->dev->dev_addr line 60: ip link set veth1-bond down master fbond given: params.ad_actor_system = 0 params.mode = BOND_MODE_8023AD ad.system.sys_mac_addr == bond->dev->dev_addr call stack: bond_enslave -> bond_3ad_initialize(); because first slave -> if ad.system.sys_mac_addr != bond->dev->dev_addr return results: Nothing is run in bond_3ad_initialize() because dev_addr equals sys_mac_addr leaving the global ad_ticks_per_sec zero as it is never initialized anywhere else. The if check around the contents of bond_3ad_initialize() is no longer needed due to commit 5ee14e6d336f ("bonding: 3ad: apply ad_actor settings changes immediately") which sets ad.system.sys_mac_addr if any one of the bonding parameters whos set function calls bond_3ad_update_ad_actor_settings(). This is because if ad.system.sys_mac_addr is zero it will be set to the current bond mac address, this causes the if check to never be true. Fixes: 5ee14e6d336f ("bonding: 3ad: apply ad_actor settings changes immediately") Signed-off-by: Jonathan Toppins <jtoppins@redhat.com> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-08-22 18:30:19 -07:00
Jonathan Toppins	c078290a2b	selftests: include bonding tests into the kselftest infra This creates a test collection in drivers/net/bonding for bonding specific kernel selftests. The first test is a reproducer that provisions a bond and given the specific order in how the ip-link(8) commands are issued the bond never transmits an LACPDU frame on any of its slaves. Signed-off-by: Jonathan Toppins <jtoppins@redhat.com> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-08-22 18:30:16 -07:00
Sergei Antonov	0ee7828dfc	net: moxa: get rid of asymmetry in DMA mapping/unmapping Since priv->rx_mapping[i] is maped in moxart_mac_open(), we should unmap it from moxart_mac_stop(). Fixes 2 warnings. 1. During error unwinding in moxart_mac_probe(): "goto init_fail;", then moxart_mac_free_memory() calls dma_unmap_single() with priv->rx_mapping[i] pointers zeroed. WARNING: CPU: 0 PID: 1 at kernel/dma/debug.c:963 check_unmap+0x704/0x980 DMA-API: moxart-ethernet 92000000.mac: device driver tries to free DMA memory it has not allocated [device address=0x0000000000000000] [size=1600 bytes] CPU: 0 PID: 1 Comm: swapper Not tainted 5.19.0+ #60 Hardware name: Generic DT based system unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x34/0x44 dump_stack_lvl from __warn+0xbc/0x1f0 __warn from warn_slowpath_fmt+0x94/0xc8 warn_slowpath_fmt from check_unmap+0x704/0x980 check_unmap from debug_dma_unmap_page+0x8c/0x9c debug_dma_unmap_page from moxart_mac_free_memory+0x3c/0xa8 moxart_mac_free_memory from moxart_mac_probe+0x190/0x218 moxart_mac_probe from platform_probe+0x48/0x88 platform_probe from really_probe+0xc0/0x2e4 2. After commands: ip link set dev eth0 down ip link set dev eth0 up WARNING: CPU: 0 PID: 55 at kernel/dma/debug.c:570 add_dma_entry+0x204/0x2ec DMA-API: moxart-ethernet 92000000.mac: cacheline tracking EEXIST, overlapping mappings aren't supported CPU: 0 PID: 55 Comm: ip Not tainted 5.19.0+ #57 Hardware name: Generic DT based system unwind_backtrace from show_stack+0x10/0x14 show_stack from dump_stack_lvl+0x34/0x44 dump_stack_lvl from __warn+0xbc/0x1f0 __warn from warn_slowpath_fmt+0x94/0xc8 warn_slowpath_fmt from add_dma_entry+0x204/0x2ec add_dma_entry from dma_map_page_attrs+0x110/0x328 dma_map_page_attrs from moxart_mac_open+0x134/0x320 moxart_mac_open from __dev_open+0x11c/0x1ec __dev_open from __dev_change_flags+0x194/0x22c __dev_change_flags from dev_change_flags+0x14/0x44 dev_change_flags from devinet_ioctl+0x6d4/0x93c devinet_ioctl from inet_ioctl+0x1ac/0x25c v1 -> v2: Extraneous change removed. Fixes: 6c821bd9edc9 ("net: Add MOXA ART SoCs ethernet driver") Signed-off-by: Sergei Antonov <saproj@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20220819110519.1230877-1-saproj@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-08-22 18:15:44 -07:00
Xiaolei Wang	6dbe852c37	net: phy: Don't WARN for PHY_READY state in mdio_bus_phy_resume() For some MAC drivers, they set the mac_managed_pm to true in its ->ndo_open() callback. So before the mac_managed_pm is set to true, we still want to leverage the mdio_bus_phy_suspend()/resume() for the phy device suspend and resume. In this case, the phy device is in PHY_READY, and we shouldn't warn about this. It also seems that the check of mac_managed_pm in WARN_ON is redundant since we already check this in the entry of mdio_bus_phy_resume(), so drop it. Fixes: 744d23c71af3 ("net: phy: Warn about incorrect mdio_bus_phy_resume() state") Signed-off-by: Xiaolei Wang <xiaolei.wang@windriver.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20220819082451.1992102-1-xiaolei.wang@windriver.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-08-22 18:14:10 -07:00
Vladimir Oltean	b18e04e362	net: dsa: tag_8021q: remove old comment regarding dsa_8021q_netdev_ops Since commit 129bd7ca8ac0 ("net: dsa: Prevent usage of NET_DSA_TAG_8021Q as tagging protocol"), dsa_8021q_netdev_ops no longer exists, so remove the comment that talks about it. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20220818143808.2808393-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-08-22 18:11:58 -07:00

... 3 4 5 6 7 ...

1122423 Commits