linux

iv/linux

Author	SHA1	Message	Date
Paolo Abeni	ea66758c17	tcp: allow MPTCP to update the announced window The MPTCP RFC requires that the MPTCP-level receive window's right edge never moves backward. Currently the MPTCP code enforces such constraint while tracking the right edge, but it does not reflects it on the wire, as MPTCP lacks a suitable hook to update accordingly the TCP header. This change modifies the existing mptcp_write_options() hook, providing the current packet's TCP header to the MPTCP protocol, so that the next patch could implement the above mentioned constraint. No functional changes intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-05 19:00:15 -07:00
Paolo Abeni	92be2f5227	mptcp: add mib for xmit window sharing Bump a counter for counter when snd_wnd is shared among subflow, for observability's sake. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-05 19:00:15 -07:00
Paolo Abeni	b713d00675	mptcp: really share subflow snd_wnd As per RFC, mptcp subflows use a "shared" snd_wnd: the effective window is the maximum among the current values received on all subflows. Without such feature a data transfer using multiple subflows could block. Window sharing is currently implemented in the RX side: __tcp_select_window uses the mptcp-level receive buffer to compute the announced window. That is not enough: the TCP stack will stick to the window size received on the given subflow; we need to propagate the msk window value on each subflow at xmit time. Change the packet scheduler to ignore the subflow level window and use instead the msk level one Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-05 19:00:14 -07:00
Andy Shevchenko	10b4a11fe7	firmware: tee_bnxt: Use UUID API for exporting the UUID There is export_uuid() function which exports uuid_t to the u8 array. Use it instead of open coding variant. This allows to hide the uuid_t internals. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220504091407.70661-1-andriy.shevchenko@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-05 18:14:29 -07:00
David Ahern	c67b627e99	net: Make msg_zerocopy_alloc static msg_zerocopy_alloc is only used by msg_zerocopy_realloc; remove the export and make static in skbuff.c Signed-off-by: David Ahern <dsahern@kernel.org> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com> Link: https://lore.kernel.org/r/20220504170947.18773-1-dsahern@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-05 17:02:50 -07:00
Jakub Kicinski	8d602e1a13	net: move snowflake callers to netif_napi_add_tx_weight() Make the drivers with custom tx napi weight call netif_napi_add_tx_weight(). Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Link: https://lore.kernel.org/r/20220504163725.550782-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-05 15:54:18 -07:00
Jakub Kicinski	16d083e28f	net: switch to netif_napi_add_tx() Switch net callers to the new API not requiring the NAPI_POLL_WEIGHT argument. Acked-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Acked-by: Alexandra Winter <wintera@linux.ibm.com> Link: https://lore.kernel.org/r/20220504163725.550782-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-05 15:54:12 -07:00
Jakub Kicinski	fd49f8e61c	jme: remove an unnecessary indirection Remove a define which looks like a OS abstraction layer and makes spatch conversions on this driver problematic. Link: https://lore.kernel.org/r/20220504163939.551231-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-05 15:53:28 -07:00
Christophe Leroy	6bff3ffcf6	net: ethernet: Prepare cleanup of powerpc's asm/prom.h powerpc's asm/prom.h includes some headers that it doesn't need itself. In order to clean powerpc's asm/prom.h up in a further step, first clean all files that include asm/prom.h Some files don't need asm/prom.h at all. For those ones, just remove inclusion of asm/prom.h Some files don't need any of the items provided by asm/prom.h, but need some of the headers included by asm/prom.h. For those ones, add the needed headers that are brought by asm/prom.h at the moment and remove asm/prom.h Some files really need asm/prom.h but also need some of the headers included by asm/prom.h. For those one, leave asm/prom.h but also add the needed headers so that they can be removed from asm/prom.h in a later step. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://lore.kernel.org/r/09a13d592d628de95d30943e59b2170af5b48110.1651663857.git.christophe.leroy@csgroup.eu Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-05 15:53:02 -07:00
Christophe Leroy	d9ccf770c7	sungem: Prepare cleanup of powerpc's asm/prom.h powerpc's <asm/prom.h> includes some headers that it doesn't need itself. In order to clean powerpc's <asm/prom.h> up in a further step, first clean all files that include <asm/prom.h> sungem_phy.c doesn't use any object provided by <asm/prom.h>. But removing inclusion of <asm/prom.h> leads to the following errors: CC drivers/net/sungem_phy.o drivers/net/sungem_phy.c: In function 'bcm5421_init': drivers/net/sungem_phy.c:448:42: error: implicit declaration of function 'of_get_parent'; did you mean 'dget_parent'? [-Werror=implicit-function-declaration] 448 \| struct device_node np = of_get_parent(phy->platform_data); \| ^~~~~~~~~~~~~ \| dget_parent drivers/net/sungem_phy.c:448:42: warning: initialization of 'struct device_node ' from 'int' makes pointer from integer without a cast [-Wint-conversion] drivers/net/sungem_phy.c:450:35: error: implicit declaration of function 'of_get_property' [-Werror=implicit-function-declaration] 450 \| if (np == NULL \|\| of_get_property(np, "no-autolowpower", NULL)) \| ^~~~~~~~~~~~~~~ Remove <asm/prom.h> from included headers but add <linux/of.h> to handle the above. Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://lore.kernel.org/r/f7a7fab3ec5edf803d934fca04df22631c2b449d.1651662885.git.christophe.leroy@csgroup.eu Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-05 15:52:27 -07:00
Eyal Birger	1f86123b97	net: align SO_RCVMARK required privileges with SO_MARK The commit referenced in the "Fixes" tag added the SO_RCVMARK socket option for receiving the skb mark in the ancillary data. Since this is a new capability, and exposes admin configured details regarding the underlying network setup to sockets, let's align the needed capabilities with those of SO_MARK. Fixes: `6fd1d51cfa` ("net: SO_RCVMARK socket option for SO_MARK with recvmsg()") Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Link: https://lore.kernel.org/r/20220504095459.2663513-1-eyal.birger@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-05 15:48:17 -07:00
Jakub Kicinski	c4a67a21a6	Revert "Merge branch 'mlxsw-line-card-model'" This reverts commit `5e927a9f4b`, reversing changes made to `cfc1d91a7d`. The discussion is still ongoing so let's remove the uAPI until the discussion settles. Link: https://lore.kernel.org/all/20220425090021.32e9a98f@kernel.org/ Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/20220504154037.539442-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-05 15:47:23 -07:00
Jakub Kicinski	c8227d568d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net tools/testing/selftests/net/forwarding/Makefile `f62c5acc80` ("selftests/net/forwarding: add missing tests to Makefile") `50fe062c80` ("selftests: forwarding: new test, verify host mdb entries") https://lore.kernel.org/all/20220502111539.0b7e4621@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-05 13:03:18 -07:00
Linus Torvalds	68533eb1fb	Networking fixes for 5.18-rc6, including fixes from can, rxrpc and wireguard Previous releases - regressions: - igmp: respect RCU rules in ip_mc_source() and ip_mc_msfilter() - mld: respect RCU rules in ip6_mc_source() and ip6_mc_msfilter() - rds: acquire netns refcount on TCP sockets - rxrpc: enable IPv6 checksums on transport socket - nic: hinic: fix bug of wq out of bound access - nic: thunder: don't use pci_irq_vector() in atomic context - nic: bnxt_en: fix possible bnxt_open() failure caused by wrong RFS flag - nic: mlx5e: - lag, fix use-after-free in fib event handler - fix deadlock in sync reset flow Previous releases - always broken: - tcp: fix insufficient TCP source port randomness - can: grcan: grcan_close(): fix deadlock - nfc: reorder destructive operations in to avoid bugs Misc: - wireguard: improve selftests reliability Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmJznX8SHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkDrgP/R9tErvWO/uvXpNgDr6Qh8osYt5Z297l EWyhz7cUm4LKi6MYWrRKR4uRK9n43DK+OVws5LXrYL0tIdJH3uYBE0RS67W9WmjA kE2Srq1A6wUi4koiYKeYDXtodCJLC93n+QnLBfih44Pc+xmk8t+G6qZ1n45qjRss gzV75AlIfErmjqyYi81DaZ6Z0TV4H5qPM4ZXRViIzH+Ccyx6rk/KNqU4wepoqRSi lCckTvMt9V7OiYHzM5Pu1kTUV07Jtiy7xkIQMdKYXCZpyqkmqyPFMM+0B7fDOEeP WZnkdUwi69WMVmeefcpEn7XsoNbVadGkTQM2EcUWvrxuCeawmGxYoORvvFs0IpAX YkYXk1US0Sd1L2XlMaus+HLsmmx4fWnb/hWqGL/D+arZOvTCOhBQItSRmKA6d+kM OLfj/gh0YLBsHVrCiHUN06oopvhWuBEBAJbVFkbJCvXoFGqHigijBCVjFBVH1p4o L5bWVEAQ8tkFdofXw0nOe6vRCD5BGN34N5DkqC5E8mj/uLP0FVEWOISV3TzKKF5B mEDGZAGN5bTf/ScvbF8XEaqtdk/cxv2ohWNn9wtgoaNBorgKtpTf99pXJtxV2+fs 3RiPM0My9uz8/wMveSfKShQntMSdnmQPMpJ4Vm0e4bOS1K0LRGUgZxOpX2/BTokq Iv5msx85X5/S =XuN7 -----END PGP SIGNATURE----- Merge tag 'net-5.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from can, rxrpc and wireguard. Previous releases - regressions: - igmp: respect RCU rules in ip_mc_source() and ip_mc_msfilter() - mld: respect RCU rules in ip6_mc_source() and ip6_mc_msfilter() - rds: acquire netns refcount on TCP sockets - rxrpc: enable IPv6 checksums on transport socket - nic: hinic: fix bug of wq out of bound access - nic: thunder: don't use pci_irq_vector() in atomic context - nic: bnxt_en: fix possible bnxt_open() failure caused by wrong RFS flag - nic: mlx5e: - lag, fix use-after-free in fib event handler - fix deadlock in sync reset flow Previous releases - always broken: - tcp: fix insufficient TCP source port randomness - can: grcan: grcan_close(): fix deadlock - nfc: reorder destructive operations in to avoid bugs Misc: - wireguard: improve selftests reliability" * tag 'net-5.18-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (63 commits) NFC: netlink: fix sleep in atomic bug when firmware download timeout selftests: ocelot: tc_flower_chains: specify conform-exceed action for policer tcp: drop the hash_32() part from the index calculation tcp: increase source port perturb table to 2^16 tcp: dynamically allocate the perturb table used by source ports tcp: add small random increments to the source port tcp: resalt the secret every 10 seconds tcp: use different parts of the port_offset for index and offset secure_seq: use the 64 bits of the siphash for port offset calculation wireguard: selftests: set panic_on_warn=1 from cmdline wireguard: selftests: bump package deps wireguard: selftests: restore support for ccache wireguard: selftests: use newer toolchains to fill out architectures wireguard: selftests: limit parallelism to $(nproc) tests at once wireguard: selftests: make routing loop test non-fatal net/mlx5: Fix matching on inner TTC net/mlx5: Avoid double clear or set of sync reset requested net/mlx5: Fix deadlock in sync reset flow net/mlx5e: Fix trust state reset in reload net/mlx5e: Avoid checking offload capability in post_parse action ...	2022-05-05 09:45:12 -07:00
Casper Andersson	1c1ed5a484	net: sparx5: Add handling of host MDB entries Handle adding and removing MDB entries for host Signed-off-by: Casper Andersson <casper.casan@gmail.com> Link: https://lore.kernel.org/r/20220503093922.1630804-1-casper.casan@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-05-05 15:03:45 +02:00
Duoming Zhou	4071bf121d	NFC: netlink: fix sleep in atomic bug when firmware download timeout There are sleep in atomic bug that could cause kernel panic during firmware download process. The root cause is that nlmsg_new with GFP_KERNEL parameter is called in fw_dnld_timeout which is a timer handler. The call trace is shown below: BUG: sleeping function called from invalid context at include/linux/sched/mm.h:265 Call Trace: kmem_cache_alloc_node __alloc_skb nfc_genl_fw_download_done call_timer_fn __run_timers.part.0 run_timer_softirq __do_softirq ... The nlmsg_new with GFP_KERNEL parameter may sleep during memory allocation process, and the timer handler is run as the result of a "software interrupt" that should not call any other function that could sleep. This patch changes allocation mode of netlink message from GFP_KERNEL to GFP_ATOMIC in order to prevent sleep in atomic bug. The GFP_ATOMIC flag makes memory allocation operation could be used in atomic context. Fixes: `9674da8759` ("NFC: Add firmware upload netlink command") Fixes: `9ea7187c53` ("NFC: netlink: Rename CMD_FW_UPLOAD to CMD_FW_DOWNLOAD") Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20220504055847.38026-1-duoming@zju.edu.cn Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2022-05-05 10:18:15 +02:00
Jakub Kicinski	4950b6990e	Merge branch 'ocelot-vcap-cleanups' Vladimir Oltean says: ==================== Ocelot VCAP cleanups This is a series of minor code cleanups brought to the Ocelot switch driver logic for VCAP filters. - don't use list_for_each_safe() in ocelot_vcap_filter_add_to_block - don't use magic numbers for OCELOT_POLICER_DISCARD ==================== Link: https://lore.kernel.org/r/20220503120150.837233-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 20:42:18 -07:00
Vladimir Oltean	91d350d661	net: mscc: ocelot: don't use magic numbers for OCELOT_POLICER_DISCARD OCELOT_POLICER_DISCARD helps "kill dropped packets dead" since a PERMIT/DENY mask mode with a port mask of 0 isn't enough to stop the CPU port from receiving packets removed from the forwarding path. The hardcoded initialization done for it in ocelot_vcap_init() is confusing. All we need from it is to have a rate and a burst size of 0. Reuse qos_policer_conf_set() for that purpose. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 20:42:15 -07:00
Vladimir Oltean	8e90c499bd	net: mscc: ocelot: drop port argument from qos_policer_conf_set The "port" argument is used for nothing else except printing on the error path. Print errors on behalf of the policer index, which is less confusing anyway. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 20:42:15 -07:00
Vladimir Oltean	09fd1e0d14	net: mscc: ocelot: use list_for_each_entry in ocelot_vcap_filter_add_to_block Unify the code paths for adding to an empty list and to a list with elements by keeping a "pos" list_head element that indicates where to insert. Initialize "pos" with the list head itself in case list_for_each_entry() doesn't iterate over any element. Note that list_for_each_safe() isn't needed because no element is removed from the list while iterating. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 20:42:15 -07:00
Vladimir Oltean	3825a0d027	net: mscc: ocelot: add to tail of empty list in ocelot_vcap_filter_add_to_block This makes no functional difference but helps in minimizing the delta for a future change. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 20:42:14 -07:00
Vladimir Oltean	0a448bba50	net: mscc: ocelot: use list_add_tail in ocelot_vcap_filter_add_to_block() list_add(..., pos->prev) and list_add_tail(..., pos) are equivalent, use the later form to unify with the case where the list is empty later. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 20:42:14 -07:00
Michael Walle	fa728505f3	dt-bindings: net: lan966x: fix example In commit `4fdabd509d` ("dt-bindings: net: lan966x: remove PHY reset") the PHY reset was removed, but I failed to remove it from the example. Fix it. Fixes: `4fdabd509d` ("dt-bindings: net: lan966x: remove PHY reset") Reported-by: Rob Herring <robh@kernel.org> Signed-off-by: Michael Walle <michael@walle.cc> Acked-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/20220503132038.2714128-1-michael@walle.cc Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 20:40:19 -07:00
Vladimir Oltean	5a7c5f70c7	selftests: ocelot: tc_flower_chains: specify conform-exceed action for policer As discussed here with Ido Schimmel: https://patchwork.kernel.org/project/netdevbpf/patch/20220224102908.5255-2-jianbol@nvidia.com/ the default conform-exceed action is "reclassify", for a reason we don't really understand. The point is that hardware can't offload that police action, so not specifying "conform-exceed" was always wrong, even though the command used to work in hardware (but not in software) until the kernel started adding validation for it. Fix the command used by the selftest by making the policer drop on exceed, and pass the packet to the next action (goto) on conform. Fixes: `8cd6b020b6` ("selftests: ocelot: add some example VCAP IS1, IS2 and ES0 tc offloads") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/20220503121428.842906-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 19:40:19 -07:00
Jakub Kicinski	ef56248981	Merge branch 'insufficient-tcp-source-port-randomness' Willy Tarreau says: ==================== insufficient TCP source port randomness In a not-yet published paper, Moshe Kol, Amit Klein, and Yossi Gilad report being able to accurately identify a client by forcing it to emit only 40 times more connections than the number of entries in the table_perturb[] table, which is indexed by hashing the connection tuple. The current 2^8 setting allows them to perform that attack with only 10k connections, which is not hard to achieve in a few seconds. Eric, Amit and I have been working on this for a few weeks now imagining, testing and eliminating a number of approaches that Amit and his team were still able to break or that were found to be too risky or too expensive, and ended up with the simple improvements in this series that resists to the attack, doesn't degrade the performance, and preserves a reliable port selection algorithm to avoid connection failures, including the odd/even port selection preference that allows bind() to always find a port quickly even under strong connect() stress. The approach relies on several factors: - resalting the hash secret that's used to choose the table_perturb[] entry every 10 seconds to eliminate slow attacks and force the attacker to forget everything that was learned after this delay. This already eliminates most of the problem because if a client stays silent for more than 10 seconds there's no link between the previous and the next patterns, and 10s isn't yet frequent enough to cause too frequent repetition of a same port that may induce a connection failure ; - adding small random increments to the source port. Previously, a random 0 or 1 was added every 16 ports. Now a random 0 to 7 is added after each port. This means that with the default 32768-60999 range, a worst case rollover happens after 1764 connections, and an average of 3137. This doesn't stop statistical attacks but requires significantly more iterations of the same attack to confirm a guess. - increasing the table_perturb[] size from 2^8 to 2^16, which Amit says will require 2.6 million connections to be attacked with the changes above, making it pointless to get a fingerprint that will only last 10 seconds. Due to the size, the table was made dynamic. - a few minor improvements on the bits used from the hash, to eliminate some unfortunate correlations that may possibly have been exploited to design future attack models. These changes were tested under the most extreme conditions, up to 1.1 million connections per second to one and a few targets, showing no performance regression, and only 2 connection failures within 13 billion, which is less than 2^-32 and perfectly within usual values. The series is split into small reviewable changes and was already reviewed by Amit and Eric. ==================== Link: https://lore.kernel.org/r/20220502084614.24123-1-w@1wt.eu Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 19:22:35 -07:00
Willy Tarreau	e8161345dd	tcp: drop the hash_32() part from the index calculation In commit `190cc82489` ("tcp: change source port randomizarion at connect() time"), the table_perturb[] array was introduced and an index was taken from the port_offset via hash_32(). But it turns out that hash_32() performs a multiplication while the input here comes from the output of SipHash in secure_seq, that is well distributed enough to avoid the need for yet another hash. Suggested-by: Amit Klein <aksecurity@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 19:22:33 -07:00
Willy Tarreau	4c2c8f03a5	tcp: increase source port perturb table to 2^16 Moshe Kol, Amit Klein, and Yossi Gilad reported being able to accurately identify a client by forcing it to emit only 40 times more connections than there are entries in the table_perturb[] table. The previous two improvements consisting in resalting the secret every 10s and adding randomness to each port selection only slightly improved the situation, and the current value of 2^8 was too small as it's not very difficult to make a client emit 10k connections in less than 10 seconds. Thus we're increasing the perturb table from 2^8 to 2^16 so that the same precision now requires 2.6M connections, which is more difficult in this time frame and harder to hide as a background activity. The impact is that the table now uses 256 kB instead of 1 kB, which could mostly affect devices making frequent outgoing connections. However such components usually target a small set of destinations (load balancers, database clients, perf assessment tools), and in practice only a few entries will be visited, like before. A live test at 1 million connections per second showed no performance difference from the previous value. Reported-by: Moshe Kol <moshe.kol@mail.huji.ac.il> Reported-by: Yossi Gilad <yossi.gilad@mail.huji.ac.il> Reported-by: Amit Klein <aksecurity@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 19:22:28 -07:00
Willy Tarreau	e926147618	tcp: dynamically allocate the perturb table used by source ports We'll need to further increase the size of this table and it's likely that at some point its size will not be suitable anymore for a static table. Let's allocate it on boot from inet_hashinfo2_init(), which is called from tcp_init(). Cc: Moshe Kol <moshe.kol@mail.huji.ac.il> Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il> Cc: Amit Klein <aksecurity@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 19:22:21 -07:00
Willy Tarreau	ca7af04025	tcp: add small random increments to the source port Here we're randomly adding between 0 and 7 random increments to the selected source port in order to add some noise in the source port selection that will make the next port less predictable. With the default port range of 32768-60999 this means a worst case reuse scenario of 14116/8=1764 connections between two consecutive uses of the same port, with an average of 14116/4.5=3137. This code was stressed at more than 800000 connections per second to a fixed target with all connections closed by the client using RSTs (worst condition) and only 2 connections failed among 13 billion, despite the hash being reseeded every 10 seconds, indicating a perfectly safe situation. Cc: Moshe Kol <moshe.kol@mail.huji.ac.il> Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il> Cc: Amit Klein <aksecurity@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 19:22:21 -07:00
Eric Dumazet	4dfa9b438e	tcp: resalt the secret every 10 seconds In order to limit the ability for an observer to recognize the source ports sequence used to contact a set of destinations, we should periodically shuffle the secret. 10 seconds looks effective enough without causing particular issues. Cc: Moshe Kol <moshe.kol@mail.huji.ac.il> Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il> Cc: Amit Klein <aksecurity@gmail.com> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Tested-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 19:22:21 -07:00
Willy Tarreau	9e9b70ae92	tcp: use different parts of the port_offset for index and offset Amit Klein suggests that we use different parts of port_offset for the table's index and the port offset so that there is no direct relation between them. Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Moshe Kol <moshe.kol@mail.huji.ac.il> Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il> Cc: Amit Klein <aksecurity@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 19:22:20 -07:00
Willy Tarreau	b2d057560b	secure_seq: use the 64 bits of the siphash for port offset calculation SipHash replaced MD5 in secure_ipv{4,6}_port_ephemeral() via commit `7cd23e5300` ("secure_seq: use SipHash in place of MD5"), but the output remained truncated to 32-bit only. In order to exploit more bits from the hash, let's make the functions return the full 64-bit of siphash_3u32(). We also make sure the port offset calculation in __inet_hash_connect() remains done on 32-bit to avoid the need for div_u64_rem() and an extra cost on 32-bit systems. Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Moshe Kol <moshe.kol@mail.huji.ac.il> Cc: Yossi Gilad <yossi.gilad@mail.huji.ac.il> Cc: Amit Klein <aksecurity@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 19:22:20 -07:00
Vasily Averin	425b9c7f51	memcg: accounting for objects allocated for new netdevice Creating a new netdevice allocates at least ~50Kb of memory for various kernel objects, but only ~5Kb of them are accounted to memcg. As a result, creating an unlimited number of netdevice inside a memcg-limited container does not fall within memcg restrictions, consumes a significant part of the host's memory, can cause global OOM and lead to random kills of host processes. The main consumers of non-accounted memory are: ~10Kb 80+ kernfs nodes ~6Kb ipv6_add_dev() allocations 6Kb __register_sysctl_table() allocations 4Kb neigh_sysctl_register() allocations 4Kb __devinet_sysctl_register() allocations 4Kb __addrconf_sysctl_register() allocations Accounting of these objects allows to increase the share of memcg-related memory up to 60-70% (~38Kb accounted vs ~54Kb total for dummy netdevice on typical VM with default Fedora 35 kernel) and this should be enough to somehow protect the host from misuse inside container. Other related objects are quite small and may not be taken into account to minimize the expected performance degradation. It should be separately mentonied ~300 bytes of percpu allocation of struct ipstats_mib in snmp6_alloc_dev(), on huge multi-cpu nodes it can become the main consumer of memory. This patch does not enables kernfs accounting as it affects other parts of the kernel and should be discussed separately. However, even without kernfs, this patch significantly improves the current situation and allows to take into account more than half of all netdevice allocations. Signed-off-by: Vasily Averin <vvs@openvz.org> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/354a0a5f-9ec3-a25c-3215-304eab2157bc@openvz.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 19:16:46 -07:00
Jakub Kicinski	205557ba99	Merge branch 'wireguard-patches-for-5-18-rc6' Jason A. Donenfeld says: ==================== wireguard patches for 5.18-rc6 In working on some other problems, I wound up leaning on the WireGuard CI more than usual and uncovered a few small issues with reliability. These are fairly low key changes, since they don't impact kernel code itself. One change does stick out in particular, though, which is the "make routing loop test non-fatal" commit. I'm not thrilled about doing this, but currently [1] remains unsolved, and I'm still working on a real solution to that (hopefully for 5.19 or 5.20 if I can come up with a good idea...), so for now that test just prints a big red warning instead. [1] https://lore.kernel.org/netdev/YmszSXueTxYOC41G@zx2c4.com/ ==================== Link: https://lore.kernel.org/r/20220504202920.72908-1-Jason@zx2c4.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 17:50:00 -07:00
Jason A. Donenfeld	3fc1b11e5d	wireguard: selftests: set panic_on_warn=1 from cmdline Rather than setting this once init is running, set panic_on_warn from the kernel command line, so that it catches splats from WireGuard initialization code and the various crypto selftests. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 17:49:57 -07:00
Jason A. Donenfeld	a6b8ea9144	wireguard: selftests: bump package deps Use newer, more reliable package dependencies. These should hopefully reduce flakes. However, we keep the old iputils package, as it accumulated bugs after resulting in flakes on slow machines. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 17:49:57 -07:00
Jason A. Donenfeld	d261ba6aa4	wireguard: selftests: restore support for ccache When moving to non-system toolchains, we inadvertantly killed the ability to use ccache. So instead, build ccache support into the test harness directly. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 17:49:56 -07:00
Jason A. Donenfeld	d5d9b29bc9	wireguard: selftests: use newer toolchains to fill out architectures Rather than relying on the system to have cross toolchains available, simply download musl.cc's ones and use that libc.so, and then we use it to fill in a few missing platforms, such as riscv64, riscv64, powerpc64, and s390x. Since riscv doesn't have a second serial port in its device description, we have to use virtio's vport. This is actually the same situation on ARM, but we were previously hacking QEMU up to work around this, which required a custom QEMU. Instead just do the vport trick on ARM too. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 17:49:56 -07:00
Jason A. Donenfeld	39f02bf1e5	wireguard: selftests: limit parallelism to $(nproc) tests at once The parallel tests were added to catch queueing issues from multiple cores. But what happens in reality when testing tons of processes is that these separate threads wind up fighting with the scheduler, and we wind up with contention in places we don't care about that decrease the chances of hitting a bug. So just do a test with the number of CPU cores, rather than trying to scale up arbitrarily. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 17:49:56 -07:00
Jason A. Donenfeld	ae2de669c1	wireguard: selftests: make routing loop test non-fatal I hate to do this, but I still do not have a good solution to actually fix this bug across architectures. So just disable it for now, so that the CI can still deliver actionable results. This commit adds a large red warning, so that at least the failure isn't lost forever, and hopefully this can be revisited down the line. Link: https://lore.kernel.org/netdev/CAHmME9pv1x6C4TNdL6648HydD8r+txpV4hTUXOBVkrapBXH4QQ@mail.gmail.com/ Link: https://lore.kernel.org/netdev/YmszSXueTxYOC41G@zx2c4.com/ Link: https://lore.kernel.org/wireguard/CAHmME9rNnBiNvBstb7MPwK-7AmAN0sOfnhdR=eeLrowWcKxaaQ@mail.gmail.com/ Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2022-05-04 17:49:56 -07:00
Linus Torvalds	a7391ad357	IOMMU Fixes for Linux v5.18-rc5 Including: - Fix for a regression in IOMMU core code which could cause NULL-ptr dereferences - Arm SMMU fixes for 5.18 - Fix off-by-one in SMMUv3 SVA TLB invalidation - Disable large mappings to workaround nvidia erratum - Intel VT-d fixes - Handle PCI stop marker messages in IOMMU driver to meet the requirement of I/O page fault handling framework. - Calculate a feasible mask for non-aligned page-selective IOTLB invalidation. - Two fixes for Apple DART IOMMU driver - Fix potential NULL-ptr dereference - Set module owner -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEr9jSbILcajRFYWYyK/BELZcBGuMFAmJyoYIACgkQK/BELZcB GuMLOhAAu4ZTxHmBbS/JDt0+DUKODB2z3vRDBoZxI+c6QCRvu5PhrCnR1nTmoNGg Falu/QlzcVmTupUOT0OLrTd4H8KKzlXJuFcVJAQh98ReSQ5zIkBX+b7ruL3+KKAU Y/zDy/nD5G9G1vM9/PjZsSvr2U9Vq1p3qNArEZDnd8UYaVokkC0nNsxQhAoVA43U WjLR9Y7wCnxLFweaa3ujzSzmAOgVcLeMBV/siaRcpu4ohwHS6kKOikVAJWlgp/pc bCHpfxJbBLSG7lIAagWHbC4EF8GxTl/+u1+8pw1fVXpBG4Qaj60VSiITznLqSBi4 avTTlwwPsLQOMshmARllIDWJny7TCEJZhqxSO22ki9Em32+zMA7e2ozKp2cSwyn2 qXFnF4o3lSpDE99qnQPHWqWXQVVabkuLgwyqqDHMg0BSKx3MPDhagXdmYh/eI9hF j0BeY1hkMJU+kt+bNTSKSwwsooanG8hZvsTepct9+oeIZndEbyG0oKi7nW+xK2zD bKDyrY8cr2MBIBcV+yiDPQoCmbegDBouxCZtkLDo4cug30IVqDWvu5efld1WkhBX BLJC5LI+EibhzEKBiHk6qb3Gz4nIvvbAs+oGVdPsjI2LMrICq4eJ4BFaZT8I8PfY QGg9iKW3SCtfwYlfm6/a2Y23k72T86zZFlgMW7fEN4pyR5SEkcc= =jB96 -----END PGP SIGNATURE----- Merge tag 'iomm-fixes-v5.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu fixes from Joerg Roedel: "IOMMU core: - Fix for a regression which could cause NULL-ptr dereferences Arm SMMU: - Fix off-by-one in SMMUv3 SVA TLB invalidation - Disable large mappings to workaround nvidia erratum Intel VT-d: - Handle PCI stop marker messages in IOMMU driver to meet the requirement of I/O page fault handling framework. - Calculate a feasible mask for non-aligned page-selective IOTLB invalidation. Apple DART IOMMU: - Fix potential NULL-ptr dereference - Set module owner" * tag 'iomm-fixes-v5.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: iommu: Make sysfs robust for non-API groups iommu/dart: Add missing module owner to ops structure iommu/dart: check return value after calling platform_get_resource() iommu/vt-d: Drop stop marker messages iommu/vt-d: Calculate mask for non-aligned flushes iommu: arm-smmu: disable large page mappings for Nvidia arm-smmu iommu/arm-smmu-v3: Fix size calculation in arm_smmu_mm_invalidate_range()	2022-05-04 11:04:52 -07:00
Linus Torvalds	3118d7ab3f	Fix some issues that were reported This has been in for-next for a bit (longer than the times would indicate, I had to rebase to add some text to the headers) and these are fixes that need to go in. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE/Q1c5nzg9ZpmiCaGYfOMkJGb/4EFAmJyaB4ACgkQYfOMkJGb /4HOwBAAqdLDIo7WSYjWqmK4hUzLS5iBvgLiH41QKt/AVGkwX20YYANmlQFRShT/ Ue08cjbCcYKVyQCLAnV7nG4RU26U+eAlLL/owH8CVLGXEND9IGahzii1Up350fhX EjZgoHoEiU+xTGiRs/Tsag0v5WL++78qiXi7tQX2NsWwBT+vCDMEo/+Xv9h+TKzg okVi78/4LATI5jZJpCkz7uRjrEIPoUs5rTkfe5p4EhBfis+pd0CIs0gIsrsxB12p cnD55NOJoxAS0EQnRiLakjy320GrvlHUy0VfFByEkMWKWhl4c6aE60mWI85CcGMo qrovIeAiGZ8r3gqwP986WJBNbSve3f9b2gNeLlai+k4loeyANUOoTx7rgGZzzqNo KKJd9v/wjDXW/cGF41sTYoFHpJXF/CZfm6rusYcQrC/L052sEwG3clasWsmVsmml suRbEYzDoDPfyOqEcLztA/OJr6bkA+/eu458W4Q75WWNfF86mM1KapZjoKNXokq+ baE855AP4bdyoG+1lKpORk+3oOlqMWYyP5Bkmkz/JIyQAmzB8DhJZeD4SMHkMgp2 Bb/rjm/AZRxjEX/bvQyO31mpSaqM39Yxua4qRZcmxAW+QXTYBgtE9psNxZ7BGpRJ dfl/siFeVMUBUnFTlB5pf49Mt9QZomcQ2wBZ029gHncWvOjCSXs= =f3CV -----END PGP SIGNATURE----- Merge tag 'for-linus-5.17-2' of https://github.com/cminyard/linux-ipmi Pull IPMI fixes from Corey Minyard: "Fix some issues that were reported. This has been in for-next for a bit (longer than the times would indicate, I had to rebase to add some text to the headers) and these are fixes that need to go in" * tag 'for-linus-5.17-2' of https://github.com/cminyard/linux-ipmi: ipmi:ipmi_ipmb: Fix null-ptr-deref in ipmi_unregister_smi() ipmi: When handling send message responses, don't process the message	2022-05-04 11:01:31 -07:00
Robin Murphy	392bf51946	iommu: Make sysfs robust for non-API groups Groups created by VFIO backends outside the core IOMMU API should never be passed directly into the API itself, however they still expose their standard sysfs attributes, so we can still stumble across them that way. Take care to consider those cases before jumping into our normal assumptions of a fully-initialised core API group. Fixes: `3f6634d997` ("iommu: Use right way to retrieve iommu_ops") Reported-by: Jan Stancek <jstancek@redhat.com> Tested-by: Jan Stancek <jstancek@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Robin Murphy <robin.murphy@arm.com> Link: https://lore.kernel.org/r/86ada41986988511a8424e84746dfe9ba7f87573.1651667683.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <jroedel@suse.de>	2022-05-04 15:13:39 +02:00
David S. Miller	a37f37a2e7	Merge branch 'mlxsw-updates' Ido Schimmel says: ==================== mlxsw: Various updates Patches #1-#3 add missing topology diagrams in selftests and perform small cleanups. Patches #4-#5 make small adjustments in QoS configuration. See detailed description in the commit messages. Patches #6-#8 reduce the number of background EMAD transactions. The driver periodically queries the device (via EMAD transactions) about updates that cannot happen in certain situations. This can negatively impact the latency of time critical transactions, as the device is busy processing other transactions. Before: # perf stat -a -e devlink:devlink_hwmsg -- sleep 10 Performance counter stats for 'system wide': 452 devlink:devlink_hwmsg 10.009736160 seconds time elapsed After: # perf stat -a -e devlink:devlink_hwmsg -- sleep 10 Performance counter stats for 'system wide': 0 devlink:devlink_hwmsg 10.001726333 seconds time elapsed ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2022-05-04 11:21:33 +01:00
Ido Schimmel	cff9437605	mlxsw: spectrum_router: Only query neighbour activity when necessary The driver periodically queries the device for activity of neighbour entries in order to report it to the kernel's neighbour table. Avoid unnecessary activity query when no neighbours are installed. Use an atomic variable to track the number of neighbours, as it is read without any locks. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-05-04 11:21:33 +01:00
Ido Schimmel	b895000384	mlxsw: spectrum_switchdev: Only query FDB notifications when necessary The driver periodically queries the device for FDB notifications (e.g., learned, aged-out) in order to update the bridge driver. These notifications can only be generated when bridges are offloaded to the device. Avoid unnecessary queries by starting to query upon installation of the first bridge and stop querying upon removal of the last bridge. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-05-04 11:21:33 +01:00
Ido Schimmel	d1314096fb	mlxsw: spectrum_acl: Do not report activity for multicast routes The driver periodically queries the device for activity of ACL rules in order to report it to tc upon 'FLOW_CLS_STATS'. In Spectrum-2 and later ASICs, multicast routes are programmed as ACL rules, but unlike rules installed by tc, their activity is of no interest. Avoid unnecessary activity query for such rules by always reporting them as inactive. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-05-04 11:21:32 +01:00
Petr Machata	0106668cd2	mlxsw: Treat LLDP packets as control When trapping packets for on-CPU processing, Spectrum machines differentiate between control and non-control traps. Traffic trapped through non-control traps is treated as data and kept in shared buffer in pools 0-4. Traffic trapped through control traps is kept in the dedicated control buffer 9. The advantage of marking traps as control is that pressure in the data plane does not prevent the control traffic to be processed. When the LLDP trap was introduced, it was marked as a control trap. But then in commit `aed4b57211` ("mlxsw: spectrum: PTP: Hook into packet receive path"), PTP traps were introduced. Because Ethernet-encapsulated PTP packets look to the Spectrum-1 ASIC as LLDP traffic and are trapped under the LLDP trap, this trap was reconfigured as non-control, in sync with the PTP traps. There is however no requirement that PTP traffic be handled as data. Besides, the usual encapsulation for PTP traffic is UDP, not bare Ethernet, and that is in deployments that even need PTP, which is far less common than LLDP. This is reflected by the default policer, which was not bumped up to the 19Kpps / 24Kpps that is the expected load of a PTP-enabled Spectrum-1 switch. Marking of LLDP trap as non-control was therefore probably misguided. In this patch, change it back to control. Reported-by: Maksym Yaremchuk <maksymy@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-05-04 11:21:32 +01:00
Petr Machata	b6b584562c	mlxsw: spectrum_dcb: Do not warn about priority changes The idea behind the warnings is that the user would get warned in case when more than one priority is configured for a given DSCP value on a netdevice. The warning is currently wrong, because dcb_ieee_getapp_mask() returns the first matching entry, not all of them, and the warning will then claim that some priority is "current", when in fact it is not. But more importantly, the warning is misleading in general. Consider the following commands: # dcb app flush dev swp19 dscp-prio # dcb app add dev swp19 dscp-prio 24:3 # dcb app replace dev swp19 dscp-prio 24:2 The last command will issue the following warning: mlxsw_spectrum3 0000:07:00.0 swp19: Ignoring new priority 2 for DSCP 24 in favor of current value of 3 The reason is that the "replace" command works by first adding the new value, and then removing all old values. This is the only way to make the replacement without causing the traffic to be prioritized to whatever the chip defaults to. The warning is issued in response to adding the new priority, and then no warning is shown when the old priority is removed. The upshot is that the canonical way to change traffic prioritization always produces a warning about ignoring the new priority, but what gets configured is in fact what the user intended. An option to just emit warning every time that the prioritization changes just to make it clear that it happened is obviously unsatisfactory. Therefore, in this patch, remove the warnings. Reported-by: Maksym Yaremchuk <maksymy@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-05-04 11:21:32 +01:00
Petr Machata	faa7521add	selftests: router.sh: Add a diagram It is customary for selftests to have a comment with a topology diagram, which serves to illustrate the situation in which the test is done. This selftest lacks it. Add it. Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2022-05-04 11:21:32 +01:00

1 2 3 4 5 ...

1091318 Commits