linux

iv/linux

Author	SHA1	Message	Date
Steen Hegelund	81e164c4ae	net: microchip: sparx5: Add automatic selection of VCAP rule actionset With more than one possible actionset in a VCAP instance, the VCAP API will now use the actions in a VCAP rule to select the actionset that fits these actions the best possible way. Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-26 10:07:44 +01:00
Steen Hegelund	88bd9ea70b	net: microchip: sparx5: Add TC filter chaining support for IS0 and IS2 VCAPs This allows rules to be chained between VCAP instances, e.g. from IS0 Lookup 0 to IS0 Lookup 1, or from one of the IS0 Lookups to one of the IS2 Lookups. Chaining from an IS2 Lookup to another IS2 Lookup is not supported in the hardware. Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-26 10:07:44 +01:00
Steen Hegelund	542e6e2c20	net: microchip: sparx5: Add TC support for IS0 VCAP This enables the TC command to use the Sparx5 IS0 VCAP Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-26 10:07:44 +01:00
Steen Hegelund	7306fcd17c	net: microchip: sparx5: Add actionset type id information to rule This adds the actionset type id to the rule information. This is needed as we now have more than one actionset in a VCAP instance (IS0). Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-26 10:07:44 +01:00
Steen Hegelund	545609fd4e	net: microchip: sparx5: Add IS0 VCAP keyset configuration for Sparx5 This adds the IS0 VCAP port keyset configuration for Sparx5 and also updates the debugFS support to show the keyset configuration. Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-26 10:07:44 +01:00
Steen Hegelund	f274a659fb	net: microchip: sparx5: Add IS0 VCAP model and updated KUNIT VCAP model This provides the IS0 (Ingress Stage 0) or CLM VCAP model for Sparx5. This VCAP provides classification actions for Sparx5. Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-26 10:07:44 +01:00
Jakub Kicinski	3f17e16f38	Merge branch 'add-ip_local_port_range-socket-option' Jakub Sitnicki says: ==================== Add IP_LOCAL_PORT_RANGE socket option This patch set is a follow up to the "How to share IPv4 addresses by partitioning the port space" talk given at LPC 2022 [1]. Please see patch #1 for the motivation & the use case description. Patch #2 adds tests exercising the new option in various scenarios. Documentation ------------- Proposed update to the ip(7) man-page: IP_LOCAL_PORT_RANGE (since Linux X.Y) Set or get the per-socket default local port range. This option can be used to clamp down the global local port range, defined by the ip_local_port_range /proc interface described below, for a given socket. The option takes an uint32_t value with the high 16 bits set to the upper range bound, and the low 16 bits set to the lower range bound. Range bounds are inclusive. The 16-bit values should be in host byte order. The lower bound has to be less than the upper bound when both bounds are not zero. Otherwise, setting the option fails with EINVAL. If either bound is outside of the global local port range, or is zero, then that bound has no effect. To reset the setting, pass zero as both the upper and the lower bound. Interaction with SELinux bind() hook ------------------------------------ SELinux bind() hook - selinux_socket_bind() - performs a permission check if the requested local port number lies outside of the netns ephemeral port range. The proposed socket option cannot be used change the ephemeral port range to extend beyond the per-netns port range, as set by net.ipv4.ip_local_port_range. Hence, there is no interaction with SELinux, AFAICT. RFC -> v1 RFC: https://lore.kernel.org/netdev/20220912225308.93659-1-jakub@cloudflare.com/ * Allow either the high bound or the low bound, or both, to be zero * Add getsockopt support * Add selftests Links: ------ [1]: https://lpc.events/event/16/contributions/1349/ ==================== Link: https://lore.kernel.org/r/20221221-sockopt-port-range-v6-0-be255cc0e51f@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-25 22:45:02 -08:00
Jakub Sitnicki	ae5439658c	selftests/net: Cover the IP_LOCAL_PORT_RANGE socket option Exercise IP_LOCAL_PORT_RANGE socket option in various scenarios: 1. pass invalid values to setsockopt 2. pass a range outside of the per-netns port range 3. configure a single-port range 4. exhaust a configured multi-port range 5. check interaction with late-bind (IP_BIND_ADDRESS_NO_PORT) 6. set then get the per-socket port range Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-25 22:45:00 -08:00
Jakub Sitnicki	91d0b78c51	inet: Add IP_LOCAL_PORT_RANGE socket option Users who want to share a single public IP address for outgoing connections between several hosts traditionally reach for SNAT. However, SNAT requires state keeping on the node(s) performing the NAT. A stateless alternative exists, where a single IP address used for egress can be shared between several hosts by partitioning the available ephemeral port range. In such a setup: 1. Each host gets assigned a disjoint range of ephemeral ports. 2. Applications open connections from the host-assigned port range. 3. Return traffic gets routed to the host based on both, the destination IP and the destination port. An application which wants to open an outgoing connection (connect) from a given port range today can choose between two solutions: 1. Manually pick the source port by bind()'ing to it before connect()'ing the socket. This approach has a couple of downsides: a) Search for a free port has to be implemented in the user-space. If the chosen 4-tuple happens to be busy, the application needs to retry from a different local port number. Detecting if 4-tuple is busy can be either easy (TCP) or hard (UDP). In TCP case, the application simply has to check if connect() returned an error (EADDRNOTAVAIL). That is assuming that the local port sharing was enabled (REUSEADDR) by all the sockets. # Assume desired local port range is 60_000-60_511 s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1) s.bind(("192.0.2.1", 60_000)) s.connect(("1.1.1.1", 53)) # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy # Application must retry with another local port In case of UDP, the network stack allows binding more than one socket to the same 4-tuple, when local port sharing is enabled (REUSEADDR). Hence detecting the conflict is much harder and involves querying sock_diag and toggling the REUSEADDR flag [1]. b) For TCP, bind()-ing to a port within the ephemeral port range means that no connecting sockets, that is those which leave it to the network stack to find a free local port at connect() time, can use the this port. IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port will be skipped during the free port search at connect() time. 2. Isolate the app in a dedicated netns and use the use the per-netns ip_local_port_range sysctl to adjust the ephemeral port range bounds. The per-netns setting affects all sockets, so this approach can be used only if: - there is just one egress IP address, or - the desired egress port range is the same for all egress IP addresses used by the application. For TCP, this approach avoids the downsides of (1). Free port search and 4-tuple conflict detection is done by the network stack: system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'") s = socket(AF_INET, SOCK_STREAM) s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1) s.bind(("192.0.2.1", 0)) s.connect(("1.1.1.1", 53)) # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy For UDP this approach has limited applicability. Setting the IP_BIND_ADDRESS_NO_PORT socket option does not result in local source port being shared with other connected UDP sockets. Hence relying on the network stack to find a free source port, limits the number of outgoing UDP flows from a single IP address down to the number of available ephemeral ports. To put it another way, partitioning the ephemeral port range between hosts using the existing Linux networking API is cumbersome. To address this use case, add a new socket option at the SOL_IP level, named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the ephemeral port range for each socket individually. The option can be used only to narrow down the per-netns local port range. If the per-socket range lies outside of the per-netns range, the latter takes precedence. UAPI-wise, the low and high range bounds are passed to the kernel as a pair of u16 values in host byte order packed into a u32. This avoids pointer passing. PORT_LO = 40_000 PORT_HI = 40_511 s = socket(AF_INET, SOCK_STREAM) v = struct.pack("I", PORT_HI << 16 \| PORT_LO) s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v) s.bind(("127.0.0.1", 0)) s.getsockname() # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511), # if there is a free port. EADDRINUSE otherwise. [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116 Reviewed-by: Marek Majkowski <marek@cloudflare.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-25 22:45:00 -08:00
Randy Dunlap	6a7a2c18a9	net: Kconfig: fix spellos Fix spelling in net/ Kconfig files. (reported by codespell) Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@netfilter.org> Cc: Florian Westphal <fw@strlen.de> Cc: coreteam@netfilter.org Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Link: https://lore.kernel.org/r/20230124181724.18166-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-25 22:39:56 -08:00
Vladimir Oltean	f5be9caf7b	net: ethtool: fix NULL pointer dereference in pause_prepare_data() In the following call path: ethnl_default_dumpit -> ethnl_default_dump_one -> ctx->ops->prepare_data -> pause_prepare_data struct genl_info *info will be passed as NULL, and pause_prepare_data() dereferences it while getting the extended ack pointer. To avoid that, just set the extack to NULL if "info" is NULL, since the netlink extack handling messages know how to deal with that. The pattern "info ? info->extack : NULL" is present in quite a few other "prepare_data" implementations, so it's clear that it's a more general problem to be dealt with at a higher level, but the code should have at least adhered to the current conventions to avoid the NULL dereference. Fixes: 04692c9020b7 ("net: ethtool: netlink: retrieve stats from multiple sources (eMAC, pMAC)") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reported-by: syzbot+9d44aae2720fc40b8474@syzkaller.appspotmail.com Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-25 09:57:41 +00:00
Vladimir Oltean	c96de13632	net: ethtool: fix NULL pointer dereference in stats_prepare_data() In the following call path: ethnl_default_dumpit -> ethnl_default_dump_one -> ctx->ops->prepare_data -> stats_prepare_data struct genl_info *info will be passed as NULL, and stats_prepare_data() dereferences it while getting the extended ack pointer. To avoid that, just set the extack to NULL if "info" is NULL, since the netlink extack handling messages know how to deal with that. The pattern "info ? info->extack : NULL" is present in quite a few other "prepare_data" implementations, so it's clear that it's a more general problem to be dealt with at a higher level, but the code should have at least adhered to the current conventions to avoid the NULL dereference. Fixes: 04692c9020b7 ("net: ethtool: netlink: retrieve stats from multiple sources (eMAC, pMAC)") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-25 09:56:31 +00:00
David S. Miller	99db6fb043	Merge branch 's390-ism-generalized-interface' Jan Karcher says: ==================== drivers/s390/net/ism: Add generalized interface Previously, there was no clean separation between SMC-D code and the ISM device driver.This patch series addresses the situation to make ISM available for uses outside of SMC-D. In detail: SMC-D offers an interface via struct smcd_ops, which only the ISM module implements so far. However, there is no real separation between the smcd and ism modules, which starts right with the ISM device initialization, which calls directly into the SMC-D code. This patch series introduces a new API in the ISM module, which allows registration of arbitrary clients via include/linux/ism.h: struct ism_client. Furthermore, it introduces a "pure" struct ism_dev (i.e. getting rid of dependencies on SMC-D in the device structure), and adds a number of API calls for data transfers via ISM (see ism_register_dmb() & friends). Still, the ISM module implements the SMC-D API, and therefore has a number of internal helper functions for that matter. Note that the ISM API is consciously kept thin for now (as compared to the SMC-D API calls), as a number of API calls are only used with SMC-D and hardly have any meaningful usage beyond SMC-D, e.g. the VLAN-related calls. v1 -> v2: Removed s390x dependency which broke config for other archs. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-25 09:46:49 +00:00
Stefan Raspl	8c81ba2034	net/smc: De-tangle ism and smc device initialization The struct device for ISM devices was part of struct smcd_dev. Move to struct ism_dev, provide a new API call in struct smcd_ops, and convert existing SMCD code accordingly. Furthermore, remove struct smcd_dev from struct ism_dev. This is the final part of a bigger overhaul of the interfaces between SMC and ISM. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-25 09:46:49 +00:00
Stefan Raspl	820f21009f	s390/ism: Consolidate SMC-D-related code The ism module had SMC-D-specific code sprinkled across the entire module. We are now consolidating the SMC-D-specific parts into the latter parts of the module, so it becomes more clear what code is intended for use with ISM, and which parts are glue code for usage in the context of SMC-D. This is the fourth part of a bigger overhaul of the interfaces between SMC and ISM. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-25 09:46:49 +00:00
Stefan Raspl	9de4df7b6b	net/smc: Separate SMC-D and ISM APIs We separate the code implementing the struct smcd_ops API in the ISM device driver from the functions that may be used by other exploiters of ISM devices. Note: We start out small, and don't offer the whole breadth of the ISM device for public use, as many functions are specific to or likely only ever used in the context of SMC-D. This is the third part of a bigger overhaul of the interfaces between SMC and ISM. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-25 09:46:48 +00:00
Stefan Raspl	8747716f39	net/smc: Register SMC-D as ISM client Register the smc module with the new ism device driver API. This is the second part of a bigger overhaul of the interfaces between SMC and ISM. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-25 09:46:48 +00:00
Stefan Raspl	89e7d2ba61	net/ism: Add new API for client registration Add a new API that allows other drivers to concurrently access ISM devices. To do so, we introduce a new API that allows other modules to register for ISM device usage. Furthermore, we move the GID to struct ism, where it belongs conceptually, and rename and relocate struct smcd_event to struct ism_event. This is the first part of a bigger overhaul of the interfaces between SMC and ISM. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-25 09:46:48 +00:00
Stefan Raspl	1baedb13f1	s390/ism: Introduce struct ism_dmb Conceptually, a DMB is a structure that belongs to ISM devices. However, SMC currently 'owns' this structure. So future exploiters of ISM devices would be forced to include SMC headers to work - which is just weird. Therefore, we switch ISM to struct ism_dmb, introduce a new public header with the definition (will be populated with further API calls later on), and, add a thin wrapper to please SMC. Since structs smcd_dmb and ism_dmb are identical, we can simply convert between the two for now. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-25 09:46:48 +00:00
Stefan Raspl	462502ff9a	net/ism: Add missing calls to disable bus-mastering Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-25 09:46:48 +00:00
Stefan Raspl	c40bff4132	net/smc: Terminate connections prior to device removal Removing an ISM device prior to terminating its associated connections doesn't end well. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-25 09:46:48 +00:00
Parav Pandit	d067111586	virtio-net: Reduce debug name field size to 16 bytes virtio queue index can be maximum of 65535. 16 bytes are enough to store the vq name with the existing string prefix. With this change, send queue struct saves 24 bytes and receive queue saves whole cache line worth 64 bytes per structure due to saving in alignment bytes. Pahole results before: pahole -s drivers/net/virtio_net.o \| \ grep -e "send_queue" -e "receive_queue" send_queue 1112 0 receive_queue 1280 1 Pahole results after: pahole -s drivers/net/virtio_net.o \| \ grep -e "send_queue" -e "receive_queue" send_queue 1088 0 receive_queue 1216 1 Signed-off-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-25 09:27:13 +00:00
Jakub Kicinski	4373a023e0	devlink: remove a dubious assumption in fmsg dumping Build bot detects that err may be returned uninitialized in devlink_fmsg_prepare_skb(). This is not really true because all fmsgs users should create at least one outer nest, and therefore fmsg can't be completely empty. That said the assumption is not trivial to confirm, so let's follow the bots advice, anyway. This code does not seem to have changed since its inception in commit 1db64e8733f6 ("devlink: Add devlink formatted message (fmsg) API") Reviewed-by: Jiri Pirko <jiri@nvidia.com> Link: https://lore.kernel.org/r/20230124035231.787381-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-24 20:31:35 -08:00
Vladimir Oltean	28113cfada	net: mscc: ocelot: fix incorrect verify_enabled reporting in ethtool get_mm() We don't read the verify_enabled variable from hardware in the MAC Merge layer state GET operation, instead we always leave it set to "false". The user may think something is wrong if they set verify_enabled to true, then read it back and see it's still false, even though the configuration took place. Fixes: 6505b6805655 ("net: mscc: ocelot: add MAC Merge layer support for VSC9959") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20230123184538.3420098-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-24 18:34:20 -08:00
James Hershaw	74b4f1739d	nfp: flower: change get/set_eeprom logic and enable for flower reps The changes in this patch are as follows: - Alter the logic of get/set_eeprom functions to use the helper function nfp_app_from_netdev() which handles differentiating between an nfp_net and a nfp_repr. This allows us to get an agnostic backpointer to the pdev. - Enable the various eeprom commands by adding the 'get_eeprom_len', 'get_eeprom', 'set_eeprom' callbacks to the nfp_port_ethtool_ops struct. This allows the eeprom commands to work on representor interfaces, similar to a previous patch which added it to the vnics. Currently these are being used to configure persistent MAC addresses for the physical ports on the nfp. Signed-off-by: James Hershaw <james.hershaw@corigine.com> Reviewed-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20230123134135.293278-1-simon.horman@corigine.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-24 18:19:12 -08:00
Guillaume Nault	90317bcdbd	ipv6: Make ip6_route_output_flags_noref() static. This function is only used in net/ipv6/route.c and has no reason to be visible outside of it. Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/50706db7f675e40b3594d62011d9363dce32b92e.1674495822.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-24 18:12:52 -08:00
Jakub Kicinski	ec8f7d495b	netlink: fix spelling mistake in dump size assert Commit 2c7bc10d0f7b ("netlink: add macro for checking dump ctx size") misspelled the name of the assert as asset, missing an R. Reported-by: Ido Schimmel <idosch@idosch.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/20230123222224.732338-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-24 16:29:11 -08:00
Paolo Abeni	c554520f2c	Merge branch 'netlink-protocol-specs' Jakub Kicinski says: ==================== Netlink protocol specs I think the Netlink proto specs are far along enough to merge. Filling in all attribute types and quirks will be an ongoing effort but we have enough to cover FOU so it's somewhat complete. I fully intend to continue polishing the code but at the same time I'd like to start helping others base their work on the specs (e.g. DPLL) and need to start working on some new families myself. That's the progress / motivation for merging. The RFC [1] has more of a high level blurb, plus I created a lot of documentation, I'm not going to repeat it here. There was also the talk at LPC [2]. [1] https://lore.kernel.org/all/20220811022304.583300-1-kuba@kernel.org/ [2] https://youtu.be/9QkXIQXkaQk?t=2562 v2: https://lore.kernel.org/all/20220930023418.1346263-1-kuba@kernel.org/ v3: https://lore.kernel.org/all/20230119003613.111778-1-kuba@kernel.org/1 v4: - spec improvements (patch 2) - Python cleanup (patch 3) - rename auto-gen files and use the right comment style ==================== Link: https://lore.kernel.org/r/20230120175041.342573-1-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-24 11:02:03 +01:00
Jakub Kicinski	e4b48ed460	tools: ynl: add a completely generic client Add a CLI sample which can take in arbitrary request in JSON format, convert it to Netlink and do the inverse for output. It's meant as a development tool primarily and perhaps for selftests which need to tickle netlink in a special way. Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-24 10:58:11 +01:00
Jakub Kicinski	1d562c32e4	net: fou: use policy and operation tables generated from the spec Generate and plug in the spec-based tables. A little bit of renaming is needed in the FOU code. Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-24 10:58:11 +01:00
Jakub Kicinski	08d323234d	net: fou: rename the source for linking We'll need to link two objects together to form the fou module. This means the source can't be called fou, the build system expects fou.o to be the combined object. Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-24 10:58:11 +01:00
Jakub Kicinski	3a330496ba	net: fou: regenerate the uAPI from the spec Regenerate the FOU uAPI header from the YAML spec. The flags now come before attributes which use them, and the comments for type disappear (coders should look at the spec instead). Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-24 10:58:11 +01:00
Jakub Kicinski	4eb77b4ecd	netlink: add a proto specification for FOU FOU has a reasonably modern Genetlink family. Add a spec. Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-24 10:58:11 +01:00
Jakub Kicinski	be5bea1cc0	net: add basic C code generators for Netlink Code generators to turn Netlink specs into C code. I'm definitely not proud of it. The main generator is in Python, there's a bash script to regen all code-gen'ed files in tree after making spec changes. Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-24 10:58:11 +01:00
Jakub Kicinski	e616c07ca5	netlink: add schemas for YAML specs Add schemas for Netlink spec files. As described in the docs we have 4 "protocols" or compatibility levels, and each one comes with its own schema, but the more general / legacy schemas are superset of more modern ones: genetlink is the smallest followed by genetlink-c and genetlink-legacy. There is no schema for raw netlink, yet, I haven't found the time.. I don't know enough jsonschema to do inheritance or something but the repetition is not too bad. I hope. Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-24 10:58:11 +01:00
Jakub Kicinski	9d6a65079c	docs: add more netlink docs (incl. spec docs) Add documentation about the upcoming Netlink protocol specs. Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Bagas Sanjaya <bagasdotme@gmail.com> Acked-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-24 10:58:11 +01:00
Paolo Abeni	d961bee454	Merge branch 'net-sched-use-the-backlog-for-nested-mirred-ingress' Davide Caratti says: ==================== net/sched: use the backlog for nested mirred ingress TC mirred has a protection against excessive stack growth, but that protection doesn't really guarantee the absence of recursion, nor it guards against loops. Patch 1/2 rewords "recursion" to "nesting" to make this more clear. We can leverage on this existing mechanism to prevent TCP / SCTP from doing soft lock-up in some specific scenarios that uses mirred egress->ingress: patch 2 changes mirred so that the networking backlog is used for nested mirred ingress actions. ==================== Link: https://lore.kernel.org/r/cover.1674233458.git.dcaratti@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-24 10:30:56 +01:00
Davide Caratti	ca22da2fbd	act_mirred: use the backlog for nested calls to mirred ingress William reports kernel soft-lockups on some OVS topologies when TC mirred egress->ingress action is hit by local TCP traffic [1]. The same can also be reproduced with SCTP (thanks Xin for verifying), when client and server reach themselves through mirred egress to ingress, and one of the two peers sends a "heartbeat" packet (from within a timer). Enqueueing to backlog proved to fix this soft lockup; however, as Cong noticed [2], we should preserve - when possible - the current mirred behavior that counts as "overlimits" any eventual packet drop subsequent to the mirred forwarding action [3]. A compromise solution might use the backlog only when tcf_mirred_act() has a nest level greater than one: change tcf_mirred_forward() accordingly. Also, add a kselftest that can reproduce the lockup and verifies TC mirred ability to account for further packet drops after TC mirred egress->ingress (when the nest level is 1). [1] https://lore.kernel.org/netdev/33dc43f587ec1388ba456b4915c75f02a8aae226.1663945716.git.dcaratti@redhat.com/ [2] https://lore.kernel.org/netdev/Y0w%2FWWY60gqrtGLp@pop-os.localdomain/ [3] such behavior is not guaranteed: for example, if RPS or skb RX timestamping is enabled on the mirred target device, the kernel can defer receiving the skb and return NET_RX_SUCCESS inside tcf_mirred_forward(). Reported-by: William Zhao <wizhao@redhat.com> CC: Xin Long <lucien.xin@gmail.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-24 10:30:54 +01:00
Davide Caratti	78dcdffe04	net/sched: act_mirred: better wording on protection against excessive stack growth with commit e2ca070f89ec ("net: sched: protect against stack overflow in TC act_mirred"), act_mirred protected itself against excessive stack growth using per_cpu counter of nested calls to tcf_mirred_act(), and capping it to MIRRED_RECURSION_LIMIT. However, such protection does not detect recursion/loops in case the packet is enqueued to the backlog (for example, when the mirred target device has RPS or skb timestamping enabled). Change the wording from "recursion" to "nesting" to make it more clear to readers. CC: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-24 10:30:54 +01:00
Paolo Abeni	5cf6c22b5b	Merge branch 'fix-cpts-release-action-in-am65-cpts-driver' Siddharth Vadapalli says: ==================== Fix CPTS release action in am65-cpts driver Delete unreachable code in am65_cpsw_init_cpts() function, which was Reported-by: Leon Romanovsky <leon@kernel.org> at: https://lore.kernel.org/r/Y8aHwSnVK9+sAb24@unreal Remove the devm action associated with am65_cpts_release() and invoke the function directly on the cleanup and exit paths. v4: https://lore.kernel.org/r/20230120044201.357950-1-s-vadapalli@ti.com/ v3: https://lore.kernel.org/r/20230118095439.114222-1-s-vadapalli@ti.com/ v2: https://lore.kernel.org/r/20230116044517.310461-1-s-vadapalli@ti.com/ v1: https://lore.kernel.org/r/20230113104816.132815-1-s-vadapalli@ti.com/ ==================== Link: https://lore.kernel.org/r/20230120070731.383729-1-s-vadapalli@ti.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-24 10:08:54 +01:00
Siddharth Vadapalli	4ad8766cd3	net: ethernet: ti: am65-cpsw/cpts: Fix CPTS release action The am65_cpts_release() function is registered as a devm_action in the am65_cpts_create() function in am65-cpts driver. When the am65-cpsw driver invokes am65_cpts_create(), am65_cpts_release() is added in the set of devm actions associated with the am65-cpsw driver's device. In the event of probe failure or probe deferral, the platform_drv_probe() function invokes dev_pm_domain_detach() which powers off the CPSW and the CPSW's CPTS hardware, both of which share the same power domain. Since the am65_cpts_disable() function invoked by the am65_cpts_release() function attempts to reset the CPTS hardware by writing to its registers, the CPTS hardware is assumed to be powered on at this point. However, the hardware is powered off before the devm actions are executed. Fix this by getting rid of the devm action for am65_cpts_release() and invoking it directly on the cleanup and exit paths. Fixes: f6bd59526ca5 ("net: ethernet: ti: introduce am654 common platform time sync driver") Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Roger Quadros <rogerq@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-24 10:08:50 +01:00
Siddharth Vadapalli	0a974b1fff	net: ethernet: ti: am65-cpsw: Delete unreachable error handling code The am65_cpts_create() function returns -EOPNOTSUPP only when the config "CONFIG_TI_K3_AM65_CPTS" is disabled. Also, in the am65_cpsw_init_cpts() function, am65_cpts_create() can only be invoked if the config "CONFIG_TI_K3_AM65_CPTS" is enabled. Thus, the error handling code for the case in which the return value of am65_cpts_create() is -EOPNOTSUPP, is unreachable. Hence delete it. Reported-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com> Reviewed-by: Roger Quadros <rogerq@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-24 10:08:50 +01:00
Rakesh Sankaranarayanan	d7bf56e0c5	net: phy: microchip: run phy initialization during each link update PHY initialization is supposed to run on every mode changes. "lan87xx_config_aneg()" verifies every mode change using "phy_modify_changed()" function. Earlier code had phy_modify_changed() followed by genphy_soft_reset. But soft_reset resets all the pre-configured register values to default state, and lost all the initialization done. With this reason gen_phy_reset was removed. But it need to go through init sequence each time the mode changed. Update lan87xx_config_aneg() to invoke phy_init once successful mode update is detected. PHY init sequence added in lan87xx_phy_init() have slave init commands executed every time. Update the init sequence to run slave init only if phydev is in slave mode. Test setup contains LAN9370 EVB connected to SAMA5D3 (Running DSA), and issue can be reproduced by connecting link to any of the available ports after SAMA5D3 boot-up. With this issue, port will fail to update link state. But once the SAMA5D3 is reset with LAN9370 link in connected state itself, on boot-up link state will be reported as UP. But Again after some time, if link is moved to DOWN state, it will not get reported. Signed-off-by: Rakesh Sankaranarayanan <rakesh.sankaranarayanan@microchip.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20230120104733.724701-1-rakesh.sankaranarayanan@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-23 22:34:19 -08:00
Jakub Kicinski	90e05ef3d1	Merge branch 'net-dsa-microchip-add-support-for-credit-based-shaper' Arun Ramadoss says: ==================== net: dsa: microchip: add support for credit based shaper LAN937x switch family, KSZ9477, KSZ9567, KSZ9563 and KSZ8563 supports the credit based shaper. But there were few difference between LAN937x and KSZ switch like - number of queues for LAN937x is 8 and for others it is 4. - size of credit increment register for LAN937x is 24 and for other is 16-bit. This patch series add the credit based shaper with common implementation for LAN937x and KSZ swithes. ==================== Link: https://lore.kernel.org/r/20230120052135.32120-1-arun.ramadoss@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-23 22:12:37 -08:00
Arun Ramadoss	71d7920fb2	net: dsa: microchip: add support for credit based shaper KSZ9477, KSZ9567, KSZ9563, KSZ8563 and LAN937x supports Credit based shaper. To differentiate the chip supporting cbs, tc_cbs_supported flag is introduced in ksz_chip_data. And KSZ series has 16bit Credit increment registers whereas LAN937x has 24bit register. The value to be programmed in the credit increment is determined using the successive multiplication method to convert decimal fraction to hexadecimal fraction. For example: if idleslope is 10000 and sendslope is -90000, then bandwidth is 10000 - (-90000) = 100000. The 10% bandwidth of 100Mbps means 10/100 = 0.1(decimal). This value has to be converted to hexa. 1) 0.1 * 16 = 1.6 --> fraction 0.6 Carry = 1 (MSB) 2) 0.6 * 16 = 9.6 --> fraction 0.6 Carry = 9 3) 0.6 * 16 = 9.6 --> fraction 0.6 Carry = 9 4) 0.6 * 16 = 9.6 --> fraction 0.6 Carry = 9 5) 0.6 * 16 = 9.6 --> fraction 0.6 Carry = 9 6) 0.6 * 16 = 9.6 --> fraction 0.6 Carry = 9 (LSB) Now 0.1(decimal) becomes 0.199999(Hex). If it is LAN937x, 24 bit value will be programmed to Credit Inc register, 0x199999. For others 16 bit value will be prgrammed, 0x1999. Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-23 22:12:35 -08:00
Arun Ramadoss	e30f33a5f5	net: dsa: microchip: enable port queues for tc mqprio LAN937x family of switches has 8 queues per port where the KSZ switches has 4 queues per port. By default, only one queue per port is enabled. The queues are configurable in 2, 4 or 8. This patch add 8 number of queues for LAN937x and 4 for other switches. In the tag_ksz.c file, prioirty of the packet is queried using the skb buffer and the corresponding value is updated in the tag. Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-23 22:12:35 -08:00
Jesper Dangaard Brouer	3176eb8268	net: avoid irqsave in skb_defer_free_flush The spin_lock irqsave/restore API variant in skb_defer_free_flush can be replaced with the faster spin_lock irq variant, which doesn't need to read and restore the CPU flags. Using the unconditional irq "disable/enable" API variant is safe, because the skb_defer_free_flush() function is only called during NAPI-RX processing in net_rx_action(), where it is known the IRQs are enabled. Expected gain is 14 cycles from avoiding reading and restoring CPU flags in a spin_lock_irqsave/restore operation, measured via a microbencmark kernel module[1] on CPU E5-1650 v4 @ 3.60GHz. Microbenchmark overhead of spin_lock+unlock: - spin_lock_unlock_irq cost: 34 cycles(tsc) 9.486 ns - spin_lock_unlock_irqsave cost: 48 cycles(tsc) 13.567 ns We don't expect to see a measurable packet performance gain, as skb_defer_free_flush() is called infrequently once per NIC device NAPI bulk cycle and conditionally only if SKBs have been deferred by other CPUs via skb_attempt_defer_free(). [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Link: https://lore.kernel.org/r/167421646327.1321776.7390743166998776914.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-23 22:08:06 -08:00
Jon Maxwell	695a376b59	ipv6: Document that max_size sysctl is deprecated v4: fix deprecated typo. Document that max_size is deprecated due to: commit af6d10345ca7 ("ipv6: remove max_size check inline with ipv4") Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com> Link: https://lore.kernel.org/r/20230120232331.1273881-1-jmaxwell37@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-23 22:07:21 -08:00
Jesper Dangaard Brouer	f72ff8b81e	net: fix kfree_skb_list use of skb_mark_not_on_list A bug was introduced by commit eedade12f4cb ("net: kfree_skb_list use kmem_cache_free_bulk"). It unconditionally unlinked the SKB list via invoking skb_mark_not_on_list(). In this patch we choose to remove the skb_mark_not_on_list() call as it isn't necessary. It would be possible and correct to call skb_mark_not_on_list() only when __kfree_skb_reason() returns true, meaning the SKB is ready to be free'ed, as it calls/check skb_unref(). This fix is needed as kfree_skb_list() is also invoked on skb_shared_info frag_list (skb_drop_fraglist() calling kfree_skb_list()). A frag_list can have SKBs with elevated refcnt due to cloning via skb_clone_fraglist(), which takes a reference on all SKBs in the list. This implies the invariant that all SKBs in the list must have the same refcnt, when using kfree_skb_list(). Reported-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com Fixes: eedade12f4cb ("net: kfree_skb_list use kmem_cache_free_bulk") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/167421088417.1125894.9761158218878962159.stgit@firesoul Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-23 21:39:04 -08:00
Dan Carpenter	3bee9b573a	net: microchip: sparx5: Fix uninitialized variable in vcap_path_exist() The "eport" variable needs to be initialized to NULL for this code to work. Fixes: 814e7693207f ("net: microchip: vcap api: Add a storage state to a VCAP rule") Signed-off-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com> Link: https://lore.kernel.org/r/Y8qbYAb+YSXo1DgR@kili Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-23 21:34:59 -08:00

1 2 3 4 5 ...

1155083 Commits