linux

iv/linux

Author	SHA1	Message	Date
Daniel Machon	1df99338e6	net: dcb: add helper functions to retrieve PCP and DSCP rewrite maps Add two new helper functions to retrieve a mapping of priority to PCP and DSCP bitmasks, where each bitmap contains ones in positions that match a rewrite entry. dcb_ieee_getrewr_prio_dscp_mask_map() reuses the dcb_ieee_app_prio_map, as this struct is already used for a similar mapping in the app table. Signed-off-by: Daniel Machon <daniel.machon@microchip.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-20 09:33:22 +00:00
Daniel Machon	622f1b2fae	net: dcb: add new rewrite table Add new rewrite table and all the required functions, offload hooks and bookkeeping for maintaining it. The rewrite table reuses the app struct, and the entire set of app selectors. As such, some bookeeping code can be shared between the rewrite- and the APP table. New functions for getting, setting and deleting entries has been added. Apart from operating on the rewrite list, these functions do not emit a DCB_APP_EVENT when the list os modified. The new dcb_getrewr does a lookup based on selector and priority and returns the protocol, so that mappings from priority to protocol, for a given selector and ifindex is obtained. Also, a new nested attribute has been added, that encapsulates one or more app structs. This attribute is used to distinguish the two tables. The dcb_lock used for the APP table is reused for the rewrite table. Signed-off-by: Daniel Machon <daniel.machon@microchip.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-20 09:33:22 +00:00
Daniel Machon	30568334b6	net: dcb: add new common function for set/del of app/rewr entries In preparation for DCB rewrite. Add a new function for setting and deleting both app and rewrite entries. Moving this into a separate function reduces duplicate code, as both type of entries requires the same set of checks. The function will now iterate through a configurable nested attribute (app or rewrite attr), validate each attribute and call the appropriate set- or delete function. Note that this function always checks for nla_len(attr_itr) < sizeof(struct dcb_app), which was only done in dcbnl_ieee_set and not in dcbnl_ieee_del prior to this patch. This means, that any userspace tool that used to shove in data < sizeof(struct dcb_app) would now receive -ERANGE. Signed-off-by: Daniel Machon <daniel.machon@microchip.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-20 09:33:22 +00:00
Daniel Machon	34b7074d3f	net: dcb: modify dcb_app_add to take list_head ptr as parameter In preparation to DCB rewrite. Modify dcb_app_add to take new struct list_head * as parameter, to make the used list configurable. This is done to allow reusing the function for adding rewrite entries to the rewrite table, which is introduced in a later patch. Signed-off-by: Daniel Machon <daniel.machon@microchip.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-20 09:33:22 +00:00
Jiri Pirko	63ba54a52c	devlink: add instance lock assertion in devl_is_registered() After region and linecard lock removals, this helper is always supposed to be called with instance lock held. So put the assertion here and remove the comment which is no longer accurate. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-19 19:08:38 -08:00
Jiri Pirko	543753d9e2	devlink: remove devlink_dump_for_each_instance_get() helper devlink_dump_for_each_instance_get() is currently called from a single place in netlink.c. As there is no need to use this helper anywhere else in the future, remove it and call devlinks_xa_find_get() directly from while loop in devlink_nl_instance_iter_dump(). Also remove redundant idx clear on loop end as it is already done in devlink_nl_instance_iter_dump(). Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-19 19:08:38 -08:00
Jiri Pirko	19be51a93d	devlink: convert reporters dump to devlink_nl_instance_iter_dump() Benefit from recently introduced instance iteration and convert reporters .dumpit generic netlink callback to use it. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-19 19:08:38 -08:00
Jiri Pirko	2557396808	devlink: convert linecards dump to devlink_nl_instance_iter_dump() Benefit from recently introduced instance iteration and convert linecards .dumpit generic netlink callback to use it. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-19 19:08:38 -08:00
Jiri Pirko	e994a75fb7	devlink: remove reporter reference counting As long as the reporter life time is protected by devlink instance lock, the reference counting is no longer needed. Remove it. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-19 19:08:38 -08:00
Jiri Pirko	9f167327ef	devlink: remove devl*_port_health_reporter_destroy() Remove port-specific health reporter destroy function as it is currently the same as the instance one so no longer needed. Inline __devlink_health_reporter_destroy() as it is no longer called from multiple places. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-19 19:08:37 -08:00
Jiri Pirko	1dea3b4e4c	devlink: remove reporters_lock Similar to other devlink objects, rely on devlink instance lock and remove object specific reporters_lock. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-19 19:08:37 -08:00
Jiri Pirko	dfdfd1305d	devlink: protect health reporter operation with instance lock Similar to other devlink objects, protect the reporters list by devlink instance lock. Alongside add unlocked versions of health reporter create/destroy functions and use them in drivers on call paths where the instance lock is held. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-19 19:08:37 -08:00
Jiri Pirko	3a10173f48	devlink: remove linecard reference counting As long as the linecard life time is protected by devlink instance lock, the reference counting is no longer needed. Remove it. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-19 19:08:37 -08:00
Jiri Pirko	5cc9049cb9	devlink: remove linecards lock Similar to other devlink objects, convert the linecards list to be protected by devlink instance lock. Alongside with that rename the create/destroy() functions to devl_* to indicate the devlink instance lock needs to be held while calling them. Signed-off-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-19 19:08:37 -08:00
Magnus Karlsson	68e5b6aa27	xdp: document xdp_do_flush() before napi_complete_done() Document in the XDP_REDIRECT manual section that drivers must call xdp_do_flush() before napi_complete_done(). The two reasons behind this can be found following the links below. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20221220185903.1105011-1-sbohrer@cloudflare.com Link: https://lore.kernel.org/all/20210624160609.292325-1-toke@redhat.com/ Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-18 14:33:34 +00:00
David S. Miller	4218b0e212	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next Florian Westphal says: ==================== Netfilter updates for net-next following patch set includes netfilter updates for your net-next tree. 1. Replace pr_debug use with nf_log infra for debugging in sctp conntrack. 2. Remove pr_debug calls, they are either useless or we have better options in place. 3. Avoid repeated load of ct->status in some spots. Some bit-flags cannot change during the lifeetime of a connection, so no need to re-fetch those. 4. Avoid uneeded nesting of rcu_read_lock during tuple lookup. 5. Remove the CLUSTERIP target. Marked as obsolete for years, and we still have WARN splats wrt. races of the out-of-band /proc interface installed by this target. 6. Add static key to nf_tables to avoid the retpoline mitigation if/else if cascade provided the cpu doesn't need the retpoline thunk. 7. add nf_tables objref calls to the retpoline mitigation workaround. 8. Split parts of nft_ct.c that do not need symbols exported by the conntrack modules and place them in nf_tables directly. This allows to avoid indirect call for 'ct status' checks. 9. Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not indicate an error if the referenced object (set, chain, rule...) did not exist, from Fernando. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-18 13:19:48 +00:00
Tanmay Bhushan	9259f6b573	ipv6: Remove extra counter pull before gc Per cpu entries are no longer used in consideration for doing gc or not. Remove the extra per cpu entries pull to directly check for time and perform gc. Signed-off-by: Tanmay Bhushan <007047221b@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-18 13:00:16 +00:00
Fernando Fernandez Mancera	f80a612dd7	netfilter: nf_tables: add support to destroy operation Introduce NFT_MSG_DESTROY* message type. The destroy operation performs a delete operation but ignoring the ENOENT errors. This is useful for the transaction semantics, where failing to delete an object which does not exist results in aborting the transaction. This new command allows the transaction to proceed in case the object does not exist. Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>	2023-01-18 13:09:00 +01:00
Florian Westphal	d9e7891476	netfilter: nf_tables: avoid retpoline overhead for some ct expression calls nft_ct expression cannot be made builtin to nf_tables without also forcing the conntrack itself to be builtin. However, this can be avoided by splitting retrieval of a few selector keys that only need to access the nf_conn structure, i.e. no function calls to nf_conntrack code. Many rulesets start with something like "ct status established,related accept" With this change, this no longer requires an indirect call, which gives about 1.8% more throughput with a simple conntrack-enabled forwarding test (retpoline thunk used). Signed-off-by: Florian Westphal <fw@strlen.de>	2023-01-18 13:05:25 +01:00
Florian Westphal	2032e907d8	netfilter: nf_tables: avoid retpoline overhead for objref calls objref expression is builtin, so avoid calls to it for RETOLINE=y builds. Signed-off-by: Florian Westphal <fw@strlen.de>	2023-01-18 13:05:25 +01:00
Florian Westphal	d8d7606278	netfilter: nf_tables: add static key to skip retpoline workarounds If CONFIG_RETPOLINE is enabled nf_tables avoids indirect calls for builtin expressions. On newer cpus indirect calls do not go through the retpoline thunk anymore, even for RETPOLINE=y builds. Just like with the new tc retpoline wrappers: Add a static key to skip the if / else if cascade if the cpu does not require retpolines. Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>	2023-01-18 13:05:25 +01:00
Florian Westphal	9db5d918e2	netfilter: ip_tables: remove clusterip target Marked as 'to be removed soon' since kernel 4.1 (2015). Functionality was superseded by the 'cluster' match, added in kernel 2.6.30 (2009). clusterip_tg_check still has races that can give proc_dir_entry 'ipt_CLUSTERIP/10.1.1.2' already registered followed by a WARN splat. Remove it instead of trying to fix this up again. clusterip uapi header is left as-is for now. Signed-off-by: Florian Westphal <fw@strlen.de>	2023-01-18 13:05:24 +01:00
Florian Westphal	2a2fa2efc6	netfilter: conntrack: move rcu read lock to nf_conntrack_find_get Move rcu_read_lock/unlock to nf_conntrack_find_get(), this avoids nested rcu_read_lock call from resolve_normal_ct(). Signed-off-by: Florian Westphal <fw@strlen.de>	2023-01-18 13:05:24 +01:00
Florian Westphal	4883ec512c	netfilter: conntrack: avoid reload of ct->status Compiler can't merge the two test_bit() calls, so load ct->status once and use non-atomic accesses. This is fine because IPS_EXPECTED or NAT_CLASH are either set at ct creation time or not at all, but compiler can't know that. Signed-off-by: Florian Westphal <fw@strlen.de>	2023-01-18 13:05:24 +01:00
Florian Westphal	50bfbb8957	netfilter: conntrack: remove pr_debug calls Those are all useless or dubious. getorigdst() is called via setsockopt, so return value/errno will already indicate an appropriate error. For other pr_debug calls there are better replacements, such as slab/slub debugging or 'conntrack -E' (ctnetlink events). Signed-off-by: Florian Westphal <fw@strlen.de>	2023-01-18 13:05:24 +01:00
Florian Westphal	f71cb8f45d	netfilter: conntrack: sctp: use nf log infrastructure for invalid packets The conntrack logging facilities include useful info such as in/out interface names and packet headers. Use those in more places instead of pr_debug calls. Furthermore, several pr_debug calls can be removed, they are useless on production machines due to the sheer volume of log messages. Signed-off-by: Florian Westphal <fw@strlen.de>	2023-01-18 13:05:24 +01:00
Pietro Borrello	21cbd90a6f	inet: fix fast path in __inet_hash_connect() __inet_hash_connect() has a fast path taken if sk_head(&tb->owners) is equal to the sk parameter. sk_head() returns the hlist_entry() with respect to the sk_node field. However entries in the tb->owners list are inserted with respect to the sk_bind_node field with sk_add_bind_node(). Thus the check would never pass and the fast path never execute. This fast path has never been executed or tested as this bug seems to be present since commit `1da177e4c3` ("Linux-2.6.12-rc2"), thus remove it to reduce code complexity. Signed-off-by: Pietro Borrello <borrello@diag.uniroma1.it> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230112-inet_hash_connect_bind_head-v3-1-b591fd212b93@diag.uniroma1.it Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-17 12:02:31 +01:00
Jesper Dangaard Brouer	eedade12f4	net: kfree_skb_list use kmem_cache_free_bulk The kfree_skb_list function walks SKB (via skb->next) and frees them individually to the SLUB/SLAB allocator (kmem_cache). It is more efficient to bulk free them via the kmem_cache_free_bulk API. This patches create a stack local array with SKBs to bulk free while walking the list. Bulk array size is limited to 16 SKBs to trade off stack usage and efficiency. The SLUB kmem_cache "skbuff_head_cache" uses objsize 256 bytes usually in an order-1 page 8192 bytes that is 32 objects per slab (can vary on archs and due to SLUB sharing). Thus, for SLUB the optimal bulk free case is 32 objects belonging to same slab, but runtime this isn't likely to occur. The expected gain from using kmem_cache bulk alloc and free API have been assessed via a microbencmark kernel module[1]. The module 'slab_bulk_test01' results at bulk 16 element: kmem-in-loop Per elem: 109 cycles(tsc) 30.532 ns (step:16) kmem-bulk Per elem: 64 cycles(tsc) 17.905 ns (step:16) More detailed description of benchmarks avail in [2]. [1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm [2] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/kfree_skb_list01.org V2: rename function to kfree_skb_add_bulk. Reviewed-by: Saeed Mahameed <saeed@kernel.org> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-17 10:27:29 +01:00
Jesper Dangaard Brouer	a4650da2a2	net: fix call location in kfree_skb_list_reason The SKB drop reason uses __builtin_return_address(0) to give the call "location" to trace_kfree_skb() tracepoint skb:kfree_skb. To keep this stable for compilers kfree_skb_reason() is annotated with __fix_address (noinline __noclone) as fixed in commit `c205cc7534` ("net: skb: prevent the split of kfree_skb_reason() by gcc"). The function kfree_skb_list_reason() invoke kfree_skb_reason(), which cause the __builtin_return_address(0) "location" to report the unexpected address of kfree_skb_list_reason. Example output from 'perf script': kpktgend_0 1337 [000] 81.002597: skb:kfree_skb: skbaddr=0xffff888144824700 protocol=2048 location=kfree_skb_list_reason+0x1e reason: QDISC_DROP Patch creates an __always_inline __kfree_skb_reason() helper call that is called from both kfree_skb_list() and kfree_skb_list_reason(). Suggestions for solutions that shares code better are welcome. As preparation for next patch move __kfree_skb() invocation out of this helper function. Reviewed-by: Saeed Mahameed <saeed@kernel.org> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-17 10:27:29 +01:00
Dan Carpenter	501543b4ff	devlink: remove some unnecessary code This code checks if (attrs[DEVLINK_ATTR_TRAP_POLICER_ID]) twice. Once at the start of the function and then a couple lines later. Delete the second check since that one must be true. Because the second condition is always true, it means the: policer_item = group_item->policer_item; assignment is immediately over-written. Delete that as well. Signed-off-by: Dan Carpenter <error27@gmail.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/Y8EJz8oxpMhfiPUb@kili Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-17 10:02:49 +01:00
Hangbin Liu	0349b8779c	sched: add new attr TCA_EXT_WARN_MSG to report tc extact message We will report extack message if there is an error via netlink_ack(). But if the rule is not to be exclusively executed by the hardware, extack is not passed along and offloading failures don't get logged. In commit `81c7288b17` ("sched: cls: enable verbose logging") Marcelo made cls could log verbose info for offloading failures, which helps improving Open vSwitch debuggability when using flower offloading. It would also be helpful if userspace monitor tools, like "tc monitor", could log this kind of message, as it doesn't require vswitchd log level adjusment. Let's add a new tc attributes to report the extack message so the monitor program could receive the failures. e.g. # tc monitor added chain dev enp3s0f1np1 parent ffff: chain 0 added filter dev enp3s0f1np1 ingress protocol all pref 49152 flower chain 0 handle 0x1 ct_state +trk+new not_in_hw action order 1: gact action drop random type none pass val 0 index 1 ref 1 bind 1 Warning: mlx5_core: matching on ct_state +new isn't supported. In this patch I only report the extack message on add/del operations. It doesn't look like we need to report the extack message on get/dump operations. Note this message not only reporte to multicast groups, it could also be reported unicast, which may affect the current usersapce tool's behaivor. Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/20230113034353.2766735-1-liuhangbin@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-17 09:38:33 +01:00
Bobby Eshleman	71dc9ec9ac	virtio/vsock: replace virtio_vsock_pkt with sk_buff This commit changes virtio/vsock to use sk_buff instead of virtio_vsock_pkt. Beyond better conforming to other net code, using sk_buff allows vsock to use sk_buff-dependent features in the future (such as sockmap) and improves throughput. This patch introduces the following performance changes: Tool: Uperf Env: Phys Host + L1 Guest Payload: 64k Threads: 16 Test Runs: 10 Type: SOCK_STREAM Before: commit `b7bfaa761d` ("Linux 6.2-rc3") Before ------ g2h: 16.77Gb/s h2g: 10.56Gb/s After ----- g2h: 21.04Gb/s h2g: 10.76Gb/s Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-16 13:26:33 +00:00
Kirill Tkhai	b27401a30e	unix: Improve locking scheme in unix_show_fdinfo() After switching to TCP_ESTABLISHED or TCP_LISTEN sk_state, alive SOCK_STREAM and SOCK_SEQPACKET sockets can't change it anymore (since commit `3ff8bff704` "unix: Fix race in SOCK_SEQPACKET's unix_dgram_sendmsg()"). Thus, we do not need to take lock here. Signed-off-by: Kirill Tkhai <tkhai@ya.ru> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-16 11:21:11 +00:00
Piergiorgio Beruto	28dbf774bc	plca.c: fix obvious mistake in checking retval Revert a wrong fix that was done during the review process. The intention was to substitute "if(ret < 0)" with "if(ret)". Unfortunately, the intended fix did not meet the code. Besides, after additional review, it was decided that "if(ret < 0)" was actually the right thing to do. Fixes: `8580e16c28` ("net/ethtool: add netlink interface for the PLCA RS") Signed-off-by: Piergiorgio Beruto <piergiorgio.beruto@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/f2277af8951a51cfee2fb905af8d7a812b7beaf4.1673616357.git.piergiorgio.beruto@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-13 21:42:53 -08:00
Jon Maxwell	af6d10345c	ipv6: remove max_size check inline with ipv4 In ip6_dst_gc() replace: if (entries > gc_thresh) With: if (entries > ops->gc_thresh) Sending Ipv6 packets in a loop via a raw socket triggers an issue where a route is cloned by ip6_rt_cache_alloc() for each packet sent. This quickly consumes the Ipv6 max_size threshold which defaults to 4096 resulting in these warnings: [1] 99.187805] dst_alloc: 7728 callbacks suppressed [2] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. . . [300] Route cache is full: consider increasing sysctl net.ipv6.route.max_size. When this happens the packet is dropped and sendto() gets a network is unreachable error: remaining pkt 200557 errno 101 remaining pkt 196462 errno 101 . . remaining pkt 126821 errno 101 Implement David Aherns suggestion to remove max_size check seeing that Ipv6 has a GC to manage memory usage. Ipv4 already does not check max_size. Here are some memory comparisons for Ipv4 vs Ipv6 with the patch: Test by running 5 instances of a program that sends UDP packets to a raw socket 5000000 times. Compare Ipv4 and Ipv6 performance with a similar program. Ipv4: Before test: MemFree: 29427108 kB Slab: 237612 kB ip6_dst_cache 1912 2528 256 32 2 : tunables 0 0 0 xfrm_dst_cache 0 0 320 25 2 : tunables 0 0 0 ip_dst_cache 2881 3990 192 42 2 : tunables 0 0 0 During test: MemFree: 29417608 kB Slab: 247712 kB ip6_dst_cache 1912 2528 256 32 2 : tunables 0 0 0 xfrm_dst_cache 0 0 320 25 2 : tunables 0 0 0 ip_dst_cache 44394 44394 192 42 2 : tunables 0 0 0 After test: MemFree: 29422308 kB Slab: 238104 kB ip6_dst_cache 1912 2528 256 32 2 : tunables 0 0 0 xfrm_dst_cache 0 0 320 25 2 : tunables 0 0 0 ip_dst_cache 3048 4116 192 42 2 : tunables 0 0 0 Ipv6 with patch: Errno 101 errors are not observed anymore with the patch. Before test: MemFree: 29422308 kB Slab: 238104 kB ip6_dst_cache 1912 2528 256 32 2 : tunables 0 0 0 xfrm_dst_cache 0 0 320 25 2 : tunables 0 0 0 ip_dst_cache 3048 4116 192 42 2 : tunables 0 0 0 During Test: MemFree: 29431516 kB Slab: 240940 kB ip6_dst_cache 11980 12064 256 32 2 : tunables 0 0 0 xfrm_dst_cache 0 0 320 25 2 : tunables 0 0 0 ip_dst_cache 3048 4116 192 42 2 : tunables 0 0 0 After Test: MemFree: 29441816 kB Slab: 238132 kB ip6_dst_cache 1902 2432 256 32 2 : tunables 0 0 0 xfrm_dst_cache 0 0 320 25 2 : tunables 0 0 0 ip_dst_cache 3048 4116 192 42 2 : tunables 0 0 0 Tested-by: Andrea Mayer <andrea.mayer@uniroma2.it> Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230112012532.311021-1-jmaxwell37@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-13 20:59:14 -08:00
Keith Busch	c191751410	caif: don't assume iov_iter type The details of the iov_iter types are appropriately abstracted, so there's no need to check for specific type fields. Just let the abstractions handle it. This is preparing for io_uring/net's io_send to utilize the more efficient ITER_UBUF. Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20230111184245.3784393-1-kbusch@meta.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-13 20:44:20 -08:00
Yunhui Cui	6e6eda44b9	sock: add tracepoint for send recv length Add 2 tracepoints to monitor the tcp/udp traffic of per process and per cgroup. Regarding monitoring the tcp/udp traffic of each process, there are two existing solutions, the first one is https://www.atoptool.nl/netatop.php. The second is via kprobe/kretprobe. Netatop solution is implemented by registering the hook function at the hook point provided by the netfilter framework. These hook functions may be in the soft interrupt context and cannot directly obtain the pid. Some data structures are added to bind packets and processes. For example, struct taskinfobucket, struct taskinfo ... Every time the process sends and receives packets it needs multiple hashmaps,resulting in low performance and it has the problem fo inaccurate tcp/udp traffic statistics(for example: multiple threads share sockets). We can obtain the information with kretprobe, but as we know, kprobe gets the result by trappig in an exception, which loses performance compared to tracepoint. We compared the performance of tracepoints with the above two methods, and the results are as follows: ab -n 1000000 -c 1000 -r http://127.0.0.1/index.html without trace: Time per request: 39.660 [ms] (mean) Time per request: 0.040 [ms] (mean, across all concurrent requests) netatop: Time per request: 50.717 [ms] (mean) Time per request: 0.051 [ms] (mean, across all concurrent requests) kr: Time per request: 43.168 [ms] (mean) Time per request: 0.043 [ms] (mean, across all concurrent requests) tracepoint: Time per request: 41.004 [ms] (mean) Time per request: 0.041 [ms] (mean, across all concurrent requests It can be seen that tracepoint has better performance. Signed-off-by: Yunhui Cui <cuiyunhui@bytedance.com> Signed-off-by: Xiongchun Duan <duanxiongchun@bytedance.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-13 10:25:10 +00:00
Daniele Palmas	31de284239	ethtool: add tx aggregation parameters Add the following ethtool tx aggregation parameters: ETHTOOL_A_COALESCE_TX_AGGR_MAX_BYTES Maximum size in bytes of a tx aggregated block of frames. ETHTOOL_A_COALESCE_TX_AGGR_MAX_FRAMES Maximum number of frames that can be aggregated into a block. ETHTOOL_A_COALESCE_TX_AGGR_TIME_USECS Time in usecs after the first packet arrival in an aggregated block for the block to be sent. Signed-off-by: Daniele Palmas <dnlplm@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-13 10:23:52 +00:00
Christian Eggers	a32190b154	net: dsa: microchip: ptp: move pdelay_rsp correction field to tail tag For PDelay_Resp messages we will likely have a negative value in the correction field. The switch hardware cannot correctly update such values (produces an off by one error in the UDP checksum), so it must be moved to the time stamp field in the tail tag. Format of the correction field is 48 bit ns + 16 bit fractional ns. After updating the correction field, clone is no longer required hence it is freed. Signed-off-by: Christian Eggers <ceggers@arri.de> Co-developed-by: Arun Ramadoss <arun.ramadoss@microchip.com> Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-13 08:40:41 +00:00
Christian Eggers	ab32f56a41	net: dsa: microchip: ptp: add packet transmission timestamping This patch adds the routines for transmission of ptp packets. When the ptp pdelay_req packet to be transmitted, it uses the deferred xmit worker to schedule the packets. During irq_setup, interrupt for Sync, Pdelay_req and Pdelay_rsp are enabled. So interrupt is triggered for all three packets. But for p2p1step, we require only time stamp of Pdelay_req packet. Hence to avoid posting of the completion from ISR routine for Sync and Pdelay_resp packets, ts_en flag is introduced. This controls which packets need to processed for timestamp. After the packet is transmitted, ISR is triggered. The time at which packet transmitted is recorded to separate register. This value is reconstructed to absolute time and posted to the user application through socket error queue. Signed-off-by: Christian Eggers <ceggers@arri.de> Co-developed-by: Arun Ramadoss <arun.ramadoss@microchip.com> Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-13 08:40:41 +00:00
Christian Eggers	90188fff65	net: dsa: microchip: ptp: add packet reception timestamping Rx Timestamping is done through 4 additional bytes in tail tag. Whenever the ptp packet is received, the 4 byte hardware time stamped value is added before 1 byte tail tag. Also, bit 7 in tail tag indicates it as PTP frame. This 4 byte value is extracted from the tail tag and reconstructed to absolute time and assigned to skb hwtstamp. If the packet received in PDelay_Resp, then partial ingress timestamp is subtracted from the correction field. Since user space tools expects to be done in hardware. Signed-off-by: Christian Eggers <ceggers@arri.de> Co-developed-by: Arun Ramadoss <arun.ramadoss@microchip.com> Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-13 08:40:41 +00:00
Arun Ramadoss	c2977c61f3	net: dsa: microchip: ptp: add 4 bytes in tail tag when ptp enabled When the PTP is enabled in hardware bit 6 of PTP_MSG_CONF1 register, the transmit frame needs additional 4 bytes before the tail tag. It is needed for all the transmission packets irrespective of PTP packets or not. The 4-byte timestamp field is 0 for frames other than Pdelay_Resp. For the one-step Pdelay_Resp, the switch needs the receive timestamp of the Pdelay_Req message so that it can put the turnaround time in the correction field. Since PTP has to be enabled for both Transmission and reception timestamping, driver needs to track of the tx and rx setting of the all the user ports in the switch. Two flags hw_tx_en and hw_rx_en are added in ksz_port to track the timestampping setting of each port. When any one of ports has tx or rx timestampping enabled, bit 6 of PTP_MSG_CONF1 is set and it is indicated to tag_ksz.c through tagger bytes. This flag adds 4 additional bytes to the tail tag. When tx and rx timestamping of all the ports are disabled, then 4 bytes are not added. Tested using hwstamp -i <interface> Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> # mostly api Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-13 08:40:40 +00:00
Jakub Kicinski	a99da46ac0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net drivers/net/usb/r8152.c `be53771c87` ("r8152: add vendor/device ID pair for Microsoft Devkit") `ec51fbd1b8` ("r8152: add USB device driver for config selection") https://lore.kernel.org/all/20230113113339.658c4723@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-12 19:59:56 -08:00
Linus Torvalds	d9fc151172	Including fixes from rxrpc. Current release - regressions: - rxrpc: - only disconnect calls in the I/O thread - move client call connection to the I/O thread - fix incoming call setup race - eth: mlx5: - restore pkt rate policing support - fix memory leak on updating vport counters Previous releases - regressions: - gro: take care of DODGY packets - ipv6: deduct extension header length in rawv6_push_pending_frames - tipc: fix unexpected link reset due to discovery messages Previous releases - always broken: - sched: disallow noqueue for qdisc classes - eth: ice: fix potential memory leak in ice_gnss_tty_write() - eth: ixgbe: fix pci device refcount leak - eth: mlx5: - fix command stats access after free - fix macsec possible null dereference when updating MAC security entity (SecY) Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmPAGskSHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOk9aQQAInUtOfTi0EwR5oveTMWOcDc8P1rGFru Yfid6d4gVRKDm9tosW8HSlnMVCDIGrvhmwVfMevkLjgQtgRXYecXM7MYMVH+6f6e yIF0azu5z2PEQvfTLTuTN++bQ3lgyfYXOB3mScCOtBE9BFXwjtL111Qby1QlvHTg sPIH5kCxpDfg3i2rge1BiyoQ4BWc4c6Us86CriKDX1vl7lilJccpWYxKFY8hyRzl PF3OVJlMph7jny4zKOa2chWUnDj5ycK289/x2rOla4EOX7R8IHDyL+sAAAvdm7/q FDuuetC3M+eo8/NTLiZkjTipw1nO+G0c1VtzAZ/wX1QkomwmN0yyPx47EllVH+ez YQ80UrXOF3f7xYXHZIhwCrIVSaHpLyZHSfDBW1r+vTokIRSJ+5TOIH/YAUUKSR0U kE4r+eHU3AdcBsDV0pZXtE0mUxROwRatOt5u+XQ3WYdORDyKo0HYu8QskIurgqUv Cnr554zogmC4Bt/uY7j5u9NvhUH/Xyp5RVXtaQnwz+hcncgVFASDDpblejHE2Lcu 8fb1NrwB7AsPnMDGUSjnG0BbQaTo6ccacBrIrhWxRbEBAQmZEbO507yoYz21EBM5 5XKWd1bTq1YG5oPYl9WR3FI9hSQN7vKsUoW4SXsuh5j65ENhCAwBK8i3liy1j+dS sf5xUgCg6KyH =4ANJ -----END PGP SIGNATURE----- Merge tag 'net-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from rxrpc. The rxrpc changes are noticeable large: to address a recent regression has been necessary completing the threaded refactor. Current release - regressions: - rxrpc: - only disconnect calls in the I/O thread - move client call connection to the I/O thread - fix incoming call setup race - eth: mlx5: - restore pkt rate policing support - fix memory leak on updating vport counters Previous releases - regressions: - gro: take care of DODGY packets - ipv6: deduct extension header length in rawv6_push_pending_frames - tipc: fix unexpected link reset due to discovery messages Previous releases - always broken: - sched: disallow noqueue for qdisc classes - eth: ice: fix potential memory leak in ice_gnss_tty_write() - eth: ixgbe: fix pci device refcount leak - eth: mlx5: - fix command stats access after free - fix macsec possible null dereference when updating MAC security entity (SecY)" * tag 'net-6.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (64 commits) r8152: add vendor/device ID pair for Microsoft Devkit net: stmmac: add aux timestamps fifo clearance wait bnxt: make sure we return pages to the pool net: hns3: fix wrong use of rss size during VF rss config ipv6: raw: Deduct extension header length in rawv6_push_pending_frames net: lan966x: check for ptp to be enabled in lan966x_ptp_deinit() net: sched: disallow noqueue for qdisc classes iavf/iavf_main: actually log ->src mask when talking about it igc: Fix PPS delta between two synchronized end-points ixgbe: fix pci device refcount leak octeontx2-pf: Fix resource leakage in VF driver unbind selftests/net: l2_tos_ttl_inherit.sh: Ensure environment cleanup on failure. selftests/net: l2_tos_ttl_inherit.sh: Run tests in their own netns. selftests/net: l2_tos_ttl_inherit.sh: Set IPv6 addresses with "nodad". net/mlx5e: Fix macsec possible null dereference when updating MAC security entity (SecY) net/mlx5e: Fix macsec ssci attribute handling in offload path net/mlx5: E-switch, Coverity: overlapping copy net/mlx5e: Don't support encap rules with gbp option net/mlx5: Fix ptp max frequency adjustment range net/mlx5e: Fix memory leak on updating vport counters ...	2023-01-12 18:20:44 -06:00
Linus Torvalds	bad8c4a850	xen: branch for v6.2-rc4 -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCY76ohgAKCRCAXGG7T9hj vo8fAP0XJ94B7asqcN4W3EyeyfqxUf1eZvmWRhrbKqpLnmHLaQEA/uJBkXL49Zj7 TTcbxR1coJ/hPwhtmONU4TNtCZ+RXw0= =2Ib5 -----END PGP SIGNATURE----- Merge tag 'for-linus-6.2-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: - two cleanup patches - a fix of a memory leak in the Xen pvfront driver - a fix of a locking issue in the Xen hypervisor console driver * tag 'for-linus-6.2-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/pvcalls: free active map buffer on pvcalls_front_free_map hvc/xen: lock console list traversal x86/xen: Remove the unused function p2m_index() xen: make remove callback of xen driver void returned	2023-01-12 17:02:20 -06:00
Bobby Eshleman	c43170b7e1	vsock: return errors other than -ENOMEM to socket This removes behaviour, where error code returned from any transport was always switched to ENOMEM. For example when user tries to send too big message via SEQPACKET socket, transport layers return EMSGSIZE, but this error code was always replaced with ENOMEM and returned to user. Signed-off-by: Bobby Eshleman <bobby.eshleman@bytedance.com> Signed-off-by: Arseniy Krasnov <AVKrasnov@sberdevices.ru> Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>	2023-01-12 12:53:54 +01:00
Jakub Kicinski	93e71edfd9	devlink: keep the instance mutex alive until references are gone The reference needs to keep the instance memory around, but also the instance lock must remain valid. Users will take the lock, check registration status and release the lock. mutex_destroy() etc. belong in the same place as the freeing of the memory. Unfortunately lockdep_unregister_key() sleeps so we need to switch the an rcu_work. Note that the problem is a bit hard to repro, because devlink_pernet_pre_exit() iterates over registered instances. AFAIU the instances must get devlink_free()d concurrently with the namespace getting deleted for the problem to occur. Reported-by: syzbot+d94d214ea473e218fc89@syzkaller.appspotmail.com Reported-by: syzbot+9f0dd863b87113935acf@syzkaller.appspotmail.com Fixes: `9053637e0d` ("devlink: remove the registration guarantee of references") Reviewed-by: Jiri Pirko <jiri@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://lore.kernel.org/r/20230111042908.988199-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>	2023-01-11 20:49:32 -08:00
Herbert Xu	cb3e9864cd	ipv6: raw: Deduct extension header length in rawv6_push_pending_frames The total cork length created by ip6_append_data includes extension headers, so we must exclude them when comparing them against the IPV6_CHECKSUM offset which does not include extension headers. Reported-by: Kyle Zeng <zengyhkyle@gmail.com> Fixes: `357b40a18b` ("[IPV6]: IPV6_CHECKSUM socket option can corrupt kernel memory") Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-11 12:49:13 +00:00
Piergiorgio Beruto	16178c8ef5	drivers/net/phy: add the link modes for the 10BASE-T1S Ethernet PHY This patch adds the link modes for the IEEE 802.3cg Clause 147 10BASE-T1S Ethernet PHY. According to the specifications, the 10BASE-T1S supports Point-To-Point Full-Duplex, Point-To-Point Half-Duplex and/or Point-To-Multipoint (AKA Multi-Drop) Half-Duplex operations. Signed-off-by: Piergiorgio Beruto <piergiorgio.beruto@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-11 08:35:02 +00:00
Piergiorgio Beruto	8580e16c28	net/ethtool: add netlink interface for the PLCA RS Add support for configuring the PLCA Reconciliation Sublayer on multi-drop PHYs that support IEEE802.3cg-2019 Clause 148 (e.g., 10BASE-T1S). This patch adds the appropriate netlink interface to ethtool. Signed-off-by: Piergiorgio Beruto <piergiorgio.beruto@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>	2023-01-11 08:35:02 +00:00

1 2 3 4 5 ...

71867 Commits