1202646 Commits

Author SHA1 Message Date
Jakub Kicinski
1e02d75817 Merge branch 'seg6-add-next-c-sid-support-for-srv6-end-x-behavior'
Andrea Mayer says:

====================
seg6: add NEXT-C-SID support for SRv6 End.X behavior

In the Segment Routing (SR) architecture a list of instructions, called
segments, can be added to the packet headers to influence the forwarding and
processing of the packets in an SR enabled network.

Considering the Segment Routing over IPv6 data plane (SRv6) [1], the segment
identifiers (SIDs) are IPv6 addresses (128 bits) and the segment list (SID
List) is carried in the Segment Routing Header (SRH). A segment may correspond
to a "behavior" that is executed by a node when the packet is received.
The Linux kernel currently supports a large subset of the behaviors described
in [2] (e.g., End, End.X, End.T and so on).

In some SRv6 scenarios, the number of segments carried by the SID List may
increase dramatically, reducing the MTU (Maximum Transfer Unit) size and/or
limiting the processing power of legacy hardware devices (due to longer IPv6
headers).

The NEXT-C-SID mechanism [3] extends the SRv6 architecture by providing several
ways to efficiently represent the SID List.
By leveraging the NEXT-C-SID, it is possible to encode several SRv6 segments
within a single 128 bit SID address (also referenced as Compressed SID
Container). In this way, the length of the SID List can be drastically reduced.

The NEXT-C-SID mechanism is built upon the "flavors" framework defined in [2].
This framework is already supported by the Linux SRv6 subsystem and is used to
modify and/or extend a subset of existing behaviors.

In this patchset, we extend the SRv6 End.X behavior in order to support the
NEXT-C-SID mechanism.

In details, the patchset is made of:
 - patch 1/2: add NEXT-C-SID support for SRv6 End.X behavior;
 - patch 2/2: add selftest for NEXT-C-SID in SRv6 End.X behavior.

From the user space perspective, we do not need to change the iproute2 code to
support the NEXT-C-SID flavor for the SRv6 End.X behavior. However, we will
update the man page considering the NEXT-C-SID flavor applied to the SRv6 End.X
behavior in a separate patch.

[1] - https://datatracker.ietf.org/doc/html/rfc8754
[2] - https://datatracker.ietf.org/doc/html/rfc8986
[3] - https://datatracker.ietf.org/doc/html/draft-ietf-spring-srv6-srh-compression
====================

Link: https://lore.kernel.org/r/20230812180926.16689-1-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 18:51:49 -07:00
Paolo Lungaroni
1c53717c80 selftests: seg6: add selftest for NEXT-C-SID flavor in SRv6 End.X behavior
This selftest is designed for testing the support of NEXT-C-SID flavor
for SRv6 End.X behavior. It instantiates a virtual network composed of
several nodes: hosts and SRv6 routers. Each node is realized using a
network namespace that is properly interconnected to others through veth
pairs, according to the topology depicted in the selftest script file.
The test considers SRv6 routers implementing IPv4/IPv6 L3 VPNs leveraged
by hosts for communicating with each other. Such routers i) apply
different SRv6 Policies to the traffic received from connected hosts,
considering the IPv4 or IPv6 protocols; ii) use the NEXT-C-SID
compression mechanism for encoding several SRv6 segments within a single
128-bit SID address, referred to as a Compressed SID (C-SID) container.

The NEXT-C-SID is provided as a "flavor" of the SRv6 End.X behavior,
enabling it to properly process the C-SID containers. The correct
execution of the enabled NEXT-C-SID SRv6 End.X behavior is verified
through reachability tests carried out between hosts belonging to the
same VPN.

Signed-off-by: Paolo Lungaroni <paolo.lungaroni@uniroma2.it>
Co-developed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230812180926.16689-3-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 18:51:47 -07:00
Andrea Mayer
7458575a07 seg6: add NEXT-C-SID support for SRv6 End.X behavior
The NEXT-C-SID mechanism described in [1] offers the possibility of
encoding several SRv6 segments within a single 128 bit SID address. Such
a SID address is called a Compressed SID (C-SID) container. In this way,
the length of the SID List can be drastically reduced.

A SID instantiated with the NEXT-C-SID flavor considers an IPv6 address
logically structured in three main blocks: i) Locator-Block; ii)
Locator-Node Function; iii) Argument.

                        C-SID container
+------------------------------------------------------------------+
|     Locator-Block      |Loc-Node|            Argument            |
|                        |Function|                                |
+------------------------------------------------------------------+
<--------- B -----------> <- NF -> <------------- A --------------->

   (i) The Locator-Block can be any IPv6 prefix available to the provider;

  (ii) The Locator-Node Function represents the node and the function to
       be triggered when a packet is received on the node;

 (iii) The Argument carries the remaining C-SIDs in the current C-SID
       container.

This patch leverages the NEXT-C-SID mechanism previously introduced in the
Linux SRv6 subsystem [2] to support SID compression capabilities in the
SRv6 End.X behavior [3].
An SRv6 End.X behavior with NEXT-C-SID flavor works as an End.X behavior
but it is capable of processing the compressed SID List encoded in C-SID
containers.

An SRv6 End.X behavior with NEXT-C-SID flavor can be configured to support
user-provided Locator-Block and Locator-Node Function lengths. In this
implementation, such lengths must be evenly divisible by 8 (i.e. must be
byte-aligned), otherwise the kernel informs the user about invalid
values with a meaningful error code and message through netlink_ext_ack.

If Locator-Block and/or Locator-Node Function lengths are not provided
by the user during configuration of an SRv6 End.X behavior instance with
NEXT-C-SID flavor, the kernel will choose their default values i.e.,
32-bit Locator-Block and 16-bit Locator-Node Function.

[1] - https://datatracker.ietf.org/doc/html/draft-ietf-spring-srv6-srh-compression
[2] - https://lore.kernel.org/all/20220912171619.16943-1-andrea.mayer@uniroma2.it/
[3] - https://datatracker.ietf.org/doc/html/rfc8986#name-endx-l3-cross-connect

Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/r/20230812180926.16689-2-andrea.mayer@uniroma2.it
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 18:51:47 -07:00
Jakub Kicinski
c65dffc6f2 Merge branch 'genetlink-provide-struct-genl_info-to-dumps'
Jakub Kicinski says:

====================
genetlink: provide struct genl_info to dumps

One of the biggest (which is not to say only) annoyances with genetlink
handling today is that doit and dumpit need some of the same information,
but it is passed to them in completely different structs.

The implementations commonly end up writing a _fill() method which
populates a message and have to pass at least 6 parameters. 3 of which
are extracted manually from request info.

After a lot of umming and ahing I decided to populate struct genl_info
for dumps, without trying to factor out only the common parts.
This makes the adoption easiest.

In the future we may add a new version of dump which takes
struct genl_info *info as the second argument, instead of
struct netlink_callback *cb. For now developers have to call
genl_info_dump(cb) to get the info.

Typical genetlink families no longer get exposed to netlink protocol
internals like pid and seq numbers.

v3:
 - correct the condition in ethtool code (patch 10)
v2: https://lore.kernel.org/all/20230810233845.2318049-1-kuba@kernel.org/
 - replace the GENL_INFO_NTF() macro with init helper
 - fix the commit messages
v1: https://lore.kernel.org/all/20230809182648.1816537-1-kuba@kernel.org/
====================

Link: https://lore.kernel.org/r/20230814214723.2924989-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 15:01:03 -07:00
Jakub Kicinski
f946270d05 ethtool: netlink: always pass genl_info to .prepare_data
We had a number of bugs in the past because developers forgot
to fully test dumps, which pass NULL as info to .prepare_data.
.prepare_data implementations would try to access info->extack
leading to a null-deref.

Now that dumps and notifications can access struct genl_info
we can pass it in, and remove the info null checks.

Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> # pause
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-11-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 15:01:03 -07:00
Jakub Kicinski
ec0e5b09b8 ethtool: netlink: simplify arguments to ethnl_default_parse()
Pass struct genl_info directly instead of its members.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-10-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 15:01:03 -07:00
Jakub Kicinski
0e19d3108a netdev-genl: use struct genl_info for reply construction
Use the just added APIs to make the code simpler.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-9-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 15:01:03 -07:00
Jakub Kicinski
5aa51d9f88 genetlink: add genlmsg_iput() API
Add some APIs and helpers required for convenient construction
of replies and notifications based on struct genl_info.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-8-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 15:01:03 -07:00
Jakub Kicinski
5c670a010d genetlink: add a family pointer to struct genl_info
Having family in struct genl_info is quite useful. It cuts
down the number of arguments which need to be passed to
helpers which already take struct genl_info.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 15:01:03 -07:00
Jakub Kicinski
7288dd2fd4 genetlink: use attrs from struct genl_info
Since dumps carry struct genl_info now, use the attrs pointer
from genl_info and remove the one in struct genl_dumpit_info.

Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Miquel Raynal <miquel.raynal@bootlin.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 15:00:45 -07:00
Jakub Kicinski
9272af109f genetlink: add struct genl_info to struct genl_dumpit_info
Netlink GET implementations must currently juggle struct genl_info
and struct netlink_callback, depending on whether they were called
from doit or dumpit.

Add genl_info to the dump state and populate the fields.
This way implementations can simply pass struct genl_info around.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 14:54:44 -07:00
Jakub Kicinski
bffcc6882a genetlink: remove userhdr from struct genl_info
Only three families use info->userhdr today and going forward
we discourage using fixed headers in new families.
So having the pointer to user header in struct genl_info
is an overkill. Compute the header pointer at runtime.

Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Aaron Conole <aconole@redhat.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 14:54:44 -07:00
Jakub Kicinski
fde9bd4a4d genetlink: make genl_info->nlhdr const
struct netlink_callback has a const nlh pointer, make the
pointer in struct genl_info const as well, to make copying
between the two easier.

Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 14:54:44 -07:00
Jakub Kicinski
84817d8c60 genetlink: push conditional locking into dumpit/done
Add helpers which take/release the genl mutex based
on family->parallel_ops. Remove the separation between
handling of ops in locked and parallel families.

Future patches would make the duplicated code grow even more.

Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Link: https://lore.kernel.org/r/20230814214723.2924989-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-15 14:54:44 -07:00
Russell King (Oracle)
479b322ee6 net: dsa: mv88e6060: add phylink_get_caps implementation
Add a phylink_get_caps implementation for Marvell 88e6060 DSA switch.
This is a fast ethernet switch, with internal PHYs for ports 0 through
4. Port 4 also supports MII, REVMII, REVRMII and SNI. Port 5 supports
MII, REVMII, REVRMII and SNI without an internal PHY.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://lore.kernel.org/r/E1qUkx7-003dMX-9b@rmk-PC.armlinux.org.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 18:57:17 -07:00
Jakub Kicinski
f3cc00303c Merge branch 'devlink-introduce-selective-dumps'
Jiri Pirko says:

====================
devlink: introduce selective dumps

Motivation:

For SFs, one devlink instance per SF is created. There might be
thousands of these on a single host. When a user needs to know port
handle for specific SF, he needs to dump all devlink ports on the host
which does not scale good.

Solution:

Allow user to pass devlink handle (and possibly other attributes)
alongside the dump command and dump only objects which are matching
the selection.

Use split ops to generate policies for dump callbacks acccording to
the attributes used for selection.

The userspace can use ctrl genetlink GET_POLICY command to find out if
the selective dumps are supported by kernel for particular command.

Example:
$ devlink port show
auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false
auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false

$ devlink port show auxiliary/mlx5_core.eth.0
auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false

$ devlink port show auxiliary/mlx5_core.eth.1
auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false

Extension:

patches #12 and #13 extends selection attributes by port index
for health reporter dumping.
====================

Link: https://lore.kernel.org/r/20230811155714.1736405-1-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:27 -07:00
Jiri Pirko
0149bca172 netlink: specs: devlink: extend health reporter dump attributes by port index
Allow user to pass port index for health reporter dump request.

Re-generate the related code.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-14-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:25 -07:00
Jiri Pirko
b03f13cb67 devlink: extend health reporter dump selector by port index
Introduce a possibility for devlink object to expose attributes it
supports for selection of dumped objects.

Use this by health reporter to indicate it supports port index based
selection of dump objects. Implement this selection mechanism in
devlink_nl_cmd_health_reporter_get_dump_one()

Example:
$ devlink health
pci/0000:08:00.0:
  reporter fw
    state healthy error 0 recover 0 auto_dump true
  reporter fw_fatal
    state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.0/32768:
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.0/32769:
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.0/32770:
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.1:
  reporter fw
    state healthy error 0 recover 0 auto_dump true
  reporter fw_fatal
    state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.1/98304:
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.1/98305:
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.1/98306:
  reporter vnic
    state healthy error 0 recover 0

$ devlink health show pci/0000:08:00.0
pci/0000:08:00.0:
  reporter fw
    state healthy error 0 recover 0 auto_dump true
  reporter fw_fatal
    state healthy error 0 recover 0 grace_period 60000 auto_recover true auto_dump true
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.0/32768:
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.0/32769:
  reporter vnic
    state healthy error 0 recover 0
pci/0000:08:00.0/32770:
  reporter vnic
    state healthy error 0 recover 0

$ devlink health show pci/0000:08:00.0/32768
pci/0000:08:00.0/32768:
  reporter vnic
    state healthy error 0 recover 0

The last command is possible because of this patch.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-13-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:25 -07:00
Jiri Pirko
34493336e7 netlink: specs: devlink: extend per-instance dump commands to accept instance attributes
Extend per-instance dump command definitions to accept instance
attributes. Allow parsing of devlink handle attributes so they could
be used for instance selection.

Re-generate the related code.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-12-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:25 -07:00
Jiri Pirko
4a1b5aa8b5 devlink: allow user to narrow per-instance dumps by passing handle attrs
For SFs, one devlink instance per SF is created. There might be
thousands of these on a single host. When a user needs to know port
handle for specific SF, he needs to dump all devlink ports on the host
which does not scale good.

Allow user to pass devlink handle attributes alongside the dump command
and dump only objects which are under selected devlink instance.

Example:
$ devlink port show
auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false
auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false

$ devlink port show auxiliary/mlx5_core.eth.0
auxiliary/mlx5_core.eth.0/65535: type eth netdev eth2 flavour physical port 0 splittable false

$ devlink port show auxiliary/mlx5_core.eth.1
auxiliary/mlx5_core.eth.1/131071: type eth netdev eth3 flavour physical port 1 splittable false

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-11-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:25 -07:00
Jiri Pirko
833e479d33 devlink: remove converted commands from small ops
As the commands are already defined in split ops, remove them
from small ops.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-10-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:25 -07:00
Jiri Pirko
ddff283280 devlink: remove duplicate temporary netlink callback prototypes
Remove the duplicate temporary netlink callback prototype as the
generated ones are already in place.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-9-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:25 -07:00
Jiri Pirko
7199c86247 netlink: specs: devlink: add commands that do per-instance dump
Add the definitions for the commands that do per-instance dump
and re-generate the related code.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-8-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:25 -07:00
Jiri Pirko
7d3c6fec61 devlink: pass flags as an arg of dump_one() callback
In order to easily set NLM_F_DUMP_FILTERED for partial dumps, pass the
flags as an arg of dump_one() callback. Currently, it is always
NLM_F_MULTI.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-7-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:25 -07:00
Jiri Pirko
24c8e56d4f devlink: introduce dumpit callbacks for split ops
Introduce dumpit callbacks for generated split ops. Have them
as a thin wrapper around iteration function and allow to pass dump_one()
function pointer directly without need to store in devlink_cmd structs.

Note that the function prototypes are temporary until the generated ones
will replace them in a follow-up patch.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-6-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:24 -07:00
Jiri Pirko
8fa995ad1f devlink: rename doit callbacks for per-instance dump commands
Rename netlink doit callback functions for the commands that do
implement per-instance dump to match the generated names that are going
to be introduce in the follow-up patch.

Note that the function prototypes are temporary until the generated ones
will replace them in a follow-up patch.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-5-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:24 -07:00
Jiri Pirko
ee6d78ac28 devlink: introduce devlink_nl_pre_doit_port*() helper functions
Define port handling helpers what don't rely on internal_flags.
Have __devlink_nl_pre_doit() to accept the flags as a function arg and
make devlink_nl_pre_doit() a wrapper helper function calling it.
Introduce new helpers devlink_nl_pre_doit_port() and
devlink_nl_pre_doit_port_optional() to be used by split ops in follow-up
patch.

Note that the function prototypes are temporary until the generated ones
will replace them in a follow-up patch.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-4-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:24 -07:00
Jiri Pirko
41a1d4d139 devlink: parse rate attrs in doit() callbacks
No need to give the rate any special treatment in netlink attributes
parsing, as unlike for ports, there is only a couple of commands
benefiting from that.

Remove DEVLINK_NL_FLAG_NEED_RATE*, make pre_doit() callback simpler
by moving the rate attributes parsing to rate_*_doit() ops.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-3-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:24 -07:00
Jiri Pirko
63618463cb devlink: parse linecard attr in doit() callbacks
No need to give the linecards any special treatment in netlink attribute
parsing, as unlike for ports, there is only a couple of commands
benefiting from that.

Remove DEVLINK_NL_FLAG_NEED_LINECARD, make pre_doit() callback simpler
by moving the linecard attribute parsing to linecard_[gs]et_doit() ops.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230811155714.1736405-2-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-14 11:47:24 -07:00
Gabor Juhos
83b5f0253b net: phy: Introduce PSGMII PHY interface mode
The PSGMII interface is similar to QSGMII. The main difference
is that the PSGMII interface combines five SGMII lines into a
single link while in QSGMII only four lines are combined.

Similarly to the QSGMII, this interface mode might also needs
special handling within the MAC driver.

It is commonly used by Qualcomm with their QCA807x PHY series and
modern WiSoC-s.

Add definitions for the PHY layer to allow to express this type
of connection between the MAC and PHY.

Signed-off-by: Gabor Juhos <j4g8y7@gmail.com>
Signed-off-by: Robert Marko <robert.marko@sartura.hr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:12:53 +01:00
Robert Marko
de875d35e0 dt-bindings: net: ethernet-controller: add PSGMII mode
Add a new PSGMII mode which is similar to QSGMII with the difference being
that it combines 5 SGMII lines into a single link compared to 4 on QSGMII.

It is commonly used by Qualcomm on their QCA807x PHY series.

Signed-off-by: Robert Marko <robert.marko@sartura.hr>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:12:53 +01:00
David S. Miller
2d93c30c4e Merge branch 'mlxsw-redirection'
Petr Machata says:

====================
mlxsw: Support traffic redirection from a locked bridge port

Ido Schimmel writes:

It is possible to add a filter that redirects traffic from the ingress
of a bridge port that is locked (i.e., performs security / SMAC lookup)
and has learning enabled. For example:

 # ip link add name br0 type bridge
 # ip link set dev swp1 master br0
 # bridge link set dev swp1 learning on locked on mab on
 # tc qdisc add dev swp1 clsact
 # tc filter add dev swp1 ingress pref 1 proto ip flower skip_sw src_ip 192.0.2.1 action mirred egress redirect dev swp2

In the kernel's Rx path, this filter is evaluated before the Rx handler
of the bridge, which means that redirected traffic should not be
affected by bridge port configuration such as learning.

However, the hardware data path is a bit different and the redirect
action (FORWARDING_ACTION in hardware) merely attaches a pointer to the
packet, which is later used by the L2 lookup stage to understand how to
forward the packet. Between both stages - ingress ACL and L2 lookup -
learning and security lookup are performed, which means that redirected
traffic is affected by bridge port configuration, unlike in the kernel's
data path.

The learning discrepancy was handled in commit 577fa14d2100 ("mlxsw:
spectrum: Do not process learned records with a dummy FID") by simply
ignoring learning notifications generated by the redirected traffic. A
similar solution is not possible for the security / SMAC lookup since
- unlike learning - the CPU is not involved and packets that failed the
lookup are dropped by the device.

Instead, solve this by prepending the ignore action to the redirect
action and use it to instruct the device to disable both learning and
the security / SMAC lookup for redirected traffic.

Patch #1 adds the ignore action.

Patch #2 prepends the action to the redirect action in flower offload
code.

Patch #3 removes the workaround in commit 577fa14d2100 ("mlxsw:
spectrum: Do not process learned records with a dummy FID") since it is
no longer needed.

Patch #4 adds a test case.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:11:14 +01:00
Ido Schimmel
38c43a1ce7 selftests: forwarding: Add test case for traffic redirection from a locked port
Check that traffic can be redirected from a locked bridge port and that
it does not create locked FDB entries.

Cc: Hans J. Schultz <netdev@kapio-technology.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:11:14 +01:00
Ido Schimmel
9793a5a9c4 mlxsw: spectrum: Stop ignoring learning notifications from redirected traffic
As explained in the previous patch, with the ignore action prepended to
the redirect action, it is not longer possible for redirected traffic to
generate learning notifications.

Therefore, remove the workaround that was added in commit 577fa14d2100
("mlxsw: spectrum: Do not process learned records with a dummy FID") as
it is no longer needed.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:11:14 +01:00
Ido Schimmel
0433670e13 mlxsw: spectrum_flower: Disable learning and security lookup when redirecting
It is possible to add a filter that redirects traffic from the ingress
of a bridge port that is locked (i.e., performs security / SMAC lookup)
and has learning enabled. For example:

 # ip link add name br0 type bridge
 # ip link set dev swp1 master br0
 # bridge link set dev swp1 learning on locked on mab on
 # tc qdisc add dev swp1 clsact
 # tc filter add dev swp1 ingress pref 1 proto ip flower skip_sw src_ip 192.0.2.1 action mirred egress redirect dev swp2

In the kernel's Rx path, this filter is evaluated before the Rx handler
of the bridge, which means that redirected traffic should not be
affected by bridge port configuration such as learning.

However, the hardware data path is a bit different and the redirect
action (FORWARDING_ACTION in hardware) merely attaches a pointer to the
packet, which is later used by the L2 lookup stage to understand how to
forward the packet. Between both stages - ingress ACL and L2 lookup -
learning and security lookup are performed, which means that redirected
traffic is affected by bridge port configuration, unlike in the kernel's
data path.

The learning discrepancy was handled in commit 577fa14d2100 ("mlxsw:
spectrum: Do not process learned records with a dummy FID") by simply
ignoring learning notifications generated by the redirected traffic. A
similar solution is not possible for the security / SMAC lookup since
- unlike learning - the CPU is not involved and packets that failed the
lookup are dropped by the device.

Instead, solve this by prepending the ignore action to the redirect
action and use it to instruct the device to disable both learning and
the security / SMAC lookup for redirected traffic.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:11:14 +01:00
Ido Schimmel
d0d449c747 mlxsw: core_acl_flex_actions: Add IGNORE_ACTION
Add the IGNORE_ACTION which is used to ignore basic switching functions
such as learning on a per-packet basis.

The action will be prepended to the FORWARDING_ACTION in subsequent
patches.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:11:14 +01:00
Furong Xu
58c1e0bace net: stmmac: xgmac: show more MAC HW features in debugfs
1. Show TSSTSSEL(Timestamp System Time Source),
ADDMACADRSEL(additional MAC addresses), SMASEL(SMA/MDIO Interface),
HDSEL(Half-duplex Support) in debugfs.
2. Show exact number of additional MAC address registers for XGMAC2 core.
3. XGMAC2 core does not have different IP checksum offload types, so just
show rx_coe instead of rx_coe_type1 or rx_coe_type2.
4. XGMAC2 core does not have rxfifo_over_2048 definition, skip it.

Signed-off-by: Furong Xu <0x1207@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:09:00 +01:00
David S. Miller
a9142847b7 Merge branch 'net-stats-helpers'
Li Zetao says:

====================
Use helper functions to update stats

The patch set uses the helper functions dev_sw_netstats_rx_add() and
dev_sw_netstats_tx_add() to update stats, which is the same as
implementing the function separately.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:06:25 +01:00
Li Zetao
3c0930b491 vxlan: Use helper functions to update stats
Use the helper functions dev_sw_netstats_rx_add() and
dev_sw_netstats_tx_add() to update stats, which helps to
provide code readability.

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:06:24 +01:00
Li Zetao
bf98bbe985 net: macsec: Use helper functions to update stats
Use the helper functions dev_sw_netstats_rx_add() and
dev_sw_netstats_tx_add() to update stats, which helps to
provide code readability.

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:06:24 +01:00
William Tu
54f00cce11 vmxnet3: Add XDP support.
The patch adds native-mode XDP support: XDP DROP, PASS, TX, and REDIRECT.

Background:
The vmxnet3 rx consists of three rings: ring0, ring1, and dataring.
For r0 and r1, buffers at r0 are allocated using alloc_skb APIs and dma
mapped to the ring's descriptor. If LRO is enabled and packet size larger
than 3K, VMXNET3_MAX_SKB_BUF_SIZE, then r1 is used to mapped the rest of
the buffer larger than VMXNET3_MAX_SKB_BUF_SIZE. Each buffer in r1 is
allocated using alloc_page. So for LRO packets, the payload will be in one
buffer from r0 and multiple from r1, for non-LRO packets, only one
descriptor in r0 is used for packet size less than 3k.

When receiving a packet, the first descriptor will have the sop (start of
packet) bit set, and the last descriptor will have the eop (end of packet)
bit set. Non-LRO packets will have only one descriptor with both sop and
eop set.

Other than r0 and r1, vmxnet3 dataring is specifically designed for
handling packets with small size, usually 128 bytes, defined in
VMXNET3_DEF_RXDATA_DESC_SIZE, by simply copying the packet from the backend
driver in ESXi to the ring's memory region at front-end vmxnet3 driver, in
order to avoid memory mapping/unmapping overhead. In summary, packet size:
    A. < 128B: use dataring
    B. 128B - 3K: use ring0 (VMXNET3_RX_BUF_SKB)
    C. > 3K: use ring0 and ring1 (VMXNET3_RX_BUF_SKB + VMXNET3_RX_BUF_PAGE)
As a result, the patch adds XDP support for packets using dataring
and r0 (case A and B), not the large packet size when LRO is enabled.

XDP Implementation:
When user loads and XDP prog, vmxnet3 driver checks configurations, such
as mtu, lro, and re-allocate the rx buffer size for reserving the extra
headroom, XDP_PACKET_HEADROOM, for XDP frame. The XDP prog will then be
associated with every rx queue of the device. Note that when using dataring
for small packet size, vmxnet3 (front-end driver) doesn't control the
buffer allocation, as a result we allocate a new page and copy packet
from the dataring to XDP frame.

The receive side of XDP is implemented for case A and B, by invoking the
bpf program at vmxnet3_rq_rx_complete and handle its returned action.
The vmxnet3_process_xdp(), vmxnet3_process_xdp_small() function handles
the ring0 and dataring case separately, and decides the next journey of
the packet afterward.

For TX, vmxnet3 has split header design. Outgoing packets are parsed
first and protocol headers (L2/L3/L4) are copied to the backend. The
rest of the payload are dma mapped. Since XDP_TX does not parse the
packet protocol, the entire XDP frame is dma mapped for transmission
and transmitted in a batch. Later on, the frame is freed and recycled
back to the memory pool.

Performance:
Tested using two VMs inside one ESXi vSphere 7.0 machine, using single
core on each vmxnet3 device, sender using DPDK testpmd tx-mode attached
to vmxnet3 device, sending 64B or 512B UDP packet.

VM1 txgen:
$ dpdk-testpmd -l 0-3 -n 1 -- -i --nb-cores=3 \
--forward-mode=txonly --eth-peer=0,<mac addr of vm2>
option: add "--txonly-multi-flow"
option: use --txpkts=512 or 64 byte

VM2 running XDP:
$ ./samples/bpf/xdp_rxq_info -d ens160 -a <options> --skb-mode
$ ./samples/bpf/xdp_rxq_info -d ens160 -a <options>
options: XDP_DROP, XDP_PASS, XDP_TX

To test REDIRECT to cpu 0, use
$ ./samples/bpf/xdp_redirect_cpu -d ens160 -c 0 -e drop

Single core performance comparison with skb-mode.
64B:      skb-mode -> native-mode
XDP_DROP: 1.6Mpps -> 2.4Mpps
XDP_PASS: 338Kpps -> 367Kpps
XDP_TX:   1.1Mpps -> 2.3Mpps
REDIRECT-drop: 1.3Mpps -> 2.3Mpps

512B:     skb-mode -> native-mode
XDP_DROP: 863Kpps -> 1.3Mpps
XDP_PASS: 275Kpps -> 376Kpps
XDP_TX:   554Kpps -> 1.2Mpps
REDIRECT-drop: 659Kpps -> 1.2Mpps

Demo: https://youtu.be/4lm1CSCi78Q

Future work:
- XDP frag support
- use napi_consume_skb() instead of dev_kfree_skb_any at unmap
- stats using u64_stats_t
- using bitfield macro BIT()
- optimization for DMA synchronization using actual frame length,
  instead of always max_len

Signed-off-by: William Tu <u9012063@gmail.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:03:52 +01:00
David S. Miller
76fa363558 Merge branch 'ovs-drop-reasons'
Adrian Moreno says:

====================
openvswitch: add drop reasons

There is currently a gap in drop visibility in the openvswitch module.
This series tries to improve this by adding a new drop reason subsystem
for OVS.

Apart from adding a new drop reasson subsystem and some common drop
reasons, this series takes Eric's preliminary work [1] on adding an
explicit drop action and integrates it into the same subsystem.

A limitation of this series is that it does not report upcall errors.
The reason is that there could be many sources of upcall drops and the
most common one, which is the netlink buffer overflow, cannot be
reported via kfree_skb() because the skb is freed in the netlink layer
(see [2]). Therefore, using a reason for the rare events and not the
common one would be even more misleading. I'd propose we add (in a
follow up patch) a tracepoint to better report upcall errors.

[1] https://lore.kernel.org/netdev/202306300609.tdRdZscy-lkp@intel.com/T/
[2] commit 1100248a5c5c ("openvswitch: Fix double reporting of drops in dropwatch")

---
v4 -> v5:
- Rebased
- Added a helper function to explicitly convert drop reason enum types

v3 -> v4:
- Changed names of errors following Ilya's suggestions
- Moved the ovs-dpctl.py changes from patch 7/7 to 3/7
- Added a test to ensure actions following a drop are rejected

rfc2 -> v3:
- Rebased on top of latest net-next

rfc1 -> rfc2:
- Fail when an explicit drop is not the last
- Added a drop reason for action errors
- Added braces around macros
- Dropped patch that added support for masks in ovs-dpctl.py as it's now
  included in Aaron's series [2].
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:01:07 +01:00
Adrian Moreno
4242029164 selftests: openvswitch: add explicit drop testcase
Test explicit drops generate the right drop reason. Also, verify that
the kernel rejects flows with actions following an explicit drop.

Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:01:06 +01:00
Adrian Moreno
aab1272f5d selftests: openvswitch: add drop reason testcase
Test if the correct drop reason is reported when OVS drops a packet due
to an explicit flow.

Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:01:06 +01:00
Adrian Moreno
43d95b30cf net: openvswitch: add misc error drop reasons
Use drop reasons from include/net/dropreason-core.h when a reasonable
candidate exists.

Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:01:06 +01:00
Adrian Moreno
f329d1bc1a net: openvswitch: add meter drop reason
By using an independent drop reason it makes it easy to distinguish
between QoS-triggered or flow-triggered drop.

Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:01:06 +01:00
Eric Garver
e7bc7db9ba net: openvswitch: add explicit drop action
From: Eric Garver <eric@garver.life>

This adds an explicit drop action. This is used by OVS to drop packets
for which it cannot determine what to do. An explicit action in the
kernel allows passing the reason _why_ the packet is being dropped or
zero to indicate no particular error happened (i.e: OVS intentionally
dropped the packet).

Since the error codes coming from userspace mean nothing for the kernel,
we squash all of them into only two drop reasons:
- OVS_DROP_EXPLICIT_WITH_ERROR to indicate a non-zero value was passed
- OVS_DROP_EXPLICIT to indicate a zero value was passed (no error)

e.g. trace all OVS dropped skbs

 # perf trace -e skb:kfree_skb --filter="reason >= 0x30000"
 [..]
 106.023 ping/2465 skb:kfree_skb(skbaddr: 0xffffa0e8765f2000, \
  location:0xffffffffc0d9b462, protocol: 2048, reason: 196611)

reason: 196611 --> 0x30003 (OVS_DROP_EXPLICIT)

Also, this patch allows ovs-dpctl.py to add explicit drop actions as:
  "drop"     -> implicit empty-action drop
  "drop(0)"  -> explicit non-error action drop
  "drop(42)" -> explicit error action drop

Signed-off-by: Eric Garver <eric@garver.life>
Co-developed-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:01:06 +01:00
Adrian Moreno
ec7bfb5e5a net: openvswitch: add action error drop reason
Add a drop reason for packets that are dropped because an action
returns a non-zero error code.

Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:01:06 +01:00
Adrian Moreno
9d802da40b net: openvswitch: add last-action drop reason
Create a new drop reason subsystem for openvswitch and add the first
drop reason to represent last-action drops.

Last-action drops happen when a flow has an empty action list or there
is no action that consumes the packet (output, userspace, recirc, etc).
It is the most common way in which OVS drops packets.

Implementation-wise, most of these skb-consuming actions already call
"consume_skb" internally and return directly from within the
do_execute_actions() loop so with minimal changes we can assume that
any skb that exits the loop normally is a packet drop.

Signed-off-by: Adrian Moreno <amorenoz@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14 08:01:06 +01:00
David S. Miller
afb0c19242 Merge branch 'mptcp-remove-msk-subflow'
Matthieu Baerts says:

====================
mptcp: get rid of msk->subflow

The MPTCP protocol maintains an additional struct socket per connection,
mainly to be able to easily use tcp-level struct socket operations.

This leads to several side effects, beyond the quite unfortunate /
confusing 'subflow' field name:

- active and passive sockets behaviour is inconsistent: only active ones
  have a not NULL msk->subflow, leading to different error handling and
  different error code returned to the user-space in several places.

- active sockets uses an unneeded, larger amount of memory

- passive sockets can't successfully go through accept(), disconnect(),
  accept() sequence, see [1] for more details.

The 13 first patches of this series are from Paolo and address all the
above, finally getting rid of the blamed field:

- The first patch is a minor clean-up.

- In the next 11 patches, msk->subflow usage is systematically removed
  from the MPTCP protocol, replacing it with direct msk->first usage,
  eventually introducing new core helpers when needed.

- The 13th patch finally disposes the field, and it's the only patch in
  the series intended to produce functional changes.

The last and 14th patch is from Kuniyuki and it is not linked to the
previous ones: it is a small clean-up to get rid of an unnecessary check
in mptcp_init_sock().

[1] https://github.com/multipath-tcp/mptcp_net-next/issues/290
====================

Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
2023-08-14 07:06:14 +01:00