1155117 Commits

Author SHA1 Message Date
Jakub Kicinski
82fe335b78 Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:

====================
virtchnl: update and refactor

Jesse Brandeburg says:

The virtchnl.h file is used by i40e/ice physical function (PF) drivers
and irdma when talking to the iavf driver. This series cleans up the
header file by removing unused elements, adding/cleaning some comments,
fixing the data structures so they are explicitly defined, including
padding, and finally does a long overdue rename of the IWARP members in
the structures to RDMA, since the ice driver and it's associated Intel
Ethernet E800 series adapters support both RDMA and IWARP.

The whole series should result in no functional change, but hopefully
clearer code.

* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  virtchnl: i40e/iavf: rename iwarp to rdma
  virtchnl: do structure hardening
  virtchnl: update header and increase header clarity
  virtchnl: remove unused structure declaration
====================

Link: https://lore.kernel.org/r/20230125212441.4030014-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-26 23:29:12 -08:00
Jakub Kicinski
0313afe8b8 Merge branch 'tools-ynl-prevent-reorder-and-fix-flags'
Jakub Kicinski says:

====================
tools: ynl: prevent reorder and fix flags

Some codegen improvements for YAML specs.

First, Lorenzon discovered when switching the XDP feature family
to use flags instead of pure enum that the kdoc got garbled.
The support for enum and flags is therefore unified.

Second when regenerating all families we discussed so far I noticed
that some netlink policies jumped around. We need to ensure we don't
render code based on their ordering in a hash.
====================

Link: https://lore.kernel.org/r/20230126000235.1085551-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-26 16:32:43 -08:00
Jakub Kicinski
3a43ded081 tools: ynl: store ops in ordered dict to avoid random ordering
When rendering code we should walk the ops in the order in which
they are declared in the spec. This is both more intuitive and
prevents code from jumping around when hashing in the dict changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-26 16:32:41 -08:00
Jakub Kicinski
b49c34e217 tools: ynl: rename ops_list -> msg_list
ops_list contains all the operations, but the main iteration use
case is to walk only ops which define attrs. Rename ops_list to
msg_list, because now it looks like the contents are the same,
just the format is different. While at it convert from tuple
to just keys, none of the users care about the name of the op.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-26 16:32:41 -08:00
Jakub Kicinski
66fa34b9c2 tools: ynl: support kdocs for flags in code generation
Lorenzo reports that after switching from enum to flags netdev
family lost ability to render kdoc (and the enum contents got
generally garbled).

Combine the flags and enum handling in uAPI handling.

Reported-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-26 16:32:41 -08:00
Jakub Kicinski
868c82f34c Merge branch 'convert-drivers-to-return-xfrm-configuration-errors-through-extack'
Leon Romanovsky says:

====================
Convert drivers to return XFRM configuration errors through extack

This series continues effort started by Sabrina to return XFRM configuration
errors through extack. It allows for user space software stack easily present
driver failure reasons to users.

As a note, Intel drivers have a path where extack is equal to NULL, and error
prints won't be available in current patchset. If it is needed, it can be
changed by adding special to Intel macro to print to dmesg in case of
extack == NULL.
====================

Link: https://lore.kernel.org/r/cover.1674560845.git.leon@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-26 16:28:51 -08:00
Leon Romanovsky
8c284ea429 cxgb4: fill IPsec state validation failure reason
Rely on extack to return failure reason.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-26 16:28:48 -08:00
Leon Romanovsky
3fe5798627 bonding: fill IPsec state validation failure reason
Rely on extack to return failure reason.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-26 16:28:48 -08:00
Leon Romanovsky
505c500cfc ixgbe: fill IPsec state validation failure reason
Rely on extack to return failure reason.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-26 16:28:48 -08:00
Leon Romanovsky
c068ec5c96 ixgbevf: fill IPsec state validation failure reason
Rely on extack to return failure reason.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-26 16:28:48 -08:00
Leon Romanovsky
05ddf5f8cb nfp: fill IPsec state validation failure reason
Rely on extack to return failure reason.

Reviewed-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-26 16:28:48 -08:00
Leon Romanovsky
6c48697955 netdevsim: Fill IPsec state validation failure reason
Rely on extack to return failure reason.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-26 16:28:48 -08:00
Leon Romanovsky
902812b816 net/mlx5e: Fill IPsec state validation failure reason
Rely on extack to return failure reason.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-26 16:28:48 -08:00
Leon Romanovsky
7681a4f58f xfrm: extend add state callback to set failure reason
Almost all validation logic is in the drivers, but they are
missing reliable way to convey failure reason to userspace
applications.

Let's use extack to return this information to users.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-26 16:28:48 -08:00
Leon Romanovsky
1bb70c5ab6 net/mlx5e: Fill IPsec policy validation failure reason
Rely on extack to return failure reason.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-26 16:28:48 -08:00
Leon Romanovsky
3089386db0 xfrm: extend add policy callback to set failure reason
Almost all validation logic is in the drivers, but they are
missing reliable way to convey failure reason to userspace
applications.

Let's use extack to return this information to users.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-26 16:28:48 -08:00
Vladimir Oltean
9179f5fe41 net: ethtool: provide shims for stats aggregation helpers when CONFIG_ETHTOOL_NETLINK=n
ethtool_aggregate_*_stats() are implemented in net/ethtool/stats.c, a
file which is compiled out when CONFIG_ETHTOOL_NETLINK=n. In order to
avoid adding Kbuild dependencies from drivers (which call these helpers)
on CONFIG_ETHTOOL_NETLINK, let's add some shim definitions which simply
make the helpers dead code.

This means the function prototypes should have been located in
include/linux/ethtool_netlink.h rather than include/linux/ethtool.h.

Fixes: 449c5459641a ("net: ethtool: add helpers for aggregate statistics")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230125110214.4127759-1-vladimir.oltean@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26 15:28:25 +01:00
Paolo Abeni
97f7d3dd76 Merge branch 'mptcp-add-mixed-v4-v6-support-for-the-in-kernel-pm'
Matthieu Baerts says:

====================
mptcp: add mixed v4/v6 support for the in-kernel PM

Before these patches, the in-kernel Path-Manager would not allow, for
the same MPTCP connection, having a mix of subflows in v4 and v6.

MPTCP's RFC 8684 doesn't forbid that and it is even recommended to do so
as the path in v4 and v6 are likely different. Some networks are also
v4 or v6 only, we cannot assume they all have both v4 and v6 support.

Patch 1 then removes this artificial constraint in the in-kernel PM
currently enforcing there are no mixed subflows in place, either in
address announcement or in subflow creation areas.

Patch 2 makes sure the sk_ipv6only attribute is also propagated to
subflows, just in case a new PM wouldn't respect it.

Some selftests have also been added for the in-kernel PM (patch 3).

Patches 4 to 8 are just some cleanups and small improvements in the
printed messages in the userspace PM. It is not linked to the rest but
identified when working on a related patch modifying this selftest,
already in -net:

  commit 4656d72c1efa ("selftests: mptcp: userspace: validate v4-v6 subflows mix")
---
====================

Link: https://lore.kernel.org/r/20230123-upstream-net-next-pm-v4-v6-v1-0-43fac502bfbf@tessares.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26 13:33:33 +01:00
Matthieu Baerts
8dbdf24f4e selftests: mptcp: userspace: avoid read errors
During the cleanup phase, the server pids were killed with a SIGTERM
directly, not using a SIGUSR1 first to quit safely. As a result, this
test was often ending with two error messages:

  read: Connection reset by peer

While at it, use a for-loop to terminate all the PIDs the same way.

Also the different files are now removed after having killed the PIDs
using them. It makes more sense to do that in this order.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26 13:33:30 +01:00
Matthieu Baerts
10d4273411 selftests: mptcp: userspace: print error details if any
Before, only '[FAIL]' was printed in case of error during the validation
phase.

Now, in case of failure, the variable name, its value and expected one
are displayed to help understand what was wrong.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26 13:33:30 +01:00
Matthieu Baerts
1c0b0ee264 selftests: mptcp: userspace: refactor asserts
Instead of having a long list of conditions to check, it is possible to
give a list of variable names to compare with their 'e_XXX' version.

This will ease the introduction of the following commit which will print
which condition has failed (if any).

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26 13:33:30 +01:00
Matthieu Baerts
f790ae03db selftests: mptcp: userspace: print titles
This script is running a few tests after having setup the environment.

Printing titles helps understand what is being tested.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26 13:33:30 +01:00
Matthieu Baerts
40c71f763f mptcp: userspace pm: use a single point of exit
Like in all other functions in this file, a single point of exit is used
when extra operations are needed: unlock, decrement refcount, etc.

There is no functional change for the moment but it is better to do the
same here to make sure all cleanups are done in case of intermediate
errors.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26 13:33:30 +01:00
Paolo Abeni
ad3493746e selftests: mptcp: add test-cases for mixed v4/v6 subflows
Note that we can't guess the listener family anymore based on the client
target address: always use IPv6.

The fullmesh flag with endpoints from different families is also
validated here.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26 13:33:30 +01:00
Matthieu Baerts
7e9740e0e8 mptcp: propagate sk_ipv6only to subflows
Usually, attributes are propagated to subflows as well.

Here, if subflows are created by other ways than the MPTCP path-manager,
it is important to make sure they are in v6 if it is asked by the
userspace.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26 13:33:30 +01:00
Paolo Abeni
b9d69db87f mptcp: let the in-kernel PM use mixed IPv4 and IPv6 addresses
Currently the in-kernel PM arbitrary enforces that created subflow's
family must match the main MPTCP socket while the RFC allows mixing
IPv4 and IPv6 subflows.

This patch changes the in-kernel PM logic to create subflows matching
the currently selected source (or destination) address. IPv4 sockets
can pick only IPv4 addresses (and v4 mapped in v6), while IPv6 sockets
not restricted to V6ONLY can pick either IPv4 and IPv6 addresses as
long as the source and destination matches.

A helper, previously introduced is used to ease family matching checks,
taking care of IPv4 vs IPv4-mapped-IPv6 vs IPv6 only addresses.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/269
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26 13:33:30 +01:00
Jamie Bainbridge
d0941130c9 icmp: Add counters for rate limits
There are multiple ICMP rate limiting mechanisms:

* Global limits: net.ipv4.icmp_msgs_burst/icmp_msgs_per_sec
* v4 per-host limits: net.ipv4.icmp_ratelimit/ratemask
* v6 per-host limits: net.ipv6.icmp_ratelimit/ratemask

However, when ICMP output is limited, there is no way to tell
which limit has been hit or even if the limits are responsible
for the lack of ICMP output.

Add counters for each of the cases above. As we are within
local_bh_disable(), use the __INC stats variant.

Example output:

 # nstat -sz "*RateLimit*"
 IcmpOutRateLimitGlobal          134                0.0
 IcmpOutRateLimitHost            770                0.0
 Icmp6OutRateLimitHost           84                 0.0

Signed-off-by: Jamie Bainbridge <jamie.bainbridge@gmail.com>
Suggested-by: Abhishek Rawal <rawal.abhishek92@gmail.com>
Link: https://lore.kernel.org/r/273b32241e6b7fdc5c609e6f5ebc68caf3994342.1674605770.git.jamie.bainbridge@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26 10:52:18 +01:00
Paolo Abeni
9f92752788 Merge branch 'adding-sparx5-is0-vcap-support'
Steen Hegelund says:

====================
Adding Sparx5 IS0 VCAP support

This provides the Ingress Stage 0 (IS0) VCAP (Versatile Content-Aware
Processor) support for the Sparx5 platform.

The IS0 VCAP (also known in the datasheet as CLM) is a classifier VCAP that
mainly extracts frame information to metadata that follows the frame in the
Sparx5 processing flow all the way to the egress port.

The IS0 VCAP has 4 lookups and they are accessible with a TC chain id:

- chain 1000000: IS0 Lookup 0
- chain 1100000: IS0 Lookup 1
- chain 1200000: IS0 Lookup 2
- chain 1300000: IS0 Lookup 3
- chain 1400000: IS0 Lookup 4
- chain 1500000: IS0 Lookup 5

Each of these lookups have their own port keyset configuration that decides
which keys will be used for matching on which traffic type.

The IS0 VCAP has these traffic classifications:

- IPv4 frames
- IPv6 frames
- Unicast MPLS frames (ethertype = 0x8847)
- Multicast MPLS frames (ethertype = 0x8847)
- Other frame types than MPLS, IPv4 and IPv6

The IS0 VCAP has an action that allows setting the value of a PAG (Policy
Association Group) key field in the frame metadata, and this can be used
for matching in an IS2 VCAP rule.

This allow rules in the IS0 VCAP to be linked to rules in the IS2 VCAP.

The linking is exposed by using the TC "goto chain" action with an offset
from the IS2 chain ids.

As an example a "goto chain 8000001" will use a PAG value of 1 to chain to
a rule in IS2 Lookup 0.
====================

Link: https://lore.kernel.org/r/20230124104511.293938-1-steen.hegelund@microchip.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26 10:08:05 +01:00
Steen Hegelund
52df82cc91 net: microchip: sparx5: Add support for IS0 VCAP CVLAN TC keys
This adds support for parsing and matching on the CVLAN tags in the Sparx5
IS0 VCAP.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26 10:07:44 +01:00
Steen Hegelund
63e3564507 net: microchip: sparx5: Add support for IS0 VCAP ethernet protocol types
This allows the IS0 VCAP to have its own list of supported ethernet
protocol types matching what is supported by the VCAPs port lookup
classification.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26 10:07:44 +01:00
Steen Hegelund
81e164c4ae net: microchip: sparx5: Add automatic selection of VCAP rule actionset
With more than one possible actionset in a VCAP instance, the VCAP API will
now use the actions in a VCAP rule to select the actionset that fits these
actions the best possible way.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26 10:07:44 +01:00
Steen Hegelund
88bd9ea70b net: microchip: sparx5: Add TC filter chaining support for IS0 and IS2 VCAPs
This allows rules to be chained between VCAP instances, e.g. from IS0
Lookup 0 to IS0 Lookup 1, or from one of the IS0 Lookups to one of the IS2
Lookups.

Chaining from an IS2 Lookup to another IS2 Lookup is not supported in the
hardware.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26 10:07:44 +01:00
Steen Hegelund
542e6e2c20 net: microchip: sparx5: Add TC support for IS0 VCAP
This enables the TC command to use the Sparx5 IS0 VCAP

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26 10:07:44 +01:00
Steen Hegelund
7306fcd17c net: microchip: sparx5: Add actionset type id information to rule
This adds the actionset type id to the rule information.  This is needed as
we now have more than one actionset in a VCAP instance (IS0).

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26 10:07:44 +01:00
Steen Hegelund
545609fd4e net: microchip: sparx5: Add IS0 VCAP keyset configuration for Sparx5
This adds the IS0 VCAP port keyset configuration for Sparx5 and also
updates the debugFS support to show the keyset configuration.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26 10:07:44 +01:00
Steen Hegelund
f274a659fb net: microchip: sparx5: Add IS0 VCAP model and updated KUNIT VCAP model
This provides the IS0 (Ingress Stage 0) or CLM VCAP model for Sparx5.
This VCAP provides classification actions for Sparx5.

Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2023-01-26 10:07:44 +01:00
Jakub Kicinski
3f17e16f38 Merge branch 'add-ip_local_port_range-socket-option'
Jakub Sitnicki says:

====================
Add IP_LOCAL_PORT_RANGE socket option

This patch set is a follow up to the "How to share IPv4 addresses by
partitioning the port space" talk given at LPC 2022 [1].

Please see patch #1 for the motivation & the use case description.
Patch #2 adds tests exercising the new option in various scenarios.

Documentation
-------------

Proposed update to the ip(7) man-page:

       IP_LOCAL_PORT_RANGE (since Linux X.Y)
              Set or get the per-socket default local  port  range.  This
              option  can  be  used  to  clamp down the global local port
              range, defined by the ip_local_port_range  /proc  interface
              described below, for a given socket.

              The  option  takes  an uint32_t value with the high 16 bits
              set to the upper range bound, and the low 16  bits  set  to
              the  lower  range  bound.  Range  bounds are inclusive. The
              16-bit values should be in host byte order.

              The lower bound has to be less than the  upper  bound  when
              both  bounds  are  not  zero. Otherwise, setting the option
              fails with EINVAL.

              If either bound is outside of the global local port  range,
              or is zero, then that bound has no effect.

              To  reset  the setting, pass zero as both the upper and the
              lower bound.

Interaction with SELinux bind() hook
------------------------------------

SELinux bind() hook - selinux_socket_bind() - performs a permission check
if the requested local port number lies outside of the netns ephemeral port
range.

The proposed socket option cannot be used change the ephemeral port range
to extend beyond the per-netns port range, as set by
net.ipv4.ip_local_port_range.

Hence, there is no interaction with SELinux, AFAICT.

RFC -> v1
RFC: https://lore.kernel.org/netdev/20220912225308.93659-1-jakub@cloudflare.com/

 * Allow either the high bound or the low bound, or both, to be zero
 * Add getsockopt support
 * Add selftests

Links:
------

[1]: https://lpc.events/event/16/contributions/1349/
====================

Link: https://lore.kernel.org/r/20221221-sockopt-port-range-v6-0-be255cc0e51f@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-25 22:45:02 -08:00
Jakub Sitnicki
ae5439658c selftests/net: Cover the IP_LOCAL_PORT_RANGE socket option
Exercise IP_LOCAL_PORT_RANGE socket option in various scenarios:

1. pass invalid values to setsockopt
2. pass a range outside of the per-netns port range
3. configure a single-port range
4. exhaust a configured multi-port range
5. check interaction with late-bind (IP_BIND_ADDRESS_NO_PORT)
6. set then get the per-socket port range

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-25 22:45:00 -08:00
Jakub Sitnicki
91d0b78c51 inet: Add IP_LOCAL_PORT_RANGE socket option
Users who want to share a single public IP address for outgoing connections
between several hosts traditionally reach for SNAT. However, SNAT requires
state keeping on the node(s) performing the NAT.

A stateless alternative exists, where a single IP address used for egress
can be shared between several hosts by partitioning the available ephemeral
port range. In such a setup:

1. Each host gets assigned a disjoint range of ephemeral ports.
2. Applications open connections from the host-assigned port range.
3. Return traffic gets routed to the host based on both, the destination IP
   and the destination port.

An application which wants to open an outgoing connection (connect) from a
given port range today can choose between two solutions:

1. Manually pick the source port by bind()'ing to it before connect()'ing
   the socket.

   This approach has a couple of downsides:

   a) Search for a free port has to be implemented in the user-space. If
      the chosen 4-tuple happens to be busy, the application needs to retry
      from a different local port number.

      Detecting if 4-tuple is busy can be either easy (TCP) or hard
      (UDP). In TCP case, the application simply has to check if connect()
      returned an error (EADDRNOTAVAIL). That is assuming that the local
      port sharing was enabled (REUSEADDR) by all the sockets.

        # Assume desired local port range is 60_000-60_511
        s = socket(AF_INET, SOCK_STREAM)
        s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
        s.bind(("192.0.2.1", 60_000))
        s.connect(("1.1.1.1", 53))
        # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
        # Application must retry with another local port

      In case of UDP, the network stack allows binding more than one socket
      to the same 4-tuple, when local port sharing is enabled
      (REUSEADDR). Hence detecting the conflict is much harder and involves
      querying sock_diag and toggling the REUSEADDR flag [1].

   b) For TCP, bind()-ing to a port within the ephemeral port range means
      that no connecting sockets, that is those which leave it to the
      network stack to find a free local port at connect() time, can use
      the this port.

      IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
      will be skipped during the free port search at connect() time.

2. Isolate the app in a dedicated netns and use the use the per-netns
   ip_local_port_range sysctl to adjust the ephemeral port range bounds.

   The per-netns setting affects all sockets, so this approach can be used
   only if:

   - there is just one egress IP address, or
   - the desired egress port range is the same for all egress IP addresses
     used by the application.

   For TCP, this approach avoids the downsides of (1). Free port search and
   4-tuple conflict detection is done by the network stack:

     system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")

     s = socket(AF_INET, SOCK_STREAM)
     s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
     s.bind(("192.0.2.1", 0))
     s.connect(("1.1.1.1", 53))
     # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy

  For UDP this approach has limited applicability. Setting the
  IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
  port being shared with other connected UDP sockets.

  Hence relying on the network stack to find a free source port, limits the
  number of outgoing UDP flows from a single IP address down to the number
  of available ephemeral ports.

To put it another way, partitioning the ephemeral port range between hosts
using the existing Linux networking API is cumbersome.

To address this use case, add a new socket option at the SOL_IP level,
named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
ephemeral port range for each socket individually.

The option can be used only to narrow down the per-netns local port
range. If the per-socket range lies outside of the per-netns range, the
latter takes precedence.

UAPI-wise, the low and high range bounds are passed to the kernel as a pair
of u16 values in host byte order packed into a u32. This avoids pointer
passing.

  PORT_LO = 40_000
  PORT_HI = 40_511

  s = socket(AF_INET, SOCK_STREAM)
  v = struct.pack("I", PORT_HI << 16 | PORT_LO)
  s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
  s.bind(("127.0.0.1", 0))
  s.getsockname()
  # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
  # if there is a free port. EADDRINUSE otherwise.

[1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116

Reviewed-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-25 22:45:00 -08:00
Randy Dunlap
6a7a2c18a9 net: Kconfig: fix spellos
Fix spelling in net/ Kconfig files.
(reported by codespell)

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Florian Westphal <fw@strlen.de>
Cc: coreteam@netfilter.org
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Link: https://lore.kernel.org/r/20230124181724.18166-1-rdunlap@infradead.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-01-25 22:39:56 -08:00
Jesse Brandeburg
2723f3b5d4 virtchnl: i40e/iavf: rename iwarp to rdma
Since the latest Intel hardware does both IWARP and ROCE, rename the
term IWARP in the virtchnl header to be RDMA. Do this for both upper and
lower case instances. Many of the non-virtchnl.h changes were done with
regular expression replacements using perl like:
perl -p -i -e 's/_IWARP/_RDMA/' <files>
perl -p -i -e 's/_iwarp/_rdma/' <files>
and I had to pick up a few instances manually.

The virtchnl.h header has some comments and clarity added around when to
use certain defines.

note: had to fix a checkpatch warning for a long line by wrapping one of
the lines I changed.

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Jakub Andrysiak <jakub.andrysiak@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-01-25 08:55:19 -08:00
Jesse Brandeburg
4e4df55941 virtchnl: do structure hardening
The virtchnl interface can have a bunch of "soft" defined structures
hardened by using explicit sizes for declarations, and then referring to
the enum type that uses them in a comment. None of these changes should
change any of the structure sizes.

Also, remove a duplicate line in a switch statement and let two cases
uses the same code.

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Marek Szlosek <marek.szlosek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-01-25 08:55:19 -08:00
Jesse Brandeburg
43fc70a208 virtchnl: update header and increase header clarity
We already have the SPDX header, so just leave a copyright notice with
an updated year and get rid of the boilerplate header (so 2002!).

In addition, update a couple of comments to clarify how the various
parts of the virtchannel header interaction work.

No functional changes.

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Marek Szlosek <marek.szlosek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-01-25 08:55:19 -08:00
Jesse Brandeburg
3a6a9b3be2 virtchnl: remove unused structure declaration
Nothing uses virtchnl_msg, just remove it.

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Marek Szlosek <marek.szlosek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2023-01-25 08:55:19 -08:00
Vladimir Oltean
f5be9caf7b net: ethtool: fix NULL pointer dereference in pause_prepare_data()
In the following call path:

ethnl_default_dumpit
-> ethnl_default_dump_one
   -> ctx->ops->prepare_data
      -> pause_prepare_data

struct genl_info *info will be passed as NULL, and pause_prepare_data()
dereferences it while getting the extended ack pointer.

To avoid that, just set the extack to NULL if "info" is NULL, since the
netlink extack handling messages know how to deal with that.

The pattern "info ? info->extack : NULL" is present in quite a few other
"prepare_data" implementations, so it's clear that it's a more general
problem to be dealt with at a higher level, but the code should have at
least adhered to the current conventions to avoid the NULL dereference.

Fixes: 04692c9020b7 ("net: ethtool: netlink: retrieve stats from multiple sources (eMAC, pMAC)")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reported-by: syzbot+9d44aae2720fc40b8474@syzkaller.appspotmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25 09:57:41 +00:00
Vladimir Oltean
c96de13632 net: ethtool: fix NULL pointer dereference in stats_prepare_data()
In the following call path:

ethnl_default_dumpit
-> ethnl_default_dump_one
   -> ctx->ops->prepare_data
      -> stats_prepare_data

struct genl_info *info will be passed as NULL, and stats_prepare_data()
dereferences it while getting the extended ack pointer.

To avoid that, just set the extack to NULL if "info" is NULL, since the
netlink extack handling messages know how to deal with that.

The pattern "info ? info->extack : NULL" is present in quite a few other
"prepare_data" implementations, so it's clear that it's a more general
problem to be dealt with at a higher level, but the code should have at
least adhered to the current conventions to avoid the NULL dereference.

Fixes: 04692c9020b7 ("net: ethtool: netlink: retrieve stats from multiple sources (eMAC, pMAC)")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25 09:56:31 +00:00
David S. Miller
99db6fb043 Merge branch 's390-ism-generalized-interface'
Jan Karcher says:

====================
drivers/s390/net/ism: Add generalized interface

Previously, there was no clean separation between SMC-D code and the ISM
device driver.This patch series addresses the situation to make ISM available
for uses outside of SMC-D.
In detail: SMC-D offers an interface via struct smcd_ops, which only the
ISM module implements so far. However, there is no real separation between
the smcd and ism modules, which starts right with the ISM device
initialization, which calls directly into the SMC-D code.
This patch series introduces a new API in the ISM module, which allows
registration of arbitrary clients via include/linux/ism.h: struct ism_client.
Furthermore, it introduces a "pure" struct ism_dev (i.e. getting rid of
dependencies on SMC-D in the device structure), and adds a number of API
calls for data transfers via ISM (see ism_register_dmb() & friends).
Still, the ISM module implements the SMC-D API, and therefore has a number
of internal helper functions for that matter.
Note that the ISM API is consciously kept thin for now (as compared to the
SMC-D API calls), as a number of API calls are only used with SMC-D and
hardly have any meaningful usage beyond SMC-D, e.g. the VLAN-related calls.

v1 -> v2:
  Removed s390x dependency which broke config for other archs.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25 09:46:49 +00:00
Stefan Raspl
8c81ba2034 net/smc: De-tangle ism and smc device initialization
The struct device for ISM devices was part of struct smcd_dev. Move to
struct ism_dev, provide a new API call in struct smcd_ops, and convert
existing SMCD code accordingly.
Furthermore, remove struct smcd_dev from struct ism_dev.
This is the final part of a bigger overhaul of the interfaces between SMC
and ISM.

Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25 09:46:49 +00:00
Stefan Raspl
820f21009f s390/ism: Consolidate SMC-D-related code
The ism module had SMC-D-specific code sprinkled across the entire module.
We are now consolidating the SMC-D-specific parts into the latter parts
of the module, so it becomes more clear what code is intended for use with
ISM, and which parts are glue code for usage in the context of SMC-D.
This is the fourth part of a bigger overhaul of the interfaces between SMC
and ISM.

Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25 09:46:49 +00:00
Stefan Raspl
9de4df7b6b net/smc: Separate SMC-D and ISM APIs
We separate the code implementing the struct smcd_ops API in the ISM
device driver from the functions that may be used by other exploiters of
ISM devices.
Note: We start out small, and don't offer the whole breadth of the ISM
device for public use, as many functions are specific to or likely only
ever used in the context of SMC-D.
This is the third part of a bigger overhaul of the interfaces between SMC
and ISM.

Signed-off-by: Stefan Raspl <raspl@linux.ibm.com>
Signed-off-by: Jan Karcher <jaka@linux.ibm.com>
Signed-off-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2023-01-25 09:46:48 +00:00