2019-05-27 08:55:01 +02:00
/* SPDX-License-Identifier: GPL-2.0-or-later */
net: Distributed Switch Architecture protocol support
Distributed Switch Architecture is a protocol for managing hardware
switch chips. It consists of a set of MII management registers and
commands to configure the switch, and an ethernet header format to
signal which of the ports of the switch a packet was received from
or is intended to be sent to.
The switches that this driver supports are typically embedded in
access points and routers, and a typical setup with a DSA switch
looks something like this:
+-----------+ +-----------+
| | RGMII | |
| +-------+ +------ 1000baseT MDI ("WAN")
| | | 6-port +------ 1000baseT MDI ("LAN1")
| CPU | | ethernet +------ 1000baseT MDI ("LAN2")
| |MIImgmt| switch +------ 1000baseT MDI ("LAN3")
| +-------+ w/5 PHYs +------ 1000baseT MDI ("LAN4")
| | | |
+-----------+ +-----------+
The switch driver presents each port on the switch as a separate
network interface to Linux, polls the switch to maintain software
link state of those ports, forwards MII management interface
accesses to those network interfaces (e.g. as done by ethtool) to
the switch, and exposes the switch's hardware statistics counters
via the appropriate Linux kernel interfaces.
This initial patch supports the MII management interface register
layout of the Marvell 88E6123, 88E6161 and 88E6165 switch chips, and
supports the "Ethertype DSA" packet tagging format.
(There is no officially registered ethertype for the Ethertype DSA
packet format, so we just grab a random one. The ethertype to use
is programmed into the switch, and the switch driver uses the value
of ETH_P_EDSA for this, so this define can be changed at any time in
the future if the one we chose is allocated to another protocol or
if Ethertype DSA gets its own officially registered ethertype, and
everything will continue to work.)
Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
Tested-by: Nicolas Pitre <nico@marvell.com>
Tested-by: Byron Bradley <byron.bbradley@gmail.com>
Tested-by: Tim Ellis <tim.ellis@mac.com>
Tested-by: Peter van Valderen <linux@ddcrew.com>
Tested-by: Dirk Teurlings <dirk@upexia.nl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-07 13:44:02 +00:00
/*
* include / net / dsa . h - Driver for Distributed Switch Architecture switch chips
dsa: add switch chip cascading support
The initial version of the DSA driver only supported a single switch
chip per network interface, while DSA-capable switch chips can be
interconnected to form a tree of switch chips. This patch adds support
for multiple switch chips on a network interface.
An example topology for a 16-port device with an embedded CPU is as
follows:
+-----+ +--------+ +--------+
| |eth0 10| switch |9 10| switch |
| CPU +----------+ +-------+ |
| | | chip 0 | | chip 1 |
+-----+ +---++---+ +---++---+
|| ||
|| ||
||1000baseT ||1000baseT
||ports 1-8 ||ports 9-16
This requires a couple of interdependent changes in the DSA layer:
- The dsa platform driver data needs to be extended: there is still
only one netdevice per DSA driver instance (eth0 in the example
above), but each of the switch chips in the tree needs its own
mii_bus device pointer, MII management bus address, and port name
array. (include/net/dsa.h) The existing in-tree dsa users need
some small changes to deal with this. (arch/arm)
- The DSA and Ethertype DSA tagging modules need to be extended to
use the DSA device ID field on receive and demultiplex the packet
accordingly, and fill in the DSA device ID field on transmit
according to which switch chip the packet is heading to.
(net/dsa/tag_{dsa,edsa}.c)
- The concept of "CPU port", which is the switch chip port that the
CPU is connected to (port 10 on switch chip 0 in the example), needs
to be extended with the concept of "upstream port", which is the
port on the switch chip that will bring us one hop closer to the CPU
(port 10 for both switch chips in the example above).
- The dsa platform data needs to specify which ports on which switch
chips are links to other switch chips, so that we can enable DSA
tagging mode on them. (For inter-switch links, we always use
non-EtherType DSA tagging, since it has lower overhead. The CPU
link uses dsa or edsa tagging depending on what the 'root' switch
chip supports.) This is done by specifying "dsa" for the given
port in the port array.
- The dsa platform data needs to be extended with information on via
which port to reach any given switch chip from any given switch chip.
This info is specified via the per-switch chip data struct ->rtable[]
array, which gives the nexthop ports for each of the other switches
in the tree.
For the example topology above, the dsa platform data would look
something like this:
static struct dsa_chip_data sw[2] = {
{
.mii_bus = &foo,
.sw_addr = 1,
.port_names[0] = "p1",
.port_names[1] = "p2",
.port_names[2] = "p3",
.port_names[3] = "p4",
.port_names[4] = "p5",
.port_names[5] = "p6",
.port_names[6] = "p7",
.port_names[7] = "p8",
.port_names[9] = "dsa",
.port_names[10] = "cpu",
.rtable = (s8 []){ -1, 9, },
}, {
.mii_bus = &foo,
.sw_addr = 2,
.port_names[0] = "p9",
.port_names[1] = "p10",
.port_names[2] = "p11",
.port_names[3] = "p12",
.port_names[4] = "p13",
.port_names[5] = "p14",
.port_names[6] = "p15",
.port_names[7] = "p16",
.port_names[10] = "dsa",
.rtable = (s8 []){ 10, -1, },
},
},
static struct dsa_platform_data pd = {
.netdev = &foo,
.nr_switches = 2,
.sw = sw,
};
Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
Tested-by: Gary Thomas <gary@mlbassoc.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2009-03-20 09:52:09 +00:00
* Copyright ( c ) 2008 - 2009 Marvell Semiconductor
net: Distributed Switch Architecture protocol support
Distributed Switch Architecture is a protocol for managing hardware
switch chips. It consists of a set of MII management registers and
commands to configure the switch, and an ethernet header format to
signal which of the ports of the switch a packet was received from
or is intended to be sent to.
The switches that this driver supports are typically embedded in
access points and routers, and a typical setup with a DSA switch
looks something like this:
+-----------+ +-----------+
| | RGMII | |
| +-------+ +------ 1000baseT MDI ("WAN")
| | | 6-port +------ 1000baseT MDI ("LAN1")
| CPU | | ethernet +------ 1000baseT MDI ("LAN2")
| |MIImgmt| switch +------ 1000baseT MDI ("LAN3")
| +-------+ w/5 PHYs +------ 1000baseT MDI ("LAN4")
| | | |
+-----------+ +-----------+
The switch driver presents each port on the switch as a separate
network interface to Linux, polls the switch to maintain software
link state of those ports, forwards MII management interface
accesses to those network interfaces (e.g. as done by ethtool) to
the switch, and exposes the switch's hardware statistics counters
via the appropriate Linux kernel interfaces.
This initial patch supports the MII management interface register
layout of the Marvell 88E6123, 88E6161 and 88E6165 switch chips, and
supports the "Ethertype DSA" packet tagging format.
(There is no officially registered ethertype for the Ethertype DSA
packet format, so we just grab a random one. The ethertype to use
is programmed into the switch, and the switch driver uses the value
of ETH_P_EDSA for this, so this define can be changed at any time in
the future if the one we chose is allocated to another protocol or
if Ethertype DSA gets its own officially registered ethertype, and
everything will continue to work.)
Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
Tested-by: Nicolas Pitre <nico@marvell.com>
Tested-by: Byron Bradley <byron.bbradley@gmail.com>
Tested-by: Tim Ellis <tim.ellis@mac.com>
Tested-by: Peter van Valderen <linux@ddcrew.com>
Tested-by: Dirk Teurlings <dirk@upexia.nl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-07 13:44:02 +00:00
*/
# ifndef __LINUX_NET_DSA_H
# define __LINUX_NET_DSA_H
2017-02-07 15:03:05 -08:00
# include <linux/if.h>
2011-11-30 22:07:18 +00:00
# include <linux/if_ether.h>
2011-11-27 17:06:08 +00:00
# include <linux/list.h>
2017-02-03 13:20:20 -05:00
# include <linux/notifier.h>
2011-11-25 14:32:52 +00:00
# include <linux/timer.h>
# include <linux/workqueue.h>
2014-08-27 17:04:49 -07:00
# include <linux/of.h>
2014-10-17 16:02:13 -07:00
# include <linux/ethtool.h>
2018-02-14 01:07:48 +01:00
# include <linux/net_tstamp.h>
2018-05-10 13:17:32 -07:00
# include <linux/phy.h>
2019-01-15 15:06:11 -08:00
# include <linux/platform_data/dsa.h>
2019-05-28 20:38:12 +03:00
# include <linux/phylink.h>
2017-03-28 23:45:07 +02:00
# include <net/devlink.h>
2017-05-17 15:46:04 -04:00
# include <net/switchdev.h>
2011-11-25 14:32:52 +00:00
2017-01-30 12:41:40 -08:00
struct tc_action ;
2017-02-07 15:03:05 -08:00
struct phy_device ;
struct fixed_phy_status ;
2018-05-10 13:17:32 -07:00
struct phylink_link_state ;
2017-01-30 12:41:40 -08:00
2019-04-28 19:37:12 +02:00
# define DSA_TAG_PROTO_NONE_VALUE 0
# define DSA_TAG_PROTO_BRCM_VALUE 1
# define DSA_TAG_PROTO_BRCM_PREPEND_VALUE 2
# define DSA_TAG_PROTO_DSA_VALUE 3
# define DSA_TAG_PROTO_EDSA_VALUE 4
# define DSA_TAG_PROTO_GSWIP_VALUE 5
# define DSA_TAG_PROTO_KSZ9477_VALUE 6
# define DSA_TAG_PROTO_KSZ9893_VALUE 7
# define DSA_TAG_PROTO_LAN9303_VALUE 8
# define DSA_TAG_PROTO_MTK_VALUE 9
# define DSA_TAG_PROTO_QCA_VALUE 10
# define DSA_TAG_PROTO_TRAILER_VALUE 11
net: dsa: Optional VLAN-based port separation for switches without tagging
This patch provides generic DSA code for using VLAN (802.1Q) tags for
the same purpose as a dedicated switch tag for injection/extraction.
It is based on the discussions and interest that has been so far
expressed in https://www.spinics.net/lists/netdev/msg556125.html.
Unlike all other DSA-supported tagging protocols, CONFIG_NET_DSA_TAG_8021Q
does not offer a complete solution for drivers (nor can it). Instead, it
provides generic code that driver can opt into calling:
- dsa_8021q_xmit: Inserts a VLAN header with the specified contents.
Can be called from another tagging protocol's xmit function.
Currently the LAN9303 driver is inserting headers that are simply
802.1Q with custom fields, so this is an opportunity for code reuse.
- dsa_8021q_rcv: Retrieves the TPID and TCI from a VLAN-tagged skb.
Removing the VLAN header is left as a decision for the caller to make.
- dsa_port_setup_8021q_tagging: For each user port, installs an Rx VID
and a Tx VID, for proper untagged traffic identification on ingress
and steering on egress. Also sets up the VLAN trunk on the upstream
(CPU or DSA) port. Drivers are intentionally left to call this
function explicitly, depending on the context and hardware support.
The expected switch behavior and VLAN semantics should not be violated
under any conditions. That is, after calling
dsa_port_setup_8021q_tagging, the hardware should still pass all
ingress traffic, be it tagged or untagged.
For uniformity with the other tagging protocols, a module for the
dsa_8021q_netdev_ops structure is registered, but the typical usage is
to set up another tagging protocol which selects CONFIG_NET_DSA_TAG_8021Q,
and calls the API from tag_8021q.h. Null function definitions are also
provided so that a "depends on" is not forced in the Kconfig.
This tagging protocol only works when switch ports are standalone, or
when they are added to a VLAN-unaware bridge. It will probably remain
this way for the reasons below.
When added to a bridge that has vlan_filtering 1, the bridge core will
install its own VLANs and reset the pvids through switchdev. For the
bridge core, switchdev is a write-only pipe. All VLAN-related state is
kept in the bridge core and nothing is read from DSA/switchdev or from
the driver. So the bridge core will break this port separation because
it will install the vlan_default_pvid into all switchdev ports.
Even if we could teach the bridge driver about switchdev preference of a
certain vlan_default_pvid (task difficult in itself since the current
setting is per-bridge but we would need it per-port), there would still
exist many other challenges.
Firstly, in the DSA rcv callback, a driver would have to perform an
iterative reverse lookup to find the correct switch port. That is
because the port is a bridge slave, so its Rx VID (port PVID) is subject
to user configuration. How would we ensure that the user doesn't reset
the pvid to a different value (which would make an O(1) translation
impossible), or to a non-unique value within this DSA switch tree (which
would make any translation impossible)?
Finally, not all switch ports are equal in DSA, and that makes it
difficult for the bridge to be completely aware of this anyway.
The CPU port needs to transmit tagged packets (VLAN trunk) in order for
the DSA rcv code to be able to decode source information.
But the bridge code has absolutely no idea which switch port is the CPU
port, if nothing else then just because there is no netdevice registered
by DSA for the CPU port.
Also DSA does not currently allow the user to specify that they want the
CPU port to do VLAN trunking anyway. VLANs are added to the CPU port
using the same flags as they were added on the user port.
So the VLANs installed by dsa_port_setup_8021q_tagging per driver
request should remain private from the bridge's and user's perspective,
and should not alter the VLAN semantics observed by the user.
In the current implementation a VLAN range ending at 4095 (VLAN_N_VID)
is reserved for this purpose. Each port receives a unique Rx VLAN and a
unique Tx VLAN. Separate VLANs are needed for Rx and Tx because they
serve different purposes: on Rx the switch must process traffic as
untagged and process it with a port-based VLAN, but with care not to
hinder bridging. On the other hand, the Tx VLAN is where the
reachability restrictions are imposed, since by tagging frames in the
xmit callback we are telling the switch onto which port to steer the
frame.
Some general guidance on how this support might be employed for
real-life hardware (some comments made by Florian Fainelli):
- If the hardware supports VLAN tag stacking, it should somehow back
up its private VLAN settings when the bridge tries to override them.
Then the driver could re-apply them as outer tags. Dedicating an outer
tag per bridge device would allow identical inner tag VID numbers to
co-exist, yet preserve broadcast domain isolation.
- If the switch cannot handle VLAN tag stacking, it should disable this
port separation when added as slave to a vlan_filtering bridge, in
that case having reduced functionality.
- Drivers for old switches that don't support the entire VLAN_N_VID
range will need to rework the current range selection mechanism.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 13:19:22 +03:00
# define DSA_TAG_PROTO_8021Q_VALUE 12
2019-05-05 13:19:27 +03:00
# define DSA_TAG_PROTO_SJA1105_VALUE 13
2019-07-29 19:49:46 +02:00
# define DSA_TAG_PROTO_KSZ8795_VALUE 14
2019-11-14 17:03:29 +02:00
# define DSA_TAG_PROTO_OCELOT_VALUE 15
2019-12-18 09:02:14 +01:00
# define DSA_TAG_PROTO_AR9331_VALUE 16
2020-07-08 14:25:36 +02:00
# define DSA_TAG_PROTO_RTL4_A_VALUE 17
2020-11-03 08:10:54 +01:00
# define DSA_TAG_PROTO_HELLCREEK_VALUE 18
2021-01-14 13:57:32 -06:00
# define DSA_TAG_PROTO_XRS700X_VALUE 19
2021-01-29 03:00:08 +02:00
# define DSA_TAG_PROTO_OCELOT_8021Q_VALUE 20
net: dsa: tag_ocelot: create separate tagger for Seville
The ocelot tagger is a hot mess currently, it relies on memory
initialized by the attached driver for basic frame transmission.
This is against all that DSA tagging protocols stand for, which is that
the transmission and reception of a DSA-tagged frame, the data path,
should be independent from the switch control path, because the tag
protocol is in principle hot-pluggable and reusable across switches
(even if in practice it wasn't until very recently). But if another
driver like dsa_loop wants to make use of tag_ocelot, it couldn't.
This was done to have common code between Felix and Ocelot, which have
one bit difference in the frame header format. Quoting from commit
67c2404922c2 ("net: dsa: felix: create a template for the DSA tags on
xmit"):
Other alternatives have been analyzed, such as:
- Create a separate tag_seville.c: too much code duplication for just 1
bit field difference.
- Create a separate DSA_TAG_PROTO_SEVILLE under tag_ocelot.c, just like
tag_brcm.c, which would have a separate .xmit function. Again, too
much code duplication for just 1 bit field difference.
- Allocate the template from the init function of the tag_ocelot.c
module, instead of from the driver: couldn't figure out a method of
accessing the correct port template corresponding to the correct
tagger in the .xmit function.
The really interesting part is that Seville should have had its own
tagging protocol defined - it is not compatible on the wire with Ocelot,
even for that single bit. In principle, a packet generated by
DSA_TAG_PROTO_OCELOT when booted on NXP LS1028A would look in a certain
way, but when booted on NXP T1040 it would look differently. The reverse
is also true: a packet generated by a Seville switch would be
interpreted incorrectly by Wireshark if it was told it was generated by
an Ocelot switch.
Actually things are a bit more nuanced. If we concentrate only on the
DSA tag, what I said above is true, but Ocelot/Seville also support an
optional DSA tag prefix, which can be short or long, and it is possible
to distinguish the two taggers based on an integer constant put in that
prefix. Nonetheless, creating a separate tagger is still justified,
since the tag prefix is optional, and without it, there is again no way
to distinguish.
Claiming backwards binary compatibility is a bit more tough, since I've
already changed the format of tag_ocelot once, in commit 5124197ce58b
("net: dsa: tag_ocelot: use a short prefix on both ingress and egress").
Therefore I am not very concerned with treating this as a bugfix and
backporting it to stable kernels (which would be another mess due to the
fact that there would be lots of conflicts with the other DSA_TAG_PROTO*
definitions). It's just simpler to say that the string values of the
taggers have ABI value starting with kernel 5.12, which will be when the
changing of tag protocol via /sys/class/net/<dsa-master>/dsa/tagging
goes live.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-14 00:37:58 +02:00
# define DSA_TAG_PROTO_SEVILLE_VALUE 21
2021-03-17 11:29:26 +01:00
# define DSA_TAG_PROTO_BRCM_LEGACY_VALUE 22
net: dsa: add support for the SJA1110 native tagging protocol
The SJA1110 has improved a few things compared to SJA1105:
- To send a control packet from the host port with SJA1105, one needed
to program a one-shot "management route" over SPI. This is no longer
true with SJA1110, you can actually send "in-band control extensions"
in the packets sent by DSA, these are in fact DSA tags which contain
the destination port and switch ID.
- When receiving a control packet from the switch with SJA1105, the
source port and switch ID were written in bytes 3 and 4 of the
destination MAC address of the frame (which was a very poor shot at a
DSA header). If the control packet also had an RX timestamp, that
timestamp was sent in an actual follow-up packet, so there were
reordering concerns on multi-core/multi-queue DSA masters, where the
metadata frame with the RX timestamp might get processed before the
actual packet to which that timestamp belonged (there is no way to
pair a packet to its timestamp other than the order in which they were
received). On SJA1110, this is no longer true, control packets have
the source port, switch ID and timestamp all in the DSA tags.
- Timestamps from the switch were partial: to get a 64-bit timestamp as
required by PTP stacks, one would need to take the partial 24-bit or
32-bit timestamp from the packet, then read the current PTP time very
quickly, and then patch in the high bits of the current PTP time into
the captured partial timestamp, to reconstruct what the full 64-bit
timestamp must have been. That is awful because packet processing is
done in NAPI context, but reading the current PTP time is done over
SPI and therefore needs sleepable context.
But it also aggravated a few things:
- Not only is there a DSA header in SJA1110, but there is a DSA trailer
in fact, too. So DSA needs to be extended to support taggers which
have both a header and a trailer. Very unconventional - my understanding
is that the trailer exists because the timestamps couldn't be prepared
in time for putting them in the header area.
- Like SJA1105, not all packets sent to the CPU have the DSA tag added
to them, only control packets do:
* the ones which match the destination MAC filters/traps in
MAC_FLTRES1 and MAC_FLTRES0
* the ones which match FDB entries which have TRAP or TAKETS bits set
So we could in theory hack something up to request the switch to take
timestamps for all packets that reach the CPU, and those would be
DSA-tagged and contain the source port / switch ID by virtue of the
fact that there needs to be a timestamp trailer provided. BUT:
- The SJA1110 does not parse its own DSA tags in a way that is useful
for routing in cross-chip topologies, a la Marvell. And the sja1105
driver already supports cross-chip bridging from the SJA1105 days.
It does that by automatically setting up the DSA links as VLAN trunks
which contain all the necessary tag_8021q RX VLANs that must be
communicated between the switches that span the same bridge. So when
using tag_8021q on sja1105, it is possible to have 2 switches with
ports sw0p0, sw0p1, sw1p0, sw1p1, and 2 VLAN-unaware bridges br0 and
br1, and br0 can take sw0p0 and sw1p0, and br1 can take sw0p1 and
sw1p1, and forwarding will happen according to the expected rules of
the Linux bridge.
We like that, and we don't want that to go away, so as a matter of
fact, the SJA1110 tagger still needs to support tag_8021q.
So the sja1110 tagger is a hybrid between tag_8021q for data packets,
and the native hardware support for control packets.
On RX, packets have a 13-byte trailer if they contain an RX timestamp.
That trailer is padded in such a way that its byte 8 (the start of the
"residence time" field - not parsed by Linux because we don't care) is
aligned on a 16 byte boundary. So the padding has a variable length
between 0 and 15 bytes. The DSA header contains the offset of the
beginning of the padding relative to the beginning of the frame (and the
end of the padding is obviously the end of the packet minus 13 bytes,
the length of the trailer). So we discard it.
Packets which don't have a trailer contain the source port and switch ID
information in the header (they are "trap-to-host" packets). Packets
which have a trailer contain the source port and switch ID in the trailer.
On TX, the destination port mask and switch ID is always in the trailer,
so we always need to say in the header that a trailer is present.
The header needs a custom EtherType and this was chosen as 0xdadc, after
0xdada which is for Marvell and 0xdadb which is for VLANs in
VLAN-unaware mode on SJA1105 (and SJA1110 in fact too).
Because we use tag_8021q in concert with the native tagging protocol,
control packets will have 2 DSA tags.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11 22:01:29 +03:00
# define DSA_TAG_PROTO_SJA1110_VALUE 23
2021-10-18 11:38:00 +02:00
# define DSA_TAG_PROTO_RTL8_4_VALUE 24
2019-04-28 19:37:12 +02:00
2014-09-11 21:18:09 -07:00
enum dsa_tag_protocol {
2019-04-28 19:37:12 +02:00
DSA_TAG_PROTO_NONE = DSA_TAG_PROTO_NONE_VALUE ,
DSA_TAG_PROTO_BRCM = DSA_TAG_PROTO_BRCM_VALUE ,
2021-03-17 11:29:26 +01:00
DSA_TAG_PROTO_BRCM_LEGACY = DSA_TAG_PROTO_BRCM_LEGACY_VALUE ,
2019-04-28 19:37:12 +02:00
DSA_TAG_PROTO_BRCM_PREPEND = DSA_TAG_PROTO_BRCM_PREPEND_VALUE ,
DSA_TAG_PROTO_DSA = DSA_TAG_PROTO_DSA_VALUE ,
DSA_TAG_PROTO_EDSA = DSA_TAG_PROTO_EDSA_VALUE ,
DSA_TAG_PROTO_GSWIP = DSA_TAG_PROTO_GSWIP_VALUE ,
DSA_TAG_PROTO_KSZ9477 = DSA_TAG_PROTO_KSZ9477_VALUE ,
DSA_TAG_PROTO_KSZ9893 = DSA_TAG_PROTO_KSZ9893_VALUE ,
DSA_TAG_PROTO_LAN9303 = DSA_TAG_PROTO_LAN9303_VALUE ,
DSA_TAG_PROTO_MTK = DSA_TAG_PROTO_MTK_VALUE ,
DSA_TAG_PROTO_QCA = DSA_TAG_PROTO_QCA_VALUE ,
DSA_TAG_PROTO_TRAILER = DSA_TAG_PROTO_TRAILER_VALUE ,
net: dsa: Optional VLAN-based port separation for switches without tagging
This patch provides generic DSA code for using VLAN (802.1Q) tags for
the same purpose as a dedicated switch tag for injection/extraction.
It is based on the discussions and interest that has been so far
expressed in https://www.spinics.net/lists/netdev/msg556125.html.
Unlike all other DSA-supported tagging protocols, CONFIG_NET_DSA_TAG_8021Q
does not offer a complete solution for drivers (nor can it). Instead, it
provides generic code that driver can opt into calling:
- dsa_8021q_xmit: Inserts a VLAN header with the specified contents.
Can be called from another tagging protocol's xmit function.
Currently the LAN9303 driver is inserting headers that are simply
802.1Q with custom fields, so this is an opportunity for code reuse.
- dsa_8021q_rcv: Retrieves the TPID and TCI from a VLAN-tagged skb.
Removing the VLAN header is left as a decision for the caller to make.
- dsa_port_setup_8021q_tagging: For each user port, installs an Rx VID
and a Tx VID, for proper untagged traffic identification on ingress
and steering on egress. Also sets up the VLAN trunk on the upstream
(CPU or DSA) port. Drivers are intentionally left to call this
function explicitly, depending on the context and hardware support.
The expected switch behavior and VLAN semantics should not be violated
under any conditions. That is, after calling
dsa_port_setup_8021q_tagging, the hardware should still pass all
ingress traffic, be it tagged or untagged.
For uniformity with the other tagging protocols, a module for the
dsa_8021q_netdev_ops structure is registered, but the typical usage is
to set up another tagging protocol which selects CONFIG_NET_DSA_TAG_8021Q,
and calls the API from tag_8021q.h. Null function definitions are also
provided so that a "depends on" is not forced in the Kconfig.
This tagging protocol only works when switch ports are standalone, or
when they are added to a VLAN-unaware bridge. It will probably remain
this way for the reasons below.
When added to a bridge that has vlan_filtering 1, the bridge core will
install its own VLANs and reset the pvids through switchdev. For the
bridge core, switchdev is a write-only pipe. All VLAN-related state is
kept in the bridge core and nothing is read from DSA/switchdev or from
the driver. So the bridge core will break this port separation because
it will install the vlan_default_pvid into all switchdev ports.
Even if we could teach the bridge driver about switchdev preference of a
certain vlan_default_pvid (task difficult in itself since the current
setting is per-bridge but we would need it per-port), there would still
exist many other challenges.
Firstly, in the DSA rcv callback, a driver would have to perform an
iterative reverse lookup to find the correct switch port. That is
because the port is a bridge slave, so its Rx VID (port PVID) is subject
to user configuration. How would we ensure that the user doesn't reset
the pvid to a different value (which would make an O(1) translation
impossible), or to a non-unique value within this DSA switch tree (which
would make any translation impossible)?
Finally, not all switch ports are equal in DSA, and that makes it
difficult for the bridge to be completely aware of this anyway.
The CPU port needs to transmit tagged packets (VLAN trunk) in order for
the DSA rcv code to be able to decode source information.
But the bridge code has absolutely no idea which switch port is the CPU
port, if nothing else then just because there is no netdevice registered
by DSA for the CPU port.
Also DSA does not currently allow the user to specify that they want the
CPU port to do VLAN trunking anyway. VLANs are added to the CPU port
using the same flags as they were added on the user port.
So the VLANs installed by dsa_port_setup_8021q_tagging per driver
request should remain private from the bridge's and user's perspective,
and should not alter the VLAN semantics observed by the user.
In the current implementation a VLAN range ending at 4095 (VLAN_N_VID)
is reserved for this purpose. Each port receives a unique Rx VLAN and a
unique Tx VLAN. Separate VLANs are needed for Rx and Tx because they
serve different purposes: on Rx the switch must process traffic as
untagged and process it with a port-based VLAN, but with care not to
hinder bridging. On the other hand, the Tx VLAN is where the
reachability restrictions are imposed, since by tagging frames in the
xmit callback we are telling the switch onto which port to steer the
frame.
Some general guidance on how this support might be employed for
real-life hardware (some comments made by Florian Fainelli):
- If the hardware supports VLAN tag stacking, it should somehow back
up its private VLAN settings when the bridge tries to override them.
Then the driver could re-apply them as outer tags. Dedicating an outer
tag per bridge device would allow identical inner tag VID numbers to
co-exist, yet preserve broadcast domain isolation.
- If the switch cannot handle VLAN tag stacking, it should disable this
port separation when added as slave to a vlan_filtering bridge, in
that case having reduced functionality.
- Drivers for old switches that don't support the entire VLAN_N_VID
range will need to rework the current range selection mechanism.
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 13:19:22 +03:00
DSA_TAG_PROTO_8021Q = DSA_TAG_PROTO_8021Q_VALUE ,
2019-05-05 13:19:27 +03:00
DSA_TAG_PROTO_SJA1105 = DSA_TAG_PROTO_SJA1105_VALUE ,
2019-07-29 19:49:46 +02:00
DSA_TAG_PROTO_KSZ8795 = DSA_TAG_PROTO_KSZ8795_VALUE ,
2019-11-14 17:03:29 +02:00
DSA_TAG_PROTO_OCELOT = DSA_TAG_PROTO_OCELOT_VALUE ,
2019-12-18 09:02:14 +01:00
DSA_TAG_PROTO_AR9331 = DSA_TAG_PROTO_AR9331_VALUE ,
2020-07-08 14:25:36 +02:00
DSA_TAG_PROTO_RTL4_A = DSA_TAG_PROTO_RTL4_A_VALUE ,
2020-11-03 08:10:54 +01:00
DSA_TAG_PROTO_HELLCREEK = DSA_TAG_PROTO_HELLCREEK_VALUE ,
2021-01-14 13:57:32 -06:00
DSA_TAG_PROTO_XRS700X = DSA_TAG_PROTO_XRS700X_VALUE ,
2021-01-29 03:00:08 +02:00
DSA_TAG_PROTO_OCELOT_8021Q = DSA_TAG_PROTO_OCELOT_8021Q_VALUE ,
net: dsa: tag_ocelot: create separate tagger for Seville
The ocelot tagger is a hot mess currently, it relies on memory
initialized by the attached driver for basic frame transmission.
This is against all that DSA tagging protocols stand for, which is that
the transmission and reception of a DSA-tagged frame, the data path,
should be independent from the switch control path, because the tag
protocol is in principle hot-pluggable and reusable across switches
(even if in practice it wasn't until very recently). But if another
driver like dsa_loop wants to make use of tag_ocelot, it couldn't.
This was done to have common code between Felix and Ocelot, which have
one bit difference in the frame header format. Quoting from commit
67c2404922c2 ("net: dsa: felix: create a template for the DSA tags on
xmit"):
Other alternatives have been analyzed, such as:
- Create a separate tag_seville.c: too much code duplication for just 1
bit field difference.
- Create a separate DSA_TAG_PROTO_SEVILLE under tag_ocelot.c, just like
tag_brcm.c, which would have a separate .xmit function. Again, too
much code duplication for just 1 bit field difference.
- Allocate the template from the init function of the tag_ocelot.c
module, instead of from the driver: couldn't figure out a method of
accessing the correct port template corresponding to the correct
tagger in the .xmit function.
The really interesting part is that Seville should have had its own
tagging protocol defined - it is not compatible on the wire with Ocelot,
even for that single bit. In principle, a packet generated by
DSA_TAG_PROTO_OCELOT when booted on NXP LS1028A would look in a certain
way, but when booted on NXP T1040 it would look differently. The reverse
is also true: a packet generated by a Seville switch would be
interpreted incorrectly by Wireshark if it was told it was generated by
an Ocelot switch.
Actually things are a bit more nuanced. If we concentrate only on the
DSA tag, what I said above is true, but Ocelot/Seville also support an
optional DSA tag prefix, which can be short or long, and it is possible
to distinguish the two taggers based on an integer constant put in that
prefix. Nonetheless, creating a separate tagger is still justified,
since the tag prefix is optional, and without it, there is again no way
to distinguish.
Claiming backwards binary compatibility is a bit more tough, since I've
already changed the format of tag_ocelot once, in commit 5124197ce58b
("net: dsa: tag_ocelot: use a short prefix on both ingress and egress").
Therefore I am not very concerned with treating this as a bugfix and
backporting it to stable kernels (which would be another mess due to the
fact that there would be lots of conflicts with the other DSA_TAG_PROTO*
definitions). It's just simpler to say that the string values of the
taggers have ABI value starting with kernel 5.12, which will be when the
changing of tag protocol via /sys/class/net/<dsa-master>/dsa/tagging
goes live.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-14 00:37:58 +02:00
DSA_TAG_PROTO_SEVILLE = DSA_TAG_PROTO_SEVILLE_VALUE ,
net: dsa: add support for the SJA1110 native tagging protocol
The SJA1110 has improved a few things compared to SJA1105:
- To send a control packet from the host port with SJA1105, one needed
to program a one-shot "management route" over SPI. This is no longer
true with SJA1110, you can actually send "in-band control extensions"
in the packets sent by DSA, these are in fact DSA tags which contain
the destination port and switch ID.
- When receiving a control packet from the switch with SJA1105, the
source port and switch ID were written in bytes 3 and 4 of the
destination MAC address of the frame (which was a very poor shot at a
DSA header). If the control packet also had an RX timestamp, that
timestamp was sent in an actual follow-up packet, so there were
reordering concerns on multi-core/multi-queue DSA masters, where the
metadata frame with the RX timestamp might get processed before the
actual packet to which that timestamp belonged (there is no way to
pair a packet to its timestamp other than the order in which they were
received). On SJA1110, this is no longer true, control packets have
the source port, switch ID and timestamp all in the DSA tags.
- Timestamps from the switch were partial: to get a 64-bit timestamp as
required by PTP stacks, one would need to take the partial 24-bit or
32-bit timestamp from the packet, then read the current PTP time very
quickly, and then patch in the high bits of the current PTP time into
the captured partial timestamp, to reconstruct what the full 64-bit
timestamp must have been. That is awful because packet processing is
done in NAPI context, but reading the current PTP time is done over
SPI and therefore needs sleepable context.
But it also aggravated a few things:
- Not only is there a DSA header in SJA1110, but there is a DSA trailer
in fact, too. So DSA needs to be extended to support taggers which
have both a header and a trailer. Very unconventional - my understanding
is that the trailer exists because the timestamps couldn't be prepared
in time for putting them in the header area.
- Like SJA1105, not all packets sent to the CPU have the DSA tag added
to them, only control packets do:
* the ones which match the destination MAC filters/traps in
MAC_FLTRES1 and MAC_FLTRES0
* the ones which match FDB entries which have TRAP or TAKETS bits set
So we could in theory hack something up to request the switch to take
timestamps for all packets that reach the CPU, and those would be
DSA-tagged and contain the source port / switch ID by virtue of the
fact that there needs to be a timestamp trailer provided. BUT:
- The SJA1110 does not parse its own DSA tags in a way that is useful
for routing in cross-chip topologies, a la Marvell. And the sja1105
driver already supports cross-chip bridging from the SJA1105 days.
It does that by automatically setting up the DSA links as VLAN trunks
which contain all the necessary tag_8021q RX VLANs that must be
communicated between the switches that span the same bridge. So when
using tag_8021q on sja1105, it is possible to have 2 switches with
ports sw0p0, sw0p1, sw1p0, sw1p1, and 2 VLAN-unaware bridges br0 and
br1, and br0 can take sw0p0 and sw1p0, and br1 can take sw0p1 and
sw1p1, and forwarding will happen according to the expected rules of
the Linux bridge.
We like that, and we don't want that to go away, so as a matter of
fact, the SJA1110 tagger still needs to support tag_8021q.
So the sja1110 tagger is a hybrid between tag_8021q for data packets,
and the native hardware support for control packets.
On RX, packets have a 13-byte trailer if they contain an RX timestamp.
That trailer is padded in such a way that its byte 8 (the start of the
"residence time" field - not parsed by Linux because we don't care) is
aligned on a 16 byte boundary. So the padding has a variable length
between 0 and 15 bytes. The DSA header contains the offset of the
beginning of the padding relative to the beginning of the frame (and the
end of the padding is obviously the end of the packet minus 13 bytes,
the length of the trailer). So we discard it.
Packets which don't have a trailer contain the source port and switch ID
information in the header (they are "trap-to-host" packets). Packets
which have a trailer contain the source port and switch ID in the trailer.
On TX, the destination port mask and switch ID is always in the trailer,
so we always need to say in the header that a trailer is present.
The header needs a custom EtherType and this was chosen as 0xdadc, after
0xdada which is for Marvell and 0xdadb which is for VLANs in
VLAN-unaware mode on SJA1105 (and SJA1110 in fact too).
Because we use tag_8021q in concert with the native tagging protocol,
control packets will have 2 DSA tags.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-11 22:01:29 +03:00
DSA_TAG_PROTO_SJA1110 = DSA_TAG_PROTO_SJA1110_VALUE ,
2021-10-18 11:38:00 +02:00
DSA_TAG_PROTO_RTL8_4 = DSA_TAG_PROTO_RTL8_4_VALUE ,
2014-09-11 21:18:09 -07:00
} ;
2014-08-27 17:04:55 -07:00
2018-02-14 01:07:49 +01:00
struct dsa_switch ;
net: dsa: introduce tagger-owned storage for private and shared data
Ansuel is working on register access over Ethernet for the qca8k switch
family. This requires the qca8k tagging protocol driver to receive
frames which aren't intended for the network stack, but instead for the
qca8k switch driver itself.
The dp->priv is currently the prevailing method for passing data back
and forth between the tagging protocol driver and the switch driver.
However, this method is riddled with caveats.
The DSA design allows in principle for any switch driver to return any
protocol it desires in ->get_tag_protocol(). The dsa_loop driver can be
modified to do just that. But in the current design, the memory behind
dp->priv has to be allocated by the switch driver, so if the tagging
protocol is paired to an unexpected switch driver, we may end up in NULL
pointer dereferences inside the kernel, or worse (a switch driver may
allocate dp->priv according to the expectations of a different tagger).
The latter possibility is even more plausible considering that DSA
switches can dynamically change tagging protocols in certain cases
(dsa <-> edsa, ocelot <-> ocelot-8021q), and the current design lends
itself to mistakes that are all too easy to make.
This patch proposes that the tagging protocol driver should manage its
own memory, instead of relying on the switch driver to do so.
After analyzing the different in-tree needs, it can be observed that the
required tagger storage is per switch, therefore a ds->tagger_data
pointer is introduced. In principle, per-port storage could also be
introduced, although there is no need for it at the moment. Future
changes will replace the current usage of dp->priv with ds->tagger_data.
We define a "binding" event between the DSA switch tree and the tagging
protocol. During this binding event, the tagging protocol's ->connect()
method is called first, and this may allocate some memory for each
switch of the tree. Then a cross-chip notifier is emitted for the
switches within that tree, and they are given the opportunity to fix up
the tagger's memory (for example, they might set up some function
pointers that represent virtual methods for consuming packets).
Because the memory is owned by the tagger, there exists a ->disconnect()
method for the tagger (which is the place to free the resources), but
there doesn't exist a ->disconnect() method for the switch driver.
This is part of the design. The switch driver should make minimal use of
the public part of the tagger data, and only after type-checking it
using the supplied "proto" argument.
In the code there are in fact two binding events, one is the initial
event in dsa_switch_setup_tag_protocol(). At this stage, the cross chip
notifier chains aren't initialized, so we call each switch's connect()
method by hand. Then there is dsa_tree_bind_tag_proto() during
dsa_tree_change_tag_proto(), and here we have an old protocol and a new
one. We first connect to the new one before disconnecting from the old
one, to simplify error handling a bit and to ensure we remain in a valid
state at all times.
Co-developed-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-10 01:34:37 +02:00
struct dsa_switch_tree ;
2014-08-27 17:04:46 -07:00
2017-08-09 14:41:16 +02:00
struct dsa_device_ops {
struct sk_buff * ( * xmit ) ( struct sk_buff * skb , struct net_device * dev ) ;
2021-07-31 17:14:32 +03:00
struct sk_buff * ( * rcv ) ( struct sk_buff * skb , struct net_device * dev ) ;
2020-09-26 22:32:05 +03:00
void ( * flow_dissect ) ( const struct sk_buff * skb , __be16 * proto ,
int * offset ) ;
net: dsa: introduce tagger-owned storage for private and shared data
Ansuel is working on register access over Ethernet for the qca8k switch
family. This requires the qca8k tagging protocol driver to receive
frames which aren't intended for the network stack, but instead for the
qca8k switch driver itself.
The dp->priv is currently the prevailing method for passing data back
and forth between the tagging protocol driver and the switch driver.
However, this method is riddled with caveats.
The DSA design allows in principle for any switch driver to return any
protocol it desires in ->get_tag_protocol(). The dsa_loop driver can be
modified to do just that. But in the current design, the memory behind
dp->priv has to be allocated by the switch driver, so if the tagging
protocol is paired to an unexpected switch driver, we may end up in NULL
pointer dereferences inside the kernel, or worse (a switch driver may
allocate dp->priv according to the expectations of a different tagger).
The latter possibility is even more plausible considering that DSA
switches can dynamically change tagging protocols in certain cases
(dsa <-> edsa, ocelot <-> ocelot-8021q), and the current design lends
itself to mistakes that are all too easy to make.
This patch proposes that the tagging protocol driver should manage its
own memory, instead of relying on the switch driver to do so.
After analyzing the different in-tree needs, it can be observed that the
required tagger storage is per switch, therefore a ds->tagger_data
pointer is introduced. In principle, per-port storage could also be
introduced, although there is no need for it at the moment. Future
changes will replace the current usage of dp->priv with ds->tagger_data.
We define a "binding" event between the DSA switch tree and the tagging
protocol. During this binding event, the tagging protocol's ->connect()
method is called first, and this may allocate some memory for each
switch of the tree. Then a cross-chip notifier is emitted for the
switches within that tree, and they are given the opportunity to fix up
the tagger's memory (for example, they might set up some function
pointers that represent virtual methods for consuming packets).
Because the memory is owned by the tagger, there exists a ->disconnect()
method for the tagger (which is the place to free the resources), but
there doesn't exist a ->disconnect() method for the switch driver.
This is part of the design. The switch driver should make minimal use of
the public part of the tagger data, and only after type-checking it
using the supplied "proto" argument.
In the code there are in fact two binding events, one is the initial
event in dsa_switch_setup_tag_protocol(). At this stage, the cross chip
notifier chains aren't initialized, so we call each switch's connect()
method by hand. Then there is dsa_tree_bind_tag_proto() during
dsa_tree_change_tag_proto(), and here we have an old protocol and a new
one. We first connect to the new one before disconnecting from the old
one, to simplify error handling a bit and to ensure we remain in a valid
state at all times.
Co-developed-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-10 01:34:37 +02:00
int ( * connect ) ( struct dsa_switch_tree * dst ) ;
void ( * disconnect ) ( struct dsa_switch_tree * dst ) ;
2021-06-11 22:01:24 +03:00
unsigned int needed_headroom ;
unsigned int needed_tailroom ;
2019-04-28 19:37:11 +02:00
const char * name ;
2019-04-28 19:37:14 +02:00
enum dsa_tag_protocol proto ;
2020-09-26 22:32:02 +03:00
/* Some tagging protocols either mangle or shift the destination MAC
* address , in which case the DSA master would drop packets on ingress
* if what it understands out of the destination MAC address is not in
* its RX filter .
*/
bool promisc_on_master ;
2017-08-09 14:41:16 +02:00
} ;
2020-07-19 20:49:52 -07:00
/* This structure defines the control interfaces that are overlayed by the
* DSA layer on top of the DSA CPU / management net_device instance . This is
* used by the core net_device layer while calling various net_device_ops
* function pointers .
*/
struct dsa_netdevice_ops {
2021-07-27 15:45:13 +02:00
int ( * ndo_eth_ioctl ) ( struct net_device * dev , struct ifreq * ifr ,
int cmd ) ;
2020-07-19 20:49:52 -07:00
} ;
2019-04-28 19:37:12 +02:00
# define DSA_TAG_DRIVER_ALIAS "dsa_tag-"
# define MODULE_ALIAS_DSA_TAG_DRIVER(__proto) \
MODULE_ALIAS ( DSA_TAG_DRIVER_ALIAS __stringify ( __proto # # _VALUE ) )
2011-11-25 14:32:52 +00:00
struct dsa_switch_tree {
2016-06-04 21:17:07 +02:00
struct list_head list ;
2017-02-03 13:20:20 -05:00
/* Notifier chain for switch-wide events */
struct raw_notifier_head nh ;
2016-06-04 21:17:07 +02:00
/* Tree identifier */
2017-11-03 19:05:21 -04:00
unsigned int index ;
2016-06-04 21:17:07 +02:00
/* Number of switches attached to this tree */
struct kref refcount ;
/* Has this tree been applied to the hardware? */
2017-11-06 16:11:46 -05:00
bool setup ;
2016-06-04 21:17:07 +02:00
net: dsa: keep a copy of the tagging protocol in the DSA switch tree
Cascading DSA switches can be done multiple ways. There is the brute
force approach / tag stacking, where one upstream switch, located
between leaf switches and the host Ethernet controller, will just
happily transport the DSA header of those leaf switches as payload.
For this kind of setups, DSA works without any special kind of treatment
compared to a single switch - they just aren't aware of each other.
Then there's the approach where the upstream switch understands the tags
it transports from its leaves below, as it doesn't push a tag of its own,
but it routes based on the source port & switch id information present
in that tag (as opposed to DMAC & VID) and it strips the tag when
egressing a front-facing port. Currently only Marvell implements the
latter, and Marvell DSA trees contain only Marvell switches.
So it is safe to say that DSA trees already have a single tag protocol
shared by all switches, and in fact this is what makes the switches able
to understand each other. This fact is also implied by the fact that
currently, the tagging protocol is reported as part of a sysfs installed
on the DSA master and not per port, so it must be the same for all the
ports connected to that DSA master regardless of the switch that they
belong to.
It's time to make this official and enforce it (yes, this also means we
won't have any "switch understands tag to some extent but is not able to
speak it" hardware oddities that we'll support in the future).
This is needed due to the imminent introduction of the dsa_switch_ops::
change_tag_protocol driver API. When that is introduced, we'll have
to notify switches of the tagging protocol that they're configured to
use. Currently the tag_ops structure pointer is held only for CPU ports.
But there are switches which don't have CPU ports and nonetheless still
need to be configured. These would be Marvell leaf switches whose
upstream port is just a DSA link. How do we inform these of their
tagging protocol setup/deletion?
One answer to the above would be: iterate through the DSA switch tree's
ports once, list the CPU ports, get their tag_ops, then iterate again
now that we have it, and notify everybody of that tag_ops. But what to
do if conflicts appear between one cpu_dp->tag_ops and another? There's
no escaping the fact that conflict resolution needs to be done, so we
can be upfront about it.
Ease our work and just keep the master copy of the tag_ops inside the
struct dsa_switch_tree. Reference counting is now moved to be per-tree
too, instead of per-CPU port.
There are many places in the data path that access master->dsa_ptr->tag_ops
and we would introduce unnecessary performance penalty going through yet
another indirection, so keep those right where they are.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 03:00:05 +02:00
/* Tagging protocol operations */
const struct dsa_device_ops * tag_ops ;
2021-04-20 20:53:10 +02:00
/* Default tagging protocol preferred by the switches in this
* tree .
*/
enum dsa_tag_protocol default_proto ;
2011-11-25 14:32:52 +00:00
/*
* Configuration data for the platform device that owns
* this dsa switch tree instance .
*/
struct dsa_platform_data * pd ;
2019-10-21 16:51:16 -04:00
/* List of switch ports */
struct list_head ports ;
2019-10-30 22:09:13 -04:00
/* List of DSA links composing the routing table */
struct list_head rtable ;
2021-01-13 09:42:53 +01:00
/* Maps offloaded LAG netdevs to a zero-based linear ID for
* drivers that need it .
*/
struct net_device * * lags ;
unsigned int lags_len ;
2021-07-22 18:55:39 +03:00
/* Track the largest switch index within a tree */
unsigned int last_switch ;
2011-11-25 14:32:52 +00:00
} ;
2021-01-13 09:42:53 +01:00
# define dsa_lags_foreach_id(_id, _dst) \
for ( ( _id ) = 0 ; ( _id ) < ( _dst ) - > lags_len ; ( _id ) + + ) \
if ( ( _dst ) - > lags [ ( _id ) ] )
# define dsa_lag_foreach_port(_dp, _dst, _lag) \
list_for_each_entry ( ( _dp ) , & ( _dst ) - > ports , list ) \
if ( ( _dp ) - > lag_dev = = ( _lag ) )
2021-02-09 19:02:12 -06:00
# define dsa_hsr_foreach_port(_dp, _ds, _hsr) \
list_for_each_entry ( ( _dp ) , & ( _ds ) - > dst - > ports , list ) \
if ( ( _dp ) - > ds = = ( _ds ) & & ( _dp ) - > hsr_dev = = ( _hsr ) )
2021-01-13 09:42:53 +01:00
static inline struct net_device * dsa_lag_dev ( struct dsa_switch_tree * dst ,
unsigned int id )
{
return dst - > lags [ id ] ;
}
static inline int dsa_lag_id ( struct dsa_switch_tree * dst ,
struct net_device * lag )
{
unsigned int id ;
dsa_lags_foreach_id ( id , dst ) {
if ( dsa_lag_dev ( dst , id ) = = lag )
return id ;
}
return - ENODEV ;
}
2020-03-29 14:51:59 +03:00
/* TC matchall action types */
2017-01-30 12:41:40 -08:00
enum dsa_port_mall_action_type {
DSA_PORT_MALL_MIRROR ,
2020-03-29 14:51:59 +03:00
DSA_PORT_MALL_POLICER ,
2017-01-30 12:41:40 -08:00
} ;
/* TC mirroring entry */
struct dsa_mall_mirror_tc_entry {
u8 to_local_port ;
bool ingress ;
} ;
2020-03-29 14:51:59 +03:00
/* TC port policer entry */
struct dsa_mall_policer_tc_entry {
2020-06-29 14:54:16 +08:00
u32 burst ;
2020-03-29 14:51:59 +03:00
u64 rate_bytes_per_sec ;
} ;
2017-01-30 12:41:40 -08:00
/* TC matchall entry */
struct dsa_mall_tc_entry {
struct list_head list ;
unsigned long cookie ;
enum dsa_port_mall_action_type type ;
union {
struct dsa_mall_mirror_tc_entry mirror ;
2020-03-29 14:51:59 +03:00
struct dsa_mall_policer_tc_entry policer ;
2017-01-30 12:41:40 -08:00
} ;
} ;
net: dsa: keep the bridge_dev and bridge_num as part of the same structure
The main desire behind this is to provide coherent bridge information to
the fast path without locking.
For example, right now we set dp->bridge_dev and dp->bridge_num from
separate code paths, it is theoretically possible for a packet
transmission to read these two port properties consecutively and find a
bridge number which does not correspond with the bridge device.
Another desire is to start passing more complex bridge information to
dsa_switch_ops functions. For example, with FDB isolation, it is
expected that drivers will need to be passed the bridge which requested
an FDB/MDB entry to be offloaded, and along with that bridge_dev, the
associated bridge_num should be passed too, in case the driver might
want to implement an isolation scheme based on that number.
We already pass the {bridge_dev, bridge_num} pair to the TX forwarding
offload switch API, however we'd like to remove that and squash it into
the basic bridge join/leave API. So that means we need to pass this
pair to the bridge join/leave API.
During dsa_port_bridge_leave, first we unset dp->bridge_dev, then we
call the driver's .port_bridge_leave with what used to be our
dp->bridge_dev, but provided as an argument.
When bridge_dev and bridge_num get folded into a single structure, we
need to preserve this behavior in dsa_port_bridge_leave: we need a copy
of what used to be in dp->bridge.
Switch drivers check bridge membership by comparing dp->bridge_dev with
the provided bridge_dev, but now, if we provide the struct dsa_bridge as
a pointer, they cannot keep comparing dp->bridge to the provided
pointer, since this only points to an on-stack copy. To make this
obvious and prevent driver writers from forgetting and doing stupid
things, in this new API, the struct dsa_bridge is provided as a full
structure (not very large, contains an int and a pointer) instead of a
pointer. An explicit comparison function needs to be used to determine
bridge membership: dsa_port_offloads_bridge().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-06 18:57:56 +02:00
struct dsa_bridge {
struct net_device * dev ;
unsigned int num ;
2021-12-06 18:57:58 +02:00
bool tx_fwd_offload ;
net: dsa: keep the bridge_dev and bridge_num as part of the same structure
The main desire behind this is to provide coherent bridge information to
the fast path without locking.
For example, right now we set dp->bridge_dev and dp->bridge_num from
separate code paths, it is theoretically possible for a packet
transmission to read these two port properties consecutively and find a
bridge number which does not correspond with the bridge device.
Another desire is to start passing more complex bridge information to
dsa_switch_ops functions. For example, with FDB isolation, it is
expected that drivers will need to be passed the bridge which requested
an FDB/MDB entry to be offloaded, and along with that bridge_dev, the
associated bridge_num should be passed too, in case the driver might
want to implement an isolation scheme based on that number.
We already pass the {bridge_dev, bridge_num} pair to the TX forwarding
offload switch API, however we'd like to remove that and squash it into
the basic bridge join/leave API. So that means we need to pass this
pair to the bridge join/leave API.
During dsa_port_bridge_leave, first we unset dp->bridge_dev, then we
call the driver's .port_bridge_leave with what used to be our
dp->bridge_dev, but provided as an argument.
When bridge_dev and bridge_num get folded into a single structure, we
need to preserve this behavior in dsa_port_bridge_leave: we need a copy
of what used to be in dp->bridge.
Switch drivers check bridge membership by comparing dp->bridge_dev with
the provided bridge_dev, but now, if we provide the struct dsa_bridge as
a pointer, they cannot keep comparing dp->bridge to the provided
pointer, since this only points to an on-stack copy. To make this
obvious and prevent driver writers from forgetting and doing stupid
things, in this new API, the struct dsa_bridge is provided as a full
structure (not very large, contains an int and a pointer) instead of a
pointer. An explicit comparison function needs to be used to determine
bridge membership: dsa_port_offloads_bridge().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-06 18:57:56 +02:00
refcount_t refcount ;
} ;
2017-01-30 12:41:40 -08:00
2016-06-04 21:16:57 +02:00
struct dsa_port {
2017-10-16 11:12:18 -04:00
/* A CPU port is physically connected to a master device.
* A user port exposed to userspace has a slave device .
*/
union {
struct net_device * master ;
struct net_device * slave ;
} ;
net: dsa: keep a copy of the tagging protocol in the DSA switch tree
Cascading DSA switches can be done multiple ways. There is the brute
force approach / tag stacking, where one upstream switch, located
between leaf switches and the host Ethernet controller, will just
happily transport the DSA header of those leaf switches as payload.
For this kind of setups, DSA works without any special kind of treatment
compared to a single switch - they just aren't aware of each other.
Then there's the approach where the upstream switch understands the tags
it transports from its leaves below, as it doesn't push a tag of its own,
but it routes based on the source port & switch id information present
in that tag (as opposed to DMAC & VID) and it strips the tag when
egressing a front-facing port. Currently only Marvell implements the
latter, and Marvell DSA trees contain only Marvell switches.
So it is safe to say that DSA trees already have a single tag protocol
shared by all switches, and in fact this is what makes the switches able
to understand each other. This fact is also implied by the fact that
currently, the tagging protocol is reported as part of a sysfs installed
on the DSA master and not per port, so it must be the same for all the
ports connected to that DSA master regardless of the switch that they
belong to.
It's time to make this official and enforce it (yes, this also means we
won't have any "switch understands tag to some extent but is not able to
speak it" hardware oddities that we'll support in the future).
This is needed due to the imminent introduction of the dsa_switch_ops::
change_tag_protocol driver API. When that is introduced, we'll have
to notify switches of the tagging protocol that they're configured to
use. Currently the tag_ops structure pointer is held only for CPU ports.
But there are switches which don't have CPU ports and nonetheless still
need to be configured. These would be Marvell leaf switches whose
upstream port is just a DSA link. How do we inform these of their
tagging protocol setup/deletion?
One answer to the above would be: iterate through the DSA switch tree's
ports once, list the CPU ports, get their tag_ops, then iterate again
now that we have it, and notify everybody of that tag_ops. But what to
do if conflicts appear between one cpu_dp->tag_ops and another? There's
no escaping the fact that conflict resolution needs to be done, so we
can be upfront about it.
Ease our work and just keep the master copy of the tag_ops inside the
struct dsa_switch_tree. Reference counting is now moved to be per-tree
too, instead of per-CPU port.
There are many places in the data path that access master->dsa_ptr->tag_ops
and we would introduce unnecessary performance penalty going through yet
another indirection, so keep those right where they are.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 03:00:05 +02:00
/* Copy of the tagging protocol operations, for quicker access
* in the data path . Valid only for the CPU ports .
*/
2017-09-29 17:19:18 -04:00
const struct dsa_device_ops * tag_ops ;
2017-09-29 17:19:19 -04:00
/* Copies for faster access in master receive hot path */
struct dsa_switch_tree * dst ;
2021-07-31 17:14:32 +03:00
struct sk_buff * ( * rcv ) ( struct sk_buff * skb , struct net_device * dev ) ;
2017-09-29 17:19:19 -04:00
2017-10-26 11:22:57 -04:00
enum {
DSA_PORT_TYPE_UNUSED = 0 ,
DSA_PORT_TYPE_CPU ,
DSA_PORT_TYPE_DSA ,
DSA_PORT_TYPE_USER ,
} type ;
2017-01-27 15:29:38 -05:00
struct dsa_switch * ds ;
unsigned int index ;
2017-02-04 13:02:43 -08:00
const char * name ;
2019-06-14 13:49:20 -04:00
struct dsa_port * cpu_dp ;
of: net: pass the dst buffer to of_get_mac_address()
of_get_mac_address() returns a "const void*" pointer to a MAC address.
Lately, support to fetch the MAC address by an NVMEM provider was added.
But this will only work with platform devices. It will not work with
PCI devices (e.g. of an integrated root complex) and esp. not with DSA
ports.
There is an of_* variant of the nvmem binding which works without
devices. The returned data of a nvmem_cell_read() has to be freed after
use. On the other hand the return of_get_mac_address() points to some
static data without a lifetime. The trick for now, was to allocate a
device resource managed buffer which is then returned. This will only
work if we have an actual device.
Change it, so that the caller of of_get_mac_address() has to supply a
buffer where the MAC address is written to. Unfortunately, this will
touch all drivers which use the of_get_mac_address().
Usually the code looks like:
const char *addr;
addr = of_get_mac_address(np);
if (!IS_ERR(addr))
ether_addr_copy(ndev->dev_addr, addr);
This can then be simply rewritten as:
of_get_mac_address(np, ndev->dev_addr);
Sometimes is_valid_ether_addr() is used to test the MAC address.
of_get_mac_address() already makes sure, it just returns a valid MAC
address. Thus we can just test its return code. But we have to be
careful if there are still other sources for the MAC address before the
of_get_mac_address(). In this case we have to keep the
is_valid_ether_addr() call.
The following coccinelle patch was used to convert common cases to the
new style. Afterwards, I've manually gone over the drivers and fixed the
return code variable: either used a new one or if one was already
available use that. Mansour Moufid, thanks for that coccinelle patch!
<spml>
@a@
identifier x;
expression y, z;
@@
- x = of_get_mac_address(y);
+ x = of_get_mac_address(y, z);
<...
- ether_addr_copy(z, x);
...>
@@
identifier a.x;
@@
- if (<+... x ...+>) {}
@@
identifier a.x;
@@
if (<+... x ...+>) {
...
}
- else {}
@@
identifier a.x;
expression e;
@@
- if (<+... x ...+>@e)
- {}
- else
+ if (!(e))
{...}
@@
expression x, y, z;
@@
- x = of_get_mac_address(y, z);
+ of_get_mac_address(y, z);
... when != x
</spml>
All drivers, except drivers/net/ethernet/aeroflex/greth.c, were
compile-time tested.
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Michael Walle <michael@walle.cc>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-04-12 19:47:17 +02:00
u8 mac [ ETH_ALEN ] ;
2016-06-04 21:16:58 +02:00
struct device_node * dn ;
2016-07-18 20:45:38 -04:00
unsigned int ageing_time ;
2019-04-28 21:45:43 +03:00
bool vlan_filtering ;
2021-08-08 17:35:26 +03:00
/* Managed by DSA on user ports and by drivers on CPU and DSA ports */
net: dsa: centralize fast ageing when address learning is turned off
Currently DSA leaves it down to device drivers to fast age the FDB on a
port when address learning is disabled on it. There are 2 reasons for
doing that in the first place:
- when address learning is disabled by user space, through
IFLA_BRPORT_LEARNING or the brport_attr_learning sysfs, what user
space typically wants to achieve is to operate in a mode with no
dynamic FDB entry on that port. But if the port is already up, some
addresses might have been already learned on it, and it seems silly to
wait for 5 minutes for them to expire until something useful can be
done.
- when a port leaves a bridge and becomes standalone, DSA turns off
address learning on it. This also has the nice side effect of flushing
the dynamically learned bridge FDB entries on it, which is a good idea
because standalone ports should not have bridge FDB entries on them.
We let drivers manage fast ageing under this condition because if DSA
were to do it, it would need to track each port's learning state, and
act upon the transition, which it currently doesn't.
But there are 2 reasons why doing it is better after all:
- drivers might get it wrong and not do it (see b53_port_set_learning)
- we would like to flush the dynamic entries from the software bridge
too, and letting drivers do that would be another pain point
So track the port learning state and trigger a fast age process
automatically within DSA.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-08 17:35:23 +03:00
bool learning ;
2016-09-22 16:49:22 -04:00
u8 stp_state ;
net: dsa: keep the bridge_dev and bridge_num as part of the same structure
The main desire behind this is to provide coherent bridge information to
the fast path without locking.
For example, right now we set dp->bridge_dev and dp->bridge_num from
separate code paths, it is theoretically possible for a packet
transmission to read these two port properties consecutively and find a
bridge number which does not correspond with the bridge device.
Another desire is to start passing more complex bridge information to
dsa_switch_ops functions. For example, with FDB isolation, it is
expected that drivers will need to be passed the bridge which requested
an FDB/MDB entry to be offloaded, and along with that bridge_dev, the
associated bridge_num should be passed too, in case the driver might
want to implement an isolation scheme based on that number.
We already pass the {bridge_dev, bridge_num} pair to the TX forwarding
offload switch API, however we'd like to remove that and squash it into
the basic bridge join/leave API. So that means we need to pass this
pair to the bridge join/leave API.
During dsa_port_bridge_leave, first we unset dp->bridge_dev, then we
call the driver's .port_bridge_leave with what used to be our
dp->bridge_dev, but provided as an argument.
When bridge_dev and bridge_num get folded into a single structure, we
need to preserve this behavior in dsa_port_bridge_leave: we need a copy
of what used to be in dp->bridge.
Switch drivers check bridge membership by comparing dp->bridge_dev with
the provided bridge_dev, but now, if we provide the struct dsa_bridge as
a pointer, they cannot keep comparing dp->bridge to the provided
pointer, since this only points to an on-stack copy. To make this
obvious and prevent driver writers from forgetting and doing stupid
things, in this new API, the struct dsa_bridge is provided as a full
structure (not very large, contains an int and a pointer) instead of a
pointer. An explicit comparison function needs to be used to determine
bridge membership: dsa_port_offloads_bridge().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-06 18:57:56 +02:00
struct dsa_bridge * bridge ;
2017-03-28 23:45:07 +02:00
struct devlink_port devlink_port ;
2020-10-04 18:12:53 +02:00
bool devlink_port_setup ;
2018-05-10 13:17:36 -07:00
struct phylink * pl ;
2019-05-28 20:38:12 +03:00
struct phylink_config pl_config ;
2021-01-13 09:42:53 +01:00
struct net_device * lag_dev ;
bool lag_tx_enabled ;
2021-02-09 19:02:12 -06:00
struct net_device * hsr_dev ;
2019-05-05 13:19:25 +03:00
2019-10-21 16:51:16 -04:00
struct list_head list ;
2017-06-13 13:27:20 -07:00
/*
* Original copy of the master netdev ethtool_ops
*/
const struct ethtool_ops * orig_ethtool_ops ;
2019-01-15 14:43:04 -08:00
/*
* Original copy of the master netdev net_device_ops
*/
2020-07-19 20:49:52 -07:00
const struct dsa_netdevice_ops * netdev_ops ;
2019-10-21 16:51:19 -04:00
net: dsa: reference count the MDB entries at the cross-chip notifier level
Ever since the cross-chip notifiers were introduced, the design was
meant to be simplistic and just get the job done without worrying too
much about dangling resources left behind.
For example, somebody installs an MDB entry on sw0p0 in this daisy chain
topology. It gets installed using ds->ops->port_mdb_add() on sw0p0,
sw1p4 and sw2p4.
|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
[ x ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
[ ] [ ] [ ] [ ] [ x ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]
[ ] [ ] [ ] [ ] [ x ]
Then the same person deletes that MDB entry. The cross-chip notifier for
deletion only matches sw0p0:
|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
[ x ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
[ ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]
[ ] [ ] [ ] [ ] [ ]
Why?
Because the DSA links are 'trunk' ports, if we just go ahead and delete
the MDB from sw1p4 and sw2p4 directly, we might delete those multicast
entries when they are still needed. Just consider the fact that somebody
does:
- add a multicast MAC address towards sw0p0 [ via the cross-chip
notifiers it gets installed on the DSA links too ]
- add the same multicast MAC address towards sw0p1 (another port of that
same switch)
- delete the same multicast MAC address from sw0p0.
At this point, if we deleted the MAC address from the DSA links, it
would be flooded, even though there is still an entry on switch 0 which
needs it not to.
So that is why deletions only match the targeted source port and nothing
on DSA links. Of course, dangling resources means that the hardware
tables will eventually run out given enough additions/removals, but hey,
at least it's simple.
But there is a bigger concern which needs to be addressed, and that is
our support for SWITCHDEV_OBJ_ID_HOST_MDB. DSA simply translates such an
object into a dsa_port_host_mdb_add() which ends up as ds->ops->port_mdb_add()
on the upstream port, and a similar thing happens on deletion:
dsa_port_host_mdb_del() will trigger ds->ops->port_mdb_del() on the
upstream port.
When there are 2 VLAN-unaware bridges spanning the same switch (which is
a use case DSA proudly supports), each bridge will install its own
SWITCHDEV_OBJ_ID_HOST_MDB entries. But upon deletion, DSA goes ahead and
emits a DSA_NOTIFIER_MDB_DEL for dp->cpu_dp, which is shared between the
user ports enslaved to br0 and the user ports enslaved to br1. Not good.
The host-trapped multicast addresses installed by br1 will be deleted
when any state changes in br0 (IGMP timers expire, or ports leave, etc).
To avoid this, we could of course go the route of the zero-sum game and
delete the DSA_NOTIFIER_MDB_DEL call for dp->cpu_dp. But the better
design is to just admit that on shared ports like DSA links and CPU
ports, we should be reference counting calls, even if this consumes some
dynamic memory which DSA has traditionally avoided. On the flip side,
the hardware tables of switches are limited in size, so it would be good
if the OS managed them properly instead of having them eventually
overflow.
To address the memory usage concern, we only apply the refcounting of
MDB entries on ports that are really shared (CPU ports and DSA links)
and not on user ports. In a typical single-switch setup, this means only
the CPU port (and the host MDB entries are not that many, really).
The name of the newly introduced data structures (dsa_mac_addr) is
chosen in such a way that will be reusable for host FDB entries (next
patch).
With this change, we can finally have the same matching logic for the
MDB additions and deletions, as well as for their host-trapped variants.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29 17:06:50 +03:00
/* List of MAC addresses that must be forwarded on this port.
* These are only valid on CPU ports and DSA links .
*/
net: dsa: introduce locking for the address lists on CPU and DSA ports
Now that the rtnl_mutex is going away for dsa_port_{host_,}fdb_{add,del},
no one is serializing access to the address lists that DSA keeps for the
purpose of reference counting on shared ports (CPU and cascade ports).
It can happen for one dsa_switch_do_fdb_del to do list_del on a dp->fdbs
element while another dsa_switch_do_fdb_{add,del} is traversing dp->fdbs.
We need to avoid that.
Currently dp->mdbs is not at risk, because dsa_switch_do_mdb_{add,del}
still runs under the rtnl_mutex. But it would be nice if it would not
depend on that being the case. So let's introduce a mutex per port (the
address lists are per port too) and share it between dp->mdbs and
dp->fdbs.
The place where we put the locking is interesting. It could be tempting
to put a DSA-level lock which still serializes calls to
.port_fdb_{add,del}, but it would still not avoid concurrency with other
driver code paths that are currently under rtnl_mutex (.port_fdb_dump,
.port_fast_age). So it would add a very false sense of security (and
adding a global switch-wide lock in DSA to resynchronize with the
rtnl_lock is also counterproductive and hard).
So the locking is intentionally done only where the dp->fdbs and dp->mdbs
lists are traversed. That means, from a driver perspective, that
.port_fdb_add will be called with the dp->addr_lists_lock mutex held on
the CPU port, but not held on user ports. This is done so that driver
writers are not encouraged to rely on any guarantee offered by
dp->addr_lists_lock.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-24 20:17:54 +03:00
struct mutex addr_lists_lock ;
2021-06-29 17:06:52 +03:00
struct list_head fdbs ;
net: dsa: reference count the MDB entries at the cross-chip notifier level
Ever since the cross-chip notifiers were introduced, the design was
meant to be simplistic and just get the job done without worrying too
much about dangling resources left behind.
For example, somebody installs an MDB entry on sw0p0 in this daisy chain
topology. It gets installed using ds->ops->port_mdb_add() on sw0p0,
sw1p4 and sw2p4.
|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
[ x ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
[ ] [ ] [ ] [ ] [ x ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]
[ ] [ ] [ ] [ ] [ x ]
Then the same person deletes that MDB entry. The cross-chip notifier for
deletion only matches sw0p0:
|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
[ x ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
[ ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]
[ ] [ ] [ ] [ ] [ ]
Why?
Because the DSA links are 'trunk' ports, if we just go ahead and delete
the MDB from sw1p4 and sw2p4 directly, we might delete those multicast
entries when they are still needed. Just consider the fact that somebody
does:
- add a multicast MAC address towards sw0p0 [ via the cross-chip
notifiers it gets installed on the DSA links too ]
- add the same multicast MAC address towards sw0p1 (another port of that
same switch)
- delete the same multicast MAC address from sw0p0.
At this point, if we deleted the MAC address from the DSA links, it
would be flooded, even though there is still an entry on switch 0 which
needs it not to.
So that is why deletions only match the targeted source port and nothing
on DSA links. Of course, dangling resources means that the hardware
tables will eventually run out given enough additions/removals, but hey,
at least it's simple.
But there is a bigger concern which needs to be addressed, and that is
our support for SWITCHDEV_OBJ_ID_HOST_MDB. DSA simply translates such an
object into a dsa_port_host_mdb_add() which ends up as ds->ops->port_mdb_add()
on the upstream port, and a similar thing happens on deletion:
dsa_port_host_mdb_del() will trigger ds->ops->port_mdb_del() on the
upstream port.
When there are 2 VLAN-unaware bridges spanning the same switch (which is
a use case DSA proudly supports), each bridge will install its own
SWITCHDEV_OBJ_ID_HOST_MDB entries. But upon deletion, DSA goes ahead and
emits a DSA_NOTIFIER_MDB_DEL for dp->cpu_dp, which is shared between the
user ports enslaved to br0 and the user ports enslaved to br1. Not good.
The host-trapped multicast addresses installed by br1 will be deleted
when any state changes in br0 (IGMP timers expire, or ports leave, etc).
To avoid this, we could of course go the route of the zero-sum game and
delete the DSA_NOTIFIER_MDB_DEL call for dp->cpu_dp. But the better
design is to just admit that on shared ports like DSA links and CPU
ports, we should be reference counting calls, even if this consumes some
dynamic memory which DSA has traditionally avoided. On the flip side,
the hardware tables of switches are limited in size, so it would be good
if the OS managed them properly instead of having them eventually
overflow.
To address the memory usage concern, we only apply the refcounting of
MDB entries on ports that are really shared (CPU ports and DSA links)
and not on user ports. In a typical single-switch setup, this means only
the CPU port (and the host MDB entries are not that many, really).
The name of the newly introduced data structures (dsa_mac_addr) is
chosen in such a way that will be reusable for host FDB entries (next
patch).
With this change, we can finally have the same matching logic for the
MDB additions and deletions, as well as for their host-trapped variants.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29 17:06:50 +03:00
struct list_head mdbs ;
2019-10-21 16:51:19 -04:00
bool setup ;
2016-06-04 21:16:57 +02:00
} ;
2019-10-30 22:09:13 -04:00
/* TODO: ideally DSA ports would have a single dp->link_dp member,
* and no dst - > rtable nor this struct dsa_link would be needed ,
* but this would require some more complex tree walking ,
* so keep it stupid at the moment and list them all .
*/
struct dsa_link {
struct dsa_port * dp ;
struct dsa_port * link_dp ;
struct list_head list ;
} ;
net: dsa: reference count the MDB entries at the cross-chip notifier level
Ever since the cross-chip notifiers were introduced, the design was
meant to be simplistic and just get the job done without worrying too
much about dangling resources left behind.
For example, somebody installs an MDB entry on sw0p0 in this daisy chain
topology. It gets installed using ds->ops->port_mdb_add() on sw0p0,
sw1p4 and sw2p4.
|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
[ x ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
[ ] [ ] [ ] [ ] [ x ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]
[ ] [ ] [ ] [ ] [ x ]
Then the same person deletes that MDB entry. The cross-chip notifier for
deletion only matches sw0p0:
|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
[ x ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
[ ] [ ] [ ] [ ] [ ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]
[ ] [ ] [ ] [ ] [ ]
Why?
Because the DSA links are 'trunk' ports, if we just go ahead and delete
the MDB from sw1p4 and sw2p4 directly, we might delete those multicast
entries when they are still needed. Just consider the fact that somebody
does:
- add a multicast MAC address towards sw0p0 [ via the cross-chip
notifiers it gets installed on the DSA links too ]
- add the same multicast MAC address towards sw0p1 (another port of that
same switch)
- delete the same multicast MAC address from sw0p0.
At this point, if we deleted the MAC address from the DSA links, it
would be flooded, even though there is still an entry on switch 0 which
needs it not to.
So that is why deletions only match the targeted source port and nothing
on DSA links. Of course, dangling resources means that the hardware
tables will eventually run out given enough additions/removals, but hey,
at least it's simple.
But there is a bigger concern which needs to be addressed, and that is
our support for SWITCHDEV_OBJ_ID_HOST_MDB. DSA simply translates such an
object into a dsa_port_host_mdb_add() which ends up as ds->ops->port_mdb_add()
on the upstream port, and a similar thing happens on deletion:
dsa_port_host_mdb_del() will trigger ds->ops->port_mdb_del() on the
upstream port.
When there are 2 VLAN-unaware bridges spanning the same switch (which is
a use case DSA proudly supports), each bridge will install its own
SWITCHDEV_OBJ_ID_HOST_MDB entries. But upon deletion, DSA goes ahead and
emits a DSA_NOTIFIER_MDB_DEL for dp->cpu_dp, which is shared between the
user ports enslaved to br0 and the user ports enslaved to br1. Not good.
The host-trapped multicast addresses installed by br1 will be deleted
when any state changes in br0 (IGMP timers expire, or ports leave, etc).
To avoid this, we could of course go the route of the zero-sum game and
delete the DSA_NOTIFIER_MDB_DEL call for dp->cpu_dp. But the better
design is to just admit that on shared ports like DSA links and CPU
ports, we should be reference counting calls, even if this consumes some
dynamic memory which DSA has traditionally avoided. On the flip side,
the hardware tables of switches are limited in size, so it would be good
if the OS managed them properly instead of having them eventually
overflow.
To address the memory usage concern, we only apply the refcounting of
MDB entries on ports that are really shared (CPU ports and DSA links)
and not on user ports. In a typical single-switch setup, this means only
the CPU port (and the host MDB entries are not that many, really).
The name of the newly introduced data structures (dsa_mac_addr) is
chosen in such a way that will be reusable for host FDB entries (next
patch).
With this change, we can finally have the same matching logic for the
MDB additions and deletions, as well as for their host-trapped variants.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-06-29 17:06:50 +03:00
struct dsa_mac_addr {
unsigned char addr [ ETH_ALEN ] ;
u16 vid ;
refcount_t refcount ;
struct list_head list ;
} ;
2011-11-27 17:06:08 +00:00
struct dsa_switch {
2019-10-21 16:51:19 -04:00
bool setup ;
2016-05-10 23:27:23 +02:00
struct device * dev ;
2011-11-27 17:06:08 +00:00
/*
* Parent switch tree , and switch index .
*/
struct dsa_switch_tree * dst ;
2017-11-03 19:05:20 -04:00
unsigned int index ;
2011-11-27 17:06:08 +00:00
2017-02-03 13:20:20 -05:00
/* Listener for switch fabric events */
struct notifier_block nb ;
2016-04-13 02:40:40 +02:00
/*
* Give the switch driver somewhere to hang its private data
* structure .
*/
void * priv ;
net: dsa: introduce tagger-owned storage for private and shared data
Ansuel is working on register access over Ethernet for the qca8k switch
family. This requires the qca8k tagging protocol driver to receive
frames which aren't intended for the network stack, but instead for the
qca8k switch driver itself.
The dp->priv is currently the prevailing method for passing data back
and forth between the tagging protocol driver and the switch driver.
However, this method is riddled with caveats.
The DSA design allows in principle for any switch driver to return any
protocol it desires in ->get_tag_protocol(). The dsa_loop driver can be
modified to do just that. But in the current design, the memory behind
dp->priv has to be allocated by the switch driver, so if the tagging
protocol is paired to an unexpected switch driver, we may end up in NULL
pointer dereferences inside the kernel, or worse (a switch driver may
allocate dp->priv according to the expectations of a different tagger).
The latter possibility is even more plausible considering that DSA
switches can dynamically change tagging protocols in certain cases
(dsa <-> edsa, ocelot <-> ocelot-8021q), and the current design lends
itself to mistakes that are all too easy to make.
This patch proposes that the tagging protocol driver should manage its
own memory, instead of relying on the switch driver to do so.
After analyzing the different in-tree needs, it can be observed that the
required tagger storage is per switch, therefore a ds->tagger_data
pointer is introduced. In principle, per-port storage could also be
introduced, although there is no need for it at the moment. Future
changes will replace the current usage of dp->priv with ds->tagger_data.
We define a "binding" event between the DSA switch tree and the tagging
protocol. During this binding event, the tagging protocol's ->connect()
method is called first, and this may allocate some memory for each
switch of the tree. Then a cross-chip notifier is emitted for the
switches within that tree, and they are given the opportunity to fix up
the tagger's memory (for example, they might set up some function
pointers that represent virtual methods for consuming packets).
Because the memory is owned by the tagger, there exists a ->disconnect()
method for the tagger (which is the place to free the resources), but
there doesn't exist a ->disconnect() method for the switch driver.
This is part of the design. The switch driver should make minimal use of
the public part of the tagger data, and only after type-checking it
using the supplied "proto" argument.
In the code there are in fact two binding events, one is the initial
event in dsa_switch_setup_tag_protocol(). At this stage, the cross chip
notifier chains aren't initialized, so we call each switch's connect()
method by hand. Then there is dsa_tree_bind_tag_proto() during
dsa_tree_change_tag_proto(), and here we have an old protocol and a new
one. We first connect to the new one before disconnecting from the old
one, to simplify error handling a bit and to ensure we remain in a valid
state at all times.
Co-developed-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-10 01:34:37 +02:00
void * tagger_data ;
2011-11-27 17:06:08 +00:00
/*
* Configuration data for this switch .
*/
2016-05-10 23:27:24 +02:00
struct dsa_chip_data * cd ;
2011-11-27 17:06:08 +00:00
/*
2016-08-23 12:38:56 -04:00
* The switch operations .
2011-11-27 17:06:08 +00:00
*/
2017-01-08 14:52:08 -08:00
const struct dsa_switch_ops * ops ;
2011-11-27 17:06:08 +00:00
/*
* Slave mii_bus and devices for the individual ports .
*/
2014-08-27 17:04:51 -07:00
u32 phys_mii_mask ;
2011-11-27 17:06:08 +00:00
struct mii_bus * slave_mii_bus ;
2017-01-27 15:29:36 -05:00
2017-03-15 15:53:49 -04:00
/* Ageing Time limits in msecs */
unsigned int ageing_time_min ;
unsigned int ageing_time_max ;
net: dsa: let the core manage the tag_8021q context
The basic problem description is as follows:
Be there 3 switches in a daisy chain topology:
|
sw0p0 sw0p1 sw0p2 sw0p3 sw0p4
[ user ] [ user ] [ user ] [ dsa ] [ cpu ]
|
+---------+
|
sw1p0 sw1p1 sw1p2 sw1p3 sw1p4
[ user ] [ user ] [ user ] [ dsa ] [ dsa ]
|
+---------+
|
sw2p0 sw2p1 sw2p2 sw2p3 sw2p4
[ user ] [ user ] [ user ] [ user ] [ dsa ]
The CPU will not be able to ping through the user ports of the
bottom-most switch (like for example sw2p0), simply because tag_8021q
was not coded up for this scenario - it has always assumed DSA switch
trees with a single switch.
To add support for the topology above, we must admit that the RX VLAN of
sw2p0 must be added on some ports of switches 0 and 1 as well. This is
in fact a textbook example of thing that can use the cross-chip notifier
framework that DSA has set up in switch.c.
There is only one problem: core DSA (switch.c) is not able right now to
make the connection between a struct dsa_switch *ds and a struct
dsa_8021q_context *ctx. Right now, it is drivers who call into
tag_8021q.c and always provide a struct dsa_8021q_context *ctx pointer,
and tag_8021q.c calls them back with the .tag_8021q_vlan_{add,del}
methods.
But with cross-chip notifiers, it is possible for tag_8021q to call
drivers without drivers having ever asked for anything. A good example
is right above: when sw2p0 wants to set itself up for tag_8021q,
the .tag_8021q_vlan_add method needs to be called for switches 1 and 0,
so that they transport sw2p0's VLANs towards the CPU without dropping
them.
So instead of letting drivers manage the tag_8021q context, add a
tag_8021q_ctx pointer inside of struct dsa_switch, which will be
populated when dsa_tag_8021q_register() returns success.
The patch is fairly long-winded because we are partly reverting commit
5899ee367ab3 ("net: dsa: tag_8021q: add a context structure") which made
the driver-facing tag_8021q API use "ctx" instead of "ds". Now that we
can access "ctx" directly from "ds", this is no longer needed.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-19 20:14:48 +03:00
/* Storage for drivers using tag_8021q */
struct dsa_8021q_context * tag_8021q_ctx ;
2017-03-28 23:45:07 +02:00
/* devlink used to represent this switch device */
struct devlink * devlink ;
2017-09-03 20:27:00 -07:00
/* Number of switch port queues */
unsigned int num_tx_queues ;
2019-04-28 21:45:44 +03:00
/* Disallow bridge core from requesting different VLAN awareness
* settings on ports if not hardware - supported
*/
bool vlan_filtering_is_global ;
net: dsa: let drivers state that they need VLAN filtering while standalone
As explained in commit e358bef7c392 ("net: dsa: Give drivers the chance
to veto certain upper devices"), the hellcreek driver uses some tricks
to comply with the network stack expectations: it enforces port
separation in standalone mode using VLANs. For untagged traffic,
bridging between ports is prevented by using different PVIDs, and for
VLAN-tagged traffic, it never accepts 8021q uppers with the same VID on
two ports, so packets with one VLAN cannot leak from one port to another.
That is almost fine*, and has worked because hellcreek relied on an
implicit behavior of the DSA core that was changed by the previous
patch: the standalone ports declare the 'rx-vlan-filter' feature as 'on
[fixed]'. Since most of the DSA drivers are actually VLAN-unaware in
standalone mode, that feature was actually incorrectly reflecting the
hardware/driver state, so there was a desire to fix it. This leaves the
hellcreek driver in a situation where it has to explicitly request this
behavior from the DSA framework.
We configure the ports as follows:
- Standalone: 'rx-vlan-filter' is on. An 8021q upper on top of a
standalone hellcreek port will go through dsa_slave_vlan_rx_add_vid
and will add a VLAN to the hardware tables, giving the driver the
opportunity to refuse it through .port_prechangeupper.
- Bridged with vlan_filtering=0: 'rx-vlan-filter' is off. An 8021q upper
on top of a bridged hellcreek port will not go through
dsa_slave_vlan_rx_add_vid, because there will not be any attempt to
offload this VLAN. The driver already disables VLAN awareness, so that
upper should receive the traffic it needs.
- Bridged with vlan_filtering=1: 'rx-vlan-filter' is on. An 8021q upper
on top of a bridged hellcreek port will call dsa_slave_vlan_rx_add_vid,
and can again be vetoed through .port_prechangeupper.
*It is not actually completely fine, because if I follow through
correctly, we can have the following situation:
ip link add br0 type bridge vlan_filtering 0
ip link set lan0 master br0 # lan0 now becomes VLAN-unaware
ip link set lan0 nomaster # lan0 fails to become VLAN-aware again, therefore breaking isolation
This patch fixes that corner case by extending the DSA core logic, based
on this requested attribute, to change the VLAN awareness state of the
switch (port) when it leaves the bridge.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-24 00:22:58 +03:00
/* Keep VLAN filtering enabled on ports not offloading any upper. */
bool needs_standalone_vlan_filtering ;
2020-05-12 20:20:25 +03:00
/* Pass .port_vlan_add and .port_vlan_del to drivers even for bridges
* that have vlan_filtering = 0. All drivers should ideally set this ( and
* then the option would get removed ) , but it is unknown whether this
* would break things or not .
*/
bool configure_vlan_while_not_filtering ;
2020-10-01 19:42:12 -07:00
/* If the switch driver always programs the CPU port as egress tagged
* despite the VLAN configuration indicating otherwise , then setting
* @ untag_bridge_pvid will force the DSA receive path to pop the bridge ' s
* default_pvid VLAN tagged frames to offer a consistent behavior
* between a vlan_filtering = 0 and vlan_filtering = 1 bridge device .
*/
bool untag_bridge_pvid ;
net: dsa: listen for SWITCHDEV_{FDB,DEL}_ADD_TO_DEVICE on foreign bridge neighbors
Some DSA switches (and not only) cannot learn source MAC addresses from
packets injected from the CPU. They only perform hardware address
learning from inbound traffic.
This can be problematic when we have a bridge spanning some DSA switch
ports and some non-DSA ports (which we'll call "foreign interfaces" from
DSA's perspective).
There are 2 classes of problems created by the lack of learning on
CPU-injected traffic:
- excessive flooding, due to the fact that DSA treats those addresses as
unknown
- the risk of stale routes, which can lead to temporary packet loss
To illustrate the second class, consider the following situation, which
is common in production equipment (wireless access points, where there
is a WLAN interface and an Ethernet switch, and these form a single
bridging domain).
AP 1:
+------------------------------------------------------------------------+
| br0 |
+------------------------------------------------------------------------+
+------------+ +------------+ +------------+ +------------+ +------------+
| swp0 | | swp1 | | swp2 | | swp3 | | wlan0 |
+------------+ +------------+ +------------+ +------------+ +------------+
| ^ ^
| | |
| | |
| Client A Client B
|
|
|
+------------+ +------------+ +------------+ +------------+ +------------+
| swp0 | | swp1 | | swp2 | | swp3 | | wlan0 |
+------------+ +------------+ +------------+ +------------+ +------------+
+------------------------------------------------------------------------+
| br0 |
+------------------------------------------------------------------------+
AP 2
- br0 of AP 1 will know that Clients A and B are reachable via wlan0
- the hardware fdb of a DSA switch driver today is not kept in sync with
the software entries on other bridge ports, so it will not know that
clients A and B are reachable via the CPU port UNLESS the hardware
switch itself performs SA learning from traffic injected from the CPU.
Nonetheless, a substantial number of switches don't.
- the hardware fdb of the DSA switch on AP 2 may autonomously learn that
Client A and B are reachable through swp0. Therefore, the software br0
of AP 2 also may or may not learn this. In the example we're
illustrating, some Ethernet traffic has been going on, and br0 from AP
2 has indeed learnt that it can reach Client B through swp0.
One of the wireless clients, say Client B, disconnects from AP 1 and
roams to AP 2. The topology now looks like this:
AP 1:
+------------------------------------------------------------------------+
| br0 |
+------------------------------------------------------------------------+
+------------+ +------------+ +------------+ +------------+ +------------+
| swp0 | | swp1 | | swp2 | | swp3 | | wlan0 |
+------------+ +------------+ +------------+ +------------+ +------------+
| ^
| |
| Client A
|
|
| Client B
| |
| v
+------------+ +------------+ +------------+ +------------+ +------------+
| swp0 | | swp1 | | swp2 | | swp3 | | wlan0 |
+------------+ +------------+ +------------+ +------------+ +------------+
+------------------------------------------------------------------------+
| br0 |
+------------------------------------------------------------------------+
AP 2
- br0 of AP 1 still knows that Client A is reachable via wlan0 (no change)
- br0 of AP 1 will (possibly) know that Client B has left wlan0. There
are cases where it might never find out though. Either way, DSA today
does not process that notification in any way.
- the hardware FDB of the DSA switch on AP 1 may learn autonomously that
Client B can be reached via swp0, if it receives any packet with
Client 1's source MAC address over Ethernet.
- the hardware FDB of the DSA switch on AP 2 still thinks that Client B
can be reached via swp0. It does not know that it has roamed to wlan0,
because it doesn't perform SA learning from the CPU port.
Now Client A contacts Client B.
AP 1 routes the packet fine towards swp0 and delivers it on the Ethernet
segment.
AP 2 sees a frame on swp0 and its fdb says that the destination is swp0.
Hairpinning is disabled => drop.
This problem comes from the fact that these switches have a 'blind spot'
for addresses coming from software bridging. The generic solution is not
to assume that hardware learning can be enabled somehow, but to listen
to more bridge learning events. It turns out that the bridge driver does
learn in software from all inbound frames, in __br_handle_local_finish.
A proper SWITCHDEV_FDB_ADD_TO_DEVICE notification is emitted for the
addresses serviced by the bridge on 'foreign' interfaces. The software
bridge also does the right thing on migration, by notifying that the old
entry is deleted, so that does not need to be special-cased in DSA. When
it is deleted, we just need to delete our static FDB entry towards the
CPU too, and wait.
The problem is that DSA currently only cares about SWITCHDEV_FDB_ADD_TO_DEVICE
events received on its own interfaces, such as static FDB entries.
Luckily we can change that, and DSA can listen to all switchdev FDB
add/del events in the system and figure out if those events were emitted
by a bridge that spans at least one of DSA's own ports. In case that is
true, DSA will also offload that address towards its own CPU port, in
the eventuality that there might be bridge clients attached to the DSA
switch who want to talk to the station connected to the foreign
interface.
In terms of implementation, we need to keep the fdb_info->added_by_user
check for the case where the switchdev event was targeted directly at a
DSA switch port. But we don't need to look at that flag for snooped
events. So the check is currently too late, we need to move it earlier.
This also simplifies the code a bit, since we avoid uselessly allocating
and freeing switchdev_work.
We could probably do some improvements in the future. For example,
multi-bridge support is rudimentary at the moment. If there are two
bridges spanning a DSA switch's ports, and both of them need to service
the same MAC address, then what will happen is that the migration of one
of those stations will trigger the deletion of the FDB entry from the
CPU port while it is still used by other bridge. That could be improved
with reference counting but is left for another time.
This behavior needs to be enabled at driver level by setting
ds->assisted_learning_on_cpu_port = true. This is because we don't want
to inflict a potential performance penalty (accesses through
MDIO/I2C/SPI are expensive) to hardware that really doesn't need it
because address learning on the CPU port works there.
Reported-by: DENG Qingfang <dqfext@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-06 11:51:35 +02:00
/* Let DSA manage the FDB entries towards the CPU, based on the
* software bridge database .
*/
bool assisted_learning_on_cpu_port ;
2019-04-28 21:45:48 +03:00
/* In case vlan_filtering_is_global is set, the VLAN awareness state
* should be retrieved from here and not from the per - port settings .
*/
bool vlan_filtering ;
2020-01-06 03:34:12 +02:00
/* MAC PCS does not provide link state change interrupt, and requires
* polling . Flag passed on to PHYLINK .
*/
bool pcs_poll ;
net: dsa: implement auto-normalization of MTU for bridge hardware datapath
Many switches don't have an explicit knob for configuring the MTU
(maximum transmission unit per interface). Instead, they do the
length-based packet admission checks on the ingress interface, for
reasons that are easy to understand (why would you accept a packet in
the queuing subsystem if you know you're going to drop it anyway).
So it is actually the MRU that these switches permit configuring.
In Linux there only exists the IFLA_MTU netlink attribute and the
associated dev_set_mtu function. The comments like to play blind and say
that it's changing the "maximum transfer unit", which is to say that
there isn't any directionality in the meaning of the MTU word. So that
is the interpretation that this patch is giving to things: MTU == MRU.
When 2 interfaces having different MTUs are bridged, the bridge driver
MTU auto-adjustment logic kicks in: what br_mtu_auto_adjust() does is it
adjusts the MTU of the bridge net device itself (and not that of the
slave net devices) to the minimum value of all slave interfaces, in
order for forwarded packets to not exceed the MTU regardless of the
interface they are received and send on.
The idea behind this behavior, and why the slave MTUs are not adjusted,
is that normal termination from Linux over the L2 forwarding domain
should happen over the bridge net device, which _is_ properly limited by
the minimum MTU. And termination over individual slave devices is
possible even if those are bridged. But that is not "forwarding", so
there's no reason to do normalization there, since only a single
interface sees that packet.
The problem with those switches that can only control the MRU is with
the offloaded data path, where a packet received on an interface with
MRU 9000 would still be forwarded to an interface with MRU 1500. And the
br_mtu_auto_adjust() function does not really help, since the MTU
configured on the bridge net device is ignored.
In order to enforce the de-facto MTU == MRU rule for these switches, we
need to do MTU normalization, which means: in order for no packet larger
than the MTU configured on this port to be sent, then we need to limit
the MRU on all ports that this packet could possibly come from. AKA
since we are configuring the MRU via MTU, it means that all ports within
a bridge forwarding domain should have the same MTU.
And that is exactly what this patch is trying to do.
>From an implementation perspective, we try to follow the intent of the
user, otherwise there is a risk that we might livelock them (they try to
change the MTU on an already-bridged interface, but we just keep
changing it back in an attempt to keep the MTU normalized). So the MTU
that the bridge is normalized to is either:
- The most recently changed one:
ip link set dev swp0 master br0
ip link set dev swp1 master br0
ip link set dev swp0 mtu 1400
This sequence will make swp1 inherit MTU 1400 from swp0.
- The one of the most recently added interface to the bridge:
ip link set dev swp0 master br0
ip link set dev swp1 mtu 1400
ip link set dev swp1 master br0
The above sequence will make swp0 inherit MTU 1400 as well.
Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-27 21:55:43 +02:00
/* For switches that only have the MRU configurable. To ensure the
* configured MTU is not exceeded , normalization of MRU on all bridged
* interfaces is needed .
*/
bool mtu_enforcement_ingress ;
2021-01-13 09:42:53 +01:00
/* Drivers that benefit from having an ID associated with each
* offloaded LAG should set this to the maximum number of
* supported IDs . DSA will then maintain a mapping of _at
* least_ these many IDs , accessible to drivers via
* dsa_lag_id ( ) .
*/
unsigned int num_lag_ids ;
2021-12-06 18:57:48 +02:00
/* Drivers that support bridge forwarding offload or FDB isolation
* should set this to the maximum number of bridges spanning the same
* switch tree ( or all trees , in the case of cross - tree bridging
* support ) that can be offloaded .
net: dsa: add support for bridge TX forwarding offload
For a DSA switch, to offload the forwarding process of a bridge device
means to send the packets coming from the software bridge as data plane
packets. This is contrary to everything that DSA has done so far,
because the current taggers only know to send control packets (ones that
target a specific destination port), whereas data plane packets are
supposed to be forwarded according to the FDB lookup, much like packets
ingressing on any regular ingress port. If the FDB lookup process
returns multiple destination ports (flooding, multicast), then
replication is also handled by the switch hardware - the bridge only
sends a single packet and avoids the skb_clone().
DSA keeps for each bridge port a zero-based index (the number of the
bridge). Multiple ports performing TX forwarding offload to the same
bridge have the same dp->bridge_num value, and ports not offloading the
TX data plane of a bridge have dp->bridge_num = -1.
The tagger can check if the packet that is being transmitted on has
skb->offload_fwd_mark = true or not. If it does, it can be sure that the
packet belongs to the data plane of a bridge, further information about
which can be obtained based on dp->bridge_dev and dp->bridge_num.
It can then compose a DSA tag for injecting a data plane packet into
that bridge number.
For the switch driver side, we offer two new dsa_switch_ops methods,
called .port_bridge_fwd_offload_{add,del}, which are modeled after
.port_bridge_{join,leave}.
These methods are provided in case the driver needs to configure the
hardware to treat packets coming from that bridge software interface as
data plane packets. The switchdev <-> bridge interaction happens during
the netdev_master_upper_dev_link() call, so to switch drivers, the
effect is that the .port_bridge_fwd_offload_add() method is called
immediately after .port_bridge_join().
If the bridge number exceeds the number of bridges for which the switch
driver can offload the TX data plane (and this includes the case where
the driver can offload none), DSA falls back to simply returning
tx_fwd_offload = false in the switchdev_bridge_port_offload() call.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-22 18:55:40 +03:00
*/
2021-12-06 18:57:48 +02:00
unsigned int max_num_bridges ;
net: dsa: add support for bridge TX forwarding offload
For a DSA switch, to offload the forwarding process of a bridge device
means to send the packets coming from the software bridge as data plane
packets. This is contrary to everything that DSA has done so far,
because the current taggers only know to send control packets (ones that
target a specific destination port), whereas data plane packets are
supposed to be forwarded according to the FDB lookup, much like packets
ingressing on any regular ingress port. If the FDB lookup process
returns multiple destination ports (flooding, multicast), then
replication is also handled by the switch hardware - the bridge only
sends a single packet and avoids the skb_clone().
DSA keeps for each bridge port a zero-based index (the number of the
bridge). Multiple ports performing TX forwarding offload to the same
bridge have the same dp->bridge_num value, and ports not offloading the
TX data plane of a bridge have dp->bridge_num = -1.
The tagger can check if the packet that is being transmitted on has
skb->offload_fwd_mark = true or not. If it does, it can be sure that the
packet belongs to the data plane of a bridge, further information about
which can be obtained based on dp->bridge_dev and dp->bridge_num.
It can then compose a DSA tag for injecting a data plane packet into
that bridge number.
For the switch driver side, we offer two new dsa_switch_ops methods,
called .port_bridge_fwd_offload_{add,del}, which are modeled after
.port_bridge_{join,leave}.
These methods are provided in case the driver needs to configure the
hardware to treat packets coming from that bridge software interface as
data plane packets. The switchdev <-> bridge interaction happens during
the netdev_master_upper_dev_link() call, so to switch drivers, the
effect is that the .port_bridge_fwd_offload_add() method is called
immediately after .port_bridge_join().
If the bridge number exceeds the number of bridges for which the switch
driver can offload the TX data plane (and this includes the case where
the driver can offload none), DSA falls back to simply returning
tx_fwd_offload = false in the switchdev_bridge_port_offload() call.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-22 18:55:40 +03:00
2017-01-27 15:29:36 -05:00
size_t num_ports ;
2011-11-27 17:06:08 +00:00
} ;
2019-10-21 16:51:15 -04:00
static inline struct dsa_port * dsa_to_port ( struct dsa_switch * ds , int p )
2017-10-26 11:22:51 -04:00
{
2019-10-21 16:51:17 -04:00
struct dsa_switch_tree * dst = ds - > dst ;
2019-10-25 14:48:53 -04:00
struct dsa_port * dp ;
2019-10-21 16:51:17 -04:00
list_for_each_entry ( dp , & dst - > ports , list )
if ( dp - > ds = = ds & & dp - > index = = p )
2019-10-25 14:48:53 -04:00
return dp ;
2019-10-21 16:51:17 -04:00
2019-10-25 14:48:53 -04:00
return NULL ;
2017-10-26 11:22:58 -04:00
}
2017-10-26 11:22:51 -04:00
2021-06-21 19:42:15 +03:00
static inline bool dsa_port_is_dsa ( struct dsa_port * port )
{
return port - > type = = DSA_PORT_TYPE_DSA ;
}
static inline bool dsa_port_is_cpu ( struct dsa_port * port )
{
return port - > type = = DSA_PORT_TYPE_CPU ;
}
static inline bool dsa_port_is_user ( struct dsa_port * dp )
{
return dp - > type = = DSA_PORT_TYPE_USER ;
}
net: dsa: flush switchdev workqueue before tearing down CPU/DSA ports
Sometimes when unbinding the mv88e6xxx driver on Turris MOX, these error
messages appear:
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 1 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete be:79:b4:9e:9e:96 vid 0 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 100 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 1 from fdb: -2
mv88e6085 d0032004.mdio-mii:12: port 1 failed to delete d8:58:d7:00:ca:6d vid 0 from fdb: -2
(and similarly for other ports)
What happens is that DSA has a policy "even if there are bugs, let's at
least not leak memory" and dsa_port_teardown() clears the dp->fdbs and
dp->mdbs lists, which are supposed to be empty.
But deleting that cleanup code, the warnings go away.
=> the FDB and MDB lists (used for refcounting on shared ports, aka CPU
and DSA ports) will eventually be empty, but are not empty by the time
we tear down those ports. Aka we are deleting them too soon.
The addresses that DSA complains about are host-trapped addresses: the
local addresses of the ports, and the MAC address of the bridge device.
The problem is that offloading those entries happens from a deferred
work item scheduled by the SWITCHDEV_FDB_DEL_TO_DEVICE handler, and this
races with the teardown of the CPU and DSA ports where the refcounting
is kept.
In fact, not only it races, but fundamentally speaking, if we iterate
through the port list linearly, we might end up tearing down the shared
ports even before we delete a DSA user port which has a bridge upper.
So as it turns out, we need to first tear down the user ports (and the
unused ones, for no better place of doing that), then the shared ports
(the CPU and DSA ports). In between, we need to ensure that all work
items scheduled by our switchdev handlers (which only run for user
ports, hence the reason why we tear them down first) have finished.
Fixes: 161ca59d39e9 ("net: dsa: reference count the MDB entries at the cross-chip notifier level")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20210914134726.2305133-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-09-14 16:47:26 +03:00
static inline bool dsa_port_is_unused ( struct dsa_port * dp )
{
return dp - > type = = DSA_PORT_TYPE_UNUSED ;
}
2017-10-26 11:22:58 -04:00
static inline bool dsa_is_unused_port ( struct dsa_switch * ds , int p )
{
return dsa_to_port ( ds , p ) - > type = = DSA_PORT_TYPE_UNUSED ;
2017-10-26 11:22:51 -04:00
}
2011-11-27 17:06:08 +00:00
static inline bool dsa_is_cpu_port ( struct dsa_switch * ds , int p )
{
2017-10-26 11:22:58 -04:00
return dsa_to_port ( ds , p ) - > type = = DSA_PORT_TYPE_CPU ;
2011-11-27 17:06:08 +00:00
}
2015-08-17 23:52:51 +02:00
static inline bool dsa_is_dsa_port ( struct dsa_switch * ds , int p )
{
2017-10-26 11:22:58 -04:00
return dsa_to_port ( ds , p ) - > type = = DSA_PORT_TYPE_DSA ;
2015-08-17 23:52:51 +02:00
}
2017-10-26 11:22:54 -04:00
static inline bool dsa_is_user_port ( struct dsa_switch * ds , int p )
2017-03-11 16:12:58 -05:00
{
2017-10-26 11:22:58 -04:00
return dsa_to_port ( ds , p ) - > type = = DSA_PORT_TYPE_USER ;
2017-03-11 16:12:58 -05:00
}
net: dsa: introduce helpers for iterating through ports using dp
Since the DSA conversion from the ds->ports array into the dst->ports
list, the DSA API has encouraged driver writers, as well as the core
itself, to write inefficient code.
Currently, code that wants to filter by a specific type of port when
iterating, like {!unused, user, cpu, dsa}, uses the dsa_is_*_port helper.
Under the hood, this uses dsa_to_port which iterates again through
dst->ports. But the driver iterates through the port list already, so
the complexity is quadratic for the typical case of a single-switch
tree.
This patch introduces some iteration helpers where the iterator is
already a struct dsa_port *dp, so that the other variant of the
filtering functions, dsa_port_is_{unused,user,cpu_dsa}, can be used
directly on the iterator. This eliminates the second lookup.
These functions can be used both by the core and by drivers.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-10-20 20:49:49 +03:00
# define dsa_tree_for_each_user_port(_dp, _dst) \
list_for_each_entry ( ( _dp ) , & ( _dst ) - > ports , list ) \
if ( dsa_port_is_user ( ( _dp ) ) )
# define dsa_switch_for_each_port(_dp, _ds) \
list_for_each_entry ( ( _dp ) , & ( _ds ) - > dst - > ports , list ) \
if ( ( _dp ) - > ds = = ( _ds ) )
# define dsa_switch_for_each_port_safe(_dp, _next, _ds) \
list_for_each_entry_safe ( ( _dp ) , ( _next ) , & ( _ds ) - > dst - > ports , list ) \
if ( ( _dp ) - > ds = = ( _ds ) )
# define dsa_switch_for_each_port_continue_reverse(_dp, _ds) \
list_for_each_entry_continue_reverse ( ( _dp ) , & ( _ds ) - > dst - > ports , list ) \
if ( ( _dp ) - > ds = = ( _ds ) )
# define dsa_switch_for_each_available_port(_dp, _ds) \
dsa_switch_for_each_port ( ( _dp ) , ( _ds ) ) \
if ( ! dsa_port_is_unused ( ( _dp ) ) )
# define dsa_switch_for_each_user_port(_dp, _ds) \
dsa_switch_for_each_port ( ( _dp ) , ( _ds ) ) \
if ( dsa_port_is_user ( ( _dp ) ) )
# define dsa_switch_for_each_cpu_port(_dp, _ds) \
dsa_switch_for_each_port ( ( _dp ) , ( _ds ) ) \
if ( dsa_port_is_cpu ( ( _dp ) ) )
2017-10-26 11:22:56 -04:00
static inline u32 dsa_user_ports ( struct dsa_switch * ds )
{
2021-10-20 20:49:50 +03:00
struct dsa_port * dp ;
2017-10-26 11:22:58 -04:00
u32 mask = 0 ;
2017-10-26 11:22:56 -04:00
2021-10-20 20:49:50 +03:00
dsa_switch_for_each_user_port ( dp , ds )
mask | = BIT ( dp - > index ) ;
2017-10-26 11:22:58 -04:00
return mask ;
2017-10-16 11:12:19 -04:00
}
2019-10-30 22:09:13 -04:00
/* Return the local port used to reach an arbitrary switch device */
static inline unsigned int dsa_routing_port ( struct dsa_switch * ds , int device )
{
struct dsa_switch_tree * dst = ds - > dst ;
struct dsa_link * dl ;
list_for_each_entry ( dl , & dst - > rtable , list )
if ( dl - > dp - > ds = = ds & & dl - > link_dp - > ds - > index = = device )
return dl - > dp - > index ;
return ds - > num_ports ;
}
2017-11-30 12:56:42 -05:00
/* Return the local port used to reach an arbitrary switch port */
static inline unsigned int dsa_towards_port ( struct dsa_switch * ds , int device ,
int port )
{
if ( device = = ds - > index )
return port ;
else
2019-10-30 22:09:13 -04:00
return dsa_routing_port ( ds , device ) ;
2017-11-30 12:56:42 -05:00
}
/* Return the local port used to reach the dedicated CPU port */
2017-12-05 15:34:13 -05:00
static inline unsigned int dsa_upstream_port ( struct dsa_switch * ds , int port )
2011-11-27 17:06:08 +00:00
{
2017-12-05 15:34:13 -05:00
const struct dsa_port * dp = dsa_to_port ( ds , port ) ;
const struct dsa_port * cpu_dp = dp - > cpu_dp ;
if ( ! cpu_dp )
return port ;
2011-11-27 17:06:08 +00:00
2017-11-30 12:56:42 -05:00
return dsa_towards_port ( ds , cpu_dp - > ds - > index , cpu_dp - > index ) ;
2011-11-27 17:06:08 +00:00
}
2021-06-29 17:06:48 +03:00
/* Return true if this is the local port used to reach the CPU port */
static inline bool dsa_is_upstream_port ( struct dsa_switch * ds , int port )
{
if ( dsa_is_unused_port ( ds , port ) )
return false ;
return port = = dsa_upstream_port ( ds , port ) ;
}
/* Return true if @upstream_ds is an upstream switch of @downstream_ds, meaning
* that the routing port from @ downstream_ds to @ upstream_ds is also the port
* which @ downstream_ds uses to reach its dedicated CPU .
*/
static inline bool dsa_switch_is_upstream_of ( struct dsa_switch * upstream_ds ,
struct dsa_switch * downstream_ds )
{
int routing_port ;
if ( upstream_ds = = downstream_ds )
return true ;
routing_port = dsa_routing_port ( downstream_ds , upstream_ds - > index ) ;
return dsa_is_upstream_port ( downstream_ds , routing_port ) ;
}
2019-04-28 21:45:49 +03:00
static inline bool dsa_port_is_vlan_filtering ( const struct dsa_port * dp )
{
const struct dsa_switch * ds = dp - > ds ;
if ( ds - > vlan_filtering_is_global )
return ds - > vlan_filtering ;
else
return dp - > vlan_filtering ;
}
2021-03-18 20:25:33 +01:00
static inline
struct net_device * dsa_port_to_bridge_port ( const struct dsa_port * dp )
{
net: dsa: keep the bridge_dev and bridge_num as part of the same structure
The main desire behind this is to provide coherent bridge information to
the fast path without locking.
For example, right now we set dp->bridge_dev and dp->bridge_num from
separate code paths, it is theoretically possible for a packet
transmission to read these two port properties consecutively and find a
bridge number which does not correspond with the bridge device.
Another desire is to start passing more complex bridge information to
dsa_switch_ops functions. For example, with FDB isolation, it is
expected that drivers will need to be passed the bridge which requested
an FDB/MDB entry to be offloaded, and along with that bridge_dev, the
associated bridge_num should be passed too, in case the driver might
want to implement an isolation scheme based on that number.
We already pass the {bridge_dev, bridge_num} pair to the TX forwarding
offload switch API, however we'd like to remove that and squash it into
the basic bridge join/leave API. So that means we need to pass this
pair to the bridge join/leave API.
During dsa_port_bridge_leave, first we unset dp->bridge_dev, then we
call the driver's .port_bridge_leave with what used to be our
dp->bridge_dev, but provided as an argument.
When bridge_dev and bridge_num get folded into a single structure, we
need to preserve this behavior in dsa_port_bridge_leave: we need a copy
of what used to be in dp->bridge.
Switch drivers check bridge membership by comparing dp->bridge_dev with
the provided bridge_dev, but now, if we provide the struct dsa_bridge as
a pointer, they cannot keep comparing dp->bridge to the provided
pointer, since this only points to an on-stack copy. To make this
obvious and prevent driver writers from forgetting and doing stupid
things, in this new API, the struct dsa_bridge is provided as a full
structure (not very large, contains an int and a pointer) instead of a
pointer. An explicit comparison function needs to be used to determine
bridge membership: dsa_port_offloads_bridge().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-06 18:57:56 +02:00
if ( ! dp - > bridge )
2021-03-18 20:25:33 +01:00
return NULL ;
if ( dp - > lag_dev )
return dp - > lag_dev ;
else if ( dp - > hsr_dev )
return dp - > hsr_dev ;
return dp - > slave ;
}
2021-12-06 18:57:52 +02:00
static inline struct net_device *
dsa_port_bridge_dev_get ( const struct dsa_port * dp )
{
net: dsa: keep the bridge_dev and bridge_num as part of the same structure
The main desire behind this is to provide coherent bridge information to
the fast path without locking.
For example, right now we set dp->bridge_dev and dp->bridge_num from
separate code paths, it is theoretically possible for a packet
transmission to read these two port properties consecutively and find a
bridge number which does not correspond with the bridge device.
Another desire is to start passing more complex bridge information to
dsa_switch_ops functions. For example, with FDB isolation, it is
expected that drivers will need to be passed the bridge which requested
an FDB/MDB entry to be offloaded, and along with that bridge_dev, the
associated bridge_num should be passed too, in case the driver might
want to implement an isolation scheme based on that number.
We already pass the {bridge_dev, bridge_num} pair to the TX forwarding
offload switch API, however we'd like to remove that and squash it into
the basic bridge join/leave API. So that means we need to pass this
pair to the bridge join/leave API.
During dsa_port_bridge_leave, first we unset dp->bridge_dev, then we
call the driver's .port_bridge_leave with what used to be our
dp->bridge_dev, but provided as an argument.
When bridge_dev and bridge_num get folded into a single structure, we
need to preserve this behavior in dsa_port_bridge_leave: we need a copy
of what used to be in dp->bridge.
Switch drivers check bridge membership by comparing dp->bridge_dev with
the provided bridge_dev, but now, if we provide the struct dsa_bridge as
a pointer, they cannot keep comparing dp->bridge to the provided
pointer, since this only points to an on-stack copy. To make this
obvious and prevent driver writers from forgetting and doing stupid
things, in this new API, the struct dsa_bridge is provided as a full
structure (not very large, contains an int and a pointer) instead of a
pointer. An explicit comparison function needs to be used to determine
bridge membership: dsa_port_offloads_bridge().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-06 18:57:56 +02:00
return dp - > bridge ? dp - > bridge - > dev : NULL ;
2021-12-06 18:57:52 +02:00
}
static inline unsigned int dsa_port_bridge_num_get ( struct dsa_port * dp )
{
net: dsa: keep the bridge_dev and bridge_num as part of the same structure
The main desire behind this is to provide coherent bridge information to
the fast path without locking.
For example, right now we set dp->bridge_dev and dp->bridge_num from
separate code paths, it is theoretically possible for a packet
transmission to read these two port properties consecutively and find a
bridge number which does not correspond with the bridge device.
Another desire is to start passing more complex bridge information to
dsa_switch_ops functions. For example, with FDB isolation, it is
expected that drivers will need to be passed the bridge which requested
an FDB/MDB entry to be offloaded, and along with that bridge_dev, the
associated bridge_num should be passed too, in case the driver might
want to implement an isolation scheme based on that number.
We already pass the {bridge_dev, bridge_num} pair to the TX forwarding
offload switch API, however we'd like to remove that and squash it into
the basic bridge join/leave API. So that means we need to pass this
pair to the bridge join/leave API.
During dsa_port_bridge_leave, first we unset dp->bridge_dev, then we
call the driver's .port_bridge_leave with what used to be our
dp->bridge_dev, but provided as an argument.
When bridge_dev and bridge_num get folded into a single structure, we
need to preserve this behavior in dsa_port_bridge_leave: we need a copy
of what used to be in dp->bridge.
Switch drivers check bridge membership by comparing dp->bridge_dev with
the provided bridge_dev, but now, if we provide the struct dsa_bridge as
a pointer, they cannot keep comparing dp->bridge to the provided
pointer, since this only points to an on-stack copy. To make this
obvious and prevent driver writers from forgetting and doing stupid
things, in this new API, the struct dsa_bridge is provided as a full
structure (not very large, contains an int and a pointer) instead of a
pointer. An explicit comparison function needs to be used to determine
bridge membership: dsa_port_offloads_bridge().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-06 18:57:56 +02:00
return dp - > bridge ? dp - > bridge - > num : 0 ;
2021-12-06 18:57:52 +02:00
}
static inline bool dsa_port_bridge_same ( const struct dsa_port * a ,
const struct dsa_port * b )
{
struct net_device * br_a = dsa_port_bridge_dev_get ( a ) ;
struct net_device * br_b = dsa_port_bridge_dev_get ( b ) ;
/* Standalone ports are not in the same bridge with one another */
return ( ! br_a | | ! br_b ) ? false : ( br_a = = br_b ) ;
}
2021-12-06 18:57:55 +02:00
static inline bool dsa_port_offloads_bridge_port ( struct dsa_port * dp ,
const struct net_device * dev )
{
return dsa_port_to_bridge_port ( dp ) = = dev ;
}
static inline bool
dsa_port_offloads_bridge_dev ( struct dsa_port * dp ,
const struct net_device * bridge_dev )
{
/* DSA ports connected to a bridge, and event was emitted
* for the bridge .
*/
return dsa_port_bridge_dev_get ( dp ) = = bridge_dev ;
}
net: dsa: keep the bridge_dev and bridge_num as part of the same structure
The main desire behind this is to provide coherent bridge information to
the fast path without locking.
For example, right now we set dp->bridge_dev and dp->bridge_num from
separate code paths, it is theoretically possible for a packet
transmission to read these two port properties consecutively and find a
bridge number which does not correspond with the bridge device.
Another desire is to start passing more complex bridge information to
dsa_switch_ops functions. For example, with FDB isolation, it is
expected that drivers will need to be passed the bridge which requested
an FDB/MDB entry to be offloaded, and along with that bridge_dev, the
associated bridge_num should be passed too, in case the driver might
want to implement an isolation scheme based on that number.
We already pass the {bridge_dev, bridge_num} pair to the TX forwarding
offload switch API, however we'd like to remove that and squash it into
the basic bridge join/leave API. So that means we need to pass this
pair to the bridge join/leave API.
During dsa_port_bridge_leave, first we unset dp->bridge_dev, then we
call the driver's .port_bridge_leave with what used to be our
dp->bridge_dev, but provided as an argument.
When bridge_dev and bridge_num get folded into a single structure, we
need to preserve this behavior in dsa_port_bridge_leave: we need a copy
of what used to be in dp->bridge.
Switch drivers check bridge membership by comparing dp->bridge_dev with
the provided bridge_dev, but now, if we provide the struct dsa_bridge as
a pointer, they cannot keep comparing dp->bridge to the provided
pointer, since this only points to an on-stack copy. To make this
obvious and prevent driver writers from forgetting and doing stupid
things, in this new API, the struct dsa_bridge is provided as a full
structure (not very large, contains an int and a pointer) instead of a
pointer. An explicit comparison function needs to be used to determine
bridge membership: dsa_port_offloads_bridge().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-06 18:57:56 +02:00
static inline bool dsa_port_offloads_bridge ( struct dsa_port * dp ,
const struct dsa_bridge * bridge )
{
return dsa_port_bridge_dev_get ( dp ) = = bridge - > dev ;
}
2021-12-06 18:57:55 +02:00
/* Returns true if any port of this tree offloads the given net_device */
static inline bool dsa_tree_offloads_bridge_port ( struct dsa_switch_tree * dst ,
const struct net_device * dev )
{
struct dsa_port * dp ;
list_for_each_entry ( dp , & dst - > ports , list )
if ( dsa_port_offloads_bridge_port ( dp , dev ) )
return true ;
return false ;
}
/* Returns true if any port of this tree offloads the given bridge */
static inline bool
dsa_tree_offloads_bridge_dev ( struct dsa_switch_tree * dst ,
const struct net_device * bridge_dev )
{
struct dsa_port * dp ;
list_for_each_entry ( dp , & dst - > ports , list )
if ( dsa_port_offloads_bridge_dev ( dp , bridge_dev ) )
return true ;
return false ;
}
2017-08-06 16:15:49 +03:00
typedef int dsa_fdb_dump_cb_t ( const unsigned char * addr , u16 vid ,
bool is_static , void * data ) ;
2016-08-23 12:38:56 -04:00
struct dsa_switch_ops {
net: dsa: allow changing the tag protocol via the "tagging" device attribute
Currently DSA exposes the following sysfs:
$ cat /sys/class/net/eno2/dsa/tagging
ocelot
which is a read-only device attribute, introduced in the kernel as
commit 98cdb4807123 ("net: dsa: Expose tagging protocol to user-space"),
and used by libpcap since its commit 993db3800d7d ("Add support for DSA
link-layer types").
It would be nice if we could extend this device attribute by making it
writable:
$ echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
This is useful with DSA switches that can make use of more than one
tagging protocol. It may be useful in dsa_loop in the future too, to
perform offline testing of various taggers, or for changing between dsa
and edsa on Marvell switches, if that is desirable.
In terms of implementation, drivers can support this feature by
implementing .change_tag_protocol, which should always leave the switch
in a consistent state: either with the new protocol if things went well,
or with the old one if something failed. Teardown of the old protocol,
if necessary, must be handled by the driver.
Some things remain as before:
- The .get_tag_protocol is currently only called at probe time, to load
the initial tagging protocol driver. Nonetheless, new drivers should
report the tagging protocol in current use now.
- The driver should manage by itself the initial setup of tagging
protocol, no later than the .setup() method, as well as destroying
resources used by the last tagger in use, no earlier than the
.teardown() method.
For multi-switch DSA trees, error handling is a bit more complicated,
since e.g. the 5th out of 7 switches may fail to change the tag
protocol. When that happens, a revert to the original tag protocol is
attempted, but that may fail too, leaving the tree in an inconsistent
state despite each individual switch implementing .change_tag_protocol
transactionally. Since the intersection between drivers that implement
.change_tag_protocol and drivers that support D in DSA is currently the
empty set, the possibility for this error to happen is ignored for now.
Testing:
$ insmod mscc_felix.ko
[ 79.549784] mscc_felix 0000:00:00.5: Adding to iommu group 14
[ 79.565712] mscc_felix 0000:00:00.5: Failed to register DSA switch: -517
$ insmod tag_ocelot.ko
$ rmmod mscc_felix.ko
$ insmod mscc_felix.ko
[ 97.261724] libphy: VSC9959 internal MDIO bus: probed
[ 97.267363] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 0
[ 97.274998] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 1
[ 97.282561] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 2
[ 97.289700] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 3
[ 97.599163] mscc_felix 0000:00:00.5 swp0 (uninitialized): PHY [0000:00:00.3:10] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 97.862034] mscc_felix 0000:00:00.5 swp1 (uninitialized): PHY [0000:00:00.3:11] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 97.950731] mscc_felix 0000:00:00.5 swp0: configuring for inband/qsgmii link mode
[ 97.964278] 8021q: adding VLAN 0 to HW filter on device swp0
[ 98.146161] mscc_felix 0000:00:00.5 swp2 (uninitialized): PHY [0000:00:00.3:12] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 98.238649] mscc_felix 0000:00:00.5 swp1: configuring for inband/qsgmii link mode
[ 98.251845] 8021q: adding VLAN 0 to HW filter on device swp1
[ 98.433916] mscc_felix 0000:00:00.5 swp3 (uninitialized): PHY [0000:00:00.3:13] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 98.485542] mscc_felix 0000:00:00.5: configuring for fixed/internal link mode
[ 98.503584] mscc_felix 0000:00:00.5: Link is Up - 2.5Gbps/Full - flow control rx/tx
[ 98.527948] device eno2 entered promiscuous mode
[ 98.544755] DSA: tree 0 setup
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=2.337 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.754 ms
^C
- 10.0.0.1 ping statistics -
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.754/1.545/2.337 ms
$ cat /sys/class/net/eno2/dsa/tagging
ocelot
$ cat ./test_ocelot_8021q.sh
#!/bin/bash
ip link set swp0 down
ip link set swp1 down
ip link set swp2 down
ip link set swp3 down
ip link set swp5 down
ip link set eno2 down
echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
ip link set eno2 up
ip link set swp0 up
ip link set swp1 up
ip link set swp2 up
ip link set swp3 up
ip link set swp5 up
$ ./test_ocelot_8021q.sh
./test_ocelot_8021q.sh: line 9: echo: write error: Protocol not available
$ rmmod tag_ocelot.ko
rmmod: can't unload module 'tag_ocelot': Resource temporarily unavailable
$ insmod tag_ocelot_8021q.ko
$ ./test_ocelot_8021q.sh
$ cat /sys/class/net/eno2/dsa/tagging
ocelot-8021q
$ rmmod tag_ocelot.ko
$ rmmod tag_ocelot_8021q.ko
rmmod: can't unload module 'tag_ocelot_8021q': Resource temporarily unavailable
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=0.953 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.787 ms
64 bytes from 10.0.0.1: seq=2 ttl=64 time=0.771 ms
$ rmmod mscc_felix.ko
[ 645.544426] mscc_felix 0000:00:00.5: Link is Down
[ 645.838608] DSA: tree 0 torn down
$ rmmod tag_ocelot_8021q.ko
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 03:00:06 +02:00
/*
* Tagging protocol helpers called for the CPU ports and DSA links .
* @ get_tag_protocol retrieves the initial tagging protocol and is
* mandatory . Switches which can operate using multiple tagging
* protocols should implement @ change_tag_protocol and report in
* @ get_tag_protocol the tagger in current use .
*/
2017-11-10 15:22:52 -08:00
enum dsa_tag_protocol ( * get_tag_protocol ) ( struct dsa_switch * ds ,
2020-01-07 21:06:05 -08:00
int port ,
enum dsa_tag_protocol mprot ) ;
net: dsa: allow changing the tag protocol via the "tagging" device attribute
Currently DSA exposes the following sysfs:
$ cat /sys/class/net/eno2/dsa/tagging
ocelot
which is a read-only device attribute, introduced in the kernel as
commit 98cdb4807123 ("net: dsa: Expose tagging protocol to user-space"),
and used by libpcap since its commit 993db3800d7d ("Add support for DSA
link-layer types").
It would be nice if we could extend this device attribute by making it
writable:
$ echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
This is useful with DSA switches that can make use of more than one
tagging protocol. It may be useful in dsa_loop in the future too, to
perform offline testing of various taggers, or for changing between dsa
and edsa on Marvell switches, if that is desirable.
In terms of implementation, drivers can support this feature by
implementing .change_tag_protocol, which should always leave the switch
in a consistent state: either with the new protocol if things went well,
or with the old one if something failed. Teardown of the old protocol,
if necessary, must be handled by the driver.
Some things remain as before:
- The .get_tag_protocol is currently only called at probe time, to load
the initial tagging protocol driver. Nonetheless, new drivers should
report the tagging protocol in current use now.
- The driver should manage by itself the initial setup of tagging
protocol, no later than the .setup() method, as well as destroying
resources used by the last tagger in use, no earlier than the
.teardown() method.
For multi-switch DSA trees, error handling is a bit more complicated,
since e.g. the 5th out of 7 switches may fail to change the tag
protocol. When that happens, a revert to the original tag protocol is
attempted, but that may fail too, leaving the tree in an inconsistent
state despite each individual switch implementing .change_tag_protocol
transactionally. Since the intersection between drivers that implement
.change_tag_protocol and drivers that support D in DSA is currently the
empty set, the possibility for this error to happen is ignored for now.
Testing:
$ insmod mscc_felix.ko
[ 79.549784] mscc_felix 0000:00:00.5: Adding to iommu group 14
[ 79.565712] mscc_felix 0000:00:00.5: Failed to register DSA switch: -517
$ insmod tag_ocelot.ko
$ rmmod mscc_felix.ko
$ insmod mscc_felix.ko
[ 97.261724] libphy: VSC9959 internal MDIO bus: probed
[ 97.267363] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 0
[ 97.274998] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 1
[ 97.282561] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 2
[ 97.289700] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 3
[ 97.599163] mscc_felix 0000:00:00.5 swp0 (uninitialized): PHY [0000:00:00.3:10] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 97.862034] mscc_felix 0000:00:00.5 swp1 (uninitialized): PHY [0000:00:00.3:11] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 97.950731] mscc_felix 0000:00:00.5 swp0: configuring for inband/qsgmii link mode
[ 97.964278] 8021q: adding VLAN 0 to HW filter on device swp0
[ 98.146161] mscc_felix 0000:00:00.5 swp2 (uninitialized): PHY [0000:00:00.3:12] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 98.238649] mscc_felix 0000:00:00.5 swp1: configuring for inband/qsgmii link mode
[ 98.251845] 8021q: adding VLAN 0 to HW filter on device swp1
[ 98.433916] mscc_felix 0000:00:00.5 swp3 (uninitialized): PHY [0000:00:00.3:13] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 98.485542] mscc_felix 0000:00:00.5: configuring for fixed/internal link mode
[ 98.503584] mscc_felix 0000:00:00.5: Link is Up - 2.5Gbps/Full - flow control rx/tx
[ 98.527948] device eno2 entered promiscuous mode
[ 98.544755] DSA: tree 0 setup
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=2.337 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.754 ms
^C
- 10.0.0.1 ping statistics -
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.754/1.545/2.337 ms
$ cat /sys/class/net/eno2/dsa/tagging
ocelot
$ cat ./test_ocelot_8021q.sh
#!/bin/bash
ip link set swp0 down
ip link set swp1 down
ip link set swp2 down
ip link set swp3 down
ip link set swp5 down
ip link set eno2 down
echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
ip link set eno2 up
ip link set swp0 up
ip link set swp1 up
ip link set swp2 up
ip link set swp3 up
ip link set swp5 up
$ ./test_ocelot_8021q.sh
./test_ocelot_8021q.sh: line 9: echo: write error: Protocol not available
$ rmmod tag_ocelot.ko
rmmod: can't unload module 'tag_ocelot': Resource temporarily unavailable
$ insmod tag_ocelot_8021q.ko
$ ./test_ocelot_8021q.sh
$ cat /sys/class/net/eno2/dsa/tagging
ocelot-8021q
$ rmmod tag_ocelot.ko
$ rmmod tag_ocelot_8021q.ko
rmmod: can't unload module 'tag_ocelot_8021q': Resource temporarily unavailable
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=0.953 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.787 ms
64 bytes from 10.0.0.1: seq=2 ttl=64 time=0.771 ms
$ rmmod mscc_felix.ko
[ 645.544426] mscc_felix 0000:00:00.5: Link is Down
[ 645.838608] DSA: tree 0 torn down
$ rmmod tag_ocelot_8021q.ko
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-01-29 03:00:06 +02:00
int ( * change_tag_protocol ) ( struct dsa_switch * ds , int port ,
enum dsa_tag_protocol proto ) ;
net: dsa: introduce tagger-owned storage for private and shared data
Ansuel is working on register access over Ethernet for the qca8k switch
family. This requires the qca8k tagging protocol driver to receive
frames which aren't intended for the network stack, but instead for the
qca8k switch driver itself.
The dp->priv is currently the prevailing method for passing data back
and forth between the tagging protocol driver and the switch driver.
However, this method is riddled with caveats.
The DSA design allows in principle for any switch driver to return any
protocol it desires in ->get_tag_protocol(). The dsa_loop driver can be
modified to do just that. But in the current design, the memory behind
dp->priv has to be allocated by the switch driver, so if the tagging
protocol is paired to an unexpected switch driver, we may end up in NULL
pointer dereferences inside the kernel, or worse (a switch driver may
allocate dp->priv according to the expectations of a different tagger).
The latter possibility is even more plausible considering that DSA
switches can dynamically change tagging protocols in certain cases
(dsa <-> edsa, ocelot <-> ocelot-8021q), and the current design lends
itself to mistakes that are all too easy to make.
This patch proposes that the tagging protocol driver should manage its
own memory, instead of relying on the switch driver to do so.
After analyzing the different in-tree needs, it can be observed that the
required tagger storage is per switch, therefore a ds->tagger_data
pointer is introduced. In principle, per-port storage could also be
introduced, although there is no need for it at the moment. Future
changes will replace the current usage of dp->priv with ds->tagger_data.
We define a "binding" event between the DSA switch tree and the tagging
protocol. During this binding event, the tagging protocol's ->connect()
method is called first, and this may allocate some memory for each
switch of the tree. Then a cross-chip notifier is emitted for the
switches within that tree, and they are given the opportunity to fix up
the tagger's memory (for example, they might set up some function
pointers that represent virtual methods for consuming packets).
Because the memory is owned by the tagger, there exists a ->disconnect()
method for the tagger (which is the place to free the resources), but
there doesn't exist a ->disconnect() method for the switch driver.
This is part of the design. The switch driver should make minimal use of
the public part of the tagger data, and only after type-checking it
using the supplied "proto" argument.
In the code there are in fact two binding events, one is the initial
event in dsa_switch_setup_tag_protocol(). At this stage, the cross chip
notifier chains aren't initialized, so we call each switch's connect()
method by hand. Then there is dsa_tree_bind_tag_proto() during
dsa_tree_change_tag_proto(), and here we have an old protocol and a new
one. We first connect to the new one before disconnecting from the old
one, to simplify error handling a bit and to ensure we remain in a valid
state at all times.
Co-developed-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-12-10 01:34:37 +02:00
/*
* Method for switch drivers to connect to the tagging protocol driver
* in current use . The switch driver can provide handlers for certain
* types of packets for switch management .
*/
int ( * connect_tag_protocol ) ( struct dsa_switch * ds ,
enum dsa_tag_protocol proto ) ;
2016-08-22 16:01:01 +02:00
net: dsa: tear down devlink port regions when tearing down the devlink port on error
Commit 86f8b1c01a0a ("net: dsa: Do not make user port errors fatal")
decided it was fine to ignore errors on certain ports that fail to
probe, and go on with the ports that do probe fine.
Commit fb6ec87f7229 ("net: dsa: Fix type was not set for devlink port")
noticed that devlink_port_type_eth_set(dlp, dp->slave); does not get
called, and devlink notices after a timeout of 3600 seconds and prints a
WARN_ON. So it went ahead to unregister the devlink port. And because
there exists an UNUSED port flavour, we actually re-register the devlink
port as UNUSED.
Commit 08156ba430b4 ("net: dsa: Add devlink port regions support to
DSA") added devlink port regions, which are set up by the driver and not
by DSA.
When we trigger the devlink port deregistration and reregistration as
unused, devlink now prints another WARN_ON, from here:
devlink_port_unregister:
WARN_ON(!list_empty(&devlink_port->region_list));
So the port still has regions, which makes sense, because they were set
up by the driver, and the driver doesn't know we're unregistering the
devlink port.
Somebody needs to tear them down, and optionally (actually it would be
nice, to be consistent) set them up again for the new devlink port.
But DSA's layering stays in our way quite badly here.
The options I've considered are:
1. Introduce a function in devlink to just change a port's type and
flavour. No dice, devlink keeps a lot of state, it really wants the
port to not be registered when you set its parameters, so changing
anything can only be done by destroying what we currently have and
recreating it.
2. Make DSA cache the parameters passed to dsa_devlink_port_region_create,
and the region returned, keep those in a list, then when the devlink
port unregister needs to take place, the existing devlink regions are
destroyed by DSA, and we replay the creation of new regions using the
cached parameters. Problem: mv88e6xxx keeps the region pointers in
chip->ports[port].region, and these will remain stale after DSA frees
them. There are many things DSA can do, but updating mv88e6xxx's
private pointers is not one of them.
3. Just let the driver do it (i.e. introduce a very specific method
called ds->ops->port_reinit_as_unused, which unregisters its devlink
port devlink regions, then the old devlink port, then registers the
new one, then the devlink port regions for it). While it does work,
as opposed to the others, it's pretty horrible from an API
perspective and we can do better.
4. Introduce a new pair of methods, ->port_setup and ->port_teardown,
which in the case of mv88e6xxx must register and unregister the
devlink port regions. Call these 2 methods when the port must be
reinitialized as unused.
Naturally, I went for the 4th approach.
Fixes: 08156ba430b4 ("net: dsa: Add devlink port regions support to DSA")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-17 17:29:16 +03:00
/* Optional switch-wide initialization and destruction methods */
2011-11-27 17:06:08 +00:00
int ( * setup ) ( struct dsa_switch * ds ) ;
2019-06-08 15:04:28 +03:00
void ( * teardown ) ( struct dsa_switch * ds ) ;
net: dsa: tear down devlink port regions when tearing down the devlink port on error
Commit 86f8b1c01a0a ("net: dsa: Do not make user port errors fatal")
decided it was fine to ignore errors on certain ports that fail to
probe, and go on with the ports that do probe fine.
Commit fb6ec87f7229 ("net: dsa: Fix type was not set for devlink port")
noticed that devlink_port_type_eth_set(dlp, dp->slave); does not get
called, and devlink notices after a timeout of 3600 seconds and prints a
WARN_ON. So it went ahead to unregister the devlink port. And because
there exists an UNUSED port flavour, we actually re-register the devlink
port as UNUSED.
Commit 08156ba430b4 ("net: dsa: Add devlink port regions support to
DSA") added devlink port regions, which are set up by the driver and not
by DSA.
When we trigger the devlink port deregistration and reregistration as
unused, devlink now prints another WARN_ON, from here:
devlink_port_unregister:
WARN_ON(!list_empty(&devlink_port->region_list));
So the port still has regions, which makes sense, because they were set
up by the driver, and the driver doesn't know we're unregistering the
devlink port.
Somebody needs to tear them down, and optionally (actually it would be
nice, to be consistent) set them up again for the new devlink port.
But DSA's layering stays in our way quite badly here.
The options I've considered are:
1. Introduce a function in devlink to just change a port's type and
flavour. No dice, devlink keeps a lot of state, it really wants the
port to not be registered when you set its parameters, so changing
anything can only be done by destroying what we currently have and
recreating it.
2. Make DSA cache the parameters passed to dsa_devlink_port_region_create,
and the region returned, keep those in a list, then when the devlink
port unregister needs to take place, the existing devlink regions are
destroyed by DSA, and we replay the creation of new regions using the
cached parameters. Problem: mv88e6xxx keeps the region pointers in
chip->ports[port].region, and these will remain stale after DSA frees
them. There are many things DSA can do, but updating mv88e6xxx's
private pointers is not one of them.
3. Just let the driver do it (i.e. introduce a very specific method
called ds->ops->port_reinit_as_unused, which unregisters its devlink
port devlink regions, then the old devlink port, then registers the
new one, then the devlink port regions for it). While it does work,
as opposed to the others, it's pretty horrible from an API
perspective and we can do better.
4. Introduce a new pair of methods, ->port_setup and ->port_teardown,
which in the case of mv88e6xxx must register and unregister the
devlink port regions. Call these 2 methods when the port must be
reinitialized as unused.
Naturally, I went for the 4th approach.
Fixes: 08156ba430b4 ("net: dsa: Add devlink port regions support to DSA")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-17 17:29:16 +03:00
/* Per-port initialization and destruction methods. Mandatory if the
* driver registers devlink port regions , optional otherwise .
*/
int ( * port_setup ) ( struct dsa_switch * ds , int port ) ;
void ( * port_teardown ) ( struct dsa_switch * ds , int port ) ;
2014-09-19 13:07:54 -07:00
u32 ( * get_phy_flags ) ( struct dsa_switch * ds , int port ) ;
2011-11-27 17:06:08 +00:00
/*
* Access to the switch ' s PHY registers .
*/
int ( * phy_read ) ( struct dsa_switch * ds , int port , int regnum ) ;
int ( * phy_write ) ( struct dsa_switch * ds , int port ,
int regnum , u16 val ) ;
2014-08-27 17:04:53 -07:00
/*
* Link state adjustment ( called from libphy )
*/
void ( * adjust_link ) ( struct dsa_switch * ds , int port ,
struct phy_device * phydev ) ;
2014-08-27 17:04:54 -07:00
void ( * fixed_link_update ) ( struct dsa_switch * ds , int port ,
struct fixed_phy_status * st ) ;
2014-08-27 17:04:53 -07:00
2018-05-10 13:17:32 -07:00
/*
* PHYLINK integration
*/
2021-11-30 13:10:01 +00:00
void ( * phylink_get_caps ) ( struct dsa_switch * ds , int port ,
struct phylink_config * config ) ;
2018-05-10 13:17:32 -07:00
void ( * phylink_validate ) ( struct dsa_switch * ds , int port ,
unsigned long * supported ,
struct phylink_link_state * state ) ;
int ( * phylink_mac_link_state ) ( struct dsa_switch * ds , int port ,
struct phylink_link_state * state ) ;
void ( * phylink_mac_config ) ( struct dsa_switch * ds , int port ,
unsigned int mode ,
const struct phylink_link_state * state ) ;
void ( * phylink_mac_an_restart ) ( struct dsa_switch * ds , int port ) ;
void ( * phylink_mac_link_down ) ( struct dsa_switch * ds , int port ,
unsigned int mode ,
phy_interface_t interface ) ;
void ( * phylink_mac_link_up ) ( struct dsa_switch * ds , int port ,
unsigned int mode ,
phy_interface_t interface ,
2020-02-26 10:23:46 +00:00
struct phy_device * phydev ,
int speed , int duplex ,
bool tx_pause , bool rx_pause ) ;
2018-05-10 13:17:32 -07:00
void ( * phylink_fixed_state ) ( struct dsa_switch * ds , int port ,
struct phylink_link_state * state ) ;
2011-11-27 17:06:08 +00:00
/*
2021-01-11 11:46:57 +01:00
* Port statistics counters .
2011-11-27 17:06:08 +00:00
*/
2018-04-25 12:12:50 -07:00
void ( * get_strings ) ( struct dsa_switch * ds , int port ,
u32 stringset , uint8_t * data ) ;
2011-11-27 17:06:08 +00:00
void ( * get_ethtool_stats ) ( struct dsa_switch * ds ,
int port , uint64_t * data ) ;
2018-04-25 12:12:50 -07:00
int ( * get_sset_count ) ( struct dsa_switch * ds , int port , int sset ) ;
2018-04-25 12:12:52 -07:00
void ( * get_ethtool_phy_stats ) ( struct dsa_switch * ds ,
int port , uint64_t * data ) ;
2021-10-18 11:37:57 +02:00
void ( * get_eth_phy_stats ) ( struct dsa_switch * ds , int port ,
struct ethtool_eth_phy_stats * phy_stats ) ;
void ( * get_eth_mac_stats ) ( struct dsa_switch * ds , int port ,
struct ethtool_eth_mac_stats * mac_stats ) ;
void ( * get_eth_ctrl_stats ) ( struct dsa_switch * ds , int port ,
struct ethtool_eth_ctrl_stats * ctrl_stats ) ;
2021-01-11 11:46:57 +01:00
void ( * get_stats64 ) ( struct dsa_switch * ds , int port ,
struct rtnl_link_stats64 * s ) ;
2021-04-19 15:01:06 +02:00
void ( * self_test ) ( struct dsa_switch * ds , int port ,
struct ethtool_test * etest , u64 * data ) ;
2014-09-18 17:31:22 -07:00
2014-09-18 17:31:24 -07:00
/*
* ethtool Wake - on - LAN
*/
void ( * get_wol ) ( struct dsa_switch * ds , int port ,
struct ethtool_wolinfo * w ) ;
int ( * set_wol ) ( struct dsa_switch * ds , int port ,
struct ethtool_wolinfo * w ) ;
2018-02-14 01:07:48 +01:00
/*
* ethtool timestamp info
*/
int ( * get_ts_info ) ( struct dsa_switch * ds , int port ,
struct ethtool_ts_info * ts ) ;
2014-09-18 17:31:22 -07:00
/*
* Suspend and resume
*/
int ( * suspend ) ( struct dsa_switch * ds ) ;
int ( * resume ) ( struct dsa_switch * ds ) ;
2014-09-24 17:05:18 -07:00
/*
* Port enable / disable
*/
int ( * port_enable ) ( struct dsa_switch * ds , int port ,
struct phy_device * phy ) ;
2019-02-24 20:44:43 +01:00
void ( * port_disable ) ( struct dsa_switch * ds , int port ) ;
2014-09-24 17:05:21 -07:00
/*
2017-08-01 16:32:41 -04:00
* Port ' s MAC EEE settings
2014-09-24 17:05:21 -07:00
*/
2017-08-01 16:32:41 -04:00
int ( * set_mac_eee ) ( struct dsa_switch * ds , int port ,
struct ethtool_eee * e ) ;
int ( * get_mac_eee ) ( struct dsa_switch * ds , int port ,
struct ethtool_eee * e ) ;
2014-10-29 10:44:58 -07:00
2014-10-29 10:45:01 -07:00
/* EEPROM access */
int ( * get_eeprom_len ) ( struct dsa_switch * ds ) ;
int ( * get_eeprom ) ( struct dsa_switch * ds ,
struct ethtool_eeprom * eeprom , u8 * data ) ;
int ( * set_eeprom ) ( struct dsa_switch * ds ,
struct ethtool_eeprom * eeprom , u8 * data ) ;
2014-10-29 10:45:04 -07:00
/*
* Register access .
*/
int ( * get_regs_len ) ( struct dsa_switch * ds , int port ) ;
void ( * get_regs ) ( struct dsa_switch * ds , int port ,
struct ethtool_regs * regs , void * p ) ;
2015-02-24 13:15:33 -08:00
2020-11-03 08:10:55 +01:00
/*
* Upper device tracking .
*/
int ( * port_prechangeupper ) ( struct dsa_switch * ds , int port ,
struct netdev_notifier_changeupper_info * info ) ;
2015-02-24 13:15:33 -08:00
/*
* Bridge integration
*/
2016-07-18 20:45:38 -04:00
int ( * set_ageing_time ) ( struct dsa_switch * ds , unsigned int msecs ) ;
2016-03-13 16:21:32 -04:00
int ( * port_bridge_join ) ( struct dsa_switch * ds , int port ,
2021-12-06 18:57:57 +02:00
struct dsa_bridge bridge ,
bool * tx_fwd_offload ) ;
2017-01-27 15:29:41 -05:00
void ( * port_bridge_leave ) ( struct dsa_switch * ds , int port ,
net: dsa: keep the bridge_dev and bridge_num as part of the same structure
The main desire behind this is to provide coherent bridge information to
the fast path without locking.
For example, right now we set dp->bridge_dev and dp->bridge_num from
separate code paths, it is theoretically possible for a packet
transmission to read these two port properties consecutively and find a
bridge number which does not correspond with the bridge device.
Another desire is to start passing more complex bridge information to
dsa_switch_ops functions. For example, with FDB isolation, it is
expected that drivers will need to be passed the bridge which requested
an FDB/MDB entry to be offloaded, and along with that bridge_dev, the
associated bridge_num should be passed too, in case the driver might
want to implement an isolation scheme based on that number.
We already pass the {bridge_dev, bridge_num} pair to the TX forwarding
offload switch API, however we'd like to remove that and squash it into
the basic bridge join/leave API. So that means we need to pass this
pair to the bridge join/leave API.
During dsa_port_bridge_leave, first we unset dp->bridge_dev, then we
call the driver's .port_bridge_leave with what used to be our
dp->bridge_dev, but provided as an argument.
When bridge_dev and bridge_num get folded into a single structure, we
need to preserve this behavior in dsa_port_bridge_leave: we need a copy
of what used to be in dp->bridge.
Switch drivers check bridge membership by comparing dp->bridge_dev with
the provided bridge_dev, but now, if we provide the struct dsa_bridge as
a pointer, they cannot keep comparing dp->bridge to the provided
pointer, since this only points to an on-stack copy. To make this
obvious and prevent driver writers from forgetting and doing stupid
things, in this new API, the struct dsa_bridge is provided as a full
structure (not very large, contains an int and a pointer) instead of a
pointer. An explicit comparison function needs to be used to determine
bridge membership: dsa_port_offloads_bridge().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-06 18:57:56 +02:00
struct dsa_bridge bridge ) ;
2016-04-06 11:55:03 -04:00
void ( * port_stp_state_set ) ( struct dsa_switch * ds , int port ,
u8 state ) ;
2016-09-22 16:49:22 -04:00
void ( * port_fast_age ) ( struct dsa_switch * ds , int port ) ;
net: dsa: act as passthrough for bridge port flags
There are multiple ways in which a PORT_BRIDGE_FLAGS attribute can be
expressed by the bridge through switchdev, and not all of them can be
emulated by DSA mid-layer API at the same time.
One possible configuration is when the bridge offloads the port flags
using a mask that has a single bit set - therefore only one feature
should change. However, DSA currently groups together unicast and
multicast flooding in the .port_egress_floods method, which limits our
options when we try to add support for turning off broadcast flooding:
do we extend .port_egress_floods with a third parameter which b53 and
mv88e6xxx will ignore? But that means that the DSA layer, which
currently implements the PRE_BRIDGE_FLAGS attribute all by itself, will
see that .port_egress_floods is implemented, and will report that all 3
types of flooding are supported - not necessarily true.
Another configuration is when the user specifies more than one flag at
the same time, in the same netlink message. If we were to create one
individual function per offloadable bridge port flag, we would limit the
expressiveness of the switch driver of refusing certain combinations of
flag values. For example, a switch may not have an explicit knob for
flooding of unknown multicast, just for flooding in general. In that
case, the only correct thing to do is to allow changes to BR_FLOOD and
BR_MCAST_FLOOD in tandem, and never allow mismatched values. But having
a separate .port_set_unicast_flood and .port_set_multicast_flood would
not allow the driver to possibly reject that.
Also, DSA doesn't consider it necessary to inform the driver that a
SWITCHDEV_ATTR_ID_BRIDGE_MROUTER attribute was offloaded, because it
just calls .port_egress_floods for the CPU port. When we'll add support
for the plain SWITCHDEV_ATTR_ID_PORT_MROUTER, that will become a real
problem because the flood settings will need to be held statefully in
the DSA middle layer, otherwise changing the mrouter port attribute will
impact the flooding attribute. And that's _assuming_ that the underlying
hardware doesn't have anything else to do when a multicast router
attaches to a port than flood unknown traffic to it. If it does, there
will need to be a dedicated .port_set_mrouter anyway.
So we need to let the DSA drivers see the exact form that the bridge
passes this switchdev attribute in, otherwise we are standing in the
way. Therefore we also need to use this form of language when
communicating to the driver that it needs to configure its initial
(before bridge join) and final (after bridge leave) port flags.
The b53 and mv88e6xxx drivers are converted to the passthrough API and
their implementation of .port_egress_floods is split into two: a
function that configures unicast flooding and another for multicast.
The mv88e6xxx implementation is quite hairy, and it turns out that
the implementations of unknown unicast flooding are actually the same
for 6185 and for 6352:
behind the confusing names actually lie two individual bits:
NO_UNKNOWN_MC -> FLOOD_UC = 0x4 = BIT(2)
NO_UNKNOWN_UC -> FLOOD_MC = 0x8 = BIT(3)
so there was no reason to entangle them in the first place.
Whereas the 6185 writes to MV88E6185_PORT_CTL0_FORWARD_UNKNOWN of
PORT_CTL0, which has the exact same bit index. I have left the
implementations separate though, for the only reason that the names are
different enough to confuse me, since I am not able to double-check with
a user manual. The multicast flooding setting for 6185 is in a different
register than for 6352 though.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-02-12 17:15:56 +02:00
int ( * port_pre_bridge_flags ) ( struct dsa_switch * ds , int port ,
struct switchdev_brport_flags flags ,
struct netlink_ext_ack * extack ) ;
int ( * port_bridge_flags ) ( struct dsa_switch * ds , int port ,
struct switchdev_brport_flags flags ,
struct netlink_ext_ack * extack ) ;
2015-08-10 09:09:49 -04:00
2015-08-13 12:52:17 -04:00
/*
* VLAN support
*/
2016-02-26 13:16:00 -05:00
int ( * port_vlan_filtering ) ( struct dsa_switch * ds , int port ,
2021-02-13 22:43:19 +02:00
bool vlan_filtering ,
struct netlink_ext_ack * extack ) ;
2021-01-09 02:01:53 +02:00
int ( * port_vlan_add ) ( struct dsa_switch * ds , int port ,
2021-02-13 22:43:18 +02:00
const struct switchdev_obj_port_vlan * vlan ,
struct netlink_ext_ack * extack ) ;
2015-11-01 12:33:55 -05:00
int ( * port_vlan_del ) ( struct dsa_switch * ds , int port ,
const struct switchdev_obj_port_vlan * vlan ) ;
2015-08-10 09:09:49 -04:00
/*
* Forwarding database
*/
2017-08-06 16:15:40 +03:00
int ( * port_fdb_add ) ( struct dsa_switch * ds , int port ,
2017-08-06 16:15:39 +03:00
const unsigned char * addr , u16 vid ) ;
2015-08-10 09:09:49 -04:00
int ( * port_fdb_del ) ( struct dsa_switch * ds , int port ,
2017-08-06 16:15:39 +03:00
const unsigned char * addr , u16 vid ) ;
2015-10-22 09:34:38 -04:00
int ( * port_fdb_dump ) ( struct dsa_switch * ds , int port ,
2017-08-06 16:15:49 +03:00
dsa_fdb_dump_cb_t * cb , void * data ) ;
2016-08-31 11:50:03 -04:00
/*
* Multicast database
*/
2021-01-09 02:01:52 +02:00
int ( * port_mdb_add ) ( struct dsa_switch * ds , int port ,
2017-11-30 11:23:58 -05:00
const struct switchdev_obj_port_mdb * mdb ) ;
2016-08-31 11:50:03 -04:00
int ( * port_mdb_del ) ( struct dsa_switch * ds , int port ,
const struct switchdev_obj_port_mdb * mdb ) ;
2017-01-30 09:48:40 -08:00
/*
* RXNFC
*/
int ( * get_rxnfc ) ( struct dsa_switch * ds , int port ,
struct ethtool_rxnfc * nfc , u32 * rule_locs ) ;
int ( * set_rxnfc ) ( struct dsa_switch * ds , int port ,
struct ethtool_rxnfc * nfc ) ;
2017-01-30 12:41:40 -08:00
/*
* TC integration
*/
2020-02-29 16:31:13 +02:00
int ( * cls_flower_add ) ( struct dsa_switch * ds , int port ,
struct flow_cls_offload * cls , bool ingress ) ;
int ( * cls_flower_del ) ( struct dsa_switch * ds , int port ,
struct flow_cls_offload * cls , bool ingress ) ;
int ( * cls_flower_stats ) ( struct dsa_switch * ds , int port ,
struct flow_cls_offload * cls , bool ingress ) ;
2017-01-30 12:41:40 -08:00
int ( * port_mirror_add ) ( struct dsa_switch * ds , int port ,
struct dsa_mall_mirror_tc_entry * mirror ,
bool ingress ) ;
void ( * port_mirror_del ) ( struct dsa_switch * ds , int port ,
struct dsa_mall_mirror_tc_entry * mirror ) ;
2020-03-29 14:51:59 +03:00
int ( * port_policer_add ) ( struct dsa_switch * ds , int port ,
struct dsa_mall_policer_tc_entry * policer ) ;
void ( * port_policer_del ) ( struct dsa_switch * ds , int port ) ;
2019-09-15 04:59:59 +03:00
int ( * port_setup_tc ) ( struct dsa_switch * ds , int port ,
enum tc_setup_type type , void * type_data ) ;
2017-03-30 17:37:14 -04:00
/*
* Cross - chip operations
*/
net: dsa: permit cross-chip bridging between all trees in the system
One way of utilizing DSA is by cascading switches which do not all have
compatible taggers. Consider the following real-life topology:
+---------------------------------------------------------------+
| LS1028A |
| +------------------------------+ |
| | DSA master for Felix | |
| |(internal ENETC port 2: eno2))| |
| +------------+------------------------------+-------------+ |
| | Felix embedded L2 switch | |
| | | |
| | +--------------+ +--------------+ +--------------+ | |
| | |DSA master for| |DSA master for| |DSA master for| | |
| | | SJA1105 1 | | SJA1105 2 | | SJA1105 3 | | |
| | |(Felix port 1)| |(Felix port 2)| |(Felix port 3)| | |
+--+-+--------------+---+--------------+---+--------------+--+--+
+-----------------------+ +-----------------------+ +-----------------------+
| SJA1105 switch 1 | | SJA1105 switch 2 | | SJA1105 switch 3 |
+-----+-----+-----+-----+ +-----+-----+-----+-----+ +-----+-----+-----+-----+
|sw1p0|sw1p1|sw1p2|sw1p3| |sw2p0|sw2p1|sw2p2|sw2p3| |sw3p0|sw3p1|sw3p2|sw3p3|
+-----+-----+-----+-----+ +-----+-----+-----+-----+ +-----+-----+-----+-----+
The above can be described in the device tree as follows (obviously not
complete):
mscc_felix {
dsa,member = <0 0>;
ports {
port@4 {
ethernet = <&enetc_port2>;
};
};
};
sja1105_switch1 {
dsa,member = <1 1>;
ports {
port@4 {
ethernet = <&mscc_felix_port1>;
};
};
};
sja1105_switch2 {
dsa,member = <2 2>;
ports {
port@4 {
ethernet = <&mscc_felix_port2>;
};
};
};
sja1105_switch3 {
dsa,member = <3 3>;
ports {
port@4 {
ethernet = <&mscc_felix_port3>;
};
};
};
Basically we instantiate one DSA switch tree for every hardware switch
in the system, but we still give them globally unique switch IDs (will
come back to that later). Having 3 disjoint switch trees makes the
tagger drivers "just work", because net devices are registered for the
3 Felix DSA master ports, and they are also DSA slave ports to the ENETC
port. So packets received on the ENETC port are stripped of their
stacked DSA tags one by one.
Currently, hardware bridging between ports on the same sja1105 chip is
possible, but switching between sja1105 ports on different chips is
handled by the software bridge. This is fine, but we can do better.
In fact, the dsa_8021q tag used by sja1105 is compatible with cascading.
In other words, a sja1105 switch can correctly parse and route a packet
containing a dsa_8021q tag. So if we could enable hardware bridging on
the Felix DSA master ports, cross-chip bridging could be completely
offloaded.
Such as system would be used as follows:
ip link add dev br0 type bridge && ip link set dev br0 up
for port in sw0p0 sw0p1 sw0p2 sw0p3 \
sw1p0 sw1p1 sw1p2 sw1p3 \
sw2p0 sw2p1 sw2p2 sw2p3; do
ip link set dev $port master br0
done
The above makes switching between ports on the same row be performed in
hardware, and between ports on different rows in software. Now assume
the Felix switch ports are called swp0, swp1, swp2. By running the
following extra commands:
ip link add dev br1 type bridge && ip link set dev br1 up
for port in swp0 swp1 swp2; do
ip link set dev $port master br1
done
the CPU no longer sees packets which traverse sja1105 switch boundaries
and can be forwarded directly by Felix. The br1 bridge would not be used
for any sort of traffic termination.
For this to work, we need to give drivers an opportunity to listen for
bridging events on DSA trees other than their own, and pass that other
tree index as argument. I have made the assumption, for the moment, that
the other existing DSA notifiers don't need to be broadcast to other
trees. That assumption might turn out to be incorrect. But in the
meantime, introduce a dsa_broadcast function, similar in purpose to
dsa_port_notify, which is used only by the bridging notifiers.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-10 19:37:41 +03:00
int ( * crosschip_bridge_join ) ( struct dsa_switch * ds , int tree_index ,
int sw_index , int port ,
net: dsa: keep the bridge_dev and bridge_num as part of the same structure
The main desire behind this is to provide coherent bridge information to
the fast path without locking.
For example, right now we set dp->bridge_dev and dp->bridge_num from
separate code paths, it is theoretically possible for a packet
transmission to read these two port properties consecutively and find a
bridge number which does not correspond with the bridge device.
Another desire is to start passing more complex bridge information to
dsa_switch_ops functions. For example, with FDB isolation, it is
expected that drivers will need to be passed the bridge which requested
an FDB/MDB entry to be offloaded, and along with that bridge_dev, the
associated bridge_num should be passed too, in case the driver might
want to implement an isolation scheme based on that number.
We already pass the {bridge_dev, bridge_num} pair to the TX forwarding
offload switch API, however we'd like to remove that and squash it into
the basic bridge join/leave API. So that means we need to pass this
pair to the bridge join/leave API.
During dsa_port_bridge_leave, first we unset dp->bridge_dev, then we
call the driver's .port_bridge_leave with what used to be our
dp->bridge_dev, but provided as an argument.
When bridge_dev and bridge_num get folded into a single structure, we
need to preserve this behavior in dsa_port_bridge_leave: we need a copy
of what used to be in dp->bridge.
Switch drivers check bridge membership by comparing dp->bridge_dev with
the provided bridge_dev, but now, if we provide the struct dsa_bridge as
a pointer, they cannot keep comparing dp->bridge to the provided
pointer, since this only points to an on-stack copy. To make this
obvious and prevent driver writers from forgetting and doing stupid
things, in this new API, the struct dsa_bridge is provided as a full
structure (not very large, contains an int and a pointer) instead of a
pointer. An explicit comparison function needs to be used to determine
bridge membership: dsa_port_offloads_bridge().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-06 18:57:56 +02:00
struct dsa_bridge bridge ) ;
net: dsa: permit cross-chip bridging between all trees in the system
One way of utilizing DSA is by cascading switches which do not all have
compatible taggers. Consider the following real-life topology:
+---------------------------------------------------------------+
| LS1028A |
| +------------------------------+ |
| | DSA master for Felix | |
| |(internal ENETC port 2: eno2))| |
| +------------+------------------------------+-------------+ |
| | Felix embedded L2 switch | |
| | | |
| | +--------------+ +--------------+ +--------------+ | |
| | |DSA master for| |DSA master for| |DSA master for| | |
| | | SJA1105 1 | | SJA1105 2 | | SJA1105 3 | | |
| | |(Felix port 1)| |(Felix port 2)| |(Felix port 3)| | |
+--+-+--------------+---+--------------+---+--------------+--+--+
+-----------------------+ +-----------------------+ +-----------------------+
| SJA1105 switch 1 | | SJA1105 switch 2 | | SJA1105 switch 3 |
+-----+-----+-----+-----+ +-----+-----+-----+-----+ +-----+-----+-----+-----+
|sw1p0|sw1p1|sw1p2|sw1p3| |sw2p0|sw2p1|sw2p2|sw2p3| |sw3p0|sw3p1|sw3p2|sw3p3|
+-----+-----+-----+-----+ +-----+-----+-----+-----+ +-----+-----+-----+-----+
The above can be described in the device tree as follows (obviously not
complete):
mscc_felix {
dsa,member = <0 0>;
ports {
port@4 {
ethernet = <&enetc_port2>;
};
};
};
sja1105_switch1 {
dsa,member = <1 1>;
ports {
port@4 {
ethernet = <&mscc_felix_port1>;
};
};
};
sja1105_switch2 {
dsa,member = <2 2>;
ports {
port@4 {
ethernet = <&mscc_felix_port2>;
};
};
};
sja1105_switch3 {
dsa,member = <3 3>;
ports {
port@4 {
ethernet = <&mscc_felix_port3>;
};
};
};
Basically we instantiate one DSA switch tree for every hardware switch
in the system, but we still give them globally unique switch IDs (will
come back to that later). Having 3 disjoint switch trees makes the
tagger drivers "just work", because net devices are registered for the
3 Felix DSA master ports, and they are also DSA slave ports to the ENETC
port. So packets received on the ENETC port are stripped of their
stacked DSA tags one by one.
Currently, hardware bridging between ports on the same sja1105 chip is
possible, but switching between sja1105 ports on different chips is
handled by the software bridge. This is fine, but we can do better.
In fact, the dsa_8021q tag used by sja1105 is compatible with cascading.
In other words, a sja1105 switch can correctly parse and route a packet
containing a dsa_8021q tag. So if we could enable hardware bridging on
the Felix DSA master ports, cross-chip bridging could be completely
offloaded.
Such as system would be used as follows:
ip link add dev br0 type bridge && ip link set dev br0 up
for port in sw0p0 sw0p1 sw0p2 sw0p3 \
sw1p0 sw1p1 sw1p2 sw1p3 \
sw2p0 sw2p1 sw2p2 sw2p3; do
ip link set dev $port master br0
done
The above makes switching between ports on the same row be performed in
hardware, and between ports on different rows in software. Now assume
the Felix switch ports are called swp0, swp1, swp2. By running the
following extra commands:
ip link add dev br1 type bridge && ip link set dev br1 up
for port in swp0 swp1 swp2; do
ip link set dev $port master br1
done
the CPU no longer sees packets which traverse sja1105 switch boundaries
and can be forwarded directly by Felix. The br1 bridge would not be used
for any sort of traffic termination.
For this to work, we need to give drivers an opportunity to listen for
bridging events on DSA trees other than their own, and pass that other
tree index as argument. I have made the assumption, for the moment, that
the other existing DSA notifiers don't need to be broadcast to other
trees. That assumption might turn out to be incorrect. But in the
meantime, introduce a dsa_broadcast function, similar in purpose to
dsa_port_notify, which is used only by the bridging notifiers.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-10 19:37:41 +03:00
void ( * crosschip_bridge_leave ) ( struct dsa_switch * ds , int tree_index ,
int sw_index , int port ,
net: dsa: keep the bridge_dev and bridge_num as part of the same structure
The main desire behind this is to provide coherent bridge information to
the fast path without locking.
For example, right now we set dp->bridge_dev and dp->bridge_num from
separate code paths, it is theoretically possible for a packet
transmission to read these two port properties consecutively and find a
bridge number which does not correspond with the bridge device.
Another desire is to start passing more complex bridge information to
dsa_switch_ops functions. For example, with FDB isolation, it is
expected that drivers will need to be passed the bridge which requested
an FDB/MDB entry to be offloaded, and along with that bridge_dev, the
associated bridge_num should be passed too, in case the driver might
want to implement an isolation scheme based on that number.
We already pass the {bridge_dev, bridge_num} pair to the TX forwarding
offload switch API, however we'd like to remove that and squash it into
the basic bridge join/leave API. So that means we need to pass this
pair to the bridge join/leave API.
During dsa_port_bridge_leave, first we unset dp->bridge_dev, then we
call the driver's .port_bridge_leave with what used to be our
dp->bridge_dev, but provided as an argument.
When bridge_dev and bridge_num get folded into a single structure, we
need to preserve this behavior in dsa_port_bridge_leave: we need a copy
of what used to be in dp->bridge.
Switch drivers check bridge membership by comparing dp->bridge_dev with
the provided bridge_dev, but now, if we provide the struct dsa_bridge as
a pointer, they cannot keep comparing dp->bridge to the provided
pointer, since this only points to an on-stack copy. To make this
obvious and prevent driver writers from forgetting and doing stupid
things, in this new API, the struct dsa_bridge is provided as a full
structure (not very large, contains an int and a pointer) instead of a
pointer. An explicit comparison function needs to be used to determine
bridge membership: dsa_port_offloads_bridge().
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-12-06 18:57:56 +02:00
struct dsa_bridge bridge ) ;
2021-01-13 09:42:53 +01:00
int ( * crosschip_lag_change ) ( struct dsa_switch * ds , int sw_index ,
int port ) ;
int ( * crosschip_lag_join ) ( struct dsa_switch * ds , int sw_index ,
int port , struct net_device * lag ,
struct netdev_lag_upper_info * info ) ;
int ( * crosschip_lag_leave ) ( struct dsa_switch * ds , int sw_index ,
int port , struct net_device * lag ) ;
2018-02-14 01:07:48 +01:00
/*
* PTP functionality
*/
int ( * port_hwtstamp_get ) ( struct dsa_switch * ds , int port ,
struct ifreq * ifr ) ;
int ( * port_hwtstamp_set ) ( struct dsa_switch * ds , int port ,
struct ifreq * ifr ) ;
2021-04-27 12:21:59 +08:00
void ( * port_txtstamp ) ( struct dsa_switch * ds , int port ,
struct sk_buff * skb ) ;
2018-02-14 01:07:49 +01:00
bool ( * port_rxtstamp ) ( struct dsa_switch * ds , int port ,
struct sk_buff * skb , unsigned int type ) ;
2019-05-05 13:19:25 +03:00
2020-09-18 21:11:08 +02:00
/* Devlink parameters, etc */
2019-10-25 01:03:51 +02:00
int ( * devlink_param_get ) ( struct dsa_switch * ds , u32 id ,
struct devlink_param_gset_ctx * ctx ) ;
int ( * devlink_param_set ) ( struct dsa_switch * ds , u32 id ,
struct devlink_param_gset_ctx * ctx ) ;
2020-09-18 21:11:08 +02:00
int ( * devlink_info_get ) ( struct dsa_switch * ds ,
struct devlink_info_req * req ,
struct netlink_ext_ack * extack ) ;
2021-01-15 04:11:13 +02:00
int ( * devlink_sb_pool_get ) ( struct dsa_switch * ds ,
unsigned int sb_index , u16 pool_index ,
struct devlink_sb_pool_info * pool_info ) ;
int ( * devlink_sb_pool_set ) ( struct dsa_switch * ds , unsigned int sb_index ,
u16 pool_index , u32 size ,
enum devlink_sb_threshold_type threshold_type ,
struct netlink_ext_ack * extack ) ;
int ( * devlink_sb_port_pool_get ) ( struct dsa_switch * ds , int port ,
unsigned int sb_index , u16 pool_index ,
u32 * p_threshold ) ;
int ( * devlink_sb_port_pool_set ) ( struct dsa_switch * ds , int port ,
unsigned int sb_index , u16 pool_index ,
u32 threshold ,
struct netlink_ext_ack * extack ) ;
int ( * devlink_sb_tc_pool_bind_get ) ( struct dsa_switch * ds , int port ,
unsigned int sb_index , u16 tc_index ,
enum devlink_sb_pool_type pool_type ,
u16 * p_pool_index , u32 * p_threshold ) ;
int ( * devlink_sb_tc_pool_bind_set ) ( struct dsa_switch * ds , int port ,
unsigned int sb_index , u16 tc_index ,
enum devlink_sb_pool_type pool_type ,
u16 pool_index , u32 threshold ,
struct netlink_ext_ack * extack ) ;
int ( * devlink_sb_occ_snapshot ) ( struct dsa_switch * ds ,
unsigned int sb_index ) ;
int ( * devlink_sb_occ_max_clear ) ( struct dsa_switch * ds ,
unsigned int sb_index ) ;
int ( * devlink_sb_occ_port_pool_get ) ( struct dsa_switch * ds , int port ,
unsigned int sb_index , u16 pool_index ,
u32 * p_cur , u32 * p_max ) ;
int ( * devlink_sb_occ_tc_port_bind_get ) ( struct dsa_switch * ds , int port ,
unsigned int sb_index , u16 tc_index ,
enum devlink_sb_pool_type pool_type ,
u32 * p_cur , u32 * p_max ) ;
net: dsa: configure the MTU for switch ports
It is useful be able to configure port policers on a switch to accept
frames of various sizes:
- Increase the MTU for better throughput from the default of 1500 if it
is known that there is no 10/100 Mbps device in the network.
- Decrease the MTU to limit the latency of high-priority frames under
congestion, or work around various network segments that add extra
headers to packets which can't be fragmented.
For DSA slave ports, this is mostly a pass-through callback, called
through the regular ndo ops and at probe time (to ensure consistency
across all supported switches).
The CPU port is called with an MTU equal to the largest configured MTU
of the slave ports. The assumption is that the user might want to
sustain a bidirectional conversation with a partner over any switch
port.
The DSA master is configured the same as the CPU port, plus the tagger
overhead. Since the MTU is by definition L2 payload (sans Ethernet
header), it is up to each individual driver to figure out if it needs to
do anything special for its frame tags on the CPU port (it shouldn't
except in special cases). So the MTU does not contain the tagger
overhead on the CPU port.
However the MTU of the DSA master, minus the tagger overhead, is used as
a proxy for the MTU of the CPU port, which does not have a net device.
This is to avoid uselessly calling the .change_mtu function on the CPU
port when nothing should change.
So it is safe to assume that the DSA master and the CPU port MTUs are
apart by exactly the tagger's overhead in bytes.
Some changes were made around dsa_master_set_mtu(), function which was
now removed, for 2 reasons:
- dev_set_mtu() already calls dev_validate_mtu(), so it's redundant to
do the same thing in DSA
- __dev_set_mtu() returns 0 if ops->ndo_change_mtu is an absent method
That is to say, there's no need for this function in DSA, we can safely
call dev_set_mtu() directly, take the rtnl lock when necessary, and just
propagate whatever errors get reported (since the user probably wants to
be informed).
Some inspiration (mainly in the MTU DSA notifier) was taken from a
vaguely similar patch from Murali and Florian, who are credited as
co-developers down below.
Co-developed-by: Murali Krishna Policharla <murali.policharla@broadcom.com>
Signed-off-by: Murali Krishna Policharla <murali.policharla@broadcom.com>
Co-developed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-27 21:55:42 +02:00
/*
* MTU change functionality . Switches can also adjust their MRU through
* this method . By MTU , one understands the SDU ( L2 payload ) length .
* If the switch needs to account for the DSA tag on the CPU port , this
2020-07-15 09:42:43 -07:00
* method needs to do so privately .
net: dsa: configure the MTU for switch ports
It is useful be able to configure port policers on a switch to accept
frames of various sizes:
- Increase the MTU for better throughput from the default of 1500 if it
is known that there is no 10/100 Mbps device in the network.
- Decrease the MTU to limit the latency of high-priority frames under
congestion, or work around various network segments that add extra
headers to packets which can't be fragmented.
For DSA slave ports, this is mostly a pass-through callback, called
through the regular ndo ops and at probe time (to ensure consistency
across all supported switches).
The CPU port is called with an MTU equal to the largest configured MTU
of the slave ports. The assumption is that the user might want to
sustain a bidirectional conversation with a partner over any switch
port.
The DSA master is configured the same as the CPU port, plus the tagger
overhead. Since the MTU is by definition L2 payload (sans Ethernet
header), it is up to each individual driver to figure out if it needs to
do anything special for its frame tags on the CPU port (it shouldn't
except in special cases). So the MTU does not contain the tagger
overhead on the CPU port.
However the MTU of the DSA master, minus the tagger overhead, is used as
a proxy for the MTU of the CPU port, which does not have a net device.
This is to avoid uselessly calling the .change_mtu function on the CPU
port when nothing should change.
So it is safe to assume that the DSA master and the CPU port MTUs are
apart by exactly the tagger's overhead in bytes.
Some changes were made around dsa_master_set_mtu(), function which was
now removed, for 2 reasons:
- dev_set_mtu() already calls dev_validate_mtu(), so it's redundant to
do the same thing in DSA
- __dev_set_mtu() returns 0 if ops->ndo_change_mtu is an absent method
That is to say, there's no need for this function in DSA, we can safely
call dev_set_mtu() directly, take the rtnl lock when necessary, and just
propagate whatever errors get reported (since the user probably wants to
be informed).
Some inspiration (mainly in the MTU DSA notifier) was taken from a
vaguely similar patch from Murali and Florian, who are credited as
co-developers down below.
Co-developed-by: Murali Krishna Policharla <murali.policharla@broadcom.com>
Signed-off-by: Murali Krishna Policharla <murali.policharla@broadcom.com>
Co-developed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-27 21:55:42 +02:00
*/
int ( * port_change_mtu ) ( struct dsa_switch * ds , int port ,
int new_mtu ) ;
int ( * port_max_mtu ) ( struct dsa_switch * ds , int port ) ;
2021-01-13 09:42:53 +01:00
/*
* LAG integration
*/
int ( * port_lag_change ) ( struct dsa_switch * ds , int port ) ;
int ( * port_lag_join ) ( struct dsa_switch * ds , int port ,
struct net_device * lag ,
struct netdev_lag_upper_info * info ) ;
int ( * port_lag_leave ) ( struct dsa_switch * ds , int port ,
struct net_device * lag ) ;
2021-02-09 19:02:12 -06:00
/*
* HSR integration
*/
int ( * port_hsr_join ) ( struct dsa_switch * ds , int port ,
struct net_device * hsr ) ;
int ( * port_hsr_leave ) ( struct dsa_switch * ds , int port ,
struct net_device * hsr ) ;
2021-02-16 22:42:04 +01:00
/*
* MRP integration
*/
int ( * port_mrp_add ) ( struct dsa_switch * ds , int port ,
const struct switchdev_obj_mrp * mrp ) ;
int ( * port_mrp_del ) ( struct dsa_switch * ds , int port ,
const struct switchdev_obj_mrp * mrp ) ;
int ( * port_mrp_add_ring_role ) ( struct dsa_switch * ds , int port ,
const struct switchdev_obj_ring_role_mrp * mrp ) ;
int ( * port_mrp_del_ring_role ) ( struct dsa_switch * ds , int port ,
const struct switchdev_obj_ring_role_mrp * mrp ) ;
2021-07-19 20:14:49 +03:00
/*
* tag_8021q operations
*/
int ( * tag_8021q_vlan_add ) ( struct dsa_switch * ds , int port , u16 vid ,
u16 flags ) ;
int ( * tag_8021q_vlan_del ) ( struct dsa_switch * ds , int port , u16 vid ) ;
2019-10-25 01:03:51 +02:00
} ;
# define DSA_DEVLINK_PARAM_DRIVER(_id, _name, _type, _cmodes) \
DEVLINK_PARAM_DRIVER ( _id , _name , _type , _cmodes , \
dsa_devlink_param_get , dsa_devlink_param_set , NULL )
int dsa_devlink_param_get ( struct devlink * dl , u32 id ,
struct devlink_param_gset_ctx * ctx ) ;
int dsa_devlink_param_set ( struct devlink * dl , u32 id ,
struct devlink_param_gset_ctx * ctx ) ;
int dsa_devlink_params_register ( struct dsa_switch * ds ,
const struct devlink_param * params ,
size_t params_count ) ;
void dsa_devlink_params_unregister ( struct dsa_switch * ds ,
const struct devlink_param * params ,
size_t params_count ) ;
2019-11-05 01:12:57 +01:00
int dsa_devlink_resource_register ( struct dsa_switch * ds ,
const char * resource_name ,
u64 resource_size ,
u64 resource_id ,
u64 parent_resource_id ,
const struct devlink_resource_size_params * size_params ) ;
void dsa_devlink_resources_unregister ( struct dsa_switch * ds ) ;
void dsa_devlink_resource_occ_get_register ( struct dsa_switch * ds ,
u64 resource_id ,
devlink_resource_occ_get_t * occ_get ,
void * occ_get_priv ) ;
void dsa_devlink_resource_occ_get_unregister ( struct dsa_switch * ds ,
u64 resource_id ) ;
2020-09-18 21:11:04 +02:00
struct devlink_region *
dsa_devlink_region_create ( struct dsa_switch * ds ,
const struct devlink_region_ops * ops ,
u32 region_max_snapshots , u64 region_size ) ;
2020-10-04 18:12:55 +02:00
struct devlink_region *
dsa_devlink_port_region_create ( struct dsa_switch * ds ,
int port ,
const struct devlink_port_region_ops * ops ,
u32 region_max_snapshots , u64 region_size ) ;
2020-09-18 21:11:04 +02:00
void dsa_devlink_region_destroy ( struct devlink_region * region ) ;
net: dsa: introduce a dsa_port_from_netdev public helper
As its implementation shows, this is synonimous with calling
dsa_slave_dev_check followed by dsa_slave_to_port, so it is quite simple
already and provides functionality which is already there.
However there is now a need for these functions outside dsa_priv.h, for
example in drivers that perform mirroring and redirection through
tc-flower offloads (they are given raw access to the flow_cls_offload
structure), where they need to call this function on act->dev.
But simply exporting dsa_slave_to_port would make it non-inline and
would result in an extra function call in the hotpath, as can be seen
for example in sja1105:
Before:
000006dc <sja1105_xmit>:
{
6dc: e92d4ff0 push {r4, r5, r6, r7, r8, r9, sl, fp, lr}
6e0: e1a04000 mov r4, r0
6e4: e591958c ldr r9, [r1, #1420] ; 0x58c <- Inline dsa_slave_to_port
6e8: e1a05001 mov r5, r1
6ec: e24dd004 sub sp, sp, #4
u16 tx_vid = dsa_8021q_tx_vid(dp->ds, dp->index);
6f0: e1c901d8 ldrd r0, [r9, #24]
6f4: ebfffffe bl 0 <dsa_8021q_tx_vid>
6f4: R_ARM_CALL dsa_8021q_tx_vid
u8 pcp = netdev_txq_to_tc(netdev, queue_mapping);
6f8: e1d416b0 ldrh r1, [r4, #96] ; 0x60
u16 tx_vid = dsa_8021q_tx_vid(dp->ds, dp->index);
6fc: e1a08000 mov r8, r0
After:
000006e4 <sja1105_xmit>:
{
6e4: e92d4ff0 push {r4, r5, r6, r7, r8, r9, sl, fp, lr}
6e8: e1a04000 mov r4, r0
6ec: e24dd004 sub sp, sp, #4
struct dsa_port *dp = dsa_slave_to_port(netdev);
6f0: e1a00001 mov r0, r1
{
6f4: e1a05001 mov r5, r1
struct dsa_port *dp = dsa_slave_to_port(netdev);
6f8: ebfffffe bl 0 <dsa_slave_to_port>
6f8: R_ARM_CALL dsa_slave_to_port
6fc: e1a09000 mov r9, r0
u16 tx_vid = dsa_8021q_tx_vid(dp->ds, dp->index);
700: e1c001d8 ldrd r0, [r0, #24]
704: ebfffffe bl 0 <dsa_8021q_tx_vid>
704: R_ARM_CALL dsa_8021q_tx_vid
Because we want to avoid possible performance regressions, introduce
this new function which is designed to be public.
Suggested-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-05 22:20:52 +03:00
struct dsa_port * dsa_port_from_netdev ( struct net_device * netdev ) ;
2019-11-05 01:12:57 +01:00
2019-10-25 01:03:51 +02:00
struct dsa_devlink_priv {
struct dsa_switch * ds ;
2011-11-27 17:06:08 +00:00
} ;
2020-09-18 21:11:03 +02:00
static inline struct dsa_switch * dsa_devlink_to_ds ( struct devlink * dl )
{
struct dsa_devlink_priv * dl_priv = devlink_priv ( dl ) ;
return dl_priv - > ds ;
}
2020-10-04 18:12:56 +02:00
static inline
struct dsa_switch * dsa_devlink_port_to_ds ( struct devlink_port * port )
{
struct devlink * dl = port - > devlink ;
struct dsa_devlink_priv * dl_priv = devlink_priv ( dl ) ;
return dl_priv - > ds ;
}
static inline int dsa_devlink_port_to_port ( struct devlink_port * port )
{
return port - > index ;
}
2017-01-08 14:52:07 -08:00
struct dsa_switch_driver {
struct list_head list ;
2017-01-08 14:52:08 -08:00
const struct dsa_switch_ops * ops ;
2017-01-08 14:52:07 -08:00
} ;
2017-02-04 13:02:42 -08:00
struct net_device * dsa_dev_to_net_device ( struct device * dev ) ;
2011-11-27 17:06:08 +00:00
2017-06-01 16:07:11 -04:00
/* Keep inline for faster access in hot path */
2020-05-10 19:37:40 +03:00
static inline bool netdev_uses_dsa ( const struct net_device * dev )
2017-03-28 23:45:06 +02:00
{
# if IS_ENABLED(CONFIG_NET_DSA)
2017-06-01 16:07:13 -04:00
return dev - > dsa_ptr & & dev - > dsa_ptr - > rcv ;
2017-03-28 23:45:06 +02:00
# endif
return false ;
}
2020-09-26 22:32:06 +03:00
/* All DSA tags that push the EtherType to the right (basically all except tail
* tags , which don ' t break dissection ) can be treated the same from the
* perspective of the flow dissector .
*
* We need to return :
* - offset : the ( B - A ) difference between :
* A . the position of the real EtherType and
* B . the current skb - > data ( aka ETH_HLEN bytes into the frame , aka 2 bytes
* after the normal EtherType was supposed to be )
* The offset in bytes is exactly equal to the tagger overhead ( and half of
* that , in __be16 shorts ) .
*
* - proto : the value of the real EtherType .
*/
static inline void dsa_tag_generic_flow_dissect ( const struct sk_buff * skb ,
__be16 * proto , int * offset )
{
# if IS_ENABLED(CONFIG_NET_DSA)
const struct dsa_device_ops * ops = skb - > dev - > dsa_ptr - > tag_ops ;
2021-06-11 22:01:24 +03:00
int tag_len = ops - > needed_headroom ;
2020-09-26 22:32:06 +03:00
* offset = tag_len ;
* proto = ( ( __be16 * ) skb - > data ) [ ( tag_len / 2 ) - 1 ] ;
# endif
}
2020-07-19 20:49:52 -07:00
# if IS_ENABLED(CONFIG_NET_DSA)
static inline int __dsa_netdevice_ops_check ( struct net_device * dev )
{
int err = - EOPNOTSUPP ;
if ( ! dev - > dsa_ptr )
return err ;
if ( ! dev - > dsa_ptr - > netdev_ops )
return err ;
return 0 ;
}
2021-07-27 15:45:13 +02:00
static inline int dsa_ndo_eth_ioctl ( struct net_device * dev , struct ifreq * ifr ,
int cmd )
2020-07-19 20:49:52 -07:00
{
const struct dsa_netdevice_ops * ops ;
int err ;
err = __dsa_netdevice_ops_check ( dev ) ;
if ( err )
return err ;
ops = dev - > dsa_ptr - > netdev_ops ;
2021-07-27 15:45:13 +02:00
return ops - > ndo_eth_ioctl ( dev , ifr , cmd ) ;
2020-07-19 20:49:52 -07:00
}
# else
2021-07-27 15:45:13 +02:00
static inline int dsa_ndo_eth_ioctl ( struct net_device * dev , struct ifreq * ifr ,
int cmd )
2020-07-19 20:49:52 -07:00
{
return - EOPNOTSUPP ;
}
# endif
2016-06-04 21:17:07 +02:00
void dsa_unregister_switch ( struct dsa_switch * ds ) ;
2017-05-26 18:12:51 -04:00
int dsa_register_switch ( struct dsa_switch * ds ) ;
net: dsa: be compatible with masters which unregister on shutdown
Lino reports that on his system with bcmgenet as DSA master and KSZ9897
as a switch, rebooting or shutting down never works properly.
What does the bcmgenet driver have special to trigger this, that other
DSA masters do not? It has an implementation of ->shutdown which simply
calls its ->remove implementation. Otherwise said, it unregisters its
network interface on shutdown.
This message can be seen in a loop, and it hangs the reboot process there:
unregister_netdevice: waiting for eth0 to become free. Usage count = 3
So why 3?
A usage count of 1 is normal for a registered network interface, and any
virtual interface which links itself as an upper of that will increment
it via dev_hold. In the case of DSA, this is the call path:
dsa_slave_create
-> netdev_upper_dev_link
-> __netdev_upper_dev_link
-> __netdev_adjacent_dev_insert
-> dev_hold
So a DSA switch with 3 interfaces will result in a usage count elevated
by two, and netdev_wait_allrefs will wait until they have gone away.
Other stacked interfaces, like VLAN, watch NETDEV_UNREGISTER events and
delete themselves, but DSA cannot just vanish and go poof, at most it
can unbind itself from the switch devices, but that must happen strictly
earlier compared to when the DSA master unregisters its net_device, so
reacting on the NETDEV_UNREGISTER event is way too late.
It seems that it is a pretty established pattern to have a driver's
->shutdown hook redirect to its ->remove hook, so the same code is
executed regardless of whether the driver is unbound from the device, or
the system is just shutting down. As Florian puts it, it is quite a big
hammer for bcmgenet to unregister its net_device during shutdown, but
having a common code path with the driver unbind helps ensure it is well
tested.
So DSA, for better or for worse, has to live with that and engage in an
arms race of implementing the ->shutdown hook too, from all individual
drivers, and do something sane when paired with masters that unregister
their net_device there. The only sane thing to do, of course, is to
unlink from the master.
However, complications arise really quickly.
The pattern of redirecting ->shutdown to ->remove is not unique to
bcmgenet or even to net_device drivers. In fact, SPI controllers do it
too (see dspi_shutdown -> dspi_remove), and presumably, I2C controllers
and MDIO controllers do it too (this is something I have not researched
too deeply, but even if this is not the case today, it is certainly
plausible to happen in the future, and must be taken into consideration).
Since DSA switches might be SPI devices, I2C devices, MDIO devices, the
insane implication is that for the exact same DSA switch device, we
might have both ->shutdown and ->remove getting called.
So we need to do something with that insane environment. The pattern
I've come up with is "if this, then not that", so if either ->shutdown
or ->remove gets called, we set the device's drvdata to NULL, and in the
other hook, we check whether the drvdata is NULL and just do nothing.
This is probably not necessary for platform devices, just for devices on
buses, but I would really insist for consistency among drivers, because
when code is copy-pasted, it is not always copy-pasted from the best
sources.
So depending on whether the DSA switch's ->remove or ->shutdown will get
called first, we cannot really guarantee even for the same driver if
rebooting will result in the same code path on all platforms. But
nonetheless, we need to do something minimally reasonable on ->shutdown
too to fix the bug. Of course, the ->remove will do more (a full
teardown of the tree, with all data structures freed, and this is why
the bug was not caught for so long). The new ->shutdown method is kept
separate from dsa_unregister_switch not because we couldn't have
unregistered the switch, but simply in the interest of doing something
quick and to the point.
The big question is: does the DSA switch's ->shutdown get called earlier
than the DSA master's ->shutdown? If not, there is still a risk that we
might still trigger the WARN_ON in unregister_netdevice that says we are
attempting to unregister a net_device which has uppers. That's no good.
Although the reference to the master net_device won't physically go away
even if DSA's ->shutdown comes afterwards, remember we have a dev_hold
on it.
The answer to that question lies in this comment above device_link_add:
* A side effect of the link creation is re-ordering of dpm_list and the
* devices_kset list by moving the consumer device and all devices depending
* on it to the ends of these lists (that does not happen to devices that have
* not been registered when this function is called).
so the fact that DSA uses device_link_add towards its master is not
exactly for nothing. device_shutdown() walks devices_kset from the back,
so this is our guarantee that DSA's shutdown happens before the master's
shutdown.
Fixes: 2f1e8ea726e9 ("net: dsa: link interfaces with the DSA master to get rid of lockdep warnings")
Link: https://lore.kernel.org/netdev/20210909095324.12978-1-LinoSanfilippo@gmx.de/
Reported-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2021-09-17 16:34:33 +03:00
void dsa_switch_shutdown ( struct dsa_switch * ds ) ;
2020-05-10 19:37:42 +03:00
struct dsa_switch * dsa_switch_find ( int tree_index , int sw_index ) ;
2016-08-18 15:30:12 -07:00
# ifdef CONFIG_PM_SLEEP
int dsa_switch_suspend ( struct dsa_switch * ds ) ;
int dsa_switch_resume ( struct dsa_switch * ds ) ;
# else
static inline int dsa_switch_suspend ( struct dsa_switch * ds )
{
return 0 ;
}
static inline int dsa_switch_resume ( struct dsa_switch * ds )
{
return 0 ;
}
# endif /* CONFIG_PM_SLEEP */
2017-10-11 10:57:48 -07:00
# if IS_ENABLED(CONFIG_NET_DSA)
2021-01-07 03:24:01 +02:00
bool dsa_slave_dev_check ( const struct net_device * dev ) ;
2017-10-11 10:57:48 -07:00
# else
2021-01-07 03:24:01 +02:00
static inline bool dsa_slave_dev_check ( const struct net_device * dev )
{
return false ;
}
2017-10-11 10:57:48 -07:00
# endif
2019-05-05 13:19:25 +03:00
netdev_tx_t dsa_enqueue_skb ( struct sk_buff * skb , struct net_device * dev ) ;
2018-04-25 12:12:52 -07:00
int dsa_port_get_phy_strings ( struct dsa_port * dp , uint8_t * data ) ;
int dsa_port_get_ethtool_phy_stats ( struct dsa_port * dp , uint64_t * data ) ;
int dsa_port_get_phy_sset_count ( struct dsa_port * dp ) ;
2018-05-10 13:17:32 -07:00
void dsa_port_phylink_mac_change ( struct dsa_switch * ds , int port , bool up ) ;
2018-04-25 12:12:52 -07:00
2019-04-28 19:37:15 +02:00
struct dsa_tag_driver {
const struct dsa_device_ops * ops ;
struct list_head list ;
struct module * owner ;
} ;
void dsa_tag_drivers_register ( struct dsa_tag_driver * dsa_tag_driver_array [ ] ,
unsigned int count ,
struct module * owner ) ;
void dsa_tag_drivers_unregister ( struct dsa_tag_driver * dsa_tag_driver_array [ ] ,
unsigned int count ) ;
# define dsa_tag_driver_module_drivers(__dsa_tag_drivers_array, __count) \
static int __init dsa_tag_driver_module_init ( void ) \
{ \
dsa_tag_drivers_register ( __dsa_tag_drivers_array , __count , \
THIS_MODULE ) ; \
return 0 ; \
} \
module_init ( dsa_tag_driver_module_init ) ; \
\
static void __exit dsa_tag_driver_module_exit ( void ) \
{ \
dsa_tag_drivers_unregister ( __dsa_tag_drivers_array , __count ) ; \
} \
module_exit ( dsa_tag_driver_module_exit )
/**
* module_dsa_tag_drivers ( ) - Helper macro for registering DSA tag
* drivers
* @ __ops_array : Array of tag driver strucutres
*
* Helper macro for DSA tag drivers which do not do anything special
* in module init / exit . Each module may only use this macro once , and
* calling it replaces module_init ( ) and module_exit ( ) .
*/
# define module_dsa_tag_drivers(__ops_array) \
dsa_tag_driver_module_drivers ( __ops_array , ARRAY_SIZE ( __ops_array ) )
# define DSA_TAG_DRIVER_NAME(__ops) dsa_tag_driver ## _ ## __ops
/* Create a static structure we can build a linked list of dsa_tag
* drivers
*/
# define DSA_TAG_DRIVER(__ops) \
static struct dsa_tag_driver DSA_TAG_DRIVER_NAME ( __ops ) = { \
. ops = & __ops , \
}
/**
* module_dsa_tag_driver ( ) - Helper macro for registering a single DSA tag
* driver
* @ __ops : Single tag driver structures
*
* Helper macro for DSA tag drivers which do not do anything special
* in module init / exit . Each module may only use this macro once , and
* calling it replaces module_init ( ) and module_exit ( ) .
*/
# define module_dsa_tag_driver(__ops) \
DSA_TAG_DRIVER ( __ops ) ; \
\
static struct dsa_tag_driver * dsa_tag_driver_array [ ] = { \
& DSA_TAG_DRIVER_NAME ( __ops ) \
} ; \
module_dsa_tag_drivers ( dsa_tag_driver_array )
net: Distributed Switch Architecture protocol support
Distributed Switch Architecture is a protocol for managing hardware
switch chips. It consists of a set of MII management registers and
commands to configure the switch, and an ethernet header format to
signal which of the ports of the switch a packet was received from
or is intended to be sent to.
The switches that this driver supports are typically embedded in
access points and routers, and a typical setup with a DSA switch
looks something like this:
+-----------+ +-----------+
| | RGMII | |
| +-------+ +------ 1000baseT MDI ("WAN")
| | | 6-port +------ 1000baseT MDI ("LAN1")
| CPU | | ethernet +------ 1000baseT MDI ("LAN2")
| |MIImgmt| switch +------ 1000baseT MDI ("LAN3")
| +-------+ w/5 PHYs +------ 1000baseT MDI ("LAN4")
| | | |
+-----------+ +-----------+
The switch driver presents each port on the switch as a separate
network interface to Linux, polls the switch to maintain software
link state of those ports, forwards MII management interface
accesses to those network interfaces (e.g. as done by ethtool) to
the switch, and exposes the switch's hardware statistics counters
via the appropriate Linux kernel interfaces.
This initial patch supports the MII management interface register
layout of the Marvell 88E6123, 88E6161 and 88E6165 switch chips, and
supports the "Ethertype DSA" packet tagging format.
(There is no officially registered ethertype for the Ethertype DSA
packet format, so we just grab a random one. The ethertype to use
is programmed into the switch, and the switch driver uses the value
of ETH_P_EDSA for this, so this define can be changed at any time in
the future if the one we chose is allocated to another protocol or
if Ethertype DSA gets its own officially registered ethertype, and
everything will continue to work.)
Signed-off-by: Lennert Buytenhek <buytenh@marvell.com>
Tested-by: Nicolas Pitre <nico@marvell.com>
Tested-by: Byron Bradley <byron.bbradley@gmail.com>
Tested-by: Tim Ellis <tim.ellis@mac.com>
Tested-by: Peter van Valderen <linux@ddcrew.com>
Tested-by: Dirk Teurlings <dirk@upexia.nl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2008-10-07 13:44:02 +00:00
# endif
2019-04-28 19:37:15 +02:00