IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
The added value of this function is that it can deal with both the case
where the VLAN header is in the skb head, as well as in the offload field.
This is something I was not able to do using other functions in the
network stack.
Since both ocelot-8021q and sja1105 need to do the same stuff, let's
make it a common service provided by tag_8021q.
This is done as refactoring for the new SJA1110 tagger, which partly
uses tag_8021q as well (just like SJA1105), and will be the third caller.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
All users of tag_8021q select it in Kconfig, so shim functions are not
needed because it is not possible for it to be disabled and its callers
enabled.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes no sense and is not needed, it is probably a debugging
leftover.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some really really weird switches just couldn't decide whether to use a
normal or a tail tagger, so they just did both.
This creates problems for DSA, because we only have the concept of an
'overhead' which can be applied to the headroom or to the tailroom of
the skb (like for example during the central TX reallocation procedure),
depending on the value of bool tail_tag, but not to both.
We need to generalize DSA to cater for these odd switches by
transforming the 'overhead / tail_tag' pair into 'needed_headroom /
needed_tailroom'.
The DSA master's MTU is increased to account for both.
The flow dissector code is modified such that it only calls the DSA
adjustment callback if the tagger has a non-zero header length.
Taggers are trivially modified to declare either needed_headroom or
needed_tailroom, based on the tail_tag value that they currently
declare.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
On SJA1105, there is support for a cascade port which is presumably
connected to a downstream SJA1105 switch. The upstream one does not take
PTP timestamps for packets received on this port, presumably because the
downstream switch already did (and for PTP, it only makes sense for the
leaf nodes in a DSA switch tree to do that).
I haven't been able to validate that feature in a fully assembled setup,
so I am disabling the feature by setting the cascade port to an unused
port value (ds->num_ports).
In SJA1110, multiple cascade ports are supported, and CASC_PORT became
a bit mask from a port number. So when CASC_PORT is set to ds->num_ports
(which is 11 on SJA1110), it is actually set to 0b1011, so ports 3, 1
and 0 are configured as cascade ports and we cannot take RX timestamps
on them.
So we need to introduce a check for SJA1110 and set things differently
(to zero there), so that the cascading feature is properly disabled and
RX timestamps can be taken on all ports.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
As opposed to SJA1105 where there are parts with TTEthernet and parts
without, in SJA1110 all parts support it, but it must be enabled in the
static config. So enable it unconditionally. We use it for the tc-taprio
and tc-gate offload.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Guangbin Huang says:
====================
net: hns3: add support for PTP
This series adds PTP support for the HNS3 ethernet driver.
change log:
V1 -> V2:
1. use spinlock to prevent concurrency
2. add the handling when ptp_clock_register() returns NULL
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a debugfs interface for dumping ptp information, which
is helpful for debugging.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adds PTP support for HNS3 ethernet driver.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder says:
====================
net: ipa: memory region rework, part 2
This is the second portion of a set of patches updating the IPA
memory region code.
In this portion (part 2), the focus is on adjusting the code so that
it no longer assumes the memory region descriptor array is indexed
by the region identifier. This brings with it some related cleanup.
Three loops are changed so their loop index variable is an unsigned
rather than an enumerated type.
A set of functions is changed so a region identifier (rather than a
memory region descriptor pointer) is passed as argument, to simplify
their call sites. This isn't entirely related or required, but I
think it improves the code.
A validation function for filter and route table memory regions is
changed to take memory region IDs, rather than determining which
region to validate based on a set of Boolean flags.
Finally, ipa_mem_find() is created to abstract getting a memory
descriptor based on its ID, and it is used everywhere rather than
indexing the array. With that implemented, all of the memory
regions can be defined by arrays of entries defined without
providing index designators.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Finally the code handles the IPA memory region array in the
configuration data without assuming it is indexed by region ID.
Get rid of the array index designators where these arrays are
initialized. As a result, there's no more need to define an
explicitly undefined memory region ID, so get rid of that.
Change ipa_mem_find() so it no longer assumes the ipa->mem[] array
is indexed by memory region ID. Instead, have it search the array
for the entry having the requested memory ID, and return the address
of the descriptor if found. Otherwise return NULL.
Stop allowing memory regions to be defined with zero size and zero
canary value. Check for this condition in ipa_mem_valid_one().
As a result, it is not necessary to check for this case in
ipa_mem_config().
Finally, there is no need for IPA_MEM_UNDEFINED to be defined any
more, so get rid of it.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce a new function that abstracts finding information about a
region in IPA-local memory, given its memory region ID. For now it
simply uses the region ID as an index into the IPA memory array.
If the region is not defined, ipa_mem_find() returns a null pointer.
Update all code that accesses the ipa->mem[] array directly to use
ipa_mem_find() instead. The return value must be checked for null
when optional memory regions are sought.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Stop passing most of the Boolean flags to ipa_table_valid_one(), and
just pass a memory region ID to it instead. We still need to
indicate whether we're operating on a routing or filter table.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a memory region ID rather than the address of a memory region
descriptor to ipa_table_reset_add() to simplify callers. Similarly,
pass memory region IDs to ipa_table_init_add().
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a memory region ID rather than the address of a memory region
descriptor to ipa_mem_zero_region_add() to simplify callers.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass a memory region ID rather than the address of a memory region
descriptor to ipa_filter_reset_table(), to simplify callers.
We can eliminate the check for a zero region size in this function
because ipa_table_reset_add() checks that before adding anything to
the transaction.
Note that here and in subsequent commits there is no need to check
whether a memory region exists, because we will have already
verified that during initialization.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Do some general cleanup in ipa_cmd_header_valid():
- Delay assigning the mem variable until just before it's used.
- Assign the maximum offset and size values together.
- Improve comments explaining the single range of memory being
made up of a modem portion and an AP portion.
- Record the offset of the combined range in a local variable.
- Do the initial size assignment right after assigning the offset.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change ipa_mem_valid() to iterate over the entries using a u32 index
variable rather than using a memory region ID. Use the ID found
inside the memory descriptor rather than the loop index.
Change ipa_mem_size_valid() to iterate over the entries but without
assuming the array index is the memory region ID. "Empty" entries
will have zero size; and we'll temporarily assume such entries have
zero offset as well (they all do, currently).
Similarly, don't assume the mem[] array is indexed by ID in
ipa_mem_config(). There, "empty" entries will have a zero canary
count, so no special assumptions are needed to handle them correctly.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow the device to be initialized at a later time if
it is not available at boot. The device will be allowed to probe but
will be given a "down" state. After completing device probe and
registering the net device, the driver will await an interrupt signal
from its partner device, indicating that it is ready for boot. The
driver will schedule a work event to perform the necessary procedure
and begin operation.
Co-developed-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: Thomas Falcon <tlfalcon@linux.ibm.com>
Signed-off-by: Cristobal Forno <cforno12@linux.ibm.com>
Acked-by: Lijun Pan <lijunp213@gmail.com>
Reviewed-by: Dany Madden <drt@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vadym Kochan says:
====================
net: marvell: prestera: add LAG support
The following features are supported:
- LAG basic operations
- create/delete LAG
- add/remove a member to LAG
- enable/disable member in LAG
- LAG Bridge support
- LAG VLAN support
- LAG FDB support
Limitations:
- Only HASH lag tx type is supported
- The Hash parameters are not configurable. They are applied
during the LAG creation stage.
- Enslaving a port to the LAG device that already has an
upper device is not supported.
Changes extracted from:
https://lkml.org/lkml/2021/2/3/877
and marked with "v2".
v2:
There are 2 additional preparation patches which simplifies the
netdev topology handling.
1) Initialize 'lag' with NULL in prestera_lag_create() [suggested by Vladimir Oltean]
2) Use -ENOSPC in prestera_lag_port_add() if max lag [suggested by Vladimir Oltean]
numbers were reached.
3) Do not propagate netdev events to prestera_switchdev [suggested by Vladimir Oltean]
but call bridge specific funcs. It simplifies the code.
4) Check on info->link_up in prestera_netdev_port_lower_event() [suggested by Vladimir Oltean]
5) Return -EOPNOTSUPP in prestera_netdev_port_event() in case [suggested by Vladimir Oltean]
LAG hashing mode is not supported.
6) Do not pass "lower" netdev to bridge join/leave functions. [suggested by Vladimir Oltean]
It is not need as offloading settings applied on particular
physical port. It requires to do extra upper dev lookup
in case port is in the LAG which is in the bridge on vlans add/del.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The following features are supported:
- LAG basic operations
- create/delete LAG
- add/remove a member to LAG
- enable/disable member in LAG
- LAG Bridge support
- LAG VLAN support
- LAG FDB support
Limitations:
- Only HASH lag tx type is supported
- The Hash parameters are not configurable. They are applied
during the LAG creation stage.
- Enslaving a port to the LAG device that already has an
upper device is not supported.
Co-developed-by: Andrii Savka <andrii.savka@plvision.eu>
Signed-off-by: Andrii Savka <andrii.savka@plvision.eu>
Signed-off-by: Serhiy Boiko <serhiy.boiko@plvision.eu>
Co-developed-by: Vadym Kochan <vkochan@marvell.com>
Signed-off-by: Vadym Kochan <vkochan@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace prestera_bridge_port_event(...) by
prestera_bridge_port_join(...) and prestera_bridge_port_leave().
It simplifies the code by reading netdev event specific handling only
once in prestera_main.c
Signed-off-by: Vadym Kochan <vkochan@marvell.com>
CC: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move handling of PRECHANGEUPPER event from prestera_switchdev to
prestera_main which is responsible for basic netdev events handling
and routing them to related module.
Signed-off-by: Vadym Kochan <vkochan@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The platform_get_irq() prints error message telling that interrupt is
missing,hence there is no need to duplicated that message in the
drivers.
Signed-off-by: Tan Zhongjun <tanzhongjun@yulong.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add description for `tfrc_invert_loss_event_rate` to fix the W=1 warnings:
net/dccp/ccids/lib/tfrc_equation.c:695: warning: Function parameter or
member 'loss_event_rate' not described in 'tfrc_invert_loss_event_rate'
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Richard Sailer <richard_siegfried@systemli.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert list_for_each() to list_for_each_entry() where
applicable. This simplifies the code.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert list_for_each() to list_for_each_entry() where
applicable. This simplifies the code.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Acked-by: Lijun Pan <lijunp213@gmail.com>
Reviewed-by: Dany Madden <drt@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert list_for_each() to list_for_each_entry() where
applicable. This simplifies the code.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use devm_platform_get_and_ioremap_resource() to simplify
code.
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use devm_platform_get_and_ioremap_resource() to simplify
code and avoid a null-ptr-deref by checking 'res' in it.
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The commit 5a5586112b92 ("net: stmmac: support FPE link partner
hand-shaking procedure") introduced the following coverity warning:
"Parse warning (PW.MIXED_ENUM_TYPE)"
"1. mixed_enum_type: enumerated type mixed with another type"
This is due to both "lo_state" and "lp_sate" which their datatype are
enum stmmac_fpe_state type, and being assigned with "FPE_EVENT_UNKNOWN"
which is a macro-defined of 0. Fixed this by assigned both these
variables with the correct enum value.
Fixes: 5a5586112b92 ("net: stmmac: support FPE link partner hand-shaking procedure")
Signed-off-by: Wong Vee Khee <vee.khee.wong@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It will cause null-ptr-deref if platform_get_resource() returns NULL,
we need check the return value.
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use devm_platform_get_and_ioremap_resource() to simplify
code.
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use devm_platform_get_and_ioremap_resource() to simplify
code.
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Peng Li says:
====================
net: ixp4xx_hss: clean up some code style issues
This patchset clean up some code style issues.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Braces {} should be used on all arms of this statement.
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Networking block comments don't use an empty /* line,
use /* Comment...
Block comments use * on subsequent lines.
Block comments use a trailing */ on a separate line.
This patch fixes the comments style issues.
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
According to the chackpatch.pl,
space prohibited after that open parenthesis '('.
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add space required before the open parenthesis '('.
Add space required after that close brace '}'.
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Should not use assignment in if condition.
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the checkpatch error as "foo* bar" and should be "foo *bar",
and "(foo*)" should be "(foo *)".
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch fixes the checkpatch error about missing a blank line
after declarations.
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes some redundant blank lines.
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce steering header insert/remove and switchdev bridge offloads
1) From Yevgeny, Steering header insert/remove support
ConnectX supports offloading of various encapsulations and decapsulations
(e.g. VXLAN), which are performed by 'Packet Reformat' action.
Starting with ConnectX-6 DX, a new reformat type is supported - INSERT_HEADER.
This reformat allows inserting an arbitrary size buffer at a selected location
in the packet on RX flows.
The insert/remove header support are needed as a prerequisite for the
bridge offloads vlan pop/push supprt, see below.
2) From Vlad, Support for bridge offloads for switchdev mode
This change implements bridge offloads with VLAN-support that works on top
of mlx5 representors in switchdev mode.
HIGH-LEVEL OVERVIEW
Hardware supported by mlx5 driver doesn't provide dynamic learning or aging
functionality and requires the driver to emulate all switch-like behavior
in software. As such, all packets by default go through miss path, appear
on representor and get to software bridge, if it is the upper device of the
representor. This causes bridge to process packet in software, learn the
MAC address to FDB and send SWITCHDEV_FDB_ADD_TO_DEVICE event to all
subscribers. Upon reception of SWITCHDEV_FDB_ADD_TO_DEVICE notification
mlx5 bridge offloads the FDB to hardware and sends back
SWITCHDEV_FDB_ADD_TO_BRIDGE notification to prevent such entries from being
aged out by kernel bridge. Leaving aging to kernel bridge would result
deletion of offloaded dynamic FDB entries every aging_time period due to
packets being processed by hardware and, consecutively, 'used' timestamp
for FDB entry not being updated. Hardware aging is emulated in driver by
running periodic workqueue task that manually updates the rules according
to their hardware counter:
- If hardware counter has changed since last update, the handler updates
'used' timestamp in kernel bridge dynamic entry by sending
SWITCHDEV_FDB_ADD_TO_BRIDGE notification for the entry.
- If FDB entry wasn't updated for user-controllable aging_time period,
then the FDB entry is unoffloaded from hardware and corresponding
SWITCHDEV_FDB_DEL_TO_BRIDGE notification is sent to kernel bridge.
The mlx5 bridge offload implementation fully supports port VLAN objects,
including PVID (vlan push) and "Egress Untagged" (vlan pop).
SOFTWARE ARCHITECTURE
Mlx5_eswitch is extended with pointer to new mlx5_esw_bridge_offloads
structure which has a linked list of mlx5_esw_bridge objects. Struct
mlx5_esw_bridge is the main switch object in mlx5 that holds all data for
offloaded FDB entries and metadata for bridge ports and their vlans. The
mlx5_esw_bridge object is created when first representor of eswitch vport
is added to bridge and deleted when the last representor is detached from
it. Bridge FDB entries are saved in linked list (to iterate over all FDB
entries in aging workqueue task) and also in hashtable for quick lookup by
MAC+VLAN tuple. Bridge FDB entries are saved in linked list (to iterate
over all FDB entries in aging workqueue task) and in hashtable for quick
lookup by MAC+VLAN tuple. Port metadata is stored in struct
mlx5_esw_bridge_port that is saved in xarray to allow quick lookup by vport
number. Part of the port metadata is the set of port vlans that are
represented by mlx5_esw_bridge_vlan structure. The vlan structure points to
all FDBs on vlan/port via fdb_list linked list.
Simplified diagram of mlx5 bridge objects:
+------------------+
| mxl5_eswitch |
| |
| br_offloads |
+--------+---------+
|
+--------v-------------------+
| mlx5_esw_bridge_offloads |
| |
+--> bridges |
| +-------+--------------------+
| |
| |
| +---v---------------+
| | mlx5_esw_bridge |
| | |
| | vports |
| | |
| | fdb_ht |
| +---+---------------+
| |
| +---v---------------+
+------+ mlx5_esw_bridge |
| |
+-------------------------+ vports |
| | |
| | fdb_ht +------------------------------------------+
| +-------------------+ |
| |
| |
| +----------------------+ +---------------------------+ |
+-> mlx5_esw_bridge_port | +--> mlx5_esw_bridge_fdb_entry <-+
| | | +----------------------+ | +--+------------------------+ |
| | vlans +--+-> mlx5_esw_bridge_vlan | | | |
| | | | | | | +--v------------------------+ |
| +----------------------+ | | fdb_list +--+ | mlx5_esw_bridge_fdb_entry <-+
| | +-------^--------------+ +--+------------------------+ |
| +----------------------+ | | | |
+-> mlx5_esw_bridge_port | | +-----------------------+ |
| | | |
| vlans | | -----------------------+ |
| | +-> mlx5_esw_bridge_vlan | |
+----------------------+ | | +---------------------------+ |
| fdb_list +-----> mlx5_esw_bridge_fdb_entry <-+
+-------^--------------+ +--+------------------------+
| |
+-----------------------+
HARDWARE REPRESENTATION
In order to adhere to kernel software datapath model bridge offloads must
come after TC and NF FDBs. However, since netfilter offload in mlx5 is
implemented with unmanaged tables, its miss path is not automatically
connected to next priority and requires the code to manually connect with
slow table. To keep bridge offloads encapsulated and not mix it with
eswitch offloads new FDB_TC_MISS priority is created between FDB_FT_OFFLOAD
and FDB_SLOW_PATH which allows bridge offloads to be created without
exposing its internal tables to any other modules since miss path of
managed TC-miss table is automatically wired to next priority.
The bridge tables are created with new priority FDB_BR_OFFLOAD in FDB
namespace. The new priority is between tc-miss and slow path priorities.
Priority consist of two levels: the ingress table that is global per
eswitch and matches incoming packets by src_mac/vid and redirects them to
next level (egress table) that is chosen according to ingress port bridge
membership and matches on dst_mac/vid in order to redirect packet to vport
according to the following diagram:
+
|
+---------v----------+
| |
| FDB_TC_OFFLOAD |
| |
+---------+----------+
|
|
+---------v----------+
| |
| FDB_FT_OFFLOAD |
| |
+---------+----------+
|
|
+---------v----------+
| |
| FDB_TC_MISS |
| |
+---------+----------+
|
+--------------------------------------+
| | |
| +------+ |
| | |
| +------v--------+ FDB_BR_OFFLOAD |
| | INGRESS_TABLE | |
| +------+---+----+ |
| | | match |
| | +---------+ |
| | | | +-------+
| | +-------v-------+ match | | |
| | | EGRESS_TABLE +------------> vport |
| | +-------+-------+ | | |
| | | | +-------+
| | miss | |
| +------+------+ |
| | |
+--------------------------------------+
|
|
+---------v----------+
| |
| FDB_SLOW_PATH |
| |
+---------+----------+
|
v
PATCHES OVERVIEW
1-3 - Miscellaneous refactorings and infrastructure changes.
4 - Mlx5 bridge offload infrastructure and dedicated fs_core
namespace/tables implementation.
5 - FDB entry offload.
6 - Dynamic FDB entry aging.
7-10 - VLAN filtering offload.
11 - Tracepoints for main mlx5 bridge offload events (FDB entry
offload/unoffload, VLAN add/delete, etc.)
--
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAmDBbIwACgkQSD+KveBX
+j6BzwgAs7zTxCwsqYC+Zw77p0C+UwEpoq9e8aARkZXY9PExQi7SHG2LswN1JX3C
MPf1nczNnos9D+P9VgbUWJP/3agtdYFbTu03toOl1W6pPRY7MVqrV14twT1zP7zA
xDqSZvYJ1jZKNVsITzdwWh0u7PDrxKpYefaKYe7b3ghNbAOqCEReF61zMTg4pu4c
LUkLx2f+diaQHY6TyQnUAMMH5O3j0bDF8JUbQK0ZX1+a1guP99t1zZKY35aBB1uQ
GcwUSGEaThU71O8whOx4kaIjLyk2kNM4rP1WxZo8V9gFu81/FJ5XNISdd7XWjOsI
z2Qf2Zu8xqXyjRF8cA5n0OIcK2UDrQ==
=oe1K
-----END PGP SIGNATURE-----
Merge tag 'mlx5-updates-2021-06-09' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2021-06-09
Introduce steering header insert/remove and switchdev bridge offloads
1) From Yevgeny, Steering header insert/remove support
ConnectX supports offloading of various encapsulations and decapsulations
(e.g. VXLAN), which are performed by 'Packet Reformat' action.
Starting with ConnectX-6 DX, a new reformat type is supported - INSERT_HEADER.
This reformat allows inserting an arbitrary size buffer at a selected location
in the packet on RX flows.
The insert/remove header support are needed as a prerequisite for the
bridge offloads vlan pop/push supprt, see below.
2) From Vlad, Support for bridge offloads for switchdev mode
This change implements bridge offloads with VLAN-support that works on top
of mlx5 representors in switchdev mode.
HIGH-LEVEL OVERVIEW
Hardware supported by mlx5 driver doesn't provide dynamic learning or aging
functionality and requires the driver to emulate all switch-like behavior
in software. As such, all packets by default go through miss path, appear
on representor and get to software bridge, if it is the upper device of the
representor. This causes bridge to process packet in software, learn the
MAC address to FDB and send SWITCHDEV_FDB_ADD_TO_DEVICE event to all
subscribers. Upon reception of SWITCHDEV_FDB_ADD_TO_DEVICE notification
mlx5 bridge offloads the FDB to hardware and sends back
SWITCHDEV_FDB_ADD_TO_BRIDGE notification to prevent such entries from being
aged out by kernel bridge. Leaving aging to kernel bridge would result
deletion of offloaded dynamic FDB entries every aging_time period due to
packets being processed by hardware and, consecutively, 'used' timestamp
for FDB entry not being updated. Hardware aging is emulated in driver by
running periodic workqueue task that manually updates the rules according
to their hardware counter:
- If hardware counter has changed since last update, the handler updates
'used' timestamp in kernel bridge dynamic entry by sending
SWITCHDEV_FDB_ADD_TO_BRIDGE notification for the entry.
- If FDB entry wasn't updated for user-controllable aging_time period,
then the FDB entry is unoffloaded from hardware and corresponding
SWITCHDEV_FDB_DEL_TO_BRIDGE notification is sent to kernel bridge.
The mlx5 bridge offload implementation fully supports port VLAN objects,
including PVID (vlan push) and "Egress Untagged" (vlan pop).
SOFTWARE ARCHITECTURE
Mlx5_eswitch is extended with pointer to new mlx5_esw_bridge_offloads
structure which has a linked list of mlx5_esw_bridge objects. Struct
mlx5_esw_bridge is the main switch object in mlx5 that holds all data for
offloaded FDB entries and metadata for bridge ports and their vlans. The
mlx5_esw_bridge object is created when first representor of eswitch vport
is added to bridge and deleted when the last representor is detached from
it. Bridge FDB entries are saved in linked list (to iterate over all FDB
entries in aging workqueue task) and also in hashtable for quick lookup by
MAC+VLAN tuple. Bridge FDB entries are saved in linked list (to iterate
over all FDB entries in aging workqueue task) and in hashtable for quick
lookup by MAC+VLAN tuple. Port metadata is stored in struct
mlx5_esw_bridge_port that is saved in xarray to allow quick lookup by vport
number. Part of the port metadata is the set of port vlans that are
represented by mlx5_esw_bridge_vlan structure. The vlan structure points to
all FDBs on vlan/port via fdb_list linked list.
Simplified diagram of mlx5 bridge objects:
+------------------+
| mxl5_eswitch |
| |
| br_offloads |
+--------+---------+
|
+--------v-------------------+
| mlx5_esw_bridge_offloads |
| |
+--> bridges |
| +-------+--------------------+
| |
| |
| +---v---------------+
| | mlx5_esw_bridge |
| | |
| | vports |
| | |
| | fdb_ht |
| +---+---------------+
| |
| +---v---------------+
+------+ mlx5_esw_bridge |
| |
+-------------------------+ vports |
| | |
| | fdb_ht +------------------------------------------+
| +-------------------+ |
| |
| |
| +----------------------+ +---------------------------+ |
+-> mlx5_esw_bridge_port | +--> mlx5_esw_bridge_fdb_entry <-+
| | | +----------------------+ | +--+------------------------+ |
| | vlans +--+-> mlx5_esw_bridge_vlan | | | |
| | | | | | | +--v------------------------+ |
| +----------------------+ | | fdb_list +--+ | mlx5_esw_bridge_fdb_entry <-+
| | +-------^--------------+ +--+------------------------+ |
| +----------------------+ | | | |
+-> mlx5_esw_bridge_port | | +-----------------------+ |
| | | |
| vlans | | -----------------------+ |
| | +-> mlx5_esw_bridge_vlan | |
+----------------------+ | | +---------------------------+ |
| fdb_list +-----> mlx5_esw_bridge_fdb_entry <-+
+-------^--------------+ +--+------------------------+
| |
+-----------------------+
HARDWARE REPRESENTATION
In order to adhere to kernel software datapath model bridge offloads must
come after TC and NF FDBs. However, since netfilter offload in mlx5 is
implemented with unmanaged tables, its miss path is not automatically
connected to next priority and requires the code to manually connect with
slow table. To keep bridge offloads encapsulated and not mix it with
eswitch offloads new FDB_TC_MISS priority is created between FDB_FT_OFFLOAD
and FDB_SLOW_PATH which allows bridge offloads to be created without
exposing its internal tables to any other modules since miss path of
managed TC-miss table is automatically wired to next priority.
The bridge tables are created with new priority FDB_BR_OFFLOAD in FDB
namespace. The new priority is between tc-miss and slow path priorities.
Priority consist of two levels: the ingress table that is global per
eswitch and matches incoming packets by src_mac/vid and redirects them to
next level (egress table) that is chosen according to ingress port bridge
membership and matches on dst_mac/vid in order to redirect packet to vport
according to the following diagram:
+
|
+---------v----------+
| |
| FDB_TC_OFFLOAD |
| |
+---------+----------+
|
|
+---------v----------+
| |
| FDB_FT_OFFLOAD |
| |
+---------+----------+
|
|
+---------v----------+
| |
| FDB_TC_MISS |
| |
+---------+----------+
|
+--------------------------------------+
| | |
| +------+ |
| | |
| +------v--------+ FDB_BR_OFFLOAD |
| | INGRESS_TABLE | |
| +------+---+----+ |
| | | match |
| | +---------+ |
| | | | +-------+
| | +-------v-------+ match | | |
| | | EGRESS_TABLE +------------> vport |
| | +-------+-------+ | | |
| | | | +-------+
| | miss | |
| +------+------+ |
| | |
+--------------------------------------+
|
|
+---------v----------+
| |
| FDB_SLOW_PATH |
| |
+---------+----------+
|
v
PATCHES OVERVIEW
1-3 - Miscellaneous refactorings and infrastructure changes.
4 - Mlx5 bridge offload infrastructure and dedicated fs_core
namespace/tables implementation.
5 - FDB entry offload.
6 - Dynamic FDB entry aging.
7-10 - VLAN filtering offload.
11 - Tracepoints for main mlx5 bridge offload events (FDB entry
offload/unoffload, VLAN add/delete, etc.)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
--
Use devm_platform_get_and_ioremap_resource() to simplify
code and avoid a null-ptr-deref by checking 'res' in it.
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>