linux/Documentation/bpf/instruction-set.rst
Linus Torvalds 0326074ff4 Networking changes for 6.1.
Core
 ----
 
  - Introduce and use a single page frag cache for allocating small skb
    heads, clawing back the 10-20% performance regression in UDP flood
    test from previous fixes.
 
  - Run packets which already went thru HW coalescing thru SW GRO.
    This significantly improves TCP segment coalescing and simplifies
    deployments as different workloads benefit from HW or SW GRO.
 
  - Shrink the size of the base zero-copy send structure.
 
  - Move TCP init under a new slow / sleepable version of DO_ONCE().
 
 BPF
 ---
 
  - Add BPF-specific, any-context-safe memory allocator.
 
  - Add helpers/kfuncs for PKCS#7 signature verification from BPF
    programs.
 
  - Define a new map type and related helpers for user space -> kernel
    communication over a ring buffer (BPF_MAP_TYPE_USER_RINGBUF).
 
  - Allow targeting BPF iterators to loop through resources of one
    task/thread.
 
  - Add ability to call selected destructive functions.
    Expose crash_kexec() to allow BPF to trigger a kernel dump.
    Use CAP_SYS_BOOT check on the loading process to judge permissions.
 
  - Enable BPF to collect custom hierarchical cgroup stats efficiently
    by integrating with the rstat framework.
 
  - Support struct arguments for trampoline based programs.
    Only structs with size <= 16B and x86 are supported.
 
  - Invoke cgroup/connect{4,6} programs for unprivileged ICMP ping
    sockets (instead of just TCP and UDP sockets).
 
  - Add a helper for accessing CLOCK_TAI for time sensitive network
    related programs.
 
  - Support accessing network tunnel metadata's flags.
 
  - Make TCP SYN ACK RTO tunable by BPF programs with TCP Fast Open.
 
  - Add support for writing to Netfilter's nf_conn:mark.
 
 Protocols
 ---------
 
  - WiFi: more Extremely High Throughput (EHT) and Multi-Link
    Operation (MLO) work (802.11be, WiFi 7).
 
  - vsock: improve support for SO_RCVLOWAT.
 
  - SMC: support SO_REUSEPORT.
 
  - Netlink: define and document how to use netlink in a "modern" way.
    Support reporting missing attributes via extended ACK.
 
  - IPSec: support collect metadata mode for xfrm interfaces.
 
  - TCPv6: send consistent autoflowlabel in SYN_RECV state
    and RST packets.
 
  - TCP: introduce optional per-netns connection hash table to allow
    better isolation between namespaces (opt-in, at the cost of memory
    and cache pressure).
 
  - MPTCP: support TCP_FASTOPEN_CONNECT.
 
  - Add NEXT-C-SID support in Segment Routing (SRv6) End behavior.
 
  - Adjust IP_UNICAST_IF sockopt behavior for connected UDP sockets.
 
  - Open vSwitch:
    - Allow specifying ifindex of new interfaces.
    - Allow conntrack and metering in non-initial user namespace.
 
  - TLS: support the Korean ARIA-GCM crypto algorithm.
 
  - Remove DECnet support.
 
 Driver API
 ----------
 
  - Allow selecting the conduit interface used by each port
    in DSA switches, at runtime.
 
  - Ethernet Power Sourcing Equipment and Power Device support.
 
  - Add tc-taprio support for queueMaxSDU parameter, i.e. setting
    per traffic class max frame size for time-based packet schedules.
 
  - Support PHY rate matching - adapting between differing host-side
    and link-side speeds.
 
  - Introduce QUSGMII PHY mode and 1000BASE-KX interface mode.
 
  - Validate OF (device tree) nodes for DSA shared ports; make
    phylink-related properties mandatory on DSA and CPU ports.
    Enforcing more uniformity should allow transitioning to phylink.
 
  - Require that flash component name used during update matches one
    of the components for which version is reported by info_get().
 
  - Remove "weight" argument from driver-facing NAPI API as much
    as possible. It's one of those magic knobs which seemed like
    a good idea at the time but is too indirect to use in practice.
 
  - Support offload of TLS connections with 256 bit keys.
 
 New hardware / drivers
 ----------------------
 
  - Ethernet:
    - Microchip KSZ9896 6-port Gigabit Ethernet Switch
    - Renesas Ethernet AVB (EtherAVB-IF) Gen4 SoCs
    - Analog Devices ADIN1110 and ADIN2111 industrial single pair
      Ethernet (10BASE-T1L) MAC+PHY.
    - Rockchip RV1126 Gigabit Ethernet (a version of stmmac IP).
 
  - Ethernet SFPs / modules:
    - RollBall / Hilink / Turris 10G copper SFPs
    - HALNy GPON module
 
  - WiFi:
    - CYW43439 SDIO chipset (brcmfmac)
    - CYW89459 PCIe chipset (brcmfmac)
    - BCM4378 on Apple platforms (brcmfmac)
 
 Drivers
 -------
 
  - CAN:
    - gs_usb: HW timestamp support
 
  - Ethernet PHYs:
    - lan8814: cable diagnostics
 
  - Ethernet NICs:
    - Intel (100G):
      - implement control of FCS/CRC stripping
      - port splitting via devlink
      - L2TPv3 filtering offload
    - nVidia/Mellanox:
      - tunnel offload for sub-functions
      - MACSec offload, w/ Extended packet number and replay
        window offload
      - significantly restructure, and optimize the AF_XDP support,
        align the behavior with other vendors
    - Huawei:
      - configuring DSCP map for traffic class selection
      - querying standard FEC statistics
      - querying SerDes lane number via ethtool
    - Marvell/Cavium:
      - egress priority flow control
      - MACSec offload
    - AMD/SolarFlare:
      - PTP over IPv6 and raw Ethernet
    - small / embedded:
      - ax88772: convert to phylink (to support SFP cages)
      - altera: tse: convert to phylink
      - ftgmac100: support fixed link
      - enetc: standard Ethtool counters
      - macb: ZynqMP SGMII dynamic configuration support
      - tsnep: support multi-queue and use page pool
      - lan743x: Rx IP & TCP checksum offload
      - igc: add xdp frags support to ndo_xdp_xmit
 
  - Ethernet high-speed switches:
    - Marvell (prestera):
      - support SPAN port features (traffic mirroring)
      - nexthop object offloading
    - Microchip (sparx5):
      - multicast forwarding offload
      - QoS queuing offload (tc-mqprio, tc-tbf, tc-ets)
 
  - Ethernet embedded switches:
    - Marvell (mv88e6xxx):
      - support RGMII cmode
    - NXP (felix):
      - standardized ethtool counters
    - Microchip (lan966x):
      - QoS queuing offload (tc-mqprio, tc-tbf, tc-cbs, tc-ets)
      - traffic policing and mirroring
      - link aggregation / bonding offload
      - QUSGMII PHY mode support
 
  - Qualcomm 802.11ax WiFi (ath11k):
    - cold boot calibration support on WCN6750
    - support to connect to a non-transmit MBSSID AP profile
    - enable remain-on-channel support on WCN6750
    - Wake-on-WLAN support for WCN6750
    - support to provide transmit power from firmware via nl80211
    - support to get power save duration for each client
    - spectral scan support for 160 MHz
 
  - MediaTek WiFi (mt76):
    - WiFi-to-Ethernet bridging offload for MT7986 chips
 
  - RealTek WiFi (rtw89):
    - P2P support
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmM7vtkACgkQMUZtbf5S
 Irvotg//dmh53rC+UMKO3OgOqPlSMnaqzbUdDEfN6mj4Mpox7Csb8zERVURHhBHY
 fvlXWsDgxmvgTebI5fvNC5+f1iW5xcqgJV2TWnNmDOKWwvQwb6qQfgixVmunvkpe
 IIukMXYt0dAf9bXeeEfbNXcCb85cPwB76stX0tMV6BX7osp3T0TL1fvFk0NJkL0j
 TeydLad/yAQtPb4TbeWYjNDoxPVDf0cVpUrevLGmWE88UMYmgTqPze+h1W5Wri52
 bzjdLklY/4cgcIZClHQ6F9CeRWqEBxvujA5Hj/cwOcn/ptVVJWUGi7sQo3sYkoSs
 HFu+F8XsTec14kGNC0Ab40eVdqs5l/w8+E+4jvgXeKGOtVns8DwoiUIzqXpyty89
 Ib04mffrwWNjFtHvo/kIsNwP05X2PGE9HUHfwsTUfisl/ASvMmQp7D7vUoqQC/4B
 AMVzT5qpjkmfBHYQQGuw8FxJhMeAOjC6aAo6censhXJyiUhIfleQsN0syHdaNb8q
 9RZlhAgQoVb6ZgvBV8r8unQh/WtNZ3AopwifwVJld2unsE/UNfQy2KyqOWBES/zf
 LP9sfuX0JnmHn8s1BQEUMPU1jF9ZVZCft7nufJDL6JhlAL+bwZeEN4yCiAHOPZqE
 ymSLHI9s8yWZoNpuMWKrI9kFexVnQFKmA3+quAJUcYHNMSsLkL8=
 =Gsio
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core:

   - Introduce and use a single page frag cache for allocating small skb
     heads, clawing back the 10-20% performance regression in UDP flood
     test from previous fixes.

   - Run packets which already went thru HW coalescing thru SW GRO. This
     significantly improves TCP segment coalescing and simplifies
     deployments as different workloads benefit from HW or SW GRO.

   - Shrink the size of the base zero-copy send structure.

   - Move TCP init under a new slow / sleepable version of DO_ONCE().

  BPF:

   - Add BPF-specific, any-context-safe memory allocator.

   - Add helpers/kfuncs for PKCS#7 signature verification from BPF
     programs.

   - Define a new map type and related helpers for user space -> kernel
     communication over a ring buffer (BPF_MAP_TYPE_USER_RINGBUF).

   - Allow targeting BPF iterators to loop through resources of one
     task/thread.

   - Add ability to call selected destructive functions. Expose
     crash_kexec() to allow BPF to trigger a kernel dump. Use
     CAP_SYS_BOOT check on the loading process to judge permissions.

   - Enable BPF to collect custom hierarchical cgroup stats efficiently
     by integrating with the rstat framework.

   - Support struct arguments for trampoline based programs. Only
     structs with size <= 16B and x86 are supported.

   - Invoke cgroup/connect{4,6} programs for unprivileged ICMP ping
     sockets (instead of just TCP and UDP sockets).

   - Add a helper for accessing CLOCK_TAI for time sensitive network
     related programs.

   - Support accessing network tunnel metadata's flags.

   - Make TCP SYN ACK RTO tunable by BPF programs with TCP Fast Open.

   - Add support for writing to Netfilter's nf_conn:mark.

  Protocols:

   - WiFi: more Extremely High Throughput (EHT) and Multi-Link Operation
     (MLO) work (802.11be, WiFi 7).

   - vsock: improve support for SO_RCVLOWAT.

   - SMC: support SO_REUSEPORT.

   - Netlink: define and document how to use netlink in a "modern" way.
     Support reporting missing attributes via extended ACK.

   - IPSec: support collect metadata mode for xfrm interfaces.

   - TCPv6: send consistent autoflowlabel in SYN_RECV state and RST
     packets.

   - TCP: introduce optional per-netns connection hash table to allow
     better isolation between namespaces (opt-in, at the cost of memory
     and cache pressure).

   - MPTCP: support TCP_FASTOPEN_CONNECT.

   - Add NEXT-C-SID support in Segment Routing (SRv6) End behavior.

   - Adjust IP_UNICAST_IF sockopt behavior for connected UDP sockets.

   - Open vSwitch:
      - Allow specifying ifindex of new interfaces.
      - Allow conntrack and metering in non-initial user namespace.

   - TLS: support the Korean ARIA-GCM crypto algorithm.

   - Remove DECnet support.

  Driver API:

   - Allow selecting the conduit interface used by each port in DSA
     switches, at runtime.

   - Ethernet Power Sourcing Equipment and Power Device support.

   - Add tc-taprio support for queueMaxSDU parameter, i.e. setting per
     traffic class max frame size for time-based packet schedules.

   - Support PHY rate matching - adapting between differing host-side
     and link-side speeds.

   - Introduce QUSGMII PHY mode and 1000BASE-KX interface mode.

   - Validate OF (device tree) nodes for DSA shared ports; make
     phylink-related properties mandatory on DSA and CPU ports.
     Enforcing more uniformity should allow transitioning to phylink.

   - Require that flash component name used during update matches one of
     the components for which version is reported by info_get().

   - Remove "weight" argument from driver-facing NAPI API as much as
     possible. It's one of those magic knobs which seemed like a good
     idea at the time but is too indirect to use in practice.

   - Support offload of TLS connections with 256 bit keys.

  New hardware / drivers:

   - Ethernet:
      - Microchip KSZ9896 6-port Gigabit Ethernet Switch
      - Renesas Ethernet AVB (EtherAVB-IF) Gen4 SoCs
      - Analog Devices ADIN1110 and ADIN2111 industrial single pair
        Ethernet (10BASE-T1L) MAC+PHY.
      - Rockchip RV1126 Gigabit Ethernet (a version of stmmac IP).

   - Ethernet SFPs / modules:
      - RollBall / Hilink / Turris 10G copper SFPs
      - HALNy GPON module

   - WiFi:
      - CYW43439 SDIO chipset (brcmfmac)
      - CYW89459 PCIe chipset (brcmfmac)
      - BCM4378 on Apple platforms (brcmfmac)

  Drivers:

   - CAN:
      - gs_usb: HW timestamp support

   - Ethernet PHYs:
      - lan8814: cable diagnostics

   - Ethernet NICs:
      - Intel (100G):
         - implement control of FCS/CRC stripping
         - port splitting via devlink
         - L2TPv3 filtering offload
      - nVidia/Mellanox:
         - tunnel offload for sub-functions
         - MACSec offload, w/ Extended packet number and replay window
           offload
         - significantly restructure, and optimize the AF_XDP support,
           align the behavior with other vendors
      - Huawei:
         - configuring DSCP map for traffic class selection
         - querying standard FEC statistics
         - querying SerDes lane number via ethtool
      - Marvell/Cavium:
         - egress priority flow control
         - MACSec offload
      - AMD/SolarFlare:
         - PTP over IPv6 and raw Ethernet
      - small / embedded:
         - ax88772: convert to phylink (to support SFP cages)
         - altera: tse: convert to phylink
         - ftgmac100: support fixed link
         - enetc: standard Ethtool counters
         - macb: ZynqMP SGMII dynamic configuration support
         - tsnep: support multi-queue and use page pool
         - lan743x: Rx IP & TCP checksum offload
         - igc: add xdp frags support to ndo_xdp_xmit

   - Ethernet high-speed switches:
      - Marvell (prestera):
         - support SPAN port features (traffic mirroring)
         - nexthop object offloading
      - Microchip (sparx5):
         - multicast forwarding offload
         - QoS queuing offload (tc-mqprio, tc-tbf, tc-ets)

   - Ethernet embedded switches:
      - Marvell (mv88e6xxx):
         - support RGMII cmode
      - NXP (felix):
         - standardized ethtool counters
      - Microchip (lan966x):
         - QoS queuing offload (tc-mqprio, tc-tbf, tc-cbs, tc-ets)
         - traffic policing and mirroring
         - link aggregation / bonding offload
         - QUSGMII PHY mode support

   - Qualcomm 802.11ax WiFi (ath11k):
      - cold boot calibration support on WCN6750
      - support to connect to a non-transmit MBSSID AP profile
      - enable remain-on-channel support on WCN6750
      - Wake-on-WLAN support for WCN6750
      - support to provide transmit power from firmware via nl80211
      - support to get power save duration for each client
      - spectral scan support for 160 MHz

   - MediaTek WiFi (mt76):
      - WiFi-to-Ethernet bridging offload for MT7986 chips

   - RealTek WiFi (rtw89):
      - P2P support"

* tag 'net-next-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1864 commits)
  eth: pse: add missing static inlines
  once: rename _SLOW to _SLEEPABLE
  net: pse-pd: add regulator based PSE driver
  dt-bindings: net: pse-dt: add bindings for regulator based PoDL PSE controller
  ethtool: add interface to interact with Ethernet Power Equipment
  net: mdiobus: search for PSE nodes by parsing PHY nodes.
  net: mdiobus: fwnode_mdiobus_register_phy() rework error handling
  net: add framework to support Ethernet PSE and PDs devices
  dt-bindings: net: phy: add PoDL PSE property
  net: marvell: prestera: Propagate nh state from hw to kernel
  net: marvell: prestera: Add neighbour cache accounting
  net: marvell: prestera: add stub handler neighbour events
  net: marvell: prestera: Add heplers to interact with fib_notifier_info
  net: marvell: prestera: Add length macros for prestera_ip_addr
  net: marvell: prestera: add delayed wq and flush wq on deinit
  net: marvell: prestera: Add strict cleanup of fib arbiter
  net: marvell: prestera: Add cleanup of allocated fib_nodes
  net: marvell: prestera: Add router nexthops ABI
  eth: octeon: fix build after netif_napi_add() changes
  net/mlx5: E-Switch, Return EBUSY if can't get mode lock
  ...
2022-10-04 13:38:03 -07:00

329 lines
12 KiB
ReStructuredText

.. contents::
.. sectnum::
========================================
eBPF Instruction Set Specification, v1.0
========================================
This document specifies version 1.0 of the eBPF instruction set.
Registers and calling convention
================================
eBPF has 10 general purpose registers and a read-only frame pointer register,
all of which are 64-bits wide.
The eBPF calling convention is defined as:
* R0: return value from function calls, and exit value for eBPF programs
* R1 - R5: arguments for function calls
* R6 - R9: callee saved registers that function calls will preserve
* R10: read-only frame pointer to access stack
R0 - R5 are scratch registers and eBPF programs needs to spill/fill them if
necessary across calls.
Instruction encoding
====================
eBPF has two instruction encodings:
* the basic instruction encoding, which uses 64 bits to encode an instruction
* the wide instruction encoding, which appends a second 64-bit immediate value
(imm64) after the basic instruction for a total of 128 bits.
The basic instruction encoding looks as follows:
============= ======= =============== ==================== ============
32 bits (MSB) 16 bits 4 bits 4 bits 8 bits (LSB)
============= ======= =============== ==================== ============
immediate offset source register destination register opcode
============= ======= =============== ==================== ============
Note that most instructions do not use all of the fields.
Unused fields shall be cleared to zero.
Instruction classes
-------------------
The three LSB bits of the 'opcode' field store the instruction class:
========= ===== =============================== ===================================
class value description reference
========= ===== =============================== ===================================
BPF_LD 0x00 non-standard load operations `Load and store instructions`_
BPF_LDX 0x01 load into register operations `Load and store instructions`_
BPF_ST 0x02 store from immediate operations `Load and store instructions`_
BPF_STX 0x03 store from register operations `Load and store instructions`_
BPF_ALU 0x04 32-bit arithmetic operations `Arithmetic and jump instructions`_
BPF_JMP 0x05 64-bit jump operations `Arithmetic and jump instructions`_
BPF_JMP32 0x06 32-bit jump operations `Arithmetic and jump instructions`_
BPF_ALU64 0x07 64-bit arithmetic operations `Arithmetic and jump instructions`_
========= ===== =============================== ===================================
Arithmetic and jump instructions
================================
For arithmetic and jump instructions (``BPF_ALU``, ``BPF_ALU64``, ``BPF_JMP`` and
``BPF_JMP32``), the 8-bit 'opcode' field is divided into three parts:
============== ====== =================
4 bits (MSB) 1 bit 3 bits (LSB)
============== ====== =================
operation code source instruction class
============== ====== =================
The 4th bit encodes the source operand:
====== ===== ========================================
source value description
====== ===== ========================================
BPF_K 0x00 use 32-bit immediate as source operand
BPF_X 0x08 use 'src_reg' register as source operand
====== ===== ========================================
The four MSB bits store the operation code.
Arithmetic instructions
-----------------------
``BPF_ALU`` uses 32-bit wide operands while ``BPF_ALU64`` uses 64-bit wide operands for
otherwise identical operations.
The 'code' field encodes the operation as below:
======== ===== ==========================================================
code value description
======== ===== ==========================================================
BPF_ADD 0x00 dst += src
BPF_SUB 0x10 dst -= src
BPF_MUL 0x20 dst \*= src
BPF_DIV 0x30 dst /= src
BPF_OR 0x40 dst \|= src
BPF_AND 0x50 dst &= src
BPF_LSH 0x60 dst <<= src
BPF_RSH 0x70 dst >>= src
BPF_NEG 0x80 dst = ~src
BPF_MOD 0x90 dst %= src
BPF_XOR 0xa0 dst ^= src
BPF_MOV 0xb0 dst = src
BPF_ARSH 0xc0 sign extending shift right
BPF_END 0xd0 byte swap operations (see `Byte swap instructions`_ below)
======== ===== ==========================================================
``BPF_ADD | BPF_X | BPF_ALU`` means::
dst_reg = (u32) dst_reg + (u32) src_reg;
``BPF_ADD | BPF_X | BPF_ALU64`` means::
dst_reg = dst_reg + src_reg
``BPF_XOR | BPF_K | BPF_ALU`` means::
src_reg = (u32) src_reg ^ (u32) imm32
``BPF_XOR | BPF_K | BPF_ALU64`` means::
src_reg = src_reg ^ imm32
Byte swap instructions
~~~~~~~~~~~~~~~~~~~~~~
The byte swap instructions use an instruction class of ``BPF_ALU`` and a 4-bit
'code' field of ``BPF_END``.
The byte swap instructions operate on the destination register
only and do not use a separate source register or immediate value.
The 1-bit source operand field in the opcode is used to select what byte
order the operation convert from or to:
========= ===== =================================================
source value description
========= ===== =================================================
BPF_TO_LE 0x00 convert between host byte order and little endian
BPF_TO_BE 0x08 convert between host byte order and big endian
========= ===== =================================================
The 'imm' field encodes the width of the swap operations. The following widths
are supported: 16, 32 and 64.
Examples:
``BPF_ALU | BPF_TO_LE | BPF_END`` with imm = 16 means::
dst_reg = htole16(dst_reg)
``BPF_ALU | BPF_TO_BE | BPF_END`` with imm = 64 means::
dst_reg = htobe64(dst_reg)
Jump instructions
-----------------
``BPF_JMP32`` uses 32-bit wide operands while ``BPF_JMP`` uses 64-bit wide operands for
otherwise identical operations.
The 'code' field encodes the operation as below:
======== ===== ========================= ============
code value description notes
======== ===== ========================= ============
BPF_JA 0x00 PC += off BPF_JMP only
BPF_JEQ 0x10 PC += off if dst == src
BPF_JGT 0x20 PC += off if dst > src unsigned
BPF_JGE 0x30 PC += off if dst >= src unsigned
BPF_JSET 0x40 PC += off if dst & src
BPF_JNE 0x50 PC += off if dst != src
BPF_JSGT 0x60 PC += off if dst > src signed
BPF_JSGE 0x70 PC += off if dst >= src signed
BPF_CALL 0x80 function call
BPF_EXIT 0x90 function / program return BPF_JMP only
BPF_JLT 0xa0 PC += off if dst < src unsigned
BPF_JLE 0xb0 PC += off if dst <= src unsigned
BPF_JSLT 0xc0 PC += off if dst < src signed
BPF_JSLE 0xd0 PC += off if dst <= src signed
======== ===== ========================= ============
The eBPF program needs to store the return value into register R0 before doing a
BPF_EXIT.
Load and store instructions
===========================
For load and store instructions (``BPF_LD``, ``BPF_LDX``, ``BPF_ST``, and ``BPF_STX``), the
8-bit 'opcode' field is divided as:
============ ====== =================
3 bits (MSB) 2 bits 3 bits (LSB)
============ ====== =================
mode size instruction class
============ ====== =================
The mode modifier is one of:
============= ===== ==================================== =============
mode modifier value description reference
============= ===== ==================================== =============
BPF_IMM 0x00 64-bit immediate instructions `64-bit immediate instructions`_
BPF_ABS 0x20 legacy BPF packet access (absolute) `Legacy BPF Packet access instructions`_
BPF_IND 0x40 legacy BPF packet access (indirect) `Legacy BPF Packet access instructions`_
BPF_MEM 0x60 regular load and store operations `Regular load and store operations`_
BPF_ATOMIC 0xc0 atomic operations `Atomic operations`_
============= ===== ==================================== =============
The size modifier is one of:
============= ===== =====================
size modifier value description
============= ===== =====================
BPF_W 0x00 word (4 bytes)
BPF_H 0x08 half word (2 bytes)
BPF_B 0x10 byte
BPF_DW 0x18 double word (8 bytes)
============= ===== =====================
Regular load and store operations
---------------------------------
The ``BPF_MEM`` mode modifier is used to encode regular load and store
instructions that transfer data between a register and memory.
``BPF_MEM | <size> | BPF_STX`` means::
*(size *) (dst_reg + off) = src_reg
``BPF_MEM | <size> | BPF_ST`` means::
*(size *) (dst_reg + off) = imm32
``BPF_MEM | <size> | BPF_LDX`` means::
dst_reg = *(size *) (src_reg + off)
Where size is one of: ``BPF_B``, ``BPF_H``, ``BPF_W``, or ``BPF_DW``.
Atomic operations
-----------------
Atomic operations are operations that operate on memory and can not be
interrupted or corrupted by other access to the same memory region
by other eBPF programs or means outside of this specification.
All atomic operations supported by eBPF are encoded as store operations
that use the ``BPF_ATOMIC`` mode modifier as follows:
* ``BPF_ATOMIC | BPF_W | BPF_STX`` for 32-bit operations
* ``BPF_ATOMIC | BPF_DW | BPF_STX`` for 64-bit operations
* 8-bit and 16-bit wide atomic operations are not supported.
The 'imm' field is used to encode the actual atomic operation.
Simple atomic operation use a subset of the values defined to encode
arithmetic operations in the 'imm' field to encode the atomic operation:
======== ===== ===========
imm value description
======== ===== ===========
BPF_ADD 0x00 atomic add
BPF_OR 0x40 atomic or
BPF_AND 0x50 atomic and
BPF_XOR 0xa0 atomic xor
======== ===== ===========
``BPF_ATOMIC | BPF_W | BPF_STX`` with 'imm' = BPF_ADD means::
*(u32 *)(dst_reg + off16) += src_reg
``BPF_ATOMIC | BPF_DW | BPF_STX`` with 'imm' = BPF ADD means::
*(u64 *)(dst_reg + off16) += src_reg
In addition to the simple atomic operations, there also is a modifier and
two complex atomic operations:
=========== ================ ===========================
imm value description
=========== ================ ===========================
BPF_FETCH 0x01 modifier: return old value
BPF_XCHG 0xe0 | BPF_FETCH atomic exchange
BPF_CMPXCHG 0xf0 | BPF_FETCH atomic compare and exchange
=========== ================ ===========================
The ``BPF_FETCH`` modifier is optional for simple atomic operations, and
always set for the complex atomic operations. If the ``BPF_FETCH`` flag
is set, then the operation also overwrites ``src_reg`` with the value that
was in memory before it was modified.
The ``BPF_XCHG`` operation atomically exchanges ``src_reg`` with the value
addressed by ``dst_reg + off``.
The ``BPF_CMPXCHG`` operation atomically compares the value addressed by
``dst_reg + off`` with ``R0``. If they match, the value addressed by
``dst_reg + off`` is replaced with ``src_reg``. In either case, the
value that was at ``dst_reg + off`` before the operation is zero-extended
and loaded back to ``R0``.
64-bit immediate instructions
-----------------------------
Instructions with the ``BPF_IMM`` 'mode' modifier use the wide instruction
encoding for an extra imm64 value.
There is currently only one such instruction.
``BPF_LD | BPF_DW | BPF_IMM`` means::
dst_reg = imm64
Legacy BPF Packet access instructions
-------------------------------------
eBPF previously introduced special instructions for access to packet data that were
carried over from classic BPF. However, these instructions are
deprecated and should no longer be used.