linux/Documentation/bpf/prog_cgroup_sockopt.rst

.. SPDX-License-Identifier: GPL-2.0

============================
BPF_PROG_TYPE_CGROUP_SOCKOPT
============================

``BPF_PROG_TYPE_CGROUP_SOCKOPT`` program type can be attached to two
cgroup hooks:

* ``BPF_CGROUP_GETSOCKOPT`` - called every time process executes ``getsockopt``
  system call.
* ``BPF_CGROUP_SETSOCKOPT`` - called every time process executes ``setsockopt``
  system call.

The context (``struct bpf_sockopt``) has associated socket (``sk``) and
all input arguments: ``level``, ``optname``, ``optval`` and ``optlen``.

BPF_CGROUP_SETSOCKOPT
=====================

``BPF_CGROUP_SETSOCKOPT`` is triggered *before* the kernel handling of
sockopt and it has writable context: it can modify the supplied arguments
before passing them down to the kernel. This hook has access to the cgroup
and socket local storage.

If BPF program sets ``optlen`` to -1, the control will be returned
back to the userspace after all other BPF programs in the cgroup
chain finish (i.e. kernel ``setsockopt`` handling will *not* be executed).

Note, that ``optlen`` can not be increased beyond the user-supplied
value. It can only be decreased or set to -1. Any other value will
trigger ``EFAULT``.

Return Type
-----------

* ``0`` - reject the syscall, ``EPERM`` will be returned to the userspace.
* ``1`` - success, continue with next BPF program in the cgroup chain.

BPF_CGROUP_GETSOCKOPT
=====================

``BPF_CGROUP_GETSOCKOPT`` is triggered *after* the kernel handing of
sockopt. The BPF hook can observe ``optval``, ``optlen`` and ``retval``
if it's interested in whatever kernel has returned. BPF hook can override
the values above, adjust ``optlen`` and reset ``retval`` to 0. If ``optlen``
has been increased above initial ``getsockopt`` value (i.e. userspace
buffer is too small), ``EFAULT`` is returned.

This hook has access to the cgroup and socket local storage.

Note, that the only acceptable value to set to ``retval`` is 0 and the
original value that the kernel returned. Any other value will trigger
``EFAULT``.

Return Type
-----------

* ``0`` - reject the syscall, ``EPERM`` will be returned to the userspace.
* ``1`` - success: copy ``optval`` and ``optlen`` to userspace, return
  ``retval`` from the syscall (note that this can be overwritten by
  the BPF program from the parent cgroup).

Cgroup Inheritance
==================

Suppose, there is the following cgroup hierarchy where each cgroup
has ``BPF_CGROUP_GETSOCKOPT`` attached at each level with
``BPF_F_ALLOW_MULTI`` flag::

  A (root, parent)
   \
    B (child)

When the application calls ``getsockopt`` syscall from the cgroup B,
the programs are executed from the bottom up: B, A. First program
(B) sees the result of kernel's ``getsockopt``. It can optionally
adjust ``optval``, ``optlen`` and reset ``retval`` to 0. After that
control will be passed to the second (A) program which will see the
same context as B including any potential modifications.

Same for ``BPF_CGROUP_SETSOCKOPT``: if the program is attached to
A and B, the trigger order is B, then A. If B does any changes
to the input arguments (``level``, ``optname``, ``optval``, ``optlen``),
then the next program in the chain (A) will see those changes,
*not* the original input ``setsockopt`` arguments. The potentially
modified values will be then passed down to the kernel.

Large optval
============
When the ``optval`` is greater than the ``PAGE_SIZE``, the BPF program
can access only the first ``PAGE_SIZE`` of that data. So it has to options:

* Set ``optlen`` to zero, which indicates that the kernel should
  use the original buffer from the userspace. Any modifications
  done by the BPF program to the ``optval`` are ignored.
* Set ``optlen`` to the value less than ``PAGE_SIZE``, which
  indicates that the kernel should use BPF's trimmed ``optval``.

When the BPF program returns with the ``optlen`` greater than
``PAGE_SIZE``, the userspace will receive original kernel
buffers without any modifications that the BPF program might have
applied.

Example
=======

Recommended way to handle BPF programs is as follows:

.. code-block:: c

	SEC("cgroup/getsockopt")
	int getsockopt(struct bpf_sockopt *ctx)
	{
		/* Custom socket option. */
		if (ctx->level == MY_SOL && ctx->optname == MY_OPTNAME) {
			ctx->retval = 0;
			optval[0] = ...;
			ctx->optlen = 1;
			return 1;
		}

		/* Modify kernel's socket option. */
		if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) {
			ctx->retval = 0;
			optval[0] = ...;
			ctx->optlen = 1;
			return 1;
		}

		/* optval larger than PAGE_SIZE use kernel's buffer. */
		if (ctx->optlen > PAGE_SIZE)
			ctx->optlen = 0;

		return 1;
	}

	SEC("cgroup/setsockopt")
	int setsockopt(struct bpf_sockopt *ctx)
	{
		/* Custom socket option. */
		if (ctx->level == MY_SOL && ctx->optname == MY_OPTNAME) {
			/* do something */
			ctx->optlen = -1;
			return 1;
		}

		/* Modify kernel's socket option. */
		if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) {
			optval[0] = ...;
			return 1;
		}

		/* optval larger than PAGE_SIZE use kernel's buffer. */
		if (ctx->optlen > PAGE_SIZE)
			ctx->optlen = 0;

		return 1;
	}

See ``tools/testing/selftests/bpf/progs/sockopt_sk.c`` for an example
of BPF program that handles socket options.
bpf: add sockopt documentation Provide user documentation about sockopt prog type and cgroup hooks. v9: * add details about setsockopt context and inheritance v7: * add description for retval=0 and optlen=-1 v6: * describe cgroup chaining, add example v2: * use return code 2 for kernel bypass Cc: Andrii Nakryiko <andriin@fb.com> Cc: Martin Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> 2019-06-27 23:38:54 +03:00			`.. SPDX-License-Identifier: GPL-2.0`

			`============================`
			`BPF_PROG_TYPE_CGROUP_SOCKOPT`
			`============================`

			``BPF_PROG_TYPE_CGROUP_SOCKOPT`` program type can be attached to two
			`cgroup hooks:`

			* ``BPF_CGROUP_GETSOCKOPT`` - called every time process executes ``getsockopt``
			`system call.`
			* ``BPF_CGROUP_SETSOCKOPT`` - called every time process executes ``setsockopt``
			`system call.`

			The context (``struct bpf_sockopt``) has associated socket (``sk``) and
			all input arguments: ``level``, ``optname``, ``optval`` and ``optlen``.

			`BPF_CGROUP_SETSOCKOPT`
			`=====================`

			``BPF_CGROUP_SETSOCKOPT`` is triggered before the kernel handling of
			`sockopt and it has writable context: it can modify the supplied arguments`
			`before passing them down to the kernel. This hook has access to the cgroup`
			`and socket local storage.`

			If BPF program sets ``optlen`` to -1, the control will be returned
			`back to the userspace after all other BPF programs in the cgroup`
			chain finish (i.e. kernel ``setsockopt`` handling will not be executed).

			Note, that ``optlen`` can not be increased beyond the user-supplied
			`value. It can only be decreased or set to -1. Any other value will`
			trigger ``EFAULT``.

			`Return Type`
			`-----------`

			* ``0`` - reject the syscall, ``EPERM`` will be returned to the userspace.
			* ``1`` - success, continue with next BPF program in the cgroup chain.

			`BPF_CGROUP_GETSOCKOPT`
			`=====================`

			``BPF_CGROUP_GETSOCKOPT`` is triggered after the kernel handing of
			sockopt. The BPF hook can observe ``optval``, ``optlen`` and ``retval``
			`if it's interested in whatever kernel has returned. BPF hook can override`
			the values above, adjust ``optlen`` and reset ``retval`` to 0. If ``optlen``
			has been increased above initial ``getsockopt`` value (i.e. userspace
			buffer is too small), ``EFAULT`` is returned.

			`This hook has access to the cgroup and socket local storage.`

			Note, that the only acceptable value to set to ``retval`` is 0 and the
			`original value that the kernel returned. Any other value will trigger`
			``EFAULT``.

			`Return Type`
			`-----------`

			* ``0`` - reject the syscall, ``EPERM`` will be returned to the userspace.
			* ``1`` - success: copy ``optval`` and ``optlen`` to userspace, return
			``retval`` from the syscall (note that this can be overwritten by
			`the BPF program from the parent cgroup).`

			`Cgroup Inheritance`
			`==================`

			`Suppose, there is the following cgroup hierarchy where each cgroup`
			has ``BPF_CGROUP_GETSOCKOPT`` attached at each level with
			``BPF_F_ALLOW_MULTI`` flag::

			`A (root, parent)`
			`\`
			`B (child)`

			When the application calls ``getsockopt`` syscall from the cgroup B,
			`the programs are executed from the bottom up: B, A. First program`
			(B) sees the result of kernel's ``getsockopt``. It can optionally
			adjust ``optval``, ``optlen`` and reset ``retval`` to 0. After that
			`control will be passed to the second (A) program which will see the`
			`same context as B including any potential modifications.`

			Same for ``BPF_CGROUP_SETSOCKOPT``: if the program is attached to
			`A and B, the trigger order is B, then A. If B does any changes`
			to the input arguments (``level``, ``optname``, ``optval``, ``optlen``),
			`then the next program in the chain (A) will see those changes,`
			not the original input ``setsockopt`` arguments. The potentially
			`modified values will be then passed down to the kernel.`

bpf: Document optval > PAGE_SIZE behavior for sockopt hooks Extend existing doc with more details about requiring ctx->optlen = 0 for handling optval > PAGE_SIZE. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200617010416.93086-3-sdf@google.com 2020-06-17 04:04:16 +03:00			`Large optval`
			`============`
			When the ``optval`` is greater than the ``PAGE_SIZE``, the BPF program
			can access only the first ``PAGE_SIZE`` of that data. So it has to options:

			* Set ``optlen`` to zero, which indicates that the kernel should
			`use the original buffer from the userspace. Any modifications`
			done by the BPF program to the ``optval`` are ignored.
			* Set ``optlen`` to the value less than ``PAGE_SIZE``, which
			indicates that the kernel should use BPF's trimmed ``optval``.

			When the BPF program returns with the ``optlen`` greater than
bpf: Document EFAULT changes for sockopt And add examples for how to correctly handle large optlens. This is less relevant now when we don't EFAULT anymore, but that's still the correct thing to do. Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230511170456.1759459-5-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> 2023-05-11 20:04:56 +03:00			``PAGE_SIZE``, the userspace will receive original kernel
			`buffers without any modifications that the BPF program might have`
			`applied.`
bpf: Document optval > PAGE_SIZE behavior for sockopt hooks Extend existing doc with more details about requiring ctx->optlen = 0 for handling optval > PAGE_SIZE. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200617010416.93086-3-sdf@google.com 2020-06-17 04:04:16 +03:00
bpf: add sockopt documentation Provide user documentation about sockopt prog type and cgroup hooks. v9: * add details about setsockopt context and inheritance v7: * add description for retval=0 and optlen=-1 v6: * describe cgroup chaining, add example v2: * use return code 2 for kernel bypass Cc: Andrii Nakryiko <andriin@fb.com> Cc: Martin Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> 2019-06-27 23:38:54 +03:00			`Example`
			`=======`

bpf: Document EFAULT changes for sockopt And add examples for how to correctly handle large optlens. This is less relevant now when we don't EFAULT anymore, but that's still the correct thing to do. Signed-off-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20230511170456.1759459-5-sdf@google.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> 2023-05-11 20:04:56 +03:00			`Recommended way to handle BPF programs is as follows:`

			`.. code-block:: c`

			`SEC("cgroup/getsockopt")`
			`int getsockopt(struct bpf_sockopt *ctx)`
			`{`
			`/* Custom socket option. */`
			`if (ctx->level == MY_SOL && ctx->optname == MY_OPTNAME) {`
			`ctx->retval = 0;`
			`optval[0] = ...;`
			`ctx->optlen = 1;`
			`return 1;`
			`}`

			`/* Modify kernel's socket option. */`
			`if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) {`
			`ctx->retval = 0;`
			`optval[0] = ...;`
			`ctx->optlen = 1;`
			`return 1;`
			`}`

			`/* optval larger than PAGE_SIZE use kernel's buffer. */`
			`if (ctx->optlen > PAGE_SIZE)`
			`ctx->optlen = 0;`

			`return 1;`
			`}`

			`SEC("cgroup/setsockopt")`
			`int setsockopt(struct bpf_sockopt *ctx)`
			`{`
			`/* Custom socket option. */`
			`if (ctx->level == MY_SOL && ctx->optname == MY_OPTNAME) {`
			`/* do something */`
			`ctx->optlen = -1;`
			`return 1;`
			`}`

			`/* Modify kernel's socket option. */`
			`if (ctx->level == SOL_IP && ctx->optname == IP_FREEBIND) {`
			`optval[0] = ...;`
			`return 1;`
			`}`

			`/* optval larger than PAGE_SIZE use kernel's buffer. */`
			`if (ctx->optlen > PAGE_SIZE)`
			`ctx->optlen = 0;`

			`return 1;`
			`}`

bpf: add sockopt documentation Provide user documentation about sockopt prog type and cgroup hooks. v9: * add details about setsockopt context and inheritance v7: * add description for retval=0 and optlen=-1 v6: * describe cgroup chaining, add example v2: * use return code 2 for kernel bypass Cc: Andrii Nakryiko <andriin@fb.com> Cc: Martin Lau <kafai@fb.com> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> 2019-06-27 23:38:54 +03:00			See ``tools/testing/selftests/bpf/progs/sockopt_sk.c`` for an example
			`of BPF program that handles socket options.`