Commit Graph

1058886 Commits

Author SHA1 Message Date
34ec7cbf1e KVM: arm64: Hook up ->page_count() for hypervisor stage-1 page-table
kvm_pgtable_hyp_unmap() relies on the ->page_count() function callback
being provided by the memory-management operations for the page-table.

Wire up this callback for the hypervisor stage-1 page-table.

Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-5-qperret@google.com
2021-12-16 12:58:56 +00:00
d6b4bd3f48 KVM: arm64: Fixup hyp stage-1 refcount
In nVHE-protected mode, the hyp stage-1 page-table refcount is broken
due to the lack of refcount support in the early allocator. Fix-up the
refcount in the finalize walker, once the 'hyp_vmemmap' is up and running.

Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-4-qperret@google.com
2021-12-16 12:58:56 +00:00
2ea2ff91e8 KVM: arm64: Refcount hyp stage-1 pgtable pages
To prepare the ground for allowing hyp stage-1 mappings to be removed at
run-time, update the KVM page-table code to maintain a correct refcount
using the ->{get,put}_page() function callbacks.

Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-3-qperret@google.com
2021-12-16 12:58:55 +00:00
1fac3cfb9c KVM: arm64: Provide {get,put}_page() stubs for early hyp allocator
In nVHE protected mode, the EL2 code uses a temporary allocator during
boot while re-creating its stage-1 page-table. Unfortunately, the
hyp_vmmemap is not ready to use at this stage, so refcounting pages
is not possible. That is not currently a problem because hyp stage-1
mappings are never removed, which implies refcounting of page-table
pages is unnecessary.

In preparation for allowing hypervisor stage-1 mappings to be removed,
provide stub implementations for {get,put}_page() in the early allocator.

Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20211215161232.1480836-2-qperret@google.com
2021-12-16 12:58:55 +00:00
2d761dbf7f Merge branch kvm-arm64/fpsimd-tracking into kvmarm-master/next
* kvm-arm64/fpsimd-tracking:
  : .
  : Simplify the handling of both the FP/SIMD and SVE state by
  : removing the need for mapping the thread at EL2, and by
  : dropping the tracking of the host's SVE state which is
  : always invalid by construction.
  : .
  arm64/fpsimd: Document the use of TIF_FOREIGN_FPSTATE by KVM
  KVM: arm64: Stop mapping current thread_info at EL2
  KVM: arm64: Introduce flag shadowing TIF_FOREIGN_FPSTATE
  KVM: arm64: Remove unused __sve_save_state
  KVM: arm64: Get rid of host SVE tracking/saving
  KVM: arm64: Reorder vcpu flag definitions

Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-12-01 12:13:44 +00:00
e525523c19 Merge branch kvm-arm64/vcpu-first-run into kvmarm-master/next
* kvm-arm64/vcpu-first-run:
  : Rework the "vcpu first run" sequence to be driven by KVM's
  : "PID change" callback, removing the need for extra state.
  KVM: arm64: Drop vcpu->arch.has_run_once for vcpu->pid
  KVM: arm64: Merge kvm_arch_vcpu_run_pid_change() and kvm_vcpu_first_run_init()
  KVM: arm64: Restructure the point where has_run_once is advertised
  KVM: arm64: Move kvm_arch_vcpu_run_pid_change() out of line
  KVM: arm64: Move SVE state mapping at HYP to finalize-time

Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-12-01 11:52:20 +00:00
cc5705fb1b KVM: arm64: Drop vcpu->arch.has_run_once for vcpu->pid
With the transition to kvm_arch_vcpu_run_pid_change() to handle
the "run once" activities, it becomes obvious that has_run_once
is now an exact shadow of vcpu->pid.

Replace vcpu->arch.has_run_once with a new vcpu_has_run_once()
helper that directly checks for vcpu->pid, and get rid of the
now unused field.

Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-12-01 11:51:22 +00:00
b5aa368abf KVM: arm64: Merge kvm_arch_vcpu_run_pid_change() and kvm_vcpu_first_run_init()
The kvm_arch_vcpu_run_pid_change() helper gets called on each PID
change. The kvm_vcpu_first_run_init() helper gets run on the...
first run(!) of a vcpu.

As it turns out, the first run of a vcpu also triggers a PID change
event (vcpu->pid is initially NULL).

Use this property to merge these two helpers and get rid of another
arm64-specific oddity.

Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-12-01 11:51:21 +00:00
1408e73d21 KVM: arm64: Restructure the point where has_run_once is advertised
Restructure kvm_vcpu_first_run_init() to set the has_run_once
flag after having completed all the "run once" activities.

This includes moving the flip of the userspace irqchip static key
to a point where nothing can fail.

Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-12-01 11:51:21 +00:00
052f064d42 KVM: arm64: Move kvm_arch_vcpu_run_pid_change() out of line
Having kvm_arch_vcpu_run_pid_change() inline doesn't bring anything
to the table. Move it next to kvm_vcpu_first_run_init(), which will
be convenient for what is next to come.

Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-12-01 11:51:21 +00:00
bff01a61af KVM: arm64: Move SVE state mapping at HYP to finalize-time
We currently map the SVE state to HYP on detection of a PID change.
Although this matches what we do for FPSIMD, this is pretty pointless
for SVE, as the buffer is per-vcpu and has nothing to do with the
thread that is being run.

Move the mapping of the SVE state to finalize-time, which is where
we allocate the state memory, and thus the most logical place to
do this.

Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-12-01 11:51:20 +00:00
d58071a8a7 Linux 5.16-rc3 v5.16-rc3 2021-11-28 14:09:19 -08:00
d06c942efe Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull vhost,virtio,vdpa bugfixes from Michael Tsirkin:
 "Misc fixes all over the place.

  Revert of virtio used length validation series: the approach taken
  does not seem to work, breaking too many guests in the process. We'll
  need to do length validation using some other approach"

[ This merge also ends up reverting commit f7a36b03a7 ("vsock/virtio:
  suppress used length validation"), which came in through the
  networking tree in the meantime, and was part of that whole used
  length validation series   - Linus ]

* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
  vdpa_sim: avoid putting an uninitialized iova_domain
  vhost-vdpa: clean irqs before reseting vdpa device
  virtio-blk: modify the value type of num in virtio_queue_rq()
  vhost/vsock: cleanup removing `len` variable
  vhost/vsock: fix incorrect used length reported to the guest
  Revert "virtio_ring: validate used buffer length"
  Revert "virtio-net: don't let virtio core to validate used length"
  Revert "virtio-blk: don't let virtio core to validate used length"
  Revert "virtio-scsi: don't let virtio core to validate used buffer length"
2021-11-28 11:58:52 -08:00
9557e60b8c Merge tag 'x86-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 build fix from Thomas Gleixner:
 "A single fix for a missing __init annotation of prepare_command_line()"

* tag 'x86-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/boot: Mark prepare_command_line() __init
2021-11-28 09:24:50 -08:00
97891bbf38 Merge tag 'sched-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
 "A single scheduler fix to ensure that there is no stale KASAN shadow
  state left on the idle task's stack when a CPU is brought up after it
  was brought down before"

* tag 'sched-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/scs: Reset task stack state in bringup_cpu()
2021-11-28 09:15:34 -08:00
1ed1d3a3da Merge tag 'perf-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Thomas Gleixner:
 "A single fix for perf to prevent it from sending SIGTRAP to another
  task from a trace point event as it's not possible to deliver a
  synchronous signal to a different task from there"

* tag 'perf-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf: Ignore sigtrap for tracepoints destined for other tasks
2021-11-28 09:10:54 -08:00
d039f38801 Merge tag 'locking-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
 "Two regression fixes for reader writer semaphores:

   - Plug a race in the lock handoff which is caused by inconsistency of
     the reader and writer path and can lead to corruption of the
     underlying counter.

   - down_read_trylock() is suboptimal when the lock is contended and
     multiple readers trylock concurrently. That's due to the initial
     value being read non-atomically which results in at least two
     compare exchange loops. Making the initial readout atomic reduces
     this significantly. Whith 40 readers by 11% in a benchmark which
     enforces contention on mmap_sem"

* tag 'locking-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/rwsem: Optimize down_read_trylock() under highly contended case
  locking/rwsem: Make handoff bit handling more consistent
2021-11-28 09:04:41 -08:00
f8132d62a2 Merge tag 'trace-v5.16-rc2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull another tracing fix from Steven Rostedt:
 "Fix the fix of pid filtering

  The setting of the pid filtering flag tested the "trace only this pid"
  case twice, and ignored the "trace everything but this pid" case.

  The 5.15 kernel does things a little differently due to the new sparse
  pid mask introduced in 5.16, and as the bug was discovered running the
  5.15 kernel, and the first fix was initially done for that kernel,
  that fix handled both cases (only pid and all but pid), but the
  forward port to 5.16 created this bug"

* tag 'trace-v5.16-rc2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Test the 'Do not trace this pid' case in create event
2021-11-28 08:50:53 -08:00
0757ca01d9 Merge tag 'iommu-fixes-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull iommu fixes from Joerg Roedel:

 - Intel VT-d fixes:
     - Remove unused PASID_DISABLED
     - Fix RCU locking
     - Fix for the unmap_pages call-back

 - Rockchip RK3568 address mask fix

 - AMD IOMMUv2 log message clarification

* tag 'iommu-fixes-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
  iommu/vt-d: Fix unmap_pages support
  iommu/vt-d: Fix an unbalanced rcu_read_lock/rcu_read_unlock()
  iommu/rockchip: Fix PAGE_DESC_HI_MASKs for RK3568
  iommu/amd: Clarify AMD IOMMUv2 initialization messages
  iommu/vt-d: Remove unused PASID_DISABLED
2021-11-28 07:17:38 -08:00
3498e7f2bb Merge tag '5.16-rc2-ksmbd-fixes' of git://git.samba.org/ksmbd
Pull ksmbd fixes from Steve French:
 "Five ksmbd server fixes, four of them for stable:

   - memleak fix

   - fix for default data stream on filesystems that don't support xattr

   - error logging fix

   - session setup fix

   - minor doc cleanup"

* tag '5.16-rc2-ksmbd-fixes' of git://git.samba.org/ksmbd:
  ksmbd: fix memleak in get_file_stream_info()
  ksmbd: contain default data stream even if xattr is empty
  ksmbd: downgrade addition info error msg to debug in smb2_get_info_sec()
  docs: filesystem: cifs: ksmbd: Fix small layout issues
  ksmbd: Fix an error handling path in 'smb2_sess_setup()'
2021-11-27 14:49:35 -08:00
00169a9245 vmxnet3: Use generic Kconfig option for page size limit
Use the architecture independent Kconfig option PAGE_SIZE_LESS_THAN_64KB
to indicate that VMXNET3 requires a page size smaller than 64kB.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-27 14:34:41 -08:00
4eec7faf67 fs: ntfs: Limit NTFS_RW to page sizes smaller than 64k
NTFS_RW code allocates page size dependent arrays on the stack. This
results in build failures if the page size is 64k or larger.

  fs/ntfs/aops.c: In function 'ntfs_write_mst_block':
  fs/ntfs/aops.c:1311:1: error:
	the frame size of 2240 bytes is larger than 2048 bytes

Since commit f22969a660 ("powerpc/64s: Default to 64K pages for 64 bit
book3s") this affects ppc:allmodconfig builds, but other architectures
supporting page sizes of 64k or larger are also affected.

Increasing the maximum frame size for affected architectures just to
silence this error does not really help.  The frame size would have to
be set to a really large value for 256k pages.  Also, a large frame size
could potentially result in stack overruns in this code and elsewhere
and is therefore not desirable.  Make NTFS_RW dependent on page sizes
smaller than 64k instead.

Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Cc: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-27 14:34:41 -08:00
1f0e290cc5 arch: Add generic Kconfig option indicating page size smaller than 64k
NTFS_RW and VMXNET3 require a page size smaller than 64kB.  Add generic
Kconfig option for use outside architecture code to avoid architecture
specific Kconfig options in that code.

Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Cc: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-27 14:34:41 -08:00
27ff768fa2 tracing: Test the 'Do not trace this pid' case in create event
When creating a new event (via a module, kprobe, eprobe, etc), the
descriptors that are created must add flags for pid filtering if an
instance has pid filtering enabled, as the flags are used at the time the
event is executed to know if pid filtering should be done or not.

The "Only trace this pid" case was added, but a cut and paste error made
that case checked twice, instead of checking the "Trace all but this pid"
case.

Link: https://lore.kernel.org/all/202111280401.qC0z99JB-lkp@intel.com/

Fixes: 6cb206508b ("tracing: Check pid filtering when creating events")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-27 16:50:43 -05:00
4f0dda359c Merge tag 'xfs-5.16-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
 "Fixes for a resource leak and a build robot complaint about totally
  dead code:

   - Fix buffer resource leak that could lead to livelock on corrupt fs.

   - Remove unused function xfs_inew_wait to shut up the build robots"

* tag 'xfs-5.16-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: remove xfs_inew_wait
  xfs: Fix the free logic of state in xfs_attr_node_hasname
2021-11-27 12:59:54 -08:00
adfb743ac0 Merge tag 'iomap-5.16-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap fixes from Darrick Wong:
 "A single iomap bug fix and a cleanup for 5.16-rc2.

  The bug fix changes how iomap deals with reading from an inline data
  region -- whereas the current code (incorrectly) lets the iomap read
  iter try for more bytes after reading the inline region (which zeroes
  the rest of the page!) and hopes the next iteration terminates, we
  surveyed the inlinedata implementations and realized that all
  inlinedata implementations also require that the inlinedata region end
  at EOF, so we can simply terminate the read.

  The second patch documents these assumptions in the code so that
  they're not subtle implications anymore, and cleans up some of the
  grosser parts of that function.

  Summary:

   - Fix an accounting problem where unaligned inline data reads can run
     off the end of the read iomap iterator. iomap has historically
     required that inline data mappings only exist at the end of a file,
     though this wasn't documented anywhere.

   - Document iomap_read_inline_data and change its return type to be
     appropriate for the information that it's actually returning"

* tag 'iomap-5.16-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  iomap: iomap_read_inline_data cleanup
  iomap: Fix inline extent handling in iomap_readpage
2021-11-27 12:50:03 -08:00
86155d6b43 Merge tag 'trace-v5.16-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
 "Two fixes to event pid filtering:

   - Make sure newly created events reflect the current state of pid
     filtering

   - Take pid filtering into account when recording trigger events.
     (Also clean up the if statement to be cleaner)"

* tag 'trace-v5.16-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix pid filtering when triggers are attached
  tracing: Check pid filtering when creating events
2021-11-27 12:03:57 -08:00
86799cdfbc Merge tag 'io_uring-5.16-2021-11-27' of git://git.kernel.dk/linux-block
Pull more io_uring fixes from Jens Axboe:
 "The locking fixup that was applied earlier this rc has both a deadlock
  and IRQ safety issue, let's get that ironed out before -rc3. This
  contains:

   - Link traversal locking fix (Pavel)

   - Cancelation fix (Pavel)

   - Relocate cond_resched() for huge buffer chain freeing, avoiding a
     softlockup warning (Ye)

   - Fix timespec validation (Ye)"

* tag 'io_uring-5.16-2021-11-27' of git://git.kernel.dk/linux-block:
  io_uring: Fix undefined-behaviour in io_issue_sqe
  io_uring: fix soft lockup when call __io_remove_buffers
  io_uring: fix link traversal locking
  io_uring: fail cancellation for EXITING tasks
2021-11-27 11:28:37 -08:00
650c8edf53 Merge tag 'block-5.16-2021-11-27' of git://git.kernel.dk/linux-block
Pull more block fixes from Jens Axboe:
 "Turns out that the flushing out of pending fixes before the
  Thanksgiving break didn't quite work out in terms of timing, so here's
  a followup set of fixes:

   - rq_qos_done() should be called regardless of whether or not we're
     the final put of the request, it's not related to the freeing of
     the state. This fixes an IO stall with wbt that a few users have
     reported, a regression in this release.

   - Only define zram_wb_devops if it's used, fixing a compilation
     warning for some compilers"

* tag 'block-5.16-2021-11-27' of git://git.kernel.dk/linux-block:
  zram: only make zram_wb_devops for CONFIG_ZRAM_WRITEBACK
  block: call rq_qos_done() before ref check in batch completions
2021-11-27 11:19:42 -08:00
9e9fbe44be Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
 "Twelve fixes, eleven in drivers (target, qla2xx, scsi_debug, mpt3sas,
  ufs). The core fix is a minor correction to the previous state update
  fix for the iscsi daemons"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: scsi_debug: Zero clear zones at reset write pointer
  scsi: core: sysfs: Fix setting device state to SDEV_RUNNING
  scsi: scsi_debug: Sanity check block descriptor length in resp_mode_select()
  scsi: target: configfs: Delete unnecessary checks for NULL
  scsi: target: core: Use RCU helpers for INQUIRY t10_alua_tg_pt_gp
  scsi: mpt3sas: Fix incorrect system timestamp
  scsi: mpt3sas: Fix system going into read-only mode
  scsi: mpt3sas: Fix kernel panic during drive powercycle test
  scsi: ufs: ufs-mediatek: Add put_device() after of_find_device_by_node()
  scsi: scsi_debug: Fix type in min_t to avoid stack OOB
  scsi: qla2xxx: edif: Fix off by one bug in qla_edif_app_getfcinfo()
  scsi: ufs: ufshpb: Fix warning in ufshpb_set_hpb_read_to_upiu()
2021-11-27 11:15:17 -08:00
7413927713 Merge tag 'nfs-for-5.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
 "Highlights include:

  Stable fixes:

   - NFSv42: Fix pagecache invalidation after COPY/CLONE

  Bugfixes:

   - NFSv42: Don't fail clone() just because the server failed to return
     post-op attributes

   - SUNRPC: use different lockdep keys for INET6 and LOCAL

   - NFSv4.1: handle NFS4ERR_NOSPC from CREATE_SESSION

   - SUNRPC: fix header include guard in trace header"

* tag 'nfs-for-5.16-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  SUNRPC: use different lock keys for INET6 and LOCAL
  sunrpc: fix header include guard in trace header
  NFSv4.1: handle NFS4ERR_NOSPC by CREATE_SESSION
  NFSv42: Fix pagecache invalidation after COPY/CLONE
  NFS: Add a tracepoint to show the results of nfs_set_cache_invalid()
  NFSv42: Don't fail clone() unless the OP_CLONE operation failed
2021-11-27 10:33:55 -08:00
52dc4c640a Merge tag 'erofs-for-5.16-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fix from Gao Xiang:
 "Fix an ABBA deadlock introduced by XArray conversion"

* tag 'erofs-for-5.16-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: fix deadlock when shrink erofs slab
2021-11-27 10:27:35 -08:00
7b65b798a6 Merge tag 'powerpc-5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
 "Fix KVM using a Power9 instruction on earlier CPUs, which could lead
  to the host SLB being incorrectly invalidated and a subsequent host
  crash.

  Fix kernel hardlockup on vmap stack overflow on 32-bit.

  Thanks to Christophe Leroy, Nicholas Piggin, and Fabiano Rosas"

* tag 'powerpc-5.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/32: Fix hardlockup on vmap stack overflow
  KVM: PPC: Book3S HV: Prevent POWER7/8 TLB flush flushing SLB
2021-11-27 10:06:15 -08:00
6be088036c Merge tag 'mips-fixes_5.16_2' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull MIPS fixes from Thomas Bogendoerfer:

 - build fix for ZSTD enabled configs

 - fix for preempt warning

 - fix for loongson FTLB detection

 - fix for page table level selection

* tag 'mips-fixes_5.16_2' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
  MIPS: use 3-level pgtable for 64KB page size on MIPS_VA_BITS_48
  MIPS: loongson64: fix FTLB configuration
  MIPS: Fix using smp_processor_id() in preemptible in show_cpuinfo()
  MIPS: boot/compressed/: add __ashldi3 to target for ZSTD compression
2021-11-27 09:50:31 -08:00
f6223ff799 io_uring: Fix undefined-behaviour in io_issue_sqe
We got issue as follows:
================================================================================
UBSAN: Undefined behaviour in ./include/linux/ktime.h:42:14
signed integer overflow:
-4966321760114568020 * 1000000000 cannot be represented in type 'long long int'
CPU: 1 PID: 2186 Comm: syz-executor.2 Not tainted 4.19.90+ #12
Hardware name: linux,dummy-virt (DT)
Call trace:
 dump_backtrace+0x0/0x3f0 arch/arm64/kernel/time.c:78
 show_stack+0x28/0x38 arch/arm64/kernel/traps.c:158
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x170/0x1dc lib/dump_stack.c:118
 ubsan_epilogue+0x18/0xb4 lib/ubsan.c:161
 handle_overflow+0x188/0x1dc lib/ubsan.c:192
 __ubsan_handle_mul_overflow+0x34/0x44 lib/ubsan.c:213
 ktime_set include/linux/ktime.h:42 [inline]
 timespec64_to_ktime include/linux/ktime.h:78 [inline]
 io_timeout fs/io_uring.c:5153 [inline]
 io_issue_sqe+0x42c8/0x4550 fs/io_uring.c:5599
 __io_queue_sqe+0x1b0/0xbc0 fs/io_uring.c:5988
 io_queue_sqe+0x1ac/0x248 fs/io_uring.c:6067
 io_submit_sqe fs/io_uring.c:6137 [inline]
 io_submit_sqes+0xed8/0x1c88 fs/io_uring.c:6331
 __do_sys_io_uring_enter fs/io_uring.c:8170 [inline]
 __se_sys_io_uring_enter fs/io_uring.c:8129 [inline]
 __arm64_sys_io_uring_enter+0x490/0x980 fs/io_uring.c:8129
 invoke_syscall arch/arm64/kernel/syscall.c:53 [inline]
 el0_svc_common+0x374/0x570 arch/arm64/kernel/syscall.c:121
 el0_svc_handler+0x190/0x260 arch/arm64/kernel/syscall.c:190
 el0_svc+0x10/0x218 arch/arm64/kernel/entry.S:1017
================================================================================

As ktime_set only judge 'secs' if big than KTIME_SEC_MAX, but if we pass
negative value maybe lead to overflow.
To address this issue, we must check if 'sec' is negative.

Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20211118015907.844807-1-yebin10@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-27 06:41:38 -07:00
1d0254e6b4 io_uring: fix soft lockup when call __io_remove_buffers
I got issue as follows:
[ 567.094140] __io_remove_buffers: [1]start ctx=0xffff8881067bf000 bgid=65533 buf=0xffff8881fefe1680
[  594.360799] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/u32:5:108]
[  594.364987] Modules linked in:
[  594.365405] irq event stamp: 604180238
[  594.365906] hardirqs last  enabled at (604180237): [<ffffffff93fec9bd>] _raw_spin_unlock_irqrestore+0x2d/0x50
[  594.367181] hardirqs last disabled at (604180238): [<ffffffff93fbbadb>] sysvec_apic_timer_interrupt+0xb/0xc0
[  594.368420] softirqs last  enabled at (569080666): [<ffffffff94200654>] __do_softirq+0x654/0xa9e
[  594.369551] softirqs last disabled at (569080575): [<ffffffff913e1d6a>] irq_exit_rcu+0x1ca/0x250
[  594.370692] CPU: 2 PID: 108 Comm: kworker/u32:5 Tainted: G            L    5.15.0-next-20211112+ #88
[  594.371891] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
[  594.373604] Workqueue: events_unbound io_ring_exit_work
[  594.374303] RIP: 0010:_raw_spin_unlock_irqrestore+0x33/0x50
[  594.375037] Code: 48 83 c7 18 53 48 89 f3 48 8b 74 24 10 e8 55 f5 55 fd 48 89 ef e8 ed a7 56 fd 80 e7 02 74 06 e8 43 13 7b fd fb bf 01 00 00 00 <e8> f8 78 474
[  594.377433] RSP: 0018:ffff888101587a70 EFLAGS: 00000202
[  594.378120] RAX: 0000000024030f0d RBX: 0000000000000246 RCX: 1ffffffff2f09106
[  594.379053] RDX: 0000000000000000 RSI: ffffffff9449f0e0 RDI: 0000000000000001
[  594.379991] RBP: ffffffff9586cdc0 R08: 0000000000000001 R09: fffffbfff2effcab
[  594.380923] R10: ffffffff977fe557 R11: fffffbfff2effcaa R12: ffff8881b8f3def0
[  594.381858] R13: 0000000000000246 R14: ffff888153a8b070 R15: 0000000000000000
[  594.382787] FS:  0000000000000000(0000) GS:ffff888399c00000(0000) knlGS:0000000000000000
[  594.383851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  594.384602] CR2: 00007fcbe71d2000 CR3: 00000000b4216000 CR4: 00000000000006e0
[  594.385540] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  594.386474] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  594.387403] Call Trace:
[  594.387738]  <TASK>
[  594.388042]  find_and_remove_object+0x118/0x160
[  594.389321]  delete_object_full+0xc/0x20
[  594.389852]  kfree+0x193/0x470
[  594.390275]  __io_remove_buffers.part.0+0xed/0x147
[  594.390931]  io_ring_ctx_free+0x342/0x6a2
[  594.392159]  io_ring_exit_work+0x41e/0x486
[  594.396419]  process_one_work+0x906/0x15a0
[  594.399185]  worker_thread+0x8b/0xd80
[  594.400259]  kthread+0x3bf/0x4a0
[  594.401847]  ret_from_fork+0x22/0x30
[  594.402343]  </TASK>

Message from syslogd@localhost at Nov 13 09:09:54 ...
kernel:watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [kworker/u32:5:108]
[  596.793660] __io_remove_buffers: [2099199]start ctx=0xffff8881067bf000 bgid=65533 buf=0xffff8881fefe1680

We can reproduce this issue by follow syzkaller log:
r0 = syz_io_uring_setup(0x401, &(0x7f0000000300), &(0x7f0000003000/0x2000)=nil, &(0x7f0000ff8000/0x4000)=nil, &(0x7f0000000280)=<r1=>0x0, &(0x7f0000000380)=<r2=>0x0)
sendmsg$ETHTOOL_MSG_FEATURES_SET(0xffffffffffffffff, &(0x7f0000003080)={0x0, 0x0, &(0x7f0000003040)={&(0x7f0000000040)=ANY=[], 0x18}}, 0x0)
syz_io_uring_submit(r1, r2, &(0x7f0000000240)=@IORING_OP_PROVIDE_BUFFERS={0x1f, 0x5, 0x0, 0x401, 0x1, 0x0, 0x100, 0x0, 0x1, {0xfffd}}, 0x0)
io_uring_enter(r0, 0x3a2d, 0x0, 0x0, 0x0, 0x0)

The reason above issue  is 'buf->list' has 2,100,000 nodes, occupied cpu lead
to soft lockup.
To solve this issue, we need add schedule point when do while loop in
'__io_remove_buffers'.
After add  schedule point we do regression, get follow data.
[  240.141864] __io_remove_buffers: [1]start ctx=0xffff888170603000 bgid=65533 buf=0xffff8881116fcb00
[  268.408260] __io_remove_buffers: [1]start ctx=0xffff8881b92d2000 bgid=65533 buf=0xffff888130c83180
[  275.899234] __io_remove_buffers: [2099199]start ctx=0xffff888170603000 bgid=65533 buf=0xffff8881116fcb00
[  296.741404] __io_remove_buffers: [1]start ctx=0xffff8881b659c000 bgid=65533 buf=0xffff8881010fe380
[  305.090059] __io_remove_buffers: [2099199]start ctx=0xffff8881b92d2000 bgid=65533 buf=0xffff888130c83180
[  325.415746] __io_remove_buffers: [1]start ctx=0xffff8881b92d1000 bgid=65533 buf=0xffff8881a17d8f00
[  333.160318] __io_remove_buffers: [2099199]start ctx=0xffff8881b659c000 bgid=65533 buf=0xffff8881010fe380
...

Fixes:8bab4c09f24e("io_uring: allow conditional reschedule for intensive iterators")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Link: https://lore.kernel.org/r/20211122024737.2198530-1-yebin10@huawei.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-11-27 06:41:32 -07:00
a55f224ff5 tracing: Fix pid filtering when triggers are attached
If a event is filtered by pid and a trigger that requires processing of
the event to happen is a attached to the event, the discard portion does
not take the pid filtering into account, and the event will then be
recorded when it should not have been.

Cc: stable@vger.kernel.org
Fixes: 3fdaf80f4a ("tracing: Implement event pid filtering")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-11-26 17:37:06 -05:00
86dc40c7ea iommu/vt-d: Fix unmap_pages support
When supporting only the .map and .unmap callbacks of iommu_ops,
the IOMMU driver can make assumptions about the size and alignment
used for mappings based on the driver provided pgsize_bitmap.  VT-d
previously used essentially PAGE_MASK for this bitmap as any power
of two mapping was acceptably filled by native page sizes.

However, with the .map_pages and .unmap_pages interface we're now
getting page-size and count arguments.  If we simply combine these
as (page-size * count) and make use of the previous map/unmap
functions internally, any size and alignment assumptions are very
different.

As an example, a given vfio device assignment VM will often create
a 4MB mapping at IOVA pfn [0x3fe00 - 0x401ff].  On a system that
does not support IOMMU super pages, the unmap_pages interface will
ask to unmap 1024 4KB pages at the base IOVA.  dma_pte_clear_level()
will recurse down to level 2 of the page table where the first half
of the pfn range exactly matches the entire pte level.  We clear the
pte, increment the pfn by the level size, but (oops) the next pte is
on a new page, so we exit the loop an pop back up a level.  When we
then update the pfn based on that higher level, we seem to assume
that the previous pfn value was at the start of the level.  In this
case the level size is 256K pfns, which we add to the base pfn and
get a results of 0x7fe00, which is clearly greater than 0x401ff,
so we're done.  Meanwhile we never cleared the ptes for the remainder
of the range.  When the VM remaps this range, we're overwriting valid
ptes and the VT-d driver complains loudly, as reported by the user
report linked below.

The fix for this seems relatively simple, if each iteration of the
loop in dma_pte_clear_level() is assumed to clear to the end of the
level pte page, then our next pfn should be calculated from level_pfn
rather than our working pfn.

Fixes: 3f34f12597 ("iommu/vt-d: Implement map/unmap_pages() iommu_ops callback")
Reported-by: Ajay Garg <ajaygargnsit@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Giovanni Cabiddu <giovanni.cabiddu@intel.com>
Link: https://lore.kernel.org/all/20211002124012.18186-1-ajaygargnsit@gmail.com/
Link: https://lore.kernel.org/r/163659074748.1617923.12716161410774184024.stgit@omen
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20211126135556.397932-3-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2021-11-26 22:54:47 +01:00
4e5973dd27 iommu/vt-d: Fix an unbalanced rcu_read_lock/rcu_read_unlock()
If we return -EOPNOTSUPP, the rcu lock remains lock. This is spurious.
Go through the end of the function instead. This way, the missing
'rcu_read_unlock()' is called.

Fixes: 7afd7f6aa2 ("iommu/vt-d: Check FL and SL capability sanity in scalable mode")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/40cc077ca5f543614eab2a10e84d29dd190273f6.1636217517.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20211126135556.397932-2-baolu.lu@linux.intel.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2021-11-26 22:54:47 +01:00
f7ff3cff35 iommu/rockchip: Fix PAGE_DESC_HI_MASKs for RK3568
With the submission of iommu driver for RK3568 a subtle bug was
introduced: PAGE_DESC_HI_MASK1 and PAGE_DESC_HI_MASK2 have to be
the other way arround - that leads to random errors, especially when
addresses beyond 32 bit are used.

Fix it.

Fixes: c55356c534 ("iommu: rockchip: Add support for iommu v2")
Signed-off-by: Alex Bee <knaerzche@gmail.com>
Tested-by: Peter Geis <pgwipeout@gmail.com>
Reviewed-by: Heiko Stuebner <heiko@sntech.de>
Tested-by: Dan Johansen <strit@manjaro.org>
Reviewed-by: Benjamin Gaignard <benjamin.gaignard@collabora.com>
Link: https://lore.kernel.org/r/20211124021325.858139-1-knaerzche@gmail.com
Signed-off-by: Joerg Roedel <jroedel@suse.de>
2021-11-26 22:54:20 +01:00
717e88aad3 iommu/amd: Clarify AMD IOMMUv2 initialization messages
The messages printed on the initialization of the AMD IOMMUv2 driver
have caused some confusion in the past. Clarify the messages to lower
the confusion in the future.

Cc: stable@vger.kernel.org
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Link: https://lore.kernel.org/r/20211123105507.7654-3-joro@8bytes.org
2021-11-26 22:54:20 +01:00
21e96a2035 iommu/vt-d: Remove unused PASID_DISABLED
The macro is unused after commit 00ecd54013 so it can be removed.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: 00ecd54013 ("iommu/vt-d: Clean up unused PASID updating functions")
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20211123105507.7654-2-joro@8bytes.org
2021-11-26 22:54:20 +01:00
c5c17547b7 Merge tag 'net-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Networking fixes, including fixes from netfilter.

  Current release - regressions:

   - r8169: fix incorrect mac address assignment

   - vlan: fix underflow for the real_dev refcnt when vlan creation
     fails

   - smc: avoid warning of possible recursive locking

  Current release - new code bugs:

   - vsock/virtio: suppress used length validation

   - neigh: fix crash in v6 module initialization error path

  Previous releases - regressions:

   - af_unix: fix change in behavior in read after shutdown

   - igb: fix netpoll exit with traffic, avoid warning

   - tls: fix splice_read() when starting mid-record

   - lan743x: fix deadlock in lan743x_phy_link_status_change()

   - marvell: prestera: fix bridge port operation

  Previous releases - always broken:

   - tcp_cubic: fix spurious Hystart ACK train detections for
     not-cwnd-limited flows

   - nexthop: fix refcount issues when replacing IPv6 groups

   - nexthop: fix null pointer dereference when IPv6 is not enabled

   - phylink: force link down and retrigger resolve on interface change

   - mptcp: fix delack timer length calculation and incorrect early
     clearing

   - ieee802154: handle iftypes as u32, prevent shift-out-of-bounds

   - nfc: virtual_ncidev: change default device permissions

   - netfilter: ctnetlink: fix error codes and flags used for kernel
     side filtering of dumps

   - netfilter: flowtable: fix IPv6 tunnel addr match

   - ncsi: align payload to 32-bit to fix dropped packets

   - iavf: fix deadlock and loss of config during VF interface reset

   - ice: avoid bpf_prog refcount underflow

   - ocelot: fix broken PTP over IP and PTP API violations

  Misc:

   - marvell: mvpp2: increase MTU limit when XDP enabled"

* tag 'net-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits)
  net: dsa: microchip: implement multi-bridge support
  net: mscc: ocelot: correctly report the timestamping RX filters in ethtool
  net: mscc: ocelot: set up traps for PTP packets
  net: ptp: add a definition for the UDP port for IEEE 1588 general messages
  net: mscc: ocelot: create a function that replaces an existing VCAP filter
  net: mscc: ocelot: don't downgrade timestamping RX filters in SIOCSHWTSTAMP
  net: hns3: fix incorrect components info of ethtool --reset command
  net: hns3: fix one incorrect value of page pool info when queried by debugfs
  net: hns3: add check NULL address for page pool
  net: hns3: fix VF RSS failed problem after PF enable multi-TCs
  net: qed: fix the array may be out of bound
  net/smc: Don't call clcsock shutdown twice when smc shutdown
  net: vlan: fix underflow for the real_dev refcnt
  ptp: fix filter names in the documentation
  ethtool: ioctl: fix potential NULL deref in ethtool_set_coalesce()
  nfc: virtual_ncidev: change default device permissions
  net/sched: sch_ets: don't peek at classes beyond 'nbands'
  net: stmmac: Disable Tx queues when reconfiguring the interface
  selftests: tls: test for correct proto_ops
  tls: fix replacing proto_ops
  ...
2021-11-26 12:58:53 -08:00
b3612ccdf2 net: dsa: microchip: implement multi-bridge support
Current driver version is able to handle only one bridge at time.
Configuring two bridges on two different ports would end up shorting this
bridges by HW. To reproduce it:

	ip l a name br0 type bridge
	ip l a name br1 type bridge
	ip l s dev br0 up
	ip l s dev br1 up
	ip l s lan1 master br0
	ip l s dev lan1 up
	ip l s lan2 master br1
	ip l s dev lan2 up

	Ping on lan1 and get response on lan2, which should not happen.

This happened, because current driver version is storing one global "Port VLAN
Membership" and applying it to all ports which are members of any
bridge.
To solve this issue, we need to handle each port separately.

This patch is dropping the global port member storage and calculating
membership dynamically depending on STP state and bridge participation.

Note: STP support was broken before this patch and should be fixed
separately.

Fixes: c2e866911e ("net: dsa: microchip: break KSZ9477 DSA driver into two files")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Link: https://lore.kernel.org/r/20211126123926.2981028-1-o.rempel@pengutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-26 12:46:38 -08:00
5367cf1c3a Merge tag 'acpi-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
 "These fix a NULL pointer dereference in the CPPC library code and a
  locking issue related to printing the names of ACPI device nodes in
  the device properties framework.

  Specifics:

   - Fix NULL pointer dereference in the CPPC library code occuring on
     hybrid systems without CPPC support (Rafael Wysocki).

   - Avoid attempts to acquire a semaphore with interrupts off when
     printing the names of ACPI device nodes and clean up code on top of
     that fix (Sakari Ailus)"

* tag 'acpi-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  ACPI: CPPC: Add NULL pointer check to cppc_get_perf()
  ACPI: Make acpi_node_get_parent() local
  ACPI: Get acpi_device's parent from the parent field
2021-11-26 12:19:13 -08:00
0ce629b15d Merge tag 'pm-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
 "These address three issues in the intel_pstate driver and fix two
  problems related to hibernation.

  Specifics:

   - Make intel_pstate work correctly on Ice Lake server systems with
     out-of-band performance control enabled (Adamos Ttofari).

   - Fix EPP handling in intel_pstate during CPU offline and online in
     the active mode (Rafael Wysocki).

   - Make intel_pstate support ITMT on asymmetric systems with
     overclocking enabled (Srinivas Pandruvada).

   - Fix hibernation image saving when using the user space interface
     based on the snapshot special device file (Evan Green).

   - Make the hibernation code release the snapshot block device using
     the same mode that was used when acquiring it (Thomas Zeitlhofer)"

* tag 'pm-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  PM: hibernate: Fix snapshot partial write lengths
  PM: hibernate: use correct mode for swsusp_close()
  cpufreq: intel_pstate: ITMT support for overclocked system
  cpufreq: intel_pstate: Fix active mode offline/online EPP handling
  cpufreq: intel_pstate: Add Ice Lake server to out-of-band IDs
2021-11-26 12:14:50 -08:00
925c94371c Merge tag 'fuse-fixes-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fix from Miklos Szeredi:
 "Fix a regression caused by a bugfix in the previous release. The
  symptom is a VM_BUG_ON triggered from splice to the fuse device.

  Unfortunately the original bugfix was already backported to a number
  of stable releases, so this fix-fix will need to be backported as
  well"

* tag 'fuse-fixes-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: release pipe buf after last use
2021-11-26 12:01:31 -08:00
32c5449754 Merge branch 'fix-broken-ptp-over-ip-on-ocelot-switches'
Vladimir Oltean says:

====================
Fix broken PTP over IP on Ocelot switches

Changes in v2: added patch 5, added Richard's ack for the whole series
sans patch 5 which is new.

Po Liu reported recently that timestamping PTP over IPv4 is broken using
the felix driver on NXP LS1028A. This has been known for a while, of
course, since it has always been broken. The reason is because IP PTP
packets are currently treated as unknown IP multicast, which is not
flooded to the CPU port in the ocelot driver design, so packets don't
reach the ptp4l program.

The series solves the problem by installing packet traps per port when
the timestamping ioctl is called, depending on the RX filter selected
(L2, L4 or both).
====================

Link: https://lore.kernel.org/r/20211126172845.3149260-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-26 11:38:23 -08:00
c49a35eedf net: mscc: ocelot: correctly report the timestamping RX filters in ethtool
The driver doesn't support RX timestamping for non-PTP packets, but it
declares that it does. Restrict the reported RX filters to PTP v2 over
L2 and over L4.

Fixes: 4e3b0468e6 ("net: mscc: PTP Hardware Clock (PHC) support")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-26 11:38:21 -08:00
96ca08c058 net: mscc: ocelot: set up traps for PTP packets
IEEE 1588 support was declared too soon for the Ocelot switch. Out of
reset, this switch does not apply any special treatment for PTP packets,
i.e. when an event message is received, the natural tendency is to
forward it by MAC DA/VLAN ID. This poses a problem when the ingress port
is under a bridge, since user space application stacks (written
primarily for endpoint ports, not switches) like ptp4l expect that PTP
messages are always received on AF_PACKET / AF_INET sockets (depending
on the PTP transport being used), and never being autonomously
forwarded. Any forwarding, if necessary (for example in Transparent
Clock mode) is handled in software by ptp4l. Having the hardware forward
these packets too will cause duplicates which will confuse endpoints
connected to these switches.

So PTP over L2 barely works, in the sense that PTP packets reach the CPU
port, but they reach it via flooding, and therefore reach lots of other
unwanted destinations too. But PTP over IPv4/IPv6 does not work at all.
This is because the Ocelot switch have a separate destination port mask
for unknown IP multicast (which PTP over IP is) flooding compared to
unknown non-IP multicast (which PTP over L2 is) flooding. Specifically,
the driver allows the CPU port to be in the PGID_MC port group, but not
in PGID_MCIPV4 and PGID_MCIPV6. There are several presentations from
Allan Nielsen which explain that the embedded MIPS CPU on Ocelot
switches is not very powerful at all, so every penny they could save by
not allowing flooding to the CPU port module matters. Unknown IP
multicast did not make it.

The de facto consensus is that when a switch is PTP-aware and an
application stack for PTP is running, switches should have some sort of
trapping mechanism for PTP packets, to extract them from the hardware
data path. This avoids both problems:
(a) PTP packets are no longer flooded to unwanted destinations
(b) PTP over IP packets are no longer denied from reaching the CPU since
    they arrive there via a trap and not via flooding

It is not the first time when this change is attempted. Last time, the
feedback from Allan Nielsen and Andrew Lunn was that the traps should
not be installed by default, and that PTP-unaware switching may be
desired for some use cases:
https://patchwork.ozlabs.org/project/netdev/patch/20190813025214.18601-5-yangbo.lu@nxp.com/

To address that feedback, the present patch adds the necessary packet
traps according to the RX filter configuration transmitted by user space
through the SIOCSHWTSTAMP ioctl. Trapping is done via VCAP IS2, where we
keep 5 filters, which are amended each time RX timestamping is enabled
or disabled on a port:
- 1 for PTP over L2
- 2 for PTP over IPv4 (UDP ports 319 and 320)
- 2 for PTP over IPv6 (UDP ports 319 and 320)

The cookie by which these filters (invisible to tc) are identified is
strategically chosen such that it does not collide with the filters used
for the ocelot-8021q tagging protocol by the Felix driver, or with the
MRP traps set up by the Ocelot library.

Other alternatives were considered, like patching user space to do
something, but there are so many ways in which PTP packets could be made
to reach the CPU, generically speaking, that "do what?" is a very valid
question. The ptp4l program from the linuxptp stack already attempts to
do something: it calls setsockopt(IP_ADD_MEMBERSHIP) (and
PACKET_ADD_MEMBERSHIP, respectively) which translates in both cases into
a dev_mc_add() on the interface, in the kernel:
https://github.com/richardcochran/linuxptp/blob/v3.1.1/udp.c#L73
https://github.com/richardcochran/linuxptp/blob/v3.1.1/raw.c

Reality shows that this is not sufficient in case the interface belongs
to a switchdev driver, as dev_mc_add() does not show the intention to
trap a packet to the CPU, but rather the intention to not drop it (it is
strictly for RX filtering, same as promiscuous does not mean to send all
traffic to the CPU, but to not drop traffic with unknown MAC DA). This
topic is a can of worms in itself, and it would be great if user space
could just stay out of it.

On the other hand, setting up PTP traps privately within the driver is
not new by any stretch of the imagination:
https://elixir.bootlin.com/linux/v5.16-rc2/source/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c#L833
https://elixir.bootlin.com/linux/v5.16-rc2/source/drivers/net/dsa/hirschmann/hellcreek.c#L1050
https://elixir.bootlin.com/linux/v5.16-rc2/source/include/linux/dsa/sja1105.h#L21

So this is the approach taken here as well. The difference here being
that we prepare and destroy the traps per port, dynamically at runtime,
as opposed to driver init time, because apparently, PTP-unaware
forwarding is a use case.

Fixes: 4e3b0468e6 ("net: mscc: PTP Hardware Clock (PHC) support")
Reported-by: Po Liu <po.liu@nxp.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-11-26 11:38:21 -08:00