IF YOU WOULD LIKE TO GET AN ACCOUNT, please write an
email to Administrator. User accounts are meant only to access repo
and report issues and/or generate pull requests.
This is a purpose-specific Git hosting for
BaseALT
projects. Thank you for your understanding!
Только зарегистрированные пользователи имеют доступ к сервису!
Для получения аккаунта, обратитесь к администратору.
follow_pud_mask() does not support non-present pud entry now. As long as
I tested on x86_64 server, follow_pud_mask() still simply returns
no_page_table() for non-present_pud_entry() due to pud_bad(), so no severe
user-visible effect should happen. But generally we should call
follow_huge_pud() for non-present pud entry for 1GB hugetlb page.
Update pud_huge() and follow_huge_pud() to handle non-present pud entries.
The changes are similar to previous works for pud entries commit
e66f17ff7177 ("mm/hugetlb: take page table lock in follow_huge_pmd()") and
commit cbef8478bee5 ("mm/hugetlb: pmd_huge() returns true for non-present
hugepage").
Link: https://lkml.kernel.org/r/20220714042420.1847125-3-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Liu Shixin <liushixin2@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- an old(er) binutils build fix,
- a new-GCC build fix,
- and a kexec boot environment fix.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmLuv4URHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1it3A//fGfrzHGtjHraiBy0H1Erlz0dUa4q/r6v
xPQVFYteGwL/Ynv2rOJreiEXNhv9pRv0cXXNS5iWh8IcP8IUNw6rfYmgr1aDpXdq
WkbJvwouX6JSo3g/CMekKd+Mf7NgA4O1OO65E80c4WJnxgd0AYvr6IxJRLR7X0C7
HwU6p6PmP/RHWT5T170z6sgun+6QdDEYSwFYOhxawL+BJaKEBYnQ0LLQgJazhe7z
uVxONQA9OdWBwMzvZygbOuTzc990jCHRPYgvYQhSZ8CUPuVzaa7IB9KUXh6lu93d
a7nqM3GlWTowBULY6Xq7gWJaJ7jsVWXjqo8SWVlb6YwoLR9dgGSW5bCGV0rOA6o3
yPjQhIQ9H4NOx126wPcCRBh3osGFjqlWUXVw7W51aNgd7hCvlbpWWmREeI/Pm1Ew
WBjQqpf4l0S+0On5FEFaF7swAG3b6KSNSKw7WBmpmTNt5eWOot0EtnjGW75ATpxM
+j2fj/1MIZ/Zp+wYaNK/+abM4sXHhYvU9gpPdJslRr+r2AVjy9gCZ/0zuUIVytwC
gOdV9KhqzlXPJCTm+py7fBt2qM2P5rKT2HBQYiJwIquB2njI0kjUBOJWXsGQ/F/y
hGd6WY8uDuwzzg5JtyfwE6fPGovxL5GCc4w9CYz0DbP0txPYuhMOdkHtAYLyraAj
wtdalMt3cT8=
=EM/G
-----END PGP SIGNATURE-----
Merge tag 'x86-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
- build fix for old(er) binutils
- build fix for new GCC
- kexec boot environment fix
* tag 'x86-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry: Build thunk_$(BITS) only if CONFIG_PREEMPTION=y
x86/numa: Use cpumask_available instead of hardcoded NULL check
x86/bus_lock: Don't assume the init value of DEBUGCTLMSR.BUS_LOCK_DETECT to be zero
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve latency
and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYuravgAKCRDdBJ7gKXxA
jpqSAQDrXSdII+ht9kSHlaCVYjqRFQz/rRvURQrWQV74f6aeiAD+NHHeDPwZn11/
SPktqEUrF1pxnGQxqLh1kUFUhsVZQgE=
=w/UH
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Most of the MM queue. A few things are still pending.
Liam's maple tree rework didn't make it. This has resulted in a few
other minor patch series being held over for next time.
Multi-gen LRU still isn't merged as we were waiting for mapletree to
stabilize. The current plan is to merge MGLRU into -mm soon and to
later reintroduce mapletree, with a view to hopefully getting both
into 6.1-rc1.
Summary:
- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve
latency and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place"
[ XFS merge from hell as per Darrick Wong in
https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]
* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
tools/testing/selftests/vm/hmm-tests.c: fix build
mm: Kconfig: fix typo
mm: memory-failure: convert to pr_fmt()
mm: use is_zone_movable_page() helper
hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
hugetlbfs: cleanup some comments in inode.c
hugetlbfs: remove unneeded header file
hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
hugetlbfs: use helper macro SZ_1{K,M}
mm: cleanup is_highmem()
mm/hmm: add a test for cross device private faults
selftests: add soft-dirty into run_vmtests.sh
selftests: soft-dirty: add test for mprotect
mm/mprotect: fix soft-dirty check in can_change_pte_writable()
mm: memcontrol: fix potential oom_lock recursion deadlock
mm/gup.c: fix formatting in check_and_migrate_movable_page()
xfs: fail dax mount if reflink is enabled on a partition
mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
userfaultfd: don't fail on unrecognized features
hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
...
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCYuooOQAKCRCAXGG7T9hj
vmmlAPoCfYBh4jKwRnvGvyn+sPQed/r0TH0wnsGK1ccONhyIvAD+IZcSTPsnp4Cj
m1URGGff2PvAyjOIAzQZbKZomtfICwM=
=z2e5
-----END PGP SIGNATURE-----
Merge tag 'for-linus-6.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
- a series fine tuning virtio support for Xen guests, including removal
the now again unused "platform_has()" feature.
- a fix for host admin triggered reboot of Xen guests
- a simple spelling fix
* tag 'for-linus-6.0-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: don't require virtio with grants for non-PV guests
kernel: remove platform_has() infrastructure
virtio: replace restricted mem access flag with callback
xen: Fix spelling mistake
xen/manage: Use orderly_reboot() to reboot
GCC-12 started triggering a new warning:
arch/x86/mm/numa.c: In function ‘cpumask_of_node’:
arch/x86/mm/numa.c:916:39: warning: the comparison will always evaluate as ‘false’ for the address of ‘node_to_cpumask_map’ will never be NULL [-Waddress]
916 | if (node_to_cpumask_map[node] == NULL) {
| ^~
node_to_cpumask_map is of type cpumask_var_t[].
When CONFIG_CPUMASK_OFFSTACK is set, cpumask_var_t is typedef'd to a
pointer for dynamic allocation, else to an array of one element. The
"wicked game" can be checked on line 700 of include/linux/cpumask.h.
The original code in debug_cpumask_set_cpu() and cpumask_of_node() were
probably written by the original authors with CONFIG_CPUMASK_OFFSTACK=y
(i.e. dynamic allocation) in mind, checking if the cpumask was available
via a direct NULL check.
When CONFIG_CPUMASK_OFFSTACK is not set, GCC gives the above warning
while compiling the kernel.
Fix that by using cpumask_available(), which does the NULL check when
CONFIG_CPUMASK_OFFSTACK is set, otherwise returns true. Use it wherever
such checks are made.
Conditional definitions of cpumask_available() can be found along with
the definition of cpumask_var_t. Check the cpumask.h reference mentioned
above.
Fixes: c032ef60d1aa ("cpumask: convert node_to_cpumask_map[] to cpumask_var_t")
Fixes: de2d9445f162 ("x86: Unify node_to_cpumask_map handling between 32 and 64bit")
Signed-off-by: Siddh Raman Pant <code@siddh.me>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20220731160913.632092-1-code@siddh.me
This pull request contains the following branches:
doc.2022.06.21a: Documentation updates.
fixes.2022.07.19a: Miscellaneous fixes.
nocb.2022.07.19a: Callback-offload updates, perhaps most notably a new
RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that causes all CPUs to
be offloaded at boot time, regardless of kernel boot parameters.
This is useful to battery-powered systems such as ChromeOS
and Android. In addition, a new RCU_NOCB_CPU_CB_BOOST kernel
boot parameter prevents offloaded callbacks from interfering
with real-time workloads and with energy-efficiency mechanisms.
poll.2022.07.21a: Polled grace-period updates, perhaps most notably
making these APIs account for both normal and expedited grace
periods.
rcu-tasks.2022.06.21a: Tasks RCU updates, perhaps most notably reducing
the CPU overhead of RCU tasks trace grace periods by more than
a factor of two on a system with 15,000 tasks. The reduction
is expected to increase with the number of tasks, so it seems
reasonable to hypothesize that a system with 150,000 tasks might
see a 20-fold reduction in CPU overhead.
torture.2022.06.21a: Torture-test updates.
ctxt.2022.07.05a: Updates that merge RCU's dyntick-idle tracking into
context tracking, thus reducing the overhead of transitioning to
kernel mode from either idle or nohz_full userspace execution
for kernels that track context independently of RCU. This is
expected to be helpful primarily for kernels built with
CONFIG_NO_HZ_FULL=y.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmLgMcgTHHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jArXD/0fjbCwqpRjHVTzjMY8jN4zDkqZZD6m
g8Fx27hZ4ToNFwRptyHwNezrNj14skjAJEXfdjaVw32W62ivXvf0HINvSzsTLCSq
k2kWyBdXLc9CwY5p5W4smnpn5VoAScjg5PoPL59INoZ/Zziji323C7Zepl/1DYJt
0T6bPCQjo1ZQoDUCyVpSjDmAqxnderWG0MeJVt74GkLqmnYLANg0GH8c7mH4+9LL
kVGlLp5nlPgNJ4FEoFdMwNU8T/ETmaVld/m2dkiawjkXjJzB2XKtBigU91DDmXz5
7DIdV4ABrxiy4kGNqtIe/jFgnKyVD7xiDpyfjd6KTeDr/rDS8u2ZH7+1iHsyz3g0
Np/tS3vcd0KR+gI/d0eXxPbgm5sKlCmKw/nU2eArpW/+4LmVXBUfHTG9Jg+LJmBc
JrUh6aEdIZJZHgv/nOQBNig7GJW43IG50rjuJxAuzcxiZNEG5lUSS23ysaA9CPCL
PxRWKSxIEfK3kdmvVO5IIbKTQmIBGWlcWMTcYictFSVfBgcCXpPAksGvqA5JiUkc
egW+xLFo/7K+E158vSKsVqlWZcEeUbsNJ88QOlpqnRgH++I2Yv/LhK41XfJfpH+Y
ALxVaDd+mAq6v+qSHNVq9wT3ozXIPy/zK1hDlMIqx40h2YvaEsH4je+521oSoN9r
vX60+QNxvUBLwA==
=vUNm
-----END PGP SIGNATURE-----
Merge tag 'rcu.2022.07.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull RCU updates from Paul McKenney:
- Documentation updates
- Miscellaneous fixes
- Callback-offload updates, perhaps most notably a new
RCU_NOCB_CPU_DEFAULT_ALL Kconfig option that causes all CPUs to be
offloaded at boot time, regardless of kernel boot parameters.
This is useful to battery-powered systems such as ChromeOS and
Android. In addition, a new RCU_NOCB_CPU_CB_BOOST kernel boot
parameter prevents offloaded callbacks from interfering with
real-time workloads and with energy-efficiency mechanisms
- Polled grace-period updates, perhaps most notably making these APIs
account for both normal and expedited grace periods
- Tasks RCU updates, perhaps most notably reducing the CPU overhead of
RCU tasks trace grace periods by more than a factor of two on a
system with 15,000 tasks.
The reduction is expected to increase with the number of tasks, so it
seems reasonable to hypothesize that a system with 150,000 tasks
might see a 20-fold reduction in CPU overhead
- Torture-test updates
- Updates that merge RCU's dyntick-idle tracking into context tracking,
thus reducing the overhead of transitioning to kernel mode from
either idle or nohz_full userspace execution for kernels that track
context independently of RCU.
This is expected to be helpful primarily for kernels built with
CONFIG_NO_HZ_FULL=y
* tag 'rcu.2022.07.26a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (98 commits)
rcu: Add irqs-disabled indicator to expedited RCU CPU stall warnings
rcu: Diagnose extended sync_rcu_do_polled_gp() loops
rcu: Put panic_on_rcu_stall() after expedited RCU CPU stall warnings
rcutorture: Test polled expedited grace-period primitives
rcu: Add polled expedited grace-period primitives
rcutorture: Verify that polled GP API sees synchronous grace periods
rcu: Make Tiny RCU grace periods visible to polled APIs
rcu: Make polled grace-period API account for expedited grace periods
rcu: Switch polled grace-period APIs to ->gp_seq_polled
rcu/nocb: Avoid polling when my_rdp->nocb_head_rdp list is empty
rcu/nocb: Add option to opt rcuo kthreads out of RT priority
rcu: Add nocb_cb_kthread check to rcu_is_callbacks_kthread()
rcu/nocb: Add an option to offload all CPUs on boot
rcu/nocb: Fix NOCB kthreads spawn failure with rcu_nocb_rdp_deoffload() direct call
rcu/nocb: Invert rcu_state.barrier_mutex VS hotplug lock locking order
rcu/nocb: Add/del rdp to iterate from rcuog itself
rcu/tree: Add comment to describe GP-done condition in fqs loop
rcu: Initialize first_gp_fqs at declaration in rcu_gp_fqs()
rcu/kvfree: Remove useless monitor_todo flag
rcu: Cleanup RCU urgency state for offline CPU
...
pr_warn_once() change broke that
- Simplify {JMP,CALL}_NOSPEC and let the objtool retpoline patching
infra take care of them instead of having unreadable alternative macros
there
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLnvG8ACgkQEsHwGGHe
VUpbPw//WUoBMnR9B9xpwuk6EBA0rVbQqCt0QYy/h27QuD9aIC+lYduL8CId1AbS
J+bqDvpqFQRqc9idAvgkjspJNMnbOSqzAx1AbT4gBCH33hPYrB/6F7XasXSDZn0M
OVJDyvhOhror2I6YuFc1uwjMPBZj8+fgv823io/RKqvJfj/WUoIBK1oxUlzm4rNs
LkrxgKpvs3QznJ0RNnZUP1kKnezzC1RtxIdz8QD3rirpuZITF8HD04jIUyK7gohh
XP5Vgt+9BzRCHyn43XT7UEobz93WASDn0YOlDCdOuMxhJONYpnvk8dYJkss2BReG
oCGZNiUHKNALd/FWUr5EeWoa84XQwRwm2Y+jjdH3al7LIaIbERGXbTIHsWH6JtVk
+EiXqj+B6d2muZpG/ka0PSgPvbeTeipeugNDzXGCLuWHkjLRwJRV/afUmzLuT7lJ
zzrejZeXACtWmTLXXt5EPAPWsibapiuA6/4ucyr+jJj7CCY8pmrOGcxqHKlq7MHU
9Lk50F6Y9jbEe7mRPmR67dj/P9QAFYewjVDLEgOdyganrNNGfRkddJVbckYYZTVF
YSUsoZBAc5E3Fzo/yQkG3+6K0d13Syuf3QRhGxe2JvYlWt2mktaFsRtcTGSCd9KM
AJw8tV0FDezxmdKQ8DNkeLG+tAt7m22dRJkhbSUilYQsNR2Cwxg=
=70Gc
-----END PGP SIGNATURE-----
Merge tag 'x86_core_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 core updates from Borislav Petkov:
- Have invalid MSR accesses warnings appear only once after a
pr_warn_once() change broke that
- Simplify {JMP,CALL}_NOSPEC and let the objtool retpoline patching
infra take care of them instead of having unreadable alternative
macros there
* tag 'x86_core_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/extable: Fix ex_handler_msr() print condition
x86,nospec: Simplify {JMP,CALL}_NOSPEC
- Update pkeys documentation
- Avoid reading contended mm's TLB generation var if not absolutely
necessary along with fixing a case where arch_tlbbatch_flush() doesn't
adhere to the generation scheme and thus violates the conditions for the
above avoidance.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLnmpYACgkQEsHwGGHe
VUrINQ/9FGnQya6mTJitM3Ohdzu1lOrHm5+XAxCO3SVzPPQlx0mRZmszzDOIZpG/
9iCEDhSi+kLdkTwIXk8Nmm1imNT2MSqswjQYr8KDtl69/j12W8Y0Pb5C5tnQnUyi
FXPiVVCAk0iegNg+QvarQa8Ou6tGWDqFMLzdrq9XNokdBmFq7FCDsOjdwd8So3IY
95755wDtCxgBXc2TVr08qSpD0Q/VlHKqb5shtzuoBe9a0YLEaRmWne9UzTOx5U6c
//qk8lmy9ohL8dmN7SgcRITzfpU8ue+/J4oZ+GV9mc/UTW5Ah2WNX+3BFnmCqZrK
gr7G5pukuuJxFj8yGzGbGIM28OHKYIE+So2Q5pA6Vrqst/oyDJS+pcoxyhAYGYCQ
hDjp4yu5AUnsPky6h6VHaR8Er5Nvo7YwhdSazcGD+HC7smwbnVEzI5H7MUgcJ05F
1CkAQSy2TVZe0hhilOu8dcHN23+2ISF8BzxKbn4qtZOsJTN6/U4MYFWl6VPh8P80
vjZcIJYZ4i6Gz03m7ITk2bHwfOD8f/7UkbZEggO/GYm1BgmxaMB0IogoIkSUG9vN
CLGZomRMfBcVVS1DTWJsUzRLbNx3x3pL41NrlxPbC/rTmvts5eJAvcDcffPfRGzx
tCqcASRdV7tQBgMT5MLjmIY8cM1aphdGSdlKVD7QHZ11bJVFZE4=
=aD0S
-----END PGP SIGNATURE-----
Merge tag 'x86_mm_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm updates from Borislav Petkov:
- Rename a PKRU macro to make more sense when reading the code
- Update pkeys documentation
- Avoid reading contended mm's TLB generation var if not absolutely
necessary along with fixing a case where arch_tlbbatch_flush()
doesn't adhere to the generation scheme and thus violates the
conditions for the above avoidance.
* tag 'x86_mm_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/tlb: Ignore f->new_tlb_gen when zero
x86/pkeys: Clarify PKRU_AD_KEY macro
Documentation/protection-keys: Clean up documentation for User Space pkeys
x86/mm/tlb: Avoid reading mm_tlb_gen when possible
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmLnlkQACgkQEsHwGGHe
VUp9ZA//VO6OYECA3ff6x1KBqB6dbC/W2dc54CPy2MSw5xwUlpPuD0ipbRMM7UEn
zhnMNSwXgtqJdtdJZfvT9XwvrBjigAJJNjR/E3SA7bJ9jk10MEWFY/0YHtshV9vp
y75c028m3yIxf+97Cku59k2Y69u1LWgM1mpFYVgJJCwkprliY9JMyvhboSXO63v9
SR9m6jnDOJV1dIdmp7SgQvKEy2gRkz/kli5yzgNgw9Q0t5yodudaq9nEVKh0tWqP
EQ6TzyORVGmqFZ5Jxti3W6NqUardrCeWwmodC1KwWm7vAcfaJET9ADTj3sAYPG4m
m26i0fdjixnYAu27adiG7txtVgoZ3JkkNMLkDa30S2Dau1zhmGxdwAFrzLP2P84T
UGQqsZ7TNkhtLp2Jlb5pfOAdj5q8mI4BDT6KLIXgJYRm0kLkV7W069mcYCKFD2S4
BCHmDNG1ZVFQUVLG16gdZe4mRlf8mJ8WLCkDbXEAOGHCCB944xLuspgh2L88VrhR
s8EP8mPn9rXGM4drAsao6+gseS4Q1gJ0E2gFh8YTOb0fFGvlTbcvqWCQSVC6P65z
xRv1MWKg/VZcPo3Xpe40CELMcobOGxmxXaKQRJp6KSY7SdvidSb41DGv6LKffmPJ
xckm/13Aip3zVXeBE8hrjulphOAKIfPkujbW0jSTOepTnVBc3Hc=
=ps1Q
-----END PGP SIGNATURE-----
Merge tag 'x86_cleanups_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanup from Borislav Petkov:
- A single CONFIG_ symbol correction in a comment
* tag 'x86_cleanups_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Refer to the intended config STRICT_DEVMEM in a comment
Instead of having a global flag to require restricted memory access
for all virtio devices, introduce a callback which can select that
requirement on a per-device basis.
For convenience add a common function returning always true, which can
be used for use cases like SEV.
Per default use a callback always returning false.
As the callback needs to be set in early init code already, add a
virtio anchor which is builtin in case virtio is enabled.
Signed-off-by: Juergen Gross <jgross@suse.com>
Tested-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> # Arm64 guest using Xen
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Link: https://lore.kernel.org/r/20220622063838.8854-2-jgross@suse.com
Signed-off-by: Juergen Gross <jgross@suse.com>
On Fri, Jun 17, 2022 at 02:08:52PM +0300, Stephane Eranian wrote:
> Some changes to the way invalid MSR accesses are reported by the
> kernel is causing some problems with messages printed on the
> console.
>
> We have seen several cases of ex_handler_msr() printing invalid MSR
> accesses once but the callstack multiple times causing confusion on
> the console.
> The problem here is that another earlier commit (5.13):
>
> a358f40600b3 ("once: implement DO_ONCE_LITE for non-fast-path "do once" functionality")
>
> Modifies all the pr_*_once() calls to always return true claiming
> that no caller is ever checking the return value of the functions.
>
> This is why we are seeing the callstack printed without the
> associated printk() msg.
Extract the ONCE_IF(cond) part into __ONCE_LTE_IF() and use that to
implement DO_ONCE_LITE_IF() and fix the extable code.
Fixes: a358f40600b3 ("once: implement DO_ONCE_LITE for non-fast-path "do once" functionality")
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Stephane Eranian <eranian@google.com>
Link: https://lkml.kernel.org/r/YqyVFsbviKjVGGZ9@worktop.programming.kicks-ass.net
Commit aa44284960d5 ("x86/mm/tlb: Avoid reading mm_tlb_gen when
possible") introduced an optimization to skip superfluous TLB
flushes based on the generation provided in flush_tlb_info.
However, arch_tlbbatch_flush() does not provide any generation in
flush_tlb_info and populates the flush_tlb_info generation with
0. This 0 is causes the flush_tlb_info to be interpreted as a
superfluous, old flush. As a result, try_to_unmap_one() would
not perform any TLB flushes.
Fix it by checking whether f->new_tlb_gen is nonzero. Zero value
is anyhow is an invalid generation value. To avoid future
confusion, introduce TLB_GENERATION_INVALID constant and use it
properly. Add warnings to ensure no partial flushes are done with
TLB_GENERATION_INVALID or when f->mm is NULL, since this does not
make any sense.
In addition, add the missing unlikely().
[ dhansen: change VM_BUG_ON() -> VM_WARN_ON(), clarify changelog ]
Fixes: aa44284960d5 ("x86/mm/tlb: Avoid reading mm_tlb_gen when possible")
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Tested-by: Hugh Dickins <hughd@google.com>
Link: https://lkml.kernel.org/r/20220710232837.3618-1-namit@vmware.com
This moves protection_map[] inside the platform and makes it a static.
This also defines a helper function add_encrypt_protection_map() that can
update the protection_map[] array with pgprot_encrypted().
Link: https://lkml.kernel.org/r/20220711070600.2378316-7-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
x86_has_pat_wp() is using a wrong test, as it relies on the normal
PAT configuration used by the kernel. In case the PAT MSR has been
setup by another entity (e.g. Xen hypervisor) it might return false
even if the PAT configuration is allowing WP mappings. This due to the
fact that when running as Xen PV guest the PAT MSR is setup by the
hypervisor and cannot be changed by the guest. This results in the WP
related entry to be at a different position when running as Xen PV
guest compared to the bare metal or fully virtualized case.
The correct way to test for WP support is:
1. Get the PTE protection bits needed to select WP mode by reading
__cachemode2pte_tbl[_PAGE_CACHE_MODE_WP] (depending on the PAT MSR
setting this might return protection bits for a stronger mode, e.g.
UC-)
2. Translate those bits back into the real cache mode selected by those
PTE bits by reading __pte2cachemode_tbl[__pte2cm_idx(prot)]
3. Test for the cache mode to be _PAGE_CACHE_MODE_WP
Fixes: f88a68facd9a ("x86/mm: Extend early_memremap() support with additional attrs")
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> # 4.14
Link: https://lore.kernel.org/r/20220503132207.17234-1-jgross@suse.com
Commit a4866aa81251 ("mm: Tighten x86 /dev/mem with zeroing reads") adds a
comment to the function devmem_is_allowed() referring to a non-existing
config STRICT_IOMEM, whereas the comment very likely intended to refer to
the config STRICT_DEVMEM, as the commit adds some behavior for the config
STRICT_DEVMEM.
Most of the initial analysis was actually done by Dave Hansen in the
email thread below (see Link).
Refer to the intended and existing config STRICT_DEVMEM.
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/r/f9074e8d-9314-9d7d-7bf5-5b5538c8be8d@intel.com/
Link: https://lkml.kernel.org/r/20220707115442.21107-1-lukas.bulwahn@gmail.com
The RCU dynticks counter is going to be merged into the context tracking
subsystem. Prepare with moving the IRQ extended quiescent states
entrypoints to context tracking. For now those are dumb redirection to
existing RCU calls.
[ paulmck: Apply Stephen Rothwell feedback from -next. ]
[ paulmck: Apply Nathan Chancellor feedback. ]
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Neeraj Upadhyay <quic_neeraju@quicinc.com>
Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Nicolas Saenz Julienne <nsaenz@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Cc: Yu Liao <liaoyu15@huawei.com>
Cc: Phil Auld <pauld@redhat.com>
Cc: Paul Gortmaker<paul.gortmaker@windriver.com>
Cc: Alex Belits <abelits@marvell.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Tested-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Update retpoline validation with the new CONFIG_RETPOLINE requirement of
not having bare naked RET instructions.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Specifically, it's because __enc_copy() encrypts the kernel after
being relocated outside the kernel in sme_encrypt_execute(), and the
RET macro's jmp offset isn't amended prior to execution.
Signed-off-by: Kim Phillips <kim.phillips@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
I observed that for each of the shared file-backed page faults, we're very
likely to retry one more time for the 1st write fault upon no page. It's
because we'll need to release the mmap lock for dirty rate limit purpose
with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).
Then after that throttling we return VM_FAULT_RETRY.
We did that probably because VM_FAULT_RETRY is the only way we can return
to the fault handler at that time telling it we've released the mmap lock.
However that's not ideal because it's very likely the fault does not need
to be retried at all since the pgtable was well installed before the
throttling, so the next continuous fault (including taking mmap read lock,
walk the pgtable, etc.) could be in most cases unnecessary.
It's not only slowing down page faults for shared file-backed, but also add
more mmap lock contention which is in most cases not needed at all.
To observe this, one could try to write to some shmem page and look at
"pgfault" value in /proc/vmstat, then we should expect 2 counts for each
shmem write simply because we retried, and vm event "pgfault" will capture
that.
To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
show that we've completed the whole fault and released the lock. It's also
a hint that we should very possibly not need another fault immediately on
this page because we've just completed it.
This patch provides a ~12% perf boost on my aarch64 test VM with a simple
program sequentially dirtying 400MB shmem file being mmap()ed and these are
the time it needs:
Before: 650.980 ms (+-1.94%)
After: 569.396 ms (+-1.38%)
I believe it could help more than that.
We need some special care on GUP and the s390 pgfault handler (for gmap
code before returning from pgfault), the rest changes in the page fault
handlers should be relatively straightforward.
Another thing to mention is that mm_account_fault() does take this new
fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.
I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
them as-is.
Link: https://lkml.kernel.org/r/20220530183450.42886-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Vineet Gupta <vgupta@kernel.org>
Acked-by: Guo Ren <guoren@kernel.org>
Acked-by: Max Filippov <jcmvbkbc@gmail.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> [arm part]
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: Brian Cain <bcain@quicinc.com>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Richard Weinberger <richard@nod.at>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Will Deacon <will@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Chris Zankel <chris@zankel.net>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Helge Deller <deller@gmx.de>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
It seems to exist since the old times and never used once. Remove them.
Link: https://lkml.kernel.org/r/20220525195220.10241-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When changing the PKRU_AD_KEY macro to be used for PKS the name came
into question.[1]
The intent of PKRU_AD_KEY is to set an initial value for the PKRU
register but that is just a mask value.
Clarify this by changing the name to PKRU_AD_MASK().
NOTE the checkpatch errors are ignored for the init_pkru_value to align
the values in the code.
[1] https://lore.kernel.org/lkml/eff862e2-bfaa-9e12-42b5-a12467d72a22@intel.com/
Suggested-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220419170649.1022246-3-ira.weiny@intel.com
On extreme TLB shootdown storms, the mm's tlb_gen cacheline is highly
contended and reading it should (arguably) be avoided as much as
possible.
Currently, flush_tlb_func() reads the mm's tlb_gen unconditionally,
even when it is not necessary (e.g., the mm was already switched).
This is wasteful.
Moreover, one of the existing optimizations is to read mm's tlb_gen to
see if there are additional in-flight TLB invalidations and flush the
entire TLB in such a case. However, if the request's tlb_gen was already
flushed, the benefit of checking the mm's tlb_gen is likely to be offset
by the overhead of the check itself.
Running will-it-scale with tlb_flush1_threads show a considerable
benefit on 56-core Skylake (up to +24%):
threads Baseline (v5.17+) +Patch
1 159960 160202
5 310808 308378 (-0.7%)
10 479110 490728
15 526771 562528
20 534495 587316
25 547462 628296
30 579616 666313
35 594134 701814
40 612288 732967
45 617517 749727
50 637476 735497
55 614363 778913 (+24%)
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20220606180123.2485171-1-namit@vmware.com
Instead of using arch_has_restricted_virtio_memory_access() together
with CONFIG_ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS, replace those
with platform_has() and a new platform feature
PLATFORM_VIRTIO_RESTRICTED_MEM_ACCESS.
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>
Tested-by: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> # Arm64 only
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Borislav Petkov <bp@suse.de>
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmKcdlsTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoaDeD/oDyKpbOf+ta83AyZm+h6Q+KzYf44n6
XRaT8Ip1f/aUk+Ucv2YLpZT/LRiBenmJpsdDRaGhje83pS2cnXStd8UhEBjiWmeq
vBGR8QpXJtFna1Ouizwwq+Pp2mnktK+YEk+Qf17z9UwrRzqXlgTIJqPWsVY1plmV
lHVhThdkr09PR4aSBN9TQzTFxG0Qtf3mHbsRyPTzODldMtSbdCy9Zy2xlTigDIJX
uTU/ykoB8qZpJguacAZpExwxvoRHy95cpxNCQfC5paeAv3gJ0U1b3PAJA4VkVlko
h/2crEzHV0MZRx12o1le72K/2UfuvTeb3949VD3ZBIE5fPSH/K/niwl9BtntXSQj
Bs0Kbm3k94tJz2VunG0FO7ayJ8/KL98zLHXQR6yRBTGvtkj9m7vhUMaN268NSLQH
meAIuMreJejfsKV2Z/z9y+nSWebKdEQ0xkOq51qe3s2ivroKshTUNR0h7S35xyw5
+R8lzJZH4rVVBOcxzCLTaIVsJbVHGfE65gb6JIyUT2sRRxXm53gYPC2rvIRzi8eA
1h1ODpz0pn3LewXsigOJGmiK11Dda5uJAF58t1BJmGi9kXSH/CVfYBTcDOXcYNHy
sA+oq+yrQ5RSywv1D9zuDzOgbcEZ1mB82aWuy4Li+k9uhdCyNIsNy7Nbw9paCDZk
Oy+IMDLroCieSw==
=KN22
-----END PGP SIGNATURE-----
Merge tag 'x86-mm-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm cleanup from Thomas Gleixner:
"Use PAGE_ALIGNED() instead of open coding it in the x86/mm code"
* tag 'x86-mm-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Use PAGE_ALIGNED(x) instead of IS_ALIGNED(x, PAGE_SIZE)
- Add support for clearing memory error via pwrite(2) on DAX
- Fix 'security overwrite' support in the presence of media errors
- Miscellaneous cleanups and fixes for nfit_test (nvdimm unit tests)
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSbo+XnGs+rwLz9XGXfioYZHlFsZwUCYpFPcQAKCRDfioYZHlFs
Z9A3AQCdfoT5sY3OK+I/3oTvJ//6lw2MtXrnXFM046ICKPi9sgD8CzR9mRAHA+vj
kxOtJEU2bA9naninXGORsDUndiNkwQo=
=gVIn
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm and DAX updates from Dan Williams:
"New support for clearing memory errors when a file is in DAX mode,
alongside with some other fixes and cleanups.
Previously it was only possible to clear these errors using a truncate
or hole-punch operation to trigger the filesystem to reallocate the
block, now, any page aligned write can opportunistically clear errors
as well.
This change spans x86/mm, nvdimm, and fs/dax, and has received the
appropriate sign-offs. Thanks to Jane for her work on this.
Summary:
- Add support for clearing memory error via pwrite(2) on DAX
- Fix 'security overwrite' support in the presence of media errors
- Miscellaneous cleanups and fixes for nfit_test (nvdimm unit tests)"
* tag 'libnvdimm-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
pmem: implement pmem_recovery_write()
pmem: refactor pmem_clear_poison()
dax: add .recovery_write dax_operation
dax: introduce DAX_RECOVERY_WRITE dax access mode
mce: fix set_mce_nospec to always unmap the whole page
x86/mce: relocate set{clear}_mce_nospec() functions
acpi/nfit: rely on mce->misc to determine poison granularity
testing: nvdimm: asm/mce.h is not needed in nfit.c
testing: nvdimm: iomap: make __nfit_test_ioremap a macro
nvdimm: Allow overwrite in the presence of disabled dimms
tools/testing/nvdimm: remove unneeded flush_workqueue
The <linux/mm.h> already provides the PAGE_ALIGNED() macro. Let's
use this macro instead of IS_ALIGNED() and passing PAGE_SIZE directly.
No change in functionality.
[ mingo: Tweak changelog. ]
Signed-off-by: Fanjun Kong <bh1scw@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20220526142038.1582839-1-bh1scw@gmail.com
file-backed transparent hugepages.
Johannes Weiner has arranged for zswap memory use to be tracked and
managed on a per-cgroup basis.
Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for runtime
enablement of the recent huge page vmemmap optimization feature.
Baolin Wang contributes a series to fix some issues around hugetlb
pagetable invalidation.
Zhenwei Pi has fixed some interactions between hwpoisoned pages and
virtualization.
Tong Tiangen has enabled the use of the presently x86-only
page_table_check debugging feature on arm64 and riscv.
David Vernet has done some fixup work on the memcg selftests.
Peter Xu has taught userfaultfd to handle write protection faults against
shmem- and hugetlbfs-backed files.
More DAMON development from SeongJae Park - adding online tuning of the
feature and support for monitoring of fixed virtual address ranges. Also
easier discovery of which monitoring operations are available.
Nadav Amit has done some optimization of TLB flushing during mprotect().
Neil Brown continues to labor away at improving our swap-over-NFS support.
David Hildenbrand has some fixes to anon page COWing versus
get_user_pages().
Peng Liu fixed some errors in the core hugetlb code.
Joao Martins has reduced the amount of memory consumed by device-dax's
compound devmaps.
Some cleanups of the arch-specific pagemap code from Anshuman Khandual.
Muchun Song has found and fixed some errors in the TLB flushing of
transparent hugepages.
Roman Gushchin has done more work on the memcg selftests.
And, of course, many smaller fixes and cleanups. Notably, the customary
million cleanup serieses from Miaohe Lin.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYo52xQAKCRDdBJ7gKXxA
jtJFAQD238KoeI9z5SkPMaeBRYSRQmNll85mxs25KapcEgWgGQD9FAb7DJkqsIVk
PzE+d9hEfirUGdL6cujatwJ6ejYR8Q8=
=nFe6
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Almost all of MM here. A few things are still getting finished off,
reviewed, etc.
- Yang Shi has improved the behaviour of khugepaged collapsing of
readonly file-backed transparent hugepages.
- Johannes Weiner has arranged for zswap memory use to be tracked and
managed on a per-cgroup basis.
- Munchun Song adds a /proc knob ("hugetlb_optimize_vmemmap") for
runtime enablement of the recent huge page vmemmap optimization
feature.
- Baolin Wang contributes a series to fix some issues around hugetlb
pagetable invalidation.
- Zhenwei Pi has fixed some interactions between hwpoisoned pages and
virtualization.
- Tong Tiangen has enabled the use of the presently x86-only
page_table_check debugging feature on arm64 and riscv.
- David Vernet has done some fixup work on the memcg selftests.
- Peter Xu has taught userfaultfd to handle write protection faults
against shmem- and hugetlbfs-backed files.
- More DAMON development from SeongJae Park - adding online tuning of
the feature and support for monitoring of fixed virtual address
ranges. Also easier discovery of which monitoring operations are
available.
- Nadav Amit has done some optimization of TLB flushing during
mprotect().
- Neil Brown continues to labor away at improving our swap-over-NFS
support.
- David Hildenbrand has some fixes to anon page COWing versus
get_user_pages().
- Peng Liu fixed some errors in the core hugetlb code.
- Joao Martins has reduced the amount of memory consumed by
device-dax's compound devmaps.
- Some cleanups of the arch-specific pagemap code from Anshuman
Khandual.
- Muchun Song has found and fixed some errors in the TLB flushing of
transparent hugepages.
- Roman Gushchin has done more work on the memcg selftests.
... and, of course, many smaller fixes and cleanups. Notably, the
customary million cleanup serieses from Miaohe Lin"
* tag 'mm-stable-2022-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (381 commits)
mm: kfence: use PAGE_ALIGNED helper
selftests: vm: add the "settings" file with timeout variable
selftests: vm: add "test_hmm.sh" to TEST_FILES
selftests: vm: check numa_available() before operating "merge_across_nodes" in ksm_tests
selftests: vm: add migration to the .gitignore
selftests/vm/pkeys: fix typo in comment
ksm: fix typo in comment
selftests: vm: add process_mrelease tests
Revert "mm/vmscan: never demote for memcg reclaim"
mm/kfence: print disabling or re-enabling message
include/trace/events/percpu.h: cleanup for "percpu: improve percpu_alloc_percpu event trace"
include/trace/events/mmflags.h: cleanup for "tracing: incorrect gfp_t conversion"
mm: fix a potential infinite loop in start_isolate_page_range()
MAINTAINERS: add Muchun as co-maintainer for HugeTLB
zram: fix Kconfig dependency warning
mm/shmem: fix shmem folio swapoff hang
cgroup: fix an error handling path in alloc_pagecache_max_30M()
mm: damon: use HPAGE_PMD_SIZE
tracing: incorrect isolate_mote_t cast in mm_vmscan_lru_isolate
nodemask.h: fix compilation error with GCC12
...
- don't over-decrypt memory (Robin Murphy)
- takes min align mask into account for the swiotlb max mapping size
(Tianyu Lan)
- use GFP_ATOMIC in dma-debug (Mikulas Patocka)
- fix DMA_ATTR_NO_KERNEL_MAPPING on xen/arm (me)
- don't fail on highmem CMA pages in dma_direct_alloc_pages (me)
- cleanup swiotlb initialization and share more code with swiotlb-xen
(me, Stefano Stabellini)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmKObTQLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYObmA//dIcDB/q4iFGD+WJh4MhM+asx0ZsdF2OJz42WEhgT
Z9duOrgcneEQundCamqJP9rNTs980LHDA8uWQC5rZEc9vxuRVOdS7bSgYRUwWh6B
r0ZjOsvQCn+ChoZML8uyk4rfmEINq+EvJuec3G5fgecZOhPuJS2i2uzzv5cHwqgP
ChC0fwyZlkfdECXgvZXbEoCJLfTgGNlziN6Ai8dirSoqgEQUoCsY89/M7OiEBvV2
R4XUWD7OvQERfB4t6xLuUHyzf9PAuWB+OiblRVNeAmK3lMjxVrc3k4kIowgklnzD
8hfmphAa9Zou3zdfi6Gd4fiQRHRVOwKVp1rtqUmJ+lPSiwyMzu64z9ld2+2qac0h
V4sSr/yJkhxnBT4/0MkTChvhnRobisackpUzNRpiM4ck7cNVb7eAvkISsbH+pWI9
aEexPhbyskjlV+GOyM4QL4ygG0dpXY0HSyoh6uaSVsaXMycnWIsJCPidXxV1HGV0
q2/RLHuHwYxia8cYCF01/DQvwOKSjwbU0zModxtRezGD5GYh2C0a+SrA1aX+qiTu
yGJCs2UHtSQstAt78tTVp499YeDeL/oGSQkPAu8zyRkSczzF+CncGTuXyoJbAWyK
otcgERWljgZ4scxjfu1uacfoVhKQ7nOu7hiJokL0U80FESAennLC3ZlocvB9h/ff
HNA=
=n2rk
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-5.19-2022-05-25' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- don't over-decrypt memory (Robin Murphy)
- takes min align mask into account for the swiotlb max mapping size
(Tianyu Lan)
- use GFP_ATOMIC in dma-debug (Mikulas Patocka)
- fix DMA_ATTR_NO_KERNEL_MAPPING on xen/arm (me)
- don't fail on highmem CMA pages in dma_direct_alloc_pages (me)
- cleanup swiotlb initialization and share more code with swiotlb-xen
(me, Stefano Stabellini)
* tag 'dma-mapping-5.19-2022-05-25' of git://git.infradead.org/users/hch/dma-mapping: (23 commits)
dma-direct: don't over-decrypt memory
swiotlb: max mapping size takes min align mask into account
swiotlb: use the right nslabs-derived sizes in swiotlb_init_late
swiotlb: use the right nslabs value in swiotlb_init_remap
swiotlb: don't panic when the swiotlb buffer can't be allocated
dma-debug: change allocation mode from GFP_NOWAIT to GFP_ATIOMIC
dma-direct: don't fail on highmem CMA pages in dma_direct_alloc_pages
swiotlb-xen: fix DMA_ATTR_NO_KERNEL_MAPPING on arm
x86: remove cruft from <asm/dma-mapping.h>
swiotlb: remove swiotlb_init_with_tbl and swiotlb_init_late_with_tbl
swiotlb: merge swiotlb-xen initialization into swiotlb
swiotlb: provide swiotlb_init variants that remap the buffer
swiotlb: pass a gfp_mask argument to swiotlb_init_late
swiotlb: add a SWIOTLB_ANY flag to lift the low memory restriction
swiotlb: make the swiotlb_init interface more useful
x86: centralize setting SWIOTLB_FORCE when guest memory encryption is enabled
x86: remove the IOMMU table infrastructure
MIPS/octeon: use swiotlb_init instead of open coding it
arm/xen: don't check for xen_initial_domain() in xen_create_contiguous_region
swiotlb: rename swiotlb_late_init_with_default_size
...
- Remove function export
- Correct asm constraint
- Fix __setup handlers retval
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKL6VkACgkQEsHwGGHe
VUqs6g/+Ikpd4Mrou4P5Ul8QNdN9mEzwUfW6i8VpoA3h1L6mKkZxbUsbSz9xInjw
MAhrcevujW6GwdQdus2sUcSlX+jxl6c/IlMdf8RegNPY/JBPDX4dRA7rPetvZEDm
ZiIYVTiEzJoOzPDJeO7a3v5EHPsY6CjsCFhGz7hjIcrwQjzCLkL5MqG+WDAtebe+
QVdbllD2RlZNPDyHYE5Lqh1h+Y0e4n6kS7LCWxexfHlNOZ5KBRVyIJvz/xOZFZ1/
9oX0UDD2gfH5chLs8GKsr7cZYERMtNlKBPoxGzl8iKF4iUeiksdj3P5y+mdcFaDG
YbM7aXewmbyLyiCkh1zXU6Mw3lK1VfUtVXtEYj+qXf1jWp59ctNEJkc6/VAcaKh7
oS7MNG7Y44B8XwdH7MiqDE7eVCnqEjIR+BIiwjyXNLFP1AXZMAXuBzXPF/vZ3Gyf
3N5vzO4VNEN6Oa1TReSspKwYvq2uPtHMjLX2rT6Py2ru32mj2dCc5E7GD83RKL8V
vDIz4VGOZyGfjp6gClMBsyK4mYwSwgXbnOci7DJn56mMf2qzBJITILXc31zz4gX2
E9kiBu/4Mwjnrx9QRpCNXu7iddBA3YM2NMtNlwBcCgZOFaFz/yOx9TpnugF17WHQ
VVtQi8wlcsS+F05Y11b7euusMQyk1EpWabIrw8UQd+61Dwpz58Q=
=/WGB
-----END PGP SIGNATURE-----
Merge tag 'x86_misc_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 updates from Borislav Petkov:
"A variety of fixes which don't fit any other tip bucket:
- Remove unnecessary function export
- Correct asm constraint
- Fix __setup handlers retval"
* tag 'x86_misc_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Cleanup the control_va_addr_alignment() __setup handler
x86: Fix return value of __setup handlers
x86/delay: Fix the wrong asm constraint in delay_loop()
x86/amd_nb: Unexport amd_cache_northbridges()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLxsIACgkQEsHwGGHe
VUrm0w//foA+nrqdDxGHEj7xLSdymnqZ5KOfjDQyL8hdMOSVdwdUf1XSwQY8TjD6
7U4gGw0TqBBooVLguKgY6NQz/wiucJ5FuU8uibkESMXfFO5oGPTwRupNAs8tV1T6
Jhs513VWHrT/8g3XYFqVQMo3MmBmtX3tF0kj+FxT9yOHNQphJ6y2c6CbcpLMZt3s
MMTaaQh6g65pzi9sW6Gl4iSEhR0COYHEq+zPPFdZOc1vB3xjhyEkz1Fuq/lMR+pr
K5tTzaZ9tuelcjw7ZPni+WoyapI092f1EWV6nZQb6EV6bXasM+HBDLFPKen4GhyZ
TqWYqaycDqNFiMkrKrXDHpOCNKWWYshwdFTZ6k4+FZSCx+4y2Sjjt4KdUEJW2qee
lWBsiqRPNfwkPrg9sMO+3TDt4U28KywsaA6U+oQ+AJn7kLe9dQqNnlzVvajBer32
+1kk2pY8m245ulVE+pcY+mhVlrOyuVpYk8H/7LnRiZAgucrkhHeJXSoW6VUn1G0G
MmqLxCQPQ6kF4617m+BXavrUdLV2sotA0xZJ8mG4XNSCMwQ1Ymr2UJM6u3Jrq7SA
p6XJQv/1e4NluC2h97N1e6nHxM61JJdFxYG/y+eDidnctH+TU6wIM2TFRXBNCsy3
vqkIimuuV/9PfmMdzdr8o7KX9gmpY4dBeCrOSPlHiA0yaYYdmbY=
=0DKB
-----END PGP SIGNATURE-----
Merge tag 'x86_mm_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm fixlet from Borislav Petkov:
- A sparse address space annotation fix
* tag 'x86_mm_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fault: Cast an argument to the proper address space in prefetch()
frequency invariance code along with removing the need for unnecessary IPIs
- Finally remove a.out support
- The usual trivial cleanups and fixes all over x86
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLn48ACgkQEsHwGGHe
VUpbkg/+PELrc0y/qxLM/+dyftKYY16Rhk6ZVAXfwqlh5ldyVQcLMUgKwDqYyTn2
XmgdI3cTcFlH2K7j6ANWLu0I9NPaviimUcEdMVcXt7aY5mGWk/q4hIyCYM8d41sV
qKx4OjNSdyoofG6MtwFLJDuoeVg99Bqgvm4nP9BuxL0dZJ2hfcUZ7MTxYCx9ZYjK
/3trx0NV287Yg/wm91EU0nLQzy9xbGS7WCmMnse6uxiUdm2vXbBt8oNFF4f747Dj
0cArfNrMgYq4Cv5bgt/Ki0NU/n4EOGDpJUSyQwlnjDKeN81ESPy7IWtTQ6cE/rJK
BZeUIPiGiYHwtqXv0UTAPGLG8cAqKeab8u0xAOyrFVDkTc0+WlPJRsUAOmRRGIGE
M8ZjoxrLeuFgxw6vKpVjaA+mDRj3qEpSH+IrTcekS98PN7gmVzvq03GobgGbT7YB
xmtbThJa+514FfUVckkyC0+A56BknUIgVxwFPqrthE2atzYTbH67hW4U0yVWXXr7
2VI7ttozBrYVgHCWhD9eoT0uhyD74Vl6pqHnqzY9ShIfKVUGvMgKHHg04nLLtF7W
hm87xV3Q5UEmXhTmDzT1rUZ99mBUxGbWxk227I9raMugIh7pp9wIr57+7O0LRYfX
TdnE2+tL8RMi7+XzRH5iLhnwkrvahBESeHSQ7GVI1Y2zMmmFN+0=
=Dks/
-----END PGP SIGNATURE-----
Merge tag 'x86_cleanups_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cleanups from Borislav Petkov:
- Serious sanitization and cleanup of the whole APERF/MPERF and
frequency invariance code along with removing the need for
unnecessary IPIs
- Finally remove a.out support
- The usual trivial cleanups and fixes all over x86
* tag 'x86_cleanups_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
x86: Remove empty files
x86/speculation: Add missing srbds=off to the mitigations= help text
x86/prctl: Remove pointless task argument
x86/aperfperf: Make it correct on 32bit and UP kernels
x86/aperfmperf: Integrate the fallback code from show_cpuinfo()
x86/aperfmperf: Replace arch_freq_get_on_cpu()
x86/aperfmperf: Replace aperfmperf_get_khz()
x86/aperfmperf: Store aperf/mperf data for cpu frequency reads
x86/aperfmperf: Make parts of the frequency invariance code unconditional
x86/aperfmperf: Restructure arch_scale_freq_tick()
x86/aperfmperf: Put frequency invariance aperf/mperf data into a struct
x86/aperfmperf: Untangle Intel and AMD frequency invariance init
x86/aperfmperf: Separate AP/BP frequency invariance init
x86/smp: Move APERF/MPERF code where it belongs
x86/aperfmperf: Dont wake idle CPUs in arch_freq_get_on_cpu()
x86/process: Fix kernel-doc warning due to a changed function name
x86: Remove a.out support
x86/mm: Replace nodes_weight() with nodes_empty() where appropriate
x86: Replace cpumask_weight() with cpumask_empty() where appropriate
x86/pkeys: Remove __arch_set_user_pkey_access() declaration
...
are not really needed anymore
- Misc fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLdfgACgkQEsHwGGHe
VUpB5Q//TIGVgmnSd0YYxY2cIe047lfcd34D+3oEGk0d2FidtirP/tjgBqIXRuY5
UncoveqBuI/6/7bodP/ANg9DNVXv2489eFYyZtEOLSGnfzV2AU10aw95cuQQG+BW
YIc6bGSsgfiNo8Vtj4L3xkVqxOrqaCYnh74GTSNNANht3i8KH8Qq9n3qZTuMiF6R
fH9xWak3TZB2nMzHdYrXh0sSR6eBHN3KYSiT0DsdlU9PUlavlSPFYQRiAlr6FL6J
BuYQdlUaCQbINvaviGW4SG7fhX32RfF/GUNaBajB40TO6H98KZLpBBvstWQ841xd
/o44o5wbghoGP1ne8OKwP+SaAV2bE6twd5eO1lpwcpXnQfATvjQ2imxvOiRhy5LY
pFPt/hko9gKWJ6SI0SQ4tiKJALFPLWD6561scHU6PoriFhv0SRIaPmJyEsDYynMz
bCXaPPsoovRwwwBfAxxQjljIlhQSBVt3gWZ8NWD1tYbNaqM+WK7xKBaONGh3OCw3
iK7lsbbljtM0zmANImYyeo7+Hr1NVOmMiK2WZYbxhxgzH3l8v/6EbDt3I70WU57V
9apCU3/nk/HFpX65SdW5qmuiWLVdH9NXrEqbvaUB4ApT18MdUUugewBhcGnf3Umu
wEtltzziqcIkxzDoXXpBGWpX31S7PsM2XVDqYC7dwuNttgEw2Fc=
=7AUX
-----END PGP SIGNATURE-----
Merge tag 'x86_cpu_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 CPU feature updates from Borislav Petkov:
- Remove a bunch of chicken bit options to turn off CPU features which
are not really needed anymore
- Misc fixes and cleanups
* tag 'x86_cpu_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation: Add missing prototype for unpriv_ebpf_notify()
x86/pm: Fix false positive kmemleak report in msr_build_context()
x86/speculation/srbds: Do not try to turn mitigation off when not supported
x86/cpu: Remove "noclflush"
x86/cpu: Remove "noexec"
x86/cpu: Remove "nosmep"
x86/cpu: Remove CONFIG_X86_SMAP and "nosmap"
x86/cpu: Remove "nosep"
x86/cpu: Allow feature bit names from /proc/cpuinfo in clearcpuid=
This is the Intel version of a confidential computing solution called
Trust Domain Extensions (TDX). This series adds support to run the
kernel as part of a TDX guest. It provides similar guest protections to
AMD's SEV-SNP like guest memory and register state encryption, memory
integrity protection and a lot more.
Design-wise, it differs from AMD's solution considerably: it uses
a software module which runs in a special CPU mode called (Secure
Arbitration Mode) SEAM. As the name suggests, this module serves as sort
of an arbiter which the confidential guest calls for services it needs
during its lifetime.
Just like AMD's SNP set, this series reworks and streamlines certain
parts of x86 arch code so that this feature can be properly accomodated.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLbisACgkQEsHwGGHe
VUqZLg/7B55iygCwzz0W/KLcXL2cISatUpzGbFs1XTbE9DMz06BPkOsEjF2k8ckv
kfZjgqhSx3GvUI80gK0Tn2M2DfIj3nKuNSXd1pfextP7AxEf68FFJsQz1Ju7bHpT
pZaG+g8IK4+mnEHEKTCO9ANg/Zw8yqJLdtsCaCNE9SUGUfQ6m/ujTEfsambXDHNm
khyCAgpIGSOt51/4apoR9ebyrNCaeVbDawpIPjTy+iyFRc/WyaLFV9CQ8klw4gbw
r/90x2JYxvAf0/z/ifT9Wa+TnYiQ0d4VjFbfr0iJ4GcPn5L3EIoIKPE8vPGMpoSX
fLSzoNmAOT3ja57ytUUQ3o0edoRUIPEdixOebf9qWvE/aj7W37YRzrlJ8Ej/x9Jy
HcI4WZF6Dr1bh6FnI/xX2eVZRzLOL4j9gNyPCwIbvgr1NjDqQnxU7nhxVMmQhJrs
IdiEcP5WYerLKfka/uF//QfWUg5mDBgFa1/3xK57Z3j0iKWmgjaPpR0SWlOKjj8G
tr0gGN9ejikZTqXKGsHn8fv/R3bjXvbVD8z0IEcx+MIrRmZPnX2QBlg7UA1AXV5n
HoVwPFdH1QAtjZq1MRcL4hTOjz3FkS68rg7ZH0f2GWJAzWmEGytBIhECRnN/PFFq
VwRB4dCCt0bzqRxkiH5lzdgR+xqRe61juQQsMzg+Flv/trpXDqM=
=ac9K
-----END PGP SIGNATURE-----
Merge tag 'x86_tdx_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull Intel TDX support from Borislav Petkov:
"Intel Trust Domain Extensions (TDX) support.
This is the Intel version of a confidential computing solution called
Trust Domain Extensions (TDX). This series adds support to run the
kernel as part of a TDX guest. It provides similar guest protections
to AMD's SEV-SNP like guest memory and register state encryption,
memory integrity protection and a lot more.
Design-wise, it differs from AMD's solution considerably: it uses a
software module which runs in a special CPU mode called (Secure
Arbitration Mode) SEAM. As the name suggests, this module serves as
sort of an arbiter which the confidential guest calls for services it
needs during its lifetime.
Just like AMD's SNP set, this series reworks and streamlines certain
parts of x86 arch code so that this feature can be properly
accomodated"
* tag 'x86_tdx_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
x86/tdx: Fix RETs in TDX asm
x86/tdx: Annotate a noreturn function
x86/mm: Fix spacing within memory encryption features message
x86/kaslr: Fix build warning in KASLR code in boot stub
Documentation/x86: Document TDX kernel architecture
ACPICA: Avoid cache flush inside virtual machines
x86/tdx/ioapic: Add shared bit for IOAPIC base address
x86/mm: Make DMA memory shared for TD guest
x86/mm/cpa: Add support for TDX shared memory
x86/tdx: Make pages shared in ioremap()
x86/topology: Disable CPU online/offline control for TDX guests
x86/boot: Avoid #VE during boot for TDX platforms
x86/boot: Set CR0.NE early and keep it set during the boot
x86/acpi/x86/boot: Add multiprocessor wake-up support
x86/boot: Add a trampoline for booting APs via firmware handoff
x86/tdx: Wire up KVM hypercalls
x86/tdx: Port I/O: Add early boot support
x86/tdx: Port I/O: Add runtime hypercalls
x86/boot: Port I/O: Add decompression-time support for TDX
x86/boot: Port I/O: Allow to hook up alternative helpers
...
Add to confidential guests the necessary memory integrity protection
against malicious hypervisor-based attacks like data replay, memory
remapping and others, thus achieving a stronger isolation from the
hypervisor.
At the core of the functionality is a new structure called a reverse
map table (RMP) with which the guest has a say in which pages get
assigned to it and gets notified when a page which it owns, gets
accessed/modified under the covers so that the guest can take an
appropriate action.
In addition, add support for the whole machinery needed to launch a SNP
guest, details of which is properly explained in each patch.
And last but not least, the series refactors and improves parts of the
previous SEV support so that the new code is accomodated properly and
not just bolted on.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmKLU2AACgkQEsHwGGHe
VUpb/Q//f4LGiJf4nw1flzpe90uIsHNwAafng3NOjeXmhI/EcOlqPf23WHPCgg3Z
2umfa4sRZyj4aZubDd7tYAoq4qWrQ7pO7viWCNTh0InxBAILOoMPMuq2jSAbq0zV
ASUJXeQ2bqjYxX4JV4N5f3HT2l+k68M0mpGLN0H+O+LV9pFS7dz7Jnsg+gW4ZP25
PMPLf6FNzO/1tU1aoYu80YDP1ne4eReLrNzA7Y/rx+S2NAetNwPn21AALVgoD4Nu
vFdKh4MHgtVbwaQuh0csb/+4vD+tDXAhc8lbIl+Abl9ZxJaDWtAJW5D9e2CnsHk1
NOkHwnrzizzhtGK1g56YPUVRFAWhZYMOI1hR0zGPLQaVqBnN4b+iahPeRiV0XnGE
PSbIHSfJdeiCkvLMCdIAmpE5mRshhRSUfl1CXTCdetMn8xV/qz/vG6bXssf8yhTV
cfLGPHU7gfVmsbR9nk5a8KZ78PaytxOxfIDXvCy8JfQwlIWtieaCcjncrj+sdMJy
0fdOuwvi4jma0cyYuPolKiS1Hn4ldeibvxXT7CZQlIx6jZShMbpfpTTJs11XdtHm
PdDAc1TY3AqI33mpy9DhDQmx/+EhOGxY3HNLT7evRhv4CfdQeK3cPVUWgo4bGNVv
ZnFz7nvmwpyufltW9K8mhEZV267174jXGl6/idxybnlVE7ESr2Y=
=Y8kW
-----END PGP SIGNATURE-----
Merge tag 'x86_sev_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull AMD SEV-SNP support from Borislav Petkov:
"The third AMD confidential computing feature called Secure Nested
Paging.
Add to confidential guests the necessary memory integrity protection
against malicious hypervisor-based attacks like data replay, memory
remapping and others, thus achieving a stronger isolation from the
hypervisor.
At the core of the functionality is a new structure called a reverse
map table (RMP) with which the guest has a say in which pages get
assigned to it and gets notified when a page which it owns, gets
accessed/modified under the covers so that the guest can take an
appropriate action.
In addition, add support for the whole machinery needed to launch a
SNP guest, details of which is properly explained in each patch.
And last but not least, the series refactors and improves parts of the
previous SEV support so that the new code is accomodated properly and
not just bolted on"
* tag 'x86_sev_for_v5.19_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
x86/entry: Fixup objtool/ibt validation
x86/sev: Mark the code returning to user space as syscall gap
x86/sev: Annotate stack change in the #VC handler
x86/sev: Remove duplicated assignment to variable info
x86/sev: Fix address space sparse warning
x86/sev: Get the AP jump table address from secrets page
x86/sev: Add missing __init annotations to SEV init routines
virt: sevguest: Rename the sevguest dir and files to sev-guest
virt: sevguest: Change driver name to reflect generic SEV support
x86/boot: Put globals that are accessed early into the .data section
x86/boot: Add an efi.h header for the decompressor
virt: sevguest: Fix bool function returning negative value
virt: sevguest: Fix return value check in alloc_shared_pages()
x86/sev-es: Replace open-coded hlt-loop with sev_es_terminate()
virt: sevguest: Add documentation for SEV-SNP CPUID Enforcement
virt: sevguest: Add support to get extended report
virt: sevguest: Add support to derive key
virt: Add SEV-SNP guest driver
x86/sev: Register SEV-SNP guest request platform device
x86/sev: Provide support for SNP guest request NAEs
...
The set_memory_uc() approach doesn't work well in all cases.
As Dan pointed out when "The VMM unmapped the bad page from
guest physical space and passed the machine check to the guest."
"The guest gets virtual #MC on an access to that page. When
the guest tries to do set_memory_uc() and instructs cpa_flush()
to do clean caches that results in taking another fault / exception
perhaps because the VMM unmapped the page from the guest."
Since the driver has special knowledge to handle NP or UC,
mark the poisoned page with NP and let driver handle it when
it comes down to repair.
Please refer to discussions here for more details.
https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/
Now since poisoned page is marked as not-present, in order to
avoid writing to a not-present page and trigger kernel Oops,
also fix pmem_do_write().
Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/165272615484.103830.2563950688772226611.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Relocate the twin mce functions to arch/x86/mm/pat/set_memory.c
file where they belong.
While at it, fixup a function name in a comment.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jane Chu <jane.chu@oracle.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
[sfr: gate {set,clear}_mce_nospec() by CONFIG_X86_64]
Link: https://lore.kernel.org/r/165272527328.90175.8336008202048685278.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
p4d_clear_huge may be optimized for void return type and function usage.
vunmap_p4d_range function saves a few steps here.
Link: https://lkml.kernel.org/r/20220507150630.90399-1-kunyu@nfschina.com
Signed-off-by: Li kunyu <kunyu@nfschina.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Calls to change_protection_range() on THP can trigger, at least on x86,
two TLB flushes for one page: one immediately, when pmdp_invalidate() is
called by change_huge_pmd(), and then another one later (that can be
batched) when change_protection_range() finishes.
The first TLB flush is only necessary to prevent the dirty bit (and with a
lesser importance the access bit) from changing while the PTE is modified.
However, this is not necessary as the x86 CPUs set the dirty-bit
atomically with an additional check that the PTE is (still) present. One
caveat is Intel's Knights Landing that has a bug and does not do so.
Leverage this behavior to eliminate the unnecessary TLB flush in
change_huge_pmd(). Introduce a new arch specific pmdp_invalidate_ad()
that only invalidates the access and dirty bit from further changes.
Link: https://lkml.kernel.org/r/20220401180821.1986781-4-namit@vmware.com
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The unused part precedes the new range spanned by the start, end parameters
of vmemmap_use_new_sub_pmd(). This means it actually goes from
ALIGN_DOWN(start, PMD_SIZE) up to start.
Use the correct address when applying the mark using memset.
Fixes: 8d400913c231 ("x86/vmemmap: handle unpopulated sub-pmd ranges")
Signed-off-by: Adrian-Ken Rueegsegger <ken@codelabs.ch>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220509090637.24152-2-ken@codelabs.ch
In preparation for Clang supporting randstruct, reorganize the Kconfigs,
move the attribute macros, and generalize the feature to be named
CONFIG_RANDSTRUCT for on/off, CONFIG_RANDSTRUCT_FULL for the full
randomization mode, and CONFIG_RANDSTRUCT_PERFORMANCE for the cache-line
sized mode.
Cc: linux-hardening@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220503205503.3054173-4-keescook@chromium.org
__setup() handlers should return 1 to obsolete_checksetup() in
init/main.c to indicate that the boot option has been handled. A return
of 0 causes the boot option/value to be listed as an Unknown kernel
parameter and added to init's (limited) argument (no '=') or environment
(with '=') strings. So return 1 from these x86 __setup handlers.
Examples:
Unknown kernel command line parameters "apicpmtimer
BOOT_IMAGE=/boot/bzImage-517rc8 vdso=1 ring3mwait=disable", will be
passed to user space.
Run /sbin/init as init process
with arguments:
/sbin/init
apicpmtimer
with environment:
HOME=/
TERM=linux
BOOT_IMAGE=/boot/bzImage-517rc8
vdso=1
ring3mwait=disable
Fixes: 2aae950b21e4 ("x86_64: Add vDSO for x86-64 with gettimeofday/clock_gettime/getcpu")
Fixes: 77b52b4c5c66 ("x86: add "debugpat" boot option")
Fixes: e16fd002afe2 ("x86/cpufeature: Enable RING3MWAIT for Knights Landing")
Fixes: b8ce33590687 ("x86_64: convert to clock events")
Reported-by: Igor Zhbanov <i.zhbanov@omprussia.ru>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru
Link: https://lore.kernel.org/r/20220314012725.26661-1-rdunlap@infradead.org
Drop lookup_address_in_mm() now that KVM is providing it's own variant
of lookup_address_in_pgd() that is safe for use with user addresses, e.g.
guards against page tables being torn down. A variant that provides a
non-init mm is inherently dangerous and flawed, as the only reason to use
an mm other than init_mm is to walk a userspace mapping, and
lookup_address_in_pgd() does not play nice with userspace mappings, e.g.
doesn't disable IRQs to block TLB shootdowns and doesn't use READ_ONCE()
to ensure an upper level entry isn't converted to a huge page between
checking the PAGE_SIZE bit and grabbing the address of the next level
down.
This reverts commit 13c72c060f1ba6f4eddd7b1c4f52a8aded43d6d9.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <YmwIi3bXr/1yhYV/@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The word of "free" is not expressive enough to express the feature of
optimizing vmemmap pages associated with each HugeTLB, rename this keywork
to "optimize". In this patch , cheanup configs to make code more
expressive.
Link: https://lkml.kernel.org/r/20220404074652.68024-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This defines and exports a platform specific custom vm_get_page_prot() via
subscribing ARCH_HAS_VM_GET_PAGE_PROT. This also unsubscribes from config
ARCH_HAS_FILTER_PGPROT, after dropping off arch_filter_pgprot() and
arch_vm_get_page_prot().
Link: https://lkml.kernel.org/r/20220414062125.609297-6-anshuman.khandual@arm.com
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David S. Miller <davem@davemloft.net>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The spacing is off in the memory encryption features message on AMD
platforms that support memory encryption, e.g.:
"Memory Encryption Features active:AMD SEV SEV-ES"
There is no space before "AMD" and two spaces after it. Fix this so that
the message is spaced properly:
"Memory Encryption Features active: AMD SEV SEV-ES"
Fixes: 968b493173ac ("x86/mm: Make DMA memory shared for TD guest")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Link: https://lkml.kernel.org/r/02401f3024b18e90bc2508147e22e729436cb6d9.1650298573.git.thomas.lendacky@amd.com
Move enabling SWIOTLB_FORCE for guest memory encryption into common code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Various mm code calls nodes_weight() to check if any bit of a given
nodemask is set.
This can be done more efficiently with nodes_empty() because nodes_empty()
stops traversing the nodemask as soon as it finds first set bit, while
nodes_weight() counts all bits unconditionally.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220210224933.379149-26-yury.norov@gmail.com
In some cases, x86 code calls cpumask_weight() to check if any bit of a
given cpumask is set.
This can be done more efficiently with cpumask_empty() because
cpumask_empty() stops traversing the cpumask as soon as it finds first set
bit, while cpumask_weight() counts all bits unconditionally.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steve Wahl <steve.wahl@hpe.com>
Link: https://lore.kernel.org/r/20220210224933.379149-17-yury.norov@gmail.com
Intel TDX doesn't allow VMM to directly access guest private memory.
Any memory that is required for communication with the VMM must be
shared explicitly. The same rule applies for any DMA to and from the
TDX guest. All DMA pages have to be marked as shared pages. A generic way
to achieve this without any changes to device drivers is to use the
SWIOTLB framework.
The previous patch ("Add support for TDX shared memory") gave TDX guests
the _ability_ to make some pages shared, but did not make any pages
shared. This actually marks SWIOTLB buffers *as* shared.
Start returning true for cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) in
TDX guests. This has several implications:
- Allows the existing mem_encrypt_init() to be used for TDX which
sets SWIOTLB buffers shared (aka. "decrypted").
- Ensures that all DMA is routed via the SWIOTLB mechanism (see
pci_swiotlb_detect())
Stop selecting DYNAMIC_PHYSICAL_MASK directly. It will get set
indirectly by selecting X86_MEM_ENCRYPT.
mem_encrypt_init() is currently under an AMD-specific #ifdef. Move it to
a generic area of the header.
Co-developed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20220405232939.73860-28-kirill.shutemov@linux.intel.com