- Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list")
- Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which reduces the special-case code for handling hugetlb pages in GUP. It also speeds up GUP handling of transparent hugepages. - Peng Zhang provides some maple tree speedups ("Optimize the fast path of mas_store()"). - Sergey Senozhatsky has improved te performance of zsmalloc during compaction (zsmalloc: small compaction improvements"). - Domenico Cerasuolo has developed additional selftest code for zswap ("selftests: cgroup: add zswap test program"). - xu xin has doe some work on KSM's handling of zero pages. These changes are mainly to enable the user to better understand the effectiveness of KSM's treatment of zero pages ("ksm: support tracking KSM-placed zero-pages"). - Jeff Xu has fixes the behaviour of memfd's MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED"). - David Howells has fixed an fscache optimization ("mm, netfs, fscache: Stop read optimisation when folio removed from pagecache"). - Axel Rasmussen has given userfaultfd the ability to simulate memory poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD"). - Miaohe Lin has contributed some routine maintenance work on the memory-failure code ("mm: memory-failure: remove unneeded PageHuge() check"). - Peng Zhang has contributed some maintenance work on the maple tree code ("Improve the validation for maple tree and some cleanup"). - Hugh Dickins has optimized the collapsing of shmem or file pages into THPs ("mm: free retracted page table by RCU"). - Jiaqi Yan has a patch series which permits us to use the healthy subpages within a hardware poisoned huge page for general purposes ("Improve hugetlbfs read on HWPOISON hugepages"). - Kemeng Shi has done some maintenance work on the pagetable-check code ("Remove unused parameters in page_table_check"). - More folioification work from Matthew Wilcox ("More filesystem folio conversions for 6.6"), ("Followup folio conversions for zswap"). And from ZhangPeng ("Convert several functions in page_io.c to use a folio"). - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext"). - Baoquan He has converted some architectures to use the GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take GENERIC_IOREMAP way"). - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support batched/deferred tlb shootdown during page reclamation/migration"). - Better maple tree lockdep checking from Liam Howlett ("More strict maple tree lockdep"). Liam also developed some efficiency improvements ("Reduce preallocations for maple tree"). - Cleanup and optimization to the secondary IOMMU TLB invalidation, from Alistair Popple ("Invalidate secondary IOMMU TLB on permission upgrade"). - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes for arm64"). - Kemeng Shi provides some maintenance work on the compaction code ("Two minor cleanups for compaction"). - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most file-backed faults under the VMA lock"). - Aneesh Kumar contributes code to use the vmemmap optimization for DAX on ppc64, under some circumstances ("Add support for DAX vmemmap optimization for ppc64"). - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client data in page_ext"), ("minor cleanups to page_ext header"). - Some zswap cleanups from Johannes Weiner ("mm: zswap: three cleanups"). - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan"). - VMA handling cleanups from Kefeng Wang ("mm: convert to vma_is_initial_heap/stack()"). - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes: implement DAMOS tried total bytes file"), ("Extend DAMOS filters for address ranges and DAMON monitoring targets"). - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction"). - Liam Howlett has improved the maple tree node replacement code ("maple_tree: Change replacement strategy"). - ZhangPeng has a general code cleanup - use the K() macro more widely ("cleanup with helper macro K()"). - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap on memory feature on ppc64"). - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list in page_alloc"), ("Two minor cleanups for get pageblock migratetype"). - Vishal Moola introduces a memory descriptor for page table tracking, "struct ptdesc" ("Split ptdesc from struct page"). - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups for vm.memfd_noexec"). - MM include file rationalization from Hugh Dickins ("arch: include asm/cacheflush.h in asm/hugetlb.h"). - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text output"). - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use object_cache instead of kmemleak_initialized"). - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor and _folio_order"). - A VMA locking scalability improvement from Suren Baghdasaryan ("Per-VMA lock support for swap and userfaults"). - pagetable handling cleanups from Matthew Wilcox ("New page table range API"). - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop using page->private on tail pages for THP_SWAP + cleanups"). - Cleanups and speedups to the hugetlb fault handling from Matthew Wilcox ("Change calling convention for ->huge_fault"). - Matthew Wilcox has also done some maintenance work on the MM subsystem documentation ("Improve mm documentation"). -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZO1JUQAKCRDdBJ7gKXxA jrMwAP47r/fS8vAVT3zp/7fXmxaJYTK27CTAM881Gw1SDhFM/wEAv8o84mDenCg6 Nfio7afS1ncD+hPYT8947UnLxTgn+ww= =Afws -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Some swap cleanups from Ma Wupeng ("fix WARN_ON in add_to_avail_list") - Peter Xu has a series (mm/gup: Unify hugetlb, speed up thp") which reduces the special-case code for handling hugetlb pages in GUP. It also speeds up GUP handling of transparent hugepages. - Peng Zhang provides some maple tree speedups ("Optimize the fast path of mas_store()"). - Sergey Senozhatsky has improved te performance of zsmalloc during compaction (zsmalloc: small compaction improvements"). - Domenico Cerasuolo has developed additional selftest code for zswap ("selftests: cgroup: add zswap test program"). - xu xin has doe some work on KSM's handling of zero pages. These changes are mainly to enable the user to better understand the effectiveness of KSM's treatment of zero pages ("ksm: support tracking KSM-placed zero-pages"). - Jeff Xu has fixes the behaviour of memfd's MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED sysctl ("mm/memfd: fix sysctl MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED"). - David Howells has fixed an fscache optimization ("mm, netfs, fscache: Stop read optimisation when folio removed from pagecache"). - Axel Rasmussen has given userfaultfd the ability to simulate memory poisoning ("add UFFDIO_POISON to simulate memory poisoning with UFFD"). - Miaohe Lin has contributed some routine maintenance work on the memory-failure code ("mm: memory-failure: remove unneeded PageHuge() check"). - Peng Zhang has contributed some maintenance work on the maple tree code ("Improve the validation for maple tree and some cleanup"). - Hugh Dickins has optimized the collapsing of shmem or file pages into THPs ("mm: free retracted page table by RCU"). - Jiaqi Yan has a patch series which permits us to use the healthy subpages within a hardware poisoned huge page for general purposes ("Improve hugetlbfs read on HWPOISON hugepages"). - Kemeng Shi has done some maintenance work on the pagetable-check code ("Remove unused parameters in page_table_check"). - More folioification work from Matthew Wilcox ("More filesystem folio conversions for 6.6"), ("Followup folio conversions for zswap"). And from ZhangPeng ("Convert several functions in page_io.c to use a folio"). - page_ext cleanups from Kemeng Shi ("minor cleanups for page_ext"). - Baoquan He has converted some architectures to use the GENERIC_IOREMAP ioremap()/iounmap() code ("mm: ioremap: Convert architectures to take GENERIC_IOREMAP way"). - Anshuman Khandual has optimized arm64 tlb shootdown ("arm64: support batched/deferred tlb shootdown during page reclamation/migration"). - Better maple tree lockdep checking from Liam Howlett ("More strict maple tree lockdep"). Liam also developed some efficiency improvements ("Reduce preallocations for maple tree"). - Cleanup and optimization to the secondary IOMMU TLB invalidation, from Alistair Popple ("Invalidate secondary IOMMU TLB on permission upgrade"). - Ryan Roberts fixes some arm64 MM selftest issues ("selftests/mm fixes for arm64"). - Kemeng Shi provides some maintenance work on the compaction code ("Two minor cleanups for compaction"). - Some reduction in mmap_lock pressure from Matthew Wilcox ("Handle most file-backed faults under the VMA lock"). - Aneesh Kumar contributes code to use the vmemmap optimization for DAX on ppc64, under some circumstances ("Add support for DAX vmemmap optimization for ppc64"). - page-ext cleanups from Kemeng Shi ("add page_ext_data to get client data in page_ext"), ("minor cleanups to page_ext header"). - Some zswap cleanups from Johannes Weiner ("mm: zswap: three cleanups"). - kmsan cleanups from ZhangPeng ("minor cleanups for kmsan"). - VMA handling cleanups from Kefeng Wang ("mm: convert to vma_is_initial_heap/stack()"). - DAMON feature work from SeongJae Park ("mm/damon/sysfs-schemes: implement DAMOS tried total bytes file"), ("Extend DAMOS filters for address ranges and DAMON monitoring targets"). - Compaction work from Kemeng Shi ("Fixes and cleanups to compaction"). - Liam Howlett has improved the maple tree node replacement code ("maple_tree: Change replacement strategy"). - ZhangPeng has a general code cleanup - use the K() macro more widely ("cleanup with helper macro K()"). - Aneesh Kumar brings memmap-on-memory to ppc64 ("Add support for memmap on memory feature on ppc64"). - pagealloc cleanups from Kemeng Shi ("Two minor cleanups for pcp list in page_alloc"), ("Two minor cleanups for get pageblock migratetype"). - Vishal Moola introduces a memory descriptor for page table tracking, "struct ptdesc" ("Split ptdesc from struct page"). - memfd selftest maintenance work from Aleksa Sarai ("memfd: cleanups for vm.memfd_noexec"). - MM include file rationalization from Hugh Dickins ("arch: include asm/cacheflush.h in asm/hugetlb.h"). - THP debug output fixes from Hugh Dickins ("mm,thp: fix sloppy text output"). - kmemleak improvements from Xiaolei Wang ("mm/kmemleak: use object_cache instead of kmemleak_initialized"). - More folio-related cleanups from Matthew Wilcox ("Remove _folio_dtor and _folio_order"). - A VMA locking scalability improvement from Suren Baghdasaryan ("Per-VMA lock support for swap and userfaults"). - pagetable handling cleanups from Matthew Wilcox ("New page table range API"). - A batch of swap/thp cleanups from David Hildenbrand ("mm/swap: stop using page->private on tail pages for THP_SWAP + cleanups"). - Cleanups and speedups to the hugetlb fault handling from Matthew Wilcox ("Change calling convention for ->huge_fault"). - Matthew Wilcox has also done some maintenance work on the MM subsystem documentation ("Improve mm documentation"). * tag 'mm-stable-2023-08-28-18-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (489 commits) maple_tree: shrink struct maple_tree maple_tree: clean up mas_wr_append() secretmem: convert page_is_secretmem() to folio_is_secretmem() nios2: fix flush_dcache_page() for usage from irq context hugetlb: add documentation for vma_kernel_pagesize() mm: add orphaned kernel-doc to the rst files. mm: fix clean_record_shared_mapping_range kernel-doc mm: fix get_mctgt_type() kernel-doc mm: fix kernel-doc warning from tlb_flush_rmaps() mm: remove enum page_entry_size mm: allow ->huge_fault() to be called without the mmap_lock held mm: move PMD_ORDER to pgtable.h mm: remove checks for pte_index memcg: remove duplication detection for mem_cgroup_uncharge_swap mm/huge_memory: work on folio->swap instead of page->private when splitting folio mm/swap: inline folio_set_swap_entry() and folio_swap_entry() mm/swap: use dedicated entry for swap in folio mm/swap: stop using page->private on tail pages for THP_SWAP selftests/mm: fix WARNING comparing pointer to 0 selftests: cgroup: fix test_kmem_memcg_deletion kernel mem check ...
This commit is contained in:
commit
b96a3e9142
@ -29,8 +29,10 @@ Description: Writing 'on' or 'off' to this file makes the kdamond starts or
|
||||
file updates contents of schemes stats files of the kdamond.
|
||||
Writing 'update_schemes_tried_regions' to the file updates
|
||||
contents of 'tried_regions' directory of every scheme directory
|
||||
of this kdamond. Writing 'clear_schemes_tried_regions' to the
|
||||
file removes contents of the 'tried_regions' directory.
|
||||
of this kdamond. Writing 'update_schemes_tried_bytes' to the
|
||||
file updates only '.../tried_regions/total_bytes' files of this
|
||||
kdamond. Writing 'clear_schemes_tried_regions' to the file
|
||||
removes contents of the 'tried_regions' directory.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/pid
|
||||
Date: Mar 2022
|
||||
@ -269,8 +271,10 @@ What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/
|
||||
Date: Dec 2022
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: Writing to and reading from this file sets and gets the type of
|
||||
the memory of the interest. 'anon' for anonymous pages, or
|
||||
'memcg' for specific memory cgroup can be written and read.
|
||||
the memory of the interest. 'anon' for anonymous pages,
|
||||
'memcg' for specific memory cgroup, 'addr' for address range
|
||||
(an open-ended interval), or 'target' for DAMON monitoring
|
||||
target can be written and read.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/memcg_path
|
||||
Date: Dec 2022
|
||||
@ -279,6 +283,27 @@ Description: If 'memcg' is written to the 'type' file, writing to and
|
||||
reading from this file sets and gets the path to the memory
|
||||
cgroup of the interest.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/addr_start
|
||||
Date: Jul 2023
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: If 'addr' is written to the 'type' file, writing to or reading
|
||||
from this file sets or gets the start address of the address
|
||||
range for the filter.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/addr_end
|
||||
Date: Jul 2023
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: If 'addr' is written to the 'type' file, writing to or reading
|
||||
from this file sets or gets the end address of the address
|
||||
range for the filter.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/target_idx
|
||||
Date: Dec 2022
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: If 'target' is written to the 'type' file, writing to or
|
||||
reading from this file sets or gets the index of the DAMON
|
||||
monitoring target of the interest.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/filters/<F>/matching
|
||||
Date: Dec 2022
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
@ -317,6 +342,13 @@ Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: Reading this file returns the number of the exceed events of
|
||||
the scheme's quotas.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/tried_regions/total_bytes
|
||||
Date: Jul 2023
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
Description: Reading this file returns the total amount of memory that
|
||||
corresponding DAMON-based Operation Scheme's action has tried
|
||||
to be applied.
|
||||
|
||||
What: /sys/kernel/mm/damon/admin/kdamonds/<K>/contexts/<C>/schemes/<S>/tried_regions/<R>/start
|
||||
Date: Oct 2022
|
||||
Contact: SeongJae Park <sj@kernel.org>
|
||||
|
@ -10,7 +10,7 @@ Description:
|
||||
dropping it if possible. The kernel will then be placed
|
||||
on the bad page list and never be reused.
|
||||
|
||||
The offlining is done in kernel specific granuality.
|
||||
The offlining is done in kernel specific granularity.
|
||||
Normally it's the base page size of the kernel, but
|
||||
this might change.
|
||||
|
||||
@ -35,7 +35,7 @@ Description:
|
||||
to access this page assuming it's poisoned by the
|
||||
hardware.
|
||||
|
||||
The offlining is done in kernel specific granuality.
|
||||
The offlining is done in kernel specific granularity.
|
||||
Normally it's the base page size of the kernel, but
|
||||
this might change.
|
||||
|
||||
|
@ -92,8 +92,6 @@ Brief summary of control files.
|
||||
memory.oom_control set/show oom controls.
|
||||
memory.numa_stat show the number of memory usage per numa
|
||||
node
|
||||
memory.kmem.limit_in_bytes This knob is deprecated and writing to
|
||||
it will return -ENOTSUPP.
|
||||
memory.kmem.usage_in_bytes show current kernel memory allocation
|
||||
memory.kmem.failcnt show the number of kernel memory usage
|
||||
hits limits
|
||||
|
@ -141,8 +141,8 @@ nodemask_t
|
||||
The size of a nodemask_t type. Used to compute the number of online
|
||||
nodes.
|
||||
|
||||
(page, flags|_refcount|mapping|lru|_mapcount|private|compound_dtor|compound_order|compound_head)
|
||||
-------------------------------------------------------------------------------------------------
|
||||
(page, flags|_refcount|mapping|lru|_mapcount|private|compound_order|compound_head)
|
||||
----------------------------------------------------------------------------------
|
||||
|
||||
User-space tools compute their values based on the offset of these
|
||||
variables. The variables are used when excluding unnecessary pages.
|
||||
@ -325,8 +325,8 @@ NR_FREE_PAGES
|
||||
On linux-2.6.21 or later, the number of free pages is in
|
||||
vm_stat[NR_FREE_PAGES]. Used to get the number of free pages.
|
||||
|
||||
PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask
|
||||
------------------------------------------------------------------------------
|
||||
PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask|PG_hugetlb
|
||||
-----------------------------------------------------------------------------------------
|
||||
|
||||
Page attributes. These flags are used to filter various unnecessary for
|
||||
dumping pages.
|
||||
@ -338,12 +338,6 @@ More page attributes. These flags are used to filter various unnecessary for
|
||||
dumping pages.
|
||||
|
||||
|
||||
HUGETLB_PAGE_DTOR
|
||||
-----------------
|
||||
|
||||
The HUGETLB_PAGE_DTOR flag denotes hugetlbfs pages. Makedumpfile
|
||||
excludes these pages.
|
||||
|
||||
x86_64
|
||||
======
|
||||
|
||||
|
@ -87,7 +87,7 @@ comma (","). ::
|
||||
│ │ │ │ │ │ │ filters/nr_filters
|
||||
│ │ │ │ │ │ │ │ 0/type,matching,memcg_id
|
||||
│ │ │ │ │ │ │ stats/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds
|
||||
│ │ │ │ │ │ │ tried_regions/
|
||||
│ │ │ │ │ │ │ tried_regions/total_bytes
|
||||
│ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age
|
||||
│ │ │ │ │ │ │ │ ...
|
||||
│ │ │ │ │ │ ...
|
||||
@ -127,14 +127,18 @@ in the state. Writing ``commit`` to the ``state`` file makes kdamond reads the
|
||||
user inputs in the sysfs files except ``state`` file again. Writing
|
||||
``update_schemes_stats`` to ``state`` file updates the contents of stats files
|
||||
for each DAMON-based operation scheme of the kdamond. For details of the
|
||||
stats, please refer to :ref:`stats section <sysfs_schemes_stats>`. Writing
|
||||
``update_schemes_tried_regions`` to ``state`` file updates the DAMON-based
|
||||
operation scheme action tried regions directory for each DAMON-based operation
|
||||
scheme of the kdamond. Writing ``clear_schemes_tried_regions`` to ``state``
|
||||
file clears the DAMON-based operating scheme action tried regions directory for
|
||||
each DAMON-based operation scheme of the kdamond. For details of the
|
||||
DAMON-based operation scheme action tried regions directory, please refer to
|
||||
:ref:`tried_regions section <sysfs_schemes_tried_regions>`.
|
||||
stats, please refer to :ref:`stats section <sysfs_schemes_stats>`.
|
||||
|
||||
Writing ``update_schemes_tried_regions`` to ``state`` file updates the
|
||||
DAMON-based operation scheme action tried regions directory for each
|
||||
DAMON-based operation scheme of the kdamond. Writing
|
||||
``update_schemes_tried_bytes`` to ``state`` file updates only
|
||||
``.../tried_regions/total_bytes`` files. Writing
|
||||
``clear_schemes_tried_regions`` to ``state`` file clears the DAMON-based
|
||||
operating scheme action tried regions directory for each DAMON-based operation
|
||||
scheme of the kdamond. For details of the DAMON-based operation scheme action
|
||||
tried regions directory, please refer to :ref:`tried_regions section
|
||||
<sysfs_schemes_tried_regions>`.
|
||||
|
||||
If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread.
|
||||
|
||||
@ -359,15 +363,21 @@ number (``N``) to the file creates the number of child directories named ``0``
|
||||
to ``N-1``. Each directory represents each filter. The filters are evaluated
|
||||
in the numeric order.
|
||||
|
||||
Each filter directory contains three files, namely ``type``, ``matcing``, and
|
||||
``memcg_path``. You can write one of two special keywords, ``anon`` for
|
||||
anonymous pages, or ``memcg`` for specific memory cgroup filtering. In case of
|
||||
the memory cgroup filtering, you can specify the memory cgroup of the interest
|
||||
by writing the path of the memory cgroup from the cgroups mount point to
|
||||
``memcg_path`` file. You can write ``Y`` or ``N`` to ``matching`` file to
|
||||
filter out pages that does or does not match to the type, respectively. Then,
|
||||
the scheme's action will not be applied to the pages that specified to be
|
||||
filtered out.
|
||||
Each filter directory contains six files, namely ``type``, ``matcing``,
|
||||
``memcg_path``, ``addr_start``, ``addr_end``, and ``target_idx``. To ``type``
|
||||
file, you can write one of four special keywords: ``anon`` for anonymous pages,
|
||||
``memcg`` for specific memory cgroup, ``addr`` for specific address range (an
|
||||
open-ended interval), or ``target`` for specific DAMON monitoring target
|
||||
filtering. In case of the memory cgroup filtering, you can specify the memory
|
||||
cgroup of the interest by writing the path of the memory cgroup from the
|
||||
cgroups mount point to ``memcg_path`` file. In case of the address range
|
||||
filtering, you can specify the start and end address of the range to
|
||||
``addr_start`` and ``addr_end`` files, respectively. For the DAMON monitoring
|
||||
target filtering, you can specify the index of the target between the list of
|
||||
the DAMON context's monitoring targets list to ``target_idx`` file. You can
|
||||
write ``Y`` or ``N`` to ``matching`` file to filter out pages that does or does
|
||||
not match to the type, respectively. Then, the scheme's action will not be
|
||||
applied to the pages that specified to be filtered out.
|
||||
|
||||
For example, below restricts a DAMOS action to be applied to only non-anonymous
|
||||
pages of all memory cgroups except ``/having_care_already``.::
|
||||
@ -381,8 +391,14 @@ pages of all memory cgroups except ``/having_care_already``.::
|
||||
echo /having_care_already > 1/memcg_path
|
||||
echo N > 1/matching
|
||||
|
||||
Note that filters are currently supported only when ``paddr``
|
||||
`implementation <sysfs_contexts>` is being used.
|
||||
Note that ``anon`` and ``memcg`` filters are currently supported only when
|
||||
``paddr`` `implementation <sysfs_contexts>` is being used.
|
||||
|
||||
Also, memory regions that are filtered out by ``addr`` or ``target`` filters
|
||||
are not counted as the scheme has tried to those, while regions that filtered
|
||||
out by other type filters are counted as the scheme has tried to. The
|
||||
difference is applied to :ref:`stats <damos_stats>` and
|
||||
:ref:`tried regions <sysfs_schemes_tried_regions>`.
|
||||
|
||||
.. _sysfs_schemes_stats:
|
||||
|
||||
@ -406,13 +422,21 @@ stats by writing a special keyword, ``update_schemes_stats`` to the relevant
|
||||
schemes/<N>/tried_regions/
|
||||
--------------------------
|
||||
|
||||
This directory initially has one file, ``total_bytes``.
|
||||
|
||||
When a special keyword, ``update_schemes_tried_regions``, is written to the
|
||||
relevant ``kdamonds/<N>/state`` file, DAMON creates directories named integer
|
||||
starting from ``0`` under this directory. Each directory contains files
|
||||
exposing detailed information about each of the memory region that the
|
||||
corresponding scheme's ``action`` has tried to be applied under this directory,
|
||||
during next :ref:`aggregation interval <sysfs_monitoring_attrs>`. The
|
||||
information includes address range, ``nr_accesses``, and ``age`` of the region.
|
||||
relevant ``kdamonds/<N>/state`` file, DAMON updates the ``total_bytes`` file so
|
||||
that reading it returns the total size of the scheme tried regions, and creates
|
||||
directories named integer starting from ``0`` under this directory. Each
|
||||
directory contains files exposing detailed information about each of the memory
|
||||
region that the corresponding scheme's ``action`` has tried to be applied under
|
||||
this directory, during next :ref:`aggregation interval
|
||||
<sysfs_monitoring_attrs>`. The information includes address range,
|
||||
``nr_accesses``, and ``age`` of the region.
|
||||
|
||||
Writing ``update_schemes_tried_bytes`` to the relevant ``kdamonds/<N>/state``
|
||||
file will only update the ``total_bytes`` file, and will not create the
|
||||
subdirectories.
|
||||
|
||||
The directories will be removed when another special keyword,
|
||||
``clear_schemes_tried_regions``, is written to the relevant
|
||||
|
@ -159,6 +159,8 @@ The effectiveness of KSM and MADV_MERGEABLE is shown in ``/sys/kernel/mm/ksm/``:
|
||||
|
||||
general_profit
|
||||
how effective is KSM. The calculation is explained below.
|
||||
pages_scanned
|
||||
how many pages are being scanned for ksm
|
||||
pages_shared
|
||||
how many shared pages are being used
|
||||
pages_sharing
|
||||
@ -173,6 +175,13 @@ stable_node_chains
|
||||
the number of KSM pages that hit the ``max_page_sharing`` limit
|
||||
stable_node_dups
|
||||
number of duplicated KSM pages
|
||||
ksm_zero_pages
|
||||
how many zero pages that are still mapped into processes were mapped by
|
||||
KSM when deduplicating.
|
||||
|
||||
When ``use_zero_pages`` is/was enabled, the sum of ``pages_sharing`` +
|
||||
``ksm_zero_pages`` represents the actual number of pages saved by KSM.
|
||||
if ``use_zero_pages`` has never been enabled, ``ksm_zero_pages`` is 0.
|
||||
|
||||
A high ratio of ``pages_sharing`` to ``pages_shared`` indicates good
|
||||
sharing, but a high ratio of ``pages_unshared`` to ``pages_sharing``
|
||||
@ -196,21 +205,25 @@ several times, which are unprofitable memory consumed.
|
||||
1) How to determine whether KSM save memory or consume memory in system-wide
|
||||
range? Here is a simple approximate calculation for reference::
|
||||
|
||||
general_profit =~ pages_sharing * sizeof(page) - (all_rmap_items) *
|
||||
general_profit =~ ksm_saved_pages * sizeof(page) - (all_rmap_items) *
|
||||
sizeof(rmap_item);
|
||||
|
||||
where all_rmap_items can be easily obtained by summing ``pages_sharing``,
|
||||
``pages_shared``, ``pages_unshared`` and ``pages_volatile``.
|
||||
where ksm_saved_pages equals to the sum of ``pages_sharing`` +
|
||||
``ksm_zero_pages`` of the system, and all_rmap_items can be easily
|
||||
obtained by summing ``pages_sharing``, ``pages_shared``, ``pages_unshared``
|
||||
and ``pages_volatile``.
|
||||
|
||||
2) The KSM profit inner a single process can be similarly obtained by the
|
||||
following approximate calculation::
|
||||
|
||||
process_profit =~ ksm_merging_pages * sizeof(page) -
|
||||
process_profit =~ ksm_saved_pages * sizeof(page) -
|
||||
ksm_rmap_items * sizeof(rmap_item).
|
||||
|
||||
where ksm_merging_pages is shown under the directory ``/proc/<pid>/``,
|
||||
and ksm_rmap_items is shown in ``/proc/<pid>/ksm_stat``. The process profit
|
||||
is also shown in ``/proc/<pid>/ksm_stat`` as ksm_process_profit.
|
||||
where ksm_saved_pages equals to the sum of ``ksm_merging_pages`` and
|
||||
``ksm_zero_pages``, both of which are shown under the directory
|
||||
``/proc/<pid>/ksm_stat``, and ksm_rmap_items is also shown in
|
||||
``/proc/<pid>/ksm_stat``. The process profit is also shown in
|
||||
``/proc/<pid>/ksm_stat`` as ksm_process_profit.
|
||||
|
||||
From the perspective of application, a high ratio of ``ksm_rmap_items`` to
|
||||
``ksm_merging_pages`` means a bad madvise-applied policy, so developers or
|
||||
|
@ -433,6 +433,18 @@ The following module parameters are currently defined:
|
||||
memory in a way that huge pages in bigger
|
||||
granularity cannot be formed on hotplugged
|
||||
memory.
|
||||
|
||||
With value "force" it could result in memory
|
||||
wastage due to memmap size limitations. For
|
||||
example, if the memmap for a memory block
|
||||
requires 1 MiB, but the pageblock size is 2
|
||||
MiB, 1 MiB of hotplugged memory will be wasted.
|
||||
Note that there are still cases where the
|
||||
feature cannot be enforced: for example, if the
|
||||
memmap is smaller than a single page, or if the
|
||||
architecture does not support the forced mode
|
||||
in all configurations.
|
||||
|
||||
``online_policy`` read-write: Set the basic policy used for
|
||||
automatic zone selection when onlining memory
|
||||
blocks without specifying a target zone.
|
||||
@ -669,7 +681,7 @@ when still encountering permanently unmovable pages within ZONE_MOVABLE
|
||||
(-> BUG), memory offlining will keep retrying until it eventually succeeds.
|
||||
|
||||
When offlining is triggered from user space, the offlining context can be
|
||||
terminated by sending a fatal signal. A timeout based offlining can easily be
|
||||
terminated by sending a signal. A timeout based offlining can easily be
|
||||
implemented via::
|
||||
|
||||
% timeout $TIMEOUT offline_block | failure_handling
|
||||
|
@ -244,6 +244,21 @@ write-protected (so future writes will also result in a WP fault). These ioctls
|
||||
support a mode flag (``UFFDIO_COPY_MODE_WP`` or ``UFFDIO_CONTINUE_MODE_WP``
|
||||
respectively) to configure the mapping this way.
|
||||
|
||||
Memory Poisioning Emulation
|
||||
---------------------------
|
||||
|
||||
In response to a fault (either missing or minor), an action userspace can
|
||||
take to "resolve" it is to issue a ``UFFDIO_POISON``. This will cause any
|
||||
future faulters to either get a SIGBUS, or in KVM's case the guest will
|
||||
receive an MCE as if there were hardware memory poisoning.
|
||||
|
||||
This is used to emulate hardware memory poisoning. Imagine a VM running on a
|
||||
machine which experiences a real hardware memory error. Later, we live migrate
|
||||
the VM to another physical machine. Since we want the migration to be
|
||||
transparent to the guest, we want that same address range to act as if it was
|
||||
still poisoned, even though it's on a new physical host which ostensibly
|
||||
doesn't have a memory error in the exact same spot.
|
||||
|
||||
QEMU/KVM
|
||||
========
|
||||
|
||||
|
@ -49,7 +49,7 @@ compressed pool.
|
||||
Design
|
||||
======
|
||||
|
||||
Zswap receives pages for compression through the Frontswap API and is able to
|
||||
Zswap receives pages for compression from the swap subsystem and is able to
|
||||
evict pages from its own compressed pool on an LRU basis and write them back to
|
||||
the backing swap device in the case that the compressed pool is full.
|
||||
|
||||
@ -70,19 +70,19 @@ means the compression ratio will always be 2:1 or worse (because of half-full
|
||||
zbud pages). The zsmalloc type zpool has a more complex compressed page
|
||||
storage method, and it can achieve greater storage densities.
|
||||
|
||||
When a swap page is passed from frontswap to zswap, zswap maintains a mapping
|
||||
When a swap page is passed from swapout to zswap, zswap maintains a mapping
|
||||
of the swap entry, a combination of the swap type and swap offset, to the zpool
|
||||
handle that references that compressed swap page. This mapping is achieved
|
||||
with a red-black tree per swap type. The swap offset is the search key for the
|
||||
tree nodes.
|
||||
|
||||
During a page fault on a PTE that is a swap entry, frontswap calls the zswap
|
||||
load function to decompress the page into the page allocated by the page fault
|
||||
handler.
|
||||
During a page fault on a PTE that is a swap entry, the swapin code calls the
|
||||
zswap load function to decompress the page into the page allocated by the page
|
||||
fault handler.
|
||||
|
||||
Once there are no PTEs referencing a swap page stored in zswap (i.e. the count
|
||||
in the swap_map goes to 0) the swap code calls the zswap invalidate function,
|
||||
via frontswap, to free the compressed entry.
|
||||
in the swap_map goes to 0) the swap code calls the zswap invalidate function
|
||||
to free the compressed entry.
|
||||
|
||||
Zswap seeks to be simple in its policies. Sysfs attributes allow for one user
|
||||
controlled policy:
|
||||
|
@ -134,6 +134,7 @@ Usage of helpers:
|
||||
bio_for_each_bvec_all()
|
||||
bio_first_bvec_all()
|
||||
bio_first_page_all()
|
||||
bio_first_folio_all()
|
||||
bio_last_bvec_all()
|
||||
|
||||
* The following helpers iterate over single-page segment. The passed 'struct
|
||||
|
@ -88,13 +88,17 @@ changes occur:
|
||||
|
||||
This is used primarily during fault processing.
|
||||
|
||||
5) ``void update_mmu_cache(struct vm_area_struct *vma,
|
||||
unsigned long address, pte_t *ptep)``
|
||||
5) ``void update_mmu_cache_range(struct vm_fault *vmf,
|
||||
struct vm_area_struct *vma, unsigned long address, pte_t *ptep,
|
||||
unsigned int nr)``
|
||||
|
||||
At the end of every page fault, this routine is invoked to
|
||||
tell the architecture specific code that a translation
|
||||
now exists at virtual address "address" for address space
|
||||
"vma->vm_mm", in the software page tables.
|
||||
At the end of every page fault, this routine is invoked to tell
|
||||
the architecture specific code that translations now exists
|
||||
in the software page tables for address space "vma->vm_mm"
|
||||
at virtual address "address" for "nr" consecutive pages.
|
||||
|
||||
This routine is also invoked in various other places which pass
|
||||
a NULL "vmf".
|
||||
|
||||
A port may use this information in any way it so chooses.
|
||||
For example, it could use this event to pre-load TLB
|
||||
@ -269,7 +273,7 @@ maps this page at its virtual address.
|
||||
If D-cache aliasing is not an issue, these two routines may
|
||||
simply call memcpy/memset directly and do nothing more.
|
||||
|
||||
``void flush_dcache_page(struct page *page)``
|
||||
``void flush_dcache_folio(struct folio *folio)``
|
||||
|
||||
This routines must be called when:
|
||||
|
||||
@ -277,7 +281,7 @@ maps this page at its virtual address.
|
||||
and / or in high memory
|
||||
b) the kernel is about to read from a page cache page and user space
|
||||
shared/writable mappings of this page potentially exist. Note
|
||||
that {get,pin}_user_pages{_fast} already call flush_dcache_page
|
||||
that {get,pin}_user_pages{_fast} already call flush_dcache_folio
|
||||
on any page found in the user address space and thus driver
|
||||
code rarely needs to take this into account.
|
||||
|
||||
@ -291,7 +295,7 @@ maps this page at its virtual address.
|
||||
|
||||
The phrase "kernel writes to a page cache page" means, specifically,
|
||||
that the kernel executes store instructions that dirty data in that
|
||||
page at the page->virtual mapping of that page. It is important to
|
||||
page at the kernel virtual mapping of that page. It is important to
|
||||
flush here to handle D-cache aliasing, to make sure these kernel stores
|
||||
are visible to user space mappings of that page.
|
||||
|
||||
@ -302,21 +306,22 @@ maps this page at its virtual address.
|
||||
If D-cache aliasing is not an issue, this routine may simply be defined
|
||||
as a nop on that architecture.
|
||||
|
||||
There is a bit set aside in page->flags (PG_arch_1) as "architecture
|
||||
There is a bit set aside in folio->flags (PG_arch_1) as "architecture
|
||||
private". The kernel guarantees that, for pagecache pages, it will
|
||||
clear this bit when such a page first enters the pagecache.
|
||||
|
||||
This allows these interfaces to be implemented much more efficiently.
|
||||
It allows one to "defer" (perhaps indefinitely) the actual flush if
|
||||
there are currently no user processes mapping this page. See sparc64's
|
||||
flush_dcache_page and update_mmu_cache implementations for an example
|
||||
of how to go about doing this.
|
||||
This allows these interfaces to be implemented much more
|
||||
efficiently. It allows one to "defer" (perhaps indefinitely) the
|
||||
actual flush if there are currently no user processes mapping this
|
||||
page. See sparc64's flush_dcache_folio and update_mmu_cache_range
|
||||
implementations for an example of how to go about doing this.
|
||||
|
||||
The idea is, first at flush_dcache_page() time, if page_file_mapping()
|
||||
returns a mapping, and mapping_mapped on that mapping returns %false,
|
||||
just mark the architecture private page flag bit. Later, in
|
||||
update_mmu_cache(), a check is made of this flag bit, and if set the
|
||||
flush is done and the flag bit is cleared.
|
||||
The idea is, first at flush_dcache_folio() time, if
|
||||
folio_flush_mapping() returns a mapping, and mapping_mapped() on that
|
||||
mapping returns %false, just mark the architecture private page
|
||||
flag bit. Later, in update_mmu_cache_range(), a check is made
|
||||
of this flag bit, and if set the flush is done and the flag bit
|
||||
is cleared.
|
||||
|
||||
.. important::
|
||||
|
||||
@ -326,12 +331,6 @@ maps this page at its virtual address.
|
||||
dirty. Again, see sparc64 for examples of how
|
||||
to deal with this.
|
||||
|
||||
``void flush_dcache_folio(struct folio *folio)``
|
||||
This function is called under the same circumstances as
|
||||
flush_dcache_page(). It allows the architecture to
|
||||
optimise for flushing the entire folio of pages instead
|
||||
of flushing one page at a time.
|
||||
|
||||
``void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
|
||||
unsigned long user_vaddr, void *dst, void *src, int len)``
|
||||
``void copy_from_user_page(struct vm_area_struct *vma, struct page *page,
|
||||
@ -352,7 +351,7 @@ maps this page at its virtual address.
|
||||
|
||||
When the kernel needs to access the contents of an anonymous
|
||||
page, it calls this function (currently only
|
||||
get_user_pages()). Note: flush_dcache_page() deliberately
|
||||
get_user_pages()). Note: flush_dcache_folio() deliberately
|
||||
doesn't work for an anonymous page. The default
|
||||
implementation is a nop (and should remain so for all coherent
|
||||
architectures). For incoherent architectures, it should flush
|
||||
@ -369,7 +368,7 @@ maps this page at its virtual address.
|
||||
``void flush_icache_page(struct vm_area_struct *vma, struct page *page)``
|
||||
|
||||
All the functionality of flush_icache_page can be implemented in
|
||||
flush_dcache_page and update_mmu_cache. In the future, the hope
|
||||
flush_dcache_folio and update_mmu_cache_range. In the future, the hope
|
||||
is to remove this interface completely.
|
||||
|
||||
The final category of APIs is for I/O to deliberately aliased address
|
||||
|
@ -115,3 +115,28 @@ More Memory Management Functions
|
||||
.. kernel-doc:: include/linux/mmzone.h
|
||||
.. kernel-doc:: mm/util.c
|
||||
:functions: folio_mapping
|
||||
|
||||
.. kernel-doc:: mm/rmap.c
|
||||
.. kernel-doc:: mm/migrate.c
|
||||
.. kernel-doc:: mm/mmap.c
|
||||
.. kernel-doc:: mm/kmemleak.c
|
||||
.. #kernel-doc:: mm/hmm.c (build warnings)
|
||||
.. kernel-doc:: mm/memremap.c
|
||||
.. kernel-doc:: mm/hugetlb.c
|
||||
.. kernel-doc:: mm/swap.c
|
||||
.. kernel-doc:: mm/zpool.c
|
||||
.. kernel-doc:: mm/memcontrol.c
|
||||
.. #kernel-doc:: mm/memory-tiers.c (build warnings)
|
||||
.. kernel-doc:: mm/shmem.c
|
||||
.. kernel-doc:: mm/migrate_device.c
|
||||
.. #kernel-doc:: mm/nommu.c (duplicates kernel-doc from other files)
|
||||
.. kernel-doc:: mm/mapping_dirty_helpers.c
|
||||
.. #kernel-doc:: mm/memory-failure.c (build warnings)
|
||||
.. kernel-doc:: mm/percpu.c
|
||||
.. kernel-doc:: mm/maccess.c
|
||||
.. kernel-doc:: mm/vmscan.c
|
||||
.. kernel-doc:: mm/memory_hotplug.c
|
||||
.. kernel-doc:: mm/mmu_notifier.c
|
||||
.. kernel-doc:: mm/balloon_compaction.c
|
||||
.. kernel-doc:: mm/huge_memory.c
|
||||
.. kernel-doc:: mm/io-mapping.c
|
||||
|
@ -9,7 +9,7 @@
|
||||
| alpha: | TODO |
|
||||
| arc: | TODO |
|
||||
| arm: | TODO |
|
||||
| arm64: | N/A |
|
||||
| arm64: | ok |
|
||||
| csky: | TODO |
|
||||
| hexagon: | TODO |
|
||||
| ia64: | TODO |
|
||||
|
@ -636,26 +636,29 @@ vm_operations_struct
|
||||
|
||||
prototypes::
|
||||
|
||||
void (*open)(struct vm_area_struct*);
|
||||
void (*close)(struct vm_area_struct*);
|
||||
vm_fault_t (*fault)(struct vm_area_struct*, struct vm_fault *);
|
||||
void (*open)(struct vm_area_struct *);
|
||||
void (*close)(struct vm_area_struct *);
|
||||
vm_fault_t (*fault)(struct vm_fault *);
|
||||
vm_fault_t (*huge_fault)(struct vm_fault *, unsigned int order);
|
||||
vm_fault_t (*map_pages)(struct vm_fault *, pgoff_t start, pgoff_t end);
|
||||
vm_fault_t (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *);
|
||||
vm_fault_t (*pfn_mkwrite)(struct vm_area_struct *, struct vm_fault *);
|
||||
int (*access)(struct vm_area_struct *, unsigned long, void*, int, int);
|
||||
|
||||
locking rules:
|
||||
|
||||
============= ========= ===========================
|
||||
============= ========== ===========================
|
||||
ops mmap_lock PageLocked(page)
|
||||
============= ========= ===========================
|
||||
open: yes
|
||||
close: yes
|
||||
fault: yes can return with page locked
|
||||
map_pages: read
|
||||
page_mkwrite: yes can return with page locked
|
||||
pfn_mkwrite: yes
|
||||
access: yes
|
||||
============= ========= ===========================
|
||||
============= ========== ===========================
|
||||
open: write
|
||||
close: read/write
|
||||
fault: read can return with page locked
|
||||
huge_fault: maybe-read
|
||||
map_pages: maybe-read
|
||||
page_mkwrite: read can return with page locked
|
||||
pfn_mkwrite: read
|
||||
access: read
|
||||
============= ========== ===========================
|
||||
|
||||
->fault() is called when a previously not present pte is about to be faulted
|
||||
in. The filesystem must find and return the page associated with the passed in
|
||||
@ -665,11 +668,18 @@ then ensure the page is not already truncated (invalidate_lock will block
|
||||
subsequent truncate), and then return with VM_FAULT_LOCKED, and the page
|
||||
locked. The VM will unlock the page.
|
||||
|
||||
->huge_fault() is called when there is no PUD or PMD entry present. This
|
||||
gives the filesystem the opportunity to install a PUD or PMD sized page.
|
||||
Filesystems can also use the ->fault method to return a PMD sized page,
|
||||
so implementing this function may not be necessary. In particular,
|
||||
filesystems should not call filemap_fault() from ->huge_fault().
|
||||
The mmap_lock may not be held when this method is called.
|
||||
|
||||
->map_pages() is called when VM asks to map easy accessible pages.
|
||||
Filesystem should find and map pages associated with offsets from "start_pgoff"
|
||||
till "end_pgoff". ->map_pages() is called with the RCU lock held and must
|
||||
not block. If it's not possible to reach a page without blocking,
|
||||
filesystem should skip it. Filesystem should use do_set_pte() to setup
|
||||
filesystem should skip it. Filesystem should use set_pte_range() to setup
|
||||
page table entry. Pointer to entry associated with the page is passed in
|
||||
"pte" field in vm_fault structure. Pointers to entries for other offsets
|
||||
should be calculated relative to "pte".
|
||||
|
@ -938,3 +938,14 @@ file pointer instead of struct dentry pointer. d_tmpfile() is similarly
|
||||
changed to simplify callers. The passed file is in a non-open state and on
|
||||
success must be opened before returning (e.g. by calling
|
||||
finish_open_simple()).
|
||||
|
||||
---
|
||||
|
||||
**mandatory**
|
||||
|
||||
Calling convention for ->huge_fault has changed. It now takes a page
|
||||
order instead of an enum page_entry_size, and it may be called without the
|
||||
mmap_lock held. All in-tree users have been audited and do not seem to
|
||||
depend on the mmap_lock being held, but out of tree users should verify
|
||||
for themselves. If they do need it, they can return VM_FAULT_RETRY to
|
||||
be called with the mmap_lock held.
|
||||
|
@ -380,12 +380,24 @@ number of filters for each scheme. Each filter specifies the type of target
|
||||
memory, and whether it should exclude the memory of the type (filter-out), or
|
||||
all except the memory of the type (filter-in).
|
||||
|
||||
As of this writing, anonymous page type and memory cgroup type are supported by
|
||||
the feature. Some filter target types can require additional arguments. For
|
||||
example, the memory cgroup filter type asks users to specify the file path of
|
||||
the memory cgroup for the filter. Hence, users can apply specific schemes to
|
||||
only anonymous pages, non-anonymous pages, pages of specific cgroups, all pages
|
||||
excluding those of specific cgroups, and any combination of those.
|
||||
Currently, anonymous page, memory cgroup, address range, and DAMON monitoring
|
||||
target type filters are supported by the feature. Some filter target types
|
||||
require additional arguments. The memory cgroup filter type asks users to
|
||||
specify the file path of the memory cgroup for the filter. The address range
|
||||
type asks the start and end addresses of the range. The DAMON monitoring
|
||||
target type asks the index of the target from the context's monitoring targets
|
||||
list. Hence, users can apply specific schemes to only anonymous pages,
|
||||
non-anonymous pages, pages of specific cgroups, all pages excluding those of
|
||||
specific cgroups, pages in specific address range, pages in specific DAMON
|
||||
monitoring targets, and any combination of those.
|
||||
|
||||
To handle filters efficiently, the address range and DAMON monitoring target
|
||||
type filters are handled by the core layer, while others are handled by
|
||||
operations set. If a memory region is filtered by a core layer-handled filter,
|
||||
it is not counted as the scheme has tried to the region. In contrast, if a
|
||||
memory regions is filtered by an operations set layer-handled filter, it is
|
||||
counted as the scheme has tried. The difference in accounting leads to changes
|
||||
in the statistics.
|
||||
|
||||
|
||||
Application Programming Interface
|
||||
|
@ -1,264 +0,0 @@
|
||||
=========
|
||||
Frontswap
|
||||
=========
|
||||
|
||||
Frontswap provides a "transcendent memory" interface for swap pages.
|
||||
In some environments, dramatic performance savings may be obtained because
|
||||
swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk.
|
||||
|
||||
.. _Transcendent memory in a nutshell: https://lwn.net/Articles/454795/
|
||||
|
||||
Frontswap is so named because it can be thought of as the opposite of
|
||||
a "backing" store for a swap device. The storage is assumed to be
|
||||
a synchronous concurrency-safe page-oriented "pseudo-RAM device" conforming
|
||||
to the requirements of transcendent memory (such as Xen's "tmem", or
|
||||
in-kernel compressed memory, aka "zcache", or future RAM-like devices);
|
||||
this pseudo-RAM device is not directly accessible or addressable by the
|
||||
kernel and is of unknown and possibly time-varying size. The driver
|
||||
links itself to frontswap by calling frontswap_register_ops to set the
|
||||
frontswap_ops funcs appropriately and the functions it provides must
|
||||
conform to certain policies as follows:
|
||||
|
||||
An "init" prepares the device to receive frontswap pages associated
|
||||
with the specified swap device number (aka "type"). A "store" will
|
||||
copy the page to transcendent memory and associate it with the type and
|
||||
offset associated with the page. A "load" will copy the page, if found,
|
||||
from transcendent memory into kernel memory, but will NOT remove the page
|
||||
from transcendent memory. An "invalidate_page" will remove the page
|
||||
from transcendent memory and an "invalidate_area" will remove ALL pages
|
||||
associated with the swap type (e.g., like swapoff) and notify the "device"
|
||||
to refuse further stores with that swap type.
|
||||
|
||||
Once a page is successfully stored, a matching load on the page will normally
|
||||
succeed. So when the kernel finds itself in a situation where it needs
|
||||
to swap out a page, it first attempts to use frontswap. If the store returns
|
||||
success, the data has been successfully saved to transcendent memory and
|
||||
a disk write and, if the data is later read back, a disk read are avoided.
|
||||
If a store returns failure, transcendent memory has rejected the data, and the
|
||||
page can be written to swap as usual.
|
||||
|
||||
Note that if a page is stored and the page already exists in transcendent memory
|
||||
(a "duplicate" store), either the store succeeds and the data is overwritten,
|
||||
or the store fails AND the page is invalidated. This ensures stale data may
|
||||
never be obtained from frontswap.
|
||||
|
||||
If properly configured, monitoring of frontswap is done via debugfs in
|
||||
the `/sys/kernel/debug/frontswap` directory. The effectiveness of
|
||||
frontswap can be measured (across all swap devices) with:
|
||||
|
||||
``failed_stores``
|
||||
how many store attempts have failed
|
||||
|
||||
``loads``
|
||||
how many loads were attempted (all should succeed)
|
||||
|
||||
``succ_stores``
|
||||
how many store attempts have succeeded
|
||||
|
||||
``invalidates``
|
||||
how many invalidates were attempted
|
||||
|
||||
A backend implementation may provide additional metrics.
|
||||
|
||||
FAQ
|
||||
===
|
||||
|
||||
* Where's the value?
|
||||
|
||||
When a workload starts swapping, performance falls through the floor.
|
||||
Frontswap significantly increases performance in many such workloads by
|
||||
providing a clean, dynamic interface to read and write swap pages to
|
||||
"transcendent memory" that is otherwise not directly addressable to the kernel.
|
||||
This interface is ideal when data is transformed to a different form
|
||||
and size (such as with compression) or secretly moved (as might be
|
||||
useful for write-balancing for some RAM-like devices). Swap pages (and
|
||||
evicted page-cache pages) are a great use for this kind of slower-than-RAM-
|
||||
but-much-faster-than-disk "pseudo-RAM device".
|
||||
|
||||
Frontswap with a fairly small impact on the kernel,
|
||||
provides a huge amount of flexibility for more dynamic, flexible RAM
|
||||
utilization in various system configurations:
|
||||
|
||||
In the single kernel case, aka "zcache", pages are compressed and
|
||||
stored in local memory, thus increasing the total anonymous pages
|
||||
that can be safely kept in RAM. Zcache essentially trades off CPU
|
||||
cycles used in compression/decompression for better memory utilization.
|
||||
Benchmarks have shown little or no impact when memory pressure is
|
||||
low while providing a significant performance improvement (25%+)
|
||||
on some workloads under high memory pressure.
|
||||
|
||||
"RAMster" builds on zcache by adding "peer-to-peer" transcendent memory
|
||||
support for clustered systems. Frontswap pages are locally compressed
|
||||
as in zcache, but then "remotified" to another system's RAM. This
|
||||
allows RAM to be dynamically load-balanced back-and-forth as needed,
|
||||
i.e. when system A is overcommitted, it can swap to system B, and
|
||||
vice versa. RAMster can also be configured as a memory server so
|
||||
many servers in a cluster can swap, dynamically as needed, to a single
|
||||
server configured with a large amount of RAM... without pre-configuring
|
||||
how much of the RAM is available for each of the clients!
|
||||
|
||||
In the virtual case, the whole point of virtualization is to statistically
|
||||
multiplex physical resources across the varying demands of multiple
|
||||
virtual machines. This is really hard to do with RAM and efforts to do
|
||||
it well with no kernel changes have essentially failed (except in some
|
||||
well-publicized special-case workloads).
|
||||
Specifically, the Xen Transcendent Memory backend allows otherwise
|
||||
"fallow" hypervisor-owned RAM to not only be "time-shared" between multiple
|
||||
virtual machines, but the pages can be compressed and deduplicated to
|
||||
optimize RAM utilization. And when guest OS's are induced to surrender
|
||||
underutilized RAM (e.g. with "selfballooning"), sudden unexpected
|
||||
memory pressure may result in swapping; frontswap allows those pages
|
||||
to be swapped to and from hypervisor RAM (if overall host system memory
|
||||
conditions allow), thus mitigating the potentially awful performance impact
|
||||
of unplanned swapping.
|
||||
|
||||
A KVM implementation is underway and has been RFC'ed to lkml. And,
|
||||
using frontswap, investigation is also underway on the use of NVM as
|
||||
a memory extension technology.
|
||||
|
||||
* Sure there may be performance advantages in some situations, but
|
||||
what's the space/time overhead of frontswap?
|
||||
|
||||
If CONFIG_FRONTSWAP is disabled, every frontswap hook compiles into
|
||||
nothingness and the only overhead is a few extra bytes per swapon'ed
|
||||
swap device. If CONFIG_FRONTSWAP is enabled but no frontswap "backend"
|
||||
registers, there is one extra global variable compared to zero for
|
||||
every swap page read or written. If CONFIG_FRONTSWAP is enabled
|
||||
AND a frontswap backend registers AND the backend fails every "store"
|
||||
request (i.e. provides no memory despite claiming it might),
|
||||
CPU overhead is still negligible -- and since every frontswap fail
|
||||
precedes a swap page write-to-disk, the system is highly likely
|
||||
to be I/O bound and using a small fraction of a percent of a CPU
|
||||
will be irrelevant anyway.
|
||||
|
||||
As for space, if CONFIG_FRONTSWAP is enabled AND a frontswap backend
|
||||
registers, one bit is allocated for every swap page for every swap
|
||||
device that is swapon'd. This is added to the EIGHT bits (which
|
||||
was sixteen until about 2.6.34) that the kernel already allocates
|
||||
for every swap page for every swap device that is swapon'd. (Hugh
|
||||
Dickins has observed that frontswap could probably steal one of
|
||||
the existing eight bits, but let's worry about that minor optimization
|
||||
later.) For very large swap disks (which are rare) on a standard
|
||||
4K pagesize, this is 1MB per 32GB swap.
|
||||
|
||||
When swap pages are stored in transcendent memory instead of written
|
||||
out to disk, there is a side effect that this may create more memory
|
||||
pressure that can potentially outweigh the other advantages. A
|
||||
backend, such as zcache, must implement policies to carefully (but
|
||||
dynamically) manage memory limits to ensure this doesn't happen.
|
||||
|
||||
* OK, how about a quick overview of what this frontswap patch does
|
||||
in terms that a kernel hacker can grok?
|
||||
|
||||
Let's assume that a frontswap "backend" has registered during
|
||||
kernel initialization; this registration indicates that this
|
||||
frontswap backend has access to some "memory" that is not directly
|
||||
accessible by the kernel. Exactly how much memory it provides is
|
||||
entirely dynamic and random.
|
||||
|
||||
Whenever a swap-device is swapon'd frontswap_init() is called,
|
||||
passing the swap device number (aka "type") as a parameter.
|
||||
This notifies frontswap to expect attempts to "store" swap pages
|
||||
associated with that number.
|
||||
|
||||
Whenever the swap subsystem is readying a page to write to a swap
|
||||
device (c.f swap_writepage()), frontswap_store is called. Frontswap
|
||||
consults with the frontswap backend and if the backend says it does NOT
|
||||
have room, frontswap_store returns -1 and the kernel swaps the page
|
||||
to the swap device as normal. Note that the response from the frontswap
|
||||
backend is unpredictable to the kernel; it may choose to never accept a
|
||||
page, it could accept every ninth page, or it might accept every
|
||||
page. But if the backend does accept a page, the data from the page
|
||||
has already been copied and associated with the type and offset,
|
||||
and the backend guarantees the persistence of the data. In this case,
|
||||
frontswap sets a bit in the "frontswap_map" for the swap device
|
||||
corresponding to the page offset on the swap device to which it would
|
||||
otherwise have written the data.
|
||||
|
||||
When the swap subsystem needs to swap-in a page (swap_readpage()),
|
||||
it first calls frontswap_load() which checks the frontswap_map to
|
||||
see if the page was earlier accepted by the frontswap backend. If
|
||||
it was, the page of data is filled from the frontswap backend and
|
||||
the swap-in is complete. If not, the normal swap-in code is
|
||||
executed to obtain the page of data from the real swap device.
|
||||
|
||||
So every time the frontswap backend accepts a page, a swap device read
|
||||
and (potentially) a swap device write are replaced by a "frontswap backend
|
||||
store" and (possibly) a "frontswap backend loads", which are presumably much
|
||||
faster.
|
||||
|
||||
* Can't frontswap be configured as a "special" swap device that is
|
||||
just higher priority than any real swap device (e.g. like zswap,
|
||||
or maybe swap-over-nbd/NFS)?
|
||||
|
||||
No. First, the existing swap subsystem doesn't allow for any kind of
|
||||
swap hierarchy. Perhaps it could be rewritten to accommodate a hierarchy,
|
||||
but this would require fairly drastic changes. Even if it were
|
||||
rewritten, the existing swap subsystem uses the block I/O layer which
|
||||
assumes a swap device is fixed size and any page in it is linearly
|
||||
addressable. Frontswap barely touches the existing swap subsystem,
|
||||
and works around the constraints of the block I/O subsystem to provide
|
||||
a great deal of flexibility and dynamicity.
|
||||
|
||||
For example, the acceptance of any swap page by the frontswap backend is
|
||||
entirely unpredictable. This is critical to the definition of frontswap
|
||||
backends because it grants completely dynamic discretion to the
|
||||
backend. In zcache, one cannot know a priori how compressible a page is.
|
||||
"Poorly" compressible pages can be rejected, and "poorly" can itself be
|
||||
defined dynamically depending on current memory constraints.
|
||||
|
||||
Further, frontswap is entirely synchronous whereas a real swap
|
||||
device is, by definition, asynchronous and uses block I/O. The
|
||||
block I/O layer is not only unnecessary, but may perform "optimizations"
|
||||
that are inappropriate for a RAM-oriented device including delaying
|
||||
the write of some pages for a significant amount of time. Synchrony is
|
||||
required to ensure the dynamicity of the backend and to avoid thorny race
|
||||
conditions that would unnecessarily and greatly complicate frontswap
|
||||
and/or the block I/O subsystem. That said, only the initial "store"
|
||||
and "load" operations need be synchronous. A separate asynchronous thread
|
||||
is free to manipulate the pages stored by frontswap. For example,
|
||||
the "remotification" thread in RAMster uses standard asynchronous
|
||||
kernel sockets to move compressed frontswap pages to a remote machine.
|
||||
Similarly, a KVM guest-side implementation could do in-guest compression
|
||||
and use "batched" hypercalls.
|
||||
|
||||
In a virtualized environment, the dynamicity allows the hypervisor
|
||||
(or host OS) to do "intelligent overcommit". For example, it can
|
||||
choose to accept pages only until host-swapping might be imminent,
|
||||
then force guests to do their own swapping.
|
||||
|
||||
There is a downside to the transcendent memory specifications for
|
||||
frontswap: Since any "store" might fail, there must always be a real
|
||||
slot on a real swap device to swap the page. Thus frontswap must be
|
||||
implemented as a "shadow" to every swapon'd device with the potential
|
||||
capability of holding every page that the swap device might have held
|
||||
and the possibility that it might hold no pages at all. This means
|
||||
that frontswap cannot contain more pages than the total of swapon'd
|
||||
swap devices. For example, if NO swap device is configured on some
|
||||
installation, frontswap is useless. Swapless portable devices
|
||||
can still use frontswap but a backend for such devices must configure
|
||||
some kind of "ghost" swap device and ensure that it is never used.
|
||||
|
||||
* Why this weird definition about "duplicate stores"? If a page
|
||||
has been previously successfully stored, can't it always be
|
||||
successfully overwritten?
|
||||
|
||||
Nearly always it can, but no, sometimes it cannot. Consider an example
|
||||
where data is compressed and the original 4K page has been compressed
|
||||
to 1K. Now an attempt is made to overwrite the page with data that
|
||||
is non-compressible and so would take the entire 4K. But the backend
|
||||
has no more space. In this case, the store must be rejected. Whenever
|
||||
frontswap rejects a store that would overwrite, it also must invalidate
|
||||
the old data and ensure that it is no longer accessible. Since the
|
||||
swap subsystem then writes the new data to the read swap device,
|
||||
this is the correct course of action to ensure coherency.
|
||||
|
||||
* Why does the frontswap patch create the new include file swapfile.h?
|
||||
|
||||
The frontswap code depends on some swap-subsystem-internal data
|
||||
structures that have, over the years, moved back and forth between
|
||||
static and global. This seemed a reasonable compromise: Define
|
||||
them as global but declare them in a new include file that isn't
|
||||
included by the large number of source files that include swap.h.
|
||||
|
||||
Dan Magenheimer, last updated April 9, 2012
|
@ -206,4 +206,5 @@ Functions
|
||||
=========
|
||||
|
||||
.. kernel-doc:: include/linux/highmem.h
|
||||
.. kernel-doc:: mm/highmem.c
|
||||
.. kernel-doc:: include/linux/highmem-internal.h
|
||||
|
@ -271,12 +271,12 @@ to the global reservation count (resv_huge_pages).
|
||||
Freeing Huge Pages
|
||||
==================
|
||||
|
||||
Huge page freeing is performed by the routine free_huge_page(). This routine
|
||||
is the destructor for hugetlbfs compound pages. As a result, it is only
|
||||
passed a pointer to the page struct. When a huge page is freed, reservation
|
||||
accounting may need to be performed. This would be the case if the page was
|
||||
associated with a subpool that contained reserves, or the page is being freed
|
||||
on an error path where a global reserve count must be restored.
|
||||
Huge pages are freed by free_huge_folio(). It is only passed a pointer
|
||||
to the folio as it is called from the generic MM code. When a huge page
|
||||
is freed, reservation accounting may need to be performed. This would
|
||||
be the case if the page was associated with a subpool that contained
|
||||
reserves, or the page is being freed on an error path where a global
|
||||
reserve count must be restored.
|
||||
|
||||
The page->private field points to any subpool associated with the page.
|
||||
If the PagePrivate flag is set, it indicates the global reserve count should
|
||||
@ -525,7 +525,7 @@ However, there are several instances where errors are encountered after a huge
|
||||
page is allocated but before it is instantiated. In this case, the page
|
||||
allocation has consumed the reservation and made the appropriate subpool,
|
||||
reservation map and global count adjustments. If the page is freed at this
|
||||
time (before instantiation and clearing of PagePrivate), then free_huge_page
|
||||
time (before instantiation and clearing of PagePrivate), then free_huge_folio
|
||||
will increment the global reservation count. However, the reservation map
|
||||
indicates the reservation was consumed. This resulting inconsistent state
|
||||
will cause the 'leak' of a reserved huge page. The global reserve count will
|
||||
|
@ -44,7 +44,6 @@ above structured documentation, or deleted if it has served its purpose.
|
||||
balance
|
||||
damon/index
|
||||
free_page_reporting
|
||||
frontswap
|
||||
hmm
|
||||
hwpoison
|
||||
hugetlbfs_reserv
|
||||
|
@ -58,7 +58,7 @@ Support of split page table lock by an architecture
|
||||
===================================================
|
||||
|
||||
There's no need in special enabling of PTE split page table lock: everything
|
||||
required is done by pgtable_pte_page_ctor() and pgtable_pte_page_dtor(), which
|
||||
required is done by pagetable_pte_ctor() and pagetable_pte_dtor(), which
|
||||
must be called on PTE table allocation / freeing.
|
||||
|
||||
Make sure the architecture doesn't use slab allocator for page table
|
||||
@ -68,8 +68,8 @@ This field shares storage with page->ptl.
|
||||
PMD split lock only makes sense if you have more than two page table
|
||||
levels.
|
||||
|
||||
PMD split lock enabling requires pgtable_pmd_page_ctor() call on PMD table
|
||||
allocation and pgtable_pmd_page_dtor() on freeing.
|
||||
PMD split lock enabling requires pagetable_pmd_ctor() call on PMD table
|
||||
allocation and pagetable_pmd_dtor() on freeing.
|
||||
|
||||
Allocation usually happens in pmd_alloc_one(), freeing in pmd_free() and
|
||||
pmd_free_tlb(), but make sure you cover all PMD table allocation / freeing
|
||||
@ -77,7 +77,7 @@ paths: i.e X86_PAE preallocate few PMDs on pgd_alloc().
|
||||
|
||||
With everything in place you can set CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK.
|
||||
|
||||
NOTE: pgtable_pte_page_ctor() and pgtable_pmd_page_ctor() can fail -- it must
|
||||
NOTE: pagetable_pte_ctor() and pagetable_pmd_ctor() can fail -- it must
|
||||
be handled properly.
|
||||
|
||||
page->ptl
|
||||
@ -97,7 +97,7 @@ trick:
|
||||
split lock with enabled DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC, but costs
|
||||
one more cache line for indirect access;
|
||||
|
||||
The spinlock_t allocated in pgtable_pte_page_ctor() for PTE table and in
|
||||
pgtable_pmd_page_ctor() for PMD table.
|
||||
The spinlock_t allocated in pagetable_pte_ctor() for PTE table and in
|
||||
pagetable_pmd_ctor() for PMD table.
|
||||
|
||||
Please, never access page->ptl directly -- use appropriate helper.
|
||||
|
@ -210,6 +210,7 @@ the device (altmap).
|
||||
|
||||
The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
|
||||
PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
|
||||
For powerpc equivalent details see Documentation/powerpc/vmemmap_dedup.rst
|
||||
|
||||
The differences with HugeTLB are relatively minor.
|
||||
|
||||
|
@ -263,3 +263,8 @@ is heavy internal fragmentation and zspool compaction is unable to relocate
|
||||
objects and release zspages. In these cases, it is recommended to decrease
|
||||
the limit on the size of the zspage chains (as specified by the
|
||||
CONFIG_ZSMALLOC_CHAIN_SIZE option).
|
||||
|
||||
Functions
|
||||
=========
|
||||
|
||||
.. kernel-doc:: mm/zsmalloc.c
|
||||
|
@ -36,6 +36,7 @@ powerpc
|
||||
ultravisor
|
||||
vas-api
|
||||
vcpudispatch_stats
|
||||
vmemmap_dedup
|
||||
|
||||
features
|
||||
|
||||
|
101
Documentation/powerpc/vmemmap_dedup.rst
Normal file
101
Documentation/powerpc/vmemmap_dedup.rst
Normal file
@ -0,0 +1,101 @@
|
||||
.. SPDX-License-Identifier: GPL-2.0
|
||||
|
||||
==========
|
||||
Device DAX
|
||||
==========
|
||||
|
||||
The device-dax interface uses the tail deduplication technique explained in
|
||||
Documentation/mm/vmemmap_dedup.rst
|
||||
|
||||
On powerpc, vmemmap deduplication is only used with radix MMU translation. Also
|
||||
with a 64K page size, only the devdax namespace with 1G alignment uses vmemmap
|
||||
deduplication.
|
||||
|
||||
With 2M PMD level mapping, we require 32 struct pages and a single 64K vmemmap
|
||||
page can contain 1024 struct pages (64K/sizeof(struct page)). Hence there is no
|
||||
vmemmap deduplication possible.
|
||||
|
||||
With 1G PUD level mapping, we require 16384 struct pages and a single 64K
|
||||
vmemmap page can contain 1024 struct pages (64K/sizeof(struct page)). Hence we
|
||||
require 16 64K pages in vmemmap to map the struct page for 1G PUD level mapping.
|
||||
|
||||
Here's how things look like on device-dax after the sections are populated::
|
||||
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
|
||||
| | | 0 | -------------> | 0 |
|
||||
| | +-----------+ +-----------+
|
||||
| | | 1 | -------------> | 1 |
|
||||
| | +-----------+ +-----------+
|
||||
| | | 2 | ----------------^ ^ ^ ^ ^ ^
|
||||
| | +-----------+ | | | | |
|
||||
| | | 3 | ------------------+ | | | |
|
||||
| | +-----------+ | | | |
|
||||
| | | 4 | --------------------+ | | |
|
||||
| PUD | +-----------+ | | |
|
||||
| level | | . | ----------------------+ | |
|
||||
| mapping | +-----------+ | |
|
||||
| | | . | ------------------------+ |
|
||||
| | +-----------+ |
|
||||
| | | 15 | --------------------------+
|
||||
| | +-----------+
|
||||
| |
|
||||
| |
|
||||
| |
|
||||
+-----------+
|
||||
|
||||
|
||||
With 4K page size, 2M PMD level mapping requires 512 struct pages and a single
|
||||
4K vmemmap page contains 64 struct pages(4K/sizeof(struct page)). Hence we
|
||||
require 8 4K pages in vmemmap to map the struct page for 2M pmd level mapping.
|
||||
|
||||
Here's how things look like on device-dax after the sections are populated::
|
||||
|
||||
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
|
||||
| | | 0 | -------------> | 0 |
|
||||
| | +-----------+ +-----------+
|
||||
| | | 1 | -------------> | 1 |
|
||||
| | +-----------+ +-----------+
|
||||
| | | 2 | ----------------^ ^ ^ ^ ^ ^
|
||||
| | +-----------+ | | | | |
|
||||
| | | 3 | ------------------+ | | | |
|
||||
| | +-----------+ | | | |
|
||||
| | | 4 | --------------------+ | | |
|
||||
| PMD | +-----------+ | | |
|
||||
| level | | 5 | ----------------------+ | |
|
||||
| mapping | +-----------+ | |
|
||||
| | | 6 | ------------------------+ |
|
||||
| | +-----------+ |
|
||||
| | | 7 | --------------------------+
|
||||
| | +-----------+
|
||||
| |
|
||||
| |
|
||||
| |
|
||||
+-----------+
|
||||
|
||||
With 1G PUD level mapping, we require 262144 struct pages and a single 4K
|
||||
vmemmap page can contain 64 struct pages (4K/sizeof(struct page)). Hence we
|
||||
require 4096 4K pages in vmemmap to map the struct pages for 1G PUD level
|
||||
mapping.
|
||||
|
||||
Here's how things look like on device-dax after the sections are populated::
|
||||
|
||||
+-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
|
||||
| | | 0 | -------------> | 0 |
|
||||
| | +-----------+ +-----------+
|
||||
| | | 1 | -------------> | 1 |
|
||||
| | +-----------+ +-----------+
|
||||
| | | 2 | ----------------^ ^ ^ ^ ^ ^
|
||||
| | +-----------+ | | | | |
|
||||
| | | 3 | ------------------+ | | | |
|
||||
| | +-----------+ | | | |
|
||||
| | | 4 | --------------------+ | | |
|
||||
| PUD | +-----------+ | | |
|
||||
| level | | . | ----------------------+ | |
|
||||
| mapping | +-----------+ | |
|
||||
| | | . | ------------------------+ |
|
||||
| | +-----------+ |
|
||||
| | | 4095 | --------------------------+
|
||||
| | +-----------+
|
||||
| |
|
||||
| |
|
||||
| |
|
||||
+-----------+
|
@ -1,196 +0,0 @@
|
||||
:Original: Documentation/mm/frontswap.rst
|
||||
|
||||
:翻译:
|
||||
|
||||
司延腾 Yanteng Si <siyanteng@loongson.cn>
|
||||
|
||||
:校译:
|
||||
|
||||
=========
|
||||
Frontswap
|
||||
=========
|
||||
|
||||
Frontswap为交换页提供了一个 “transcendent memory” 的接口。在一些环境中,由
|
||||
于交换页被保存在RAM(或类似RAM的设备)中,而不是交换磁盘,因此可以获得巨大的性能
|
||||
节省(提高)。
|
||||
|
||||
.. _Transcendent memory in a nutshell: https://lwn.net/Articles/454795/
|
||||
|
||||
Frontswap之所以这么命名,是因为它可以被认为是与swap设备的“back”存储相反。存
|
||||
储器被认为是一个同步并发安全的面向页面的“伪RAM设备”,符合transcendent memory
|
||||
(如Xen的“tmem”,或内核内压缩内存,又称“zcache”,或未来的类似RAM的设备)的要
|
||||
求;这个伪RAM设备不能被内核直接访问或寻址,其大小未知且可能随时间变化。驱动程序通过
|
||||
调用frontswap_register_ops将自己与frontswap链接起来,以适当地设置frontswap_ops
|
||||
的功能,它提供的功能必须符合某些策略,如下所示:
|
||||
|
||||
一个 “init” 将设备准备好接收与指定的交换设备编号(又称“类型”)相关的frontswap
|
||||
交换页。一个 “store” 将把该页复制到transcendent memory,并与该页的类型和偏移
|
||||
量相关联。一个 “load” 将把该页,如果找到的话,从transcendent memory复制到内核
|
||||
内存,但不会从transcendent memory中删除该页。一个 “invalidate_page” 将从
|
||||
transcendent memory中删除该页,一个 “invalidate_area” 将删除所有与交换类型
|
||||
相关的页(例如,像swapoff)并通知 “device” 拒绝进一步存储该交换类型。
|
||||
|
||||
一旦一个页面被成功存储,在该页面上的匹配加载通常会成功。因此,当内核发现自己处于需
|
||||
要交换页面的情况时,它首先尝试使用frontswap。如果存储的结果是成功的,那么数据就已
|
||||
经成功的保存到了transcendent memory中,并且避免了磁盘写入,如果后来再读回数据,
|
||||
也避免了磁盘读取。如果存储返回失败,transcendent memory已经拒绝了该数据,且该页
|
||||
可以像往常一样被写入交换空间。
|
||||
|
||||
请注意,如果一个页面被存储,而该页面已经存在于transcendent memory中(一个 “重复”
|
||||
的存储),要么存储成功,数据被覆盖,要么存储失败,该页面被废止。这确保了旧的数据永远
|
||||
不会从frontswap中获得。
|
||||
|
||||
如果配置正确,对frontswap的监控是通过 `/sys/kernel/debug/frontswap` 目录下的
|
||||
debugfs完成的。frontswap的有效性可以通过以下方式测量(在所有交换设备中):
|
||||
|
||||
``failed_stores``
|
||||
有多少次存储的尝试是失败的
|
||||
|
||||
``loads``
|
||||
尝试了多少次加载(应该全部成功)
|
||||
|
||||
``succ_stores``
|
||||
有多少次存储的尝试是成功的
|
||||
|
||||
``invalidates``
|
||||
尝试了多少次作废
|
||||
|
||||
后台实现可以提供额外的指标。
|
||||
|
||||
经常问到的问题
|
||||
==============
|
||||
|
||||
* 价值在哪里?
|
||||
|
||||
当一个工作负载开始交换时,性能就会下降。Frontswap通过提供一个干净的、动态的接口来
|
||||
读取和写入交换页到 “transcendent memory”,从而大大增加了许多这样的工作负载的性
|
||||
能,否则内核是无法直接寻址的。当数据被转换为不同的形式和大小(比如压缩)或者被秘密
|
||||
移动(对于一些类似RAM的设备来说,这可能对写平衡很有用)时,这个接口是理想的。交换
|
||||
页(和被驱逐的页面缓存页)是这种比RAM慢但比磁盘快得多的“伪RAM设备”的一大用途。
|
||||
|
||||
Frontswap对内核的影响相当小,为各种系统配置中更动态、更灵活的RAM利用提供了巨大的
|
||||
灵活性:
|
||||
|
||||
在单一内核的情况下,又称“zcache”,页面被压缩并存储在本地内存中,从而增加了可以安
|
||||
全保存在RAM中的匿名页面总数。Zcache本质上是用压缩/解压缩的CPU周期换取更好的内存利
|
||||
用率。Benchmarks测试显示,当内存压力较低时,几乎没有影响,而在高内存压力下的一些
|
||||
工作负载上,则有明显的性能改善(25%以上)。
|
||||
|
||||
“RAMster” 在zcache的基础上增加了对集群系统的 “peer-to-peer” transcendent memory
|
||||
的支持。Frontswap页面像zcache一样被本地压缩,但随后被“remotified” 到另一个系
|
||||
统的RAM。这使得RAM可以根据需要动态地来回负载平衡,也就是说,当系统A超载时,它可以
|
||||
交换到系统B,反之亦然。RAMster也可以被配置成一个内存服务器,因此集群中的许多服务器
|
||||
可以根据需要动态地交换到配置有大量内存的单一服务器上......而不需要预先配置每个客户
|
||||
有多少内存可用
|
||||
|
||||
在虚拟情况下,虚拟化的全部意义在于统计地将物理资源在多个虚拟机的不同需求之间进行复
|
||||
用。对于RAM来说,这真的很难做到,而且在不改变内核的情况下,要做好这一点的努力基本上
|
||||
是失败的(除了一些广为人知的特殊情况下的工作负载)。具体来说,Xen Transcendent Memory
|
||||
后端允许管理器拥有的RAM “fallow”,不仅可以在多个虚拟机之间进行“time-shared”,
|
||||
而且页面可以被压缩和重复利用,以优化RAM的利用率。当客户操作系统被诱导交出未充分利用
|
||||
的RAM时(如 “selfballooning”),突然出现的意外内存压力可能会导致交换;frontswap
|
||||
允许这些页面被交换到管理器RAM中或从管理器RAM中交换(如果整体主机系统内存条件允许),
|
||||
从而减轻计划外交换可能带来的可怕的性能影响。
|
||||
|
||||
一个KVM的实现正在进行中,并且已经被RFC'ed到lkml。而且,利用frontswap,对NVM作为
|
||||
内存扩展技术的调查也在进行中。
|
||||
|
||||
* 当然,在某些情况下可能有性能上的优势,但frontswap的空间/时间开销是多少?
|
||||
|
||||
如果 CONFIG_FRONTSWAP 被禁用,每个 frontswap 钩子都会编译成空,唯一的开销是每
|
||||
个 swapon'ed swap 设备的几个额外字节。如果 CONFIG_FRONTSWAP 被启用,但没有
|
||||
frontswap的 “backend” 寄存器,每读或写一个交换页就会有一个额外的全局变量,而不
|
||||
是零。如果 CONFIG_FRONTSWAP 被启用,并且有一个frontswap的backend寄存器,并且
|
||||
后端每次 “store” 请求都失败(即尽管声称可能,但没有提供内存),CPU 的开销仍然可以
|
||||
忽略不计 - 因为每次frontswap失败都是在交换页写到磁盘之前,系统很可能是 I/O 绑定
|
||||
的,无论如何使用一小部分的 CPU 都是不相关的。
|
||||
|
||||
至于空间,如果CONFIG_FRONTSWAP被启用,并且有一个frontswap的backend注册,那么
|
||||
每个交换设备的每个交换页都会被分配一个比特。这是在内核已经为每个交换设备的每个交换
|
||||
页分配的8位(在2.6.34之前是16位)上增加的。(Hugh Dickins观察到,frontswap可能
|
||||
会偷取现有的8个比特,但是我们以后再来担心这个小的优化问题)。对于标准的4K页面大小的
|
||||
非常大的交换盘(这很罕见),这是每32GB交换盘1MB开销。
|
||||
|
||||
当交换页存储在transcendent memory中而不是写到磁盘上时,有一个副作用,即这可能会
|
||||
产生更多的内存压力,有可能超过其他的优点。一个backend,比如zcache,必须实现策略
|
||||
来仔细(但动态地)管理内存限制,以确保这种情况不会发生。
|
||||
|
||||
* 好吧,那就用内核骇客能理解的术语来快速概述一下这个frontswap补丁的作用如何?
|
||||
|
||||
我们假设在内核初始化过程中,一个frontswap 的 “backend” 已经注册了;这个注册表
|
||||
明这个frontswap 的 “backend” 可以访问一些不被内核直接访问的“内存”。它到底提
|
||||
供了多少内存是完全动态和随机的。
|
||||
|
||||
每当一个交换设备被交换时,就会调用frontswap_init(),把交换设备的编号(又称“类
|
||||
型”)作为一个参数传给它。这就通知了frontswap,以期待 “store” 与该号码相关的交
|
||||
换页的尝试。
|
||||
|
||||
每当交换子系统准备将一个页面写入交换设备时(参见swap_writepage()),就会调用
|
||||
frontswap_store。Frontswap与frontswap backend协商,如果backend说它没有空
|
||||
间,frontswap_store返回-1,内核就会照常把页换到交换设备上。注意,来自frontswap
|
||||
backend的响应对内核来说是不可预测的;它可能选择从不接受一个页面,可能接受每九个
|
||||
页面,也可能接受每一个页面。但是如果backend确实接受了一个页面,那么这个页面的数
|
||||
据已经被复制并与类型和偏移量相关联了,而且backend保证了数据的持久性。在这种情况
|
||||
下,frontswap在交换设备的“frontswap_map” 中设置了一个位,对应于交换设备上的
|
||||
页面偏移量,否则它就会将数据写入该设备。
|
||||
|
||||
当交换子系统需要交换一个页面时(swap_readpage()),它首先调用frontswap_load(),
|
||||
检查frontswap_map,看这个页面是否早先被frontswap backend接受。如果是,该页
|
||||
的数据就会从frontswap后端填充,换入就完成了。如果不是,正常的交换代码将被执行,
|
||||
以便从真正的交换设备上获得这一页的数据。
|
||||
|
||||
所以每次frontswap backend接受一个页面时,交换设备的读取和(可能)交换设备的写
|
||||
入都被 “frontswap backend store” 和(可能)“frontswap backend loads”
|
||||
所取代,这可能会快得多。
|
||||
|
||||
* frontswap不能被配置为一个 “特殊的” 交换设备,它的优先级要高于任何真正的交换
|
||||
设备(例如像zswap,或者可能是swap-over-nbd/NFS)?
|
||||
|
||||
首先,现有的交换子系统不允许有任何种类的交换层次结构。也许它可以被重写以适应层次
|
||||
结构,但这将需要相当大的改变。即使它被重写,现有的交换子系统也使用了块I/O层,它
|
||||
假定交换设备是固定大小的,其中的任何页面都是可线性寻址的。Frontswap几乎没有触
|
||||
及现有的交换子系统,而是围绕着块I/O子系统的限制,提供了大量的灵活性和动态性。
|
||||
|
||||
例如,frontswap backend对任何交换页的接受是完全不可预测的。这对frontswap backend
|
||||
的定义至关重要,因为它赋予了backend完全动态的决定权。在zcache中,人们无法预
|
||||
先知道一个页面的可压缩性如何。可压缩性 “差” 的页面会被拒绝,而 “差” 本身也可
|
||||
以根据当前的内存限制动态地定义。
|
||||
|
||||
此外,frontswap是完全同步的,而真正的交换设备,根据定义,是异步的,并且使用
|
||||
块I/O。块I/O层不仅是不必要的,而且可能进行 “优化”,这对面向RAM的设备来说是
|
||||
不合适的,包括将一些页面的写入延迟相当长的时间。同步是必须的,以确保后端的动
|
||||
态性,并避免棘手的竞争条件,这将不必要地大大增加frontswap和/或块I/O子系统的
|
||||
复杂性。也就是说,只有最初的 “store” 和 “load” 操作是需要同步的。一个独立
|
||||
的异步线程可以自由地操作由frontswap存储的页面。例如,RAMster中的 “remotification”
|
||||
线程使用标准的异步内核套接字,将压缩的frontswap页面移动到远程机器。同样,
|
||||
KVM的客户方实现可以进行客户内压缩,并使用 “batched” hypercalls。
|
||||
|
||||
在虚拟化环境中,动态性允许管理程序(或主机操作系统)做“intelligent overcommit”。
|
||||
例如,它可以选择只接受页面,直到主机交换可能即将发生,然后强迫客户机做他们
|
||||
自己的交换。
|
||||
|
||||
transcendent memory规格的frontswap有一个坏处。因为任何 “store” 都可
|
||||
能失败,所以必须在一个真正的交换设备上有一个真正的插槽来交换页面。因此,
|
||||
frontswap必须作为每个交换设备的 “影子” 来实现,它有可能容纳交换设备可能
|
||||
容纳的每一个页面,也有可能根本不容纳任何页面。这意味着frontswap不能包含比
|
||||
swap设备总数更多的页面。例如,如果在某些安装上没有配置交换设备,frontswap
|
||||
就没有用。无交换设备的便携式设备仍然可以使用frontswap,但是这种设备的
|
||||
backend必须配置某种 “ghost” 交换设备,并确保它永远不会被使用。
|
||||
|
||||
|
||||
* 为什么会有这种关于 “重复存储” 的奇怪定义?如果一个页面以前被成功地存储过,
|
||||
难道它不能总是被成功地覆盖吗?
|
||||
|
||||
几乎总是可以的,不,有时不能。考虑一个例子,数据被压缩了,原来的4K页面被压
|
||||
缩到了1K。现在,有人试图用不可压缩的数据覆盖该页,因此会占用整个4K。但是
|
||||
backend没有更多的空间了。在这种情况下,这个存储必须被拒绝。每当frontswap
|
||||
拒绝一个会覆盖的存储时,它也必须使旧的数据作废,并确保它不再被访问。因为交
|
||||
换子系统会把新的数据写到读交换设备上,这是确保一致性的正确做法。
|
||||
|
||||
* 为什么frontswap补丁会创建新的头文件swapfile.h?
|
||||
|
||||
frontswap代码依赖于一些swap子系统内部的数据结构,这些数据结构多年来一直
|
||||
在静态和全局之间来回移动。这似乎是一个合理的妥协:将它们定义为全局,但在一
|
||||
个新的包含文件中声明它们,该文件不被包含swap.h的大量源文件所包含。
|
||||
|
||||
Dan Magenheimer,最后更新于2012年4月9日
|
@ -219,7 +219,7 @@ vma_commit_reservation()之间,预留映射有可能被改变。如果hugetlb_
|
||||
释放巨页
|
||||
========
|
||||
|
||||
巨页释放是由函数free_huge_page()执行的。这个函数是hugetlbfs复合页的析构器。因此,它只传
|
||||
巨页释放是由函数free_huge_folio()执行的。这个函数是hugetlbfs复合页的析构器。因此,它只传
|
||||
递一个指向页面结构体的指针。当一个巨页被释放时,可能需要进行预留计算。如果该页与包含保
|
||||
留的子池相关联,或者该页在错误路径上被释放,必须恢复全局预留计数,就会出现这种情况。
|
||||
|
||||
@ -387,7 +387,7 @@ region_count()在解除私有巨页映射时被调用。在私有映射中,预
|
||||
|
||||
然而,有几种情况是,在一个巨页被分配后,但在它被实例化之前,就遇到了错误。在这种情况下,
|
||||
页面分配已经消耗了预留,并进行了适当的子池、预留映射和全局计数调整。如果页面在这个时候被释放
|
||||
(在实例化和清除PagePrivate之前),那么free_huge_page将增加全局预留计数。然而,预留映射
|
||||
(在实例化和清除PagePrivate之前),那么free_huge_folio将增加全局预留计数。然而,预留映射
|
||||
显示报留被消耗了。这种不一致的状态将导致预留的巨页的 “泄漏” 。全局预留计数将比它原本的要高,
|
||||
并阻止分配一个预先分配的页面。
|
||||
|
||||
|
@ -42,7 +42,6 @@ Linux内存管理文档
|
||||
damon/index
|
||||
free_page_reporting
|
||||
ksm
|
||||
frontswap
|
||||
hmm
|
||||
hwpoison
|
||||
hugetlbfs_reserv
|
||||
|
@ -56,16 +56,16 @@ Hugetlb特定的辅助函数:
|
||||
架构对分页表锁的支持
|
||||
====================
|
||||
|
||||
没有必要特别启用PTE分页表锁:所有需要的东西都由pgtable_pte_page_ctor()
|
||||
和pgtable_pte_page_dtor()完成,它们必须在PTE表分配/释放时被调用。
|
||||
没有必要特别启用PTE分页表锁:所有需要的东西都由pagetable_pte_ctor()
|
||||
和pagetable_pte_dtor()完成,它们必须在PTE表分配/释放时被调用。
|
||||
|
||||
确保架构不使用slab分配器来分配页表:slab使用page->slab_cache来分配其页
|
||||
面。这个区域与page->ptl共享存储。
|
||||
|
||||
PMD分页锁只有在你有两个以上的页表级别时才有意义。
|
||||
|
||||
启用PMD分页锁需要在PMD表分配时调用pgtable_pmd_page_ctor(),在释放时调
|
||||
用pgtable_pmd_page_dtor()。
|
||||
启用PMD分页锁需要在PMD表分配时调用pagetable_pmd_ctor(),在释放时调
|
||||
用pagetable_pmd_dtor()。
|
||||
|
||||
分配通常发生在pmd_alloc_one()中,释放发生在pmd_free()和pmd_free_tlb()
|
||||
中,但要确保覆盖所有的PMD表分配/释放路径:即X86_PAE在pgd_alloc()中预先
|
||||
@ -73,7 +73,7 @@ PMD分页锁只有在你有两个以上的页表级别时才有意义。
|
||||
|
||||
一切就绪后,你可以设置CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK。
|
||||
|
||||
注意:pgtable_pte_page_ctor()和pgtable_pmd_page_ctor()可能失败--必
|
||||
注意:pagetable_pte_ctor()和pagetable_pmd_ctor()可能失败--必
|
||||
须正确处理。
|
||||
|
||||
page->ptl
|
||||
@ -90,7 +90,7 @@ page->ptl用于访问分割页表锁,其中'page'是包含该表的页面struc
|
||||
的指针并动态分配它。这允许在启用DEBUG_SPINLOCK或DEBUG_LOCK_ALLOC的
|
||||
情况下使用分页锁,但由于间接访问而多花了一个缓存行。
|
||||
|
||||
PTE表的spinlock_t分配在pgtable_pte_page_ctor()中,PMD表的spinlock_t
|
||||
分配在pgtable_pmd_page_ctor()中。
|
||||
PTE表的spinlock_t分配在pagetable_pte_ctor()中,PMD表的spinlock_t
|
||||
分配在pagetable_pmd_ctor()中。
|
||||
|
||||
请不要直接访问page->ptl - -使用适当的辅助函数。
|
||||
|
@ -8438,13 +8438,6 @@ F: Documentation/power/freezing-of-tasks.rst
|
||||
F: include/linux/freezer.h
|
||||
F: kernel/freezer.c
|
||||
|
||||
FRONTSWAP API
|
||||
M: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
|
||||
L: linux-kernel@vger.kernel.org
|
||||
S: Maintained
|
||||
F: include/linux/frontswap.h
|
||||
F: mm/frontswap.c
|
||||
|
||||
FS-CACHE: LOCAL CACHING FOR NETWORK FILESYSTEMS
|
||||
M: David Howells <dhowells@redhat.com>
|
||||
L: linux-cachefs@redhat.com (moderated for non-subscribers)
|
||||
@ -14898,7 +14891,6 @@ NETWORKING [TCP]
|
||||
M: Eric Dumazet <edumazet@google.com>
|
||||
L: netdev@vger.kernel.org
|
||||
S: Maintained
|
||||
F: include/linux/net_mm.h
|
||||
F: include/linux/tcp.h
|
||||
F: include/net/tcp.h
|
||||
F: include/trace/events/tcp.h
|
||||
|
@ -53,9 +53,16 @@ extern void flush_icache_user_page(struct vm_area_struct *vma,
|
||||
#define flush_icache_user_page flush_icache_user_page
|
||||
#endif /* CONFIG_SMP */
|
||||
|
||||
/* This is used only in __do_fault and do_swap_page. */
|
||||
#define flush_icache_page(vma, page) \
|
||||
flush_icache_user_page((vma), (page), 0, 0)
|
||||
/*
|
||||
* Both implementations of flush_icache_user_page flush the entire
|
||||
* address space, so one call, no matter how many pages.
|
||||
*/
|
||||
static inline void flush_icache_pages(struct vm_area_struct *vma,
|
||||
struct page *page, unsigned int nr)
|
||||
{
|
||||
flush_icache_user_page(vma, page, 0, 0);
|
||||
}
|
||||
#define flush_icache_pages flush_icache_pages
|
||||
|
||||
#include <asm-generic/cacheflush.h>
|
||||
|
||||
|
@ -26,7 +26,6 @@ struct vm_area_struct;
|
||||
* hook is made available.
|
||||
*/
|
||||
#define set_pte(pteptr, pteval) ((*(pteptr)) = (pteval))
|
||||
#define set_pte_at(mm,addr,ptep,pteval) set_pte(ptep,pteval)
|
||||
|
||||
/* PMD_SHIFT determines the size of the area a second-level page table can map */
|
||||
#define PMD_SHIFT (PAGE_SHIFT + (PAGE_SHIFT-3))
|
||||
@ -189,7 +188,8 @@ extern unsigned long __zero_page(void);
|
||||
* and a page entry and page directory to the page they refer to.
|
||||
*/
|
||||
#define page_to_pa(page) (page_to_pfn(page) << PAGE_SHIFT)
|
||||
#define pte_pfn(pte) (pte_val(pte) >> 32)
|
||||
#define PFN_PTE_SHIFT 32
|
||||
#define pte_pfn(pte) (pte_val(pte) >> PFN_PTE_SHIFT)
|
||||
|
||||
#define pte_page(pte) pfn_to_page(pte_pfn(pte))
|
||||
#define mk_pte(page, pgprot) \
|
||||
@ -303,6 +303,12 @@ extern inline void update_mmu_cache(struct vm_area_struct * vma,
|
||||
{
|
||||
}
|
||||
|
||||
static inline void update_mmu_cache_range(struct vm_fault *vmf,
|
||||
struct vm_area_struct *vma, unsigned long address,
|
||||
pte_t *ptep, unsigned int nr)
|
||||
{
|
||||
}
|
||||
|
||||
/*
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
* are !pte_none() && !pte_present().
|
||||
|
@ -26,6 +26,7 @@ config ARC
|
||||
select GENERIC_PENDING_IRQ if SMP
|
||||
select GENERIC_SCHED_CLOCK
|
||||
select GENERIC_SMP_IDLE_THREAD
|
||||
select GENERIC_IOREMAP
|
||||
select HAVE_ARCH_KGDB
|
||||
select HAVE_ARCH_TRACEHOOK
|
||||
select HAVE_ARCH_TRANSPARENT_HUGEPAGE if ARC_MMU_V4
|
||||
|
@ -18,24 +18,18 @@
|
||||
#include <linux/mm.h>
|
||||
#include <asm/shmparam.h>
|
||||
|
||||
/*
|
||||
* Semantically we need this because icache doesn't snoop dcache/dma.
|
||||
* However ARC Cache flush requires paddr as well as vaddr, latter not available
|
||||
* in the flush_icache_page() API. So we no-op it but do the equivalent work
|
||||
* in update_mmu_cache()
|
||||
*/
|
||||
#define flush_icache_page(vma, page)
|
||||
|
||||
void flush_cache_all(void);
|
||||
|
||||
void flush_icache_range(unsigned long kstart, unsigned long kend);
|
||||
void __sync_icache_dcache(phys_addr_t paddr, unsigned long vaddr, int len);
|
||||
void __inv_icache_page(phys_addr_t paddr, unsigned long vaddr);
|
||||
void __flush_dcache_page(phys_addr_t paddr, unsigned long vaddr);
|
||||
void __inv_icache_pages(phys_addr_t paddr, unsigned long vaddr, unsigned nr);
|
||||
void __flush_dcache_pages(phys_addr_t paddr, unsigned long vaddr, unsigned nr);
|
||||
|
||||
#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
|
||||
|
||||
void flush_dcache_page(struct page *page);
|
||||
void flush_dcache_folio(struct folio *folio);
|
||||
#define flush_dcache_folio flush_dcache_folio
|
||||
|
||||
void dma_cache_wback_inv(phys_addr_t start, unsigned long sz);
|
||||
void dma_cache_inv(phys_addr_t start, unsigned long sz);
|
||||
|
@ -21,8 +21,9 @@
|
||||
#endif
|
||||
|
||||
extern void __iomem *ioremap(phys_addr_t paddr, unsigned long size);
|
||||
extern void __iomem *ioremap_prot(phys_addr_t paddr, unsigned long size,
|
||||
unsigned long flags);
|
||||
#define ioremap ioremap
|
||||
#define ioremap_prot ioremap_prot
|
||||
#define iounmap iounmap
|
||||
static inline void __iomem *ioport_map(unsigned long port, unsigned int nr)
|
||||
{
|
||||
return (void __iomem *)port;
|
||||
@ -32,8 +33,6 @@ static inline void ioport_unmap(void __iomem *addr)
|
||||
{
|
||||
}
|
||||
|
||||
extern void iounmap(const volatile void __iomem *addr);
|
||||
|
||||
/*
|
||||
* io{read,write}{16,32}be() macros
|
||||
*/
|
||||
|
@ -100,14 +100,12 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
|
||||
return __pte((pte_val(pte) & _PAGE_CHG_MASK) | pgprot_val(newprot));
|
||||
}
|
||||
|
||||
static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
|
||||
pte_t *ptep, pte_t pteval)
|
||||
{
|
||||
set_pte(ptep, pteval);
|
||||
}
|
||||
struct vm_fault;
|
||||
void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma,
|
||||
unsigned long address, pte_t *ptep, unsigned int nr);
|
||||
|
||||
void update_mmu_cache(struct vm_area_struct *vma, unsigned long address,
|
||||
pte_t *ptep);
|
||||
#define update_mmu_cache(vma, addr, ptep) \
|
||||
update_mmu_cache_range(NULL, vma, addr, ptep, 1)
|
||||
|
||||
/*
|
||||
* Encode/decode swap entries and swap PTEs. Swap PTEs are all PTEs that
|
||||
|
@ -169,6 +169,7 @@
|
||||
#define pte_ERROR(e) \
|
||||
pr_crit("%s:%d: bad pte %08lx.\n", __FILE__, __LINE__, pte_val(e))
|
||||
|
||||
#define PFN_PTE_SHIFT PAGE_SHIFT
|
||||
#define pte_none(x) (!pte_val(x))
|
||||
#define pte_present(x) (pte_val(x) & _PAGE_PRESENT)
|
||||
#define pte_clear(mm,addr,ptep) set_pte_at(mm, addr, ptep, __pte(0))
|
||||
|
@ -752,17 +752,17 @@ static inline void arc_slc_enable(void)
|
||||
* There's a corollary case, where kernel READs from a userspace mapped page.
|
||||
* If the U-mapping is not congruent to K-mapping, former needs flushing.
|
||||
*/
|
||||
void flush_dcache_page(struct page *page)
|
||||
void flush_dcache_folio(struct folio *folio)
|
||||
{
|
||||
struct address_space *mapping;
|
||||
|
||||
if (!cache_is_vipt_aliasing()) {
|
||||
clear_bit(PG_dc_clean, &page->flags);
|
||||
clear_bit(PG_dc_clean, &folio->flags);
|
||||
return;
|
||||
}
|
||||
|
||||
/* don't handle anon pages here */
|
||||
mapping = page_mapping_file(page);
|
||||
mapping = folio_flush_mapping(folio);
|
||||
if (!mapping)
|
||||
return;
|
||||
|
||||
@ -771,17 +771,27 @@ void flush_dcache_page(struct page *page)
|
||||
* Make a note that K-mapping is dirty
|
||||
*/
|
||||
if (!mapping_mapped(mapping)) {
|
||||
clear_bit(PG_dc_clean, &page->flags);
|
||||
} else if (page_mapcount(page)) {
|
||||
|
||||
clear_bit(PG_dc_clean, &folio->flags);
|
||||
} else if (folio_mapped(folio)) {
|
||||
/* kernel reading from page with U-mapping */
|
||||
phys_addr_t paddr = (unsigned long)page_address(page);
|
||||
unsigned long vaddr = page->index << PAGE_SHIFT;
|
||||
phys_addr_t paddr = (unsigned long)folio_address(folio);
|
||||
unsigned long vaddr = folio_pos(folio);
|
||||
|
||||
/*
|
||||
* vaddr is not actually the virtual address, but is
|
||||
* congruent to every user mapping.
|
||||
*/
|
||||
if (addr_not_cache_congruent(paddr, vaddr))
|
||||
__flush_dcache_page(paddr, vaddr);
|
||||
__flush_dcache_pages(paddr, vaddr,
|
||||
folio_nr_pages(folio));
|
||||
}
|
||||
}
|
||||
EXPORT_SYMBOL(flush_dcache_folio);
|
||||
|
||||
void flush_dcache_page(struct page *page)
|
||||
{
|
||||
return flush_dcache_folio(page_folio(page));
|
||||
}
|
||||
EXPORT_SYMBOL(flush_dcache_page);
|
||||
|
||||
/*
|
||||
@ -921,18 +931,18 @@ void __sync_icache_dcache(phys_addr_t paddr, unsigned long vaddr, int len)
|
||||
}
|
||||
|
||||
/* wrapper to compile time eliminate alignment checks in flush loop */
|
||||
void __inv_icache_page(phys_addr_t paddr, unsigned long vaddr)
|
||||
void __inv_icache_pages(phys_addr_t paddr, unsigned long vaddr, unsigned nr)
|
||||
{
|
||||
__ic_line_inv_vaddr(paddr, vaddr, PAGE_SIZE);
|
||||
__ic_line_inv_vaddr(paddr, vaddr, nr * PAGE_SIZE);
|
||||
}
|
||||
|
||||
/*
|
||||
* wrapper to clearout kernel or userspace mappings of a page
|
||||
* For kernel mappings @vaddr == @paddr
|
||||
*/
|
||||
void __flush_dcache_page(phys_addr_t paddr, unsigned long vaddr)
|
||||
void __flush_dcache_pages(phys_addr_t paddr, unsigned long vaddr, unsigned nr)
|
||||
{
|
||||
__dc_line_op(paddr, vaddr & PAGE_MASK, PAGE_SIZE, OP_FLUSH_N_INV);
|
||||
__dc_line_op(paddr, vaddr & PAGE_MASK, nr * PAGE_SIZE, OP_FLUSH_N_INV);
|
||||
}
|
||||
|
||||
noinline void flush_cache_all(void)
|
||||
@ -962,10 +972,10 @@ void flush_cache_page(struct vm_area_struct *vma, unsigned long u_vaddr,
|
||||
|
||||
u_vaddr &= PAGE_MASK;
|
||||
|
||||
__flush_dcache_page(paddr, u_vaddr);
|
||||
__flush_dcache_pages(paddr, u_vaddr, 1);
|
||||
|
||||
if (vma->vm_flags & VM_EXEC)
|
||||
__inv_icache_page(paddr, u_vaddr);
|
||||
__inv_icache_pages(paddr, u_vaddr, 1);
|
||||
}
|
||||
|
||||
void flush_cache_range(struct vm_area_struct *vma, unsigned long start,
|
||||
@ -978,9 +988,9 @@ void flush_anon_page(struct vm_area_struct *vma, struct page *page,
|
||||
unsigned long u_vaddr)
|
||||
{
|
||||
/* TBD: do we really need to clear the kernel mapping */
|
||||
__flush_dcache_page((phys_addr_t)page_address(page), u_vaddr);
|
||||
__flush_dcache_page((phys_addr_t)page_address(page),
|
||||
(phys_addr_t)page_address(page));
|
||||
__flush_dcache_pages((phys_addr_t)page_address(page), u_vaddr, 1);
|
||||
__flush_dcache_pages((phys_addr_t)page_address(page),
|
||||
(phys_addr_t)page_address(page), 1);
|
||||
|
||||
}
|
||||
|
||||
@ -989,6 +999,8 @@ void flush_anon_page(struct vm_area_struct *vma, struct page *page,
|
||||
void copy_user_highpage(struct page *to, struct page *from,
|
||||
unsigned long u_vaddr, struct vm_area_struct *vma)
|
||||
{
|
||||
struct folio *src = page_folio(from);
|
||||
struct folio *dst = page_folio(to);
|
||||
void *kfrom = kmap_atomic(from);
|
||||
void *kto = kmap_atomic(to);
|
||||
int clean_src_k_mappings = 0;
|
||||
@ -1005,7 +1017,7 @@ void copy_user_highpage(struct page *to, struct page *from,
|
||||
* addr_not_cache_congruent() is 0
|
||||
*/
|
||||
if (page_mapcount(from) && addr_not_cache_congruent(kfrom, u_vaddr)) {
|
||||
__flush_dcache_page((unsigned long)kfrom, u_vaddr);
|
||||
__flush_dcache_pages((unsigned long)kfrom, u_vaddr, 1);
|
||||
clean_src_k_mappings = 1;
|
||||
}
|
||||
|
||||
@ -1019,17 +1031,17 @@ void copy_user_highpage(struct page *to, struct page *from,
|
||||
* non copied user pages (e.g. read faults which wire in pagecache page
|
||||
* directly).
|
||||
*/
|
||||
clear_bit(PG_dc_clean, &to->flags);
|
||||
clear_bit(PG_dc_clean, &dst->flags);
|
||||
|
||||
/*
|
||||
* if SRC was already usermapped and non-congruent to kernel mapping
|
||||
* sync the kernel mapping back to physical page
|
||||
*/
|
||||
if (clean_src_k_mappings) {
|
||||
__flush_dcache_page((unsigned long)kfrom, (unsigned long)kfrom);
|
||||
set_bit(PG_dc_clean, &from->flags);
|
||||
__flush_dcache_pages((unsigned long)kfrom,
|
||||
(unsigned long)kfrom, 1);
|
||||
} else {
|
||||
clear_bit(PG_dc_clean, &from->flags);
|
||||
clear_bit(PG_dc_clean, &src->flags);
|
||||
}
|
||||
|
||||
kunmap_atomic(kto);
|
||||
@ -1038,8 +1050,9 @@ void copy_user_highpage(struct page *to, struct page *from,
|
||||
|
||||
void clear_user_page(void *to, unsigned long u_vaddr, struct page *page)
|
||||
{
|
||||
struct folio *folio = page_folio(page);
|
||||
clear_page(to);
|
||||
clear_bit(PG_dc_clean, &page->flags);
|
||||
clear_bit(PG_dc_clean, &folio->flags);
|
||||
}
|
||||
EXPORT_SYMBOL(clear_user_page);
|
||||
|
||||
|
@ -8,7 +8,6 @@
|
||||
#include <linux/module.h>
|
||||
#include <linux/io.h>
|
||||
#include <linux/mm.h>
|
||||
#include <linux/slab.h>
|
||||
#include <linux/cache.h>
|
||||
|
||||
static inline bool arc_uncached_addr_space(phys_addr_t paddr)
|
||||
@ -25,13 +24,6 @@ static inline bool arc_uncached_addr_space(phys_addr_t paddr)
|
||||
|
||||
void __iomem *ioremap(phys_addr_t paddr, unsigned long size)
|
||||
{
|
||||
phys_addr_t end;
|
||||
|
||||
/* Don't allow wraparound or zero size */
|
||||
end = paddr + size - 1;
|
||||
if (!size || (end < paddr))
|
||||
return NULL;
|
||||
|
||||
/*
|
||||
* If the region is h/w uncached, MMU mapping can be elided as optim
|
||||
* The cast to u32 is fine as this region can only be inside 4GB
|
||||
@ -51,55 +43,22 @@ EXPORT_SYMBOL(ioremap);
|
||||
* ARC hardware uncached region, this one still goes thru the MMU as caller
|
||||
* might need finer access control (R/W/X)
|
||||
*/
|
||||
void __iomem *ioremap_prot(phys_addr_t paddr, unsigned long size,
|
||||
void __iomem *ioremap_prot(phys_addr_t paddr, size_t size,
|
||||
unsigned long flags)
|
||||
{
|
||||
unsigned int off;
|
||||
unsigned long vaddr;
|
||||
struct vm_struct *area;
|
||||
phys_addr_t end;
|
||||
pgprot_t prot = __pgprot(flags);
|
||||
|
||||
/* Don't allow wraparound, zero size */
|
||||
end = paddr + size - 1;
|
||||
if ((!size) || (end < paddr))
|
||||
return NULL;
|
||||
|
||||
/* An early platform driver might end up here */
|
||||
if (!slab_is_available())
|
||||
return NULL;
|
||||
|
||||
/* force uncached */
|
||||
prot = pgprot_noncached(prot);
|
||||
|
||||
/* Mappings have to be page-aligned */
|
||||
off = paddr & ~PAGE_MASK;
|
||||
paddr &= PAGE_MASK_PHYS;
|
||||
size = PAGE_ALIGN(end + 1) - paddr;
|
||||
|
||||
/*
|
||||
* Ok, go for it..
|
||||
*/
|
||||
area = get_vm_area(size, VM_IOREMAP);
|
||||
if (!area)
|
||||
return NULL;
|
||||
area->phys_addr = paddr;
|
||||
vaddr = (unsigned long)area->addr;
|
||||
if (ioremap_page_range(vaddr, vaddr + size, paddr, prot)) {
|
||||
vunmap((void __force *)vaddr);
|
||||
return NULL;
|
||||
}
|
||||
return (void __iomem *)(off + (char __iomem *)vaddr);
|
||||
return generic_ioremap_prot(paddr, size, pgprot_noncached(prot));
|
||||
}
|
||||
EXPORT_SYMBOL(ioremap_prot);
|
||||
|
||||
|
||||
void iounmap(const volatile void __iomem *addr)
|
||||
void iounmap(volatile void __iomem *addr)
|
||||
{
|
||||
/* weird double cast to handle phys_addr_t > 32 bits */
|
||||
if (arc_uncached_addr_space((phys_addr_t)(u32)addr))
|
||||
return;
|
||||
|
||||
vfree((void *)(PAGE_MASK & (unsigned long __force)addr));
|
||||
generic_iounmap(addr);
|
||||
}
|
||||
EXPORT_SYMBOL(iounmap);
|
||||
|
@ -467,8 +467,8 @@ void create_tlb(struct vm_area_struct *vma, unsigned long vaddr, pte_t *ptep)
|
||||
* Note that flush (when done) involves both WBACK - so physical page is
|
||||
* in sync as well as INV - so any non-congruent aliases don't remain
|
||||
*/
|
||||
void update_mmu_cache(struct vm_area_struct *vma, unsigned long vaddr_unaligned,
|
||||
pte_t *ptep)
|
||||
void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma,
|
||||
unsigned long vaddr_unaligned, pte_t *ptep, unsigned int nr)
|
||||
{
|
||||
unsigned long vaddr = vaddr_unaligned & PAGE_MASK;
|
||||
phys_addr_t paddr = pte_val(*ptep) & PAGE_MASK_PHYS;
|
||||
@ -491,15 +491,19 @@ void update_mmu_cache(struct vm_area_struct *vma, unsigned long vaddr_unaligned,
|
||||
*/
|
||||
if ((vma->vm_flags & VM_EXEC) ||
|
||||
addr_not_cache_congruent(paddr, vaddr)) {
|
||||
|
||||
int dirty = !test_and_set_bit(PG_dc_clean, &page->flags);
|
||||
struct folio *folio = page_folio(page);
|
||||
int dirty = !test_and_set_bit(PG_dc_clean, &folio->flags);
|
||||
if (dirty) {
|
||||
unsigned long offset = offset_in_folio(folio, paddr);
|
||||
nr = folio_nr_pages(folio);
|
||||
paddr -= offset;
|
||||
vaddr -= offset;
|
||||
/* wback + inv dcache lines (K-mapping) */
|
||||
__flush_dcache_page(paddr, paddr);
|
||||
__flush_dcache_pages(paddr, paddr, nr);
|
||||
|
||||
/* invalidate any existing icache lines (U-mapping) */
|
||||
if (vma->vm_flags & VM_EXEC)
|
||||
__inv_icache_page(paddr, vaddr);
|
||||
__inv_icache_pages(paddr, vaddr, nr);
|
||||
}
|
||||
}
|
||||
}
|
||||
@ -531,7 +535,7 @@ void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
|
||||
pmd_t *pmd)
|
||||
{
|
||||
pte_t pte = __pte(pmd_val(*pmd));
|
||||
update_mmu_cache(vma, addr, &pte);
|
||||
update_mmu_cache_range(NULL, vma, addr, &pte, HPAGE_PMD_NR);
|
||||
}
|
||||
|
||||
void local_flush_pmd_tlb_range(struct vm_area_struct *vma, unsigned long start,
|
||||
|
@ -231,14 +231,15 @@ vivt_flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned
|
||||
vma->vm_flags);
|
||||
}
|
||||
|
||||
static inline void
|
||||
vivt_flush_cache_page(struct vm_area_struct *vma, unsigned long user_addr, unsigned long pfn)
|
||||
static inline void vivt_flush_cache_pages(struct vm_area_struct *vma,
|
||||
unsigned long user_addr, unsigned long pfn, unsigned int nr)
|
||||
{
|
||||
struct mm_struct *mm = vma->vm_mm;
|
||||
|
||||
if (!mm || cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm))) {
|
||||
unsigned long addr = user_addr & PAGE_MASK;
|
||||
__cpuc_flush_user_range(addr, addr + PAGE_SIZE, vma->vm_flags);
|
||||
__cpuc_flush_user_range(addr, addr + nr * PAGE_SIZE,
|
||||
vma->vm_flags);
|
||||
}
|
||||
}
|
||||
|
||||
@ -247,15 +248,17 @@ vivt_flush_cache_page(struct vm_area_struct *vma, unsigned long user_addr, unsig
|
||||
vivt_flush_cache_mm(mm)
|
||||
#define flush_cache_range(vma,start,end) \
|
||||
vivt_flush_cache_range(vma,start,end)
|
||||
#define flush_cache_page(vma,addr,pfn) \
|
||||
vivt_flush_cache_page(vma,addr,pfn)
|
||||
#define flush_cache_pages(vma, addr, pfn, nr) \
|
||||
vivt_flush_cache_pages(vma, addr, pfn, nr)
|
||||
#else
|
||||
extern void flush_cache_mm(struct mm_struct *mm);
|
||||
extern void flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned long end);
|
||||
extern void flush_cache_page(struct vm_area_struct *vma, unsigned long user_addr, unsigned long pfn);
|
||||
void flush_cache_mm(struct mm_struct *mm);
|
||||
void flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned long end);
|
||||
void flush_cache_pages(struct vm_area_struct *vma, unsigned long user_addr,
|
||||
unsigned long pfn, unsigned int nr);
|
||||
#endif
|
||||
|
||||
#define flush_cache_dup_mm(mm) flush_cache_mm(mm)
|
||||
#define flush_cache_page(vma, addr, pfn) flush_cache_pages(vma, addr, pfn, 1)
|
||||
|
||||
/*
|
||||
* flush_icache_user_range is used when we want to ensure that the
|
||||
@ -289,7 +292,9 @@ extern void flush_cache_page(struct vm_area_struct *vma, unsigned long user_addr
|
||||
* See update_mmu_cache for the user space part.
|
||||
*/
|
||||
#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
|
||||
extern void flush_dcache_page(struct page *);
|
||||
void flush_dcache_page(struct page *);
|
||||
void flush_dcache_folio(struct folio *folio);
|
||||
#define flush_dcache_folio flush_dcache_folio
|
||||
|
||||
#define ARCH_IMPLEMENTS_FLUSH_KERNEL_VMAP_RANGE 1
|
||||
static inline void flush_kernel_vmap_range(void *addr, int size)
|
||||
@ -316,12 +321,6 @@ static inline void flush_anon_page(struct vm_area_struct *vma,
|
||||
#define flush_dcache_mmap_lock(mapping) xa_lock_irq(&mapping->i_pages)
|
||||
#define flush_dcache_mmap_unlock(mapping) xa_unlock_irq(&mapping->i_pages)
|
||||
|
||||
/*
|
||||
* We don't appear to need to do anything here. In fact, if we did, we'd
|
||||
* duplicate cache flushing elsewhere performed by flush_dcache_page().
|
||||
*/
|
||||
#define flush_icache_page(vma,page) do { } while (0)
|
||||
|
||||
/*
|
||||
* flush_cache_vmap() is used when creating mappings (eg, via vmap,
|
||||
* vmalloc, ioremap etc) in kernel space for pages. On non-VIPT
|
||||
|
@ -10,6 +10,7 @@
|
||||
#ifndef _ASM_ARM_HUGETLB_H
|
||||
#define _ASM_ARM_HUGETLB_H
|
||||
|
||||
#include <asm/cacheflush.h>
|
||||
#include <asm/page.h>
|
||||
#include <asm/hugetlb-3level.h>
|
||||
#include <asm-generic/hugetlb.h>
|
||||
|
@ -207,8 +207,9 @@ static inline void __sync_icache_dcache(pte_t pteval)
|
||||
extern void __sync_icache_dcache(pte_t pteval);
|
||||
#endif
|
||||
|
||||
void set_pte_at(struct mm_struct *mm, unsigned long addr,
|
||||
pte_t *ptep, pte_t pteval);
|
||||
void set_ptes(struct mm_struct *mm, unsigned long addr,
|
||||
pte_t *ptep, pte_t pteval, unsigned int nr);
|
||||
#define set_ptes set_ptes
|
||||
|
||||
static inline pte_t clear_pte_bit(pte_t pte, pgprot_t prot)
|
||||
{
|
||||
|
@ -39,7 +39,9 @@ static inline void __tlb_remove_table(void *_table)
|
||||
static inline void
|
||||
__pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte, unsigned long addr)
|
||||
{
|
||||
pgtable_pte_page_dtor(pte);
|
||||
struct ptdesc *ptdesc = page_ptdesc(pte);
|
||||
|
||||
pagetable_pte_dtor(ptdesc);
|
||||
|
||||
#ifndef CONFIG_ARM_LPAE
|
||||
/*
|
||||
@ -50,17 +52,17 @@ __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte, unsigned long addr)
|
||||
__tlb_adjust_range(tlb, addr - PAGE_SIZE, 2 * PAGE_SIZE);
|
||||
#endif
|
||||
|
||||
tlb_remove_table(tlb, pte);
|
||||
tlb_remove_ptdesc(tlb, ptdesc);
|
||||
}
|
||||
|
||||
static inline void
|
||||
__pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp, unsigned long addr)
|
||||
{
|
||||
#ifdef CONFIG_ARM_LPAE
|
||||
struct page *page = virt_to_page(pmdp);
|
||||
struct ptdesc *ptdesc = virt_to_ptdesc(pmdp);
|
||||
|
||||
pgtable_pmd_page_dtor(page);
|
||||
tlb_remove_table(tlb, page);
|
||||
pagetable_pmd_dtor(ptdesc);
|
||||
tlb_remove_ptdesc(tlb, ptdesc);
|
||||
#endif
|
||||
}
|
||||
|
||||
|
@ -619,18 +619,22 @@ extern void flush_bp_all(void);
|
||||
* If PG_dcache_clean is not set for the page, we need to ensure that any
|
||||
* cache entries for the kernels virtual memory range are written
|
||||
* back to the page. On ARMv6 and later, the cache coherency is handled via
|
||||
* the set_pte_at() function.
|
||||
* the set_ptes() function.
|
||||
*/
|
||||
#if __LINUX_ARM_ARCH__ < 6
|
||||
extern void update_mmu_cache(struct vm_area_struct *vma, unsigned long addr,
|
||||
pte_t *ptep);
|
||||
void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma,
|
||||
unsigned long addr, pte_t *ptep, unsigned int nr);
|
||||
#else
|
||||
static inline void update_mmu_cache(struct vm_area_struct *vma,
|
||||
unsigned long addr, pte_t *ptep)
|
||||
static inline void update_mmu_cache_range(struct vm_fault *vmf,
|
||||
struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
|
||||
unsigned int nr)
|
||||
{
|
||||
}
|
||||
#endif
|
||||
|
||||
#define update_mmu_cache(vma, addr, ptep) \
|
||||
update_mmu_cache_range(NULL, vma, addr, ptep, 1)
|
||||
|
||||
#define update_mmu_cache_pmd(vma, address, pmd) do { } while (0)
|
||||
|
||||
#endif
|
||||
|
@ -64,10 +64,11 @@ static void mc_copy_user_page(void *from, void *to)
|
||||
void v4_mc_copy_user_highpage(struct page *to, struct page *from,
|
||||
unsigned long vaddr, struct vm_area_struct *vma)
|
||||
{
|
||||
struct folio *src = page_folio(from);
|
||||
void *kto = kmap_atomic(to);
|
||||
|
||||
if (!test_and_set_bit(PG_dcache_clean, &from->flags))
|
||||
__flush_dcache_page(page_mapping_file(from), from);
|
||||
if (!test_and_set_bit(PG_dcache_clean, &src->flags))
|
||||
__flush_dcache_folio(folio_flush_mapping(src), src);
|
||||
|
||||
raw_spin_lock(&minicache_lock);
|
||||
|
||||
|
@ -69,11 +69,12 @@ static void discard_old_kernel_data(void *kto)
|
||||
static void v6_copy_user_highpage_aliasing(struct page *to,
|
||||
struct page *from, unsigned long vaddr, struct vm_area_struct *vma)
|
||||
{
|
||||
struct folio *src = page_folio(from);
|
||||
unsigned int offset = CACHE_COLOUR(vaddr);
|
||||
unsigned long kfrom, kto;
|
||||
|
||||
if (!test_and_set_bit(PG_dcache_clean, &from->flags))
|
||||
__flush_dcache_page(page_mapping_file(from), from);
|
||||
if (!test_and_set_bit(PG_dcache_clean, &src->flags))
|
||||
__flush_dcache_folio(folio_flush_mapping(src), src);
|
||||
|
||||
/* FIXME: not highmem safe */
|
||||
discard_old_kernel_data(page_address(to));
|
||||
|
@ -84,10 +84,11 @@ static void mc_copy_user_page(void *from, void *to)
|
||||
void xscale_mc_copy_user_highpage(struct page *to, struct page *from,
|
||||
unsigned long vaddr, struct vm_area_struct *vma)
|
||||
{
|
||||
struct folio *src = page_folio(from);
|
||||
void *kto = kmap_atomic(to);
|
||||
|
||||
if (!test_and_set_bit(PG_dcache_clean, &from->flags))
|
||||
__flush_dcache_page(page_mapping_file(from), from);
|
||||
if (!test_and_set_bit(PG_dcache_clean, &src->flags))
|
||||
__flush_dcache_folio(folio_flush_mapping(src), src);
|
||||
|
||||
raw_spin_lock(&minicache_lock);
|
||||
|
||||
|
@ -709,19 +709,21 @@ static void __dma_page_dev_to_cpu(struct page *page, unsigned long off,
|
||||
* Mark the D-cache clean for these pages to avoid extra flushing.
|
||||
*/
|
||||
if (dir != DMA_TO_DEVICE && size >= PAGE_SIZE) {
|
||||
unsigned long pfn;
|
||||
size_t left = size;
|
||||
struct folio *folio = pfn_folio(paddr / PAGE_SIZE);
|
||||
size_t offset = offset_in_folio(folio, paddr);
|
||||
|
||||
pfn = page_to_pfn(page) + off / PAGE_SIZE;
|
||||
off %= PAGE_SIZE;
|
||||
if (off) {
|
||||
pfn++;
|
||||
left -= PAGE_SIZE - off;
|
||||
}
|
||||
while (left >= PAGE_SIZE) {
|
||||
page = pfn_to_page(pfn++);
|
||||
set_bit(PG_dcache_clean, &page->flags);
|
||||
left -= PAGE_SIZE;
|
||||
for (;;) {
|
||||
size_t sz = folio_size(folio) - offset;
|
||||
|
||||
if (size < sz)
|
||||
break;
|
||||
if (!offset)
|
||||
set_bit(PG_dcache_clean, &folio->flags);
|
||||
offset = 0;
|
||||
size -= sz;
|
||||
if (!size)
|
||||
break;
|
||||
folio = folio_next(folio);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
@ -117,11 +117,10 @@ static int adjust_pte(struct vm_area_struct *vma, unsigned long address,
|
||||
* must use the nested version. This also means we need to
|
||||
* open-code the spin-locking.
|
||||
*/
|
||||
pte = pte_offset_map(pmd, address);
|
||||
pte = pte_offset_map_nolock(vma->vm_mm, pmd, address, &ptl);
|
||||
if (!pte)
|
||||
return 0;
|
||||
|
||||
ptl = pte_lockptr(vma->vm_mm, pmd);
|
||||
do_pte_lock(ptl);
|
||||
|
||||
ret = do_adjust_pte(vma, address, pfn, pte);
|
||||
@ -181,12 +180,12 @@ make_coherent(struct address_space *mapping, struct vm_area_struct *vma,
|
||||
*
|
||||
* Note that the pte lock will be held.
|
||||
*/
|
||||
void update_mmu_cache(struct vm_area_struct *vma, unsigned long addr,
|
||||
pte_t *ptep)
|
||||
void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma,
|
||||
unsigned long addr, pte_t *ptep, unsigned int nr)
|
||||
{
|
||||
unsigned long pfn = pte_pfn(*ptep);
|
||||
struct address_space *mapping;
|
||||
struct page *page;
|
||||
struct folio *folio;
|
||||
|
||||
if (!pfn_valid(pfn))
|
||||
return;
|
||||
@ -195,13 +194,13 @@ void update_mmu_cache(struct vm_area_struct *vma, unsigned long addr,
|
||||
* The zero page is never written to, so never has any dirty
|
||||
* cache lines, and therefore never needs to be flushed.
|
||||
*/
|
||||
page = pfn_to_page(pfn);
|
||||
if (page == ZERO_PAGE(0))
|
||||
if (is_zero_pfn(pfn))
|
||||
return;
|
||||
|
||||
mapping = page_mapping_file(page);
|
||||
if (!test_and_set_bit(PG_dcache_clean, &page->flags))
|
||||
__flush_dcache_page(mapping, page);
|
||||
folio = page_folio(pfn_to_page(pfn));
|
||||
mapping = folio_flush_mapping(folio);
|
||||
if (!test_and_set_bit(PG_dcache_clean, &folio->flags))
|
||||
__flush_dcache_folio(mapping, folio);
|
||||
if (mapping) {
|
||||
if (cache_is_vivt())
|
||||
make_coherent(mapping, vma, addr, ptep, pfn);
|
||||
|
@ -95,10 +95,10 @@ void flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned
|
||||
__flush_icache_all();
|
||||
}
|
||||
|
||||
void flush_cache_page(struct vm_area_struct *vma, unsigned long user_addr, unsigned long pfn)
|
||||
void flush_cache_pages(struct vm_area_struct *vma, unsigned long user_addr, unsigned long pfn, unsigned int nr)
|
||||
{
|
||||
if (cache_is_vivt()) {
|
||||
vivt_flush_cache_page(vma, user_addr, pfn);
|
||||
vivt_flush_cache_pages(vma, user_addr, pfn, nr);
|
||||
return;
|
||||
}
|
||||
|
||||
@ -196,29 +196,31 @@ void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
|
||||
#endif
|
||||
}
|
||||
|
||||
void __flush_dcache_page(struct address_space *mapping, struct page *page)
|
||||
void __flush_dcache_folio(struct address_space *mapping, struct folio *folio)
|
||||
{
|
||||
/*
|
||||
* Writeback any data associated with the kernel mapping of this
|
||||
* page. This ensures that data in the physical page is mutually
|
||||
* coherent with the kernels mapping.
|
||||
*/
|
||||
if (!PageHighMem(page)) {
|
||||
__cpuc_flush_dcache_area(page_address(page), page_size(page));
|
||||
if (!folio_test_highmem(folio)) {
|
||||
__cpuc_flush_dcache_area(folio_address(folio),
|
||||
folio_size(folio));
|
||||
} else {
|
||||
unsigned long i;
|
||||
if (cache_is_vipt_nonaliasing()) {
|
||||
for (i = 0; i < compound_nr(page); i++) {
|
||||
void *addr = kmap_atomic(page + i);
|
||||
for (i = 0; i < folio_nr_pages(folio); i++) {
|
||||
void *addr = kmap_local_folio(folio,
|
||||
i * PAGE_SIZE);
|
||||
__cpuc_flush_dcache_area(addr, PAGE_SIZE);
|
||||
kunmap_atomic(addr);
|
||||
kunmap_local(addr);
|
||||
}
|
||||
} else {
|
||||
for (i = 0; i < compound_nr(page); i++) {
|
||||
void *addr = kmap_high_get(page + i);
|
||||
for (i = 0; i < folio_nr_pages(folio); i++) {
|
||||
void *addr = kmap_high_get(folio_page(folio, i));
|
||||
if (addr) {
|
||||
__cpuc_flush_dcache_area(addr, PAGE_SIZE);
|
||||
kunmap_high(page + i);
|
||||
kunmap_high(folio_page(folio, i));
|
||||
}
|
||||
}
|
||||
}
|
||||
@ -230,15 +232,14 @@ void __flush_dcache_page(struct address_space *mapping, struct page *page)
|
||||
* userspace colour, which is congruent with page->index.
|
||||
*/
|
||||
if (mapping && cache_is_vipt_aliasing())
|
||||
flush_pfn_alias(page_to_pfn(page),
|
||||
page->index << PAGE_SHIFT);
|
||||
flush_pfn_alias(folio_pfn(folio), folio_pos(folio));
|
||||
}
|
||||
|
||||
static void __flush_dcache_aliases(struct address_space *mapping, struct page *page)
|
||||
static void __flush_dcache_aliases(struct address_space *mapping, struct folio *folio)
|
||||
{
|
||||
struct mm_struct *mm = current->active_mm;
|
||||
struct vm_area_struct *mpnt;
|
||||
pgoff_t pgoff;
|
||||
struct vm_area_struct *vma;
|
||||
pgoff_t pgoff, pgoff_end;
|
||||
|
||||
/*
|
||||
* There are possible user space mappings of this page:
|
||||
@ -246,21 +247,36 @@ static void __flush_dcache_aliases(struct address_space *mapping, struct page *p
|
||||
* data in the current VM view associated with this page.
|
||||
* - aliasing VIPT: we only need to find one mapping of this page.
|
||||
*/
|
||||
pgoff = page->index;
|
||||
pgoff = folio->index;
|
||||
pgoff_end = pgoff + folio_nr_pages(folio) - 1;
|
||||
|
||||
flush_dcache_mmap_lock(mapping);
|
||||
vma_interval_tree_foreach(mpnt, &mapping->i_mmap, pgoff, pgoff) {
|
||||
unsigned long offset;
|
||||
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff_end) {
|
||||
unsigned long start, offset, pfn;
|
||||
unsigned int nr;
|
||||
|
||||
/*
|
||||
* If this VMA is not in our MM, we can ignore it.
|
||||
*/
|
||||
if (mpnt->vm_mm != mm)
|
||||
if (vma->vm_mm != mm)
|
||||
continue;
|
||||
if (!(mpnt->vm_flags & VM_MAYSHARE))
|
||||
if (!(vma->vm_flags & VM_MAYSHARE))
|
||||
continue;
|
||||
offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT;
|
||||
flush_cache_page(mpnt, mpnt->vm_start + offset, page_to_pfn(page));
|
||||
|
||||
start = vma->vm_start;
|
||||
pfn = folio_pfn(folio);
|
||||
nr = folio_nr_pages(folio);
|
||||
offset = pgoff - vma->vm_pgoff;
|
||||
if (offset > -nr) {
|
||||
pfn -= offset;
|
||||
nr += offset;
|
||||
} else {
|
||||
start += offset * PAGE_SIZE;
|
||||
}
|
||||
if (start + nr * PAGE_SIZE > vma->vm_end)
|
||||
nr = (vma->vm_end - start) / PAGE_SIZE;
|
||||
|
||||
flush_cache_pages(vma, start, pfn, nr);
|
||||
}
|
||||
flush_dcache_mmap_unlock(mapping);
|
||||
}
|
||||
@ -269,7 +285,7 @@ static void __flush_dcache_aliases(struct address_space *mapping, struct page *p
|
||||
void __sync_icache_dcache(pte_t pteval)
|
||||
{
|
||||
unsigned long pfn;
|
||||
struct page *page;
|
||||
struct folio *folio;
|
||||
struct address_space *mapping;
|
||||
|
||||
if (cache_is_vipt_nonaliasing() && !pte_exec(pteval))
|
||||
@ -279,14 +295,14 @@ void __sync_icache_dcache(pte_t pteval)
|
||||
if (!pfn_valid(pfn))
|
||||
return;
|
||||
|
||||
page = pfn_to_page(pfn);
|
||||
folio = page_folio(pfn_to_page(pfn));
|
||||
if (cache_is_vipt_aliasing())
|
||||
mapping = page_mapping_file(page);
|
||||
mapping = folio_flush_mapping(folio);
|
||||
else
|
||||
mapping = NULL;
|
||||
|
||||
if (!test_and_set_bit(PG_dcache_clean, &page->flags))
|
||||
__flush_dcache_page(mapping, page);
|
||||
if (!test_and_set_bit(PG_dcache_clean, &folio->flags))
|
||||
__flush_dcache_folio(mapping, folio);
|
||||
|
||||
if (pte_exec(pteval))
|
||||
__flush_icache_all();
|
||||
@ -312,7 +328,7 @@ void __sync_icache_dcache(pte_t pteval)
|
||||
* Note that we disable the lazy flush for SMP configurations where
|
||||
* the cache maintenance operations are not automatically broadcasted.
|
||||
*/
|
||||
void flush_dcache_page(struct page *page)
|
||||
void flush_dcache_folio(struct folio *folio)
|
||||
{
|
||||
struct address_space *mapping;
|
||||
|
||||
@ -320,31 +336,36 @@ void flush_dcache_page(struct page *page)
|
||||
* The zero page is never written to, so never has any dirty
|
||||
* cache lines, and therefore never needs to be flushed.
|
||||
*/
|
||||
if (page == ZERO_PAGE(0))
|
||||
if (is_zero_pfn(folio_pfn(folio)))
|
||||
return;
|
||||
|
||||
if (!cache_ops_need_broadcast() && cache_is_vipt_nonaliasing()) {
|
||||
if (test_bit(PG_dcache_clean, &page->flags))
|
||||
clear_bit(PG_dcache_clean, &page->flags);
|
||||
if (test_bit(PG_dcache_clean, &folio->flags))
|
||||
clear_bit(PG_dcache_clean, &folio->flags);
|
||||
return;
|
||||
}
|
||||
|
||||
mapping = page_mapping_file(page);
|
||||
mapping = folio_flush_mapping(folio);
|
||||
|
||||
if (!cache_ops_need_broadcast() &&
|
||||
mapping && !page_mapcount(page))
|
||||
clear_bit(PG_dcache_clean, &page->flags);
|
||||
mapping && !folio_mapped(folio))
|
||||
clear_bit(PG_dcache_clean, &folio->flags);
|
||||
else {
|
||||
__flush_dcache_page(mapping, page);
|
||||
__flush_dcache_folio(mapping, folio);
|
||||
if (mapping && cache_is_vivt())
|
||||
__flush_dcache_aliases(mapping, page);
|
||||
__flush_dcache_aliases(mapping, folio);
|
||||
else if (mapping)
|
||||
__flush_icache_all();
|
||||
set_bit(PG_dcache_clean, &page->flags);
|
||||
set_bit(PG_dcache_clean, &folio->flags);
|
||||
}
|
||||
}
|
||||
EXPORT_SYMBOL(flush_dcache_page);
|
||||
EXPORT_SYMBOL(flush_dcache_folio);
|
||||
|
||||
void flush_dcache_page(struct page *page)
|
||||
{
|
||||
flush_dcache_folio(page_folio(page));
|
||||
}
|
||||
EXPORT_SYMBOL(flush_dcache_page);
|
||||
/*
|
||||
* Flush an anonymous page so that users of get_user_pages()
|
||||
* can safely access the data. The expected sequence is:
|
||||
|
@ -45,7 +45,7 @@ struct mem_type {
|
||||
|
||||
const struct mem_type *get_mem_type(unsigned int type);
|
||||
|
||||
extern void __flush_dcache_page(struct address_space *mapping, struct page *page);
|
||||
void __flush_dcache_folio(struct address_space *mapping, struct folio *folio);
|
||||
|
||||
/*
|
||||
* ARM specific vm_struct->flags bits.
|
||||
|
@ -737,11 +737,12 @@ static void __init *early_alloc(unsigned long sz)
|
||||
|
||||
static void *__init late_alloc(unsigned long sz)
|
||||
{
|
||||
void *ptr = (void *)__get_free_pages(GFP_PGTABLE_KERNEL, get_order(sz));
|
||||
void *ptdesc = pagetable_alloc(GFP_PGTABLE_KERNEL & ~__GFP_HIGHMEM,
|
||||
get_order(sz));
|
||||
|
||||
if (!ptr || !pgtable_pte_page_ctor(virt_to_page(ptr)))
|
||||
if (!ptdesc || !pagetable_pte_ctor(ptdesc))
|
||||
BUG();
|
||||
return ptr;
|
||||
return ptdesc_to_virt(ptdesc);
|
||||
}
|
||||
|
||||
static pte_t * __init arm_pte_alloc(pmd_t *pmd, unsigned long addr,
|
||||
@ -1788,7 +1789,7 @@ void __init paging_init(const struct machine_desc *mdesc)
|
||||
bootmem_init();
|
||||
|
||||
empty_zero_page = virt_to_page(zero_page);
|
||||
__flush_dcache_page(NULL, empty_zero_page);
|
||||
__flush_dcache_folio(NULL, page_folio(empty_zero_page));
|
||||
}
|
||||
|
||||
void __init early_mm_init(const struct machine_desc *mdesc)
|
||||
@ -1797,8 +1798,8 @@ void __init early_mm_init(const struct machine_desc *mdesc)
|
||||
early_paging_init(mdesc);
|
||||
}
|
||||
|
||||
void set_pte_at(struct mm_struct *mm, unsigned long addr,
|
||||
pte_t *ptep, pte_t pteval)
|
||||
void set_ptes(struct mm_struct *mm, unsigned long addr,
|
||||
pte_t *ptep, pte_t pteval, unsigned int nr)
|
||||
{
|
||||
unsigned long ext = 0;
|
||||
|
||||
@ -1808,5 +1809,11 @@ void set_pte_at(struct mm_struct *mm, unsigned long addr,
|
||||
ext |= PTE_EXT_NG;
|
||||
}
|
||||
|
||||
set_pte_ext(ptep, pteval, ext);
|
||||
for (;;) {
|
||||
set_pte_ext(ptep, pteval, ext);
|
||||
if (--nr == 0)
|
||||
break;
|
||||
ptep++;
|
||||
pte_val(pteval) += PAGE_SIZE;
|
||||
}
|
||||
}
|
||||
|
@ -180,6 +180,12 @@ void setup_mm_for_reboot(void)
|
||||
{
|
||||
}
|
||||
|
||||
void flush_dcache_folio(struct folio *folio)
|
||||
{
|
||||
__cpuc_flush_dcache_area(folio_address(folio), folio_size(folio));
|
||||
}
|
||||
EXPORT_SYMBOL(flush_dcache_folio);
|
||||
|
||||
void flush_dcache_page(struct page *page)
|
||||
{
|
||||
__cpuc_flush_dcache_area(page_address(page), PAGE_SIZE);
|
||||
|
@ -25,7 +25,7 @@ static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
|
||||
return 0;
|
||||
}
|
||||
|
||||
static bool in_range(unsigned long start, unsigned long size,
|
||||
static bool range_in_range(unsigned long start, unsigned long size,
|
||||
unsigned long range_start, unsigned long range_end)
|
||||
{
|
||||
return start >= range_start && start < range_end &&
|
||||
@ -63,8 +63,8 @@ static int change_memory_common(unsigned long addr, int numpages,
|
||||
if (!size)
|
||||
return 0;
|
||||
|
||||
if (!in_range(start, size, MODULES_VADDR, MODULES_END) &&
|
||||
!in_range(start, size, VMALLOC_START, VMALLOC_END))
|
||||
if (!range_in_range(start, size, MODULES_VADDR, MODULES_END) &&
|
||||
!range_in_range(start, size, VMALLOC_START, VMALLOC_END))
|
||||
return -EINVAL;
|
||||
|
||||
return __change_memory_common(start, size, set_mask, clear_mask);
|
||||
|
@ -78,6 +78,7 @@ config ARM64
|
||||
select ARCH_INLINE_SPIN_UNLOCK_IRQ if !PREEMPTION
|
||||
select ARCH_INLINE_SPIN_UNLOCK_IRQRESTORE if !PREEMPTION
|
||||
select ARCH_KEEP_MEMBLOCK
|
||||
select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
|
||||
select ARCH_USE_CMPXCHG_LOCKREF
|
||||
select ARCH_USE_GNU_PROPERTY
|
||||
select ARCH_USE_MEMTEST
|
||||
@ -96,6 +97,7 @@ config ARM64
|
||||
select ARCH_SUPPORTS_NUMA_BALANCING
|
||||
select ARCH_SUPPORTS_PAGE_TABLE_CHECK
|
||||
select ARCH_SUPPORTS_PER_VMA_LOCK
|
||||
select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
|
||||
select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT
|
||||
select ARCH_WANT_DEFAULT_BPF_JIT
|
||||
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
|
||||
@ -348,9 +350,6 @@ config GENERIC_CSUM
|
||||
config GENERIC_CALIBRATE_DELAY
|
||||
def_bool y
|
||||
|
||||
config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
|
||||
def_bool y
|
||||
|
||||
config SMP
|
||||
def_bool y
|
||||
|
||||
|
@ -114,7 +114,7 @@ extern void copy_to_user_page(struct vm_area_struct *, struct page *,
|
||||
#define copy_to_user_page copy_to_user_page
|
||||
|
||||
/*
|
||||
* flush_dcache_page is used when the kernel has written to the page
|
||||
* flush_dcache_folio is used when the kernel has written to the page
|
||||
* cache page at virtual address page->virtual.
|
||||
*
|
||||
* If this page isn't mapped (ie, page_mapping == NULL), or it might
|
||||
@ -127,6 +127,8 @@ extern void copy_to_user_page(struct vm_area_struct *, struct page *,
|
||||
*/
|
||||
#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
|
||||
extern void flush_dcache_page(struct page *);
|
||||
void flush_dcache_folio(struct folio *);
|
||||
#define flush_dcache_folio flush_dcache_folio
|
||||
|
||||
static __always_inline void icache_inval_all_pou(void)
|
||||
{
|
||||
|
@ -10,6 +10,7 @@
|
||||
#ifndef __ASM_HUGETLB_H
|
||||
#define __ASM_HUGETLB_H
|
||||
|
||||
#include <asm/cacheflush.h>
|
||||
#include <asm/page.h>
|
||||
|
||||
#ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
|
||||
@ -60,4 +61,19 @@ extern void huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
|
||||
|
||||
#include <asm-generic/hugetlb.h>
|
||||
|
||||
#define __HAVE_ARCH_FLUSH_HUGETLB_TLB_RANGE
|
||||
static inline void flush_hugetlb_tlb_range(struct vm_area_struct *vma,
|
||||
unsigned long start,
|
||||
unsigned long end)
|
||||
{
|
||||
unsigned long stride = huge_page_size(hstate_vma(vma));
|
||||
|
||||
if (stride == PMD_SIZE)
|
||||
__flush_tlb_range(vma, start, end, stride, false, 2);
|
||||
else if (stride == PUD_SIZE)
|
||||
__flush_tlb_range(vma, start, end, stride, false, 1);
|
||||
else
|
||||
__flush_tlb_range(vma, start, end, PAGE_SIZE, false, 0);
|
||||
}
|
||||
|
||||
#endif /* __ASM_HUGETLB_H */
|
||||
|
@ -139,8 +139,7 @@ extern void __memset_io(volatile void __iomem *, int, size_t);
|
||||
* I/O memory mapping functions.
|
||||
*/
|
||||
|
||||
bool ioremap_allowed(phys_addr_t phys_addr, size_t size, unsigned long prot);
|
||||
#define ioremap_allowed ioremap_allowed
|
||||
#define ioremap_prot ioremap_prot
|
||||
|
||||
#define _PAGE_IOREMAP PROT_DEVICE_nGnRE
|
||||
|
||||
|
@ -90,7 +90,7 @@ static inline bool try_page_mte_tagging(struct page *page)
|
||||
}
|
||||
|
||||
void mte_zero_clear_page_tags(void *addr);
|
||||
void mte_sync_tags(pte_t old_pte, pte_t pte);
|
||||
void mte_sync_tags(pte_t pte);
|
||||
void mte_copy_page_tags(void *kto, const void *kfrom);
|
||||
void mte_thread_init_user(void);
|
||||
void mte_thread_switch(struct task_struct *next);
|
||||
@ -122,7 +122,7 @@ static inline bool try_page_mte_tagging(struct page *page)
|
||||
static inline void mte_zero_clear_page_tags(void *addr)
|
||||
{
|
||||
}
|
||||
static inline void mte_sync_tags(pte_t old_pte, pte_t pte)
|
||||
static inline void mte_sync_tags(pte_t pte)
|
||||
{
|
||||
}
|
||||
static inline void mte_copy_page_tags(void *kto, const void *kfrom)
|
||||
|
@ -338,30 +338,29 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
|
||||
* don't expose tags (instruction fetches don't check tags).
|
||||
*/
|
||||
if (system_supports_mte() && pte_access_permitted(pte, false) &&
|
||||
!pte_special(pte)) {
|
||||
pte_t old_pte = READ_ONCE(*ptep);
|
||||
/*
|
||||
* We only need to synchronise if the new PTE has tags enabled
|
||||
* or if swapping in (in which case another mapping may have
|
||||
* set tags in the past even if this PTE isn't tagged).
|
||||
* (!pte_none() && !pte_present()) is an open coded version of
|
||||
* is_swap_pte()
|
||||
*/
|
||||
if (pte_tagged(pte) || (!pte_none(old_pte) && !pte_present(old_pte)))
|
||||
mte_sync_tags(old_pte, pte);
|
||||
}
|
||||
!pte_special(pte) && pte_tagged(pte))
|
||||
mte_sync_tags(pte);
|
||||
|
||||
__check_safe_pte_update(mm, ptep, pte);
|
||||
|
||||
set_pte(ptep, pte);
|
||||
}
|
||||
|
||||
static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
|
||||
pte_t *ptep, pte_t pte)
|
||||
static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
|
||||
pte_t *ptep, pte_t pte, unsigned int nr)
|
||||
{
|
||||
page_table_check_pte_set(mm, addr, ptep, pte);
|
||||
return __set_pte_at(mm, addr, ptep, pte);
|
||||
page_table_check_ptes_set(mm, ptep, pte, nr);
|
||||
|
||||
for (;;) {
|
||||
__set_pte_at(mm, addr, ptep, pte);
|
||||
if (--nr == 0)
|
||||
break;
|
||||
ptep++;
|
||||
addr += PAGE_SIZE;
|
||||
pte_val(pte) += PAGE_SIZE;
|
||||
}
|
||||
}
|
||||
#define set_ptes set_ptes
|
||||
|
||||
/*
|
||||
* Huge pte definitions.
|
||||
@ -535,14 +534,14 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
|
||||
static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
|
||||
pmd_t *pmdp, pmd_t pmd)
|
||||
{
|
||||
page_table_check_pmd_set(mm, addr, pmdp, pmd);
|
||||
page_table_check_pmd_set(mm, pmdp, pmd);
|
||||
return __set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd));
|
||||
}
|
||||
|
||||
static inline void set_pud_at(struct mm_struct *mm, unsigned long addr,
|
||||
pud_t *pudp, pud_t pud)
|
||||
{
|
||||
page_table_check_pud_set(mm, addr, pudp, pud);
|
||||
page_table_check_pud_set(mm, pudp, pud);
|
||||
return __set_pte_at(mm, addr, (pte_t *)pudp, pud_pte(pud));
|
||||
}
|
||||
|
||||
@ -940,7 +939,7 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
|
||||
{
|
||||
pte_t pte = __pte(xchg_relaxed(&pte_val(*ptep), 0));
|
||||
|
||||
page_table_check_pte_clear(mm, address, pte);
|
||||
page_table_check_pte_clear(mm, pte);
|
||||
|
||||
return pte;
|
||||
}
|
||||
@ -952,7 +951,7 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_struct *mm,
|
||||
{
|
||||
pmd_t pmd = __pmd(xchg_relaxed(&pmd_val(*pmdp), 0));
|
||||
|
||||
page_table_check_pmd_clear(mm, address, pmd);
|
||||
page_table_check_pmd_clear(mm, pmd);
|
||||
|
||||
return pmd;
|
||||
}
|
||||
@ -988,7 +987,7 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
|
||||
static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
|
||||
unsigned long address, pmd_t *pmdp, pmd_t pmd)
|
||||
{
|
||||
page_table_check_pmd_set(vma->vm_mm, address, pmdp, pmd);
|
||||
page_table_check_pmd_set(vma->vm_mm, pmdp, pmd);
|
||||
return __pmd(xchg_relaxed(&pmd_val(*pmdp), pmd_val(pmd)));
|
||||
}
|
||||
#endif
|
||||
@ -1061,8 +1060,9 @@ static inline void arch_swap_restore(swp_entry_t entry, struct folio *folio)
|
||||
/*
|
||||
* On AArch64, the cache coherency is handled via the set_pte_at() function.
|
||||
*/
|
||||
static inline void update_mmu_cache(struct vm_area_struct *vma,
|
||||
unsigned long addr, pte_t *ptep)
|
||||
static inline void update_mmu_cache_range(struct vm_fault *vmf,
|
||||
struct vm_area_struct *vma, unsigned long addr, pte_t *ptep,
|
||||
unsigned int nr)
|
||||
{
|
||||
/*
|
||||
* We don't do anything here, so there's a very small chance of
|
||||
@ -1071,6 +1071,8 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
|
||||
*/
|
||||
}
|
||||
|
||||
#define update_mmu_cache(vma, addr, ptep) \
|
||||
update_mmu_cache_range(NULL, vma, addr, ptep, 1)
|
||||
#define update_mmu_cache_pmd(vma, address, pmd) do { } while (0)
|
||||
|
||||
#ifdef CONFIG_ARM64_PA_BITS_52
|
||||
|
@ -75,18 +75,20 @@ static inline void tlb_flush(struct mmu_gather *tlb)
|
||||
static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
|
||||
unsigned long addr)
|
||||
{
|
||||
pgtable_pte_page_dtor(pte);
|
||||
tlb_remove_table(tlb, pte);
|
||||
struct ptdesc *ptdesc = page_ptdesc(pte);
|
||||
|
||||
pagetable_pte_dtor(ptdesc);
|
||||
tlb_remove_ptdesc(tlb, ptdesc);
|
||||
}
|
||||
|
||||
#if CONFIG_PGTABLE_LEVELS > 2
|
||||
static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
|
||||
unsigned long addr)
|
||||
{
|
||||
struct page *page = virt_to_page(pmdp);
|
||||
struct ptdesc *ptdesc = virt_to_ptdesc(pmdp);
|
||||
|
||||
pgtable_pmd_page_dtor(page);
|
||||
tlb_remove_table(tlb, page);
|
||||
pagetable_pmd_dtor(ptdesc);
|
||||
tlb_remove_ptdesc(tlb, ptdesc);
|
||||
}
|
||||
#endif
|
||||
|
||||
@ -94,7 +96,7 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
|
||||
static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pudp,
|
||||
unsigned long addr)
|
||||
{
|
||||
tlb_remove_table(tlb, virt_to_page(pudp));
|
||||
tlb_remove_ptdesc(tlb, virt_to_ptdesc(pudp));
|
||||
}
|
||||
#endif
|
||||
|
||||
|
12
arch/arm64/include/asm/tlbbatch.h
Normal file
12
arch/arm64/include/asm/tlbbatch.h
Normal file
@ -0,0 +1,12 @@
|
||||
/* SPDX-License-Identifier: GPL-2.0 */
|
||||
#ifndef _ARCH_ARM64_TLBBATCH_H
|
||||
#define _ARCH_ARM64_TLBBATCH_H
|
||||
|
||||
struct arch_tlbflush_unmap_batch {
|
||||
/*
|
||||
* For arm64, HW can do tlb shootdown, so we don't
|
||||
* need to record cpumask for sending IPI
|
||||
*/
|
||||
};
|
||||
|
||||
#endif /* _ARCH_ARM64_TLBBATCH_H */
|
@ -13,6 +13,7 @@
|
||||
#include <linux/bitfield.h>
|
||||
#include <linux/mm_types.h>
|
||||
#include <linux/sched.h>
|
||||
#include <linux/mmu_notifier.h>
|
||||
#include <asm/cputype.h>
|
||||
#include <asm/mmu.h>
|
||||
|
||||
@ -252,17 +253,26 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
|
||||
__tlbi(aside1is, asid);
|
||||
__tlbi_user(aside1is, asid);
|
||||
dsb(ish);
|
||||
mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
|
||||
}
|
||||
|
||||
static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
|
||||
unsigned long uaddr)
|
||||
{
|
||||
unsigned long addr;
|
||||
|
||||
dsb(ishst);
|
||||
addr = __TLBI_VADDR(uaddr, ASID(mm));
|
||||
__tlbi(vale1is, addr);
|
||||
__tlbi_user(vale1is, addr);
|
||||
mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
|
||||
(uaddr & PAGE_MASK) + PAGE_SIZE);
|
||||
}
|
||||
|
||||
static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
|
||||
unsigned long uaddr)
|
||||
{
|
||||
unsigned long addr;
|
||||
|
||||
dsb(ishst);
|
||||
addr = __TLBI_VADDR(uaddr, ASID(vma->vm_mm));
|
||||
__tlbi(vale1is, addr);
|
||||
__tlbi_user(vale1is, addr);
|
||||
return __flush_tlb_page_nosync(vma->vm_mm, uaddr);
|
||||
}
|
||||
|
||||
static inline void flush_tlb_page(struct vm_area_struct *vma,
|
||||
@ -272,6 +282,53 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
|
||||
dsb(ish);
|
||||
}
|
||||
|
||||
static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
|
||||
{
|
||||
#ifdef CONFIG_ARM64_WORKAROUND_REPEAT_TLBI
|
||||
/*
|
||||
* TLB flush deferral is not required on systems which are affected by
|
||||
* ARM64_WORKAROUND_REPEAT_TLBI, as __tlbi()/__tlbi_user() implementation
|
||||
* will have two consecutive TLBI instructions with a dsb(ish) in between
|
||||
* defeating the purpose (i.e save overall 'dsb ish' cost).
|
||||
*/
|
||||
if (unlikely(cpus_have_const_cap(ARM64_WORKAROUND_REPEAT_TLBI)))
|
||||
return false;
|
||||
#endif
|
||||
return true;
|
||||
}
|
||||
|
||||
static inline void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
|
||||
struct mm_struct *mm,
|
||||
unsigned long uaddr)
|
||||
{
|
||||
__flush_tlb_page_nosync(mm, uaddr);
|
||||
}
|
||||
|
||||
/*
|
||||
* If mprotect/munmap/etc occurs during TLB batched flushing, we need to
|
||||
* synchronise all the TLBI issued with a DSB to avoid the race mentioned in
|
||||
* flush_tlb_batched_pending().
|
||||
*/
|
||||
static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
|
||||
{
|
||||
dsb(ish);
|
||||
}
|
||||
|
||||
/*
|
||||
* To support TLB batched flush for multiple pages unmapping, we only send
|
||||
* the TLBI for each page in arch_tlbbatch_add_pending() and wait for the
|
||||
* completion at the end in arch_tlbbatch_flush(). Since we've already issued
|
||||
* TLBI for each page so only a DSB is needed to synchronise its effect on the
|
||||
* other CPUs.
|
||||
*
|
||||
* This will save the time waiting on DSB comparing issuing a TLBI;DSB sequence
|
||||
* for each page.
|
||||
*/
|
||||
static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
|
||||
{
|
||||
dsb(ish);
|
||||
}
|
||||
|
||||
/*
|
||||
* This is meant to avoid soft lock-ups on large TLB flushing ranges and not
|
||||
* necessarily a performance improvement.
|
||||
@ -358,6 +415,7 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma,
|
||||
scale++;
|
||||
}
|
||||
dsb(ish);
|
||||
mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
|
||||
}
|
||||
|
||||
static inline void flush_tlb_range(struct vm_area_struct *vma,
|
||||
|
@ -35,41 +35,18 @@ DEFINE_STATIC_KEY_FALSE(mte_async_or_asymm_mode);
|
||||
EXPORT_SYMBOL_GPL(mte_async_or_asymm_mode);
|
||||
#endif
|
||||
|
||||
static void mte_sync_page_tags(struct page *page, pte_t old_pte,
|
||||
bool check_swap, bool pte_is_tagged)
|
||||
{
|
||||
if (check_swap && is_swap_pte(old_pte)) {
|
||||
swp_entry_t entry = pte_to_swp_entry(old_pte);
|
||||
|
||||
if (!non_swap_entry(entry))
|
||||
mte_restore_tags(entry, page);
|
||||
}
|
||||
|
||||
if (!pte_is_tagged)
|
||||
return;
|
||||
|
||||
if (try_page_mte_tagging(page)) {
|
||||
mte_clear_page_tags(page_address(page));
|
||||
set_page_mte_tagged(page);
|
||||
}
|
||||
}
|
||||
|
||||
void mte_sync_tags(pte_t old_pte, pte_t pte)
|
||||
void mte_sync_tags(pte_t pte)
|
||||
{
|
||||
struct page *page = pte_page(pte);
|
||||
long i, nr_pages = compound_nr(page);
|
||||
bool check_swap = nr_pages == 1;
|
||||
bool pte_is_tagged = pte_tagged(pte);
|
||||
|
||||
/* Early out if there's nothing to do */
|
||||
if (!check_swap && !pte_is_tagged)
|
||||
return;
|
||||
|
||||
/* if PG_mte_tagged is set, tags have already been initialised */
|
||||
for (i = 0; i < nr_pages; i++, page++)
|
||||
if (!page_mte_tagged(page))
|
||||
mte_sync_page_tags(page, old_pte, check_swap,
|
||||
pte_is_tagged);
|
||||
for (i = 0; i < nr_pages; i++, page++) {
|
||||
if (try_page_mte_tagging(page)) {
|
||||
mte_clear_page_tags(page_address(page));
|
||||
set_page_mte_tagged(page);
|
||||
}
|
||||
}
|
||||
|
||||
/* ensure the tags are visible before the PTE is set */
|
||||
smp_wmb();
|
||||
|
@ -587,7 +587,6 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
|
||||
|
||||
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
|
||||
|
||||
#ifdef CONFIG_PER_VMA_LOCK
|
||||
if (!(mm_flags & FAULT_FLAG_USER))
|
||||
goto lock_mmap;
|
||||
|
||||
@ -600,7 +599,8 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
|
||||
goto lock_mmap;
|
||||
}
|
||||
fault = handle_mm_fault(vma, addr, mm_flags | FAULT_FLAG_VMA_LOCK, regs);
|
||||
vma_end_read(vma);
|
||||
if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)))
|
||||
vma_end_read(vma);
|
||||
|
||||
if (!(fault & VM_FAULT_RETRY)) {
|
||||
count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
|
||||
@ -615,7 +615,6 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr,
|
||||
return 0;
|
||||
}
|
||||
lock_mmap:
|
||||
#endif /* CONFIG_PER_VMA_LOCK */
|
||||
|
||||
retry:
|
||||
vma = lock_mm_and_find_vma(mm, addr, regs);
|
||||
|
@ -51,20 +51,13 @@ void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
|
||||
|
||||
void __sync_icache_dcache(pte_t pte)
|
||||
{
|
||||
struct page *page = pte_page(pte);
|
||||
struct folio *folio = page_folio(pte_page(pte));
|
||||
|
||||
/*
|
||||
* HugeTLB pages are always fully mapped, so only setting head page's
|
||||
* PG_dcache_clean flag is enough.
|
||||
*/
|
||||
if (PageHuge(page))
|
||||
page = compound_head(page);
|
||||
|
||||
if (!test_bit(PG_dcache_clean, &page->flags)) {
|
||||
sync_icache_aliases((unsigned long)page_address(page),
|
||||
(unsigned long)page_address(page) +
|
||||
page_size(page));
|
||||
set_bit(PG_dcache_clean, &page->flags);
|
||||
if (!test_bit(PG_dcache_clean, &folio->flags)) {
|
||||
sync_icache_aliases((unsigned long)folio_address(folio),
|
||||
(unsigned long)folio_address(folio) +
|
||||
folio_size(folio));
|
||||
set_bit(PG_dcache_clean, &folio->flags);
|
||||
}
|
||||
}
|
||||
EXPORT_SYMBOL_GPL(__sync_icache_dcache);
|
||||
@ -74,17 +67,16 @@ EXPORT_SYMBOL_GPL(__sync_icache_dcache);
|
||||
* it as dirty for later flushing when mapped in user space (if executable,
|
||||
* see __sync_icache_dcache).
|
||||
*/
|
||||
void flush_dcache_folio(struct folio *folio)
|
||||
{
|
||||
if (test_bit(PG_dcache_clean, &folio->flags))
|
||||
clear_bit(PG_dcache_clean, &folio->flags);
|
||||
}
|
||||
EXPORT_SYMBOL(flush_dcache_folio);
|
||||
|
||||
void flush_dcache_page(struct page *page)
|
||||
{
|
||||
/*
|
||||
* HugeTLB pages are always fully mapped and only head page will be
|
||||
* set PG_dcache_clean (see comments in __sync_icache_dcache()).
|
||||
*/
|
||||
if (PageHuge(page))
|
||||
page = compound_head(page);
|
||||
|
||||
if (test_bit(PG_dcache_clean, &page->flags))
|
||||
clear_bit(PG_dcache_clean, &page->flags);
|
||||
flush_dcache_folio(page_folio(page));
|
||||
}
|
||||
EXPORT_SYMBOL(flush_dcache_page);
|
||||
|
||||
|
@ -236,7 +236,7 @@ static void clear_flush(struct mm_struct *mm,
|
||||
unsigned long i, saddr = addr;
|
||||
|
||||
for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
|
||||
pte_clear(mm, addr, ptep);
|
||||
ptep_clear(mm, addr, ptep);
|
||||
|
||||
flush_tlb_range(&vma, saddr, addr);
|
||||
}
|
||||
|
@ -3,20 +3,22 @@
|
||||
#include <linux/mm.h>
|
||||
#include <linux/io.h>
|
||||
|
||||
bool ioremap_allowed(phys_addr_t phys_addr, size_t size, unsigned long prot)
|
||||
void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size,
|
||||
unsigned long prot)
|
||||
{
|
||||
unsigned long last_addr = phys_addr + size - 1;
|
||||
|
||||
/* Don't allow outside PHYS_MASK */
|
||||
if (last_addr & ~PHYS_MASK)
|
||||
return false;
|
||||
return NULL;
|
||||
|
||||
/* Don't allow RAM to be mapped. */
|
||||
if (WARN_ON(pfn_is_map_memory(__phys_to_pfn(phys_addr))))
|
||||
return false;
|
||||
return NULL;
|
||||
|
||||
return true;
|
||||
return generic_ioremap_prot(phys_addr, size, __pgprot(prot));
|
||||
}
|
||||
EXPORT_SYMBOL(ioremap_prot);
|
||||
|
||||
/*
|
||||
* Must be called after early_fixmap_init
|
||||
|
@ -426,6 +426,7 @@ static phys_addr_t __pgd_pgtable_alloc(int shift)
|
||||
static phys_addr_t pgd_pgtable_alloc(int shift)
|
||||
{
|
||||
phys_addr_t pa = __pgd_pgtable_alloc(shift);
|
||||
struct ptdesc *ptdesc = page_ptdesc(phys_to_page(pa));
|
||||
|
||||
/*
|
||||
* Call proper page table ctor in case later we need to
|
||||
@ -433,12 +434,12 @@ static phys_addr_t pgd_pgtable_alloc(int shift)
|
||||
* this pre-allocated page table.
|
||||
*
|
||||
* We don't select ARCH_ENABLE_SPLIT_PMD_PTLOCK if pmd is
|
||||
* folded, and if so pgtable_pmd_page_ctor() becomes nop.
|
||||
* folded, and if so pagetable_pte_ctor() becomes nop.
|
||||
*/
|
||||
if (shift == PAGE_SHIFT)
|
||||
BUG_ON(!pgtable_pte_page_ctor(phys_to_page(pa)));
|
||||
BUG_ON(!pagetable_pte_ctor(ptdesc));
|
||||
else if (shift == PMD_SHIFT)
|
||||
BUG_ON(!pgtable_pmd_page_ctor(phys_to_page(pa)));
|
||||
BUG_ON(!pagetable_pmd_ctor(ptdesc));
|
||||
|
||||
return pa;
|
||||
}
|
||||
|
@ -33,8 +33,9 @@ int mte_save_tags(struct page *page)
|
||||
|
||||
mte_save_page_tags(page_address(page), tag_storage);
|
||||
|
||||
/* page_private contains the swap entry.val set in do_swap_page */
|
||||
ret = xa_store(&mte_pages, page_private(page), tag_storage, GFP_KERNEL);
|
||||
/* lookup the swap entry.val from the page */
|
||||
ret = xa_store(&mte_pages, page_swap_entry(page).val, tag_storage,
|
||||
GFP_KERNEL);
|
||||
if (WARN(xa_is_err(ret), "Failed to store MTE tags")) {
|
||||
mte_free_tag_storage(tag_storage);
|
||||
return xa_err(ret);
|
||||
|
@ -15,45 +15,51 @@
|
||||
|
||||
#define PG_dcache_clean PG_arch_1
|
||||
|
||||
void flush_dcache_page(struct page *page)
|
||||
void flush_dcache_folio(struct folio *folio)
|
||||
{
|
||||
struct address_space *mapping;
|
||||
|
||||
if (page == ZERO_PAGE(0))
|
||||
if (is_zero_pfn(folio_pfn(folio)))
|
||||
return;
|
||||
|
||||
mapping = page_mapping_file(page);
|
||||
mapping = folio_flush_mapping(folio);
|
||||
|
||||
if (mapping && !page_mapcount(page))
|
||||
clear_bit(PG_dcache_clean, &page->flags);
|
||||
if (mapping && !folio_mapped(folio))
|
||||
clear_bit(PG_dcache_clean, &folio->flags);
|
||||
else {
|
||||
dcache_wbinv_all();
|
||||
if (mapping)
|
||||
icache_inv_all();
|
||||
set_bit(PG_dcache_clean, &page->flags);
|
||||
set_bit(PG_dcache_clean, &folio->flags);
|
||||
}
|
||||
}
|
||||
EXPORT_SYMBOL(flush_dcache_folio);
|
||||
|
||||
void flush_dcache_page(struct page *page)
|
||||
{
|
||||
flush_dcache_folio(page_folio(page));
|
||||
}
|
||||
EXPORT_SYMBOL(flush_dcache_page);
|
||||
|
||||
void update_mmu_cache(struct vm_area_struct *vma, unsigned long addr,
|
||||
pte_t *ptep)
|
||||
void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma,
|
||||
unsigned long addr, pte_t *ptep, unsigned int nr)
|
||||
{
|
||||
unsigned long pfn = pte_pfn(*ptep);
|
||||
struct page *page;
|
||||
struct folio *folio;
|
||||
|
||||
flush_tlb_page(vma, addr);
|
||||
|
||||
if (!pfn_valid(pfn))
|
||||
return;
|
||||
|
||||
page = pfn_to_page(pfn);
|
||||
if (page == ZERO_PAGE(0))
|
||||
if (is_zero_pfn(pfn))
|
||||
return;
|
||||
|
||||
if (!test_and_set_bit(PG_dcache_clean, &page->flags))
|
||||
folio = page_folio(pfn_to_page(pfn));
|
||||
if (!test_and_set_bit(PG_dcache_clean, &folio->flags))
|
||||
dcache_wbinv_all();
|
||||
|
||||
if (page_mapping_file(page)) {
|
||||
if (folio_flush_mapping(folio)) {
|
||||
if (vma->vm_flags & VM_EXEC)
|
||||
icache_inv_all();
|
||||
}
|
||||
|
@ -9,6 +9,8 @@
|
||||
|
||||
#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
|
||||
extern void flush_dcache_page(struct page *);
|
||||
void flush_dcache_folio(struct folio *);
|
||||
#define flush_dcache_folio flush_dcache_folio
|
||||
|
||||
#define flush_cache_mm(mm) dcache_wbinv_all()
|
||||
#define flush_cache_page(vma, page, pfn) cache_wbinv_all()
|
||||
@ -43,7 +45,6 @@ extern void flush_cache_range(struct vm_area_struct *vma, unsigned long start, u
|
||||
#define flush_cache_vmap(start, end) cache_wbinv_all()
|
||||
#define flush_cache_vunmap(start, end) cache_wbinv_all()
|
||||
|
||||
#define flush_icache_page(vma, page) do {} while (0);
|
||||
#define flush_icache_range(start, end) cache_wbinv_range(start, end)
|
||||
#define flush_icache_mm_range(mm, start, end) cache_wbinv_range(start, end)
|
||||
#define flush_icache_deferred(mm) do {} while (0);
|
||||
|
@ -7,30 +7,33 @@
|
||||
#include <asm/cache.h>
|
||||
#include <asm/tlbflush.h>
|
||||
|
||||
void update_mmu_cache(struct vm_area_struct *vma, unsigned long address,
|
||||
pte_t *pte)
|
||||
void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma,
|
||||
unsigned long address, pte_t *pte, unsigned int nr)
|
||||
{
|
||||
unsigned long addr;
|
||||
struct page *page;
|
||||
unsigned long pfn = pte_pfn(*pte);
|
||||
struct folio *folio;
|
||||
unsigned int i;
|
||||
|
||||
flush_tlb_page(vma, address);
|
||||
|
||||
if (!pfn_valid(pte_pfn(*pte)))
|
||||
if (!pfn_valid(pfn))
|
||||
return;
|
||||
|
||||
page = pfn_to_page(pte_pfn(*pte));
|
||||
if (page == ZERO_PAGE(0))
|
||||
folio = page_folio(pfn_to_page(pfn));
|
||||
|
||||
if (test_and_set_bit(PG_dcache_clean, &folio->flags))
|
||||
return;
|
||||
|
||||
if (test_and_set_bit(PG_dcache_clean, &page->flags))
|
||||
return;
|
||||
icache_inv_range(address, address + nr*PAGE_SIZE);
|
||||
for (i = 0; i < folio_nr_pages(folio); i++) {
|
||||
unsigned long addr = (unsigned long) kmap_local_folio(folio,
|
||||
i * PAGE_SIZE);
|
||||
|
||||
addr = (unsigned long) kmap_atomic(page);
|
||||
|
||||
icache_inv_range(address, address + PAGE_SIZE);
|
||||
dcache_wb_range(addr, addr + PAGE_SIZE);
|
||||
|
||||
kunmap_atomic((void *) addr);
|
||||
dcache_wb_range(addr, addr + PAGE_SIZE);
|
||||
if (vma->vm_flags & VM_EXEC)
|
||||
icache_inv_range(addr, addr + PAGE_SIZE);
|
||||
kunmap_local((void *) addr);
|
||||
}
|
||||
}
|
||||
|
||||
void flush_icache_deferred(struct mm_struct *mm)
|
||||
|
@ -18,16 +18,21 @@
|
||||
|
||||
#define PG_dcache_clean PG_arch_1
|
||||
|
||||
static inline void flush_dcache_folio(struct folio *folio)
|
||||
{
|
||||
if (test_bit(PG_dcache_clean, &folio->flags))
|
||||
clear_bit(PG_dcache_clean, &folio->flags);
|
||||
}
|
||||
#define flush_dcache_folio flush_dcache_folio
|
||||
|
||||
#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
|
||||
static inline void flush_dcache_page(struct page *page)
|
||||
{
|
||||
if (test_bit(PG_dcache_clean, &page->flags))
|
||||
clear_bit(PG_dcache_clean, &page->flags);
|
||||
flush_dcache_folio(page_folio(page));
|
||||
}
|
||||
|
||||
#define flush_dcache_mmap_lock(mapping) do { } while (0)
|
||||
#define flush_dcache_mmap_unlock(mapping) do { } while (0)
|
||||
#define flush_icache_page(vma, page) do { } while (0)
|
||||
|
||||
#define flush_icache_range(start, end) cache_wbinv_range(start, end)
|
||||
|
||||
|
@ -63,8 +63,8 @@ static inline pgd_t *pgd_alloc(struct mm_struct *mm)
|
||||
|
||||
#define __pte_free_tlb(tlb, pte, address) \
|
||||
do { \
|
||||
pgtable_pte_page_dtor(pte); \
|
||||
tlb_remove_page(tlb, pte); \
|
||||
pagetable_pte_dtor(page_ptdesc(pte)); \
|
||||
tlb_remove_page_ptdesc(tlb, page_ptdesc(pte)); \
|
||||
} while (0)
|
||||
|
||||
extern void pagetable_init(void);
|
||||
|
@ -28,6 +28,7 @@
|
||||
#define pgd_ERROR(e) \
|
||||
pr_err("%s:%d: bad pgd %08lx.\n", __FILE__, __LINE__, pgd_val(e))
|
||||
|
||||
#define PFN_PTE_SHIFT PAGE_SHIFT
|
||||
#define pmd_pfn(pmd) (pmd_phys(pmd) >> PAGE_SHIFT)
|
||||
#define pmd_page(pmd) (pfn_to_page(pmd_phys(pmd) >> PAGE_SHIFT))
|
||||
#define pte_clear(mm, addr, ptep) set_pte((ptep), \
|
||||
@ -90,7 +91,6 @@ static inline void set_pte(pte_t *p, pte_t pte)
|
||||
/* prevent out of order excution */
|
||||
smp_mb();
|
||||
}
|
||||
#define set_pte_at(mm, addr, ptep, pteval) set_pte(ptep, pteval)
|
||||
|
||||
static inline pte_t *pmd_page_vaddr(pmd_t pmd)
|
||||
{
|
||||
@ -263,8 +263,10 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
|
||||
extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
|
||||
extern void paging_init(void);
|
||||
|
||||
void update_mmu_cache(struct vm_area_struct *vma, unsigned long address,
|
||||
pte_t *pte);
|
||||
void update_mmu_cache_range(struct vm_fault *vmf, struct vm_area_struct *vma,
|
||||
unsigned long address, pte_t *pte, unsigned int nr);
|
||||
#define update_mmu_cache(vma, addr, ptep) \
|
||||
update_mmu_cache_range(NULL, vma, addr, ptep, 1)
|
||||
|
||||
#define io_remap_pfn_range(vma, vaddr, pfn, size, prot) \
|
||||
remap_pfn_range(vma, vaddr, pfn, size, prot)
|
||||
|
@ -25,6 +25,7 @@ config HEXAGON
|
||||
select NEED_SG_DMA_LENGTH
|
||||
select NO_IOPORT_MAP
|
||||
select GENERIC_IOMAP
|
||||
select GENERIC_IOREMAP
|
||||
select GENERIC_SMP_IDLE_THREAD
|
||||
select STACKTRACE_SUPPORT
|
||||
select GENERIC_CLOCKEVENTS_BROADCAST
|
||||
|
@ -18,7 +18,7 @@
|
||||
* - flush_cache_range(vma, start, end) flushes a range of pages
|
||||
* - flush_icache_range(start, end) flush a range of instructions
|
||||
* - flush_dcache_page(pg) flushes(wback&invalidates) a page for dcache
|
||||
* - flush_icache_page(vma, pg) flushes(invalidates) a page for icache
|
||||
* - flush_icache_pages(vma, pg, nr) flushes(invalidates) nr pages for icache
|
||||
*
|
||||
* Need to doublecheck which one is really needed for ptrace stuff to work.
|
||||
*/
|
||||
@ -58,12 +58,16 @@ extern void flush_cache_all_hexagon(void);
|
||||
* clean the cache when the PTE is set.
|
||||
*
|
||||
*/
|
||||
static inline void update_mmu_cache(struct vm_area_struct *vma,
|
||||
unsigned long address, pte_t *ptep)
|
||||
static inline void update_mmu_cache_range(struct vm_fault *vmf,
|
||||
struct vm_area_struct *vma, unsigned long address,
|
||||
pte_t *ptep, unsigned int nr)
|
||||
{
|
||||
/* generic_ptrace_pokedata doesn't wind up here, does it? */
|
||||
}
|
||||
|
||||
#define update_mmu_cache(vma, addr, ptep) \
|
||||
update_mmu_cache_range(NULL, vma, addr, ptep, 1)
|
||||
|
||||
void copy_to_user_page(struct vm_area_struct *vma, struct page *page,
|
||||
unsigned long vaddr, void *dst, void *src, int len);
|
||||
#define copy_to_user_page copy_to_user_page
|
||||
|
@ -27,8 +27,6 @@
|
||||
extern int remap_area_pages(unsigned long start, unsigned long phys_addr,
|
||||
unsigned long end, unsigned long flags);
|
||||
|
||||
extern void iounmap(const volatile void __iomem *addr);
|
||||
|
||||
/* Defined in lib/io.c, needed for smc91x driver. */
|
||||
extern void __raw_readsw(const void __iomem *addr, void *data, int wordlen);
|
||||
extern void __raw_writesw(void __iomem *addr, const void *data, int wordlen);
|
||||
@ -170,8 +168,13 @@ static inline void writel(u32 data, volatile void __iomem *addr)
|
||||
#define writew_relaxed __raw_writew
|
||||
#define writel_relaxed __raw_writel
|
||||
|
||||
void __iomem *ioremap(unsigned long phys_addr, unsigned long size);
|
||||
#define ioremap_uc(X, Y) ioremap((X), (Y))
|
||||
/*
|
||||
* I/O memory mapping functions.
|
||||
*/
|
||||
#define _PAGE_IOREMAP (_PAGE_PRESENT | _PAGE_READ | _PAGE_WRITE | \
|
||||
(__HEXAGON_C_DEV << 6))
|
||||
|
||||
#define ioremap_uc(addr, size) ioremap((addr), (size))
|
||||
|
||||
|
||||
#define __raw_writel writel
|
||||
|
@ -87,10 +87,10 @@ static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd,
|
||||
max_kernel_seg = pmdindex;
|
||||
}
|
||||
|
||||
#define __pte_free_tlb(tlb, pte, addr) \
|
||||
do { \
|
||||
pgtable_pte_page_dtor((pte)); \
|
||||
tlb_remove_page((tlb), (pte)); \
|
||||
#define __pte_free_tlb(tlb, pte, addr) \
|
||||
do { \
|
||||
pagetable_pte_dtor((page_ptdesc(pte))); \
|
||||
tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte))); \
|
||||
} while (0)
|
||||
|
||||
#endif
|
||||
|
@ -338,6 +338,7 @@ static inline int pte_exec(pte_t pte)
|
||||
/* __swp_entry_to_pte - extract PTE from swap entry */
|
||||
#define __swp_entry_to_pte(x) ((pte_t) { (x).val })
|
||||
|
||||
#define PFN_PTE_SHIFT PAGE_SHIFT
|
||||
/* pfn_pte - convert page number and protection value to page table entry */
|
||||
#define pfn_pte(pfn, pgprot) __pte((pfn << PAGE_SHIFT) | pgprot_val(pgprot))
|
||||
|
||||
@ -345,14 +346,6 @@ static inline int pte_exec(pte_t pte)
|
||||
#define pte_pfn(pte) (pte_val(pte) >> PAGE_SHIFT)
|
||||
#define set_pmd(pmdptr, pmdval) (*(pmdptr) = (pmdval))
|
||||
|
||||
/*
|
||||
* set_pte_at - update page table and do whatever magic may be
|
||||
* necessary to make the underlying hardware/firmware take note.
|
||||
*
|
||||
* VM may require a virtual instruction to alert the MMU.
|
||||
*/
|
||||
#define set_pte_at(mm, addr, ptep, pte) set_pte(ptep, pte)
|
||||
|
||||
static inline unsigned long pmd_page_vaddr(pmd_t pmd)
|
||||
{
|
||||
return (unsigned long)__va(pmd_val(pmd) & PAGE_MASK);
|
||||
|
@ -14,12 +14,10 @@
|
||||
EXPORT_SYMBOL(__clear_user_hexagon);
|
||||
EXPORT_SYMBOL(raw_copy_from_user);
|
||||
EXPORT_SYMBOL(raw_copy_to_user);
|
||||
EXPORT_SYMBOL(iounmap);
|
||||
EXPORT_SYMBOL(__vmgetie);
|
||||
EXPORT_SYMBOL(__vmsetie);
|
||||
EXPORT_SYMBOL(__vmyield);
|
||||
EXPORT_SYMBOL(empty_zero_page);
|
||||
EXPORT_SYMBOL(ioremap);
|
||||
EXPORT_SYMBOL(memcpy);
|
||||
EXPORT_SYMBOL(memset);
|
||||
|
||||
|
@ -3,5 +3,5 @@
|
||||
# Makefile for Hexagon memory management subsystem
|
||||
#
|
||||
|
||||
obj-y := init.o ioremap.o uaccess.o vm_fault.o cache.o
|
||||
obj-y := init.o uaccess.o vm_fault.o cache.o
|
||||
obj-y += copy_to_user.o copy_from_user.o vm_tlb.o
|
||||
|
@ -1,44 +0,0 @@
|
||||
// SPDX-License-Identifier: GPL-2.0-only
|
||||
/*
|
||||
* I/O remap functions for Hexagon
|
||||
*
|
||||
* Copyright (c) 2010-2011, The Linux Foundation. All rights reserved.
|
||||
*/
|
||||
|
||||
#include <linux/io.h>
|
||||
#include <linux/vmalloc.h>
|
||||
#include <linux/mm.h>
|
||||
|
||||
void __iomem *ioremap(unsigned long phys_addr, unsigned long size)
|
||||
{
|
||||
unsigned long last_addr, addr;
|
||||
unsigned long offset = phys_addr & ~PAGE_MASK;
|
||||
struct vm_struct *area;
|
||||
|
||||
pgprot_t prot = __pgprot(_PAGE_PRESENT|_PAGE_READ|_PAGE_WRITE
|
||||
|(__HEXAGON_C_DEV << 6));
|
||||
|
||||
last_addr = phys_addr + size - 1;
|
||||
|
||||
/* Wrapping not allowed */
|
||||
if (!size || (last_addr < phys_addr))
|
||||
return NULL;
|
||||
|
||||
/* Rounds up to next page size, including whole-page offset */
|
||||
size = PAGE_ALIGN(offset + size);
|
||||
|
||||
area = get_vm_area(size, VM_IOREMAP);
|
||||
addr = (unsigned long)area->addr;
|
||||
|
||||
if (ioremap_page_range(addr, addr+size, phys_addr, prot)) {
|
||||
vunmap((void *)addr);
|
||||
return NULL;
|
||||
}
|
||||
|
||||
return (void __iomem *) (offset + addr);
|
||||
}
|
||||
|
||||
void iounmap(const volatile void __iomem *addr)
|
||||
{
|
||||
vunmap((void *) ((unsigned long) addr & PAGE_MASK));
|
||||
}
|
@ -47,6 +47,7 @@ config IA64
|
||||
select GENERIC_IRQ_LEGACY
|
||||
select ARCH_HAVE_NMI_SAFE_CMPXCHG
|
||||
select GENERIC_IOMAP
|
||||
select GENERIC_IOREMAP
|
||||
select GENERIC_SMP_IDLE_THREAD
|
||||
select ARCH_TASK_STRUCT_ON_STACK
|
||||
select ARCH_TASK_STRUCT_ALLOCATOR
|
||||
|
@ -798,22 +798,30 @@ sba_io_pdir_entry(u64 *pdir_ptr, unsigned long vba)
|
||||
#endif
|
||||
|
||||
#ifdef ENABLE_MARK_CLEAN
|
||||
/**
|
||||
/*
|
||||
* Since DMA is i-cache coherent, any (complete) pages that were written via
|
||||
* DMA can be marked as "clean" so that lazy_mmu_prot_update() doesn't have to
|
||||
* flush them when they get mapped into an executable vm-area.
|
||||
*/
|
||||
static void
|
||||
mark_clean (void *addr, size_t size)
|
||||
static void mark_clean(void *addr, size_t size)
|
||||
{
|
||||
unsigned long pg_addr, end;
|
||||
struct folio *folio = virt_to_folio(addr);
|
||||
ssize_t left = size;
|
||||
size_t offset = offset_in_folio(folio, addr);
|
||||
|
||||
pg_addr = PAGE_ALIGN((unsigned long) addr);
|
||||
end = (unsigned long) addr + size;
|
||||
while (pg_addr + PAGE_SIZE <= end) {
|
||||
struct page *page = virt_to_page((void *)pg_addr);
|
||||
set_bit(PG_arch_1, &page->flags);
|
||||
pg_addr += PAGE_SIZE;
|
||||
if (offset) {
|
||||
left -= folio_size(folio) - offset;
|
||||
if (left <= 0)
|
||||
return;
|
||||
folio = folio_next(folio);
|
||||
}
|
||||
|
||||
while (left >= folio_size(folio)) {
|
||||
left -= folio_size(folio);
|
||||
set_bit(PG_arch_1, &folio->flags);
|
||||
if (!left)
|
||||
break;
|
||||
folio = folio_next(folio);
|
||||
}
|
||||
}
|
||||
#endif
|
||||
|
@ -13,10 +13,16 @@
|
||||
#include <asm/page.h>
|
||||
|
||||
#define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1
|
||||
#define flush_dcache_page(page) \
|
||||
do { \
|
||||
clear_bit(PG_arch_1, &(page)->flags); \
|
||||
} while (0)
|
||||
static inline void flush_dcache_folio(struct folio *folio)
|
||||
{
|
||||
clear_bit(PG_arch_1, &folio->flags);
|
||||
}
|
||||
#define flush_dcache_folio flush_dcache_folio
|
||||
|
||||
static inline void flush_dcache_page(struct page *page)
|
||||
{
|
||||
flush_dcache_folio(page_folio(page));
|
||||
}
|
||||
|
||||
extern void flush_icache_range(unsigned long start, unsigned long end);
|
||||
#define flush_icache_range flush_icache_range
|
||||
|
@ -243,15 +243,12 @@ static inline void outsl(unsigned long port, const void *src,
|
||||
|
||||
# ifdef __KERNEL__
|
||||
|
||||
extern void __iomem * ioremap(unsigned long offset, unsigned long size);
|
||||
#define _PAGE_IOREMAP pgprot_val(PAGE_KERNEL)
|
||||
|
||||
extern void __iomem * ioremap_uc(unsigned long offset, unsigned long size);
|
||||
extern void iounmap (volatile void __iomem *addr);
|
||||
static inline void __iomem * ioremap_cache (unsigned long phys_addr, unsigned long size)
|
||||
{
|
||||
return ioremap(phys_addr, size);
|
||||
}
|
||||
#define ioremap ioremap
|
||||
#define ioremap_cache ioremap_cache
|
||||
|
||||
#define ioremap_prot ioremap_prot
|
||||
#define ioremap_cache ioremap
|
||||
#define ioremap_uc ioremap_uc
|
||||
#define iounmap iounmap
|
||||
|
||||
|
@ -206,6 +206,7 @@ ia64_phys_addr_valid (unsigned long addr)
|
||||
#define RGN_MAP_SHIFT (PGDIR_SHIFT + PTRS_PER_PGD_SHIFT - 3)
|
||||
#define RGN_MAP_LIMIT ((1UL << RGN_MAP_SHIFT) - PAGE_SIZE) /* per region addr limit */
|
||||
|
||||
#define PFN_PTE_SHIFT PAGE_SHIFT
|
||||
/*
|
||||
* Conversion functions: convert page frame number (pfn) and a protection value to a page
|
||||
* table entry (pte).
|
||||
@ -303,8 +304,6 @@ static inline void set_pte(pte_t *ptep, pte_t pteval)
|
||||
*ptep = pteval;
|
||||
}
|
||||
|
||||
#define set_pte_at(mm,addr,ptep,pteval) set_pte(ptep,pteval)
|
||||
|
||||
/*
|
||||
* Make page protection values cacheable, uncacheable, or write-
|
||||
* combining. Note that "protection" is really a misnomer here as the
|
||||
@ -396,6 +395,7 @@ pte_same (pte_t a, pte_t b)
|
||||
return pte_val(a) == pte_val(b);
|
||||
}
|
||||
|
||||
#define update_mmu_cache_range(vmf, vma, address, ptep, nr) do { } while (0)
|
||||
#define update_mmu_cache(vma, address, ptep) do { } while (0)
|
||||
|
||||
extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
|
||||
|
@ -50,30 +50,44 @@ void
|
||||
__ia64_sync_icache_dcache (pte_t pte)
|
||||
{
|
||||
unsigned long addr;
|
||||
struct page *page;
|
||||
struct folio *folio;
|
||||
|
||||
page = pte_page(pte);
|
||||
addr = (unsigned long) page_address(page);
|
||||
folio = page_folio(pte_page(pte));
|
||||
addr = (unsigned long)folio_address(folio);
|
||||
|
||||
if (test_bit(PG_arch_1, &page->flags))
|
||||
if (test_bit(PG_arch_1, &folio->flags))
|
||||
return; /* i-cache is already coherent with d-cache */
|
||||
|
||||
flush_icache_range(addr, addr + page_size(page));
|
||||
set_bit(PG_arch_1, &page->flags); /* mark page as clean */
|
||||
flush_icache_range(addr, addr + folio_size(folio));
|
||||
set_bit(PG_arch_1, &folio->flags); /* mark page as clean */
|
||||
}
|
||||
|
||||
/*
|
||||
* Since DMA is i-cache coherent, any (complete) pages that were written via
|
||||
* Since DMA is i-cache coherent, any (complete) folios that were written via
|
||||
* DMA can be marked as "clean" so that lazy_mmu_prot_update() doesn't have to
|
||||
* flush them when they get mapped into an executable vm-area.
|
||||
*/
|
||||
void arch_dma_mark_clean(phys_addr_t paddr, size_t size)
|
||||
{
|
||||
unsigned long pfn = PHYS_PFN(paddr);
|
||||
struct folio *folio = page_folio(pfn_to_page(pfn));
|
||||
ssize_t left = size;
|
||||
size_t offset = offset_in_folio(folio, paddr);
|
||||
|
||||
do {
|
||||
if (offset) {
|
||||
left -= folio_size(folio) - offset;
|
||||
if (left <= 0)
|
||||
return;
|
||||
folio = folio_next(folio);
|
||||
}
|
||||
|
||||
while (left >= (ssize_t)folio_size(folio)) {
|
||||
left -= folio_size(folio);
|
||||
set_bit(PG_arch_1, &pfn_to_page(pfn)->flags);
|
||||
} while (++pfn <= PHYS_PFN(paddr + size - 1));
|
||||
if (!left)
|
||||
break;
|
||||
folio = folio_next(folio);
|
||||
}
|
||||
}
|
||||
|
||||
inline void
|
||||
|
@ -29,13 +29,9 @@ early_ioremap (unsigned long phys_addr, unsigned long size)
|
||||
return __ioremap_uc(phys_addr);
|
||||
}
|
||||
|
||||
void __iomem *
|
||||
ioremap (unsigned long phys_addr, unsigned long size)
|
||||
void __iomem *ioremap_prot(phys_addr_t phys_addr, size_t size,
|
||||
unsigned long flags)
|
||||
{
|
||||
void __iomem *addr;
|
||||
struct vm_struct *area;
|
||||
unsigned long offset;
|
||||
pgprot_t prot;
|
||||
u64 attr;
|
||||
unsigned long gran_base, gran_size;
|
||||
unsigned long page_base;
|
||||
@ -68,36 +64,12 @@ ioremap (unsigned long phys_addr, unsigned long size)
|
||||
*/
|
||||
page_base = phys_addr & PAGE_MASK;
|
||||
size = PAGE_ALIGN(phys_addr + size) - page_base;
|
||||
if (efi_mem_attribute(page_base, size) & EFI_MEMORY_WB) {
|
||||
prot = PAGE_KERNEL;
|
||||
|
||||
/*
|
||||
* Mappings have to be page-aligned
|
||||
*/
|
||||
offset = phys_addr & ~PAGE_MASK;
|
||||
phys_addr &= PAGE_MASK;
|
||||
|
||||
/*
|
||||
* Ok, go for it..
|
||||
*/
|
||||
area = get_vm_area(size, VM_IOREMAP);
|
||||
if (!area)
|
||||
return NULL;
|
||||
|
||||
area->phys_addr = phys_addr;
|
||||
addr = (void __iomem *) area->addr;
|
||||
if (ioremap_page_range((unsigned long) addr,
|
||||
(unsigned long) addr + size, phys_addr, prot)) {
|
||||
vunmap((void __force *) addr);
|
||||
return NULL;
|
||||
}
|
||||
|
||||
return (void __iomem *) (offset + (char __iomem *)addr);
|
||||
}
|
||||
if (efi_mem_attribute(page_base, size) & EFI_MEMORY_WB)
|
||||
return generic_ioremap_prot(phys_addr, size, __pgprot(flags));
|
||||
|
||||
return __ioremap_uc(phys_addr);
|
||||
}
|
||||
EXPORT_SYMBOL(ioremap);
|
||||
EXPORT_SYMBOL(ioremap_prot);
|
||||
|
||||
void __iomem *
|
||||
ioremap_uc(unsigned long phys_addr, unsigned long size)
|
||||
@ -114,8 +86,7 @@ early_iounmap (volatile void __iomem *addr, unsigned long size)
|
||||
{
|
||||
}
|
||||
|
||||
void
|
||||
iounmap (volatile void __iomem *addr)
|
||||
void iounmap(volatile void __iomem *addr)
|
||||
{
|
||||
if (REGION_NUMBER(addr) == RGN_GATE)
|
||||
vunmap((void *) ((unsigned long) addr & PAGE_MASK));
|
||||
|
@ -60,7 +60,7 @@ config LOONGARCH
|
||||
select ARCH_USE_QUEUED_SPINLOCKS
|
||||
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
|
||||
select ARCH_WANT_LD_ORPHAN_WARN
|
||||
select ARCH_WANT_OPTIMIZE_VMEMMAP
|
||||
select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
|
||||
select ARCH_WANTS_NO_INSTR
|
||||
select BUILDTIME_TABLE_SORT
|
||||
select COMMON_CLK
|
||||
|
@ -46,7 +46,6 @@ void local_flush_icache_range(unsigned long start, unsigned long end);
|
||||
#define flush_cache_page(vma, vmaddr, pfn) do { } while (0)
|
||||
#define flush_cache_vmap(start, end) do { } while (0)
|
||||
#define flush_cache_vunmap(start, end) do { } while (0)
|
||||
#define flush_icache_page(vma, page) do { } while (0)
|
||||
#define flush_icache_user_page(vma, page, addr, len) do { } while (0)
|
||||
#define flush_dcache_page(page) do { } while (0)
|
||||
#define flush_dcache_mmap_lock(mapping) do { } while (0)
|
||||
|
@ -5,8 +5,6 @@
|
||||
#ifndef _ASM_IO_H
|
||||
#define _ASM_IO_H
|
||||
|
||||
#define ARCH_HAS_IOREMAP_WC
|
||||
|
||||
#include <linux/kernel.h>
|
||||
#include <linux/types.h>
|
||||
|
||||
|
@ -45,9 +45,9 @@ extern void pagetable_init(void);
|
||||
extern pgd_t *pgd_alloc(struct mm_struct *mm);
|
||||
|
||||
#define __pte_free_tlb(tlb, pte, address) \
|
||||
do { \
|
||||
pgtable_pte_page_dtor(pte); \
|
||||
tlb_remove_page((tlb), pte); \
|
||||
do { \
|
||||
pagetable_pte_dtor(page_ptdesc(pte)); \
|
||||
tlb_remove_page_ptdesc((tlb), page_ptdesc(pte)); \
|
||||
} while (0)
|
||||
|
||||
#ifndef __PAGETABLE_PMD_FOLDED
|
||||
@ -55,18 +55,18 @@ do { \
|
||||
static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
|
||||
{
|
||||
pmd_t *pmd;
|
||||
struct page *pg;
|
||||
struct ptdesc *ptdesc;
|
||||
|
||||
pg = alloc_page(GFP_KERNEL_ACCOUNT);
|
||||
if (!pg)
|
||||
ptdesc = pagetable_alloc(GFP_KERNEL_ACCOUNT, 0);
|
||||
if (!ptdesc)
|
||||
return NULL;
|
||||
|
||||
if (!pgtable_pmd_page_ctor(pg)) {
|
||||
__free_page(pg);
|
||||
if (!pagetable_pmd_ctor(ptdesc)) {
|
||||
pagetable_free(ptdesc);
|
||||
return NULL;
|
||||
}
|
||||
|
||||
pmd = (pmd_t *)page_address(pg);
|
||||
pmd = ptdesc_address(ptdesc);
|
||||
pmd_init(pmd);
|
||||
return pmd;
|
||||
}
|
||||
@ -80,10 +80,13 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
|
||||
static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address)
|
||||
{
|
||||
pud_t *pud;
|
||||
struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0);
|
||||
|
||||
pud = (pud_t *) __get_free_page(GFP_KERNEL);
|
||||
if (pud)
|
||||
pud_init(pud);
|
||||
if (!ptdesc)
|
||||
return NULL;
|
||||
pud = ptdesc_address(ptdesc);
|
||||
|
||||
pud_init(pud);
|
||||
return pud;
|
||||
}
|
||||
|
||||
|
@ -50,12 +50,12 @@
|
||||
#define _PAGE_NO_EXEC (_ULCAST_(1) << _PAGE_NO_EXEC_SHIFT)
|
||||
#define _PAGE_RPLV (_ULCAST_(1) << _PAGE_RPLV_SHIFT)
|
||||
#define _CACHE_MASK (_ULCAST_(3) << _CACHE_SHIFT)
|
||||
#define _PFN_SHIFT (PAGE_SHIFT - 12 + _PAGE_PFN_SHIFT)
|
||||
#define PFN_PTE_SHIFT (PAGE_SHIFT - 12 + _PAGE_PFN_SHIFT)
|
||||
|
||||
#define _PAGE_USER (PLV_USER << _PAGE_PLV_SHIFT)
|
||||
#define _PAGE_KERN (PLV_KERN << _PAGE_PLV_SHIFT)
|
||||
|
||||
#define _PFN_MASK (~((_ULCAST_(1) << (_PFN_SHIFT)) - 1) & \
|
||||
#define _PFN_MASK (~((_ULCAST_(1) << (PFN_PTE_SHIFT)) - 1) & \
|
||||
((_ULCAST_(1) << (_PAGE_PFN_END_SHIFT)) - 1))
|
||||
|
||||
/*
|
||||
|
@ -237,9 +237,9 @@ extern pmd_t mk_pmd(struct page *page, pgprot_t prot);
|
||||
extern void set_pmd_at(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp, pmd_t pmd);
|
||||
|
||||
#define pte_page(x) pfn_to_page(pte_pfn(x))
|
||||
#define pte_pfn(x) ((unsigned long)(((x).pte & _PFN_MASK) >> _PFN_SHIFT))
|
||||
#define pfn_pte(pfn, prot) __pte(((pfn) << _PFN_SHIFT) | pgprot_val(prot))
|
||||
#define pfn_pmd(pfn, prot) __pmd(((pfn) << _PFN_SHIFT) | pgprot_val(prot))
|
||||
#define pte_pfn(x) ((unsigned long)(((x).pte & _PFN_MASK) >> PFN_PTE_SHIFT))
|
||||
#define pfn_pte(pfn, prot) __pte(((pfn) << PFN_PTE_SHIFT) | pgprot_val(prot))
|
||||
#define pfn_pmd(pfn, prot) __pmd(((pfn) << PFN_PTE_SHIFT) | pgprot_val(prot))
|
||||
|
||||
/*
|
||||
* Initialize a new pgd / pud / pmd table with invalid pointers.
|
||||
@ -334,19 +334,13 @@ static inline void set_pte(pte_t *ptep, pte_t pteval)
|
||||
}
|
||||
}
|
||||
|
||||
static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
|
||||
pte_t *ptep, pte_t pteval)
|
||||
{
|
||||
set_pte(ptep, pteval);
|
||||
}
|
||||
|
||||
static inline void pte_clear(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
|
||||
{
|
||||
/* Preserve global status for the pair */
|
||||
if (pte_val(*ptep_buddy(ptep)) & _PAGE_GLOBAL)
|
||||
set_pte_at(mm, addr, ptep, __pte(_PAGE_GLOBAL));
|
||||
set_pte(ptep, __pte(_PAGE_GLOBAL));
|
||||
else
|
||||
set_pte_at(mm, addr, ptep, __pte(0));
|
||||
set_pte(ptep, __pte(0));
|
||||
}
|
||||
|
||||
#define PGD_T_LOG2 (__builtin_ffs(sizeof(pgd_t)) - 1)
|
||||
@ -445,11 +439,20 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
|
||||
extern void __update_tlb(struct vm_area_struct *vma,
|
||||
unsigned long address, pte_t *ptep);
|
||||
|
||||
static inline void update_mmu_cache(struct vm_area_struct *vma,
|
||||
unsigned long address, pte_t *ptep)
|
||||
static inline void update_mmu_cache_range(struct vm_fault *vmf,
|
||||
struct vm_area_struct *vma, unsigned long address,
|
||||
pte_t *ptep, unsigned int nr)
|
||||
{
|
||||
__update_tlb(vma, address, ptep);
|
||||
for (;;) {
|
||||
__update_tlb(vma, address, ptep);
|
||||
if (--nr == 0)
|
||||
break;
|
||||
address += PAGE_SIZE;
|
||||
ptep++;
|
||||
}
|
||||
}
|
||||
#define update_mmu_cache(vma, addr, ptep) \
|
||||
update_mmu_cache_range(NULL, vma, addr, ptep, 1)
|
||||
|
||||
#define __HAVE_ARCH_UPDATE_MMU_TLB
|
||||
#define update_mmu_tlb update_mmu_cache
|
||||
@ -462,7 +465,7 @@ static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,
|
||||
|
||||
static inline unsigned long pmd_pfn(pmd_t pmd)
|
||||
{
|
||||
return (pmd_val(pmd) & _PFN_MASK) >> _PFN_SHIFT;
|
||||
return (pmd_val(pmd) & _PFN_MASK) >> PFN_PTE_SHIFT;
|
||||
}
|
||||
|
||||
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
||||
|
@ -11,10 +11,11 @@
|
||||
|
||||
pgd_t *pgd_alloc(struct mm_struct *mm)
|
||||
{
|
||||
pgd_t *ret, *init;
|
||||
pgd_t *init, *ret = NULL;
|
||||
struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0);
|
||||
|
||||
ret = (pgd_t *) __get_free_page(GFP_KERNEL);
|
||||
if (ret) {
|
||||
if (ptdesc) {
|
||||
ret = (pgd_t *)ptdesc_address(ptdesc);
|
||||
init = pgd_offset(&init_mm, 0UL);
|
||||
pgd_init(ret);
|
||||
memcpy(ret + USER_PTRS_PER_PGD, init + USER_PTRS_PER_PGD,
|
||||
@ -107,7 +108,7 @@ pmd_t mk_pmd(struct page *page, pgprot_t prot)
|
||||
{
|
||||
pmd_t pmd;
|
||||
|
||||
pmd_val(pmd) = (page_to_pfn(page) << _PFN_SHIFT) | pgprot_val(prot);
|
||||
pmd_val(pmd) = (page_to_pfn(page) << PFN_PTE_SHIFT) | pgprot_val(prot);
|
||||
|
||||
return pmd;
|
||||
}
|
||||
|
@ -252,7 +252,7 @@ static void output_pgtable_bits_defines(void)
|
||||
pr_define("_PAGE_WRITE_SHIFT %d\n", _PAGE_WRITE_SHIFT);
|
||||
pr_define("_PAGE_NO_READ_SHIFT %d\n", _PAGE_NO_READ_SHIFT);
|
||||
pr_define("_PAGE_NO_EXEC_SHIFT %d\n", _PAGE_NO_EXEC_SHIFT);
|
||||
pr_define("_PFN_SHIFT %d\n", _PFN_SHIFT);
|
||||
pr_define("PFN_PTE_SHIFT %d\n", PFN_PTE_SHIFT);
|
||||
pr_debug("\n");
|
||||
}
|
||||
|
||||
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user