8cd7c588de
Patch series "Remove dependency on congestion_wait in mm/", v5. This series that removes all calls to congestion_wait in mm/ and deletes wait_iff_congested. It's not a clever implementation but congestion_wait has been broken for a long time [1]. Even if congestion throttling worked, it was never a great idea. While excessive dirty/writeback pages at the tail of the LRU is one possibility that reclaim may be slow, there is also the problem of too many pages being isolated and reclaim failing for other reasons (elevated references, too many pages isolated, excessive LRU contention etc). This series replaces the "congestion" throttling with 3 different types. - If there are too many dirty/writeback pages, sleep until a timeout or enough pages get cleaned - If too many pages are isolated, sleep until enough isolated pages are either reclaimed or put back on the LRU - If no progress is being made, direct reclaim tasks sleep until another task makes progress with acceptable efficiency. This was initially tested with a mix of workloads that used to trigger corner cases that no longer work. A new test case was created called "stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly created XFS filesystem. Note that it may be necessary to increase the timeout of ssh if executing remotely as ssh itself can get throttled and the connection may timeout. stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4 to check the impact as the number of direct reclaimers increase. It has four types of worker. - One "anon latency" worker creates small mappings with mmap() and times how long it takes to fault the mapping reading it 4K at a time - X file writers which is fio randomly writing X files where the total size of the files add up to the allowed dirty_ratio. fio is allowed to run for a warmup period to allow some file-backed pages to accumulate. The duration of the warmup is based on the best-case linear write speed of the storage. - Y file readers which is fio randomly reading small files - Z anon memory hogs which continually map (100-dirty_ratio)% of memory - Total estimated WSS = (100+dirty_ration) percentage of memory X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4 The intent is to maximise the total WSS with a mix of file and anon memory where some anonymous memory must be swapped and there is a high likelihood of dirty/writeback pages reaching the end of the LRU. The test can be configured to have no background readers to stress dirty/writeback pages. The results below are based on having zero readers. The short summary of the results is that the series works and stalls until some event occurs but the timeouts may need adjustment. The test results are not broken down by patch as the series should be treated as one block that replaces a broken throttling mechanism with a working one. Finally, three machines were tested but I'm reporting the worst set of results. The other two machines had much better latencies for example. First the results of the "anon latency" latency stutterp 5.15.0-rc1 5.15.0-rc1 vanilla mm-reclaimcongest-v5r4 Amean mmap-4 31.4003 ( 0.00%) 2661.0198 (-8374.52%) Amean mmap-7 38.1641 ( 0.00%) 149.2891 (-291.18%) Amean mmap-12 60.0981 ( 0.00%) 187.8105 (-212.51%) Amean mmap-21 161.2699 ( 0.00%) 213.9107 ( -32.64%) Amean mmap-30 174.5589 ( 0.00%) 377.7548 (-116.41%) Amean mmap-48 8106.8160 ( 0.00%) 1070.5616 ( 86.79%) Stddev mmap-4 41.3455 ( 0.00%) 27573.9676 (-66591.66%) Stddev mmap-7 53.5556 ( 0.00%) 4608.5860 (-8505.23%) Stddev mmap-12 171.3897 ( 0.00%) 5559.4542 (-3143.75%) Stddev mmap-21 1506.6752 ( 0.00%) 5746.2507 (-281.39%) Stddev mmap-30 557.5806 ( 0.00%) 7678.1624 (-1277.05%) Stddev mmap-48 61681.5718 ( 0.00%) 14507.2830 ( 76.48%) Max-90 mmap-4 31.4243 ( 0.00%) 83.1457 (-164.59%) Max-90 mmap-7 41.0410 ( 0.00%) 41.0720 ( -0.08%) Max-90 mmap-12 66.5255 ( 0.00%) 53.9073 ( 18.97%) Max-90 mmap-21 146.7479 ( 0.00%) 105.9540 ( 27.80%) Max-90 mmap-30 193.9513 ( 0.00%) 64.3067 ( 66.84%) Max-90 mmap-48 277.9137 ( 0.00%) 591.0594 (-112.68%) Max mmap-4 1913.8009 ( 0.00%) 299623.9695 (-15555.96%) Max mmap-7 2423.9665 ( 0.00%) 204453.1708 (-8334.65%) Max mmap-12 6845.6573 ( 0.00%) 221090.3366 (-3129.64%) Max mmap-21 56278.6508 ( 0.00%) 213877.3496 (-280.03%) Max mmap-30 19716.2990 ( 0.00%) 216287.6229 (-997.00%) Max mmap-48 477923.9400 ( 0.00%) 245414.8238 ( 48.65%) For most thread counts, the time to mmap() is unfortunately increased. In earlier versions of the series, this was lower but a large number of throttling events were reaching their timeout increasing the amount of inefficient scanning of the LRU. There is no prioritisation of reclaim tasks making progress based on each tasks rate of page allocation versus progress of reclaim. The variance is also impacted for high worker counts but in all cases, the differences in latency are not statistically significant due to very large maximum outliers. Max-90 shows that 90% of the stalls are comparable but the Max results show the massive outliers which are increased to to stalling. It is expected that this will be very machine dependant. Due to the test design, reclaim is difficult so allocations stall and there are variances depending on whether THPs can be allocated or not. The amount of memory will affect exactly how bad the corner cases are and how often they trigger. The warmup period calculation is not ideal as it's based on linear writes where as fio is randomly writing multiple files from multiple tasks so the start state of the test is variable. For example, these are the latencies on a single-socket machine that had more memory Amean mmap-4 42.2287 ( 0.00%) 49.6838 * -17.65%* Amean mmap-7 216.4326 ( 0.00%) 47.4451 * 78.08%* Amean mmap-12 2412.0588 ( 0.00%) 51.7497 ( 97.85%) Amean mmap-21 5546.2548 ( 0.00%) 51.8862 ( 99.06%) Amean mmap-30 1085.3121 ( 0.00%) 72.1004 ( 93.36%) The overall system CPU usage and elapsed time is as follows 5.15.0-rc3 5.15.0-rc3 vanilla mm-reclaimcongest-v5r4 Duration User 6989.03 983.42 Duration System 7308.12 799.68 Duration Elapsed 2277.67 2092.98 The patches reduce system CPU usage by 89% as the vanilla kernel is rarely stalling. The high-level /proc/vmstats show 5.15.0-rc1 5.15.0-rc1 vanilla mm-reclaimcongest-v5r2 Ops Direct pages scanned 1056608451.00 503594991.00 Ops Kswapd pages scanned 109795048.00 147289810.00 Ops Kswapd pages reclaimed 63269243.00 31036005.00 Ops Direct pages reclaimed 10803973.00 6328887.00 Ops Kswapd efficiency % 57.62 21.07 Ops Kswapd velocity 48204.98 57572.86 Ops Direct efficiency % 1.02 1.26 Ops Direct velocity 463898.83 196845.97 Kswapd scanned less pages but the detailed pattern is different. The vanilla kernel scans slowly over time where as the patches exhibits burst patterns of scan activity. Direct reclaim scanning is reduced by 52% due to stalling. The pattern for stealing pages is also slightly different. Both kernels exhibit spikes but the vanilla kernel when reclaiming shows pages being reclaimed over a period of time where as the patches tend to reclaim in spikes. The difference is that vanilla is not throttling and instead scanning constantly finding some pages over time where as the patched kernel throttles and reclaims in spikes. Ops Percentage direct scans 90.59 77.37 For direct reclaim, vanilla scanned 90.59% of pages where as with the patches, 77.37% were direct reclaim due to throttling Ops Page writes by reclaim 2613590.00 1687131.00 Page writes from reclaim context are reduced. Ops Page writes anon 2932752.00 1917048.00 And there is less swapping. Ops Page reclaim immediate 996248528.00 107664764.00 The number of pages encountered at the tail of the LRU tagged for immediate reclaim but still dirty/writeback is reduced by 89%. Ops Slabs scanned 164284.00 153608.00 Slab scan activity is similar. ftrace was used to gather stall activity Vanilla ------- 1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000 2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000 8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000 29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000 82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0 The fast majority of wait_iff_congested calls do not stall at all. What is likely happening is that cond_resched() reschedules the task for a short period when the BDI is not registering congestion (which it never will in this test setup). 1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000 2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000 4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000 380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000 778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000 congestion_wait if called always exceeds the timeout as there is no trigger to wake it up. Bottom line: Vanilla will throttle but it's not effective. Patch series ------------ Kswapd throttle activity was always due to scanning pages tagged for immediate reclaim at the tail of the LRU 1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK 4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK 5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK 6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK 11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK 11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK 94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK 112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK The majority of events did not stall or stalled for a short period. Roughly 16% of stalls reached the timeout before expiry. For direct reclaim, the number of times stalled for each reason were 6624 reason=VMSCAN_THROTTLE_ISOLATED 93246 reason=VMSCAN_THROTTLE_NOPROGRESS 96934 reason=VMSCAN_THROTTLE_WRITEBACK The most common reason to stall was due to excessive pages tagged for immediate reclaim at the tail of the LRU followed by a failure to make forward. A relatively small number were due to too many pages isolated from the LRU by parallel threads For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was 9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED 12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED 83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED 6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED Most did not stall at all. A small number reached the timeout. For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over the map 1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS 1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPROGRESS 2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS 3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPROGRESS 3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS 3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS 4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS 5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS 6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS 7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS 7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS 7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS 7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS 8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPROGRESS 8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPROGRESS 8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS 9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS 9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPROGRESS 9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS 10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPROGRESS 10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS 10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS 11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS 12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS 13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS 13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS 14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPROGRESS 14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS 14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS 16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS 17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS 17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS 17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPROGRESS 18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPROGRESS 20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS 20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS 20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS 21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS 23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS 23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS 25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS 25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS 26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS 27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS 28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS 29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS 30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS 30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS 31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS 32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPROGRESS 33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS 35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS 35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS 36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS 36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS 37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS 38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS 40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS 43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS 55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS 56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS 58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS 59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS 61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS 71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS 71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS 79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS 82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS 82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS 85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS 85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS 88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS 90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS 90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS 94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS 118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS 119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS 126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS 146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS 148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS 148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS 159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS 178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS 183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS 237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS 266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS 313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS 347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS 470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS 559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS 964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS 2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS 2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS 7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS 22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS 51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS The full timeout is often hit but a large number also do not stall at all. The remainder slept a little allowing other reclaim tasks to make progress. While this timeout could be further increased, it could also negatively impact worst-case behaviour when there is no prioritisation of what task should make progress. For VMSCAN_THROTTLE_WRITEBACK, the breakdown was 1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK 2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK 3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK 5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK 5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK 6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK 7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK 11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK 12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK 16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK 24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK 28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK 30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK 30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK 32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK 42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK 77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK 99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK 137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK 190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK 339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK 518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK 852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK 3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK 7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK 83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK The majority hit the timeout in direct reclaim context although a sizable number did not stall at all. This is very different to kswapd where only a tiny percentage of stalls due to writeback reached the timeout. Bottom line, the throttling appears to work and the wakeup events may limit worst case stalls. There might be some grounds for adjusting timeouts but it's likely futile as the worst-case scenarios depend on the workload, memory size and the speed of the storage. A better approach to improve the series further would be to prioritise tasks based on their rate of allocation with the caveat that it may be very expensive to track. This patch (of 5): Page reclaim throttles on wait_iff_congested under the following conditions: - kswapd is encountering pages under writeback and marked for immediate reclaim implying that pages are cycling through the LRU faster than pages can be cleaned. - Direct reclaim will stall if all dirty pages are backed by congested inodes. wait_iff_congested is almost completely broken with few exceptions. This patch adds a new node-based workqueue and tracks the number of throttled tasks and pages written back since throttling started. If enough pages belonging to the node are written back then the throttled tasks will wake early. If not, the throttled tasks sleeps until the timeout expires. [neilb@suse.de: Uninterruptible sleep and simpler wakeups] [hdanton@sina.com: Avoid race when reclaim starts] [vbabka@suse.cz: vmstat irq-safe api, clarifications] Link: https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/ [1] Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: NeilBrown <neilb@suse.de> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: "Darrick J . Wong" <djwong@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Rik van Riel <riel@surriel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
1041 lines
26 KiB
C
1041 lines
26 KiB
C
// SPDX-License-Identifier: GPL-2.0-only
|
|
|
|
#include <linux/wait.h>
|
|
#include <linux/rbtree.h>
|
|
#include <linux/backing-dev.h>
|
|
#include <linux/kthread.h>
|
|
#include <linux/freezer.h>
|
|
#include <linux/fs.h>
|
|
#include <linux/pagemap.h>
|
|
#include <linux/mm.h>
|
|
#include <linux/sched/mm.h>
|
|
#include <linux/sched.h>
|
|
#include <linux/module.h>
|
|
#include <linux/writeback.h>
|
|
#include <linux/device.h>
|
|
#include <trace/events/writeback.h>
|
|
|
|
struct backing_dev_info noop_backing_dev_info;
|
|
EXPORT_SYMBOL_GPL(noop_backing_dev_info);
|
|
|
|
static struct class *bdi_class;
|
|
static const char *bdi_unknown_name = "(unknown)";
|
|
|
|
/*
|
|
* bdi_lock protects bdi_tree and updates to bdi_list. bdi_list has RCU
|
|
* reader side locking.
|
|
*/
|
|
DEFINE_SPINLOCK(bdi_lock);
|
|
static u64 bdi_id_cursor;
|
|
static struct rb_root bdi_tree = RB_ROOT;
|
|
LIST_HEAD(bdi_list);
|
|
|
|
/* bdi_wq serves all asynchronous writeback tasks */
|
|
struct workqueue_struct *bdi_wq;
|
|
|
|
#define K(x) ((x) << (PAGE_SHIFT - 10))
|
|
|
|
#ifdef CONFIG_DEBUG_FS
|
|
#include <linux/debugfs.h>
|
|
#include <linux/seq_file.h>
|
|
|
|
static struct dentry *bdi_debug_root;
|
|
|
|
static void bdi_debug_init(void)
|
|
{
|
|
bdi_debug_root = debugfs_create_dir("bdi", NULL);
|
|
}
|
|
|
|
static int bdi_debug_stats_show(struct seq_file *m, void *v)
|
|
{
|
|
struct backing_dev_info *bdi = m->private;
|
|
struct bdi_writeback *wb = &bdi->wb;
|
|
unsigned long background_thresh;
|
|
unsigned long dirty_thresh;
|
|
unsigned long wb_thresh;
|
|
unsigned long nr_dirty, nr_io, nr_more_io, nr_dirty_time;
|
|
struct inode *inode;
|
|
|
|
nr_dirty = nr_io = nr_more_io = nr_dirty_time = 0;
|
|
spin_lock(&wb->list_lock);
|
|
list_for_each_entry(inode, &wb->b_dirty, i_io_list)
|
|
nr_dirty++;
|
|
list_for_each_entry(inode, &wb->b_io, i_io_list)
|
|
nr_io++;
|
|
list_for_each_entry(inode, &wb->b_more_io, i_io_list)
|
|
nr_more_io++;
|
|
list_for_each_entry(inode, &wb->b_dirty_time, i_io_list)
|
|
if (inode->i_state & I_DIRTY_TIME)
|
|
nr_dirty_time++;
|
|
spin_unlock(&wb->list_lock);
|
|
|
|
global_dirty_limits(&background_thresh, &dirty_thresh);
|
|
wb_thresh = wb_calc_thresh(wb, dirty_thresh);
|
|
|
|
seq_printf(m,
|
|
"BdiWriteback: %10lu kB\n"
|
|
"BdiReclaimable: %10lu kB\n"
|
|
"BdiDirtyThresh: %10lu kB\n"
|
|
"DirtyThresh: %10lu kB\n"
|
|
"BackgroundThresh: %10lu kB\n"
|
|
"BdiDirtied: %10lu kB\n"
|
|
"BdiWritten: %10lu kB\n"
|
|
"BdiWriteBandwidth: %10lu kBps\n"
|
|
"b_dirty: %10lu\n"
|
|
"b_io: %10lu\n"
|
|
"b_more_io: %10lu\n"
|
|
"b_dirty_time: %10lu\n"
|
|
"bdi_list: %10u\n"
|
|
"state: %10lx\n",
|
|
(unsigned long) K(wb_stat(wb, WB_WRITEBACK)),
|
|
(unsigned long) K(wb_stat(wb, WB_RECLAIMABLE)),
|
|
K(wb_thresh),
|
|
K(dirty_thresh),
|
|
K(background_thresh),
|
|
(unsigned long) K(wb_stat(wb, WB_DIRTIED)),
|
|
(unsigned long) K(wb_stat(wb, WB_WRITTEN)),
|
|
(unsigned long) K(wb->write_bandwidth),
|
|
nr_dirty,
|
|
nr_io,
|
|
nr_more_io,
|
|
nr_dirty_time,
|
|
!list_empty(&bdi->bdi_list), bdi->wb.state);
|
|
|
|
return 0;
|
|
}
|
|
DEFINE_SHOW_ATTRIBUTE(bdi_debug_stats);
|
|
|
|
static void bdi_debug_register(struct backing_dev_info *bdi, const char *name)
|
|
{
|
|
bdi->debug_dir = debugfs_create_dir(name, bdi_debug_root);
|
|
|
|
debugfs_create_file("stats", 0444, bdi->debug_dir, bdi,
|
|
&bdi_debug_stats_fops);
|
|
}
|
|
|
|
static void bdi_debug_unregister(struct backing_dev_info *bdi)
|
|
{
|
|
debugfs_remove_recursive(bdi->debug_dir);
|
|
}
|
|
#else
|
|
static inline void bdi_debug_init(void)
|
|
{
|
|
}
|
|
static inline void bdi_debug_register(struct backing_dev_info *bdi,
|
|
const char *name)
|
|
{
|
|
}
|
|
static inline void bdi_debug_unregister(struct backing_dev_info *bdi)
|
|
{
|
|
}
|
|
#endif
|
|
|
|
static ssize_t read_ahead_kb_store(struct device *dev,
|
|
struct device_attribute *attr,
|
|
const char *buf, size_t count)
|
|
{
|
|
struct backing_dev_info *bdi = dev_get_drvdata(dev);
|
|
unsigned long read_ahead_kb;
|
|
ssize_t ret;
|
|
|
|
ret = kstrtoul(buf, 10, &read_ahead_kb);
|
|
if (ret < 0)
|
|
return ret;
|
|
|
|
bdi->ra_pages = read_ahead_kb >> (PAGE_SHIFT - 10);
|
|
|
|
return count;
|
|
}
|
|
|
|
#define BDI_SHOW(name, expr) \
|
|
static ssize_t name##_show(struct device *dev, \
|
|
struct device_attribute *attr, char *buf) \
|
|
{ \
|
|
struct backing_dev_info *bdi = dev_get_drvdata(dev); \
|
|
\
|
|
return sysfs_emit(buf, "%lld\n", (long long)expr); \
|
|
} \
|
|
static DEVICE_ATTR_RW(name);
|
|
|
|
BDI_SHOW(read_ahead_kb, K(bdi->ra_pages))
|
|
|
|
static ssize_t min_ratio_store(struct device *dev,
|
|
struct device_attribute *attr, const char *buf, size_t count)
|
|
{
|
|
struct backing_dev_info *bdi = dev_get_drvdata(dev);
|
|
unsigned int ratio;
|
|
ssize_t ret;
|
|
|
|
ret = kstrtouint(buf, 10, &ratio);
|
|
if (ret < 0)
|
|
return ret;
|
|
|
|
ret = bdi_set_min_ratio(bdi, ratio);
|
|
if (!ret)
|
|
ret = count;
|
|
|
|
return ret;
|
|
}
|
|
BDI_SHOW(min_ratio, bdi->min_ratio)
|
|
|
|
static ssize_t max_ratio_store(struct device *dev,
|
|
struct device_attribute *attr, const char *buf, size_t count)
|
|
{
|
|
struct backing_dev_info *bdi = dev_get_drvdata(dev);
|
|
unsigned int ratio;
|
|
ssize_t ret;
|
|
|
|
ret = kstrtouint(buf, 10, &ratio);
|
|
if (ret < 0)
|
|
return ret;
|
|
|
|
ret = bdi_set_max_ratio(bdi, ratio);
|
|
if (!ret)
|
|
ret = count;
|
|
|
|
return ret;
|
|
}
|
|
BDI_SHOW(max_ratio, bdi->max_ratio)
|
|
|
|
static ssize_t stable_pages_required_show(struct device *dev,
|
|
struct device_attribute *attr,
|
|
char *buf)
|
|
{
|
|
dev_warn_once(dev,
|
|
"the stable_pages_required attribute has been removed. Use the stable_writes queue attribute instead.\n");
|
|
return sysfs_emit(buf, "%d\n", 0);
|
|
}
|
|
static DEVICE_ATTR_RO(stable_pages_required);
|
|
|
|
static struct attribute *bdi_dev_attrs[] = {
|
|
&dev_attr_read_ahead_kb.attr,
|
|
&dev_attr_min_ratio.attr,
|
|
&dev_attr_max_ratio.attr,
|
|
&dev_attr_stable_pages_required.attr,
|
|
NULL,
|
|
};
|
|
ATTRIBUTE_GROUPS(bdi_dev);
|
|
|
|
static __init int bdi_class_init(void)
|
|
{
|
|
bdi_class = class_create(THIS_MODULE, "bdi");
|
|
if (IS_ERR(bdi_class))
|
|
return PTR_ERR(bdi_class);
|
|
|
|
bdi_class->dev_groups = bdi_dev_groups;
|
|
bdi_debug_init();
|
|
|
|
return 0;
|
|
}
|
|
postcore_initcall(bdi_class_init);
|
|
|
|
static int bdi_init(struct backing_dev_info *bdi);
|
|
|
|
static int __init default_bdi_init(void)
|
|
{
|
|
int err;
|
|
|
|
bdi_wq = alloc_workqueue("writeback", WQ_MEM_RECLAIM | WQ_UNBOUND |
|
|
WQ_SYSFS, 0);
|
|
if (!bdi_wq)
|
|
return -ENOMEM;
|
|
|
|
err = bdi_init(&noop_backing_dev_info);
|
|
|
|
return err;
|
|
}
|
|
subsys_initcall(default_bdi_init);
|
|
|
|
/*
|
|
* This function is used when the first inode for this wb is marked dirty. It
|
|
* wakes-up the corresponding bdi thread which should then take care of the
|
|
* periodic background write-out of dirty inodes. Since the write-out would
|
|
* starts only 'dirty_writeback_interval' centisecs from now anyway, we just
|
|
* set up a timer which wakes the bdi thread up later.
|
|
*
|
|
* Note, we wouldn't bother setting up the timer, but this function is on the
|
|
* fast-path (used by '__mark_inode_dirty()'), so we save few context switches
|
|
* by delaying the wake-up.
|
|
*
|
|
* We have to be careful not to postpone flush work if it is scheduled for
|
|
* earlier. Thus we use queue_delayed_work().
|
|
*/
|
|
void wb_wakeup_delayed(struct bdi_writeback *wb)
|
|
{
|
|
unsigned long timeout;
|
|
|
|
timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
|
|
spin_lock_bh(&wb->work_lock);
|
|
if (test_bit(WB_registered, &wb->state))
|
|
queue_delayed_work(bdi_wq, &wb->dwork, timeout);
|
|
spin_unlock_bh(&wb->work_lock);
|
|
}
|
|
|
|
static void wb_update_bandwidth_workfn(struct work_struct *work)
|
|
{
|
|
struct bdi_writeback *wb = container_of(to_delayed_work(work),
|
|
struct bdi_writeback, bw_dwork);
|
|
|
|
wb_update_bandwidth(wb);
|
|
}
|
|
|
|
/*
|
|
* Initial write bandwidth: 100 MB/s
|
|
*/
|
|
#define INIT_BW (100 << (20 - PAGE_SHIFT))
|
|
|
|
static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
|
|
gfp_t gfp)
|
|
{
|
|
int i, err;
|
|
|
|
memset(wb, 0, sizeof(*wb));
|
|
|
|
wb->bdi = bdi;
|
|
wb->last_old_flush = jiffies;
|
|
INIT_LIST_HEAD(&wb->b_dirty);
|
|
INIT_LIST_HEAD(&wb->b_io);
|
|
INIT_LIST_HEAD(&wb->b_more_io);
|
|
INIT_LIST_HEAD(&wb->b_dirty_time);
|
|
spin_lock_init(&wb->list_lock);
|
|
|
|
atomic_set(&wb->writeback_inodes, 0);
|
|
wb->bw_time_stamp = jiffies;
|
|
wb->balanced_dirty_ratelimit = INIT_BW;
|
|
wb->dirty_ratelimit = INIT_BW;
|
|
wb->write_bandwidth = INIT_BW;
|
|
wb->avg_write_bandwidth = INIT_BW;
|
|
|
|
spin_lock_init(&wb->work_lock);
|
|
INIT_LIST_HEAD(&wb->work_list);
|
|
INIT_DELAYED_WORK(&wb->dwork, wb_workfn);
|
|
INIT_DELAYED_WORK(&wb->bw_dwork, wb_update_bandwidth_workfn);
|
|
wb->dirty_sleep = jiffies;
|
|
|
|
err = fprop_local_init_percpu(&wb->completions, gfp);
|
|
if (err)
|
|
return err;
|
|
|
|
for (i = 0; i < NR_WB_STAT_ITEMS; i++) {
|
|
err = percpu_counter_init(&wb->stat[i], 0, gfp);
|
|
if (err)
|
|
goto out_destroy_stat;
|
|
}
|
|
|
|
return 0;
|
|
|
|
out_destroy_stat:
|
|
while (i--)
|
|
percpu_counter_destroy(&wb->stat[i]);
|
|
fprop_local_destroy_percpu(&wb->completions);
|
|
return err;
|
|
}
|
|
|
|
static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb);
|
|
|
|
/*
|
|
* Remove bdi from the global list and shutdown any threads we have running
|
|
*/
|
|
static void wb_shutdown(struct bdi_writeback *wb)
|
|
{
|
|
/* Make sure nobody queues further work */
|
|
spin_lock_bh(&wb->work_lock);
|
|
if (!test_and_clear_bit(WB_registered, &wb->state)) {
|
|
spin_unlock_bh(&wb->work_lock);
|
|
return;
|
|
}
|
|
spin_unlock_bh(&wb->work_lock);
|
|
|
|
cgwb_remove_from_bdi_list(wb);
|
|
/*
|
|
* Drain work list and shutdown the delayed_work. !WB_registered
|
|
* tells wb_workfn() that @wb is dying and its work_list needs to
|
|
* be drained no matter what.
|
|
*/
|
|
mod_delayed_work(bdi_wq, &wb->dwork, 0);
|
|
flush_delayed_work(&wb->dwork);
|
|
WARN_ON(!list_empty(&wb->work_list));
|
|
flush_delayed_work(&wb->bw_dwork);
|
|
}
|
|
|
|
static void wb_exit(struct bdi_writeback *wb)
|
|
{
|
|
int i;
|
|
|
|
WARN_ON(delayed_work_pending(&wb->dwork));
|
|
|
|
for (i = 0; i < NR_WB_STAT_ITEMS; i++)
|
|
percpu_counter_destroy(&wb->stat[i]);
|
|
|
|
fprop_local_destroy_percpu(&wb->completions);
|
|
}
|
|
|
|
#ifdef CONFIG_CGROUP_WRITEBACK
|
|
|
|
#include <linux/memcontrol.h>
|
|
|
|
/*
|
|
* cgwb_lock protects bdi->cgwb_tree, blkcg->cgwb_list, offline_cgwbs and
|
|
* memcg->cgwb_list. bdi->cgwb_tree is also RCU protected.
|
|
*/
|
|
static DEFINE_SPINLOCK(cgwb_lock);
|
|
static struct workqueue_struct *cgwb_release_wq;
|
|
|
|
static LIST_HEAD(offline_cgwbs);
|
|
static void cleanup_offline_cgwbs_workfn(struct work_struct *work);
|
|
static DECLARE_WORK(cleanup_offline_cgwbs_work, cleanup_offline_cgwbs_workfn);
|
|
|
|
static void cgwb_release_workfn(struct work_struct *work)
|
|
{
|
|
struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
|
|
release_work);
|
|
struct blkcg *blkcg = css_to_blkcg(wb->blkcg_css);
|
|
struct backing_dev_info *bdi = wb->bdi;
|
|
|
|
mutex_lock(&wb->bdi->cgwb_release_mutex);
|
|
wb_shutdown(wb);
|
|
|
|
css_put(wb->memcg_css);
|
|
css_put(wb->blkcg_css);
|
|
mutex_unlock(&wb->bdi->cgwb_release_mutex);
|
|
|
|
/* triggers blkg destruction if no online users left */
|
|
blkcg_unpin_online(blkcg);
|
|
|
|
fprop_local_destroy_percpu(&wb->memcg_completions);
|
|
|
|
spin_lock_irq(&cgwb_lock);
|
|
list_del(&wb->offline_node);
|
|
spin_unlock_irq(&cgwb_lock);
|
|
|
|
percpu_ref_exit(&wb->refcnt);
|
|
wb_exit(wb);
|
|
bdi_put(bdi);
|
|
WARN_ON_ONCE(!list_empty(&wb->b_attached));
|
|
kfree_rcu(wb, rcu);
|
|
}
|
|
|
|
static void cgwb_release(struct percpu_ref *refcnt)
|
|
{
|
|
struct bdi_writeback *wb = container_of(refcnt, struct bdi_writeback,
|
|
refcnt);
|
|
queue_work(cgwb_release_wq, &wb->release_work);
|
|
}
|
|
|
|
static void cgwb_kill(struct bdi_writeback *wb)
|
|
{
|
|
lockdep_assert_held(&cgwb_lock);
|
|
|
|
WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id));
|
|
list_del(&wb->memcg_node);
|
|
list_del(&wb->blkcg_node);
|
|
list_add(&wb->offline_node, &offline_cgwbs);
|
|
percpu_ref_kill(&wb->refcnt);
|
|
}
|
|
|
|
static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb)
|
|
{
|
|
spin_lock_irq(&cgwb_lock);
|
|
list_del_rcu(&wb->bdi_node);
|
|
spin_unlock_irq(&cgwb_lock);
|
|
}
|
|
|
|
static int cgwb_create(struct backing_dev_info *bdi,
|
|
struct cgroup_subsys_state *memcg_css, gfp_t gfp)
|
|
{
|
|
struct mem_cgroup *memcg;
|
|
struct cgroup_subsys_state *blkcg_css;
|
|
struct blkcg *blkcg;
|
|
struct list_head *memcg_cgwb_list, *blkcg_cgwb_list;
|
|
struct bdi_writeback *wb;
|
|
unsigned long flags;
|
|
int ret = 0;
|
|
|
|
memcg = mem_cgroup_from_css(memcg_css);
|
|
blkcg_css = cgroup_get_e_css(memcg_css->cgroup, &io_cgrp_subsys);
|
|
blkcg = css_to_blkcg(blkcg_css);
|
|
memcg_cgwb_list = &memcg->cgwb_list;
|
|
blkcg_cgwb_list = &blkcg->cgwb_list;
|
|
|
|
/* look up again under lock and discard on blkcg mismatch */
|
|
spin_lock_irqsave(&cgwb_lock, flags);
|
|
wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id);
|
|
if (wb && wb->blkcg_css != blkcg_css) {
|
|
cgwb_kill(wb);
|
|
wb = NULL;
|
|
}
|
|
spin_unlock_irqrestore(&cgwb_lock, flags);
|
|
if (wb)
|
|
goto out_put;
|
|
|
|
/* need to create a new one */
|
|
wb = kmalloc(sizeof(*wb), gfp);
|
|
if (!wb) {
|
|
ret = -ENOMEM;
|
|
goto out_put;
|
|
}
|
|
|
|
ret = wb_init(wb, bdi, gfp);
|
|
if (ret)
|
|
goto err_free;
|
|
|
|
ret = percpu_ref_init(&wb->refcnt, cgwb_release, 0, gfp);
|
|
if (ret)
|
|
goto err_wb_exit;
|
|
|
|
ret = fprop_local_init_percpu(&wb->memcg_completions, gfp);
|
|
if (ret)
|
|
goto err_ref_exit;
|
|
|
|
wb->memcg_css = memcg_css;
|
|
wb->blkcg_css = blkcg_css;
|
|
INIT_LIST_HEAD(&wb->b_attached);
|
|
INIT_WORK(&wb->release_work, cgwb_release_workfn);
|
|
set_bit(WB_registered, &wb->state);
|
|
bdi_get(bdi);
|
|
|
|
/*
|
|
* The root wb determines the registered state of the whole bdi and
|
|
* memcg_cgwb_list and blkcg_cgwb_list's next pointers indicate
|
|
* whether they're still online. Don't link @wb if any is dead.
|
|
* See wb_memcg_offline() and wb_blkcg_offline().
|
|
*/
|
|
ret = -ENODEV;
|
|
spin_lock_irqsave(&cgwb_lock, flags);
|
|
if (test_bit(WB_registered, &bdi->wb.state) &&
|
|
blkcg_cgwb_list->next && memcg_cgwb_list->next) {
|
|
/* we might have raced another instance of this function */
|
|
ret = radix_tree_insert(&bdi->cgwb_tree, memcg_css->id, wb);
|
|
if (!ret) {
|
|
list_add_tail_rcu(&wb->bdi_node, &bdi->wb_list);
|
|
list_add(&wb->memcg_node, memcg_cgwb_list);
|
|
list_add(&wb->blkcg_node, blkcg_cgwb_list);
|
|
blkcg_pin_online(blkcg);
|
|
css_get(memcg_css);
|
|
css_get(blkcg_css);
|
|
}
|
|
}
|
|
spin_unlock_irqrestore(&cgwb_lock, flags);
|
|
if (ret) {
|
|
if (ret == -EEXIST)
|
|
ret = 0;
|
|
goto err_fprop_exit;
|
|
}
|
|
goto out_put;
|
|
|
|
err_fprop_exit:
|
|
bdi_put(bdi);
|
|
fprop_local_destroy_percpu(&wb->memcg_completions);
|
|
err_ref_exit:
|
|
percpu_ref_exit(&wb->refcnt);
|
|
err_wb_exit:
|
|
wb_exit(wb);
|
|
err_free:
|
|
kfree(wb);
|
|
out_put:
|
|
css_put(blkcg_css);
|
|
return ret;
|
|
}
|
|
|
|
/**
|
|
* wb_get_lookup - get wb for a given memcg
|
|
* @bdi: target bdi
|
|
* @memcg_css: cgroup_subsys_state of the target memcg (must have positive ref)
|
|
*
|
|
* Try to get the wb for @memcg_css on @bdi. The returned wb has its
|
|
* refcount incremented.
|
|
*
|
|
* This function uses css_get() on @memcg_css and thus expects its refcnt
|
|
* to be positive on invocation. IOW, rcu_read_lock() protection on
|
|
* @memcg_css isn't enough. try_get it before calling this function.
|
|
*
|
|
* A wb is keyed by its associated memcg. As blkcg implicitly enables
|
|
* memcg on the default hierarchy, memcg association is guaranteed to be
|
|
* more specific (equal or descendant to the associated blkcg) and thus can
|
|
* identify both the memcg and blkcg associations.
|
|
*
|
|
* Because the blkcg associated with a memcg may change as blkcg is enabled
|
|
* and disabled closer to root in the hierarchy, each wb keeps track of
|
|
* both the memcg and blkcg associated with it and verifies the blkcg on
|
|
* each lookup. On mismatch, the existing wb is discarded and a new one is
|
|
* created.
|
|
*/
|
|
struct bdi_writeback *wb_get_lookup(struct backing_dev_info *bdi,
|
|
struct cgroup_subsys_state *memcg_css)
|
|
{
|
|
struct bdi_writeback *wb;
|
|
|
|
if (!memcg_css->parent)
|
|
return &bdi->wb;
|
|
|
|
rcu_read_lock();
|
|
wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id);
|
|
if (wb) {
|
|
struct cgroup_subsys_state *blkcg_css;
|
|
|
|
/* see whether the blkcg association has changed */
|
|
blkcg_css = cgroup_get_e_css(memcg_css->cgroup, &io_cgrp_subsys);
|
|
if (unlikely(wb->blkcg_css != blkcg_css || !wb_tryget(wb)))
|
|
wb = NULL;
|
|
css_put(blkcg_css);
|
|
}
|
|
rcu_read_unlock();
|
|
|
|
return wb;
|
|
}
|
|
|
|
/**
|
|
* wb_get_create - get wb for a given memcg, create if necessary
|
|
* @bdi: target bdi
|
|
* @memcg_css: cgroup_subsys_state of the target memcg (must have positive ref)
|
|
* @gfp: allocation mask to use
|
|
*
|
|
* Try to get the wb for @memcg_css on @bdi. If it doesn't exist, try to
|
|
* create one. See wb_get_lookup() for more details.
|
|
*/
|
|
struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
|
|
struct cgroup_subsys_state *memcg_css,
|
|
gfp_t gfp)
|
|
{
|
|
struct bdi_writeback *wb;
|
|
|
|
might_alloc(gfp);
|
|
|
|
if (!memcg_css->parent)
|
|
return &bdi->wb;
|
|
|
|
do {
|
|
wb = wb_get_lookup(bdi, memcg_css);
|
|
} while (!wb && !cgwb_create(bdi, memcg_css, gfp));
|
|
|
|
return wb;
|
|
}
|
|
|
|
static int cgwb_bdi_init(struct backing_dev_info *bdi)
|
|
{
|
|
int ret;
|
|
|
|
INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC);
|
|
mutex_init(&bdi->cgwb_release_mutex);
|
|
init_rwsem(&bdi->wb_switch_rwsem);
|
|
|
|
ret = wb_init(&bdi->wb, bdi, GFP_KERNEL);
|
|
if (!ret) {
|
|
bdi->wb.memcg_css = &root_mem_cgroup->css;
|
|
bdi->wb.blkcg_css = blkcg_root_css;
|
|
}
|
|
return ret;
|
|
}
|
|
|
|
static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
|
|
{
|
|
struct radix_tree_iter iter;
|
|
void **slot;
|
|
struct bdi_writeback *wb;
|
|
|
|
WARN_ON(test_bit(WB_registered, &bdi->wb.state));
|
|
|
|
spin_lock_irq(&cgwb_lock);
|
|
radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
|
|
cgwb_kill(*slot);
|
|
spin_unlock_irq(&cgwb_lock);
|
|
|
|
mutex_lock(&bdi->cgwb_release_mutex);
|
|
spin_lock_irq(&cgwb_lock);
|
|
while (!list_empty(&bdi->wb_list)) {
|
|
wb = list_first_entry(&bdi->wb_list, struct bdi_writeback,
|
|
bdi_node);
|
|
spin_unlock_irq(&cgwb_lock);
|
|
wb_shutdown(wb);
|
|
spin_lock_irq(&cgwb_lock);
|
|
}
|
|
spin_unlock_irq(&cgwb_lock);
|
|
mutex_unlock(&bdi->cgwb_release_mutex);
|
|
}
|
|
|
|
/*
|
|
* cleanup_offline_cgwbs_workfn - try to release dying cgwbs
|
|
*
|
|
* Try to release dying cgwbs by switching attached inodes to the nearest
|
|
* living ancestor's writeback. Processed wbs are placed at the end
|
|
* of the list to guarantee the forward progress.
|
|
*/
|
|
static void cleanup_offline_cgwbs_workfn(struct work_struct *work)
|
|
{
|
|
struct bdi_writeback *wb;
|
|
LIST_HEAD(processed);
|
|
|
|
spin_lock_irq(&cgwb_lock);
|
|
|
|
while (!list_empty(&offline_cgwbs)) {
|
|
wb = list_first_entry(&offline_cgwbs, struct bdi_writeback,
|
|
offline_node);
|
|
list_move(&wb->offline_node, &processed);
|
|
|
|
/*
|
|
* If wb is dirty, cleaning up the writeback by switching
|
|
* attached inodes will result in an effective removal of any
|
|
* bandwidth restrictions, which isn't the goal. Instead,
|
|
* it can be postponed until the next time, when all io
|
|
* will be likely completed. If in the meantime some inodes
|
|
* will get re-dirtied, they should be eventually switched to
|
|
* a new cgwb.
|
|
*/
|
|
if (wb_has_dirty_io(wb))
|
|
continue;
|
|
|
|
if (!wb_tryget(wb))
|
|
continue;
|
|
|
|
spin_unlock_irq(&cgwb_lock);
|
|
while (cleanup_offline_cgwb(wb))
|
|
cond_resched();
|
|
spin_lock_irq(&cgwb_lock);
|
|
|
|
wb_put(wb);
|
|
}
|
|
|
|
if (!list_empty(&processed))
|
|
list_splice_tail(&processed, &offline_cgwbs);
|
|
|
|
spin_unlock_irq(&cgwb_lock);
|
|
}
|
|
|
|
/**
|
|
* wb_memcg_offline - kill all wb's associated with a memcg being offlined
|
|
* @memcg: memcg being offlined
|
|
*
|
|
* Also prevents creation of any new wb's associated with @memcg.
|
|
*/
|
|
void wb_memcg_offline(struct mem_cgroup *memcg)
|
|
{
|
|
struct list_head *memcg_cgwb_list = &memcg->cgwb_list;
|
|
struct bdi_writeback *wb, *next;
|
|
|
|
spin_lock_irq(&cgwb_lock);
|
|
list_for_each_entry_safe(wb, next, memcg_cgwb_list, memcg_node)
|
|
cgwb_kill(wb);
|
|
memcg_cgwb_list->next = NULL; /* prevent new wb's */
|
|
spin_unlock_irq(&cgwb_lock);
|
|
|
|
queue_work(system_unbound_wq, &cleanup_offline_cgwbs_work);
|
|
}
|
|
|
|
/**
|
|
* wb_blkcg_offline - kill all wb's associated with a blkcg being offlined
|
|
* @blkcg: blkcg being offlined
|
|
*
|
|
* Also prevents creation of any new wb's associated with @blkcg.
|
|
*/
|
|
void wb_blkcg_offline(struct blkcg *blkcg)
|
|
{
|
|
struct bdi_writeback *wb, *next;
|
|
|
|
spin_lock_irq(&cgwb_lock);
|
|
list_for_each_entry_safe(wb, next, &blkcg->cgwb_list, blkcg_node)
|
|
cgwb_kill(wb);
|
|
blkcg->cgwb_list.next = NULL; /* prevent new wb's */
|
|
spin_unlock_irq(&cgwb_lock);
|
|
}
|
|
|
|
static void cgwb_bdi_register(struct backing_dev_info *bdi)
|
|
{
|
|
spin_lock_irq(&cgwb_lock);
|
|
list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
|
|
spin_unlock_irq(&cgwb_lock);
|
|
}
|
|
|
|
static int __init cgwb_init(void)
|
|
{
|
|
/*
|
|
* There can be many concurrent release work items overwhelming
|
|
* system_wq. Put them in a separate wq and limit concurrency.
|
|
* There's no point in executing many of these in parallel.
|
|
*/
|
|
cgwb_release_wq = alloc_workqueue("cgwb_release", 0, 1);
|
|
if (!cgwb_release_wq)
|
|
return -ENOMEM;
|
|
|
|
return 0;
|
|
}
|
|
subsys_initcall(cgwb_init);
|
|
|
|
#else /* CONFIG_CGROUP_WRITEBACK */
|
|
|
|
static int cgwb_bdi_init(struct backing_dev_info *bdi)
|
|
{
|
|
return wb_init(&bdi->wb, bdi, GFP_KERNEL);
|
|
}
|
|
|
|
static void cgwb_bdi_unregister(struct backing_dev_info *bdi) { }
|
|
|
|
static void cgwb_bdi_register(struct backing_dev_info *bdi)
|
|
{
|
|
list_add_tail_rcu(&bdi->wb.bdi_node, &bdi->wb_list);
|
|
}
|
|
|
|
static void cgwb_remove_from_bdi_list(struct bdi_writeback *wb)
|
|
{
|
|
list_del_rcu(&wb->bdi_node);
|
|
}
|
|
|
|
#endif /* CONFIG_CGROUP_WRITEBACK */
|
|
|
|
static int bdi_init(struct backing_dev_info *bdi)
|
|
{
|
|
int ret;
|
|
|
|
bdi->dev = NULL;
|
|
|
|
kref_init(&bdi->refcnt);
|
|
bdi->min_ratio = 0;
|
|
bdi->max_ratio = 100;
|
|
bdi->max_prop_frac = FPROP_FRAC_BASE;
|
|
INIT_LIST_HEAD(&bdi->bdi_list);
|
|
INIT_LIST_HEAD(&bdi->wb_list);
|
|
init_waitqueue_head(&bdi->wb_waitq);
|
|
|
|
ret = cgwb_bdi_init(bdi);
|
|
|
|
return ret;
|
|
}
|
|
|
|
struct backing_dev_info *bdi_alloc(int node_id)
|
|
{
|
|
struct backing_dev_info *bdi;
|
|
|
|
bdi = kzalloc_node(sizeof(*bdi), GFP_KERNEL, node_id);
|
|
if (!bdi)
|
|
return NULL;
|
|
|
|
if (bdi_init(bdi)) {
|
|
kfree(bdi);
|
|
return NULL;
|
|
}
|
|
bdi->capabilities = BDI_CAP_WRITEBACK | BDI_CAP_WRITEBACK_ACCT;
|
|
bdi->ra_pages = VM_READAHEAD_PAGES;
|
|
bdi->io_pages = VM_READAHEAD_PAGES;
|
|
timer_setup(&bdi->laptop_mode_wb_timer, laptop_mode_timer_fn, 0);
|
|
return bdi;
|
|
}
|
|
EXPORT_SYMBOL(bdi_alloc);
|
|
|
|
static struct rb_node **bdi_lookup_rb_node(u64 id, struct rb_node **parentp)
|
|
{
|
|
struct rb_node **p = &bdi_tree.rb_node;
|
|
struct rb_node *parent = NULL;
|
|
struct backing_dev_info *bdi;
|
|
|
|
lockdep_assert_held(&bdi_lock);
|
|
|
|
while (*p) {
|
|
parent = *p;
|
|
bdi = rb_entry(parent, struct backing_dev_info, rb_node);
|
|
|
|
if (bdi->id > id)
|
|
p = &(*p)->rb_left;
|
|
else if (bdi->id < id)
|
|
p = &(*p)->rb_right;
|
|
else
|
|
break;
|
|
}
|
|
|
|
if (parentp)
|
|
*parentp = parent;
|
|
return p;
|
|
}
|
|
|
|
/**
|
|
* bdi_get_by_id - lookup and get bdi from its id
|
|
* @id: bdi id to lookup
|
|
*
|
|
* Find bdi matching @id and get it. Returns NULL if the matching bdi
|
|
* doesn't exist or is already unregistered.
|
|
*/
|
|
struct backing_dev_info *bdi_get_by_id(u64 id)
|
|
{
|
|
struct backing_dev_info *bdi = NULL;
|
|
struct rb_node **p;
|
|
|
|
spin_lock_bh(&bdi_lock);
|
|
p = bdi_lookup_rb_node(id, NULL);
|
|
if (*p) {
|
|
bdi = rb_entry(*p, struct backing_dev_info, rb_node);
|
|
bdi_get(bdi);
|
|
}
|
|
spin_unlock_bh(&bdi_lock);
|
|
|
|
return bdi;
|
|
}
|
|
|
|
int bdi_register_va(struct backing_dev_info *bdi, const char *fmt, va_list args)
|
|
{
|
|
struct device *dev;
|
|
struct rb_node *parent, **p;
|
|
|
|
if (bdi->dev) /* The driver needs to use separate queues per device */
|
|
return 0;
|
|
|
|
vsnprintf(bdi->dev_name, sizeof(bdi->dev_name), fmt, args);
|
|
dev = device_create(bdi_class, NULL, MKDEV(0, 0), bdi, bdi->dev_name);
|
|
if (IS_ERR(dev))
|
|
return PTR_ERR(dev);
|
|
|
|
cgwb_bdi_register(bdi);
|
|
bdi->dev = dev;
|
|
|
|
bdi_debug_register(bdi, dev_name(dev));
|
|
set_bit(WB_registered, &bdi->wb.state);
|
|
|
|
spin_lock_bh(&bdi_lock);
|
|
|
|
bdi->id = ++bdi_id_cursor;
|
|
|
|
p = bdi_lookup_rb_node(bdi->id, &parent);
|
|
rb_link_node(&bdi->rb_node, parent, p);
|
|
rb_insert_color(&bdi->rb_node, &bdi_tree);
|
|
|
|
list_add_tail_rcu(&bdi->bdi_list, &bdi_list);
|
|
|
|
spin_unlock_bh(&bdi_lock);
|
|
|
|
trace_writeback_bdi_register(bdi);
|
|
return 0;
|
|
}
|
|
|
|
int bdi_register(struct backing_dev_info *bdi, const char *fmt, ...)
|
|
{
|
|
va_list args;
|
|
int ret;
|
|
|
|
va_start(args, fmt);
|
|
ret = bdi_register_va(bdi, fmt, args);
|
|
va_end(args);
|
|
return ret;
|
|
}
|
|
EXPORT_SYMBOL(bdi_register);
|
|
|
|
void bdi_set_owner(struct backing_dev_info *bdi, struct device *owner)
|
|
{
|
|
WARN_ON_ONCE(bdi->owner);
|
|
bdi->owner = owner;
|
|
get_device(owner);
|
|
}
|
|
|
|
/*
|
|
* Remove bdi from bdi_list, and ensure that it is no longer visible
|
|
*/
|
|
static void bdi_remove_from_list(struct backing_dev_info *bdi)
|
|
{
|
|
spin_lock_bh(&bdi_lock);
|
|
rb_erase(&bdi->rb_node, &bdi_tree);
|
|
list_del_rcu(&bdi->bdi_list);
|
|
spin_unlock_bh(&bdi_lock);
|
|
|
|
synchronize_rcu_expedited();
|
|
}
|
|
|
|
void bdi_unregister(struct backing_dev_info *bdi)
|
|
{
|
|
del_timer_sync(&bdi->laptop_mode_wb_timer);
|
|
|
|
/* make sure nobody finds us on the bdi_list anymore */
|
|
bdi_remove_from_list(bdi);
|
|
wb_shutdown(&bdi->wb);
|
|
cgwb_bdi_unregister(bdi);
|
|
|
|
if (bdi->dev) {
|
|
bdi_debug_unregister(bdi);
|
|
device_unregister(bdi->dev);
|
|
bdi->dev = NULL;
|
|
}
|
|
|
|
if (bdi->owner) {
|
|
put_device(bdi->owner);
|
|
bdi->owner = NULL;
|
|
}
|
|
}
|
|
EXPORT_SYMBOL(bdi_unregister);
|
|
|
|
static void release_bdi(struct kref *ref)
|
|
{
|
|
struct backing_dev_info *bdi =
|
|
container_of(ref, struct backing_dev_info, refcnt);
|
|
|
|
WARN_ON_ONCE(test_bit(WB_registered, &bdi->wb.state));
|
|
WARN_ON_ONCE(bdi->dev);
|
|
wb_exit(&bdi->wb);
|
|
kfree(bdi);
|
|
}
|
|
|
|
void bdi_put(struct backing_dev_info *bdi)
|
|
{
|
|
kref_put(&bdi->refcnt, release_bdi);
|
|
}
|
|
EXPORT_SYMBOL(bdi_put);
|
|
|
|
const char *bdi_dev_name(struct backing_dev_info *bdi)
|
|
{
|
|
if (!bdi || !bdi->dev)
|
|
return bdi_unknown_name;
|
|
return bdi->dev_name;
|
|
}
|
|
EXPORT_SYMBOL_GPL(bdi_dev_name);
|
|
|
|
static wait_queue_head_t congestion_wqh[2] = {
|
|
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
|
|
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
|
|
};
|
|
static atomic_t nr_wb_congested[2];
|
|
|
|
void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
|
|
{
|
|
wait_queue_head_t *wqh = &congestion_wqh[sync];
|
|
enum wb_congested_state bit;
|
|
|
|
bit = sync ? WB_sync_congested : WB_async_congested;
|
|
if (test_and_clear_bit(bit, &bdi->wb.congested))
|
|
atomic_dec(&nr_wb_congested[sync]);
|
|
smp_mb__after_atomic();
|
|
if (waitqueue_active(wqh))
|
|
wake_up(wqh);
|
|
}
|
|
EXPORT_SYMBOL(clear_bdi_congested);
|
|
|
|
void set_bdi_congested(struct backing_dev_info *bdi, int sync)
|
|
{
|
|
enum wb_congested_state bit;
|
|
|
|
bit = sync ? WB_sync_congested : WB_async_congested;
|
|
if (!test_and_set_bit(bit, &bdi->wb.congested))
|
|
atomic_inc(&nr_wb_congested[sync]);
|
|
}
|
|
EXPORT_SYMBOL(set_bdi_congested);
|
|
|
|
/**
|
|
* congestion_wait - wait for a backing_dev to become uncongested
|
|
* @sync: SYNC or ASYNC IO
|
|
* @timeout: timeout in jiffies
|
|
*
|
|
* Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
|
|
* write congestion. If no backing_devs are congested then just wait for the
|
|
* next write to be completed.
|
|
*/
|
|
long congestion_wait(int sync, long timeout)
|
|
{
|
|
long ret;
|
|
unsigned long start = jiffies;
|
|
DEFINE_WAIT(wait);
|
|
wait_queue_head_t *wqh = &congestion_wqh[sync];
|
|
|
|
prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
|
|
ret = io_schedule_timeout(timeout);
|
|
finish_wait(wqh, &wait);
|
|
|
|
trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
|
|
jiffies_to_usecs(jiffies - start));
|
|
|
|
return ret;
|
|
}
|
|
EXPORT_SYMBOL(congestion_wait);
|