linux

iv/linux

Author	SHA1	Message	Date
Luwei Kang	e0018afec5	perf/x86/intel/pt: add new capability for Intel PT This adds support for "output to Trace Transport subsystem" capability of Intel PT. It means that PT can output its trace to an MMIO address range rather than system memory buffer. Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Luwei Kang <luwei.kang@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-12-21 11:28:33 +01:00
Luwei Kang	69843a913f	perf/x86/intel/pt: Add new bit definitions for PT MSRs Add bit definitions for Intel PT MSRs to support trace output directed to the memeory subsystem and holds a count if packet bytes that have been sent out. These are required by the upcoming PT support in KVM guests for MSRs read/write emulation. Signed-off-by: Luwei Kang <luwei.kang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-12-21 11:28:33 +01:00
Luwei Kang	61be2998ca	perf/x86/intel/pt: Introduce intel_pt_validate_cap() intel_pt_validate_hw_cap() validates whether a given PT capability is supported by the hardware. It checks the PT capability array which reflects the capabilities of the hardware on which the code is executed. For setting up PT for KVM guests this is not correct as the capability array for the guest can be different from the host array. Provide a new function to check against a given capability array. Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Luwei Kang <luwei.kang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-12-21 11:28:32 +01:00
Chao Peng	f6d079ce86	perf/x86/intel/pt: Export pt_cap_get() pt_cap_get() is required by the upcoming PT support in KVM guests. Export it and move the capabilites enum to a global header. As a global functions, "pt_" is already used for ptrace and other things, so it makes sense to use "intel_pt_" as a prefix. Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Signed-off-by: Luwei Kang <luwei.kang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-12-21 11:28:32 +01:00
Chao Peng	887eda13b5	perf/x86/intel/pt: Move Intel PT MSRs bit defines to global header The Intel Processor Trace (PT) MSR bit defines are in a private header. The upcoming support for PT virtualization requires these defines to be accessible from KVM code. Move them to the global MSR header file. Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> Signed-off-by: Luwei Kang <luwei.kang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-12-21 11:28:31 +01:00
Andrew Jones	8cee58161e	kvm: selftests: aarch64: dirty_log_test: support greater than 40-bit IPAs When KVM has KVM_CAP_ARM_VM_IPA_SIZE we can test with > 40-bit IPAs by using the 'type' field of KVM_CREATE_VM. Signed-off-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-12-21 11:28:30 +01:00
Andrew Jones	cdbd242848	kvm: selftests: add pa-48/va-48 VM modes Signed-off-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-12-21 11:28:30 +01:00
Andrew Jones	696ade770f	kvm: selftests: dirty_log_test: improve mode param management Signed-off-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-12-21 11:28:29 +01:00
Andrew Jones	fd3f6f8139	kvm: selftests: dirty_log_test: reset guest test phys offset We need to reset the offset for each mode as it will change depending on the number of guest physical address bits. Signed-off-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-12-21 11:28:28 +01:00
Andrew Jones	6498e1da84	kvm: selftests: dirty_log_test: always use -t There's no reason not to always test the topmost physical addresses, and if the user wants to try lower addresses then '-p' (used to be '-o before this patch) can be used. Let's remove the '-t' option and just always do what it did. Signed-off-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-12-21 11:28:28 +01:00
Andrew Jones	d4df5a1560	kvm: selftests: dirty_log_test: don't identity map the test mem It isn't necessary and can even cause problems when testing high guest physical addresses. This patch leaves the test memory id- mapped by default, but when using '-t' the test memory virtual addresses stay the same even though the physical addresses switch to the topmost valid addresses. Signed-off-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-12-21 11:28:27 +01:00
Andrew Jones	b442324b58	kvm: selftests: x86_64: dirty_log_test: fix -t Signed-off-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-12-21 11:28:27 +01:00
Wei Yang	bdd303cb1b	KVM: fix some typos Signed-off-by: Wei Yang <richard.weiyang@gmail.com> [Preserved the iff and a probably intentional weird bracket notation. Also dropped the style change to make a single-purpose patch. - Radim] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-12-21 11:28:26 +01:00
Peng Hao	649472a169	x86/kvmclock: convert to SPDX identifiers Update the verbose license text with the matching SPDX license identifier. Signed-off-by: Peng Hao <peng.hao2@zte.com.cn> [Changed deprecated GPL-2.0+ to GPL-2.0-or-later. - Radim] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-12-21 11:28:25 +01:00
Sean Christopherson	9b7ebff23c	KVM: x86: Remove KF() macro placeholder Although well-intentioned, keeping the KF() definition as a hint for handling scattered CPUID features may be counter-productive. Simply redefining the bit position only works for directly manipulating the guest's CPUID leafs, e.g. it doesn't make guest_cpuid_has() magically work. Taking an alternative approach, e.g. ensuring the bit position is identical between the Linux-defined and hardware-defined features, may be a simpler and/or more effective method of exposing scattered features to the guest. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-12-21 11:28:25 +01:00
Jim Mattson	788fc1e9ad	kvm: vmx: Allow guest read access to IA32_TSC Let the guest read the IA32_TSC MSR with the generic RDMSR instruction as well as the specific RDTSC(P) instructions. Note that the hardware applies the TSC multiplier and offset (when applicable) to the result of RDMSR(IA32_TSC), just as it does to the result of RDTSC(P). Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Marc Orr <marcorr@google.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-12-21 11:28:24 +01:00
Jim Mattson	9ebdfe5230	kvm: nVMX: NMI-window and interrupt-window exiting should wake L2 from HLT According to the SDM, "NMI-window exiting" VM-exits wake a logical processor from the same inactive states as would an NMI and "interrupt-window exiting" VM-exits wake a logical processor from the same inactive states as would an external interrupt. Specifically, they wake a logical processor from the shutdown state and from the states entered using the HLT and MWAIT instructions. Fixes: 6dfacadd5858 ("KVM: nVMX: Add support for activity state HLT") Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> [Squashed comments of two Jim's patches and used the simplified code hunk provided by Sean. - Radim] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-12-21 11:28:24 +01:00
Tambe, William	e081354d6a	KVM: nSVM: Fix nested guest support for PAUSE filtering. Currently, the nested guest's PAUSE intercept intentions are not being honored. Instead, since the L0 hypervisor's pause_filter_count and pause_filter_thresh values are still in place, these values are used instead of those programmed in the VMCB by the L1 hypervisor. To honor the desired PAUSE intercept support of the L1 hypervisor, the L0 hypervisor must use the PAUSE filtering fields of the L1 hypervisor. This requires saving and restoring of both the L0 and L1 hypervisor's PAUSE filtering fields. Signed-off-by: William Tambe <william.tambe@amd.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-12-21 11:28:23 +01:00
Jim Mattson	7a86dab8cf	kvm: Change offset in kvm_write_guest_offset_cached to unsigned Since the offset is added directly to the hva from the gfn_to_hva_cache, a negative offset could result in an out of bounds write. The existing BUG_ON only checks for addresses beyond the end of the gfn_to_hva_cache, not for addresses before the start of the gfn_to_hva_cache. Note that all current call sites have non-negative offsets. Fixes: 4ec6e8636256 ("kvm: Introduce kvm_write_guest_offset_cached()") Reported-by: Cfir Cohen <cfir@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Cfir Cohen <cfir@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-12-21 11:28:22 +01:00
Jim Mattson	f1b9dd5eb8	kvm: Disallow wraparound in kvm_gfn_to_hva_cache_init Previously, in the case where (gpa + len) wrapped around, the entire region was not validated, as the comment claimed. It doesn't actually seem that wraparound should be allowed here at all. Furthermore, since some callers don't check the return code from this function, it seems prudent to clear ghc->memslot in the event of an error. Fixes: 8f964525a121f ("KVM: Allow cross page reads and writes from cached translations.") Reported-by: Cfir Cohen <cfir@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Cfir Cohen <cfir@google.com> Reviewed-by: Marc Orr <marcorr@google.com> Cc: Andrew Honig <ahonig@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-12-21 11:28:22 +01:00
YueHaibing	ba7424b200	KVM: VMX: Remove duplicated include from vmx.c Remove duplicated include. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-12-21 11:28:21 +01:00
Vitaly Kuznetsov	b85c32dd27	selftests: kvm: report failed stage when exit reason is unexpected When we get a report like ==== Test Assertion Failure ==== x86_64/state_test.c:157: run->exit_reason == KVM_EXIT_IO pid=955 tid=955 - Success 1 0x0000000000401350: main at state_test.c:154 2 0x00007fc31c9e9412: ?? ??:0 3 0x000000000040159d: _start at ??:? Unexpected exit reason: 8 (SHUTDOWN), it is not obvious which particular stage failed. Add the info. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-12-21 11:28:21 +01:00
Vitaly Kuznetsov	e87555e550	KVM: x86: svm: report MSR_IA32_MCG_EXT_CTL as unsupported AMD doesn't seem to implement MSR_IA32_MCG_EXT_CTL and svm code in kvm knows nothing about it, however, this MSR is among emulated_msrs and thus returned with KVM_GET_MSR_INDEX_LIST. The consequent KVM_GET_MSRS, of course, fails. Report the MSR as unsupported to not confuse userspace. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2018-12-21 11:28:20 +01:00
Paolo Bonzini	ed8e481227	KVM: x86: fix size of x86_fpu_cache objects The memory allocation in b666a4b69739 ("kvm: x86: Dynamically allocate guest_fpu", 2018-11-06) is wrong, there are other members in struct fpu before the fpregs_state union and the patch should be doing something similar to the code in fpu__init_task_struct_size. It's enough to run a guest and then rmmod kvm to see slub errors which are actually caused by memory corruption. For now let's revert it to sizeof(struct fpu), which is conservative. I have plans to move fsave/fxsave/xsave directly in KVM, without using the kernel FPU helpers, and once it's done, the size of the object in the cache will be something like kvm_xstate_size. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2018-12-21 11:28:19 +01:00
Mantas Mikulėnas	7a71712293	Input: synaptics - enable SMBus for HP EliteBook 840 G4 dmesg reports that "Your touchpad (PNP: SYN3052 SYN0100 SYN0002 PNP0f13) says it can support a different bus." I've tested the offered psmouse.synaptics_intertouch=1 with 4.18.x and 4.19.x and it seems to work well. No problems seen with suspend/resume. Also, it appears that RMI/SMBus mode is actually required for 3-4 finger multitouch gestures to work -- otherwise they are not reported at all. Information from dmesg in both modes: psmouse serio3: synaptics: Touchpad model: 1, fw: 8.2, id: 0x1e2b1, caps: 0xf00123/0x840300/0x2e800/0x0, board id: 3139, fw id: 2000742 psmouse serio3: synaptics: Trying to set up SMBus access rmi4_smbus 6-002c: registering SMbus-connected sensor rmi4_f01 rmi4-00.fn01: found RMI device, manufacturer: Synaptics, product: TM3139-001, fw id: 2000742 Signed-off-by: Mantas Mikulėnas <grawity@gmail.com> Reviewed-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>	2018-12-21 01:09:17 -08:00
Rafael J. Wysocki	a465d38fa3	Merge branches 'pm-devfreq', 'pm-avs' and 'pm-tools' * pm-devfreq: PM / devfreq: add devfreq_suspend/resume() functions PM / devfreq: add support for suspend/resume of a devfreq device PM / devfreq: refactor set_target frequency function * pm-avs: PM / AVS: SmartReflex: Switch to SPDX Licence ID PM / AVS: SmartReflex: NULL check before some freeing functions is not needed PM / AVS: SmartReflex: remove unused function * pm-tools: tools/power/x86/intel_pstate_tracer: Fix non root execution for post processing a trace file tools/power turbostat: consolidate duplicate model numbers tools/power turbostat: fix goldmont C-state limit decoding cpupower : Auto-completion for cpupower tool tools/power turbostat: reduce debug output tools/power turbosat: fix AMD APIC-id output	2018-12-21 10:07:37 +01:00
Rafael J. Wysocki	442a5d000a	Merge branches 'pm-core', 'pm-qos', 'pm-domains' and 'pm-sleep' * pm-core: PM-runtime: Switch autosuspend over to using hrtimers * pm-qos: PM / QoS: Change to use DEFINE_SHOW_ATTRIBUTE macro * pm-domains: PM / Domains: remove define_genpd_open_function() and define_genpd_debugfs_fops() * pm-sleep: PM / sleep: convert to DEFINE_SHOW_ATTRIBUTE	2018-12-21 10:06:44 +01:00
Rafael J. Wysocki	6f049e7c87	Merge branch 'pm-opp' * pm-opp: PM / Domains: Propagate performance state updates PM / Domains: Factorize dev_pm_genpd_set_performance_state() PM / Domains: Save OPP table pointer in genpd OPP: Don't return 0 on error from of_get_required_opp_performance_state() OPP: Add dev_pm_opp_xlate_performance_state() helper OPP: Improve _find_table_of_opp_np() PM / Domains: Make genpd performance states orthogonal to the idlestates OPP: Fix missing debugfs supply directory for OPPs OPP: Use opp_table->regulators to verify no regulator case OPP: Remove of_dev_pm_opp_find_required_opp() OPP: Rename and relocate of_genpd_opp_to_performance_state() OPP: Configure all required OPPs OPP: Add dev_pm_opp_{set\|put}_genpd_virt_dev() helper PM / Domains: Add genpd_opp_to_performance_state() OPP: Populate OPPs from "required-opps" property OPP: Populate required opp tables from "required-opps" property OPP: Separate out custom OPP handler specific code OPP: Identify and mark genpd OPP tables PM / Domains: Rename genpd virtual devices as virt_dev	2018-12-21 10:06:18 +01:00
Rafael J. Wysocki	3a56fe685d	Merge branches 'pm-cpuidle', 'pm-cpufreq' and 'pm-cpufreq-sched' * pm-cpuidle: cpuidle: Add 'above' and 'below' idle state metrics cpuidle: big.LITTLE: fix refcount leak cpuidle: Add cpuidle.governor= command line parameter cpuidle: poll_state: Disregard disable idle states Documentation: admin-guide: PM: Add cpuidle document * pm-cpufreq: cpufreq: qcom-hw: Add support for QCOM cpufreq HW driver dt-bindings: cpufreq: Introduce QCOM cpufreq firmware bindings cpufreq: nforce2: Remove meaningless return cpufreq: ia64: Remove unused header files cpufreq: imx6q: save one condition block for normal case of nvmem read cpufreq: imx6q: remove unused code cpufreq: pmac64: add of_node_put() cpufreq: powernv: add of_node_put() Documentation: intel_pstate: Clarify coordination of P-State limits cpufreq: intel_pstate: Force HWP min perf before offline cpufreq: s3c24xx: Change to use DEFINE_SHOW_ATTRIBUTE macro * pm-cpufreq-sched: sched/cpufreq: Add the SPDX tags	2018-12-21 10:06:06 +01:00
Rafael J. Wysocki	3eb8536846	Merge branch 'acpi-pci' * acpi-pci: ACPI: Make PCI slot detection driver depend on PCI ACPI/IORT: Stub out ACS functions when CONFIG_PCI is not set arm64: select ACPI PCI code only when both features are enabled PCI/ACPI: Allow ACPI to be built without CONFIG_PCI set ACPICA: Remove PCI bits from ACPICA when CONFIG_PCI is unset ACPI: Allow CONFIG_PCI to be unset for reboot ACPI: Move PCI reset to a separate function	2018-12-21 10:04:23 +01:00
Rafael J. Wysocki	4cd9da8ad1	Merge branches 'acpi-tables', 'acpi-soc', 'acpi-apei' and 'acpi-misc' * acpi-tables: ACPI / tables: Add an ifdef around amlcode and dsdt_amlcode ACPI / tables: add DSDT AmlCode new declaration name support ACPI: SPCR: Consider baud rate 0 as preconfigured state * acpi-soc: ACPI / LPSS: Ignore acpi_device_fix_up_power() return value ACPI / APD: Add clock frequency for Hisilicon Hip08 SPI controller * acpi-apei: ACPI/APEI: Clear GHES block_status before panic() ACPI, APEI, EINJ: Change to use DEFINE_SHOW_ATTRIBUTE macro * acpi-misc: ACPI: fix acpi_find_child_device() invocation in acpi_preset_companion()	2018-12-21 10:04:14 +01:00
Rafael J. Wysocki	1027fb0fb9	Merge branch 'acpica' * acpica: ACPICA: Update version to 20181213 ACPICA: change coding style to match ACPICA, no functional change ACPICA: Debug output: Add option to display method/object evaluation ACPICA: disassembler: disassemble OEMx tables as AML ACPICA: Add "Windows 2018.2" string in the _OSI support ACPICA: Expressions in package elements are not supported ACPICA: Update buffer-to-string conversions ACPICA: add comments, no functional change ACPICA: Remove defines that use deprecated flag ACPICA: Add "Windows 2018" string in the _OSI support ACPICA: Update version to 20181031 ACPICA: iASL: Enhance error detection ACPICA: iASL: adding definition and disassembly for TPM2 revision 3 ACPICA: Use %d for signed int print formatting instead of %u ACPICA: Debugger: refactor to fix unused variable warning	2018-12-21 10:03:16 +01:00
Benjamin Tissoires	d21ff5d7f8	Input: elantech - disable elan-i2c for P52 and P72 The current implementation of elan_i2c is known to not support those 2 laptops. A proper fix is to tweak both elantech and elan_i2c to transmit the correct information from PS/2, which would make a bad candidate for stable. So to give us some time for fixing the root of the problem, disable elan_i2c for the devices we know are not behaving properly. Link: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1803600 Link: https://bugs.archlinux.org/task/59714 Fixes: df077237cf55 Input: elantech - detect new ICs and setup Host Notify for them Cc: stable@vger.kernel.org # v4.18+ Signed-off-by: Benjamin Tissoires <benjamin.tissoires@redhat.com> Acked-by: Peter Hutterer <peter.hutterer@who-t.net> Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>	2018-12-21 00:57:01 -08:00
Uwe Kleine-König	c8da642d41	gpio: mvebu: only fail on missing clk if pwm is actually to be used The gpio IP on Armada 370 at offset 0x18180 has neither a clk nor pwm registers. So there is no need for a clk as the pwm isn't used anyhow. So only check for the clk in the presence of the pwm registers. This fixes a failure to probe the gpio driver for the above mentioned gpio device. Fixes: 757642f9a584 ("gpio: mvebu: Add limited PWM support") Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Reviewed-by: Gregory CLEMENT <gregory.clement@bootlin.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>	2018-12-21 09:23:46 +01:00
Christophe Leroy	abf221d2f5	gpio: max7301: fix driver for use with CONFIG_VMAP_STACK spi_read() and spi_write() require DMA-safe memory. When CONFIG_VMAP_STACK is selected, those functions cannot be used with buffers on stack. This patch replaces calls to spi_read() and spi_write() by spi_write_then_read() which doesn't require DMA-safe buffers. Fixes: 0c36ec314735 ("gpio: gpio driver for max7301 SPI GPIO expander") Cc: <stable@vger.kernel.org> Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>	2018-12-21 09:23:45 +01:00
Tony Lindgren	00ded24c33	gpio: gpio-omap: Revert deferred wakeup quirk handling for regressions Commit ec0daae685b2 ("gpio: omap: Add level wakeup handling for omap4 based SoCs") attempted to fix omap4 GPIO wakeup handling as it was blocking deeper SoC idle states. However this caused a regression for GPIOs during runtime having over second long latencies for Ethernet GPIO interrupt as reportedy by Russell King <rmk+kernel@armlinux.org.uk>. Let's fix this issue by doing a partial revert of the breaking commit. We still want to keep the quirk handling around as it is also used for OMAP_GPIO_QUIRK_IDLE_REMOVE_TRIGGER. The real fix for omap4 GPIO wakeup handling involves fixes for omap_set_gpio_trigger() and omap_gpio_unmask_irq() and will be posted separately. And we must keep the wakeup bit enabled during runtime because of module doing clock autogating with autoidle configured. Reported-by: Russell King <rmk+kernel@armlinux.org.uk> Fixes: ec0daae685b2 ("gpio: omap: Add level wakeup handling for omap4 based SoCs") Cc: Aaro Koskinen <aaro.koskinen@iki.fi> Cc: Grygorii Strashko <grygorii.strashko@ti.com> Cc: Keerthy <j-keerthy@ti.com> Cc: Ladislav Michl <ladis@linux-mips.org> Cc: Russell King <rmk+kernel@armlinux.org.uk> Cc: Tero Kristo <t-kristo@ti.com> Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org>	2018-12-21 09:23:45 +01:00
Alexey Kardashevskiy	7f92891778	vfio_pci: Add NVIDIA GV100GL [Tesla V100 SXM2] subdriver POWER9 Witherspoon machines come with 4 or 6 V100 GPUs which are not pluggable PCIe devices but still have PCIe links which are used for config space and MMIO. In addition to that the GPUs have 6 NVLinks which are connected to other GPUs and the POWER9 CPU. POWER9 chips have a special unit on a die called an NPU which is an NVLink2 host bus adapter with p2p connections to 2 to 3 GPUs, 3 or 2 NVLinks to each. These systems also support ATS (address translation services) which is a part of the NVLink2 protocol. Such GPUs also share on-board RAM (16GB or 32GB) to the system via the same NVLink2 so a CPU has cache-coherent access to a GPU RAM. This exports GPU RAM to the userspace as a new VFIO device region. This preregisters the new memory as device memory as it might be used for DMA. This inserts pfns from the fault handler as the GPU memory is not onlined until the vendor driver is loaded and trained the NVLinks so doing this earlier causes low level errors which we fence in the firmware so it does not hurt the host system but still better be avoided; for the same reason this does not map GPU RAM into the host kernel (usual thing for emulated access otherwise). This exports an ATSD (Address Translation Shootdown) register of NPU which allows TLB invalidations inside GPU for an operating system. The register conveniently occupies a single 64k page. It is also presented to the userspace as a new VFIO device region. One NPU has 8 ATSD registers, each of them can be used for TLB invalidation in a GPU linked to this NPU. This allocates one ATSD register per an NVLink bridge allowing passing up to 6 registers. Due to the host firmware bug (just recently fixed), only 1 ATSD register per NPU was actually advertised to the host system so this passes that alone register via the first NVLink bridge device in the group which is still enough as QEMU collects them all back and presents to the guest via vPHB to mimic the emulated NPU PHB on the host. In order to provide the userspace with the information about GPU-to-NVLink connections, this exports an additional capability called "tgt" (which is an abbreviated host system bus address). The "tgt" property tells the GPU its own system address and allows the guest driver to conglomerate the routing information so each GPU knows how to get directly to the other GPUs. For ATS to work, the nest MMU (an NVIDIA block in a P9 CPU) needs to know LPID (a logical partition ID or a KVM guest hardware ID in other words) and PID (a memory context ID of a userspace process, not to be confused with a linux pid). This assigns a GPU to LPID in the NPU and this is why this adds a listener for KVM on an IOMMU group. A PID comes via NVLink from a GPU and NPU uses a PID wildcard to pass it through. This requires coherent memory and ATSD to be available on the host as the GPU vendor only supports configurations with both features enabled and other configurations are known not to work. Because of this and because of the ways the features are advertised to the host system (which is a device tree with very platform specific properties), this requires enabled POWERNV platform. The V100 GPUs do not advertise any of these capabilities via the config space and there are more than just one device ID so this relies on the platform to tell whether these GPUs have special abilities such as NVLinks. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-12-21 16:20:47 +11:00
Alexey Kardashevskiy	c2c0f1cde0	vfio_pci: Allow regions to add own capabilities VFIO regions already support region capabilities with a limited set of fields. However the subdriver might have to report to the userspace additional bits. This adds an add_capability() hook to vfio_pci_regops. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-12-21 16:20:47 +11:00
Alexey Kardashevskiy	a15b1883fe	vfio_pci: Allow mapping extra regions So far we only allowed mapping of MMIO BARs to the userspace. However there are GPUs with on-board coherent RAM accessible via side channels which we also want to map to the userspace. The first client for this is NVIDIA V100 GPU with NVLink2 direct links to a POWER9 NPU-enabled CPU; such GPUs have 16GB RAM which is coherently mapped to the system address space, we are going to export these as an extra PCI region. We already support extra PCI regions and this adds support for mapping them to the userspace. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Acked-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-12-21 16:20:47 +11:00
Alexey Kardashevskiy	58629c0dc3	powerpc/powernv/npu: Fault user page into the hypervisor's pagetable When a page fault happens in a GPU, the GPU signals the OS and the GPU driver calls the fault handler which populated a page table; this allows the GPU to complete an ATS request. On the bare metal get_user_pages() is enough as it adds a pte to the kernel page table but under KVM the partition scope tree does not get updated so ATS will still fail. This reads a byte from an effective address which causes HV storage interrupt and KVM updates the partition scope tree. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-12-21 16:20:46 +11:00
Alexey Kardashevskiy	135ef95405	powerpc/powernv/npu: Check mmio_atsd array bounds when populating A broken device tree might contain more than 8 values and introduce hard to debug memory corruption bug. This adds the boundary check. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-12-21 16:20:46 +11:00
Alexey Kardashevskiy	1b785611e1	powerpc/powernv/npu: Add release_ownership hook In order to make ATS work and translate addresses for arbitrary LPID and PID, we need to program an NPU with LPID and allow PID wildcard matching with a specific MSR mask. This implements a helper to assign a GPU to LPAR and program the NPU with a wildcard for PID and a helper to do clean-up. The helper takes MSR (only DR/HV/PR/SF bits are allowed) to program them into NPU2 for ATS checkout requests support. This exports pnv_npu2_unmap_lpar_dev() as following patches will use it from the VFIO driver. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-12-21 16:20:46 +11:00
Alexey Kardashevskiy	0bd971676e	powerpc/powernv/npu: Add compound IOMMU groups At the moment the powernv platform registers an IOMMU group for each PE. There is an exception though: an NVLink bridge which is attached to the corresponding GPU's IOMMU group making it a master. Now we have POWER9 systems with GPUs connected to each other directly bypassing PCI. At the moment we do not control state of these links so we have to put such interconnected GPUs to one IOMMU group which means that the old scheme with one GPU as a master won't work - there will be up to 3 GPUs in such group. This introduces a npu_comp struct which represents a compound IOMMU group made of multiple PEs - PCI PEs (for GPUs) and NPU PEs (for NVLink bridges). This converts the existing NVLink1 code to use the new scheme. >From now on, each PE must have a valid iommu_table_group_ops which will either be called directly (for a single PE group) or indirectly from a compound group handlers. This moves IOMMU group registration for NVLink-connected GPUs to npu-dma.c. For POWER8, this stores a new compound group pointer in the PE (so a GPU is still a master); for POWER9 the new group pointer is stored in an NPU (which is allocated per a PCI host controller). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> [mpe: Initialise npdev to NULL in pnv_try_setup_npu_table_group()] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-12-21 16:20:46 +11:00
Alexey Kardashevskiy	83fb8ccf97	powerpc/powernv/npu: Convert NPU IOMMU helpers to iommu_table_group_ops At the moment NPU IOMMU is manipulated directly from the IODA2 PCI PE code; PCI PE acts as a master to NPU PE. Soon we will have compound IOMMU groups with several PEs from several different PHB (such as interconnected GPUs and NPUs) so there will be no single master but a one big IOMMU group. This makes a first step and converts an NPU PE with a set of extern function to a table group. This should cause no behavioral change. Note that pnv_npu_release_ownership() has never been implemented. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-12-21 16:20:46 +11:00
Alexey Kardashevskiy	b04149c2dd	powerpc/powernv/npu: Move single TVE handling to NPU PE Normal PCI PEs have 2 TVEs, one per a DMA window; however NPU PE has only one which points to one of two tables of the corresponding PCI PE. So whenever a new DMA window is programmed to PEs, the NPU PE needs to release old table in order to use the new one. Commit d41ce7b1bcc3e ("powerpc/powernv/npu: Do not try invalidating 32bit table when 64bit table is enabled") did just that but in pci-ioda.c while it actually belongs to npu-dma.c. This moves the single TVE handling to npu-dma.c. This does not implement restoring though as it is highly unlikely that we can set the table to PCI PE and cannot to NPU PE and if that fails, we could only set 32bit table to NPU PE and this configuration is not really supported or wanted. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-12-21 16:20:46 +11:00
Alexey Kardashevskiy	847e6563aa	powerpc/powernv: Reference iommu_table while it is linked to a group The iommu_table pointer stored in iommu_table_group may get stale by accident, this adds referencing and removes a redundant comment about this. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-12-21 16:20:46 +11:00
Alexey Kardashevskiy	5eada8a3f0	powerpc/iommu_api: Move IOMMU groups setup to a single place Registering new IOMMU groups and adding devices to them are separated in code and the latter is dug in the DMA setup code which it does not really belong to. This moved IOMMU groups setup to a separate helper which registers a group and adds devices as before. This does not make a difference as IOMMU groups are not used anyway; the only dependency here is that iommu_add_device() requires a valid pointer to an iommu_table (set by set_iommu_table_base()). To keep the old behaviour, this does not add new IOMMU groups for PEs with no DMA weight and also skips NVLink bridges which do not have pci_controller_ops::setup_bridge (the normal way of adding PEs). Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-12-21 16:20:46 +11:00
Alexey Kardashevskiy	c4e9d3c1e6	powerpc/powernv/pseries: Rework device adding to IOMMU groups The powernv platform registers IOMMU groups and adds devices to them from the pci_controller_ops::setup_bridge() hook except one case when virtual functions (SRIOV VFs) are added from a bus notifier. The pseries platform registers IOMMU groups from the pci_controller_ops::dma_bus_setup() hook and adds devices from the pci_controller_ops::dma_dev_setup() hook. The very same bus notifier used for powernv does not add devices for pseries though as __of_scan_bus() adds devices first, then it does the bus/dev DMA setup. Both platforms use iommu_add_device() which takes a device and expects it to have a valid IOMMU table struct with an iommu_table_group pointer which in turn points the iommu_group struct (which represents an IOMMU group). Although the helper seems easy to use, it relies on some pre-existing device configuration and associated data structures which it does not really need. This simplifies iommu_add_device() to take the table_group pointer directly. Pseries already has a table_group pointer handy and the bus notified is not used anyway. For powernv, this copies the existing bus notifier, makes it work for powernv only which means an easy way of getting to the table_group pointer. This was tested on VFs but should also support physical PCI hotplug. Since iommu_add_device() receives the table_group pointer directly, pseries does not do TCE cache invalidation (the hypervisor does) nor allow multiple groups per a VFIO container (in other words sharing an IOMMU table between partitionable endpoints), this removes iommu_table_group_link from pseries. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-12-21 16:20:46 +11:00
Alexey Kardashevskiy	c409c63161	powerpc/pseries: Remove IOMMU API support for non-LPAR systems The pci_dma_bus_setup_pSeries and pci_dma_dev_setup_pSeries hooks are registered for the pseries platform which does not have FW_FEATURE_LPAR; these would be pre-powernv platforms which we never supported PCI pass through for anyway so remove it. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-12-21 16:20:46 +11:00
Alexey Kardashevskiy	3be2df00e2	powerpc/pseries/npu: Enable platform support We already changed NPU API for GPUs to not to call OPAL and the remaining bit is initializing NPU structures. This searches for POWER9 NVLinks attached to any device on a PHB and initializes an NPU structure if any found. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>	2018-12-21 16:20:46 +11:00

... 2 3 4 5 6 ...

802696 Commits