19897 Commits

Author SHA1 Message Date
Linus Torvalds
f74ad8df4e PCI changes for the v3.17 merge window:
Resource management
     - Support BAR sizes up to 128GB (Yinghai Lu)
     - Keep original resource if we fail to expand it (Guo Chao)
     - Return conventional error values from pci_revert_fw_address() (Bjorn Helgaas)
     - Tidy resource assignment messages (Bjorn Helgaas)
     - Don't exclude low BIOS area for non-PCI cards (Christoph Schulz)
 
   PCI device hotplug
     - Prevent NULL dereference during pciehp probe (Andreas Noever)
     - Make pciehp pcie_wait_cmd() self-contained (Bjorn Helgaas)
     - Wait for pciehp hotplug command completion lazily (Bjorn Helgaas)
     - Compute pciehp timeout from hotplug command start time (Bjorn Helgaas)
     - Remove pciehp assumptions about which commands cause completion events (Bjorn Helgaas)
     - Clear pciehp Data Link Layer State Changed during init (Myron Stowe)
     - Remove pciehp struct controller.no_cmd_complete (Rajat Jain)
     - Remove cpqphp unnecessary null test (Fabian Frederick)
     - Remove "invalid IRQ" warning for hot-added PCIe ports (Jiang Liu)
 
   IOMMU
     - Add DMA alias quirk for Intel 82801 bridge (Alex Williamson)
 
   MSI
     - Add internal msix_clear_and_set_ctrl() (Yijing Wang)
     - Remove unused msi_enabled_mask() (Yijing Wang)
     - Cache Multiple Message Capable in struct msi_desc (Yijing Wang)
     - Add msi_setup_entry() to clean up initialization (Yijing Wang)
     - Remove unused msi_remove_pci_irq_vectors() (Yijing Wang)
     - Retrieve first MSI IRQ from msi_desc rather than pci_dev (Yijing Wang)
     - Remove unused list access in __pci_restore_msix_state() (Yijing Wang)
     - Use irq_get_msi_desc() to simplify code (Yijing Wang)
 
   Generic host bridge driver
     - Fix GPL v2 license string typo (Bjorn Helgaas)
 
   Marvell MVEBU
     - Fix GPL v2 license string typo (Thierry Reding)
 
   NVIDIA Tegra
     - Use correct initial HW settings (Phil Edworthy)
     - Remove rcar_pcie_setup_window() resource argument (Phil Edworthy)
     - Fix GPL v2 license string typo (Thierry Reding)
 
   Renesas R-Car
     - Remove redundant config accessor register checks (Sergei Shtylyov)
     - Fix GPL v2 license string typo (Bjorn Helgaas)
 
   Virtualization
     - Factor secondary bus reset logic (Gavin Shan)
     - Remove duplicate powerpc reset logic (Gavin Shan)
 
   Miscellaneous
     - Rework default VGA detection for EFI (Bruno Prémont)
     - Fix sysfs "acpi_index" and "label" errors for NIC renaming (Simone Gotti)
     - Configure ASPM at pci_enable_device()-time (Vidya Sagar)
     - Add include/linux/pci_ids.h include guard (Rasmus Villemoes)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.11 (GNU/Linux)
 
 iQIcBAABAgAGBQJTzppBAAoJEFmIoMA60/r8mtQP/jgVWCSU+0ulHjoxVSRLu4Lc
 UGKQFjS03oWWflHdvW6wZFqN82Ynva9fYCLMtiKdPg7cgTosSRT3I4DjAIm80ZI/
 kZvHxSmi6DBYmchZBsWzj60zxNiYZeEgd7CevzcJRHuwbKNMr2y12s6hjJbyl5lF
 ygaXWpDveKjsEDjyk9vKjUGwul/NJKynar253Yh178XaoypdGuiEIw3D1lQFMZZp
 ADcRijIi+CD2BENtDr6fbldbj+yQ93yyUSloEnaKtWZD+Ao5IsHngN0IyRu+l1Wl
 LFob0AsopeYVFKdw22Gn1KAq9Jj01acsSBRXjgrauU+tLY512Vkbp1lFYl85B/38
 /Z0VNHncmIh29rq9Tl2xQwEeI3Ja27FfnMjC70dLM5YjWf8vsYnDEQZHyxAAe15D
 p3H3YuuDjmvHkoSrHY/68DLfDl9ubw3/BFUlCMqijL7444ZWLXathrnCV8ZJimmr
 PlF/m7GtXYF4wIw19m9KQqNBUPJJEsVHExKzICOY4v5/nMlvx4ZkBDR3tPNEH1sk
 3AYKjLDw21Nle7yKcAlxDI/TYWZqxuph23UpevzlQd16tutq2i2FqpauiqI3DFm4
 VfYVbOVQwfeUJt11VOCgxvE7RsTxCk5QefB+YKVAdVK6vMZHeZxsetYvrCDptnea
 cId/NfiEFnmr+u3mAyPM
 =U5Ip
 -----END PGP SIGNATURE-----

Merge tag 'pci-v3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

Pull PCI updates from Bjorn Helgaas:
 "I'll be on vacation until Aug 11, and I suspect the merge window will
  open before then, so I'm sending this to you early.  There are more
  things I'd like to get into v3.17, so I hope to send another pull
  request soon after I return.

  The most notable pieces here are:

   - Support BARs up to 128GB (up from 8GB)
   - Fix SR-IOV resource assignment when we fail to expand a resource
   - Rework pciehp to handle a common hardware erratum
   - Cleanup MSI
   - Fix NIC renaming issue
   - Fix VGA default device issue on EFI systems
   - Fix ASPM configuration (previously we didn't enable it as expected)

  Alex Williamson has graciously agreed to take care of any major issues
  with this if you take it before I return.

  Details:

  Resource management
    - Support BAR sizes up to 128GB (Yinghai Lu)
    - Keep original resource if we fail to expand it (Guo Chao)
    - Return conventional error values from pci_revert_fw_address() (Bjorn Helgaas)
    - Tidy resource assignment messages (Bjorn Helgaas)
    - Don't exclude low BIOS area for non-PCI cards (Christoph Schulz)

  PCI device hotplug
    - Prevent NULL dereference during pciehp probe (Andreas Noever)
    - Make pciehp pcie_wait_cmd() self-contained (Bjorn Helgaas)
    - Wait for pciehp hotplug command completion lazily (Bjorn Helgaas)
    - Compute pciehp timeout from hotplug command start time (Bjorn Helgaas)
    - Remove pciehp assumptions about which commands cause completion events (Bjorn Helgaas)
    - Clear pciehp Data Link Layer State Changed during init (Myron Stowe)
    - Remove pciehp struct controller.no_cmd_complete (Rajat Jain)
    - Remove cpqphp unnecessary null test (Fabian Frederick)
    - Remove "invalid IRQ" warning for hot-added PCIe ports (Jiang Liu)

  IOMMU
    - Add DMA alias quirk for Intel 82801 bridge (Alex Williamson)

  MSI
    - Add internal msix_clear_and_set_ctrl() (Yijing Wang)
    - Remove unused msi_enabled_mask() (Yijing Wang)
    - Cache Multiple Message Capable in struct msi_desc (Yijing Wang)
    - Add msi_setup_entry() to clean up initialization (Yijing Wang)
    - Remove unused msi_remove_pci_irq_vectors() (Yijing Wang)
    - Retrieve first MSI IRQ from msi_desc rather than pci_dev (Yijing Wang)
    - Remove unused list access in __pci_restore_msix_state() (Yijing Wang)
    - Use irq_get_msi_desc() to simplify code (Yijing Wang)

  Generic host bridge driver
    - Fix GPL v2 license string typo (Bjorn Helgaas)

  Marvell MVEBU
    - Fix GPL v2 license string typo (Thierry Reding)

  NVIDIA Tegra
    - Use correct initial HW settings (Phil Edworthy)
    - Remove rcar_pcie_setup_window() resource argument (Phil Edworthy)
    - Fix GPL v2 license string typo (Thierry Reding)

  Renesas R-Car
    - Remove redundant config accessor register checks (Sergei Shtylyov)
    - Fix GPL v2 license string typo (Bjorn Helgaas)

  Virtualization
    - Factor secondary bus reset logic (Gavin Shan)
    - Remove duplicate powerpc reset logic (Gavin Shan)

  Miscellaneous
    - Rework default VGA detection for EFI (Bruno Prémont)
    - Fix sysfs "acpi_index" and "label" errors for NIC renaming (Simone Gotti)
    - Configure ASPM at pci_enable_device()-time (Vidya Sagar)
    - Add include/linux/pci_ids.h include guard (Rasmus Villemoes)"

* tag 'pci-v3.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (38 commits)
  PCI/MSI: Use irq_get_msi_desc() to simplify code
  PCI/MSI: Remove unused list access in __pci_restore_msix_state()
  PCI/MSI: Retrieve first MSI IRQ from msi_desc rather than pci_dev
  PCI/MSI: Remove unused function msi_remove_pci_irq_vectors()
  PCI/MSI: Add msi_setup_entry() to clean up MSI initialization
  PCI: Configure ASPM when enabling device
  x86: don't exclude low BIOS area when allocating address space for non-PCI cards
  PCI: generic: Fix GPL v2 license string typo
  PCI: rcar: Fix GPL v2 license string typo
  PCI: tegra: Fix GPL v2 license string typo
  PCI: mvebu: Fix GPL v2 license string typo
  PCI: Add include guard to include/linux/pci_ids.h
  x86, ia64: Move EFI_FB vga_default_device() initialization to pci_vga_fixup()
  PCI: Tidy resource assignment messages
  PCI: Return conventional error values from pci_revert_fw_address()
  PCI: Cleanup control flow
  PCI: Support BAR sizes up to 128GB
  PCI: cpqphp: Remove unnecessary null test before debugfs_remove()
  PCI: pciehp: Clear Data Link Layer State Changed during init
  PCI: Add bridge DMA alias quirk for Intel 82801 bridge
  ...
2014-08-04 09:29:37 -07:00
Mark Brown
a6ce305207 Merge remote-tracking branches 'asoc/topic/intel', 'asoc/topic/kirkwood', 'asoc/topic/max98090' and 'asoc/topic/mc13783' into asoc-next 2014-08-04 16:31:45 +01:00
Dan Carpenter
4c51cb005b x86/pmc_atom: Silence shift wrapping warnings in pmc_sleep_tmr_show()
I don't know if we really need 64 bits here but these variables are
declared as u64 and it can't hurt to cast this so we prevent any shift
wrapping.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Aubrey Li <aubrey.li@linux.intel.com>
Link: http://lkml.kernel.org/r/20140801082715.GE28869@mwanda
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2014-08-02 16:52:17 -07:00
Alexei Starovoitov
7ae457c1e5 net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix

split 'struct sk_filter' into
struct sk_filter {
	atomic_t        refcnt;
	struct rcu_head rcu;
	struct bpf_prog *prog;
};
and
struct bpf_prog {
        u32                     jited:1,
                                len:31;
        struct sock_fprog_kern  *orig_prog;
        unsigned int            (*bpf_func)(const struct sk_buff *skb,
                                            const struct bpf_insn *filter);
        union {
                struct sock_filter      insns[0];
                struct bpf_insn         insnsi[0];
                struct work_struct      work;
        };
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases

split SK_RUN_FILTER macro into:
    SK_RUN_FILTER to be used with 'struct sk_filter *' and
    BPF_PROG_RUN to be used with 'struct bpf_prog *'

__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function

also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:

sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter

API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet

API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02 15:03:58 -07:00
Alexei Starovoitov
8fb575ca39 net: filter: rename sk_convert_filter() -> bpf_convert_filter()
to indicate that this function is converting classic BPF into eBPF
and not related to sockets

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02 15:02:38 -07:00
Linus Torvalds
f88cf230a4 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Peter Anvin:
 "A single fix to not invoke the espfix code on Xen PV, as it turns out
  to oops the guest when invoked after all.  This patch leaves some
  amount of dead code, in particular unnecessary initialization of the
  espfix stacks when they won't be used, but in the interest of keeping
  the patch minimal that cleanup can wait for the next cycle"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86_64/entry/xen: Do not invoke espfix64 on Xen
2014-08-01 17:37:01 -07:00
H. Peter Anvin
5e3bf215f4 x86/apic/vsmp: Make is_vsmp_box() static
Since checkin

411cf9ee2946 x86, vsmp: Remove is_vsmp_box() from apic_is_clustered_box()

... is_vsmp_box() is only used in vsmp_64.c and does not have any
header file declaring it, so make it explicitly static.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Cc: Shai Fultheim <shai@scalemp.com>
Cc: Oren Twaig <oren@scalemp.com>
Link: http://lkml.kernel.org/r/1404036068-11674-1-git-send-email-oren@scalemp.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-08-01 15:09:45 -07:00
Matt Rushton
3fbdf631b7 xen/setup: Remove Identity Map Debug Message
Removing a debug message for setting the identity map since it becomes
rather noisy after rework of the identity map code.

Signed-off-by: Matthew Rushton <mrushton@amazon.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2014-07-31 18:40:13 +01:00
Dave Hansen
a5102476a2 x86/mm: Set TLB flush tunable to sane value (33)
This has been run through Intel's LKP tests across a wide range
of modern sytems and workloads and it wasn't shown to make a
measurable performance difference positive or negative.

Now that we have some shiny new tracepoints, we can actually
figure out what the heck is going on.

During a kernel compile, 60% of the flush_tlb_mm_range() calls
are for a single page.  It breaks down like this:

 size   percent  percent<=
  V        V        V
GLOBAL:   2.20%   2.20% avg cycles:  2283
     1:  56.92%  59.12% avg cycles:  1276
     2:  13.78%  72.90% avg cycles:  1505
     3:   8.26%  81.16% avg cycles:  1880
     4:   7.41%  88.58% avg cycles:  2447
     5:   1.73%  90.31% avg cycles:  2358
     6:   1.32%  91.63% avg cycles:  2563
     7:   1.14%  92.77% avg cycles:  2862
     8:   0.62%  93.39% avg cycles:  3542
     9:   0.08%  93.47% avg cycles:  3289
    10:   0.43%  93.90% avg cycles:  3570
    11:   0.20%  94.10% avg cycles:  3767
    12:   0.08%  94.18% avg cycles:  3996
    13:   0.03%  94.20% avg cycles:  4077
    14:   0.02%  94.23% avg cycles:  4836
    15:   0.04%  94.26% avg cycles:  5699
    16:   0.06%  94.32% avg cycles:  5041
    17:   0.57%  94.89% avg cycles:  5473
    18:   0.02%  94.91% avg cycles:  5396
    19:   0.03%  94.95% avg cycles:  5296
    20:   0.02%  94.96% avg cycles:  6749
    21:   0.18%  95.14% avg cycles:  6225
    22:   0.01%  95.15% avg cycles:  6393
    23:   0.01%  95.16% avg cycles:  6861
    24:   0.12%  95.28% avg cycles:  6912
    25:   0.05%  95.32% avg cycles:  7190
    26:   0.01%  95.33% avg cycles:  7793
    27:   0.01%  95.34% avg cycles:  7833
    28:   0.01%  95.35% avg cycles:  8253
    29:   0.08%  95.42% avg cycles:  8024
    30:   0.03%  95.45% avg cycles:  9670
    31:   0.01%  95.46% avg cycles:  8949
    32:   0.01%  95.46% avg cycles:  9350
    33:   3.11%  98.57% avg cycles:  8534
    34:   0.02%  98.60% avg cycles: 10977
    35:   0.02%  98.62% avg cycles: 11400

We get in to dimishing returns pretty quickly.  On pre-IvyBridge
CPUs, we used to set the limit at 8 pages, and it was set at 128
on IvyBrige.  That 128 number looks pretty silly considering that
less than 0.5% of the flushes are that large.

The previous code tried to size this number based on the size of
the TLB.  Good idea, but it's error-prone, needs maintenance
(which it didn't get up to now), and probably would not matter in
practice much.

Settting it to 33 means that we cover the mallopt
M_TRIM_THRESHOLD, which is the most universally common size to do
flushes.

That's the short version.  Here's the long one for why I chose 33:

1. These numbers have a constant bias in the timestamps from the
   tracing.  Probably counts for a couple hundred cycles in each of
   these tests, but it should be fairly _even_ across all of them.
   The smallest delta between the tracepoints I have ever seen is
   335 cycles.  This is one reason the cycles/page cost goes down in
   general as the flushes get larger.  The true cost is nearer to
   100 cycles.
2. A full flush is more expensive than a single invlpg, but not
   by much (single percentages).
3. A dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
   (~34 cycles).  At those rates, refilling the 512-entry dTLB takes
   22,000 cycles.
4. 22,000 cycles is approximately the equivalent of doing 85
   invlpg operations.  But, the odds are that the TLB can
   actually be filled up faster than that because TLB misses that
   are close in time also tend to leverage the same caches.
6. ~98% of flushes are <=33 pages.  There are a lot of flushes of
   33 pages, probably because libc's M_TRIM_THRESHOLD is set to
   128k (32 pages)
7. I've found no consistent data to support changing the IvyBridge
   vs. SandyBridge tunable by a factor of 16

I used the performance counters on this hardware (IvyBridge i5-3320M)
to figure out the tlb miss costs:

ocperf.py stat -e dtlb_load_misses.walk_duration,dtlb_load_misses.walk_completed,dtlb_store_misses.walk_duration,dtlb_store_misses.walk_completed,itlb_misses.walk_duration,itlb_misses.walk_completed,itlb.itlb_flush

     7,720,030,970      dtlb_load_misses_walk_duration                                    [57.13%]
       169,856,353      dtlb_load_misses_walk_completed                                    [57.15%]
       708,832,859      dtlb_store_misses_walk_duration                                    [57.17%]
        19,346,823      dtlb_store_misses_walk_completed                                    [57.17%]
     2,779,687,402      itlb_misses_walk_duration                                    [57.15%]
        82,241,148      itlb_misses_walk_completed                                    [57.13%]
           770,717      itlb_itlb_flush                                              [57.11%]

Show that a dtlb miss is 17.1ns (~45 cycles) and a itlb miss is 13.0ns
(~34 cycles).  At those rates, refilling the 512-entry dTLB takes
22,000 cycles.  On a SandyBridge system with more cores and larger
caches, those are dtlb=13.4ns and itlb=9.5ns.

cat perf.stat.txt | perl -pe 's/,//g'
	| awk '/itlb_misses_walk_duration/ { icyc+=$1 }
		/itlb_misses_walk_completed/ { imiss+=$1 }
		/dtlb_.*_walk_duration/ { dcyc+=$1 }
		/dtlb_.*.*completed/ { dmiss+=$1 }
		END {print "itlb cyc/miss: ", icyc/imiss, " dtlb cyc/miss: ", dcyc/dmiss, "   -----    ", icyc,imiss, dcyc,dmiss }

On Westmere CPUs, the counters to use are: itlb_flush,itlb_misses.walk_cycles,itlb_misses.any,dtlb_misses.walk_cycles,dtlb_misses.any

The assumptions that this code went in under:
https://lkml.org/lkml/2012/6/12/119 say that a flush and a refill are
about 100ns.  Being generous, that is over by a factor of 6 on the
refill side, although it is fairly close on the cost of an invlpg.
An increase of a single invlpg operation seems to lengthen the flush
range operation by about 200 cycles.  Here is one example of the data
collected for flushing 10 and 11 pages (full data are below):

    10:   0.43%  93.90% avg cycles:  3570 cycles/page:  357 samples: 4714
    11:   0.20%  94.10% avg cycles:  3767 cycles/page:  342 samples: 2145

How to generate this table:

	echo 10000 > /sys/kernel/debug/tracing/buffer_size_kb
	echo x86-tsc > /sys/kernel/debug/tracing/trace_clock
	echo 'reason != 0' > /sys/kernel/debug/tracing/events/tlb/tlb_flush/filter
	echo 1 > /sys/kernel/debug/tracing/events/tlb/tlb_flush/enable

Pipe the trace output in to this script:

	http://sr71.net/~dave/intel/201402-tlb/trace-time-diff-process.pl.txt

Note that these data were gathered with the invlpg threshold set to
150 pages.  Only data points with >=50 of samples were printed:

Flush    % of     %<=
in       flush    this
pages      es     size
------------------------------------------------------------------------------
    -1:   2.20%   2.20% avg cycles:  2283 cycles/page: xxxx samples: 23960
     1:  56.92%  59.12% avg cycles:  1276 cycles/page: 1276 samples: 620895
     2:  13.78%  72.90% avg cycles:  1505 cycles/page:  752 samples: 150335
     3:   8.26%  81.16% avg cycles:  1880 cycles/page:  626 samples: 90131
     4:   7.41%  88.58% avg cycles:  2447 cycles/page:  611 samples: 80877
     5:   1.73%  90.31% avg cycles:  2358 cycles/page:  471 samples: 18885
     6:   1.32%  91.63% avg cycles:  2563 cycles/page:  427 samples: 14397
     7:   1.14%  92.77% avg cycles:  2862 cycles/page:  408 samples: 12441
     8:   0.62%  93.39% avg cycles:  3542 cycles/page:  442 samples: 6721
     9:   0.08%  93.47% avg cycles:  3289 cycles/page:  365 samples: 917
    10:   0.43%  93.90% avg cycles:  3570 cycles/page:  357 samples: 4714
    11:   0.20%  94.10% avg cycles:  3767 cycles/page:  342 samples: 2145
    12:   0.08%  94.18% avg cycles:  3996 cycles/page:  333 samples: 864
    13:   0.03%  94.20% avg cycles:  4077 cycles/page:  313 samples: 289
    14:   0.02%  94.23% avg cycles:  4836 cycles/page:  345 samples: 236
    15:   0.04%  94.26% avg cycles:  5699 cycles/page:  379 samples: 390
    16:   0.06%  94.32% avg cycles:  5041 cycles/page:  315 samples: 643
    17:   0.57%  94.89% avg cycles:  5473 cycles/page:  321 samples: 6229
    18:   0.02%  94.91% avg cycles:  5396 cycles/page:  299 samples: 224
    19:   0.03%  94.95% avg cycles:  5296 cycles/page:  278 samples: 367
    20:   0.02%  94.96% avg cycles:  6749 cycles/page:  337 samples: 185
    21:   0.18%  95.14% avg cycles:  6225 cycles/page:  296 samples: 1964
    22:   0.01%  95.15% avg cycles:  6393 cycles/page:  290 samples: 83
    23:   0.01%  95.16% avg cycles:  6861 cycles/page:  298 samples: 61
    24:   0.12%  95.28% avg cycles:  6912 cycles/page:  288 samples: 1307
    25:   0.05%  95.32% avg cycles:  7190 cycles/page:  287 samples: 533
    26:   0.01%  95.33% avg cycles:  7793 cycles/page:  299 samples: 94
    27:   0.01%  95.34% avg cycles:  7833 cycles/page:  290 samples: 66
    28:   0.01%  95.35% avg cycles:  8253 cycles/page:  294 samples: 73
    29:   0.08%  95.42% avg cycles:  8024 cycles/page:  276 samples: 846
    30:   0.03%  95.45% avg cycles:  9670 cycles/page:  322 samples: 296
    31:   0.01%  95.46% avg cycles:  8949 cycles/page:  288 samples: 79
    32:   0.01%  95.46% avg cycles:  9350 cycles/page:  292 samples: 60
    33:   3.11%  98.57% avg cycles:  8534 cycles/page:  258 samples: 33936
    34:   0.02%  98.60% avg cycles: 10977 cycles/page:  322 samples: 268
    35:   0.02%  98.62% avg cycles: 11400 cycles/page:  325 samples: 177
    36:   0.01%  98.63% avg cycles: 11504 cycles/page:  319 samples: 161
    37:   0.02%  98.65% avg cycles: 11596 cycles/page:  313 samples: 182
    38:   0.02%  98.66% avg cycles: 11850 cycles/page:  311 samples: 195
    39:   0.01%  98.68% avg cycles: 12158 cycles/page:  311 samples: 128
    40:   0.01%  98.68% avg cycles: 11626 cycles/page:  290 samples: 78
    41:   0.04%  98.73% avg cycles: 11435 cycles/page:  278 samples: 477
    42:   0.01%  98.73% avg cycles: 12571 cycles/page:  299 samples: 74
    43:   0.01%  98.74% avg cycles: 12562 cycles/page:  292 samples: 78
    44:   0.01%  98.75% avg cycles: 12991 cycles/page:  295 samples: 108
    45:   0.01%  98.76% avg cycles: 13169 cycles/page:  292 samples: 78
    46:   0.02%  98.78% avg cycles: 12891 cycles/page:  280 samples: 261
    47:   0.01%  98.79% avg cycles: 13099 cycles/page:  278 samples: 67
    48:   0.01%  98.80% avg cycles: 13851 cycles/page:  288 samples: 77
    49:   0.01%  98.80% avg cycles: 13749 cycles/page:  280 samples: 66
    50:   0.01%  98.81% avg cycles: 13949 cycles/page:  278 samples: 73
    52:   0.00%  98.82% avg cycles: 14243 cycles/page:  273 samples: 52
    54:   0.01%  98.83% avg cycles: 15312 cycles/page:  283 samples: 87
    55:   0.01%  98.84% avg cycles: 15197 cycles/page:  276 samples: 109
    56:   0.02%  98.86% avg cycles: 15234 cycles/page:  272 samples: 208
    57:   0.00%  98.86% avg cycles: 14888 cycles/page:  261 samples: 53
    58:   0.01%  98.87% avg cycles: 15037 cycles/page:  259 samples: 59
    59:   0.01%  98.87% avg cycles: 15752 cycles/page:  266 samples: 63
    62:   0.00%  98.89% avg cycles: 16222 cycles/page:  261 samples: 54
    64:   0.02%  98.91% avg cycles: 17179 cycles/page:  268 samples: 248
    65:   0.12%  99.03% avg cycles: 18762 cycles/page:  288 samples: 1324
    85:   0.00%  99.10% avg cycles: 21649 cycles/page:  254 samples: 50
   127:   0.01%  99.18% avg cycles: 32397 cycles/page:  255 samples: 75
   128:   0.13%  99.31% avg cycles: 31711 cycles/page:  247 samples: 1466
   129:   0.18%  99.49% avg cycles: 33017 cycles/page:  255 samples: 1927
   181:   0.33%  99.84% avg cycles:  2489 cycles/page:   13 samples: 3547
   256:   0.05%  99.91% avg cycles:  2305 cycles/page:    9 samples: 550
   512:   0.03%  99.95% avg cycles:  2133 cycles/page:    4 samples: 304
  1512:   0.01%  99.99% avg cycles:  3038 cycles/page:    2 samples: 65

Here are the tlb counters during a 10-second slice of a kernel compile
for a SandyBridge system.  It's better than IvyBridge, but probably
due to the larger caches since this was one of the 'X' extreme parts.

    10,873,007,282      dtlb_load_misses_walk_duration
       250,711,333      dtlb_load_misses_walk_completed
     1,212,395,865      dtlb_store_misses_walk_duration
        31,615,772      dtlb_store_misses_walk_completed
     5,091,010,274      itlb_misses_walk_duration
       163,193,511      itlb_misses_walk_completed
         1,321,980      itlb_itlb_flush

      10.008045158 seconds time elapsed

# cat perf.stat.1392743721.txt | perl -pe 's/,//g' | awk '/itlb_misses_walk_duration/ { icyc+=$1 } /itlb_misses_walk_completed/ { imiss+=$1 } /dtlb_.*_walk_duration/ { dcyc+=$1 } /dtlb_.*.*completed/ { dmiss+=$1 } END {print "itlb cyc/miss: ", icyc/imiss/3.3, " dtlb cyc/miss: ", dcyc/dmiss/3.3, "   -----    ", icyc,imiss, dcyc,dmiss }'
itlb ns/miss:  9.45338  dtlb ns/miss:  12.9716

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140731154103.10C1115E@viggo.jf.intel.com
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-07-31 08:48:51 -07:00
Dave Hansen
2d040a1ce9 x86/mm: New tunable for single vs full TLB flush
Most of the logic here is in the documentation file.  Please take
a look at it.

I know we've come full-circle here back to a tunable, but this
new one is *WAY* simpler.  I challenge anyone to describe in one
sentence how the old one worked.  Here's the way the new one
works:

	If we are flushing more pages than the ceiling, we use
	the full flush, otherwise we use per-page flushes.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140731154101.12B52CAF@viggo.jf.intel.com
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-07-31 08:48:51 -07:00
Dave Hansen
d17d8f9ded x86/mm: Add tracepoints for TLB flushes
We don't have any good way to figure out what kinds of flushes
are being attempted.  Right now, we can try to use the vm
counters, but those only tell us what we actually did with the
hardware (one-by-one vs full) and don't tell us what was actually
_requested_.

This allows us to select out "interesting" TLB flushes that we
might want to optimize (like the ranged ones) and ignore the ones
that we have very little control over (the ones at context
switch).

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140731154059.4C96CBA5@viggo.jf.intel.com
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-07-31 08:48:51 -07:00
Dave Hansen
a23421f111 x86/mm: Unify remote INVLPG code
There are currently three paths through the remote flush code:

1. full invalidation
2. single page invalidation using invlpg
3. ranged invalidation using invlpg

This takes 2 and 3 and combines them in to a single path by
making the single-page one just be the start and end be start
plus a single page.  This makes placement of our tracepoint easier.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140731154058.E0F90408@viggo.jf.intel.com
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-07-31 08:48:51 -07:00
Dave Hansen
9dfa6dee53 x86/mm: Fix missed global TLB flush stat
If we take the

	if (end == TLB_FLUSH_ALL || vmflag & VM_HUGETLB) {
		local_flush_tlb();
		goto out;
	}

path out of flush_tlb_mm_range(), we will have flushed the tlb,
but not incremented NR_TLB_LOCAL_FLUSH_ALL.  This unifies the
way out of the function so that we always take a single path when
doing a full tlb flush.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140731154056.FF763B76@viggo.jf.intel.com
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-07-31 08:48:50 -07:00
Dave Hansen
e9f4e0a9fe x86/mm: Rip out complicated, out-of-date, buggy TLB flushing
I think the flush_tlb_mm_range() code that tries to tune the
flush sizes based on the CPU needs to get ripped out for
several reasons:

1. It is obviously buggy.  It uses mm->total_vm to judge the
   task's footprint in the TLB.  It should certainly be using
   some measure of RSS, *NOT* ->total_vm since only resident
   memory can populate the TLB.
2. Haswell, and several other CPUs are missing from the
   intel_tlb_flushall_shift_set() function.  Thus, it has been
   demonstrated to bitrot quickly in practice.
3. It is plain wrong in my vm:
	[    0.037444] Last level iTLB entries: 4KB 0, 2MB 0, 4MB 0
	[    0.037444] Last level dTLB entries: 4KB 0, 2MB 0, 4MB 0
	[    0.037444] tlb_flushall_shift: 6
   Which leads to it to never use invlpg.
4. The assumptions about TLB refill costs are wrong:
	http://lkml.kernel.org/r/1337782555-8088-3-git-send-email-alex.shi@intel.com
    (more on this in later patches)
5. I can not reproduce the original data: https://lkml.org/lkml/2012/5/17/59
   I believe the sample times were too short.  Running the
   benchmark in a loop yields times that vary quite a bit.

Note that this leaves us with a static ceiling of 1 page.  This
is a conservative, dumb setting, and will be revised in a later
patch.

This also removes the code which attempts to predict whether we
are flushing data or instructions.  We expect instruction flushes
to be relatively rare and not worth tuning for explicitly.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140731154055.ABC88E89@viggo.jf.intel.com
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-07-31 08:48:50 -07:00
Dave Hansen
4995ab9cf5 x86/mm: Clean up the TLB flushing code
The

	if (cpumask_any_but(mm_cpumask(mm), smp_processor_id()) < nr_cpu_ids)

line of code is not exactly the easiest to audit, especially when
it ends up at two different indentation levels.  This eliminates
one of the the copy-n-paste versions.  It also gives us a unified
exit point for each path through this function.  We need this in
a minute for our tracepoint.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: http://lkml.kernel.org/r/20140731154054.44F1CDDC@viggo.jf.intel.com
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-07-31 08:48:50 -07:00
David Rientjes
2f078b9cb8 x86, apic: Remove enable_apic_mode callback
The enable_apic_mode() apic callback is never called, so remove it.

Signed-off-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1407302352320.17503@chino.kir.corp.google.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2014-07-31 08:05:44 -07:00
David Rientjes
11a8318ef5 x86, apic: Remove setup_portio_remap callback
Since commit b5660ba76b41 ("x86, platforms: Remove NUMAQ") removed NUMAQ,
the setup_portio_remap() apic callback has been obsolete.  Remove it.

Signed-off-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1407302351480.17503@chino.kir.corp.google.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2014-07-31 08:05:44 -07:00
David Rientjes
e76661ba09 x86, apic: Remove multi_timer_check callback
Since commit b5660ba76b41 ("x86, platforms: Remove NUMAQ") removed NUMAQ,
the multi_timer_check() apic callback has been obsolete.  Remove it.

Signed-off-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1407302351120.17503@chino.kir.corp.google.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2014-07-31 08:05:43 -07:00
David Rientjes
4c59f3e63d x86, apic: Replace noop_check_apicid_used
noop_check_apicid_used() has the same implementation as
default_check_apicid_used() in the standard header file, so replace the
former with the latter.

Signed-off-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1407302350450.17503@chino.kir.corp.google.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2014-07-31 08:05:43 -07:00
David Rientjes
658ffd7e6f x86, apic: Remove check_apicid_present callback
The check_apicid_present() apic callback is never called, so remove it
and functions that implement it.

Signed-off-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1407302350160.17503@chino.kir.corp.google.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2014-07-31 08:05:42 -07:00
David Rientjes
c460b5d340 x86, apic: Remove mps_oem_check callback
Since commit b5660ba76b41 ("x86, platforms: Remove NUMAQ") removed NUMAQ,
the mps_oem_check() apic callback has been obsolete.  Remove it.

This allows generic_mps_oem_check() to be removed as well.

Signed-off-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1407302349390.17503@chino.kir.corp.google.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2014-07-31 08:05:42 -07:00
David Rientjes
300eddf967 x86, apic: Remove smp_callin_clear_local_apic callback
Since commit b5660ba76b41 ("x86, platforms: Remove NUMAQ") removed NUMAQ,
the smp_callin_clear_local_apic() apic callback has been obsolete.
Remove it.

Signed-off-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1407302349040.17503@chino.kir.corp.google.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2014-07-31 08:05:41 -07:00
David Rientjes
6ab1b27c84 x86, apic: Replace trampoline physical addresses with defaults
The trampoline_phys_{high,low} members of struct apic are always
initialized to DEFAULT_TRAMPOLINE_PHYS_HIGH and TRAMPOLINE_PHYS_LOW,
respectively.  Hardwire the constants and remove the unneeded members.

Signed-off-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1407302348330.17503@chino.kir.corp.google.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2014-07-31 08:05:41 -07:00
David Rientjes
80a2670379 x86, apic: Remove x86_32_numa_cpu_node callback
Since commit b5660ba76b41 ("x86, platforms: Remove NUMAQ") removed NUMAQ,
the x86_32_numa_cpu_node() apic callback has been obsolete.  Remove it.

Signed-off-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1407302348060.17503@chino.kir.corp.google.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2014-07-31 08:05:40 -07:00
Mark D Rustad
42cbc04fd3 x86/kvm: Resolve shadow warnings in macro expansion
Resolve shadow warnings that appear in W=2 builds. Instead of
using ret to hold the return pointer, save the length in a new
variable saved_len and compute the pointer on exit. This also
resolves a very technical error, in that ret was declared as
a const char *, when it really was a char * const.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-31 16:33:29 +02:00
David S. Miller
f139c74a8d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-30 13:25:49 -07:00
H. Peter Anvin
c3107e3c50 APEI is currently implemented so that it depends on x86 hardware.
The primary dependency is that GHES uses the x86 NMI for hardware
 error notification and MCE for memory error handling. These patches
 remove that dependency.
 
 Other APEI features such as error reporting via external IRQ, error
 serialization, or error injection, do not require changes to use them
 on non-x86 architectures.
 
 The following patch set eliminates the APEI Kconfig x86 dependency
 by making these changes:
 - treat NMI notification as GHES architecture - HAVE_ACPI_APEI_NMI
 - group and wrap around #ifdef CONFIG_HAVE_ACPI_APEI_NMI code which
   is used only for NMI path
 - identify architectural boxes and abstract it accordingly (tlb flush and MCE)
 - rework ioremap for both IRQ and NMI context
 
 NMI code is kept in ghes.c file since NMI and IRQ context are tightly coupled.
 
 Note, these patches introduce no functional changes for x86. The NMI notification
 feature is hard selected for x86. Architectures that want to use this
 feature should also provide NMI code infrastructure.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJT2BaPAAoJEKurIx+X31iBLGMP/0yyWOna4229p9CmuElSP3os
 Kb+9Thru+Wg4ihj43CYW0nznQnamCaqBa5NpDXZn0Ebtxc08SSGVzbf+z+vBMeD+
 HW4093m4g8sGL7i4JdAol0MEPpKTQRdpj525N/h/xWVSDXQ0Bq3vQ7DS1/j1Bp4k
 Lq3G8dEk+4LjNPcQ5YBPl71zWJOC4iUctfh1OpFdfgA04804Vis3j8T6ljE7/72M
 51xXK3af9ktIg6MU2HOwraUsSspVeJs/4lPu4fab4XI07BRDb4T7yx19a9VaBy67
 m6TaTd3eC/Z0Uh+51grNuXSnWQK4fvahRZJEwiRdC0wL3w3mhdZkmqm0nBdBFyof
 5b251+FOazOtZdMsWS/mMjQUjybQ+4k9zpnndIPw/5rqxJ8lgaP7o81e+hw1Xh1Q
 E0ZWUMXnAIkRmkyYLUv5aTICRYIZtAC/C1QrR5ZB/9Q+yvtxp13dbqGzWhcF7AIw
 UK/yb5T5ZAzvuJlmPG0ZiV75HH9bjX4OFV3AhXJIEG/iTOdVVpat8yICFrT33Xpc
 uAwRXQvz6mn2c2xpZcJqSJQlXKg2nbrfUmscU8P8Zu6mQpvBB/+2cDbW/5wfuKbE
 NpD0aB5PxhHY+nNvIfOsTUk72aZcZdUEQJt/792vhnMYb/IK1X/qa4zrVmOqlZKt
 mtXwUQWdj3kSG36mgssO
 =nYdd
 -----END PGP SIGNATURE-----

Merge tag 'please-pull-apei' into x86/ras

APEI is currently implemented so that it depends on x86 hardware.
The primary dependency is that GHES uses the x86 NMI for hardware
error notification and MCE for memory error handling. These patches
remove that dependency.

Other APEI features such as error reporting via external IRQ, error
serialization, or error injection, do not require changes to use them
on non-x86 architectures.

The following patch set eliminates the APEI Kconfig x86 dependency
by making these changes:
- treat NMI notification as GHES architecture - HAVE_ACPI_APEI_NMI
- group and wrap around #ifdef CONFIG_HAVE_ACPI_APEI_NMI code which
  is used only for NMI path
- identify architectural boxes and abstract it accordingly (tlb flush and MCE)
- rework ioremap for both IRQ and NMI context

NMI code is kept in ghes.c file since NMI and IRQ context are tightly coupled.

Note, these patches introduce no functional changes for x86. The NMI notification
feature is hard selected for x86. Architectures that want to use this
feature should also provide NMI code infrastructure.
2014-07-30 10:48:00 -07:00
Linus Torvalds
acba648dca Fix BUG when trying to expand the grant table. This seems to occur
often during boot with Ubuntu 14.04 PV guests.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQEcBAABAgAGBQJT2PhgAAoJEFxbo/MsZsTRlzIH/1HjbkGZmRlOj5wcrYlWCUJ/
 DGLBHc76so52xd9oP8COT5tuSVP6/usPPLFaOmVZ7fMiOpoyz9d3lc0g56otw3gJ
 tTUFTyW0EoFtvmIl50OMC726p9azETjA3P2XJkV/D3GhBGGqgrP5uR+mRvisvq3y
 eGZEx1UIHv1jov47TBFR1NcckXBWw+6J9m34y9h6an9VNDCuuGwYZ8dfGAFsLrVb
 lGLTmgQQmyk4SexVINfOwL40KkVDVEq+X74HcPviyNHEIy66xLzMtKpL+Sf4xeuv
 VG3JhqAUGuRGGK48rrbpxhBbpxGp35O9RV68YrGssxfuTejSYduw5zTzzt30QIA=
 =cr8X
 -----END PGP SIGNATURE-----

Merge tag 'stable/for-linus-3.16-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Pull Xen fix from David Vrabel:
 "Fix BUG when trying to expand the grant table.  This seems to occur
  often during boot with Ubuntu 14.04 PV guests"

* tag 'stable/for-linus-3.16-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  x86/xen: safely map and unmap grant frames when in atomic context
2014-07-30 09:00:20 -07:00
Chris J Arges
296f047502 KVM: vmx: remove duplicate vmx_mpx_supported() prototype
Remove a prototype which was added by both 93c4adc7afe and 36be0b9deb2.

Signed-off-by: Chris J Arges <chris.j.arges@canonical.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-30 17:43:57 +02:00
David Vrabel
b7dd0e350e x86/xen: safely map and unmap grant frames when in atomic context
arch_gnttab_map_frames() and arch_gnttab_unmap_frames() are called in
atomic context but were calling alloc_vm_area() which might sleep.

Also, if a driver attempts to allocate a grant ref from an interrupt
and the table needs expanding, then the CPU may already by in lazy MMU
mode and apply_to_page_range() will BUG when it tries to re-enable
lazy MMU mode.

These two functions are only used in PV guests.

Introduce arch_gnttab_init() to allocates the virtual address space in
advance.

Avoid the use of apply_to_page_range() by using saving and using the
array of PTE addresses from the alloc_vm_area() call (which ensures
that the required page tables are pre-allocated).

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2014-07-30 14:22:47 +01:00
Andy Lutomirski
7209a75d20 x86_64/entry/xen: Do not invoke espfix64 on Xen
This moves the espfix64 logic into native_iret.  To make this work,
it gets rid of the native patch for INTERRUPT_RETURN:
INTERRUPT_RETURN on native kernels is now 'jmp native_iret'.

This changes the 16-bit SS behavior on Xen from OOPSing to leaking
some bits of the Xen hypervisor's RSP (I think).

[ hpa: this is a nonzero cost on native, but probably not enough to
  measure. Xen needs to fix this in their own code, probably doing
  something equivalent to espfix64. ]

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/7b8f1d8ef6597cb16ae004a43c56980a7de3cf94.1406129132.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@vger.kernel.org>
2014-07-28 15:25:40 -07:00
Alexander Graf
784aa3d7fb KVM: Rename and add argument to check_extension
In preparation to make the check_extension function available to VM scope
we add a struct kvm * argument to the function header and rename the function
accordingly. It will still be called from the /dev/kvm fd, but with a NULL
argument for struct kvm *.

Signed-off-by: Alexander Graf <agraf@suse.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-28 15:23:17 +02:00
Ingo Molnar
5030c69755 Linux 3.16-rc7
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJT1VYNAAoJEHm+PkMAQRiGQJwIAKSYp1Uqz5O/e5r0V1TlZKT4
 1B4Njopl57PwSrJQWcGEuH2yHyM896vfPO4L6BJIOfyWzh8kwpQqclDt6uhXoF/v
 OsO1zb/7/j+n/pDZsePqP9AyIgErsHEBgUbhecDqzjN++ITPcZjQ6TIMPglZaumN
 jFAdAZuAaEwqAk8jqN2wlm689Fh9MuUEarHXbXLCqu5RgLrWhFGhp/cTWY62aqnZ
 XfEeQ9KtpRZmlR/IYjerbb1eRH7ZdJsZ88WngLX9dj/JdNxHWBkWQBXGAusXk5Fk
 y6LsIV3TjyBdrRKJ1Ifyg/2EIXHNBs8HxTFGXpjtp2HPuMLDxZOWOWikb9URtNg=
 =Fjf4
 -----END PGP SIGNATURE-----

Merge tag 'v3.16-rc7' into perf/core, to merge in the latest fixes before applying new changes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-07-28 10:00:33 +02:00
Rafael J. Wysocki
8989e1cc35 Merge branch 'acpi-config'
* acpi-config:
  ACPI / processor: Introduce ARCH_MIGHT_HAVE_ACPI_PDC
  ACPI: Don't use acpi_lapic in ACPI core code
  ACPI: add config for BIOS table scan
2014-07-27 23:52:48 +02:00
Rafael J. Wysocki
7e1c1a82f5 Merge branch 'acpi-headers'
* acpi-headers:
  ACPI: Add support to force header inclusion rules for <acpi/acpi.h>.
  ACPI / SFI: Fix wrong <acpi/acpi.h> inclusion in SFI/ACPI wrapper - table definitions.
  ACPICA: Linux: Allow ACPICA inclusion for CONFIG_ACPI=n builds.
  ACPICA: Linux: Add support to exclude <asm/acenv.h> inclusion.
  ACPICA: Linux: Add stub implementation of ACPICA 64-bit mathematics.
  ACPICA: Linux: Add stub support for Linux specific variables and functions.
2014-07-27 23:52:05 +02:00
Linus Torvalds
9dae0a3fc4 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "A bunch of fixes for perf and kprobes:
   - revert a commit that caused a perf group regression
   - silence dmesg spam
   - fix kprobe probing errors on ia64 and ppc64
   - filter kprobe faults from userspace
   - lockdep fix for perf exit path
   - prevent perf #GP in KVM guest
   - correct perf event and filters"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kprobes: Fix "Failed to find blacklist" probing errors on ia64 and ppc64
  kprobes/x86: Don't try to resolve kprobe faults from userspace
  perf/x86/intel: Avoid spamming kernel log for BTS buffer failure
  perf/x86/intel: Protect LBR and extra_regs against KVM lying
  perf: Fix lockdep warning on process exit
  perf/x86/intel/uncore: Fix SNB-EP/IVT Cbox filter mappings
  perf/x86/intel: Use proper dTLB-load-misses event on IvyBridge
  perf: Revert ("perf: Always destroy groups on exit")
2014-07-27 09:57:16 -07:00
Linus Torvalds
43a255c210 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Peter Anvin:
 "A couple of crash fixes, plus a fix that on 32 bits would cause a
  missing -ENOSYS for nonexistent system calls"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, cpu: Fix cache topology for early P4-SMT
  x86_32, entry: Store badsys error code in %eax
  x86, MCE: Robustify mcheck_init_device
2014-07-27 09:53:01 -07:00
Andy Lutomirski
53b884ac37 x86_64/vsyscall: Fix warn_bad_vsyscall log output
This commit in Linux 3.6:

    commit c767a54ba0657e52e6edaa97cbe0b0a8bf1c1655
    Author: Joe Perches <joe@perches.com>
    Date:   Mon May 21 19:50:07 2012 -0700

        x86/debug: Add KERN_<LEVEL> to bare printks, convert printks to pr_<level>

caused warn_bad_vsyscall to output garbage in the middle of the
line.  Revert the bad part of it.

The printk in question isn't actually bare; the level is "%s".

The bug this fixes is purely cosmetic; backports are optional.

Cc: <stable@vger.kernel.org> # v3.6+
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/03eac1f24110bbe496ecc12a4df467e0d88466d4.1406330947.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-07-25 16:34:15 -07:00
Andy Lutomirski
ac379835e8 x86/vdso: Set VM_MAYREAD for the vvar vma
The VVAR area can, obviously, be read; that is kind of the point.

AFAIK this has no effect whatsoever unless x86 suddenly turns into a
nommu architecture.  Nonetheless, not setting it is suspicious.

Reported-by: Nathan Lynch <Nathan_Lynch@mentor.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/e4c8bf4bc2725bda22c4a4b7d0c82adcd8f8d9b8.1406330779.git.luto@amacapital.net
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-07-25 16:32:53 -07:00
Li, Aubrey
f855911c1f x86/pmc_atom: Expose PMC device state and platform sleep state
Add the following interfaces to exposes PMC device state and sleep
state residency via debugfs:
	/sys/kernel/debugfs/pmc_atom/dev_state
	/sys/kernel/debugfs/pmc_atom/sleep_state

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Link: http://lkml.kernel.org/r/53B0FF59.8000600@linux.intel.com
Signed-off-by: Kasagar, Srinidhi <srinidhi.kasagar@intel.com>
Reviewed-by: Rudramuni, Vishwesh M <vishwesh.m.rudramuni@intel.com>
Reviewed-by: Joe Perches <joe@perches.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-07-25 14:12:14 -07:00
Li, Aubrey
b00055cade x86/pmc_atom: Eisable a few S0ix wake up events for S0ix residency
Disable PMC S0IX_WAKE_EN events coming from LPC block(unused) and
also from GPIO_SUS ored dedicated IRQs (must be disabled as per PMC
programming rule), GPIOSCORE ored dedicated IRQs (must be disabled
as per PMC programming rule), GPIO_SUS shared IRQ (not necessary
since the IOAPIC_DS wake event will still work), GPIO_SCORE shared
IRQ (not necessary since the IOAPIC_DS wake event will still work).

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Link: http://lkml.kernel.org/r/53B0FF22.5080403@linux.intel.com
Signed-off-by: Olivier Leveque <olivier.leveque@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-07-25 14:11:58 -07:00
Li, Aubrey
93e5eadd1f x86/platform: New Intel Atom SOC power management controller driver
The Power Management Controller (PMC) controls many of the power
management features present in the Atom SoC. This driver provides
a native power off function via PMC PCI IO port.

On some ACPI hardware-reduced platforms(e.g. ASUS-T100), ACPI sleep
registers are not valid so that (*pm_power_off)() is not hooked by
acpi_power_off(). The power off function in this driver is installed
only when pm_power_off is NULL.

Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Link: http://lkml.kernel.org/r/53B0FEEA.3010805@linux.intel.com
Signed-off-by: Lejun Zhu <lejun.zhu@linux.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-07-25 14:11:29 -07:00
Mark Rustad
b55a8144d1 x86/kvm: Resolve shadow warning from min macro
Resolve a shadow warning generated in W=2 builds by the nested
use of the min macro by instead using the min3 macro for the
minimum of 3 values.

Signed-off-by: Mark Rustad <mark.d.rustad@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-25 16:05:54 +02:00
Alexei Starovoitov
2695fb552c net: filter: rename 'struct sock_filter_int' into 'struct bpf_insn'
eBPF is used by socket filtering, seccomp and soon by tracing and
exposed to userspace, therefore 'sock_filter_int' name is not accurate.
Rename it to 'bpf_insn'

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-24 23:27:17 -07:00
H. Peter Anvin
bf72f5dee0 Promote one fix for 3.16
This fix was necessary after
 
 9c15a24b038f ("x86/mce: Improve mcheck_init_device() error handling")
 
 went in. What this patch did was, among others, check the return value
 of misc_register and exit early if it encountered an error. Original
 code sloppily didn't do that.
 
 However,
 
         cef12ee52b05 ("xen/mce: Add mcelog support for Xen platform")
 
 made it so that xen's init routine xen_late_init_mcelog runs first. This
 was needed for the xen mcelog device which is supposed to be independent
 from the baremetal one.
 
 Initially it was reported that misc_register() fails often on xen and
 that's why it needed fixing. However, it is *supposed* to fail by
 design, when running in dom0 so that the xen mcelog device file gets
 registered first.
 
 And *then* you need the notifier *not* unregistered on the error path so
 that the timer does get deleted properly in the CPU hotplug notifier.
 
 Btw, this fix is needed also on baremetal in the unlikely event that
 misc_register(&mce_chrdev_device) fails there too.
 
 I was unsure whether to rush it in now and decided to delay it to 3.17.
 However, xen people wanted it promoted as it breaks xen when doing cpu
 hotplug there. So, after a bit of simmering in tip/master for initial
 smoke testing, let's move it to 3.16. It fixes a semi-regression which
 got introduced in 3.16 so no need for stable tagging.
 
 tip/x86/ras contains that exact same commit but we can't remove it
 there as it is not the last one. It won't cause any merge issues, as I
 confirmed locally but I should state here the special situation of this
 one fix explicitly anyway.
 
 Thanks.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJTzUMOAAoJEBLB8Bhh3lVKDC0P/3azgACl7qXkpUoFIUW4RSGJ
 3psmzjYd8RU2Trh7+8NFTUccLYIe6+4wyE8R4fgx1iV3kuUPOuAIRn0sRk4/TgCC
 IwA+j9v28jilFQvXuv+v105mIdiTmKJotcL3+k4I3HUR8RehnHK+BrlxR8e/zQzU
 W/Vk9cgz8wzPxMmmpvUE8OHmWYEk6h/rkCz6TF0bAJbgSq9iN37mmpKfXwwvaIRi
 CFP/UXIduipEuYUd+8fJqUg14roDFvA5hXPz7abM0gSr1NjR/k724f2G4iDnez0V
 pm8I+PmyEaZYI6udUrbyZ0X9wTcgHLe6GtwyEIBaOOqkMZbjC4anEe7VUJFIKr5/
 dNN538DGXxNfrf0u5lkHF/SCiE/T3eoP8imaMeZXwE04HHjWOfvqJs7miUPfGjKE
 VYvbY2fVbjbYCFjgqc2XkJgPpwGpOF2WCaXva2hyIRKrXsgt725wvoBCnNMX+AD7
 l71AeexVCN3TGmP7Twi7W8d23QPR0+M7x5SNkFaUTepJ2f+s3Fpeopq9mfbAtM6i
 p6fRDVncOBalNW3G5hED3YYcjfjQgpiKMQ7zm84K71NkSCBOsjiatFrPD+Y0lBwM
 LNtOViTXFvXW5xMBDDjQq3+InIbzt7MlQMVP4nP6rkEjOLfR60+ivr9sPUuqTb//
 hqI5bCPgWhxEBrzA7iIj
 =z/P/
 -----END PGP SIGNATURE-----

x86: Merge tag 'ras_urgent' into x86/urgent

Promote one fix for 3.16

This fix was necessary after

9c15a24b038f ("x86/mce: Improve mcheck_init_device() error handling")

went in. What this patch did was, among others, check the return value
of misc_register and exit early if it encountered an error. Original
code sloppily didn't do that.

However,

        cef12ee52b05 ("xen/mce: Add mcelog support for Xen platform")

made it so that xen's init routine xen_late_init_mcelog runs first. This
was needed for the xen mcelog device which is supposed to be independent
from the baremetal one.

Initially it was reported that misc_register() fails often on xen and
that's why it needed fixing. However, it is *supposed* to fail by
design, when running in dom0 so that the xen mcelog device file gets
registered first.

And *then* you need the notifier *not* unregistered on the error path so
that the timer does get deleted properly in the CPU hotplug notifier.

Btw, this fix is needed also on baremetal in the unlikely event that
misc_register(&mce_chrdev_device) fails there too.

I was unsure whether to rush it in now and decided to delay it to 3.17.
However, xen people wanted it promoted as it breaks xen when doing cpu
hotplug there. So, after a bit of simmering in tip/master for initial
smoke testing, let's move it to 3.16. It fixes a semi-regression which
got introduced in 3.16 so no need for stable tagging.

tip/x86/ras contains that exact same commit but we can't remove it
there as it is not the last one. It won't cause any merge issues, as I
confirmed locally but I should state here the special situation of this
one fix explicitly anyway.

Thanks.

Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2014-07-24 16:32:31 -07:00
Paolo Bonzini
03916db934 Replace NR_VMX_MSR with its definition
Using ARRAY_SIZE directly makes it easier to read the code.  While touching
the code, replace the division by a multiplication in the recently added
BUILD_BUG_ON.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-24 14:21:57 +02:00
Nadav Amit
0123be429f KVM: x86: Assertions to check no overrun in MSR lists
Currently there is no check whether shared MSRs list overrun the allocated size
which can results in bugs. In addition there is no check that vmx->guest_msrs
has sufficient space to accommodate all the VMX msrs.  This patch adds the
assertions.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-24 14:16:57 +02:00
Nadav Amit
d6e8c85456 KVM: x86: set rflags.rf during fault injection
x86 does not automatically set rflags.rf during event injection. This patch
does partial job, setting rflags.rf upon fault injection.  It does not handle
the setting of RF upon interrupt injection on rep-string instruction.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-24 14:04:01 +02:00
Nadav Amit
b9a1ecb909 KVM: x86: Setting rflags.rf during rep-string emulation
This patch updates RF for rep-string emulation.  The flag is set upon the first
iteration, and cleared after the last (if emulated). It is intended to make
sure that if a trap (in future data/io #DB emulation) or interrupt is delivered
to the guest during the rep-string instruction, RF will be set correctly. RF
affects whether instruction breakpoint in the guest is masked.

Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-24 14:03:54 +02:00
Thomas Gleixner
d28ede8379 timekeeping: Create struct tk_read_base and use it in struct timekeeper
The members of the new struct are the required ones for the new NMI
safe accessor to clcok monotonic. In order to reuse the existing
timekeeping code and to make the update of the fast NMI safe
timekeepers a simple memcpy use the struct for the timekeeper as well
and convert all users.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 15:01:53 -07:00