18420 Commits

Author SHA1 Message Date
Al Viro
3064c3563b death to mnt_pinned
Rather than playing silly buggers with vfsmount refcounts, just have
acct_on() ask fs/namespace.c for internal clone of file->f_path.mnt
and replace it with said clone.  Then attach the pin to original
vfsmount.  Voila - the clone will be alive until the file gets closed,
making sure that underlying superblock remains active, etc., and
we can drop the original vfsmount, so that it's not kept busy.
If the file lives until the final mntput of the original vfsmount,
we'll notice that there's an fs_pin (one in bsd_acct_struct that
holds that file) and mnt_pin_kill() will take it out.  Since
->kill() is synchronous, we won't proceed past that point until
these files are closed (and private clones of our vfsmount are
gone), so we get the same ordering warranties we used to get.

mnt_pin()/mnt_unpin()/->mnt_pinned is gone now, and good riddance -
it never became usable outside of kernel/acct.c (and racy wrt
umount even there).

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:09 -04:00
Al Viro
efb170c228 take fs_pin stuff to fs/*
Add a new field to fs_pin - kill(pin).  That's what umount and r/o remount
will be calling for all pins attached to vfsmount and superblock resp.
Called after bumping the refcount, so it won't go away under us.  Dropping
the refcount is responsibility of the instance.  All generic stuff moved to
fs/fs_pin.c; the next step will rip all the knowledge of kernel/acct.c from
fs/super.c and fs/namespace.c.  After that - death to mnt_pin(); it was
intended to be usable as generic mechanism for code that wants to attach
objects to vfsmount, so that they would not make the sucker busy and
would get killed on umount.  Never got it right; it remained acct.c-specific
all along.  Now it's very close to being killable.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:08 -04:00
Al Viro
1629d0eb3e start carving bsd_acct_struct up
pull generic parts into struct fs_pin.  Eventually we want those
to replace mnt_pin()/mnt_unpin() mess; that stuff will move to
fs/*.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:08 -04:00
Al Viro
215748e67d acct: move mnt_pin() upwards.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:08 -04:00
Al Viro
17c0a5aaff make acct_kill() wait for file closing.
Do actual closing of file via schedule_work().  And use
__fput_sync() there.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:08 -04:00
Al Viro
2798d4ce61 acct: get rid of acct_lock for acct->count
* make acct->count atomic and acct freeing - rcu-delayed.
* instead of grabbing acct_lock around the places where we take a reference,
do that under rcu_read_lock() with atomic_long_inc_not_zero().
* have the new acct locked before making ns->bacct point to it

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:08 -04:00
Al Viro
215752fce3 acct: get rid of acct_list
Put these suckers on per-vfsmount and per-superblock lists instead.
Note: right now it's still acct_lock for everything, but that's
going to change.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:08 -04:00
Al Viro
54a4d58a64 acct: simplify check_free_space()
a) file can't be NULL
b) file can't be changed under us
c) all writes are serialized by acct->lock; no need to mess with
spinlock there.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:08 -04:00
Al Viro
b8f00e6be4 acct: new lifetime rules
Do not reuse bsd_acct_struct after closing the damn thing.
Structure lifetime is controlled by refcount now.  We also
have a mutex in there, held over closing and writing (the
file is O_APPEND, so we are not losing any concurrency).

As the result, we do not need to bother with get_file()/fput()
on log write anymore.  Moreover, do_acct_process() only needs
acct itself; file and pidns are picked from it.

Killed instances are distinguished by having NULL ->ns.
Refcount is protected by acct_lock; anybody taking the
mutex needs to grab a reference first.

The things will get a lot simpler in the next commits - this
is just the minimal chunk switching to the new lifetime rules.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:08 -04:00
Al Viro
9df7fa16ee acct: serialize acct_on()
brute-force - on a global mutex that isn't nested into anything.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:08 -04:00
Al Viro
795a2f22a8 acct() should honour the limits from the very beginning
We need to check free space on the first write to freshly opened log.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:07 -04:00
Al Viro
e25ff11ff1 split the slow path in acct_process() off
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:07 -04:00
Al Viro
cdd37e2309 separate namespace-independent parts of filling acct_t
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:07 -04:00
Al Viro
ed44724b79 acct: switch to __kernel_write()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:07 -04:00
Al Viro
ecfdb33d1f acct: encode_comp_t(0) is 0, fortunately...
There was an amusing bogosity in ac_rw calculation - it tried to
do encode_comp_t(encode_comp_t(0) / 1024).  Seeing that comp_t is
a 3-bit exponent + 13-bit mantissa... it's a good thing that 0 is
represented by all-bits-clear.

The history of that one is interesting - it was introduced in
2.1.68pre1, when acct.c had been reworked and moved to separate
file.  Two months later (2.1.86) somebody has noticed that the
sucker won't compile - there was no task_struct::io_usage.
At which point the ac_io calculation had changed from
encode_comp_t(current->io_usage) to encode_comp_t(0) and the
bug in the next line (absolutely real back then, had it ever
managed to compile) become a harmless bogosity.  Looks like
nobody has ever noticed until now.

Anyway, let's bury that idiocy now that it got noticed.  17 years
is long enough...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-08-07 14:40:07 -04:00
Al Viro
82df9c8beb Merge commit 'ccbf62d8a284cf181ac28c8e8407dd077d90dd4b' into for-next
backmerge to avoid kernel/acct.c conflict
2014-08-07 14:07:57 -04:00
Linus Torvalds
8d71844b51 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "Two fixes in the timer area:
   - a long-standing lock inversion due to a printk
   - suspend-related hrtimer corruption in sched_clock"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks
  sched_clock: Avoid corrupting hrtimer tree during suspend
2014-08-03 09:58:20 -07:00
Jan Kara
504d58745c timer: Fix lock inversion between hrtimer_bases.lock and scheduler locks
clockevents_increase_min_delta() calls printk() from under
hrtimer_bases.lock. That causes lock inversion on scheduler locks because
printk() can call into the scheduler. Lockdep puts it as:

======================================================
[ INFO: possible circular locking dependency detected ]
3.15.0-rc8-06195-g939f04b #2 Not tainted
-------------------------------------------------------
trinity-main/74 is trying to acquire lock:
 (&port_lock_key){-.....}, at: [<811c60be>] serial8250_console_write+0x8c/0x10c

but task is already holding lock:
 (hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #5 (hrtimer_bases.lock){-.-...}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<8103c918>] __hrtimer_start_range_ns+0x1c/0x197
       [<8107ec20>] perf_swevent_start_hrtimer.part.41+0x7a/0x85
       [<81080792>] task_clock_event_start+0x3a/0x3f
       [<810807a4>] task_clock_event_add+0xd/0x14
       [<8108259a>] event_sched_in+0xb6/0x17a
       [<810826a2>] group_sched_in+0x44/0x122
       [<81082885>] ctx_sched_in.isra.67+0x105/0x11f
       [<810828e6>] perf_event_sched_in.isra.70+0x47/0x4b
       [<81082bf6>] __perf_install_in_context+0x8b/0xa3
       [<8107eb8e>] remote_function+0x12/0x2a
       [<8105f5af>] smp_call_function_single+0x2d/0x53
       [<8107e17d>] task_function_call+0x30/0x36
       [<8107fb82>] perf_install_in_context+0x87/0xbb
       [<810852c9>] SYSC_perf_event_open+0x5c6/0x701
       [<810856f9>] SyS_perf_event_open+0x17/0x19
       [<8142f8ee>] syscall_call+0x7/0xb

-> #4 (&ctx->lock){......}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f04c>] _raw_spin_lock+0x21/0x30
       [<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
       [<8142cacc>] __schedule+0x4c6/0x4cb
       [<8142cae0>] schedule+0xf/0x11
       [<8142f9a6>] work_resched+0x5/0x30

-> #3 (&rq->lock){-.-.-.}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f04c>] _raw_spin_lock+0x21/0x30
       [<81040873>] __task_rq_lock+0x33/0x3a
       [<8104184c>] wake_up_new_task+0x25/0xc2
       [<8102474b>] do_fork+0x15c/0x2a0
       [<810248a9>] kernel_thread+0x1a/0x1f
       [<814232a2>] rest_init+0x1a/0x10e
       [<817af949>] start_kernel+0x303/0x308
       [<817af2ab>] i386_start_kernel+0x79/0x7d

-> #2 (&p->pi_lock){-.-...}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<810413dd>] try_to_wake_up+0x1d/0xd6
       [<810414cd>] default_wake_function+0xb/0xd
       [<810461f3>] __wake_up_common+0x39/0x59
       [<81046346>] __wake_up+0x29/0x3b
       [<811b8733>] tty_wakeup+0x49/0x51
       [<811c3568>] uart_write_wakeup+0x17/0x19
       [<811c5dc1>] serial8250_tx_chars+0xbc/0xfb
       [<811c5f28>] serial8250_handle_irq+0x54/0x6a
       [<811c5f57>] serial8250_default_handle_irq+0x19/0x1c
       [<811c56d8>] serial8250_interrupt+0x38/0x9e
       [<810510e7>] handle_irq_event_percpu+0x5f/0x1e2
       [<81051296>] handle_irq_event+0x2c/0x43
       [<81052cee>] handle_level_irq+0x57/0x80
       [<81002a72>] handle_irq+0x46/0x5c
       [<810027df>] do_IRQ+0x32/0x89
       [<8143036e>] common_interrupt+0x2e/0x33
       [<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
       [<811c25a4>] uart_start+0x2d/0x32
       [<811c2c04>] uart_write+0xc7/0xd6
       [<811bc6f6>] n_tty_write+0xb8/0x35e
       [<811b9beb>] tty_write+0x163/0x1e4
       [<811b9cd9>] redirected_tty_write+0x6d/0x75
       [<810b6ed6>] vfs_write+0x75/0xb0
       [<810b7265>] SyS_write+0x44/0x77
       [<8142f8ee>] syscall_call+0x7/0xb

-> #1 (&tty->write_wait){-.....}:
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<81046332>] __wake_up+0x15/0x3b
       [<811b8733>] tty_wakeup+0x49/0x51
       [<811c3568>] uart_write_wakeup+0x17/0x19
       [<811c5dc1>] serial8250_tx_chars+0xbc/0xfb
       [<811c5f28>] serial8250_handle_irq+0x54/0x6a
       [<811c5f57>] serial8250_default_handle_irq+0x19/0x1c
       [<811c56d8>] serial8250_interrupt+0x38/0x9e
       [<810510e7>] handle_irq_event_percpu+0x5f/0x1e2
       [<81051296>] handle_irq_event+0x2c/0x43
       [<81052cee>] handle_level_irq+0x57/0x80
       [<81002a72>] handle_irq+0x46/0x5c
       [<810027df>] do_IRQ+0x32/0x89
       [<8143036e>] common_interrupt+0x2e/0x33
       [<8142f23c>] _raw_spin_unlock_irqrestore+0x3f/0x49
       [<811c25a4>] uart_start+0x2d/0x32
       [<811c2c04>] uart_write+0xc7/0xd6
       [<811bc6f6>] n_tty_write+0xb8/0x35e
       [<811b9beb>] tty_write+0x163/0x1e4
       [<811b9cd9>] redirected_tty_write+0x6d/0x75
       [<810b6ed6>] vfs_write+0x75/0xb0
       [<810b7265>] SyS_write+0x44/0x77
       [<8142f8ee>] syscall_call+0x7/0xb

-> #0 (&port_lock_key){-.....}:
       [<8104a62d>] __lock_acquire+0x9ea/0xc6d
       [<8104a942>] lock_acquire+0x92/0x101
       [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
       [<811c60be>] serial8250_console_write+0x8c/0x10c
       [<8104e402>] call_console_drivers.constprop.31+0x87/0x118
       [<8104f5d5>] console_unlock+0x1d7/0x398
       [<8104fb70>] vprintk_emit+0x3da/0x3e4
       [<81425f76>] printk+0x17/0x19
       [<8105bfa0>] clockevents_program_min_delta+0x104/0x116
       [<8105c548>] clockevents_program_event+0xe7/0xf3
       [<8105cc1c>] tick_program_event+0x1e/0x23
       [<8103c43c>] hrtimer_force_reprogram+0x88/0x8f
       [<8103c49e>] __remove_hrtimer+0x5b/0x79
       [<8103cb21>] hrtimer_try_to_cancel+0x49/0x66
       [<8103cb4b>] hrtimer_cancel+0xd/0x18
       [<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
       [<81080705>] task_clock_event_stop+0x20/0x64
       [<81080756>] task_clock_event_del+0xd/0xf
       [<81081350>] event_sched_out+0xab/0x11e
       [<810813e0>] group_sched_out+0x1d/0x66
       [<81081682>] ctx_sched_out+0xaf/0xbf
       [<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
       [<8142cacc>] __schedule+0x4c6/0x4cb
       [<8142cae0>] schedule+0xf/0x11
       [<8142f9a6>] work_resched+0x5/0x30

other info that might help us debug this:

Chain exists of:
  &port_lock_key --> &ctx->lock --> hrtimer_bases.lock

 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(hrtimer_bases.lock);
                               lock(&ctx->lock);
                               lock(hrtimer_bases.lock);
  lock(&port_lock_key);

 *** DEADLOCK ***

4 locks held by trinity-main/74:
 #0:  (&rq->lock){-.-.-.}, at: [<8142c6f3>] __schedule+0xed/0x4cb
 #1:  (&ctx->lock){......}, at: [<81081df3>] __perf_event_task_sched_out+0x1dc/0x34f
 #2:  (hrtimer_bases.lock){-.-...}, at: [<8103caeb>] hrtimer_try_to_cancel+0x13/0x66
 #3:  (console_lock){+.+...}, at: [<8104fb5d>] vprintk_emit+0x3c7/0x3e4

stack backtrace:
CPU: 0 PID: 74 Comm: trinity-main Not tainted 3.15.0-rc8-06195-g939f04b #2
 00000000 81c3a310 8b995c14 81426f69 8b995c44 81425a99 8161f671 8161f570
 8161f538 8161f559 8161f538 8b995c78 8b142bb0 00000004 8b142fdc 8b142bb0
 8b995ca8 8104a62d 8b142fac 000016f2 81c3a310 00000001 00000001 00000003
Call Trace:
 [<81426f69>] dump_stack+0x16/0x18
 [<81425a99>] print_circular_bug+0x18f/0x19c
 [<8104a62d>] __lock_acquire+0x9ea/0xc6d
 [<8104a942>] lock_acquire+0x92/0x101
 [<811c60be>] ? serial8250_console_write+0x8c/0x10c
 [<811c6032>] ? wait_for_xmitr+0x76/0x76
 [<8142f11d>] _raw_spin_lock_irqsave+0x2e/0x3e
 [<811c60be>] ? serial8250_console_write+0x8c/0x10c
 [<811c60be>] serial8250_console_write+0x8c/0x10c
 [<8104af87>] ? lock_release+0x191/0x223
 [<811c6032>] ? wait_for_xmitr+0x76/0x76
 [<8104e402>] call_console_drivers.constprop.31+0x87/0x118
 [<8104f5d5>] console_unlock+0x1d7/0x398
 [<8104fb70>] vprintk_emit+0x3da/0x3e4
 [<81425f76>] printk+0x17/0x19
 [<8105bfa0>] clockevents_program_min_delta+0x104/0x116
 [<8105cc1c>] tick_program_event+0x1e/0x23
 [<8103c43c>] hrtimer_force_reprogram+0x88/0x8f
 [<8103c49e>] __remove_hrtimer+0x5b/0x79
 [<8103cb21>] hrtimer_try_to_cancel+0x49/0x66
 [<8103cb4b>] hrtimer_cancel+0xd/0x18
 [<8107f102>] perf_swevent_cancel_hrtimer.part.60+0x2b/0x30
 [<81080705>] task_clock_event_stop+0x20/0x64
 [<81080756>] task_clock_event_del+0xd/0xf
 [<81081350>] event_sched_out+0xab/0x11e
 [<810813e0>] group_sched_out+0x1d/0x66
 [<81081682>] ctx_sched_out+0xaf/0xbf
 [<81081e04>] __perf_event_task_sched_out+0x1ed/0x34f
 [<8104416d>] ? __dequeue_entity+0x23/0x27
 [<81044505>] ? pick_next_task_fair+0xb1/0x120
 [<8142cacc>] __schedule+0x4c6/0x4cb
 [<81047574>] ? trace_hardirqs_off_caller+0xd7/0x108
 [<810475b0>] ? trace_hardirqs_off+0xb/0xd
 [<81056346>] ? rcu_irq_exit+0x64/0x77

Fix the problem by using printk_deferred() which does not call into the
scheduler.

Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-08-01 12:54:41 +02:00
David Rientjes
3a1122d26c kexec: fix build error when hugetlbfs is disabled
free_huge_page() is undefined without CONFIG_HUGETLBFS and there's no
need to filter PageHuge() page is such a configuration either, so avoid
exporting the symbol to fix a build error:

   In file included from kernel/kexec.c:14:0:
   kernel/kexec.c: In function 'crash_save_vmcoreinfo_init':
   kernel/kexec.c:1623:20: error: 'free_huge_page' undeclared (first use in this function)
     VMCOREINFO_SYMBOL(free_huge_page);
                       ^

Introduced by commit 8f1d26d0e59b ("kexec: export free_huge_page to
VMCOREINFO")

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Acked-by: Olof Johansson <olof@lixom.net>
Cc: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-30 20:09:37 -07:00
Josh Triplett
e0198b290d Josh has moved
My IBM email addresses haven't worked for years; also map some
old-but-functional forwarding addresses to my canonical address.

Update my GPG key fingerprint; I moved to 4096R a long time ago.

Update description.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-30 17:16:13 -07:00
Atsushi Kumagai
8f1d26d0e5 kexec: export free_huge_page to VMCOREINFO
PG_head_mask was added into VMCOREINFO to filter huge pages in b3acc56bfe1
("kexec: save PG_head_mask in VMCOREINFO"), but makedumpfile still need
another symbol to filter *hugetlbfs* pages.

If a user hope to filter user pages, makedumpfile tries to exclude them by
checking the condition whether the page is anonymous, but hugetlbfs pages
aren't anonymous while they also be user pages.

We know it's possible to detect them in the same way as PageHuge(),
so we need the start address of free_huge_page():

    int PageHuge(struct page *page)
    {
            if (!PageCompound(page))
                    return 0;

            page = compound_head(page);
            return get_compound_page_dtor(page) == free_huge_page;
    }

For that reason, this patch changes free_huge_page() into public
to export it to VMCOREINFO.

Signed-off-by: Atsushi Kumagai <kumagai-atsushi@mxc.nes.nec.co.jp>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-30 17:16:13 -07:00
Linus Torvalds
9dae0a3fc4 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "A bunch of fixes for perf and kprobes:
   - revert a commit that caused a perf group regression
   - silence dmesg spam
   - fix kprobe probing errors on ia64 and ppc64
   - filter kprobe faults from userspace
   - lockdep fix for perf exit path
   - prevent perf #GP in KVM guest
   - correct perf event and filters"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kprobes: Fix "Failed to find blacklist" probing errors on ia64 and ppc64
  kprobes/x86: Don't try to resolve kprobe faults from userspace
  perf/x86/intel: Avoid spamming kernel log for BTS buffer failure
  perf/x86/intel: Protect LBR and extra_regs against KVM lying
  perf: Fix lockdep warning on process exit
  perf/x86/intel/uncore: Fix SNB-EP/IVT Cbox filter mappings
  perf/x86/intel: Use proper dTLB-load-misses event on IvyBridge
  perf: Revert ("perf: Always destroy groups on exit")
2014-07-27 09:57:16 -07:00
Stephen Boyd
f723aa1817 sched_clock: Avoid corrupting hrtimer tree during suspend
During suspend we call sched_clock_poll() to update the epoch and
accumulated time and reprogram the sched_clock_timer to fire
before the next wrap-around time. Unfortunately,
sched_clock_poll() doesn't restart the timer, instead it relies
on the hrtimer layer to do that and during suspend we aren't
calling that function from the hrtimer layer. Instead, we're
reprogramming the expires time while the hrtimer is enqueued,
which can cause the hrtimer tree to be corrupted. Furthermore, we
restart the timer during suspend but we update the epoch during
resume which seems counter-intuitive.

Let's fix this by saving the accumulated state and canceling the
timer during suspend. On resume we can update the epoch and
restart the timer similar to what we would do if we were starting
the clock for the first time.

Fixes: a08ca5d1089d "sched_clock: Use an hrtimer instead of timer"
Signed-off-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Link: http://lkml.kernel.org/r/1406174630-23458-1-git-send-email-john.stultz@linaro.org
Cc: Ingo Molnar <mingo@kernel.org>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-07-24 12:02:49 +02:00
Thomas Gleixner
ccbf62d8a2 sched: Make task->start_time nanoseconds based
Simplify the timespec to nsec/usec conversions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:18:05 -07:00
Thomas Gleixner
57e0be041d sched: Make task->real_start_time nanoseconds based
Simplify the only user of this data by removing the timespec
conversion.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:18:05 -07:00
Thomas Gleixner
d560fed6ab time: Export nsecs_to_jiffies()
Required for moving drivers to the nanosecond based interfaces.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:18:04 -07:00
Thomas Gleixner
dcaab54e34 timekeeping: Remove ktime_get_monotonic_offset()
No more users.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:18:03 -07:00
Thomas Gleixner
9a6b51976e timekeeping: Provide ktime_mono_to_any()
ktime based conversion function to map a monotonic time stamp to a
different CLOCK.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:18:01 -07:00
Thomas Gleixner
48064f5f67 timekeeping; Use ktime based data for ktime_get_update_offsets_tick()
No need to juggle with timespecs.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:18:01 -07:00
Thomas Gleixner
a37c0aad60 timekeeping: Use ktime_t data for ktime_get_update_offsets_now()
No need to juggle with timespecs.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:18:00 -07:00
Thomas Gleixner
afab07c0e9 timekeeping: Use ktime_t based data for ktime_get_clocktai()
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:18:00 -07:00
Thomas Gleixner
b82c817e2d timekeeping; Use ktime_t based data for ktime_get_boottime()
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:17:59 -07:00
Thomas Gleixner
f5264d5d5a timekeeping: Use ktime_t based data for ktime_get_real()
Speed up the readout.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:17:59 -07:00
Thomas Gleixner
0077dc60f2 timekeeping: Provide ktime_get_with_offset()
Provide a helper function which lets us implement ktime_t based
interfaces for real, boot and tai clocks.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:17:58 -07:00
Thomas Gleixner
a016a5bd62 timekeeping: Use ktime_t based data for ktime_get()
Speed up ktime_get() by using ktime_t based data. Text size shrinks by
64 bytes on x8664.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:17:58 -07:00
Thomas Gleixner
7c032df557 timekeeping: Provide internal ktime_t based data
The ktime_t based interfaces are used a lot in performance critical
code pathes. Add ktime_t based data so the interfaces don't have to
convert from the xtime/timespec based data.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:17:57 -07:00
Thomas Gleixner
f111adfdd7 timekeeping: Use timekeeping_update() instead of memcpy()
We already have a function which does the right thing, that also makes
sure that the coming ktime_t based cached values are getting updated.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:17:57 -07:00
Thomas Gleixner
3fdb14fd1d timekeeping: Cache optimize struct timekeeper
struct timekeeper is quite badly sorted for the hot readout path. Most
time access functions need to load two cache lines.

Rearrange it so ktime_get() and getnstimeofday() are happy with a
single cache line.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:17:56 -07:00
Thomas Gleixner
c905fae43f timekeeper: Move tk_xtime to core code
No users outside of the core.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:17:55 -07:00
Thomas Gleixner
d6d29896c6 timekeeping: Provide timespec64 based interfaces
To convert callers of the core code to timespec64 we need to provide
the proper interfaces.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:17:55 -07:00
Thomas Gleixner
8b094cd03b time: Consolidate the time accessor prototypes
Right now we have time related prototypes in 3 different header
files. Move it to a single timekeeping header file and move the core
internal stuff into a core private header.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:17:54 -07:00
John Stultz
7d489d15ce timekeeping: Convert timekeeping core to use timespec64s
Convert the core timekeeping logic to use timespec64s. This moves the
2038 issues out of the core logic and into all of the accessor
functions.

Future changes will need to push the timespec64s out to all
timekeeping users, but that can be done interface by interface.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:17:54 -07:00
John Stultz
49cd6f8699 time: More core infrastructure for timespec64
Helper and conversion functions for timespec64.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:17:53 -07:00
Thomas Gleixner
166afb6451 ktime: Sanitize ktime_to_us/ms conversion
With the plain nanoseconds based ktime_t we can simply use
ktime_divns() instead of going through loops and hoops of
timespec/timeval conversion.

Reported-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:16:50 -07:00
John Stultz
24e4a8c3e8 ktime: Kill non-scalar ktime_t implementation for 2038
The non-scalar ktime_t implementation is basically a timespec
which has to be changed to support dates past 2038 on 32bit
systems.

This patch removes the non-scalar ktime_t implementation, forcing
the scalar s64 nanosecond version on all architectures.

This may have additional performance overhead on some 32bit
systems when converting between ktime_t and timespec structures,
however the majority of 32bit systems (arm and i386) were already
using scalar ktime_t, so no performance regressions will be seen
on those platforms.

On affected platforms, I'm open to finding optimizations, including
avoiding converting to timespecs where possible.

[ tglx: We can now cleanup the ktime_t.tv64 mess, but thats a
  different issue and we can throw a coccinelle script at it ]

Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:16:50 -07:00
John Stultz
76f4108892 hrtimer: Cleanup hrtimer accessors to the timekepeing state
Rather then having two similar but totally different implementations
that provide timekeeping state to the hrtimer code, try to unify the
two implementations to be more simliar.

Thus this clarifies ktime_get_update_offsets to
ktime_get_update_offsets_now and changes get_xtime...  to
ktime_get_update_offsets_tick.

Signed-off-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:16:50 -07:00
Thomas Gleixner
e06fde37b8 timekeeping: Simplify arch_gettimeoffset()
Provide a default stub function instead of having the extra
conditional. Cuts binary size on a m68k build by ~100 bytes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:16:50 -07:00
David Riley
e704f93af5 kernel: time: Add udelay_test module to validate udelay
Create a module that allows udelay() to be executed to ensure that
it is delaying at least as long as requested (with a little bit of
error allowed).

There are some configurations which don't have reliably udelay
due to using a loop delay with cpufreq changes which should use
a counter time based delay instead.  This test aims to identify
those configurations where timing is unreliable.

Signed-off-by: David Riley <davidriley@chromium.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
2014-07-23 10:16:35 -07:00
Tony Luck
58d4e21e50 tracing: Fix wraparound problems in "uptime" trace clock
The "uptime" trace clock added in:

    commit 8aacf017b065a805d27467843490c976835eb4a5
    tracing: Add "uptime" trace clock that uses jiffies

has wraparound problems when the system has been up more
than 1 hour 11 minutes and 34 seconds. It converts jiffies
to nanoseconds using:
        (u64)jiffies_to_usecs(jiffy) * 1000ULL
but since jiffies_to_usecs() only returns a 32-bit value, it
truncates at 2^32 microseconds.  An additional problem on 32-bit
systems is that the argument is "unsigned long", so fixing the
return value only helps until 2^32 jiffies (49.7 days on a HZ=1000
system).

Avoid these problems by using jiffies_64 as our basis, and
not converting to nanoseconds (we do convert to clock_t because
user facing API must not be dependent on internal kernel
HZ values).

Link: http://lkml.kernel.org/p/99d63c5bfe9b320a3b428d773825a37095bf6a51.1405708254.git.tony.luck@intel.com

Cc: stable@vger.kernel.org # 3.10+
Fixes: 8aacf017b065 "tracing: Add "uptime" trace clock that uses jiffies"
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-07-21 09:56:12 -04:00
Linus Torvalds
d057190925 Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
 "The locking department delivers:

   - A rather large and intrusive bundle of fixes to address serious
     performance regressions introduced by the new rwsem / mcs
     technology.  Simpler solutions have been discussed, but they would
     have been ugly bandaids with more risk than doing the right thing.

   - Make the rwsem spin on owner technology opt-in for architectures
     and enable it only on the known to work ones.

   - A few fixes to the lockdep userspace library"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  locking/rwsem: Add CONFIG_RWSEM_SPIN_ON_OWNER
  locking/mutex: Disable optimistic spinning on some architectures
  locking/rwsem: Reduce the size of struct rw_semaphore
  locking/rwsem: Rename 'activity' to 'count'
  locking/spinlocks/mcs: Micro-optimize osq_unlock()
  locking/spinlocks/mcs: Introduce and use init macro and function for osq locks
  locking/spinlocks/mcs: Convert osq lock to atomic_t to reduce overhead
  locking/spinlocks/mcs: Rename optimistic_spin_queue() to optimistic_spin_node()
  locking/rwsem: Allow conservative optimistic spinning when readers have lock
  tools/liblockdep: Account for bitfield changes in lockdeps lock_acquire
  tools/liblockdep: Remove debug print left over from development
  tools/liblockdep: Fix comparison of a boolean value with a value of 2
2014-07-19 06:27:55 -10:00