The current virtual timer interface is inherently per-cpu and hard to
use. The sole user of the interface is appldata which uses it to execute
a function after a specific amount of cputime has been used over all cpus.
Rework the virtual timer interface to hook into the cputime accounting.
This makes the interface independent from the CPU timer interrupts, and
makes the virtual timers global as opposed to per-cpu.
Overall the code is greatly simplified. The downside is that the accuracy
is not as good as the original implementation, but it is still good enough
for appldata.
Reviewed-by: Jan Glauber <jang@linux.vnet.ibm.com>
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Remove the file name from the comment at top of many files. In most
cases the file name was wrong anyway, so it's rather pointless.
Also unify the IBM copyright statement. We did have a lot of sightly
different statements and wanted to change them one after another
whenever a file gets touched. However that never happened. Instead
people start to take the old/"wrong" statements to use as a template
for new files.
So unify all of them in one go.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Setting the cpu restart parameters is done in three different fashions:
- directly setting the four parameters individually
- copying the four parameters with memcpy (using 4 * sizeof(long))
- copying the four parameters using a private structure
In addition code in entry*.S relies on a certain order of the restart
members of struct _lowcore.
Make all of this more robust to future changes by adding a
mem_absolute_assign(dest, val) define, which assigns val to dest
using absolute addressing mode. Also the load multiple instructions
in entry*.S have been split into separate load instruction so the
order of the struct _lowcore members doesn't matter anymore.
In addition move the prototypes of memcpy_real/absolute from uaccess.h
to processor.h. These memcpy* variants are not related to uaccess at all.
string.h doesn't seem to match as well, so lets use processor.h.
Also replace the eight byte array in struct _lowcore which represents a
misaliged u64 with a u64. The compiler will always create code that
handles the misaligned u64 correctly.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Use sigp order code defines in assembly code as well.
With this change all places that use sigp constants should
have been converted to use self describing defines instead
of directly using constants.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
There is a small race window in the __switch_to code in regard to
the transfer of the TIF_MCCK_PENDING bit from the previous to the
next task. The bit is transferred before the task struct pointer
and the thread-info pointer for the next task has been stored to
lowcore. If a machine check sets the TIF_MCCK_PENDING bit between
the transfer code and the store of current/thread_info the bit
is still set for the previous task. And if the previous task has
terminated it can get lost. The effect is that a pending CRW is
not retrieved until the next machine checks sets TIF_MCCK_PENDING.
To fix this reorder __switch_to to first store the task struct
and thread-info pointer and then do the transfer of the bit.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Whenever the cpu loads an enabled wait PSW it will appear as idle to the
underlying host system. The code in default_idle calls vtime_stop_cpu
which does the necessary voodoo to get the cpu time accounting right.
The udelay code just loads an enabled wait PSW. To correct this rework
the vtime_stop_cpu/vtime_start_cpu logic and move the difficult parts
to entry[64].S, vtime_stop_cpu can now be called from anywhere and
vtime_start_cpu is gone. The correction of the cpu time during wakeup
from an enabled wait PSW is done with a critical section in entry[64].S.
As vtime_start_cpu is gone, s390_idle_check can be removed as well.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Define struct pcpu and merge some of the NR_CPUS arrays into it, including
__cpu_logical_map, current_set and smp_cpu_state. Split smp related
functions to those operating on physical cpus and the functions operating
on a logical cpu number. Make the functions for physical cpus use a
pointer to a struct pcpu. This hides the knowledge about cpu addresses in
smp.c, entry[64].S and swsusp_asm64.S, thus remove the sigp.h header.
The PSW restart mechanism is used to start secondary cpus, calling a
function on an online cpu, calling a function on the ipl cpu, and for
the nmi signal. Replace the different assembler functions with a
single function restart_int_handler. The new entry point calls a function
whose pointer is stored in the lowcore of the target cpu and it can wait
for the source cpu to stop. This covers all existing use cases.
Overall the code is now simpler and there are ~380 lines less code.
Reviewed-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The 16 bit value at the lowcore location with offset 0x84 is the
cpu address that is associated with an external interrupt. Rename
the field from cpu_addr to ext_cpu_addr to make that clear.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Move the program interruption code and the translation exception identifier
to the pt_regs structure as 'int_code' and 'int_parm_long' and make the
first level interrupt handler in entry[64].S store the two values. That
makes it possible to drop 'prot_addr' and 'trap_no' from the thread_struct
and to reduce the number of arguments to a lot of functions. Finally
un-inline do_trap. Overall this saves 5812 bytes in the .text section of
the 64 bit kernel.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Another round of cleanup for entry[64].S, in particular the program check
handler looks more reasonable now. The code size for the 31 bit kernel
has been reduced by 616 byte and by 528 byte for the 64 bit version.
Even better the code is a bit faster as well.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Add an explicit TIF_SYSCALL bit that indicates if a task is inside
a system call. The svc_code in the pt_regs structure is now only
valid if TIF_SYSCALL is set. With this definition TIF_RESTART_SVC
can be replaced with TIF_SYSCALL. Overall do_signal is a bit more
readable and it saves a few lines of code.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
For a ERESTARTNOHAND/ERESTARTSYS/ERESTARTNOINTR restarting system call
do_signal will prepare the restart of the system call with a rewind of
the PSW before calling get_signal_to_deliver (where the debugger might
take control). For A ERESTART_RESTARTBLOCK restarting system call
do_signal will set -EINTR as return code.
There are two issues with this approach:
1) strace never sees ERESTARTNOHAND, ERESTARTSYS, ERESTARTNOINTR or
ERESTART_RESTARTBLOCK as the rewinding already took place or the
return code has been changed to -EINTR
2) if get_signal_to_deliver does not return with a signal to deliver
the restart via the repeat of the svc instruction is left in place.
This opens a race if another signal is made pending before the
system call instruction can be reexecuted. The original system call
will be restarted even if the second signal would have ended the
system call with -EINTR.
These two issues can be solved by dropping the early rewind of the
system call before get_signal_to_deliver has been called and by using
the TIF_RESTART_SVC magic to do the restart if no signal has to be
delivered. The only situation where the system call restart via the
repeat of the svc instruction is appropriate is when a SA_RESTART
signal is delivered to user space.
Unfortunately this breaks inferior calls by the debugger again. The
system call number and the length of the system call instruction is
lost over the inferior call and user space will see ERESTARTNOHAND/
ERESTARTSYS/ERESTARTNOINTR/ERESTART_RESTARTBLOCK. To correct this a
new ptrace interface is added to save/restore the system call number
and system call instruction length.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Remove the save_area_64 field from the 0xe00 - 0xf00 area in the lowcore.
Use a free slot in the save_area array instead.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
With this patch a new S390 shutdown trigger "restart" is added. If under
z/VM "systerm restart" is entered or under the HMC the "PSW restart" button
is pressed, the PSW located at 0 (31 bit) or 0x1a0 (64 bit) bit is loaded.
Now we execute do_restart() that processes the restart action that is
defined under /sys/firmware/shutdown_actions/on_restart. Currently the
following actions are possible: reipl (default), stop, vmcmd, dump, and
dump_reipl.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
The alignment is missing for various global symbols in s390 assembly code.
With a recent gcc and an instruction like stgrl this can lead to a
specification exception if the instruction uses such a mis-aligned address.
Specify the alignment explicitely and while add it define __ALIGN for s390
and use the ENTRY define to save some lines of code.
Signed-off-by: Jan Glauber <jang@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
On cpu hot remove a PFAULT CANCEL command is sent to the hypervisor
which in turn will cancel all outstanding pfault requests that have
been issued on that cpu (the same happens with a SIGP cpu reset).
The result is that we end up with uninterruptible processes where
the interrupt that would wake up these processes never arrives.
In order to solve this all processes which wait for a pfault
completion interrupt get woken up after a cpu hot remove. The worst
case that could happen is that they fault again and in turn need to
wait again.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
When starting a new CPU we currently jump to start_secondary() without
setting register 14 (the return address) correctly. Therefore on the stack
frame for start_secondary an invalid return address is stored. This leads
to wrong stack back traces in kernel dumps.
Example:
#00 [1f33fe48] cpu_idle at 10614a
#01 [1f33fe90] start_secondary at 54fa88
#02 [1f33feb8] (null) at 0 <--- invalid
To fix this start_secondary() is called now with basr/brasl that sets
register 14 correctly. The output of the stack backtrace looks then
like the following:
#00 [1f33fe48] cpu_idle at 10614a
#01 [1f33fe90] start_secondary at 54fa88
#02 [1f33feb8] restart_base at 54f41e <--- correct
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Make the code in the 31 bit entry.S code as similar as possible to the
64 bit version in entry64.S. That makes it easier to add new code to
the first level interrupt handler that affects both 31 and 64 bit kernels.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Add kprobes annotations to get the massive 'probe kernel.function("*") {}'
stress test working.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Fix kprobes after git commit 1e54622e04
broke it. The kprobe_handler is now called with interrupts in the state
at the time of the breakpoint. The single step of the replaced instruction
is done with interrupts off which makes it necessary to enable and disable
the interupts in the kprobes code.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Do the setup of the stack overflow argument for the sixth system
call parameter right before the branch to the system call function.
That simplifies the system call parameter access code.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Read external interrupts parameters from the lowcore in the first
level interrupt handler in entry[64].S.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Read all required fields for program checks from the lowcore in the
first level interrupt handler in entry[64].S. If the context that
caused the fault was enabled for interrupts we can now re-enable the
irqs in entry[64].S.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
In case user space is single stepped (PER) the program check handler
claims too early that IRQs are enabled on the return path.
Subsequent checks will notice that the IRQ mask in the PSW and
what lockdep thinks the IRQ mask should be do not correlate and
therefore will print a warning to the console and disable lockdep.
Fix this by doing all the work within the correct context.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Copy the last breaking event address from the lowcore to a new
field in the thread_struct on each system entry. Add a new
ptrace request PTRACE_GET_LAST_BREAK and a new utrace regset
REGSET_LAST_BREAK to query the last breaking event.
This is useful for debugging wild branches in user space code.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
A machine check can interrupt the i/o and external interrupt handler
anytime. If the machine check occurs while the interrupt handler is
waking up from idle vtime_start_cpu can get executed a second time
and the int_clock / async_enter_timer values in the lowcore get
clobbered. This can confuse the cpu time accounting.
To fix this problem two changes are needed. First the machine check
handler has to use its own copies of int_clock and async_enter_timer,
named mcck_clock and mcck_enter_timer. Second the nested execution
of vtime_start_cpu has to be prevented. This is done in s390_idle_check
by checking the wait bit in the program status word.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The system call path in entry[64].S is run with interrupts enabled.
Remove the irq tracing check from the system call exit code. If a
program check interrupted a context enabled for interrupts do a
call to trace_irq_off_caller in the program check handler before
branching to the system call exit code.
Restructure the system call and io interrupt return code to avoid
avoid the lpsw[e] to disable machine checks.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cleanup the #ifdef mess at io_work in entry[64].S and streamline the
TIF work code of the system call and io exit path.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
If a machine check interrupts the io interrupt handler on one of the
instructions between io_return and io_leave the critical section
cleanup code will move the return psw to io_work_loop. By doing that
the switch from the asynchronous interrupt stack to the process stack
is skipped. If e.g. TIF_NEED_RESCHED is set things break because
the scheduler is called with the asynchronous interrupts stack.
Moving the psw back to io_return instead fixes the problem.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Use asm offsets to make sure the offset defines to struct _lowcore and
its layout don't get out of sync.
Also add a BUILD_BUG_ON() which checks that the size of the structure
is sane.
And while being at it change those sites which use odd casts to access
the current lowcore. These should use S390_lowcore instead.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
If irq flags tracing is enabled the TRACE_IRQS_ON macros expands to
a function call which clobbers registers %r0-%r5. The macro is used
in the code path for single stepped system calls. The argument
registers %r2-%r6 need to be restored from the stack before the system
call function is called.
Cc: stable@kernel.org
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
On s390 there are two ways of specifying the system call number for
the svc instruction. The standard way is to use the immediate field
in the instruction (or to use EXecute for values unknown during
assemble time). This can encode 256 system calls.
The kernel ABI also allows to put the system call number in r1 and
then execute svc 0 to enable system call numbers > 255.
It turns out that single stepping svc 0 is broken, since the PER
program check handler uses r1. We have to use a different register.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* 'tracing-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (105 commits)
ring-buffer: only enable ring_buffer_swap_cpu when needed
ring-buffer: check for swapped buffers in start of committing
tracing: report error in trace if we fail to swap latency buffer
tracing: add trace_array_printk for internal tracers to use
tracing: pass around ring buffer instead of tracer
tracing: make tracing_reset safe for external use
tracing: use timestamp to determine start of latency traces
tracing: Remove mentioning of legacy latency_trace file from documentation
tracing/filters: Defer pred allocation, fix memory leak
tracing: remove users of tracing_reset
tracing: disable buffers and synchronize_sched before resetting
tracing: disable update max tracer while reading trace
tracing: print out start and stop in latency traces
ring-buffer: disable all cpu buffers when one finds a problem
ring-buffer: do not count discarded events
ring-buffer: remove ring_buffer_event_discard
ring-buffer: fix ring_buffer_read crossing pages
ring-buffer: remove unnecessary cpu_relax
ring-buffer: do not swap buffers during a commit
ring-buffer: do not reset while in a commit
...
The sysc_restore_trace_psw and io_restore_trace_psw storage locations
are created in the .text section. When creating and IPLing from a named
saved system (NSS), writing to these locations causes a protection exception
(because the .text section is mapped as shared read-only in the NSS).
To permit write access, move the storage locations into the .data section.
The problem occurs only when CONFIG_TRACE_IRQFLAGS is set.
The git commmit that has introduced these variables is:
411788ea7f
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
s/HAVE_FTRACE_SYSCALLS/HAVE_SYSCALL_TRACEPOINTS/g
s/TIF_SYSCALL_FTRACE/TIF_SYSCALL_TRACEPOINT/g
The syscall enter/exit tracing is no longer specific to just ftrace, so
they now have names that reflect their tie to tracepoints instead.
Signed-off-by: Josh Stone <jistone@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Jiaying Zhang <jiayingz@google.com>
Cc: Martin Bligh <mbligh@google.com>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
LKML-Reference: <1251150194-1713-2-git-send-email-jistone@redhat.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
System call tracer support for s390.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Enable secure computing on s390 as well.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Reset the cpu timer to the maximum value and correctly initialize the
cpu accounting values in the lowcore when the cpu is started.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Distinguish the cputime of the idle process where idle is actually using
cpu cycles from the cputime where idle is sleeping on an enabled wait psw.
The former is accounted as system time, the later as idle time.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
On s390 we always want to run with precise cputime accounting.
Remove the config options VIRT_TIMER and VIRT_CPU_ACCOUNTING.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
On s390 we have ret_from_fork jump not to the "do all work we
normally do on return from syscall" as on x86, ppc, etc., but to the
"do all such work except audit". Historical reasons - the codepath
triggered when we have AUDIT process flag set is separated from the
normall one and they converge at sysc_return, which is the common
part of post-syscall work. And does not include calling audit_syscall_exit() -
that's done in the end of sysc_tracesys path, just before that path jumps
to sysc_return.
IOW, the child returning from fork()/clone()/vfork() doesn't
call audit_syscall_exit() at all, so no matter what we do with its
audit context, we are not going to see the audit entry.
The fix is simple: have ret_from_fork go to the point just past
the call of sys_.... in the 'we have AUDIT flag set' path. There we
have (64bit variant; for 31bit the situation is the same):
sysc_tracenogo:
tm __TI_flags+7(%r9),(_TIF_SYSCALL_TRACE|_TIF_SYSCALL_AUDIT)
jz sysc_return
la %r2,SP_PTREGS(%r15) # load pt_regs
larl %r14,sysc_return # return point is sysc_return
jg do_syscall_trace_exit
which is precisely what we need - check the flag, bugger off to sysc_return
if not set, otherwise call do_syscall_trace_exit() and bugger off to
sysc_return. r9 has just been properly set by ret_from_fork itself,
so we are fine.
Tested on s390x, seems to work fine. WARNING: it's been about
16 years since my last contact with 3X0 assembler[1], so additional
review would be very welcome. I don't think I've managed to screw it
up, but...
[1] that *was* in another country and besides, the box is dead...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
syscall_get_nr() currently returns a valid result only if the call
chain of the traced process includes do_syscall_trace_enter(). But
collect_syscall() can be called for any sleeping task, the result of
syscall_get_nr() in general is completely bogus.
To make syscall_get_nr() work for any sleeping task the traps field
in pt_regs is replace with svcnr - the system call number the process
is executing. If svcnr == 0 the process is not on a system call path.
The syscall_get_arguments and syscall_set_arguments use regs->gprs[2]
for the first system call parameter. This is incorrect since gprs[2]
may have been overwritten with the system call number if the call
chain includes do_syscall_trace_enter. Use regs->orig_gprs2 instead.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
With CONFIG_IRQSOFF_TRACER the trace_hardirqs_off() function includes
a call to __builtin_return_address(1). But we calltrace_hardirqs_off()
from early entry code. There we have just a single stack frame.
So this results in a kernel stack backchain walk that would walk beyond
the kernel stack. Following the NULL terminated backchain this results
in a lowcore read access.
To fix this we simply call trace_hardirqs_off_caller() and pass the
current instruction pointer.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
arch/s390/kernel/built-in.o: In function `cleanup_io_leave_insn':
mem_detect.c:(.text+0x10592): undefined reference to `lockdep_sys_exit'
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
* System call parameter and result access functions
* Add tracehook calls
* Split syscall_trace into two functions do_syscall_trace_enter and
do_syscall_trace_exit
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
On return from syscall or interrupt, we have to check if we return to
userspace (likely) and if there is work todo (less likely) to decide
if we handle the work. We can optimize this check: we first check for
the less likely work case and then check for userspace.
This patch is also a preparation for an additional patch, that fixes a bug
in KVM dealing with cpu bound guests.
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>