ftrace: Consider shared max priority in latency histograms

The algorithm used so far to trace the process with the highest priority requires that no other processes with the same priority are being woken up simultaneously. Otherwise, a process with a lower priority may be picked up for tracing which leads to an erroneously high latency value. Generally, the wakeup latency of a process that exclusively uses the highest priority of the system is due to software or hardware issues we would like to solve or, at least, keep as small as possible. This is what latency measurements are made for, after all. The wakeup latency of a process that shares the highest priority of the system with other processes, is quite another story. It may contain the worst-case runtime durations of the other processes; thus, it is the result of the priority design of a given system and nothing a kernel developer or hardware engineer may want to fix. This said, we need to separately record latencies i) of processes that exclusively use the highest priority of the system and ii) of processes that share the highest priority of the system with other processes. The above mentioned shortcoming of the tracing algorithm also applies to the variable tracing_max_latency that the wakeup latency tracer uses, since it is based on the same procedure as the original version of the latency histogram. In consequence, if several processes share the highest priority of the system, the variable tracing_max_latency may contain erroneously high values. We could now patch the wakeup latency tracer as well and separately record the various latencies, but we better document this behavior and recommend the latency histograms to reliably determine a system's worst-case wakeup latency. Simplified and cleaned up a bit. Added some more help info to Kconfig. Signed-off-by: Carsten Emde <C.Emde@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

ftrace: Consider shared max priority in latency histograms
The algorithm used so far to trace the process with the highest priority requires that no other processes with the same priority are being woken up simultaneously. Otherwise, a process with a lower priority may be picked up for tracing which leads to an erroneously high latency value. Generally, the wakeup latency of a process that exclusively uses the highest priority of the system is due to software or hardware issues we would like to solve or, at least, keep as small as possible. This is what latency measurements are made for, after all. The wakeup latency of a process that shares the highest priority of the system with other processes, is quite another story. It may contain the worst-case runtime durations of the other processes; thus, it is the result of the priority design of a given system and nothing a kernel developer or hardware engineer may want to fix. This said, we need to separately record latencies i) of processes that exclusively use the highest priority of the system and ii) of processes that share the highest priority of the system with other processes. The above mentioned shortcoming of the tracing algorithm also applies to the variable tracing_max_latency that the wakeup latency tracer uses, since it is based on the same procedure as the original version of the latency histogram. In consequence, if several processes share the highest priority of the system, the variable tracing_max_latency may contain erroneously high values. We could now patch the wakeup latency tracer as well and separately record the various latencies, but we better document this behavior and recommend the latency histograms to reliably determine a system's worst-case wakeup latency. Simplified and cleaned up a bit. Added some more help info to Kconfig. Signed-off-by: Carsten Emde <C.Emde@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
16731e6f · Carsten Emde · Thomas Gleixner · d9a4a1d0 · 16731e6f · 16731e6f
Commit 16731e6f authored Oct 26, 2009 by Carsten Emde Committed by Thomas Gleixner Nov 06, 2009
5 changed files
--- a/Documentation/trace/ftrace.txt
+++ b/Documentation/trace/ftrace.txt
@@ -111,9 +111,14 @@ of ftrace. Here is a list of some of the key files:
 	For example, the time interrupts are disabled.
 	This time is saved in this file. The max trace
 	will also be stored, and displayed by "trace".
-	A new max trace will only be recorded if the
+	A new max trace will only be recorded, if the
 	latency is greater than the value in this
-	file. (in microseconds)
+	file (in microseconds). Note that the max latency
+	recorded by the wakeup and the wakeup_rt tracer
+	do not necessarily reflect the worst-case latency
+	of the system, but may be erroneously high in
+	case two or more processes share the maximum
+	priority of the system.

  buffer_size_kb:


--- a/Documentation/trace/histograms.txt
+++ b/Documentation/trace/histograms.txt
@@ -24,7 +24,7 @@ histograms of potential sources of latency, the kernel stores the time
 stamp at the start of a critical section, determines the time elapsed
 when the end of the section is reached, and increments the frequency
 counter of that latency value - irrespective of whether any concurrently
-running process is affected by latency or not.
+running process is affected by this latency or not.
 - Configuration items (in the Kernel hacking/Tracers submenu)
  CONFIG_INTERRUPT_OFF_LATENCY
  CONFIG_PREEMPT_OFF_LATENCY
@@ -71,18 +71,20 @@ histogram data - one per CPU - are available in the files
 /sys/kernel/debug/tracing/latency_hist/irqsoff/CPUx
 /sys/kernel/debug/tracing/latency_hist/preemptirqsoff/CPUx
 /sys/kernel/debug/tracing/latency_hist/wakeup/CPUx.
+/sys/kernel/debug/tracing/latency_hist/wakeup/sharedprio/CPUx.

 The histograms are reset by writing non-zero to the file "reset" in a
 particular latency directory. To reset all latency data, use

-#!/bin/sh
+#!/bin/bash

-HISTDIR=/sys/kernel/debug/tracing/latency_hist
+TRACINGDIR=/sys/kernel/debug/tracing
+HISTDIR=$TRACINGDIR/latency_hist

 if test -d $HISTDIR
 then
  cd $HISTDIR
-  for i in */reset
+  for i in `find . | grep /reset$`
  do
    echo 1 >$i
  done
@@ -133,6 +135,18 @@ grep -v " 0$" /sys/kernel/debug/tracing/latency_hist/preemptoff/CPU0
   25	               1


+* Two types of wakeup latency histograms
+
+Two different algorithms are used to determine the wakeup latency of a
+process. One of them only considers processes that exclusively use the
+highest priority of the system, the other one records the wakeup latency
+of a process even if it shares the highest systemm latency with other
+processes. The former is used to improve hardware and system software;
+the related histograms are located it the wakeup subdirectory. The
+latter is used to optimize the priority design of a given system; the
+related histograms are located in the wakeup/sharedprio subdirectory.
+
+
 * Wakeup latency of a selected process

 To only collect wakeup latency data of a particular process, write the
@@ -146,11 +160,17 @@ PIDs are not considered, if this variable is set to 0.
 * Details of the process with the highest wakeup latency so far

 Selected data of the process that suffered from the highest wakeup
-latency that occurred in a particular CPU are available in the file
+latency that occurred in a particular CPU are available in the files
+
+/sys/kernel/debug/tracing/latency_hist/wakeup/max_latency-CPUx
+
+and
+
+/sys/kernel/debug/tracing/latency_hist/wakeup/sharedprio/max_latency-CPUx,

-/sys/kernel/debug/tracing/latency_hist/wakeup/max_latency-CPUx.
+respectively.

 The format of the data is
 <PID> <Priority> <Latency> <Command>

-These data are also reset when the wakeup histogram ist reset.
+These data are also reset when the related wakeup histograms are reset.
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1548,6 +1548,9 @@ struct task_struct {
 	unsigned long trace;
 	/* bitmask of trace recursion */
 	unsigned long trace_recursion;
+#ifdef CONFIG_WAKEUP_LATENCY_HIST
+	u64 preempt_timestamp_hist;
+#endif
 #endif /* CONFIG_TRACING */
 #ifdef CONFIG_PREEMPT_RT
 	/*

--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -143,7 +143,6 @@ config FUNCTION_GRAPH_TRACER
 	  the return value. This is done by setting the current return 
 	  address on the current task structure into a stack of calls.

-
 config IRQSOFF_TRACER
 	bool "Interrupts-off Latency Tracer"
 	default n
@@ -171,15 +170,15 @@ config INTERRUPT_OFF_HIST
 	bool "Interrupts-off Latency Histogram"
 	depends on IRQSOFF_TRACER
 	help
-	  This option generates a continuously updated histogram (one per cpu)
+	  This option generates continuously updated histograms (one per cpu)
 	  of the duration of time periods with interrupts disabled. The
-	  histogram is disabled by default. To enable it, write a non-zero
-	  number to the related file in
+	  histograms are disabled by default. To enable them, write a non-zero
+	  number to

 	      /sys/kernel/debug/tracing/latency_hist/enable/preemptirqsoff

-	  If PREEMPT_OFF_HIST is also selected, an additional histogram (one
-	  per cpu) is generated that accumulates the duration of time periods
+	  If PREEMPT_OFF_HIST is also selected, additional histograms (one
+	  per cpu) are generated that accumulate the duration of time periods
 	  when both interrupts and preemption are disabled.

 config PREEMPT_TRACER
@@ -208,15 +207,15 @@ config PREEMPT_OFF_HIST
 	bool "Preemption-off Latency Histogram"
 	depends on PREEMPT_TRACER
 	help
-	  This option generates a continuously updated histogram (one per cpu)
+	  This option generates continuously updated histograms (one per cpu)
 	  of the duration of time periods with preemption disabled. The
-	  histogram is disabled by default. To enable it, write a non-zero
+	  histograms are disabled by default. To enable them, write a non-zero
 	  number to

 	      /sys/kernel/debug/tracing/latency_hist/enable/preemptirqsoff

-	  If INTERRUPT_OFF_HIST is also selected, an additional histogram (one
-	  per cpu) is generated that accumulates the duration of time periods
+	  If INTERRUPT_OFF_HIST is also selected, additional histograms (one
+	  per cpu) are generated that accumulate the duration of time periods
 	  when both interrupts and preemption are disabled.

 config SCHED_TRACER
@@ -232,12 +231,20 @@ config WAKEUP_LATENCY_HIST
 	bool "Scheduling Latency Histogram"
 	depends on SCHED_TRACER
 	help
-	  This option generates a continuously updated histogram (one per cpu)
-	  of the scheduling latency of the highest priority task. The histogram
-	  is disabled by default. To enable it, write a non-zero number to
+	  This option generates continuously updated histograms (one per cpu)
+	  of the scheduling latency of the highest priority task.
+	  The histograms are disabled by default. To enable them, write a
+	  non-zero number to

 	      /sys/kernel/debug/tracing/latency_hist/enable/wakeup

+	  Two different algorithms are used, one to determine the latency of
+	  processes that exclusively use the highest priority of the system and
+	  another one to determine the latency of processes that share the
+	  highest system priority with other processes. The former is used to
+	  improve hardware and system software, the latter to optimize the
+	  priority design of a given system.
+
 config SYSPROF_TRACER
 	bool "Sysprof Tracer"
 	depends on X86

--- a/kernel/trace/latency_hist.c
+++ b/kernel/trace/latency_hist.c