Implementing Thread State Analysis (TSA)
Lately, I have been reading System Performance for Enterprise and Cloud by Brendan Gregg[1]. Among other things, the book provides a set of methodologies for application performance analysis. One of these is the Thread State Analysis (TSA).
This post describes some of the exploration I did around TSA. The methodology has been covered by Gregg multiple times[2][3]. This post proposes a solution to one of the book's exercises which asks to implement a TSA tool for Linux.
The code presented later in the post is available https://github.com/samueleresca/tsastat.
Overview
The Thread State Analysis (TSA) method measures the time threads spend in certain states. It then investigates each state, from most to least frequent[2:1]. It can used as the first step to analyse application performances.
The methodology can be used independently from the type of application you want to investigate. Furthermore, it can be applied as an initial approach to guide your investigation in the right direction.
The methodology proposes a set of thread states[4] that most operating systems have. For each state, it provides some follow-up actions to get further insights into the state of the system. Below are the states proposed by the methodology:
Running
: the time spent by the thread running on-CPU. The call to action in case of high value is to investigate the application with Profiling tools to determine where the CPU cycles are spent. Note that, since spinning locks are running on-CPU, they would add time in this state.Runnable
represents the time spent by the thread waiting for scheduling. A high value in this state usually implies that the application needs more CPU. Therefore, review the CPU resources and the limits configured for the application or the system.Swapping
the time spent by the thread in a swapped-out, wait-for-residency state. A long time spent in this state usually points to memory utilization issues.Sleeping
represents the thread waiting for any I/O. A high value here might suggest resource limitations (e.g.: disk, network). The next step is to understand which I/O is the bottleneck and to perform Off-CPU analysis[5] to gather further details.Lock
state measures the thread waiting on lock contention. A high time spent here might suggest investigating which lock is blocking the thread and why it is blocking. Further lock analysis is required.
A prerequisite of capturing these states is to identify how the states are actually implemented in the operating system. For Linux systems, we can gather the possible states of a thread by looking at the linux/sched.h
. More on this in the later section.
Implementing TSA on Linux with bpftrace
As mentioned in the intro, this section proposes an implementation of the Thread State Analysis on Linux using bpftrace. To quote the exercise in the book:
"Develop a tool for Linux called
tsastat(8)
that prints columns for multiple thread state analysis states, with time spent in each. This can behave similarly to pidstat(1) and produce a rolling output."
All the code described next is available https://github.com/samueleresca/tsastat. The scripts require bpftrace installed in the system. More detailed instructions are in the README of the repo.
A similar implementation for FreeBSD has been already implemented by Gregg[6].
Calculating on-CPU time (Running
state)
The on-CPU time is the time spent by the thread on the CPU. We can use the tracepoint:sched:sched_switch
tracepoint that is triggered at every scheduler switch event. Each tracepoint comes with some arguments that are passed in when the action is executed. It is possible to list the arguments using the following command:
$ bpftrace -lv "tracepoint:sched:sched_switch"
tracepoint:sched:sched_switch
char prev_comm[16]
pid_t prev_pid
int prev_prio
long prev_state
char next_comm[16]
pid_t next_pid
int next_prio
In this case, the prev_
prefixed fields track the information of the task that switched out from the CPU. While the next_
prefixed fields refer to the task that is about to run.
Below is a bpftrace script that gathers the on-CPU time for each thread:
At each sched_switch
the script calculates the interval between the current time (nsecs
) and the start time of the previous task. In this way, we can keep track of the on-CPU interval. In addition, it sets the start time for the next_pid
for the subsequent interval calculations.
Tracking Runnable
state (Waiting on Queue)
To calculate the Runnable
state we can introduce 2 new tracepoints:
tracepoint:sched:sched_wakeup
triggers when the scheduler wakes up a task. It typically occurs when a task transitions from a sleeping state to a runnable state.tracepoint:sched:sched_wakeup_new
triggers when a new task transitions for the first time into a runnable state.
Below is a snippet to compute the runnable state time:
The implementation listens for the wakeup
and wakeup_new
tracepoints to track the timestamp of a task moved into a runnable state. Next, it uses the previously introduced tracepoint:sched:sched_switch
to calculate the delta between the runnable and running states. This will give us the Runnable
time for each thread.
Tracking the Sleeping
time (off-CPU)
Sleeping times indicate the task is off-CPU waiting for an event. The tracking of the sleeping times can be built on top of the previously used tracepoints.
The tracepoint:sched:sched_switch
action verifies the state of the previous thread. In case the thread switched out hasn't the TASK_RUNNING
state, then we can save the timestamp and the state:
tracepoint:sched:sched_switch
{
...
// Store sleep state for previous task
// in case it is not TASK_RUNNING
if ($prev_state > 0)
{
@sleeping_ts[$prev_pid] = nsecs;
@sleeping_state[$prev_pid] = $prev_state;
}
}
The code above allows to keep track of the sleeping times and states of each thread. The next step is to record the sleep time every time the thread wakes up. This can be done by extending the sched_wakeup
and sched_wakeup_new
tracepoints.
Below a full snippet showing the changes.
The implementation uses the states in linux/sched.h
[7] to check the thread's state saved during the scheduler switch. It computes the delta and saves the results in the map corresponding to the sleeping state. Finally, once the delta is saved, it proceeds by deleting the key.
The sleeping states tracked in the implementation are:
TASK_INTERRUPTIBLE
state indicates the task is sleeping (or waiting). The task will move to theTASK_RUNNING
state once some event occurs. The task can be awakened by signals. This state is for tasks that can safely be interrupted without any risk of causing inconsistencies. An example might be if the application calls functions likepoll()
orselect()
to wait for input from the keyboard.-
- The
TASK_UNINTERRUPTIBLE
state refer to a sleeping task similarly toTASK_INTERRUPTIBLE
. User-space signals cannot wake it up. An example is a call to a blocking function likeread()
to fetch data from the disk. In that case, the system puts the task in an uninterruptible state because interruption could cause data corruption or inconsistency.
- The
TASK_RUNNING
state indicates the task is currently executing or is ready to execute on a CPU.TASK_STOPPED
state means the task has been stopped. This is usually due to a SIGSTOP signal or a similar signal that pauses the task.
The next section goes to the implementation of tracking lock times.
Tracking lock time (Locked
state)
This section proposes two implementations to calculate the time spent on lock contention. The first listens for the futex syscall[8]("Fast Userspace muTEX"). The implementation uses the following tracepoints:
tracepoint:syscalls:sys_enter_futex
triggered on the lock attempt.tracepoint:syscalls:sys_exit_futex
triggered on the lock releases.
Similarly to the previous states, we can keep track of the lock attempts by saving the timestamps in a map and calculating the delta on the lock release.
Below is the implementation:
When tracepoint:syscalls:sys_enter_futex
events fire, the action gets executed in case of FUTEX_WAIT
operations (see the predicate declared in the probe)[9]. The action tracks the thread id and the initial timestamp. When tracepoint:syscalls:sys_exit_futex
events fire, the action checks if the initial timestamp for that thread has been saved (see predicate). It then calculates the time delta. Finally, it deletes the thread key from the lock_ts
map.
Futexes operate in two phases:
- fast path: threads can lock and unlock resources directly in user space without entering the kernel if there’s no contention.
- slow path: if contention occurs (e.g., a thread needs to wait for a lock to be released), futexes involve the kernel to handle waiting and waking up threads.
Note that the tracepoints above capture the slow path. On top of that, futexes are general synchronization mechanisms used in the kernel. So, the timing above might be misleading. As it might also capture other sync operations, not just the locks.
The second implementation uses the tracepoint:lock:contention_begin
and tracepoint:lock:contention_end
[10] tracepoints. This has a similar pattern to the previous one:
The contention time by thread is measured at contention_begin
. The delta is calculated using the contention_end
time when the lock contention ends.
Displaying the results
We can use the END built-in event to display the final results. The action bound to the event has access to all the maps populated in precedence (see the previous sections):
The script prints the header for the table at first. Then it proceeds by iterating over the maps to print the corresponding thread state time aggregation for each thread by converting the times in milliseconds.
By default, bpftrace prints the content of the 'non-empty' maps at exit. To avoid that behaviour it is possible to specify the print_maps_on_exit=0
configuration[11].
As a final result, you will get something similar to the psstat output:
user@ubuntu-vm:~/Projects$ tsastat.bt
Attaching 7 probes...
Tracing scheduler events... Hit Ctrl-C to end.
^C
Thread state analysis:
COMM PID CPU RUNQ SLP USL SUS LCK DEA
futex_contentio 32834 2 201 201 0 0 225 0
xdg-desktop-por 2401 7 0 58985 0 0 0 0
futex_contentio 33769 2 209 348 0 0 384 0
CacheThread_Blo 2727 0 0 0 0 0 0 0
futex_contentio 33207 3 352 402 0 0 503 0
futex_contentio 32026 2 138 160 0 0 166 0
kworker/u4:5 31410 19 0 0 110 3 0 3224
gmain 2410 1 0 0 0 0 0 0
ThreadPoolForeg 2744 32 0 746 0 0 7077 0
For better readability, you can implement rolling input. Replace the END
built-in event with interval:5s
and clear the terminal output each time.
Wrap up
The post proposes an implementation of Thread State Analysis on Linux using bpftrace. It's in response to an exercise in the System Performance book[1:1]. The tracepoints used in the implementation might vary based on your OS's configuration. In general, you can get a list of tracepoints available on your system by running: bpftrace -l 'tracepoint:*'
. In addition, please verify the script before using it in production environments. Ut might cause some noticeable overhead on your system as it traces very frequent events (including the scheduler events).
Systems Performance: Enterprise and the Cloud, 2nd Edition ↩︎ ↩︎
The og TSA method post described by Gregg: https://www.brendangregg.com/tsamethod.html ↩︎ ↩︎
In Linux a task is a runnable entity that can be either a thread, a process with a single thread or a kernel thread. In this post, thread and task are used interchangeably. ↩︎
The Off-CPU analysis is another methodology introduced by Brendan Gregg ↩︎
See: https://github.com/brendangregg/DTrace-tools/blob/master/sched/tstates.d ↩︎
These tracepoints have been introduced in 2022 in the kernel. See: [PATCH 1/4] locking: Add lock contention tracepoints ↩︎