Implementing Thread State Analysis (TSA)

Lately, I have been reading System Performance for Enterprise and Cloud by Brendan Gregg^[1]. Among other things, the book provides a set of methodologies for application performance analysis. One of these is the Thread State Analysis (TSA).

This post describes some of the exploration I did around TSA. The methodology has been covered by Gregg multiple times^[2]^[3]. This post proposes a solution to one of the book's exercises which asks to implement a TSA tool for Linux.

The code presented later in the post is available https://github.com/samueleresca/tsastat.

Overview

The Thread State Analysis (TSA) method measures the time threads spend in certain states. It then investigates each state, from most to least frequent^[2:1]. It can used as the first step to analyse application performances.

The methodology can be used independently from the type of application you want to investigate. Furthermore, it can be applied as an initial approach to guide your investigation in the right direction.

The methodology proposes a set of thread states^[4] that most operating systems have. For each state, it provides some follow-up actions to get further insights into the state of the system. Below are the states proposed by the methodology:

Running: the time spent by the thread running on-CPU. The call to action in case of high value is to investigate the application with Profiling tools to determine where the CPU cycles are spent. Note that, since spinning locks are running on-CPU, they would add time in this state.
Runnable represents the time spent by the thread waiting for scheduling. A high value in this state usually implies that the application needs more CPU. Therefore, review the CPU resources and the limits configured for the application or the system.
Swapping the time spent by the thread in a swapped-out, wait-for-residency state. A long time spent in this state usually points to memory utilization issues.
Sleeping represents the thread waiting for any I/O. A high value here might suggest resource limitations (e.g.: disk, network). The next step is to understand which I/O is the bottleneck and to perform Off-CPU analysis^[5] to gather further details.
Lock state measures the thread waiting on lock contention. A high time spent here might suggest investigating which lock is blocking the thread and why it is blocking. Further lock analysis is required.

A prerequisite of capturing these states is to identify how the states are actually implemented in the operating system. For Linux systems, we can gather the possible states of a thread by looking at the linux/sched.h. More on this in the later section.

Implementing TSA on Linux with bpftrace

As mentioned in the intro, this section proposes an implementation of the Thread State Analysis on Linux using bpftrace. To quote the exercise in the book:

"Develop a tool for Linux called tsastat(8) that prints columns for multiple thread state analysis states, with time spent in each. This can behave similarly to pidstat(1) and produce a rolling output."

All the code described next is available https://github.com/samueleresca/tsastat. The scripts require bpftrace installed in the system. More detailed instructions are in the README of the repo.

A similar implementation for FreeBSD has been already implemented by Gregg^[6].

Calculating on-CPU time (`Running` state)

The on-CPU time is the time spent by the thread on the CPU. We can use the tracepoint:sched:sched_switch tracepoint that is triggered at every scheduler switch event. Each tracepoint comes with some arguments that are passed in when the action is executed. It is possible to list the arguments using the following command:

$ bpftrace -lv "tracepoint:sched:sched_switch"
tracepoint:sched:sched_switch
    char prev_comm[16]
    pid_t prev_pid
    int prev_prio
    long prev_state
    char next_comm[16]
    pid_t next_pid
    int next_prio

In this case, the prev_ prefixed fields track the information of the task that switched out from the CPU. While the next_ prefixed fields refer to the task that is about to run.

Below is a bpftrace script that gathers the on-CPU time for each thread:

At each sched_switch the script calculates the interval between the current time (nsecs) and the start time of the previous task. In this way, we can keep track of the on-CPU interval. In addition, it sets the start time for the next_pid for the subsequent interval calculations.

Tracking `Runnable` state (Waiting on Queue)

To calculate the Runnable state we can introduce 2 new tracepoints:

tracepoint:sched:sched_wakeup triggers when the scheduler wakes up a task. It typically occurs when a task transitions from a sleeping state to a runnable state.
tracepoint:sched:sched_wakeup_new triggers when a new task transitions for the first time into a runnable state.

Below is a snippet to compute the runnable state time:

The implementation listens for the wakeup and wakeup_new tracepoints to track the timestamp of a task moved into a runnable state. Next, it uses the previously introduced tracepoint:sched:sched_switch to calculate the delta between the runnable and running states. This will give us the Runnable time for each thread.

Tracking the `Sleeping` time (off-CPU)

Sleeping times indicate the task is off-CPU waiting for an event. The tracking of the sleeping times can be built on top of the previously used tracepoints.

The tracepoint:sched:sched_switch action verifies the state of the previous thread. In case the thread switched out hasn't the TASK_RUNNING state, then we can save the timestamp and the state:

tracepoint:sched:sched_switch
{
  ...
  // Store sleep state for previous task
  // in case it is not TASK_RUNNING
  if ($prev_state > 0)
  {
    @sleeping_ts[$prev_pid] = nsecs;
    @sleeping_state[$prev_pid] = $prev_state;
  }
}

The code above allows to keep track of the sleeping times and states of each thread. The next step is to record the sleep time every time the thread wakes up. This can be done by extending the sched_wakeup and sched_wakeup_new tracepoints.

Below a full snippet showing the changes.

The implementation uses the states in linux/sched.h^[7] to check the thread's state saved during the scheduler switch. It computes the delta and saves the results in the map corresponding to the sleeping state. Finally, once the delta is saved, it proceeds by deleting the key.

The sleeping states tracked in the implementation are:

TASK_INTERRUPTIBLE state indicates the task is sleeping (or waiting). The task will move to the TASK_RUNNING state once some event occurs. The task can be awakened by signals. This state is for tasks that can safely be interrupted without any risk of causing inconsistencies. An example might be if the application calls functions like poll() or select() to wait for input from the keyboard.
- The TASK_UNINTERRUPTIBLE state refer to a sleeping task similarly to TASK_INTERRUPTIBLE. User-space signals cannot wake it up. An example is a call to a blocking function like read() to fetch data from the disk. In that case, the system puts the task in an uninterruptible state because interruption could cause data corruption or inconsistency.
TASK_RUNNING state indicates the task is currently executing or is ready to execute on a CPU.
TASK_STOPPED state means the task has been stopped. This is usually due to a SIGSTOP signal or a similar signal that pauses the task.

The next section goes to the implementation of tracking lock times.

Tracking lock time (`Locked` state)

This section proposes two implementations to calculate the time spent on lock contention. The first listens for the futex syscall^[8]("Fast Userspace muTEX"). The implementation uses the following tracepoints:

tracepoint:syscalls:sys_enter_futex triggered on the lock attempt.
tracepoint:syscalls:sys_exit_futex triggered on the lock releases.

Similarly to the previous states, we can keep track of the lock attempts by saving the timestamps in a map and calculating the delta on the lock release.

Below is the implementation:

When tracepoint:syscalls:sys_enter_futex events fire, the action gets executed in case of FUTEX_WAIT operations (see the predicate declared in the probe)^[9]. The action tracks the thread id and the initial timestamp. When tracepoint:syscalls:sys_exit_futex events fire, the action checks if the initial timestamp for that thread has been saved (see predicate). It then calculates the time delta. Finally, it deletes the thread key from the lock_ts map.

Futexes operate in two phases:

fast path: threads can lock and unlock resources directly in user space without entering the kernel if there’s no contention.
slow path: if contention occurs (e.g., a thread needs to wait for a lock to be released), futexes involve the kernel to handle waiting and waking up threads.

Note that the tracepoints above capture the slow path. On top of that, futexes are general synchronization mechanisms used in the kernel. So, the timing above might be misleading. As it might also capture other sync operations, not just the locks.

The second implementation uses the tracepoint:lock:contention_begin and tracepoint:lock:contention_end^[10] tracepoints. This has a similar pattern to the previous one:

The contention time by thread is measured at contention_begin. The delta is calculated using the contention_end time when the lock contention ends.

Displaying the results

We can use the END built-in event to display the final results. The action bound to the event has access to all the maps populated in precedence (see the previous sections):

The script prints the header for the table at first. Then it proceeds by iterating over the maps to print the corresponding thread state time aggregation for each thread by converting the times in milliseconds.

By default, bpftrace prints the content of the 'non-empty' maps at exit. To avoid that behaviour it is possible to specify the print_maps_on_exit=0 configuration^[11].

As a final result, you will get something similar to the psstat output:

user@ubuntu-vm:~/Projects$ tsastat.bt
Attaching 7 probes...
Tracing scheduler events... Hit Ctrl-C to end.
^C
Thread state analysis:
COMM             PID       CPU   RUNQ    SLP    USL    SUS    LCK    DEA
futex_contentio  32834       2    201    201      0      0    225      0
xdg-desktop-por  2401        7      0  58985      0      0      0      0
futex_contentio  33769       2    209    348      0      0    384      0
CacheThread_Blo  2727        0      0      0      0      0      0      0
futex_contentio  33207       3    352    402      0      0    503      0
futex_contentio  32026       2    138    160      0      0    166      0
kworker/u4:5     31410      19      0      0    110      3      0   3224
gmain            2410        1      0      0      0      0      0      0
ThreadPoolForeg  2744       32      0    746      0      0   7077      0

For better readability, you can implement rolling input. Replace the END built-in event with interval:5s and clear the terminal output each time.

Wrap up

The post proposes an implementation of Thread State Analysis on Linux using bpftrace. It's in response to an exercise in the System Performance book^[1:1]. The tracepoints used in the implementation might vary based on your OS's configuration. In general, you can get a list of tracepoints available on your system by running: bpftrace -l 'tracepoint:*'. In addition, please verify the script before using it in production environments. Ut might cause some noticeable overhead on your system as it traces very frequent events (including the scheduler events).

Systems Performance: Enterprise and the Cloud, 2nd Edition ↩︎ ↩︎
The og TSA method post described by Gregg: https://www.brendangregg.com/tsamethod.html ↩︎ ↩︎
EuroBSDcon: System Performance Analysis Methodologies ↩︎
In Linux a task is a runnable entity that can be either a thread, a process with a single thread or a kernel thread. In this post, thread and task are used interchangeably. ↩︎
The Off-CPU analysis is another methodology introduced by Brendan Gregg ↩︎
See: https://github.com/brendangregg/DTrace-tools/blob/master/sched/tstates.d ↩︎
include/linux/sched.h ↩︎
futex manpage ↩︎
include/uapi/linux/futex ↩︎
These tracepoints have been introduced in 2022 in the kernel. See: [PATCH 1/4] locking: Add lock contention tracepoints ↩︎
print_maps_on_exit - bpftrace ↩︎

Overview

Implementing TSA on Linux with bpftrace

Calculating on-CPU time (Running state)

Tracking Runnable state (Waiting on Queue)

Tracking the Sleeping time (off-CPU)

Tracking lock time (Locked state)