Samuele Resca

Implementing Thread State Analysis (TSA)

Samuele Resca — Fri, 13 Dec 2024 21:09:24 GMT

Lately, I have been reading System Performance for Enterprise and Cloud by Brendan Gregg^[1]. Among other things, the book provides a set of methodologies for application performance analysis. One of these is the Thread State Analysis (TSA).

This post describes some of the exploration I did around TSA. The methodology has been covered by Gregg multiple times^[2]^[3]. This post proposes a solution to one of the book's exercises which asks to implement a TSA tool for Linux.

The code presented later in the post is available https://github.com/samueleresca/tsastat.

Overview

The Thread State Analysis (TSA) method measures the time threads spend in certain states. It then investigates each state, from most to least frequent^[2:1]. It can used as the first step to analyse application performances.

The methodology can be used independently from the type of application you want to investigate. Furthermore, it can be applied as an initial approach to guide your investigation in the right direction.

The methodology proposes a set of thread states^[4] that most operating systems have. For each state, it provides some follow-up actions to get further insights into the state of the system. Below are the states proposed by the methodology:

Running: the time spent by the thread running on-CPU. The call to action in case of high value is to investigate the application with Profiling tools to determine where the CPU cycles are spent. Note that, since spinning locks are running on-CPU, they would add time in this state.
Runnable represents the time spent by the thread waiting for scheduling. A high value in this state usually implies that the application needs more CPU. Therefore, review the CPU resources and the limits configured for the application or the system.
Swapping the time spent by the thread in a swapped-out, wait-for-residency state. A long time spent in this state usually points to memory utilization issues.
Sleeping represents the thread waiting for any I/O. A high value here might suggest resource limitations (e.g.: disk, network). The next step is to understand which I/O is the bottleneck and to perform Off-CPU analysis^[5] to gather further details.
Lock state measures the thread waiting on lock contention. A high time spent here might suggest investigating which lock is blocking the thread and why it is blocking. Further lock analysis is required.

A prerequisite of capturing these states is to identify how the states are actually implemented in the operating system. For Linux systems, we can gather the possible states of a thread by looking at the linux/sched.h. More on this in the later section.

Implementing TSA on Linux with bpftrace

As mentioned in the intro, this section proposes an implementation of the Thread State Analysis on Linux using bpftrace. To quote the exercise in the book:

"Develop a tool for Linux called tsastat(8) that prints columns for multiple thread state analysis states, with time spent in each. This can behave similarly to pidstat(1) and produce a rolling output."

All the code described next is available https://github.com/samueleresca/tsastat. The scripts require bpftrace installed in the system. More detailed instructions are in the README of the repo.

A similar implementation for FreeBSD has been already implemented by Gregg^[6].

Calculating on-CPU time (`Running` state)

The on-CPU time is the time spent by the thread on the CPU. We can use the tracepoint:sched:sched_switch tracepoint that is triggered at every scheduler switch event. Each tracepoint comes with some arguments that are passed in when the action is executed. It is possible to list the arguments using the following command:

$ bpftrace -lv "tracepoint:sched:sched_switch"
tracepoint:sched:sched_switch
    char prev_comm[16]
    pid_t prev_pid
    int prev_prio
    long prev_state
    char next_comm[16]
    pid_t next_pid
    int next_prio

In this case, the prev_ prefixed fields track the information of the task that switched out from the CPU. While the next_ prefixed fields refer to the task that is about to run.

Below is a bpftrace script that gathers the on-CPU time for each thread:

At each sched_switch the script calculates the interval between the current time (nsecs) and the start time of the previous task. In this way, we can keep track of the on-CPU interval. In addition, it sets the start time for the next_pid for the subsequent interval calculations.

Tracking `Runnable` state (Waiting on Queue)

To calculate the Runnable state we can introduce 2 new tracepoints:

tracepoint:sched:sched_wakeup triggers when the scheduler wakes up a task. It typically occurs when a task transitions from a sleeping state to a runnable state.
tracepoint:sched:sched_wakeup_new triggers when a new task transitions for the first time into a runnable state.

Below is a snippet to compute the runnable state time:

The implementation listens for the wakeup and wakeup_new tracepoints to track the timestamp of a task moved into a runnable state. Next, it uses the previously introduced tracepoint:sched:sched_switch to calculate the delta between the runnable and running states. This will give us the Runnable time for each thread.

Tracking the `Sleeping` time (off-CPU)

Sleeping times indicate the task is off-CPU waiting for an event. The tracking of the sleeping times can be built on top of the previously used tracepoints.

The tracepoint:sched:sched_switch action verifies the state of the previous thread. In case the thread switched out hasn't the TASK_RUNNING state, then we can save the timestamp and the state:

tracepoint:sched:sched_switch
{
  ...
  // Store sleep state for previous task
  // in case it is not TASK_RUNNING
  if ($prev_state > 0)
  {
    @sleeping_ts[$prev_pid] = nsecs;
    @sleeping_state[$prev_pid] = $prev_state;
  }
}

The code above allows to keep track of the sleeping times and states of each thread. The next step is to record the sleep time every time the thread wakes up. This can be done by extending the sched_wakeup and sched_wakeup_new tracepoints.

Below a full snippet showing the changes.

The implementation uses the states in linux/sched.h^[7] to check the thread's state saved during the scheduler switch. It computes the delta and saves the results in the map corresponding to the sleeping state. Finally, once the delta is saved, it proceeds by deleting the key.

The sleeping states tracked in the implementation are:

TASK_INTERRUPTIBLE state indicates the task is sleeping (or waiting). The task will move to the TASK_RUNNING state once some event occurs. The task can be awakened by signals. This state is for tasks that can safely be interrupted without any risk of causing inconsistencies. An example might be if the application calls functions like poll() or select() to wait for input from the keyboard.
- The TASK_UNINTERRUPTIBLE state refer to a sleeping task similarly to TASK_INTERRUPTIBLE. User-space signals cannot wake it up. An example is a call to a blocking function like read() to fetch data from the disk. In that case, the system puts the task in an uninterruptible state because interruption could cause data corruption or inconsistency.
TASK_RUNNING state indicates the task is currently executing or is ready to execute on a CPU.
TASK_STOPPED state means the task has been stopped. This is usually due to a SIGSTOP signal or a similar signal that pauses the task.

The next section goes to the implementation of tracking lock times.

Tracking lock time (`Locked` state)

This section proposes two implementations to calculate the time spent on lock contention. The first listens for the futex syscall^[8]("Fast Userspace muTEX"). The implementation uses the following tracepoints:

tracepoint:syscalls:sys_enter_futex triggered on the lock attempt.
tracepoint:syscalls:sys_exit_futex triggered on the lock releases.

Similarly to the previous states, we can keep track of the lock attempts by saving the timestamps in a map and calculating the delta on the lock release.

Below is the implementation:

When tracepoint:syscalls:sys_enter_futex events fire, the action gets executed in case of FUTEX_WAIT operations (see the predicate declared in the probe)^[9]. The action tracks the thread id and the initial timestamp. When tracepoint:syscalls:sys_exit_futex events fire, the action checks if the initial timestamp for that thread has been saved (see predicate). It then calculates the time delta. Finally, it deletes the thread key from the lock_ts map.

Futexes operate in two phases:

fast path: threads can lock and unlock resources directly in user space without entering the kernel if there’s no contention.
slow path: if contention occurs (e.g., a thread needs to wait for a lock to be released), futexes involve the kernel to handle waiting and waking up threads.

Note that the tracepoints above capture the slow path. On top of that, futexes are general synchronization mechanisms used in the kernel. So, the timing above might be misleading. As it might also capture other sync operations, not just the locks.

The second implementation uses the tracepoint:lock:contention_begin and tracepoint:lock:contention_end^[10] tracepoints. This has a similar pattern to the previous one:

The contention time by thread is measured at contention_begin. The delta is calculated using the contention_end time when the lock contention ends.

Displaying the results

We can use the END built-in event to display the final results. The action bound to the event has access to all the maps populated in precedence (see the previous sections):

The script prints the header for the table at first. Then it proceeds by iterating over the maps to print the corresponding thread state time aggregation for each thread by converting the times in milliseconds.

By default, bpftrace prints the content of the 'non-empty' maps at exit. To avoid that behaviour it is possible to specify the print_maps_on_exit=0 configuration^[11].

As a final result, you will get something similar to the psstat output:

user@ubuntu-vm:~/Projects$ tsastat.bt
Attaching 7 probes...
Tracing scheduler events... Hit Ctrl-C to end.
^C
Thread state analysis:
COMM             PID       CPU   RUNQ    SLP    USL    SUS    LCK    DEA
futex_contentio  32834       2    201    201      0      0    225      0
xdg-desktop-por  2401        7      0  58985      0      0      0      0
futex_contentio  33769       2    209    348      0      0    384      0
CacheThread_Blo  2727        0      0      0      0      0      0      0
futex_contentio  33207       3    352    402      0      0    503      0
futex_contentio  32026       2    138    160      0      0    166      0
kworker/u4:5     31410      19      0      0    110      3      0   3224
gmain            2410        1      0      0      0      0      0      0
ThreadPoolForeg  2744       32      0    746      0      0   7077      0

For better readability, you can implement rolling input. Replace the END built-in event with interval:5s and clear the terminal output each time.

Wrap up

The post proposes an implementation of Thread State Analysis on Linux using bpftrace. It's in response to an exercise in the System Performance book^[1:1]. The tracepoints used in the implementation might vary based on your OS's configuration. In general, you can get a list of tracepoints available on your system by running: bpftrace -l 'tracepoint:*'. In addition, please verify the script before using it in production environments. Ut might cause some noticeable overhead on your system as it traces very frequent events (including the scheduler events).

Systems Performance: Enterprise and the Cloud, 2nd Edition ↩︎ ↩︎
The og TSA method post described by Gregg: https://www.brendangregg.com/tsamethod.html ↩︎ ↩︎
EuroBSDcon: System Performance Analysis Methodologies ↩︎
In Linux a task is a runnable entity that can be either a thread, a process with a single thread or a kernel thread. In this post, thread and task are used interchangeably. ↩︎
The Off-CPU analysis is another methodology introduced by Brendan Gregg ↩︎
See: https://github.com/brendangregg/DTrace-tools/blob/master/sched/tstates.d ↩︎
include/linux/sched.h ↩︎
futex manpage ↩︎
include/uapi/linux/futex ↩︎
These tracepoints have been introduced in 2022 in the kernel. See: [PATCH 1/4] locking: Add lock contention tracepoints ↩︎
print_maps_on_exit - bpftrace ↩︎

Notes on CVE assessment

Samuele Resca — Fri, 22 Mar 2024 18:05:32 GMT

This post collects some notes about the lifecycle of vulnerabilities. It also discusses the challenges I faced during the assessment process: from the need to keep the analysis consistent to the limits of the CVSS base score. The post interchanges the terms "vulnerability" and "CVE". A "system" is any solution or service, which is deployed somewhere or consumed by another upstream component.

Vulnerabilities lifecycle

CVE is an acronym for Common Vulnerabilities and Exposures. "A CVE" is usually referring to a specific vulnerability affecting a system. For example, if you read "This CVE is affecting X," it means a CVE-ID vulnerability is present in component X or its downstream dependencies (i.e.: 3rd party Open-source dependencies).

CVE is a system maintained by The MITRE Organization in collaboration with some other organisations.

The lifecycle of a vulnerability usually follows these steps:

A person or an organisation (let's call them a researcher) finds a security flaw in a software product.
The researcher disclose in private the security concern to the owner(s) (let's call they vendor) of the software product.
The vendor plans a fix by following a pre-defined security policy. It also prepares the disclosure of the vulnerability.
At the start of the disclosure process, the vendor reports the vulnerability to a CVE Program Partner ^[1]. This process assigns a CVE ID to the vulnerability.
The CNA submits the details of the vulnerability linked to the CVE ID.
Finally, the CVE ID is published.

The final result usually looks like something similar to this: CVE-2023-27536 - cve.org. The CVE details summarize the problem briefly and they display the name of the CNA that published the CVE.

Note that, the cve.org website does not provide any information on the severity of a CVE. The next section, "Scoring process", explains how severities are assigned.

Vulnerabilities in Open-Source

Open-Source security reporting process varies across projects. The owner of an OSS package may be an individual, a community, or an organization. As a result, the Open-source ecosystem does not have a unified vulnerabilities reporting process. Yet, there are some similarities:

Open-source projects usually have a SECURITY.md file. It describes the process for reporting security problems for the project.
Maintainers and small organisation owning a open-source projects, use the MITRE Corporation CNA to create CVE-IDs.
As a general rule, vulnerabilities or security concerns should never been disclosed in public (e.g.: Issue Github page of the project). Usually, the best approach is to follow the process in the SECURITY.md file. In alternative, reach the organization or the maintainer in private.

Note that, more mature OSS organizations usually have a more established security teams and processes in place. ^[2]

Scoring process

When a CVE-ID is generated, multiple groups and vendors take part to the process of assigning the severity and other details to the vulnerability. The details include:

The severity of the vulnerability, usually measured through the CVSS standard.
The proof of exploitation, if any.
All the component vulnerable versions.
Any reference to the vulnerability.
Any link to a specific CWE-ID tracked in the Common Weakness Enumeration List (CWE) maintained by MITRE.

These groups and vendors keep records of the details in their databases. Some of the well-known examples are: the National Vulnerability Database (NVD)^[3] which is supported by NIST, GitHub Advisories Database, or more vendor specific databases such as the Ubuntu CVEs report.

CVSS is the common scoring system for measuring the severity of a CVE. The most adopted version is CVSS 3.1. Few months back, CVSS 4.0 has been released, but it will take time for the field to adopt it. Vulnerability databases map the CVSS severity score to the CVEs. This score is also referred as "Base score", because it keeps track of the intrinsic characteristics of the vulnerability which are constant over time.

The CVSS score of a CVE is linked to a CVSS vector. A CVSS vector gives a short view of a vulnerability's characteristics. These characteristics determine its severity. The vector string has metric names and values separated by forward slashes. For example, the following CVSS vector:

AV:L/AC:H/PR:L/UI:R

Represents a vulnerability with the following metrics: Attack Vector (AV): Local, Attack Complexity (AC): High, Priviledge Required (PR): Low and User Interaction(UI): Required.

The result at the end of the scoring process is what developers and engineers usually exchange when they refer to a CVE: CVE-2023-27536 - nvd.nist.gov.

Challenges in vulnerabilities assessment

This section summarizes some of the challenges you might face in doing vulnerabilities assessment and in defining a vulnerabilities assessment process.

CVEs system is not perfect

CVEs, CVSS scoring are not perfect. Some OSS software maintainers have expressed their frustration with reported CVEs that are not really security flaws^[4]^[5]. This is the main cause of CVE disputes. You should look at why the CVE is disputed. This prevents wasting effort in assessing something that is not a vulnerability.

CVSS base score is not accurate

The CVSS base score is the raw score and severity given to a CVE as you find on NVD or other CVEs databases. It doesn't include any environmental, temporal metrics^[6]. The CVSS base score is not enough to measure the true severity of a vulnerability. It ignores how and where the affected component is deployed or consumed. The new CVSS 4.0 specification clarifies that the base score alone is not an accurate measure of the severity of a vulnerability. The specification states that the base score (CVSS-B) should be combined with the other metric groups. These include the Threat (CVSS-BT) and Environmental (CVSS-BE) groups.

Populating the Environmental and Temporal metric groups, in addition to the Base score metrics, gives a more precise measure of the vulnerability risk^[7]. This enables organizations to prioritize their mitigation efforts and use resources more efficiently.

Keep analysis consistent

It is important to keep analysis consistent when evaluating vulnerabilities. This means the similar/same CVEs should be analyzed in the same way. This is true for the different services it impacts. Also, the analysis should not depend on who is doing it. The same CVE might be evaluated by different engineers on the project. Inconsistency often indicates a lack of a documented process that is shared across the team.

Vulnerabilities hot spots

Vulnerabilities' hot spots are the parts and components of the software dependency tree affected by many CVEs. There are different factors that can lead to vulnerabilities hotspots. It could be an old dependency. Older dependencies usually get more CVEs. Or, it could be a base image that installs many packages.

It is important to identify these hot spots. That said, some parts of a system are more prone to vulnerabilities. For example, config parsing and I/O systems are more sensitive to vulnerabilities. For instance, sometimes a CVE impacts a protocol. For example, CVE-2023-44487 affects the HTTP/2 protocol, which implies that all the components handling HTTP/2 protocol will be vulnerable.

Analyze vulnerabilities

This section contains some tips on how to assess vulnerabilities. It presumes you did everything possible to fix the vulnerabilities in your system. For example, you updated to the latest dependency versions. And you have a system in place (usually a vulns. scan tooling running at CI level across your repositories) for detecting new vulnerabilities.

Early checks

I typically perfrom some early checks before doing an in-depth analysis. The early checks should be documented and shared between the engineers in the team. There are many reasons for performing some early checks:

Vulnerabilities analysis tooling might detect false positive.
Vulnerabilities data sources are not always accurate.
Vulnerabilities are usually affecting section of a dependency.

Below some questions to ask yourself before analyze a vulnerability. The goal of a early checks is to save time by excluding the vulnerability from a more in-depth analysis.

1. Is a false positive?

The tools that find CVEs are not always reliable. They might report false positives^[8]. It is better to rule out this scenario first.

Also, the scans tools can have errors at different levels. Most of the tools look for the packages and binaries within a dependency. They compare them with a set of data sources, usually NVD, GitHub Advisory Database, and others. If the CVE metadata in a database is not correct, then the tooling would flag a false-positive. An example is CVE-2023-39017. It affected the quartz-jobs package but also showed up on the quartz package^[9].

2. Is the CVE OS-specific?

Some vulnerabilities are effetive only when the affected system run on a certain operating system. If your deployment is not using that OS, then the vulnerability cannot be exploited.

For reference:

RUSTSEC-2023-0001 affecting tokio only when the library is running on Windows.
CVE-2023-4807 affecting openssl only for Windows 64 platform.

It is often not very costly to check if a vulnerability depends on the OS. Therefore, doing this early check usually saves time excluding vulnerabilities affecting a OS not used by your solution.

3. Is the CVE exploitable under any particular condition?

This point builds on the previous one. A vulnerability may only impact some features of a library. For instance, CVE-2023-37276 only affects the HTTP server (i.e. aiohttp.Application) of the aiohttp package. It is possible that your system only uses the HTTP client capabilities of the package. So, the system affected by a vulnerability might be safe.

4. Is the CVE disputed?

Vendors or maintainers might dispute vulnerabilities. Disputes can happen for various reasons, few examples:

CVE-2018-20225 the vulnerability is considered a feature of the pip package manager.
CVE-2023-45322 is not considered critical enough by the vendor to reserve a CVE ID.
CVE-2023-39017 is another disputed vulnerability. The reason is the plausibility of untrusted user input to reach the code where injection could occur.

A disputed CVE is not always a sign of low risk. But, the person who disputes the CVE is usually the maintainer or owner of the affected component. They usually give the background and insights explaining why the CVE is not a security issue. Depending on that the prioritization of the fix might change.

Scoring analysis

The early checks eliminates the vulnerabilities that are not exploitable or do not affect the product. Then, the scoring analysis assess the severity of the vulnerability is for the system. This can be done by populating the additional metrics groups of the CVSS score.^[10]

Next there are some questions to consider while populating the additional metrics groups and assessing of the severity.

Is the affected system deployed behind network boundary?

Many vulnerabilities are associated to the Attack Vector: Network metric. The network attack vector establishes that the attack has to be conducted over the network to be successful. If the affected component is in a private network or behind a firewall, the attacker would need to be inside the network to exploit the vulnerability. So, if the system is deployed on a private network, the Modified Attack Vector can be overwritten to Adj. network^[11]. This will contribute to reduce the severity of the vulnerability.

Is the affected system running on containers?

The CVSS metrics might have different implications depending on the infrastructure where the vulnerable component is running. For example in case the affected system is deployed in a container orchestration environment (e.g.: Kubernetes) ^[12].

A vulnerability with Attack Vector: Local means that the attacker needs local access to the system. In a container orchestration system, that would mean that the attacker has high privilege and they can run a local command on a container. So, the privileges needed to launch an attack increases.

Moreover, a vulnerability that enables privilege escalation on a system deployed in a container orchestrated environment would only give the privileges within the container (not the whole host). Therefore, in some cases, the CVSS confidentiality availability and integrity metrics can be lowered.

Populate the Temporal Score metrics

The Temporal score metrics^[6:1] measure the current state of exploit techniques. In CVSS 3.1, temporal metrics are defined as:

Exploit Code Maturity which measures the likelihood of the vulnerability being attacked.
Remediation Level which tracks if there is an official fix or a workaround in place.
Report confidence which measures the degree of confidence in the existence of the vulnerability and the technical details related to that.

During analysis, fill in Temporal Score metrics to reflect the status of the vulnerability at the time of the assessment.

An approach to analysed vulnerabilities tracking

As discussed, analysis consistency can be challenging. Specially when the analysis is performed by multiple engineers. A possible solution is to have a set of assessment questions that the engineer should answer when doing vulnerability analysis. Besides that, it is helpful to record in a central place the vulnerabilities previously analysed. One way to do this is to maintain a file with the known vulnerabilities of the system in the git repository(s). If the system is built by many repositories, each service will have its own analysis file. The file should contain the following information for each vulnerability that is analysed:

CVE-ID (string): The unique identifier of the vulnerability.
Affected component (string): The component affected by the vulnerability. For example: a 3rd-party OSS package or a container image. Note that the version of the affected component should be specified as well. (i.e.: crypthography:38.0.0, /busybox:1.27).
Modified score (numeric): The modified score derived from the analysis and the modified CVSS vector.
Modified CVSS vector (string): The modified CVSS vector derived from the assessment.
Analysis details (text): A text that briefly explains the reasoning behind the analysis that has been done.

The fields above would be repeated for each assessed vulnerability, and they can be stored on a versioned file in any format: JSON, YAML, Markdown. By documenting the examined vulnerabilities and adding them to the repositories, there are several benefits:

You can keep track of all the analysis in one place and everything is versioned. An engineer can look at previous analysis and take them as reference or correct them.
If you use git tags or release branches to deploy the service, you will have an analysis file for each release. Therefore you can check the status of the vulnerabilities at each release point.
You can track the history of the analysis using git history.
It is easy to add automation to the CI pipeline (e.g.: alerting the internal security team in case a new vulnerability is detected but not analyzed yet).

Vulnerabilities in container images

Today, a lot of production software runs container orchestration systems (e.g. Kubernetes). Orchestration systems comes with Cloud-native patterns, e.g.: sidecar containers and container injection. These patterns can be useful and efficient in some cases. But, they also force you to depend on a high number of container images^[13]. This can pose some problems. A container image might have a lot of packages and dependencies which are vectors for vulnerabilities.

Minimal container images come handy here. Make sure you have only the minimum dependencies to run your app. Also, have the correct pre-configured nonroot user. Scratch or distroless images are good for these cases. Canonical recently presented an interesting set of toolings inspired by distroless images, called "chisel"^[14]. Chisel produces "chiselled images": it lets you create images with a subset of Debian packages. It is based on the idea that a package (let's call it package A) which depends on another package (package B) only uses some of package B's files. So, we can chisel package B to cut the unnecessary content. This will reduce the chance of vulnerabilities.

Finally, before adding a container image as a dependency of your system, check if the image follows best practices.

Is there a nonroot user set up in the image?
Does the image come from a reputable vendor?
Is the image creation process open-sourced?
How often is a new version of the image released per month?
In case a CVE impacts the image, how does the source release a fixed version of the image?
In case a CVE impacts the image, does the vendor fix all the major versions of the image or only the most recent one?
If you are running on a containerized environment, make sure to use minimal container images.

The considerations above might help you reduce the number of vulnerabilities to assess in your container-based system.

Prioritize mitigation with EPSS

first.org is introducing a new scoring system called EPSS. The definition is:

The Exploit Prediction Scoring System (EPSS) is a data-driven effort for estimating the likelihood (probability) that a software vulnerability will be exploited in the wild.

A detailed explanation of EPSS can be found in the paper: Enhancing Vulnerability Prioritization: Data-Driven Exploit Predictions with Community-Driven Insights.

EPSS doesn't measure the risk of a vulnerability, but the probability for the vulnerability to be exploited. So, if we combine the EPSS and CVSS scores, we can see the chance of a vulnerability being exploited and the impact of the exploit. This can help organizations prioritize their vulnerability management. It can also help them allocate resources better.

Final thoughts

This post went through the lifecycle and the challenges in assessing vulnerabilities. The few takeaways:

There is no doubt that vulnerabilities must be fixed. Try your best to keep updated all the 3rd party components consumed by your software.
The ecosystem surrounding CVEs and CVSS works, but is not perfect. Don't blindly accept the severity of a CVE as accurate.
CVSS Base score does not reflect the true impact of a vulnerability on your system. To get a accurate severity, you need to assess the CVE in relation to your system and populate the additional CVSS metrics.
Establish a method for identifying, evaluating, and tracking CVEs.
Do whatever you could to keep the container images as slim as possible.

A partner is a trusted source participating to the CVE program. It could be a company or an organization. A Partner could be also a CVE Numbering Authority (CNA) that assigns new CVE IDs. The list of partners is available on cve.org. ↩︎
For example, the Apache Software Foundation (ASF) has an established security policy. That said, usually the new vulnerabilities are reported in-private directly in the project security mailing list. ↩︎
At the time of writing, NVD is delaying the analysis of vulnerabilities and addressing challenges in the NVD program. More details available at Death Knell of the NVD? - Resilient Cyber ↩︎
CVE-2020-19909 is everything that is wrong with CVEs | daniel.haxx.se ↩︎
PostgreSQL: CVE-2020-21469 is not a security vulnerability ↩︎
Starting from CVSS 4.0 the "Temporal score metrics" have been renamed to "Threat metrics". ↩︎ ↩︎
Announcing CVSS v4.0 - first.org ↩︎
The vuln. detection tools can detect false positive/false negatives. Here is two lists: Trivy false positive reports and Grype false positive reports ↩︎
See the following GitHub issue quartz-scheduler - #943. The discussion in the issue highlights multiple problems related to the CVE. From the wrong package associated to the vulnerability to the "indefensible" CVSS 3.1 base score. ↩︎
There are multiple CVSS tools online for populating the additional metrics groups. Few examples: NVD - CVSS v3 Calculator, Common Vulnerability Scoring System Version 3.0 Calculator. You can copy and paste the CVSS vector of the vulnerability and start populating the metrics. ↩︎
See "Table 1: Attack Vector" for a description of the "Adjacent" metric in CVSS v3.1: Specification Document ↩︎
Containers vulnerability risk assessment - Red Hat Blog discusses how the impact of vulnerabilities on containerized environments can differ from traditional operating systems by providing some concrete examples. ↩︎
The container images are usually fetched from, or built by 3rd party providers. Some examples on top of my mind are Istio or Ingress NGINX controller image. ↩︎
Github - canonical/chisel ↩︎

Changes and improvements in CVSS 4.0

Samuele Resca — Thu, 14 Mar 2024 19:57:05 GMT

The Common Vulnerability Scoring System (CVSS) is a system for capturing the properties and severity of software vulnerabilities. The Forum of Incident Response and Security Teams (FIRST) and the CVSS Special Interest Group (SIG) recently released the 4.0 version of CVSS^[1].

This post outlines the improvements introduced in CVSS 4.0 and what's changed compared with CVSS 3.1. The post originates from the CVSS 4.0 public preview presentations^[2]^[3].

Base score alone is not accurate

The CVSS 3.1 Base score is used as a primary input for assessing CVEs. The Base score definition captures all the vulnerability's characteristics that don't change over time nor across user environments. Most of the CVEs published on the vendor databases comes with the base score already set. Downstream consumers often rely only on the base score for doing vulnerability assessment and prioritization.

CVSS 4.0 emphasize that the Base score is only one part of the CVSS scoring. Introducing the following terms:

CVSS-B refers to the CVSS base score alone.
CVSS-BT refers to a CVSS score that provides both the CVSS Base and Threat metrics.
CVSS-BE refers to a CVSS score that provides both CVSS Base and Environmental metrics.
CVSS-BTE refers to a CVSS score that provides CVSS Base, Threat and Environmental metrics.

CVSS 4.0 states that using more metrics makes the vulnerability assessment more accurate. ^[2:1] Moreover, CVSS-B should not be used alone to set the remediation priority of a vulnerability.

Temporal metrics weren't impacting the final score

In CVSS 3.1, the Temporal Score metrics had a low impact on the overall score. The Temporal Score metrics are Exploit Code Maturity (E), Remediation Level (RL), and Report Confidence (RC). During the assessment, the Remediation Level (RL) is frequently set as Official fix (O) and the Report Confidence (RC) is frequently set as Confirmed (C).

CVSS 4.0 introduces the following changes:

Rename the Temporal Score metrics to Threat Metrics.
Rename the Exploit Code Maturity to Exploit Maturity.
The Exploit Maturity is now enumerated with Attacked (A), POC (P), Unreported (U).
Retire the Remediation Level (RL) and Report Confidence (RC) metrics.

On top of the changes above, the Threat Metrics have now a higher impact on the calucation of the final CVSS-BTE score.

User Interaction metric wasn't granular enough

The User Interaction (UI) metric in the CVSS 3.1 specification has two possible values: None(N), Required(R). These values do not tell the difference between voluntary and involuntary interaction.

CVSS 4.0 introduces new metrics values for the User Interaction (UI) metric:

None (N) : the vulnerable system can be exploited without intervention.
Passive (P) : a user needs to interact with the vulnerable system. These interactions would be involuntary.
Active (A): a user must interact with the vulnerable component and the attacker’s payload. Or, the user's actions would subvert protection mechanisms and lead to exploitation.

The changes above provide a better representation of the type of interaction needed for exploiting the system.

Scope (S) metric was ambiguous

CVSS 3.1 defined a Scope (S) metric. The metric captured the idea of a vulnerable system, that when compromised, could also impact resources beyond the security control of the vulnerable system. The metrics was inconsistent across the vendors.

CVSS 4.0 introduces the concept of vulnerable system, which refers to the actual system that is vulnerable. And a subsequent systems, referring to the downstream system that is impacted by the exploitability of the vulnerable system.

The new CVSS 4.0 Impact Metrics are split into two groups. They represent the difference between the vulnarable and the subsequent system. Each group contains the familiar Confidentiality (C), Integrity (I), Availability (A) metrics.

This approach remove the need of the Scope (S) metric.

New supplemental metrics

CVSS 4.0 introduces a set of Supplemental Metrics. They give additional information on a vulnerability. Note that Supplemental metrics do not affect the CVSS score. They store extra information to help organizations assessments. It is up to the organizations ignore or use this group of metrics during the assessment.

Below is a summary of the supplemental metrics added in CVSS 4.0.

Safety (S)

Safety (S) is intended for fitness devices. Exploiting the vulnerability might have a safety impact on an individual or a system.

The Safety (S) metric can assume the following values:

Not Defined (X) indicates that the metric is not defined.
Negligible (N) means the vulnerability's consequences are "Negligible" by IEC 61508.
Present (P) means that the vulnerability's consequences meet the IEC 61508 consequence definitions. These include "Marginal", "Critical", or "Catastrophic".

Automatable (AU)

Automatable (AU) metric answers the following question "Can the attackers automate the exploitation across multiple targets?". The Automatable (AU) metric can have these self-explanatory values: Not Defined (X), No (N), Yes (Y).

Recovery (R)

The Recovery (R) metric describes how well a system can recover from exploitation. The ability to recover has to be intended from an availability and performance point of view.

The Recovery (R) metric can assume the following values:

Not Defined (X): the metric is not defined.
Automatic (A): the component recovers automatically after an attack.
User (U): The component requires manual intervention by the user to recover after an attack.
Irrecoverable (I): the component is irrecoverable by the user, after an attack.

Value Density (V)

The Value Density (V) metric represents the amount of resources an attacker will gain control of with a single exploitation event.

The Value Density(V) metric can assume the following values:

Diffuse (D): the system that contains the vulnerable components has limited resources.
Concentrated (C): the system that contains the vulnerable component is rich in resources.

Vulnerability Response Effort (RE)

The Vulnerability Response Effort (RE) measures how difficult it is for consumers to provide an initial response to the impact of vulnerabilities for deployed services.

The Vulnerability Response Effort (RE) has the following values:

Low (L): the effort required to respond to the vulnerability is low. Some example provided includes configuration workarounds.
Medium (M): responding to the vulnerability will need some effort. It could cause minimal service impact. Some examples provided by the CVSS 4.0 specifications include: simple remote updates or a low-touch software upgrade.
High (H): the actions required to respond to a vulnerability are significant and/or difficult. Some examples provided by the CVSS 4.0 specifications include: a highly privileged driver update or updates that require careful analysis.

Provider Urgency (U)

The Provider Urgency (U) metric is meant to capture the provider's assessment. It shows how urgent the vulnerability should be fixed. It can assume the following: Not Defined (X), Red, Amber, Green, Clear. Where Red implies the higher urgency while Clear the lower (no urgency).

CVSS 4.0 also specifies that any provider along the supply chain may set a Provider Urgency rating. For example:

Library Maintainer Urgency -> OS/Distro Maintainer Urgency -> Provider 1 .. 
Provider N  Urgency -> Consumer

From a Consumer point of view, the Provider N urgency is the more accurate.

New EPSS scoring system

On top of the CVSS 4.0 release, FIRST recently introduced a new scoring system called EPSS^[4]. The system relies on a model trained on list of data sources, see Table 1. Description of data sources used in EPSS^[5], for estimating the probability that a vulnerability would be exploited within 30 days following the prediction.

Recap

The Common Vulnerability Scoring System (CVSS) is a system for capturing the properties and severity of software vulnerabilities.
CVSS 4.0 introduces new terms and emphasizes that the Base score is only one part of the CVSS scoring.
The Temporal Score metrics have been renamed to Threat Metrics and have a higher impact on the final score.
The User Interaction (UI) metric has been updated to provide a better representation of the type of interaction needed for exploitation.
The Scope (S) metric has been removed and replaced with the concept of vulnerable and subsequent systems.
CVSS 4.0 introduces a set of Supplemental Metrics that do not affect the CVSS score but provide additional information on a vulnerability.
FIRST also introduced a new scoring system called EPSS, which estimates the probability that a vulnerability would be exploited within 30 days following the prediction.

CVSS 4.0 specifications. At the time of writing, most vendors have not adopted CVSS 4.0 yet. However, CVSS 4.0 launched in late 2023. It will take time before it is widely used by vulnerabilities databases and integrated with the vulnerabilities detection tools. ↩︎
CVSS v4.0 NIST presentation ↩︎ ↩︎
Announcing CVSS v4.0 ↩︎
Exploit Prediction Scoring System ↩︎
Jacobs, Jay, Sasha Romanosky, Octavian Suciu, Ben Edwards, and Armin Sarabi. "Enhancing Vulnerability prioritization: Data-driven exploit predictions with community-driven insights. ↩︎

Memory management optimization techniques

Samuele Resca — Sat, 20 May 2023 13:42:27 GMT

This article builds upon the previous one from this series and delves into the optimizations outlined in the "What a programmer can do" section of the "What every programmer should know about memory" paper by Ulrich Drepper. The code examples are implemented in Rust, and the execution is analyzed using perf.

💡

The code in this post is available at samueleresca/what-every-programmer-should-know-about-memory.

Understanding perf

This post uses perf to explore how the code is executed on the CPU. perf is the official Linux profiler and it is included in the kernel source code, see perf Tutorial.

Let's take a moment to go through some of the main perf events that will be used in this post:

instructions: tells us how many instructions our code issued.
cycles tells us how many CPU cycles our code took. The number of cycles is usually lower than the number of instructions because the CPU can issue many instructions per cycle. The difference between cycles and instructions is called the IPC (Instructions Per Cycle). The IPC is a good metric to understand how efficiently the CPU is running your code. The higher the IPC, the better. It is also an indicator of how well the code is doing vectorization.
cache-references is representing the number of times the CPU accessed the cache. if we do not already have this data in the cache, we will have to access the main memory.
cache-misses is representing the number of times the CPU accessed the main memory. As mentioned in the previous post of this series, this is a very expensive operation, and we want to cut it.
task-clock: refers to how many clock cycles our task took. The task-clock is usually in milliseconds. It differs from the total runtime because if the program uses many CPUs or threads, the task-clock will be the sum of the time spent on each CPU or thread.
cs is representing the context switches happening during the code execution.
page-faults is representing the number of page faults happening during the code execution. Usually, there are dedicated events to distinguish between minor-page-faults and major-page-faults.

Note that, depending on the CPU model and the manufacturer, the events available in perf will be different. In some cases, dedicated cache layer events, such as L1 cache misses / references, will be available. In other cases, you will have to use the generic cache-misses/ cache-references events.

Bypassing cache

One of the ways of influencing memory proposed by the paper is to bypass CPU caches when data is initialized and not reused immediately. This approach avoids pushing out data that is already cached in favour of data that will not be read soon.

One of the intrinsics provided by the processors is for non-temporal writes operations. Non-temporal writes function flags the data as non-temporal so is not stored in the CPU cache.

Below is the implementation of a standard vs non-temporal matrix initialization:

The standard_initialize function initializes a matrix in a standard way. The nocache_initialize function takes the non-temporal approach by using the _mm_stream_si32 intrinsic function.

The _mm_stream_si32 function is defined in the std::arch::x86_64::_mm_stream_si32 backed by the corresponding Intel intrinsic.

The following analysis uses the perf stat command to understand the impact of non-temporal writes on the process execution. Note that the code is built in optimized mode using the --release flag.

Event Name	standard_initialize	nocache_initialize
`l1d.replacement` (in millions)	1,052,585	532,639
`l2_lines_in.all`	1,497,033	91,486

The table contains the event count comparison between standard and non-temporal writes. The l1d.replacement event describes the number of cache lines replaced in the L1 data cache. The l2_lines_in.all event describes the L2 cache lines filling L2.

Note that the numbers in the tables are the average event counts for 10 executions. A single execution is monitored using the perf stat command.
The events are part of the Intel Ice Lake V1.00 events definitions.

The analysis above shows how the initialization using the non-temporal approach avoids performing L1d cache replacement. On top of that, the L2 cache lines are ~16x less than a standard initialization.

Cache optimization with sequential access and loop tiling

Sequential access is crucial for improving the L1d cache performances because the processor prefetches data.

The paper takes as an example a matrix multiplication described as:

$$c_{i,j} \space = \space a_{i,1} \space b_{1,j} + \space a_{i,2} \space b_{2,j} + \space \cdots + \space a_{i,n} \space b_{n,j}$$

Matrix multiplication can be performed as follow:

The matrix multiplication above is an example of an operation that can be optimized by taking advantage of sequential access. The m1 is accessed sequentially (by accessing the matrix per row), while the m2 is not accessed sequentially (the implementation iterates on the columns first).

It is possible to transpose the m2 matrix to access it sequentially. This is done in the optimized function below:

Translating and multiplying the m2 matrix results in better performance. The table below shows the cycles and instruction perf events resulting from the execution of the non_optimized and optimized functions (Again, the code has been compiled in release mode).

Event	non optimized	optimized
`cycles` (in billion)	13.66	9.47
`instructions` (in billion)	26.41	24.97
`ins. per cycle`	1.94	2.64

The ins. per cycle is a metric that shows how many instructions are executed per cycle. The higher the value, the better the performance. The optimized function has higher ins. per cycle value because it is taking advantage of the sequential access for both the m1 and m2 matrices.

Another way to improve matrix multiplication speed is to increase the usage of L1d cache by implementing techniques such as loop tiling. The following code demonstrates how loop tiling is applied in matrix multiplication:

The loop tiling technique enhances cache locality by computing multiplications in blocks that can fit into the L1d cache, which maximizes cache usage and reduces cache misses. The results of the performance tests I run on three functions: non_optimized, optimized, and optimized_tiled are presented in the table below:

Event	non optimized	optimized	optimized with loop tiling
`cycles` (in billion)	13.66	9.47	7.05
`instructions` (in billion)	26.47	24.97	27.77
`ins. per cycle`	1.94	2.64	3.94
`L1d-cache-load-misses` (in billion)	1.48	0.13	0.05

As seen in the table, the optimized_tiled function outperforms the other two functions, with higher ins. per cycle metric and fewer L1d cache misses. Note that additonal optimization can be probably achieved using SIMD instructions. Again, that would involve the use of the intrinsics provided by the CPU vendor.

Optimizing using huge pages

Another aspect to consider in how to optimize an application is the use of huge pages. Huge pages are pages that are larger than the usual ~4KB page size. The advantage of using huge pages is that the TLB can store more entries. So the CPU can access more memory without having to access the page table in memory and re-do the page table walk. There is an awesome post by Evan Jones that explains the advantages of using huge pages.
The post refers to a Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator paper by Google. The paper explains how they extended malloc to make it huge page aware and improve their requests-per-second (RPS) by 7.7% while reducing RAM usage by 2.4%.

Evan Jones provides the evanj/hugepagedemo repository containing a Rust demo showing the huge pages advantages. Again, the demo use perf to show the analyze the performance of the code.

Some other techniques for achieving huge pages are described in the Low Latency Optimization: Using Huge Pages on Linux (Part 2) post by Guillaume Morin.

Wrap-Up

This post covered some of the techniques that can be used to perform adhoc optimizations on the code. The post includes bypassing cache using non-temporal writes, cache optimization with sequential access and loop tiling, and using huge pages for improving the TLB performance. These techniques can be applied to most programming languages. It worth noting that these optimizations are strongly dependent on the CPU architecture. The same applies also to the perf events tracked in the post.
Finally, most of the theoretical concepts are explained in the What every programmer should know about memory paper. The full code samples are available in the samueleresca/what-every-programmer-should-know-about-memory repository.

References

What every programmer should know about memory (What a programmer can do) - Ulrich Drepper's

Hugepages are a good idea - Evan Jones

Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator - A.H. Hunter and Chris Kennelly and Paul Turner and Darryl Gove and Tipp Moseley and Parthasarathy Ranganathan

Low Latency Optimization: Using Huge Pages on Linux (Part 2) - Guillaume Morin

Analysis of 'What Every Programmer Should Know About Memory'

Samuele Resca — Sat, 20 May 2023 13:41:59 GMT

I have recently attended a Linux kernel programming training. One of the references in the memory management section of the training was the paper What Every Programmer Should Know About Memory by Ulrich Drepper. The paper was published as a series in the LWN.net newsletter.

This post highlights the main concepts useful to software engineers presented in the paper. The 2nd part of this series (see Memory management optimization techniques) provides some Rust examples to explore how to write memory-optimized code (coming from the "What a programmer can do" section).

This content is from 2007 why I should care?

Ulrich Drepper's paper titled "What Every Programmer Should Know About Memory" was published in 2007.

The paper provides an extensive exploration of memory, covering topics such as management, organization, and optimization. These fundamental concepts continue to apply today, and their understanding is crucial for programmers to write efficient and effective code.

Some of the concepts described in the paper are still widely referred to in the industry. For example, Low Latency Optimization: Understanding Huge Pages (Part 1) refers to the paper in the Further reading section. Furthermore, the concepts explored in Drepper's paper can be used to comprehend and make decisions based on the output of performance tools such as perf.

In summary, despite being over a decade old, Ulrich Drepper's paper remains highly pertinent to software engineering today. Some of its insights and concepts continue to be applicable, and understanding them is essential for optimal code performance.

Commodity hardware architectures

The paper starts with a section about the anatomy of commodity hardware. Systems scaling is usually achieved horizontally these days, and commodity hardware is what services and applications run on.

The commodity hardware structure usually involves the following:

The northbridge components put in communication the RAMs and the CPUs. The northbridge also communicates with the southbridge to reach all other system devices.
The FSB is the bus that connects the CPUs with the northbridge.
The memory controller is the controller associated with each RAM. Different types of RAM correspond to different memory controllers. The memory controller can be contained in the Northbridge or it can be as a standalone component.
The southbridge (a.k.a I/O bridge) handles communications with all the system devices (e.g.: SATA, USB PCI-E).

The first hardware architecture presented in the paper is the following:

The main bottleneck in the architecture above is the Northbridge communication to the RAM. In this architecture, the northbridge contains the memory controller for interfacing with the RAM. Therefore, the data goes through a unique memory bus and a unique memory controller.

Nowadays, the architectures integrate the memory controllers into the CPU so that there is no need for the northbridge at all. Below are the schema describing an architecture without the Northbridge:

In this case, the memory bandwidth is proportional to the number of CPUs. This architecture leads to a non-uniform memory because the memory is now bounded to the CPU (local memory) usually referred to as NUMA: Non-Uniform Memory Architecture. (See dedicated section for more details.)

Optimizing device access with DMA

In the early architectures, every communication to/from a device had to pass through the CPU. This has been solved with DMA. DMA is an optimization for offloading the CPU. It allows transferring data from devices with minimal effort from the CPU only needs to initiate the transfer and move over to new tasks.

NUMA

NUMA stands for Non-uniform memory architectures. The non-uniform part refers to the locality of the memory compared with the CPU.

In a NUMA architecture, a processor can access its local memory faster than non-local memory. As already mentioned, the Non-uniform memory is needed because the northbridge becomes a severe bottleneck since all the memory traffic is routed through it. So, the memory controller and the memory are integrated into the CPU.

NUMA requires that the OS takes into account the distributed nature of the memory:

if a process run on a given processor then the physical RAM assigned to the process address space should be from the local memory of that processor.
the OS should not migrate a process or a thread from one node to another. In case migrating the process across nodes is necessary then the selection of the new processor should be restricted to processors that do not have higher access costs to the memory than the current processor.

The memory allocated in NUMA is by default not allocated only on the local node to avoid saturation of the local memory of nodes that are running large processes.

CPU caches

The memory is not as fast as the CPU. Thus, the need for CPU caches between CPUs and memory. It is feasible to build a faster RAM (see SRAM internals), but it is not cheap. The choice is either for a large amount of DRAM or for a smaller amount of very fast SRAM.

The approach taken on most of the commodity hardware is hybrid. The DRAM is used as the main memory facility. On top of that, there is a small amount of SRAM to cache data used in the near future by the processor.

Architecture

The CPU caches architecture consists of many cache layers. The reason for having many layers is historical. The new layers are added as the speed difference between the cache and main memory continues increasing.

Below is the schema of a 3 levels cache architecture:

The speed is higher in the cache layers closer to the core. The architecture presents two flavours of L1 cache:

L1d cache is dedicated to the data.
L1i cache is dedicated to the instructions.

In a multi-processor / multi-core architecture a cache might be shared depending on the level. The paper describes the cache in a multi-processor architecture with the following schema:

The schema represent a 2 processor, each one with 2 cores. Each core has 2 threads respectively. The L2 and L3 caches are shared across the cores within a processor, while the threads within a core shares L1 caches.

CPU caches implementation

All the data read or written by the CPU is stored in the CPU caches. If the CPU needs a data word, it searches in the caches first. Entries are stored in the cache as lines of several contiguous words for the following reasons:

The spatial locality: in a short period, there is a good chance that the same code or data gets reused.
The RAM modules are more effective if they can transport many data words in a row.

When the processor needs the memory content the entire cache line is loaded into the L1d. The following steps describe, at high-level, the process for modifying memory:

The processor loads a cache line first.
The processor modifies the interested part of the cache line, marking it as "dirty".
When the processor writes back the changes to the main memory the dirty flag is cleared.

To load data in a cache it is necessary to make room in the cache. The eviction process from L1d pushes the cache line down into L2. In turn, evicting the cache in L2 means pushing the content into L3. Finally, L3 pushes content into the main memory.

Associativity

The core problem in CPU cache implementations is that many memory locations fight for each place in the cache. The cache associativity plays a role in the performance of the overall system. In specific, the associativity determines where a particular memory block can be placed when it is loaded into the cache.

     ┌──────────────┬────────────────────┬───────────┐
     │     TAG      │      SET INDEX     │   BLOCK   │
     │              │                    │   OFFSET  │
     └──────────────┴────────────────────┴───────────┘
          t bits            s bits           b bits

The caches are organized in sets divided into cache lines. The addresses referring to cache values are usually composed of the followings parts:

Tag represents the tag to store alongside the cached value. The tag contains the address of the target data in the main memory. It is needed because in some cases many memory addresses can end up being stored in the same cache line, see associativity types below.
Set index identifies the target set.
Block offset identifies the offset within the cache line. A cache line can store many words; the offset is needed to identify a single word.

There are 3 types of CPU caches: the direct-mapped, fully-associative and set-associative types. Each of them determines how the content of the cache is stored and fetched.

Direct-mapped cache

The cache is organized into many sets with a single cache line per set. Thus, a memory block can only occupy a single cache line.

The write pattern is as simple as placing the memory block in the set index identified by the address. The tag is stored alongside the set. The new data replace the old one already stored.

The search pattern is as simple as retrieving the set corresponding to the set index contained in the address. If the tag (stored alongside the set) corresponds to the tag in the address then the data is retrieved from the cache (cache hit). Otherwise, it is a cache miss.

The advantage of this approach is that it is simple and cheap to implement. The disadvantage is that it leads to a lot of cache misses.

Fully-associative cache

A fully-associative cache is organized in a unique set with many cache lines.
A memory block can be placed in any of the cache lines. The addresses of a fully-associative cache do not have any Set index as the set is unique.

The write pattern consists in looking for the first cache line available to place the memory block. If no space is available, a block in the cache is evicted based on a replacement policy.

The search pattern consists in checking each Tag field against the memory address with the tag bits associated with each cache line. The Offset is used to select the byte to return.

The advantage is that the cache is fully utilised and the cache hits rate is maximized. The disadvantages are that the search pattern iterates over all the cache lines and the implementation is expensive.

Because of the performance and hardware costs, this approach is usually adopted for small caches with a few dozen of entries.

Set-associative

It is a hybrid between the fully associative and the direct-mapped approach. Each set can contain many cache lines. It is the go-to approach for CPU caches.

A memory block is first mapped to a set and then it can be placed into any cache line of that set.

The writing pattern consists in placing the memory block in any of the cache lines in the set defined by the Set index. The tag is stored in the field associated with the cache line. In case all the cache lines are occupied, then one block in that set is evicted.

The search pattern identifies the set using the Set index. The Tag bit is compared with the tags of all cache lines of the set.

This approach is a trade-off between the previous two approaches.

Write behaviours in a single processor

The paper describes 3 cache policy implementations to achieve coherency in a single processor scenario: write-through, write-back and write-combining.

The write-through implementation is the simplest. If the cache line is written to, the processor immediately also writes it to the main memory. This implementation is simple but not fast and it creates a lot of traffic on the FSB.

The write-back implementation consists in marking the cache line dirty just after it is updated. When the cache line is dropped from the cache the dirty bit will instruct the processor to write the data back to the main memory. The write-back approach is more complex, but it provides better performance. The write-back approach is the widespread approach adopted by the processors.

The write-combining implementation combines many write accesses before the cache line is written. This approach is used to cut the transfer and it is usually used when a device is involved, such as a GPU.

Multi-processor coherency and MESI

The cache behaviours get more complicated in a multi-processor setup. The write-back implementation fails to keep a consistent coherency with many processors.

Let's take as an example a 2 processors scenario (P1, P2). P1 writes an updated cache line, that cache line won't be available either on the main memory or on the caches of the P2, as they don't share the same caches. Thus, the P2 would read out-of-date information.

Providing access from P2 to the cache of P1 would be impractical for performance reasons. Thus, the go-to approach is to transfer the cached content over to the other processor in case it is needed. The same is also valid for the caches that are not shared on the same processor. The cache line transfer happens when a processor needs a cache line which is dirty in another processor.

The MESI is the coherency protocol that is used to implement write-back caches in a multi-processor scenario.

MESI takes the name from the 4 states a cache line can be:

Modified: the cache line is dirty and only present in the current cache.
Exclusive: the cache line is only present in the current cache, but it is not modified.
Shared: the cache line is stored in many caches but is clean (it matches the main memory)
Invalid: the cache line is invalid.

The cache lines are initialized as Invalid:

when data is loaded into the cache for writing the cache line is marked as Modified.
when data is loaded for reading the state depends on other processors:
- if another processor has the same data then the line is marked as Shared
- if no other processor has the same data then the line is Exclusive

The state of the cache line is subject 2 main stimuli:

a local request, which does not need to notify other processors. For example, it happens when a Modified or Exclusive cache line is read or written locally. In this case, the state does not change.
a bus request, which happens when a processor needs to notify another processor. Some examples:
- a processor called P2 wants to read the cache line from another processor P1. P1 has to send the content of the cache to P1 and change the state to Shared. The data is sent to the memory controller that stores the data in memory.
- a processor called P2 wants to write a cache line that is in a Shared or Invalid state. This operation would trigger a bus request to notify the other processors and mark their local copy Invalid.

In general, a write operation can only be performed as a local request if the cache line is in Modified / Exclusive state. In case a write-request acts on a cache line in a Shared state, then all the other cached copies must be invalidated. The invalidation process happens by notifying the change of the other processors. This is also known as Request for ownership (RFO).

Performance considerations on CPU caches

Associativity plays a crucial role in reducing cache misses, and thus in the performance of the CPU caches. The paper demonstrates how moving from direct to set associativity can reduce cache misses by the 44% in a 8M L2 cache layer.

Worth mentioning that another factor that plays a role in the cache misses reduction is the cache size. Larger the cache, less are the chances to encounter cache misses.

Request for ownership (RFO) in the context of MESI can slow down things in the context of multi-processor operations. They are expensive coordination operations. The coherency messages need to be sent to N processors and await a reply from all of them. So, the speed of this exchange is determined by the longest possible time needed to get a reply.

Virtual Memory

The paper proceeds by describing the virtual memory system used by the processors. The virtual memory provides the virtual address spaces used by the processors. Some of the advantages of virtual memory:

Virtual memory handles the offloading of the physical memory by moving the in-memory data into a disk.
Provides a transparent abstraction to the process. The processes see the memory as a unique chunk.

The component taking care of the virtual address space is the Memory management unit (MMU).

Address translation

The core functionality of the Memory management unit (MMU) is the translation from a virtual address to a physical address.

The translation is implemented using the multi-level page table structure. Defined in the paper with the following schema:

The virtual address is divided into 5 parts. The Level 4 index points to the Level 4 Entry which is a reference to the Level 3 Directory. The structure continues until the Level 1 Directory where the entry contains the reference to the physical page. The combination of the physical page address plus the offset is the physical address and the operation of recursively navigating the page table is defined as page tree walking.

The data structure for building the page table is kept in the main memory. Each process has its own page table. The CPU is notified of every creation or change of a page table. The page tree walking operation is expensive. For example, for a 4-level page tree, the operation needs at least 4 memory accesses. On top of that, the address computation cannot be parallelized because an address depends on the up-level directory.

Thus, it is necessary to precompute the mapping between the virtual address and the physical address. The pre-computed values are stored in the Transaction Look-Aside Buffer (TLB).

The Transaction Look-Aside Buffer (TLB) is usually built with a set or a fully associative approach (depending on the purpose of the architecture). If the tag has a match in the cache, the physical address is computed by adding the offset specified in the virtual address. In case of a TLB cache miss, the CPU needs to execute the page tree walking process and store the result in the TLB cache.

Note that the TLB follow the same structure as the other CPU caches, so it can have built by many layers: L1TLB, L2TLB. Futhermore, there can be many TLB for many purposes: an instruction TLB and a data TLB.

Performance considerations on TLB

TLB size is limited like for the CPU caches. The performances of TLB can be influenced by changing the size of the pages.

Larger pages reduce the number of address translations needed, meaning that fewer TLB entries are needed. Futhermore, in case of TLB miss, the page tree walk will be faster as the number of levels will be smaller.

The downside of this approach is that the large pages must be contiguous in the physical memory. This results in a more fragmented memory, and an increase in wasted memory.

Huge pages are beneficial in a scenario where performance matters and resources are not a concern. For example database servers or systems where latency matters. This series of posts provide a detailed overview on the use of huge pages:

Low Latency Optimization: Using Huge Pages on Linux (Part 1)

Low Latency Optimization: Using Huge Pages on Linux (Part 2)

Wrap-Up

This post analyzes some of the topics discussed in the paper What Every Programmer Should Know About Memory by Ulrich Drepper. CPU caches, Virtual memory, TLB and the anatomy of the hardware plays a role in optimizing the performance of the system.

The next post (Memory management optimization techniques) of this series is more hands-on and it explores the concepts introduced in the "What programmers can do" section of Ulrich's paper.

References

What Every Programmer Should Know About Memory - Ulrich Drepper

Low Latency Optimization: Using Huge Pages on Linux (Part 1)

Low Latency Optimization: Using Huge Pages on Linux (Part 2)

perf tool

Techniques for fuzz testing

Samuele Resca — Mon, 05 Dec 2022 09:31:42 GMT

💡

This post refers to the following .NET design proposal.

💡

This post refers to the following pull request etcd-io/etcd/pull/14561.

Recently, I have been working on a fuzzer for testing the resilience of a parser. The fuzzer detected some crashes in a parser running on top of well-established production code. That made me realize how critical this testing technique is for the security of a system.

Fuzz testing is a broad topic with many approaches and strategies.

This post summarizes some techniques for fuzz testing and the learnings I have made. It also goes through some fuzz tests running on some cloud-native foundation projects, such as etcd.

The index below summarizes the post sections:

Anatomy of fuzzing
Black-box fuzzing
Coverage-guided fuzzing
Blackbox fuzzing vs Coverage-guided fuzzing
Fuzzing in a distributed systems world
[Use case] etcd fuzzing
[Use case] .NET fuzzing design proposal
Democratize fuzz testing

Anatomy of fuzzing

Let's start simple.

The most basic example of a fuzzer might be a function that generates a random value. The generated value is then used as an input for the system under test:

while(true):
    # Generation of the fuzzing value
    random_value = generate_random()
    # Implementation under test
    under_test_func(random_value)

The snippet above generates a random value. The under_test_func is executed with the random value for an indefinite amount of iterations. The snippet above might crash given a certain input, depending on the resilience of the under_test_func function.

This is a very basic example of fuzzing. Usually, a fuzzer implementation is more complicated than that.

Before proceeding, let's clarify some terminology by describing an usual fuzzer architecture:

The seed is a hyperparameter of the fuzzing process. It is the initial input defined by the user that is used as a starting point to generate the fuzzing inputs.
The fuzz engine finds interesting inputs to send to a fuzz target. Usually, a fuzzing engine includes a mutator actor.
The mutator generates the corpus manipulating the seed. There are different techniques used by mutators.
The corpus is a set of inputs that are used by the fuzz target.
The fuzz input is a single input that is passed to the fuzz target.
The fuzz target is usually a function that, given a fuzz input, it makes a call to the system under test.
The system under test is the target of our fuzzing process.

In case the system under test crashes, a new crash report file is generated.

Fuzz target example

The schema above defines a fuzz target. In concrete, fuzz targets usually have an implementation like the one below:

The above code is the target function coming from the LLVM toolchain. The LLVMFuzzerTestOneInput defines the fuzz target function. The fuzz target receives the corpus generated by the mutator.
Note that, every fuzz engine defines its syntax, in general, the function will accept the corpus as parameters.

Black-box fuzzing

In Black-box fuzzing, the fuzz engine does not know anything about the system under test implementation. An example of black-box fuzzing might be a fuzz target that calls an application running on a container exposing an HTTP API.

Below is the schema that describes the scenario:

The fuzz target calls the system under test through a client:

The fuzz target and the system under test are running in total isolation. The only information that they share is the network and the input that is sent from one of the fuzz target to the actual system under test.

Keep in mind that fuzzing across the network will slow things down. Futhermore, is hard to detect crashes in the system under test because is running as a separate instance and is not running the fuzzing process. see the Fuzzing in a distributed systems world for more details and solutions.

Coverage-guided fuzzing

Blackbox fuzzing approaches the system under test in a blind way. It doesn't know anything about the internals of the system under testing. Thus, the fuzz input is not tuned to hit every code path. Coverage-guided fuzzing instruments the system under test to gain awareness of the code paths that the fuzz input is covering. More in general, the coverage-guided fuzzing process has two key steps[1]:

Instrument the system under test to inject tracking instructions at compile time. The purpose is to collect metrics on the code paths covered.
Keep or discard fuzz inputs depending on the coverage metrics collected.

Coverage-guided fuzzing in depth

The following section describes how AFL++, a widespread fuzzing framework, implements coverage-guided fuzzing[2].

Given the following snippet of code:

x = read_num()
y = read_num()

if x > y:
    print x + "is greater than" + y
    return
if x == y:
    print x + "is equal to" + y
    return
print x + "is less than" + y

We can assign an edge for each branching in the code:

0: (A) x = read_num()
1: (A) y = read_num()
2: (A) if x > y:
3: (B)   print x + "is greater than" + y
4: (B)   return
5: (A) if x == y:
6: (C)    print x + "is equal to" + y
7: (C)    return
8: (D) print x + "is less than" + y
9: (E) end program

The flow of the code above can be summarized with the following control flow graph, where each branch is an edge of the graph:


        ┌───┐
   ┌────┤ A ├────┐
   │    └─┬─┘    │
   │      │      │
 ┌─V─┐  ┌─V─┐  ┌─V─┐
 │ B │  │ C │  │ D │
 └─┬─┘  └─┬─┘  └─┬─┘
   │      │      │
   │    ┌─V─┐    │
   └────> E <────┘
        └───┘

AFL++ uses the control-flow graphs in the instrumentation phase by injecting the following pseudo-code in each edge:

cur_location = ;
shared_mem[cur_location ^ prev_location]++;
prev_location = cur_location >> 1;

The cur_location property identifies the current path within the code using a randomly generated value. The shared_mem takes a count of the amount of time a certain path (source, dest) (given by the XOR cur_location ^ prev_location) has been hit. On top of that, the perv_location value is overridden in each iteration with the right bitshift of the cur_location variable. In this way the injected code keeps track of the ordering: A -> B diff from B -> A.

In this way, AFL++ know which branch has been hit by a specific input and if it is worth continuing for that path.

New behaviours detection

As mentioned above, the coverage algorithm keeps track of two parameters:

The tuples represent a path in the code
The number of times that path has been hit (hit_count)

This information is used for triggering extra processing. When an execution contains a new tuple, the input of that execution is marked as interesting. The hit counts for the execution path are bucketed in the range:

1, 2, 3, 4-7, 8-15, 16-31, 32-127, 128+ // Ranges of hit count.

If the algorithm detects a variation between hit count buckets then the input is marked as interesting. This helps to detect the variations in the execution of the instrumented code.

Even so, it is important to mention that coverage-guided fuzzing does not come for free. Instrumenting the system under test means adding an overhead at runtime. On some occasions, this overhead is not sustainable.

There has been some effort to optimize coverage-guided fuzzing recently. Most of the research focuses on better ways of instrumenting the code[8] and on more meaningful coverage metrics[9].

Blackbox fuzzing vs Coverage-guided fuzzing

We described two fuzzing techniques: black-box and coverage-guided fuzzing. This section gives some guidelines on which type of fuzzing we should use and when.

In general, coverage-guided fuzzing provides better coverage and is more effective. The downside is you need to instrument your target first. This might work well in self-contained libraries, but it might be hard to set up in more complex codebases.

The advice is to examine the system under test and understand the complexity of the codebase:

For simple and independent codebases, such as libraries, I would focus on coverage-guided fuzzing.
For complex projects, hard to decompose, black-box fuzzing is the quicker approach, and it can still spot the weaknesses in the system under test. If you are spending a lot of time figuring out how to instrument your code for run fuzzing, you should probably start with a black-box approach.

Fuzzing in a distributed systems world

In the distributed system era, most of the services talks over the network. Fuzzing a network service is not easy for two main reasons:

In a stateful world, the server can be in various states. On top of that, the exchange usually needs to fit a certain transmission protocol.
The network slows down the fuzzing.
It is hard to detect a crash on a system running on a different process.

The general approach for fuzzing over the network is to change the source code to read from stdin instead of relying on the network. [4]

For example, AFLplusplus/utils/socket_fuzzing is an util embedded in AFL++ that provides a way to emulate a network socket by sending input in the stdin.

If changing the source code is not viable, keep in mind that fuzzing over the network is still an option.

aflnet/aflnet: AFLNet provides a suite for testing network protocols. It uses a pcap dump of a real client-server exchange to generate valid sequences of requests while it keeps monitoring the state of the server. The input sequence is then mutated in a way that does not interfere with the rules of the protocol.

microsoft/RESTler provides a way to test a remote-distributed black box [6]. Even if it is not possible to instrument the code of a distributed blackbox, RESTler takes a hybrid approach between a blackbox and coverage-guided fuzzing by measuring the fuzzing effectiveness by introducing a response error type metric. Every time that the distributed blackbox returns an HTTP error type (HTTP status codes from 4xx to 5xx) is considered a crash, and the coverage metrics triggered.

RESTler is HTTP REST-based. What can we do when the system under test utilizes a protocol that is not HTTP? Structure-aware fuzzing becomes handy when you want to compose a message following some rules.

Structure-aware fuzzing

Structure-aware fuzzing [7] is a technique that stands on top of a normal fuzzing process. It can be combined with black-box and coverage-guided fuzzing. Structure-aware fuzzing mutates corpus following a defined structure.

One of the most common tools that implement structure-aware fuzzing is google/libprotobuf-mutator.

This library runs on top of others fuzz engines and mutates the inputs following a protobuf definition. For example, the following proto file defines the structure of a message:

The Msg above can be reused as input for a fuzz target as shown below:

This approach allows defining fuzz inputs based on certain rules. So it excludes all the cases where the corpus provided by the fuzz engine is not compliant with a particular syntax or protocol.

The next section focuses on a concrete example of fuzzing on the etcd-io/etcd OSS project.

[Use case] etcd fuzzing

I recently contributed to etcd fuzzing to solve the following issue: etcd/issues/13617.
The etcd process for receiving requests has two main steps:

The incoming request goes through a validation step (validator step)
The incoming validated request is applied to the server (apply step)

The goal was to ensure that API validators reject all requests that would cause panic.

In practice, below is the approach that has been taken:

Generate an input request with fuzzing
Send the request to the validator
2.1 If the validator rejects the request then the test pass ✅
2.2 If the validator does not reject the request proceeds with step 3 ⬇️
Send the request to the apply function
3.1 If the apply function executes without panic then the test pass ✅
3.1 If the apply function panics then the test fails 💀

The following PR implements the approach described above.

Ensure that input validation between API and Apply is consistent #14561.

The implementation uses the new toolchain fuzzing released with Go 1.18, see: go.dev/security/fuzz. Let's start by taking a look at one of the fuzz tests in the PR:

The above implementation shows a fuzz test case for an etcd RangeRequest. The test uses the fuzz capabilities provided by Golang.

FuzzTnxRangeRequest defines the test case.
f.Add operation adds the seed corpus to the fuzz target.
f.Fuzz defines the fuzz target. It takes as arguments the fuzzed parameters mutate based on the seed corpus.

The fuzz test proceeds by calling the checkRangeRequest function, which implements the validation for the request. In case the validation returns an error, the rest of the test is skipped using t.Skip. In case the request is valid, the test proceeds by calling the apply method (txn.Txn function). The fuzz test will fail if the txn.Txn function panics, otherwise, the test is successful.

The PR also add a fuzzing.sh script that is used to execute the fuzz tests. The script is then embedded in CI using a GitHub Action definition. Below is the core command defined in the fuzzing.sh script:

go test -fuzz "${target}" -fuzztime "${fuzz_time}"

The command executes a specific target (e.g.: FuzzTxnRangeRequest) with a specific fuzztime.

[Use case] .NET fuzzing design proposal

Another open-source contribution I made recently around fuzzing is the following design proposal for the .NET toolchain #273 - Add proposal for out-of-the-box fuzzing. The proposal was highly inspired by the introduction of fuzzing within the Go toolchain. The main reasons that triggered this proposal are:

The lack of a uniform and easy way to perform fuzz testing in the .NET ecosystem
Increase the awareness of the fuzzing practice within the .NET consumers. Therefore, increase the overall security of the .NET applications and libraries

The proposal got rejected with some good news: the .NET team is already exploring some possible paths forward, and some internal prototyping effort has already been made.

Final thoughts: Democratize fuzz testing

A few months back, the Cloud Native Computer Foundation (CNCF) went through some effort in the fuzzing space:

Improving Security by Fuzzing the CNCF landscape

The CNCF post goes through a quick introduction to fuzzing and the purposes of this technique. Futhermore, the post shows a list of fuzz tests implemented within the CNCF projects, such as Kubernetes, etcd, and Envoy.

The post's security reports highlight the various crashes has been found in the OSS projects of the CNCF foundation. Surfacing the need for these security practices in the OSS software.

The .NET fuzzing design proposal shown before is tentative to democratize fuzz testing in the .NET ecosystem. The easier is accessing to this security practice, the higher the chance that fuzzing is adopted in the development process. Thus, the shipped software has higher security and resilience standards.

Go 1.18 took a similar approach by including the fuzzing capabilities within the language toolchain. The Go native fuzzing proposal has brought up some interesting discussions available at golang/go/issues/44551.

LLVM took a similar approach long time ago by including libFuzzer in the toolchain.

Democratizing security practices such as fuzzing within the OSS ecosystem is becoming crucial. Especially with the increasing involvement of OSS projects as part of enterprise solutions.

References

[1]Qian, R., Zhang, Q., Fang, C., & Guo, L. (2022). Investigating Coverage Guided Fuzzing with Mutation Testing. arXiv preprint arXiv:2203.06910.
[2]Technical "whitepaper" for afl-fuzz
[3]google/fuzzing
[5]AFLplusplus best_practices
[6]Patrice Godefroid, Bo-Yuan Huang, and Marina Polishchuk. 2020. Intelligent REST API data fuzzing.
[7]Going Beyond Coverage-Guided Fuzzing with Structured Fuzzing - Black Hat
[8]S. Nagy and M. Hicks, "Full-Speed Fuzzing: Reducing Fuzzing Overhead through Coverage-Guided Tracing," 2019 IEEE Symposium on Security and Privacy (SP), 2019, pp. 787-802, doi: 10.1109/SP.2019.00069.
[9]L. Simon and A. Verma, "Improving Fuzzing through Controlled Compilation," 2020 IEEE European Symposium on Security and Privacy (EuroS&P), 2020, pp. 34-52, doi: 10.1109/EuroSP48549.2020.00011.

A practical approach to read write quorum systems [Part 2]

Samuele Resca — Tue, 28 Dec 2021 15:42:21 GMT

ℹ️

The post is a continuation of A practical approach to read-write quorum systems.

💡

The code is available at https://github.com/samueleresca/quoracle-go.

I published the post "A practical approach to read-write quorum systems" a few months ago. The post refers to the paper Read-Write Quorum Systems Made Practical - Michael Whittaker, Aleksey Charapko, Joseph M. Hellerstein, Heidi Howard, Ion Stoica. It illustrates the implementation of "Quoracle". Quoracle provides the optimal quorums in respect of either the load, network or latency.

I have decided to rewrite the tool in Golang to explore the ecosystem and the tooling of the language. This article goes through the Golang implementation of the original Python library.

Quorum expressions definition

First of all, let's discuss the implementation of the expressions. The original paper uses expressions and nodes to describe quorums:

a, b, c = Node("a"), Node("b"), Node("c")
majority = QuorumSystem(reads=a*b + b*c + a*c)

The example above builds the majority quorums using the following pairs: [a,b], [b,c], [a,c]. Python can represent the expression above by overloading the * and the + operations. The original quoracle library uses the operations overload approach.

Go advocates simplicity, and it does not embrace operator overloading. So, it is necessary to proceed with a different approach.

In Golang, these methods describe the operations:

The ExprOperator interface defines the operations between two logical expressions. A logical expression between nodes represents many quorums.

Thus, it is possible to describe quorum as follows:

a, b, c :=
NewNode("a"), NewNode("b"), NewNode("c")

// (a * b) + (b * c) + (a * c)
majority := NewQuorumSystemWithReads(a.Multiply(b).Add(b.Multiply(c)).Add(a.Multiply(c)))

The ExprOperator interface provides the same functionalities as the original library. In the example above, the pairs: [a,b], [b,c], [a,c] are majority quorums. The next section goes through the Golang definition of a QuorumSystem and how to use it.

Quorum system overview definition

Now that we know how to define an Expr of nodes, we can declare a read-write quorum system. The below implementation describes the QuorumSystem struct used in the library.

The reads and writes fields represent the quorums. The nameToNode field keeps track of the different nodes in the quorum system. The QuorumSystem struct has the methods for calculating the quorum system's capacity, latency, load, and network load.

The StrategyOptions parameter struct represents the configurations for the strategy optimisation:

The Optimize property points to the optimisation target. The LoadLimit, NetworkLimit, LatencyLimit define an optional limit on the load, the network, and latency. The ReadFraction and the WriteFraction determine the workload distribution of the read and write operations.
The F field represents the resilience of the quorum. A quorum is F-resilient for some integer f, despite removing any f nodes from r, r is still a read/write quorum[1].

The library defines the initialization functions for a new QuorumSystem:

The above code omits some methods implementations for brevity. If the caller provides either the read or writes quorum, the constructor computes the logical dual operation of the other quorum and initialise a new QuorumSystem struct. If the caller provides both read and write quorums, the constructor checks the validity of the quorums and returns a new QuorumSystem struct with the corresponding quorums.

The following section shows how to translate the optimal strategy problem into a linear programming problem.

Optimal strategy problem definition

The original python implementation of quoracle uses the PuLP library and coin-or. The previous blog post looked at how to use PuLP for optimisation problems in a python runtime.

The Golang implementation uses a library called lanl/clp. Also lanl/clp relies on coin-or for solving linear programming optimization problems.

The codebase defines a helper struct to build a linear programming problem:

A lpDefinition struct contains the variables needed to describe a linear programming problem.
Let's suppose that we have three six-faced dices. Two dice are not allowed to have the same value. The goal is to find a difference between the 1st and 2nd-largest dice smaller than the 2nd and 3rd dice. The following lpDefinition represents the problem:

The lpDefinition above stores a value in the Vars array for each dice. The Constraints matrix represents the dice range constraint, from 1 to 6. The Objective matrix contains the two goals of the problem:

Each dice must be different from the others (line 10 and 11);
The difference between the 1st and 2nd largest dice must be smaller than the one between the 2nd and the 3rd dice.

The example above shows how to use the helper struct in the codebase to build a minimisation problem. Next, we will take a detailed look at the implementation for optimising the metrics. Depending on the optimisation target, the problem definition builds a different lpDefinition struct.

Load optimization definition

Let's start by describing how the load optimization problem is implemented. To recap, the formula for the load defined in the paper[1] is:

$$ \frac{f_r}{cap_R(x)} \sum_{{r \in R | x \in r }} p_r + \frac{1 - f_r}{cap_W(x)} \sum_{{w \in W | x \in w }} p_w \leq L_{f_r} $$

The following snippet of code implements the above formula, and it builds the LP problem:

The snippet omits some code for brevity. The buildLoadDef local function encapsulates the logic for building the load optimisation problem.
The function initialises the Vars and the Constraints from the lpVariable. Next, it adds the l variable and constraint that indicates the load($L_{fr}$) for the specific read fraction. It builds the load formula for every lpVariable in the problem. For each Node in the quorum system, it applies the following expression in case of a read quorum:

tmp[v.Index] += fr * v.Value / float64(*qs.GetNodeByName(n.Name).ReadCapacity)

otherwise, in the case of a write quorum, it proceeds by using:

tmp[v.Index] += (1 - fr) * v.Value / float64(*qs.GetNodeByName(n.Name).WriteCapacity)

The ReadCapacity and the WriteCapacity are configurable for each Node.
The code needs to maintain the same order in the Vars and the Objectives arrays. Thus, each lpVariable uses an Index to refer to the exact position of each element in the arrays.

Network load optimization definition

This section describes the implementation of the network load. Let's start by refreshing the formula[1]:

$$ f_r ( \sum_{r \in R} p_r \cdot |r|) + (1 - f_r) ( \sum_{w \in W} p_w \cdot |w|) $$

$|r|$ and $|w|$ are the length of the read and write quorums sets. The library builds the network load minimization problem as follow:

The above code initialises a new Vars and the Constraints fields for each quorum. Then, it applies the network load formula by multiplying the length of the quorum with fr. Also, the implementation adds a row in the Objectives matrix in case we specify a network limit.

Latency optimization definition

The last optimization target is the latency. The formula described in the paper defines the latency as:

$$ f_r ( \sum_{r \in R} p_r \cdot latency(r)) + (1 - f_r) ( \sum_{w \in W} p_w \cdot latency(w)) $$

Below is the optimization definition of the latency:

The implementation creates a new lpDefinition populating the Vars and the Constraints. Then, it retrieves the latency for each readQuourmVars and writeQuorumVars. The latency of a quorum is the shortest time required to form a quorum after contacting the nodes in that quorum. The code calculates l using the readQuorumLatency and the writeQuorumLatency methods.

Next, it continues by applying the formula of the latency for the read quorums:

obj[v.Index] = fr * v.Value * float64(l)

the code takes the same approach for the write quorums using the opposite workload:

obj[v.Index] = (1 - fr) * v.Value * float64(l)

The following section describes how to translate the optimisation result into a new Strategy. Also, it shows how to execute the LP optimisation using the definitions seen in this section.

Strategy initialisation

The previous section described how to build the problem definition. Now we can proceed by executing the optimisation. The snippet of code below describes the optimisation execution and the initialisation of the strategy:

As a first step, the code initialises a NewSimplex. The simplex algorithm is a popular linear programming algorithm. We want to find the least objective of our problem. Thus the code sets the optimisation direction as clp. Minimise.

The code proceeds by creating a new lpDefinition based on the optimisation definitions seen in the previous section. For example, if the optimisation target is Network, the code calls the buildNetworkDef function.

On top of the optimisation target, the code needs to add another objective: the total sum of the read and write probabilities must be 1. The getTotalProbabilityObjectives method takes care of that. It returns a new objective array where the read and write probabilities targets 1.

The code executes the optimisation and checks that the resulting status is optimal. If the operation succeeds, the code gets back the optimal solution. Then, each quorum initialises a new SigmaRecord with the quorum and its probability of being selected.

Finally, it creates a new Strategy with the SigmaR (array of SigmaRecord for the read quorums) and the SigmaW (array of SigmaRecord for the write quorums).

Let's refresh the definition of strategy as mentioned in the paper:

$$ \sigma = (\sigma_R, \sigma_W) $$

$\sigma_R$ and $\sigma_W$ are probabilities distributions: $\sigma_R(r)$ and $\sigma_W(w)$ are respectively the probabilities that the strategy chooses a read quorum $r$ and a write quorum $w$.

quoracle-go represents a Strategy in a similar way using the following structs:

The SigmaR and SigmaW corresponds to the strategy's probability of choosing a specific quorum. Also, the Strategy struct maintains a hashmap storing the node and its likelihood of being selected.

The Strategy struct exposes some methods that retrieve some metrics and information. Below is the list of methods provided with the Strategy struct:

The code above omits the implementation of the functions for brevity. The GetReadQuorum, GetWriteQuorum uses a probability distribution to return the quorums. The Load, Capacity, NetworkLoad and Latency methods return the respective metrics for a given read or write workload. The NodeLoad, NodeUtilization and NodeThroughput methods target specific Node in the quorum system. The implementations use the probability of a node getting selected to calculate the respective node metric.

Searching for the optimal quorum strategy

We have seen how the codebase leverages linear programming to find the optimal strategy.

Let's reiterate one of the primary purposes of quoracle. Given the node's details, an optimisation target and workload distribution, it returns the optimised strategy.

The Search method implements the rule mentioned above. It calculates the optimal strategy by trying all the combinations of quorums. Whenever the optimal strategy for a given valid quorum returns a better target metric, the Search function saves that.

Below is the code implementation of the Search function:

The dupFreeExprs and the doSearch functions encapsulate the core logic.
The dupFreeExprs returns all the possible combinations of quorums composed using a list of nodes.
The doSearch uses the dupFreeExprs outcome to initialise a new quorum system and find an optimal strategy. The implementation keeps track of the Strategy and QuorumSystem with the most optimised metric.

The search process is time measured. When the operation reaches a specified timeout, then the search stops. The timeout-controlled function prevents long-hanging processes.

Wrap up

This post went through the Golang port of quoracle, describing the main components implemented in the codebase. The code is available at samueleresca/quoracle-go.
The project had two primary purposes. First, to put into practice the core concepts described in the paper[1]. Secondly, to explore the Golang ecosystem and tooling.
Some of the concepts might seem very theoretical, but it is essential to know the basics. Quorums are the foundation of distributed systems topics such as replication and consensus.

References

[1]Read-Write Quorum Systems Made Practical - Michael Whittaker, Aleksey Charapko, Joseph M. Hellerstein, Heidi Howard, Ion Stoica
[2]lanl/clp provides a Go interface to the COIN-OR Linear Programming (CLP) library
[3]github.com/mwhittaker/quoracle

A practical approach to read-write quorum systems

Samuele Resca — Mon, 14 Jun 2021 11:51:41 GMT

ℹ️

The 2nd part of this post is available: [Part 2] A practical approach to read-write quorum systems.

Quorum systems allow consistency of replicated data; every time a group of servers needs to agree on something, a quorum is involved in the decisions.

For example, leaderless databases such as Dynamo (not to be confused with DynamoDB) uses a consistency protocol based on quorums[1]. Furthermore, most of the consensus algorithms such as Zab, Raft, Paxos are based on quorums[2].

Read-write quorums define two configurable parameters, R and W.
R is the minimum number of nodes that must participate in a read operation, and W the minimum number of nodes that must participate in a write operation.

A few weeks ago, I came across this tweet:

Majority quorums are pervasive in strongly consistent distributed systems yet there are many alternatives with better scalability and performance. Quoracle is a new open-source tool to find an optimal quorum system for a given distributed architecture. https://t.co/d4UpFNdQuf pic.twitter.com/V2SaXYmkoP
— Heidi Howard (@heidiann360) April 13, 2021

The tweet refers to the following publication:

Read-Write Quorum Systems Made Practical - Michael Whittaker, Aleksey Charapko, Joseph M. Hellerstein, Heidi Howard, Ion Stoica

The paper reviews some concepts of the quorum systems, and it presents a concrete tool named "Quoracle" that explores the trade-offs of the read-write quorum systems.

Quoracle provides an alternative to the majority quorum systems that are widely adopted in distributed systems. The majority quorum can be defined as

$$ \frac{n}{2} + 1 \space where \space n = n. \space of \space nodes $$

In the case of a read-write quorum systems the majority is represented in a similar way:

$$ r = w = \frac{n}{2} + 1 $$

where $r$ and $w$ are the read and write quorums.

Below there are some notes I took while I was reading the paper and a detailed look at the proposed implementation of Quoracle. The topics discussed are listed below:

Quorum system definitions
Quoracle overview
Optimal strategy problem implementation
Strategy representation
Handling quorum failures
Optimal strategy search

Quorum system definitions

The paper resumes the definitions and the characteristics of the read-write quorum systems. These concepts are helpful to understand the theory of quorum systems and the implementation of Quoracle.

Given a list of nodes represented as follow:

$$ X = {{x_1,..,x_n}} $$

A read-write quorum system is defined as $Q = (R, W)$ where $R$ represents the read quorums, and $W$ represents the write quorums. $R$ and $W$ are sets of subsets of the list of nodes defined in $X$.

An additional constraint asserts that every read and writes quorum ($r$ and the $w$) must intersect. This is represented with the following definition:

$$ r \in R \space and \space w \in W, r \cap w \neq 0 $$

It is possible to illustrate a quorums systems with a grid as follow:

The grid above describes a $ Q_{2x3} $ quorum system defined as $ Q_{2x3} = (abc + def, ab + bc + ac)$.

Note that multiplication represents the nodes that are part of the same set while the addition separates the groups. This approach will be helpful to define read and write sets in Quoracle.

Fault tolerance

The paper describes the read fault tolerance of a quorum system as the largest number, called $f$, of nodes that can fail before a read quorum is still available. The same definition is applied to the write fault tolerance.

The fault tolerance of a quorum system is the minimum between the read fault tolerance and the write fault tolerance. For example, the $Q_{2x3}$ quorum system defined above has a read fault tolerance of 1 because if one node is down, it means that only a single read quorum will be available. On the other hand, the write fault tolerance is 2 because if two nodes fail, then there will be only 1 write quorum available.

In this case, the fault tolerance of the quorum system is equal to 1.

Strategy

The paper describes the concept of strategy of a read-write quorum system: the strategy of a quorum system, represented in the paper as $\sigma$, decides which quorum to contact.

The strategy of a read-write quorum system is represented as:

$$ \sigma = (\sigma_R, \sigma_W) $$

where $\sigma_R$ and $\sigma_W$ are probabilities distributions: $\sigma_R(r)$ and $\sigma_W(w)$ are respectively the probabilities that the strategy choose a read quorum $r$ and a write quorum $w$.

Load, capacity and the read fraction

The read load of a node $x$, represented as
$load_{\sigma_R}(x)$ is the probability that the $\sigma_R(r)$ choose a read quorum $r$ that contains the node $x$. The same approach can be taken for the $load_{\sigma_W}(x)$ .

For calculating the total load of a node $x$, we need to introduce another parameter: the read fraction.

The read fraction, represented as $f_r$, is the percentage of reads vs. write operations in a workflow. For example, a workflow with 50% read operations and 50% write operations has a read fraction $ f_r = 0.5 $.

Therefore, the total load of x can be represented by the following definition:
$$ load(x) = f_rload_{\sigma_R}(x) + (1 - f_r)load_{\sigma_W}(x) $$

The load of the quorum system is the load of the optimal strategy, defined as the strategy that brings the lowest load.

The capacity can be repsented as the inverse of the load:

$$ C = \frac{1}{L} \hspace{0.5cm} where \hspace{0.5cm} L = load $$

Higher is the load on the quorums; lower is the capacity of the quorum system.

In conclusion, these are some of the definitions illustrated in the paper. This section gives a good overview of the possible parameters that can influence the performance and the efficiency of read-write quorum systems. The following excerpt is more practical, and it shows the application of these definitions on the implementation of Quoracle.

Quoracle overview

The concepts introduced in the previous section help understand the implementation of the quoracle library described in the paper, available on GitHub: mwhittaker/quoracle.

Note that the quoracle lib is distributed on PyPI, and it can be referred to in a project using the following command:

pip install quoracle

The paper provides an initial example similar to this:

The snippet above declares nodes using the Node class and the corresponding quorum system defined by the QuorumSystem class. The Node class is used to describe the nodes that are part of the $X$ set:

Each Node instance has a name, a read_capacity, a write_capacity, and a latency attribute.

The read_capacity, write_capacity and latency attributes represent heterogeneous nodes.

When a read/write capacity is specified for a node, it is used to normalise the node load during the linear optimisation; we will go through the detailed process later in this post.

By default, the latency of a Node is 1s. The attribute is used in the linear optimisation process for calculating the latency of the quorum systems.

The multiplication(*) and sum operation(+) are defined by the Expr class that is extended by the Node class:

Each Node instance extends the Expr type, this approach allows to compose expressions of nodes, such as the following quorum definition:

  a, b, c = Node('a'), Node('b'), Node('c')
  q = QuorumSystem(reads=a * b + b * c + a * c)

Another key component is the QuourmSystem class: it encapsulates the read and write quorums of the system, and it defines some methods for calculating the optimal strategy: the load, the latency, and the network load.

The snippet below defines the QuorumSystem class implemented by the library and the signature of the key methods exposed by the class:

The snippet above defines four methods for the class QuorumSystem: load, capacity, network_load, latency. We will have a detailed look at the implementation of the methods later in this section. First, however, all these methods have a similar signature: they require an optimize parameter representing the objective of the optimisation, either the read_fraction or the write_fraction parameters (already described in the theory section), and some additional constraints used by the optimisation of the quorum strategy, such as the load_limit, network_limit, latency_limit parameters.

As you may have noticed, the initialisation of the quorum system only provides the following read quorums:

ab + bc + ac

In this case, the library will derive the optimal write quorums by applying a dual operation to the given quorum. For example, the result of the dual process on the ab + bc + ac quorum calculates:

(a + b) * (b + c) * (a + c)

Once the QuorumSystem is initialised, it is possible to calculate the load, the latency, and the network_load of the system.

Optimal strategy problem implementation

This section dig into the optimal strategy implementation. The load, capacity, network_load and latency methods in the QuorumSystem class use the concept of optimal strategy, referred in the paper as $ \sigma^{\ast} = (\sigma_R^{\ast}, \sigma_W^{\ast}) $.

A strategy can be optimal in respect of the load, the network, or the latency. The objective of the optimisation can be configured using the optimize parameter.

The internal implementation turns the optimisation into a linear programming problem, and it relies on PuLP to find the optimal strategy.

In the following sections, we discover the different optimisation objectives provided by Quoracle.

Load optimization

In case the optimize == LOAD then the objective of the linear optimiation is the load. The implementation of the load optimization works as follow:

The implementation minimizes the load using the following definition:

$$ \frac{f_r}{cap_R(x)} \sum_{{r \in R | x \in r }} p_r + \frac{1 - f_r}{cap_W(x)} \sum_{{w \in W | x \in w }} p_w \leq L_{f_r} $$

For each read_fraction ($f_r$) in the QuorumSystem, the load method proceeds by computing the load by calling fr_load method.

As a first step, the fr_load method defines the $L_{f_r}$ variable, representing the load for that read fraction.
It continues by adding the read quorum part of the definition above:

x_load += fr * sum(vs) / self.node(x).read_capacity

the same for the write quorum part:

x_load += (1 - fr) * sum(vs) / self.node(x).write_capacity

As a final step, it proceeds by finalising the constraint using the instruction:

problem += (x_load <= l, f'{x}{fr}')

Network load optimization

The following expression defines the network load.

$$ f_r ( \sum_{r \in R} p_r \cdot |r|) + (1 - f_r) ( \sum_{w \in W} p_w \cdot |w|) $$

Where $|r|$ and $|w|$ are the length of the read and write quorums sets.

In concrete, the library implements the following code:

If optimise == NETWORK, quoracle will use the network function above as a target objective for the optimal strategy. The implementation follows the network load definition mentioned above. Each read and write quorum multiplies the quorum length with the read_quorum_vars and write_quorum_vars, and it sums the values together.

If $f_r$ is a distribution, the network load is computed as a weighted average of all the network loads derived by every single value in the distribution.

Latency optimization

The latency is another important aspect of a quorum system. In the real world, nodes are heterogeneous, and the latency depends on the network conditions and the status of the node that we are contacting.

The paper defines the latency as:

$$ f_r ( \sum_{r \in R} p_r \cdot latency(r)) + (1 - f_r) ( \sum_{w \in W} p_w \cdot latency(w)) $$

In practice, in case optimize == LATENCY, Quoracle uses the following approach for calculating the latency:

The Node class has a configurable latency. The implementation builds the latency minimisation problem by multiplying the latency in seconds assigned to each node with the read and write quorums. If $f_r$ represents a distribution, then the implementation uses the weighted average of the values in the distribution to get the final result.

Additional constraints

As we saw previously, the target of linear optimisation depends on the optimize parameter. Therefore, the resulting optimal strategy will be optimal regarding the load, the network load, or the latency.

In addition, it is possible to specify some additional constraints for the optimal strategy: these constraints are limits on the load, the network, or the latency.

Note that it is not possible to optimise for load and at the same time specify a load limit to the strategy; the same restriction is valid for the network and the latency.

The additional constraints use the same definitions of load, network latency, and latency we saw previously. This example shows a final problem that defines an optimal load strategy with some constraints on the network and the latency.

Let's take for example the following optimization query:

q = QuorumSystem(reads=a * b + b * c + a * c)

q.load(optimize=LOAD, read_fraction=1, network_limit=3, latency_limit=datetime.timedelta(seconds=2))

The QuorumSystem is defined by some quorums composed of three nodes: a, b, c. We want to get the optimal load (optimize==LOAD), with a read_fraction of 1 (100% reads workflow), a network_limit==3 and a latency of 3 seconds (by default, every Node has a latency of 1s if we not specify anything.). The resulting optimisation problem that finds the optimal strategy looks like this:

optimal_strategy:
    MINIMIZE
        1.0*l_1.0 + 0.0
    SUBJECT TO
        valid_read_strategy: r0 + r1 + r2 = 1
        valid_write_strategy: w0 + w1 + w2 + w3 + w4 + w5 + w6 + w7 = 1
        a1.0: - l_1.0 + r0 + r2 <= 0
        b1.0: - l_1.0 + r0 + r1 <= 0
        c1.0: - l_1.0 + r1 + r2 <= 0
        network_limit: 2 r0 + 2 r1 + 2 r2 <= 3
        latency_limit: r0 + r1 + r2 <= 2

As you can see, the optimal strategy problem tries to minimise the load in respect of some constraints:

valid read/write strategies: the sum of all the read/write probabilities is 1;
the a1.0, b1.0, c1.0 are the constraints which represent the load optimization. In detail, the constraint a1.0 representing the load for node a, use the constraint r0 + r2 <= l_1.0;
the network_limit must be less than three as we specified in the load method call above;
the latency_limit must be less than 2s as we specified in the load method call above;

The reference to the optimal_strategy problem is then optimised, and the resulting values, encapsulated in a pulp.LpVariable instance, are passed to a new Strategy type.

Strategy representation

Quoracle uses a Strategy class to represent the concept of strategy (σ) for the quorum system. The following snippet of code describes the implementation and the constructor of a Strategy[T] type:

The class extends a Generic[T] type, and it wraps a reference to a QuorumSystem[T] type, a sigma_r, and a sigma_w attribute, the last two properties represents the definition of strategy we saw in the first section:

$$ \sigma = (\sigma_R, \sigma_W) \newline$$

The Strategy constructor initializes two dictionaries: the x_read_probability and x_write_probability. They provide a map between node x and its probability of getting selected as part of a quorum.

The Strategy class exposes the methods for retrieving the read and write quorums and the nodes that are part of the quorum systems:

The get_read_quorum and the get_write_quorum methods pick the read/write set depending on the probability of getting selected.

Finally, the Strategy class implements the methods for calculating key characteristics we discussed previously, such as: the load, the capacity, the network_latency, and the latency:

The implementation works similarly to the constraints definitions that define the optimal strategy problem we discussed in the previous sections. Again, it is possible to pass the read_fraction and the write_fraction distributions to represent heterogeneous workflows.

Handling quorum failures

In distributed systems, failures are frequent. The two main approaches adopted in quorum systems are:

contact every node in the quorum system in case some of them are failing;
contact the minimum amount of nodes and eventually retry if a node fails (retry means losing a significant amount of time);

The paper defines a dynamic value representing a quorum system's resilience, called f-resilience.

$f$ represents the number of failures that are tolerated by a strategy.

Quoracle capabilities computes the read and write quorums that are f-resilient using DP. The logic is defined in the _f_resilient_quorums method:

Given a $f$ coefficient, a set of nodes, and expression (Expr) indicating the read and write quorums, the method returns the sets of f-resilient quorums by iterating over all the possible combinations of nodes that can form a valid f-resilient quorum. Note that it is impossible to build resilient quorums in some scenarios: for example, in case f > len(xs). Therefore, the method will return an empty set.

The concept of resilience, as described in the paper, comes with the cost of capacity.

Optimal strategy search

Quoarcle provides a way to search for an optimal quorum depending on a given set of parameters:

def search(nodes: List[Node[T]],
           read_fraction: Optional[Distribution] = None,
           write_fraction: Optional[Distribution] = None,
           optimize: str = LOAD,
           resilience: int = 0,
           load_limit: Optional[float] = None,
           network_limit: Optional[float] = None,
           latency_limit: Optional[datetime.timedelta] = None,
           f: int = 0,
           timeout: datetime.timedelta = datetime.timedelta(seconds=0)) \
           -> Tuple[QuorumSystem[T], Strategy[T]]:
           
           #...

The nodes parameter represents the nodes you have in the quorum system. As already discussed, the read_fraction could be a single value or a Distribution of percentages. Finally, the optimize parameter defines the objective of the optimisation.

The load_limit, network_limit, latency_limit are the additional constraints already explained in the Optimal strategy problem implementation section.

The underlying implementation of the search function uses the types we already described, such as the Strategy and theQuorumSystem classes:

Given a set of nodes, the implementation calculates the list of all the de-duplicated expressions that can be composed with that set of nodes using the _dup_free_exprs function.

The implementation proceeds by calling the do_search function; it initialises a new instance of a QuorumSystem for each previously computed expression.

Once the QuorumSystem is initialised, it proceeds by calling the strategy method to get the optimal strategy for the current expression. If the target parameter for that strategy is more optimal than the previous strategy, it is promoted to an optimal strategy. Otherwise, the algorithm continues.

In terms of interactions, the do_search applies linear programming optimisation for each expression derived from the _dup_free_exprs. However, the computation is time-consuming: the search function accepts a timeout parameter limiting the search's computation time. The implementation checks if the timeout has hit during every iteration in case it stops the computation.

Final thoughts

This post went through the concepts of quorum systems illustrated in the Read-Write Quorum Systems Made Practical - Michael Whittaker, Aleksey Charapko, Joseph M. Hellerstein, Heidi Howard, Ion Stoica paper, and it gave a detailed look at the Quoracle implementation. Quoracle is an excellent foundation to explore the thread-offs of the read-write quorums systems. A closer look at Quoracle implementation helps to understand the critical aspect of implementing quorums in distributed systems.

In the next couple of months, I will implement a Golang port of Quoracle for learning purposes. The port is in progress at the following: samueleresca/quoracle-go. I will probably write a more detailed post that gives a closer look at the Golang implementation.

Below there are the references used to write this post.

References

[1]Dynamo: Amazon’s Highly Available Key-value Store

[2]ZooKeeper: Wait-free coordination for Internet-scale systems

[3]Quorums - Martin fowler

[4]Read-Write Quorum Systems Made Practical - Michael Whittaker, Aleksey Charapko, Joseph M. Hellerstein, Heidi Howard, Ion Stoica

Detecting node failures and the Phi accrual failure detector

Samuele Resca — Wed, 07 Apr 2021 16:02:01 GMT

Partial failure is an aspect of distributed systems; the asynchronous nature of the processes and the network infrastructure makes fault detection a complex topic.
Failure detectors usually provide a way to identify and handle failures using a constant timeout: after a certain threshold, the failure detector declares the node offline.

This post is the result of some readings and notes related to an alternative approach for failure detection called "φ accrual failure detector".

Most of the concepts are from the following paper:

The φ Accrual Failure Detector - Naohiro Hayashibara, Xavier Défago, Rami Yared and Takuya Katayama

A common way to detect node failures in asynchronous distributed systems is to use a heartbeat signal: let's suppose that we have a p process that pings a q monitor process. If q doesn't receive a heartbeat request from p after a certain delay (Δt), then q can declare p failed.

How long should the timeout be before declaring p offline?

Since we have a binary signal, it is hard to distinguish an offline process from a slow one; we can end up with two different cases:

A short timeout means that we can potentially declare the node offline even if it is not (aggressive approach), but we will detect offline nodes faster;
A long timeout reduces the risk of wrongly declaring nodes offline with the cons that the detection will be slower;

Therefore, the timeout value is usually configured with an experimental approach and adjusted manually.

In addition to the process slowness, some additional variables play an important role in a heartbeat timeframe: the network unbounded delays. Different parts of the network infrastructure can slow down the communication between two nodes, such as the TCP retry mechanism or network congestion in a network switch.

An accrual failure detector calculates the variability of the response times based on a sample window, and it provides a dynamic way to identify a dependency failure.

Failure detector architecture

The paper describes the failure detectors with the following components:

the Monitoring component receives the heartbeat from the network;
the Interpretation component defines the criteria that establish if a node should be considered available or not;
the Action component implements the decisions of what to do in case the node is not available based on the outcome of the interpretation component;

The image shows the different parts of a failure detector system.

The left schema represents a standard failure detector: the monitoring and the interpretation phases are implemented within the failure detection system. In this case, the failure detector returns a suspicion flag represented by a boolean value. The application logic does not perform any additional analysis on the value; it proceeds by executing an action depending on the suspicion result.

The schema on the right describes the anatomy of an accrual failure detector. The difference is that the accrual failure detector layer returns a level of suspicious, described by a dynamic numeric value. In this case, both the interpretation and the action are delegated to the application layer. This approach gives the application the freedom to perform different actions depending on the suspicon level and eventually prioritize the job reallocation using the level returned by the accrual failure detector.

φ implementation

The paper proposes an implementation of the accrual failure detector called "The φ accrual failure detector".

The φ is a value that is continuously adjusted depending on the current network conditions.

The heartbeat signals arrive from the network, and each heartbeat interval is stored in a sample window collection which has a fixed size. The sample window collection is used to estimate the distribution of the signals.

The φ value is defined as follow:

Tlast is when the failure detector received the most recent heartbeat. tnow is the current timestamp. Plater is the probability that a heartbeat will arrive within an interval of time tnow - Tlast.

Since we are storing all the incoming intervals (tnow - Tlast) in the sample window collection, then Plater is calculated using the cumulative distribution function.

Python implementation

The φ accrual failure detector concepts described in the paper are already implemented in akka/akka, and a slightly modified version is implemented in Cassandra.

In this section, I'm describing a python port of the akka implementation. The source is available at the following repo: samueleresca/phi-accrual-failure-detector.
The code implements a φ accrual failure detector class called PhiAccrualFailureDetector.

Note that the implementation focuses on a single instance of the failure detector. Therefore, it ignores some components that are needed for handling multiple instances, such as a failure detector registry (see: akka.remote.DefaultFailureDetectorRegistry).

Sampling window

The implementation of the sampling window collection is in the HeartbeatHistory class:

The HeartbeatHistory class defines a list of intervals and a max_sample_size attribute, which indicates the sampling window size. The class implements some methods that retrieve the mean, the std_dev, and the variance of the distribution.

HeartbeatHistory overrides the sum operation by explicitly declaring the __add__ definition. The method allows adding a new interval to the intervals collection.
In case the size of the intervals exceeds the max_sample_size defined for the collection, the implementation proceeds by removing the oldest value in the list (see the drop_oldest method). This would guarantee the fixed size of the collection.

The _HearbeatHistory class is encapsulated by a _State class:

The _State class represents the accrual failure detector state, and it stores the heartbeat history and the latest heartbeat timestamp (Tlast). The class instance will be wrapped into an AtomicReference to guarantee the thread-safety.

Heartbeat method

The heartbeat method handles the incoming signal from the network. The implementation is described here:

The code above omits some components of the class and focuses only on the heartbeat method.

The implementation uses the get_time() function to retrieve the current time in ms. The get_time() function is also useful to mock the current time in a testing phase.

In case the current state is not defined, it initializes the state with the first heartbeat, represented by the first_heartbeat_estimate_ms attribute.

It proceeds by calculating the interval between tnow - Tlast and stores the interval in the heartbeat history state.

The state is wrapped into an AtomicReference type to handle the cases of multiple threads trying to access the attribute concurrently. The implementation calls the compare_and_set method for comparing the old state with the expected one. If the states do not match, the method retries recursively by calling itself.

Calculating the Phi value

The phi method calculates and returns the actual value of φ computed using the HeartbeatHistory instance encapsulated in the state attribute:

The core of the implementation is defined in the self.calc_phi function. Given the time_diff, the mean, and the std_dev of the distribution, the function computes the logistic approximation of the cumulative normal distribution (for the details, see "A logistic approximation to the cumulative normal distribution" at the bottom of the post). Once the phi value is calculated, it is returned to the caller.

The self.calc_phi function wraps the math.exp operation with a try-except block. If the operation reaches the digits limit and raises an OverflowError exception, it assigns a float(inf) value to e. In case the argument of the math.exp operation is a very large negative value, the result will be rounded to 0.

Another aspect to notice is the different calculations made depending on the timeDiff > mean condition. This is because of a floating-point loss precision concern well-described in the Akka original issue: akka/issues/1821.

Usage example

The following code describes a simple usage of the PhiAccrualFailureDetector class. It mocks the timings by overriding the _get_time() method defined in the class:

The test above defines an Iterable of mocked times as follow:

t0 = 0
t1 = 1000
t2 = 1100
t3 = 1200
t4 = 5200
t5 = 8200

And it executes a sequence of heartbeat methods, which will populate the heartbeat_history state of the accrual failure detector instance. When the test code calls the is_available method for the first time, the value of φ is:

φ: 0.025714293568000528

This is less than the threshold we defined in the class initialization; therefore, is_available will return True. When we call the is_available method for the second time (after skipping the 5200 time), we have a Δt = t5 - t3 = 8200 - 1200 = 7000, which will lead to the following value of φ :

φ: 109.21058212993705

Therefore, the second is_available call returns False since the threshold will be greater than threshold=3 we defined in the initialization.

Conclusion

This post goes through some concepts of the ϕ Accrual failure detector paper, and it describes a concrete python implementation available at the following link: phi-accrual-failure-detector. The code is using a fixed value (phi_value < threshold) to decide if a node/process is available or not. Still, the resulting φ value is dynamic, and the implementation can eventually consider assigning different values of availability depending on the resulting φ value.

Below there are the references I used to write this post.

References

The ϕ Accrual Failure Detector - Naohiro Hayashibara, Xavier Défago, Rami Yared and Takuya Katayama

Phi Accrual Failure Detector - Akka documentation

akka/akka source code

Cassandra - A Decentralized Structured Storage System

A logistic approximation to the cumulative normal distribution

Phi φ Accrual Failure Detection - @arpitbhayani

Cover photo by @diesektion

Large-Scale Data Quality Verification in .NET PT.1

Samuele Resca — Wed, 02 Sep 2020 12:04:32 GMT

The quality testing of large data-sets plays an essential role in the reliability of data-intensive applications. The business decisions of companies rely on machine learning models and data analysis; for this reason, data quality has gained a lot of importance. A few months ago, the awslabs/deequ library caught my attention.

The library helps to define unit tests for data, and it uses Apache Spark to support large data-sets. I started to dig into the implementation, and I'm working on porting the library into the .NET ecosystem: samueleresca/deequ.NET.

Why data quality is important?

One thing that I had noticed when I jumped on the machine learning world is that ordinary software engineering practices are not enough to guarantee the stability of the codebase. One of the main reasons is already well-described in the Hidden Technical Debt in Machine Learning Systems paper.

In traditional software projects, the established tools and techniques for code analysis, unit testing, integration testing, and monitoring solve the common pitfalls derived by the code dependencies depts. Although these tools and techniques are still valid on a machine learning project, they are not enough. In a machine learning project, the ecosystem of components and technologies is broader:

The machine learning code is a minimum part of the whole project. A lot of components are dedicated to the pre-processing/preparation/validation phases, such as the feature extraction part, the data collection, and the data verification. One of the main assertions made by the research paper mentioned above is that the data dependencies cost more than code dependencies. Therefore, the versioning of the upstream data-sets and the quality testing data needs a considerable effort, and it plays an essential role in the reliability of the machine learning project.

Implementation details

The automating large-scale data quality verification research that inspired the deequ library describes the common pitfalls behind the data quality verification and provides a pattern for testing large-scale data-sets. It highlights three data quality dimensions: the completeness, the consistency, and the accuracy of the data.

The completeness represents the degree to which an entity can have all the values needed to describe a real-world object. For example, in the case of relational storage, it is the presence or not of null values.

The consistency refers to the semantic rules of data. More in detail, to all the rules that are related to a data type, a numeric interval, or a categorical column. The consistency dimension also describes the rules that involve multiple columns. For example, if the category value of a record is t-shirt, then the size could be in the range {S, M, L}.

On the other side, the accuracy focuses the on syntactic correctness of the data based on the definition domain. For example, a color field should not have a value XL. Deequ uses these dimensions as the main reference to understand the data quality of a data-set.

The next sections go through the main components that the original deequ library uses, and it shows the corresponding implementation in the deequ.NET library.

Check and constraint declaration

The library uses a declarative syntax for defining the list of checks and the related constraints that are used to assert the data quality of a data-set. Every constraint is identified by a type that describes the purpose, and a set of arguments:

The declarative approach of the library asserts the quality of the data-set in the following way:

The VerificationSuite class exposes the api needed to load the data-set (OnData) and to declare the list of checks (AddCheck).
Every check specifies a description, the list of constraints, and a CheckLevel, which defines the severity of the checks.
Once we declared a list of Check instances, we can proceed by calling the Run method that lazily executes the operations on the data and returns a VerificationResult instance.

Verification output

As mentioned above the verification output is represented by a VerificationResult type. In concrete, this is the core C# definition of the VerificationResult:

The code above introduces the concept of the CheckResult type. The CheckResult class describes the result derived from a check, and it has the following implementation:

For each executed Check, there is an associated CheckResult that contains the Status of the check and a list of ConstraintResults bound with that check. Therefore, once the VerificationSuite has been executed, it is possible to access the actual results of the checks:

The Status field represents the overall status of the VerificationResult. In case of failure, it is possible to iterate every single CheckResult instance and extract the list of ConstraintsResults. Furthermore, we can print out a message for every constraint that is failing and the actual reason for the failure.

At the foundation of each constraint execution, there is an analyzer that interfaces with the Apache Spark APIs. In the deequ.NET implementation the spark API are provided by the dotnet/spark library. In the following section, we will see how the analyzer classes are abstracted from the rest of the layers of the library.

Analyzers

Analyzers are the foundation of the deequ. They are the implementation of the operators that compute the metrics used by the constraints instances. For each metric, the library has multiple analyzer implementations that refer to the Apache Spark operators. Therefore, all the logic for communicating with Spark is encapsulated in the analyzers layer.
More in detail, the library uses the following interface to define a generic analyzer definition:

The interface declares a set of operations part of each analyzer lifecycle:

ComputeStateFrom executes the computation of the state based on the DataFrame;
ComputeMetricFrom computes and returns the IMetric depending on the state you are passing in;
Preconditions returns a set of assertions that must be satisfied by the schema of the DataFrame;
Calculate runs the Preconditions, calculates, and returns an IMetric instance with the result of the computation. In addition to that, it optionally accepts an IStateLoader and an IStatePersiter interfaces that can be used to load/persist the state into storage.

Every analyzer implements the IAnalyzer interface to provide the core functionalities needed to run the operations in a distributed manner using the underlying Spark implementation. In addition to the IAnalyzer, the library also defines three additional interfaces: the IGroupingAnalyzer, the IScanShareableAnalyzer, and the IFilterableAnalyzer interface.

The IScanShareableAnalyzer interface identifies an analyzer that runs a set of aggregation functions over the data, and that share scans over the data. The IScanShareableAnalyzer enriches the analyzer with the AggregationFunctions method used to retrieve the list of the aggregation functions and the FromAggragationResult method that is used to return the state calculated from the execution of the aggregation functions.

The IGroupingAnalyzer interface identifies the analyzers that groups the data by a specific set of columns. It defines the GroupingColumns method to the analyzer to retrieve the list of grouping columns.

The IFilterableAnalyzer describes the analyzer that implements a filter condition on the fields, and it enriches each implementation with the FilterCondition method.

Let's continue with an example of the implementation of the MaxLength analyzer. As the name suggests, the purpose of this analyzer is to verify the max length of a column in the data-set:

The class defines two properties: the string Column and the Option Where condition of the analyzer. The Where condition is returned as the value of the FilterCondition method.
The AggregationFunctions method calculates the Length of the field specified by the Column attribute, and it applies the Max function to the length of the specified Column. The Spark API exposes both the Length and the Max functions used in the AggregationFunctions method.
Also, the class implements the AdditionalPrecoditions method, which checks if the Column property of the class is present in the data set and if the field is of type string. Finally, the analyzer instance will be then executed by the ComputeStateFrom method implemented in the ScanShareableAnalyzer parent class:

The IState resulting from the execution of the above method is then eventually combined with the previous states persisted in the memory and converted in a resulting IMetric instance in the CalculateMetric method implemented in the Analyzer.CalculateMetric method implementation.

Incremental computation of metrics

In a real-world scenario, ETLs usually import batches of data, and the data-sets continuously grow in size with new data. Therefore, it is essential to support situations where the resulting metrics of the analyzers can be stored and calculated using an incremental approach. The research paper that inspired deequ describes the incremental computation of the metrics in the following way:

On the left, you have the batch computation that is repeated every time the input data-set grows (ΔD). This approach needs access to the previous data-sets, and it results in a more computational effort.
On the right side, the data-set grows (ΔD) is combined with the state (S) of the previous computation. Therefore, the system needs to recompute the metric every time a new batch of data is processed.

The incremental computation method we described is achievable using the APIs exposed by deequ.

The following example demonstrate how to implement the incremental computation using the following sample:

id	manufacturerName	countryCode
1	ManufacturerA	DE
2	ManufacturerB	DE
3	ManufacturerC	DE
4	ManufacturerD	US
5	ManufacturerE	US
6	ManufacturerG	CN

and the snippet of code defined here:

The LoadData loads the data schema defined in the table above into three different data sets using the countryCode as a partition key. Also, the code defines a new check using the following constraint methods: IsComplete, ContainsURL, IsContainedIn. The resulting analyzers (obtained by calling the RequiredAnalyzers() method) are then passed into a new instance of the Analysis class.
The code also defines 3 different InMemoryStateProvider instances and it executes the AnalyzerRunner.Run method for each country code: DE, US, CN by passing the corresponding InMemoryStateProvider.

The mechanism of aggregated states (AnalysisRunner.RunOnAggregatedStates method) provides a way to merge the 3 in-memory states: dataSetDE, dataSetUS and dataSetCN into a unique table of metrics. It is important to notice that the operation does not trigger any re-computation of the data sample.

Once we have a unique table of metrics, it is also possible to increment only one partition of the data. For example, let's assume that the US partition changes and the data increase, the system only recompute the state of the changed partition to update the metrics for the whole table:

It is essential to notice that the schema of the data must be unique for every data-set state you need to aggregate. This approach results in a lighter computation effort when you have to refresh the metrics of a single partition of your data-set.

Handle the Scala functional approach

The official awslabs/deequ implementation is written in Sala, which is also the official language of Apache Spark. The strong object-oriented nature of C# adds more difficulties in replicating some of the concepts used by the original Scala deequ library. An example could be the widespread use of the Try and Option monads. Fortunately, it is not the first time that someone ports a Scala library into C# .NET: the Akka.NET (port of Akka) has a handful guide that gives some conversion suggestion for doing that. Akka.NET repository also provides some implementation utils such as the Try and Option monads for C#, which are also used by the deequ.NET code.

Final thoughts

This post described the initial work that I did to port the deequ library into the .NET ecosystem. We have seen an introduction to some of the components that are part of the architecture of the library, such as the checks part, the constraint API, the analyzers layer, and the batch vs. incremental computation approach.

I'm going to cover the rest of the core topics of the library in a next post, such as metrics history, anomaly detectors, and deployment part.

In the meantime, this the repository where you can find the actual library implementation samueleresca/deequ.NET, and this is the original awslabs/deequ library.

References

Automating large-scale data quality verification.

Hidden Technical Debt in Machine Learning Systems.

akka.net - Scala to C# Conversion Guide.

Notes on threading

Samuele Resca — Thu, 07 May 2020 13:02:25 GMT

This post shares some personal notes related to multithreading in C#. Although it describes several concepts of multithreading in C#, some of the topics can be similar in other typed languages. The following notes are the results of readings, interview prep that I had in the past years. The post implements the concepts on a concrete LRUCache example. The code related to the post is available on GitHub.

Below the topics described in this post:

Low-level API - thread initialization
Thread Pool - thread initialization
Thread pool from a Task perspective
- Starvation issue
Thread safety and sychronization
- LRUCache and locking constructs
Example of signaling between threads

Low-level API - thread initialization

There are different ways of initializing threads. .NET provides a different level of abstractions for the thread initializations. First of all, an intuitive approach is to declare a new Thread instance in your code:

using System.Threading;

namespace Blog.LRUCacheThreadSafe
{
    class Program
    {
        static void Main(string[] args)
        {
            new Thread (myMethod).Start();
        }
    }
}

The Thread class exposes properties to check the state of the instance of the thread. At the same time, it is possible to alternate the state of the thread by calling the methods exposed by the instance, e.g., the Start() and Stop() methods start and stop the execution of the threads. It also is possible to wait for the completion of a thread by using the Join() method exposed by the Thread instance:

using System;
using System.Threading;

namespace Blog.LRUCacheThreadSafe
{
    class Program
    {
        static void Main(string[] args)
        {
            Thread thread = new Thread (myMethod);
            thread.Start();
            thread.Join();
            Console.WriteLine ("myMethod execution ended");
        }
    }
}

The Join() method provides a way to wait until the execution of the t instance is completed. After that, the executions proceed on the main thread, in this case, the Program.Main method. Every time we use the Join() method, the calling thread is blocked.

One important fact to notice is that the new Thread initialization does not use the thread pool, we can verify that by calling the IsThreadPoolThread property exposed by the t object.

Because of the lack of abstraction over the low-level nature of the Thread class, Microsoft introduced multiple abstractions over this class. Most of them rely on the thread pool to improve the efficiency of the code. Furthermore, it is strongly suggested to use the higher-level APIs to achieving a parallel workflow in .NET.
As we will see later, the Thread class is still useful to cover some of the core concepts related to the synchronization and messaging between threads.

Thread Pool - thread initialization

The creation process of the new Thread comes with a considerable overhead in terms of time complexity and resources. For this reason, Microsoft built an abstraction that delegates the management of the OS threads to the Thread Pool implemented in the CLR. One of the Thread pool main feature is to optimize the creation and destruction of the threads by re-using the threads. Therefore, it is more suitable for short-running tasks.
The Thread Pool way of working can be described in the following way:

There is a queue of tasks that are processed by N threads present in the Thread pool. The queuing operation is implemented using the following method:

ThreadPool.QueueUserWorkItem(() => { ... });

More in detail, the queuing process uses the thread injection and retirement algorithm described in this post. which proceeds with the following workflow:

When a new task is queued, and there are no available threads, the ThreadPool verify that the running threads have reached the maximum number of threads and it waits until a running thread is completed. Otherwise, it checks if the amount of the running threads is less than the minimum, and in that case, it creates a new thread.
Furthermore, if the number of running threads is equal to the allowed minimum, then it suspends the creation of new threads.

The thread pool can be configured with a maximum / minimum number of threads using the following approach:

ThreadPool.SetMaxThreads(int workerThreads, int completionPortThreads);
ThreadPool.SetMinThreads(int workerThreads, int completionPortThreads);

To understand these methods is also essential to distinguish between two different types:

the workerThreads parameter sets the number of worker threads which refers to all the CPU-bound threads used by the application logic;
the completionPortThreads (not used in UNIX-like OS) parameter sets the amount of I/O threads which refers to all the I/O-bound operations that use resources other than CPUs;

These two types are commonized with the introduction of the System.Threading.Tasks namespace.

Thread pool from a Task perspective

Starting from CLR 4, the Thread pool way of queuing Tasks from System.Threading.Task perspective has been optimized to improve the handling of multiple asynchronous nested tasks. The CLR 4 introduced a double queue architecture.

global queue used by the main thread to queue tasks in a FIFO order;
local queues, once for each thread worker, it is used to queue the work in the context of a specific task, and it works like a stack (LIFO order);

The above image (1) describes the process of queuing two tasks: Task1 and Task2. Task2 has a nested task, called Task3. The Task1 and Task2 are queued in the global queue and picked up respectively by the Worker Thread1 and the Worker Thread 2, while Task3 is queued in the local queue of the Worker Thread 1 (2).
Now two things can happen:

if the Worker thread 1 ends processing Task 1 before the other worker threads are free, it is going to pick up the Task 3 from the local queue.
if the Worker thread 2 end processing Task 2 before the other workers, it going to look into the local queue for another task, if the local queue is empty, it will proceed by checking the global queue, and finally, if the global queue is empty it going to pick up the Task 3 from the local queue of the Worker Thread 1, this is defined as work-stealing.

Recently the corefx of .NET Core has implemented the possibilty to direct the creation of the new thread pools to the local queues in the ThreadPool.QueueWorkItem method: dotnet/corefx/pull/24396.

Starvation issue

The Thread pool starvation became more frequent with the increase of the need for high-scale async services. The Thread pool starvation main symptom is your service that is not able to handle the increasing number of incoming requests, but you can still see the CPUs are not fully utilized.

This symptom can point your attention to 3rd party components calls. Furthermore, it is also possible to see a method call that takes a long time to complete the execution.

However, sometimes the stall can be caused by the lack of threads that can run the next step in servicing the request, so it merely stalls, waiting for the availability of thread. This time is registered as execution time for the asynchronous method; therefore, it seems like that the operation randomly takes longer than it should.

There is the following blog post of Criteo Labs that explain in detail the issue. Let's take the example they describe in the post:

The above implementation uses the System.Threading.Task APIs to simulate a thread starvation issue through some blocking calls. Mainly, this set of rules describes the enqueuing behaviors:

Every time you execute Task.Factory.StartNew from the main thread, the task will be added to the global queue;
The ThreadPool.QueueWorkItem adds the execution on the global queue unless you specify the preferLocal parameter, see #24396;
The Task.Yield instructions add the Task to the global queue unless you are not on the default task scheduler;
Every time you execute Task.Factory.StartNew from inside a worker thread, it will add the task to local queue;
In all other cases, the Task is scheduled on the local queue;

In the code above, the main thread spawns a new Process method execution every 200 msec in the global queue. The Process methods trigger another thread in the local queue using the Task.Run method, and it waits for the completion of the execution. When the thread pool spawns a new thread, it follows this check sequence, as mentioned previously:

check if there are items in its local queue;
check if there are items in the global queue;
check the local queues of other threads (work-stealing);

The Producer method has already queued another set of tasks in the global queue, and the ratio of the creation of new threads in the global queue is higher than the creation of new threads, there is a saturation of tasks.
You can see it by executing and checking the output of the application, that in most cases is blocked:

Ended - 18:38:20
Ended - 18:38:20
Ended - 18:38:20
Ended - 18:38:21
Ended - 18:38:21
Ended - 18:38:21
Ended - 18:38:22
Ended - 18:38:22

Furthermore, we can see the number of threads of the process that increase during the execution time, without stabilizing, and the CPU usage that stays low:

In this case, the Process method is executed every second and is queued into the global queue. The Main method spawns more threads more quickly than the capacity of the thread pool to generate new threads. Therefore the application reaches a condition of starvation. Because the bottleneck is derived by the global queue, a common thought could be to de-prioritize it, see #28295, but the work-stealing between local queues come with a cost. In some cases, it may cause performance degradation in the system. Asynchronous operations are becoming more common these days; it is essential to pay attention to this kind of issue and try to avoid mixing asynchronous with synchronous stacks. Moving from synchronous stack to async can be dangerous, see the post-mortem Azure DevOps Service Outages in October 2018.

Thread safety and synchronization

A snippet of code is thread safe when called by multiple threads does not cause race conditions. Race conditions mainly happen when multiple threads share the same resources. It is crucial to highlight that:

every variable stored in the stack is thread-safe, since the stack is related to a single thread;
everything else, stored in the heap can potentially be shared by multiple threads;

Furthermore, we can take as reference some simple examples. Everything declared as a local variable is in general thread-safe:

In this case, the variable has a primitive type, and it is locally declared. Therefore, it is out-of-box thread-safe because it is stored in the stack.

Let's proceed with the following example that declares a reference type inside the method:

The pointer stored in the stack refers to an instance of the object stored in the heap memory. Since the object is never returned and it is only used by the Initializer and the Handler methods, you can declare it thread-safe. Let's consider the situation where the Handler assign the object to an attribute of the class, as follow:

The attributes of a class are stored by default along with the object in the heap. Therefore the above Handler(string value) method is not thread-safe; if it is used by multiple threads can lead you into a race condition situation:

ThreadUnsafeClass threadUnsafeClass = new ThreadUnsafeClass();
new Thread(() => threadUnsafeClass.Handler("value1")).Start();
new Thread(() => threadUnsafeClass.Handler("value2")).Start();

On the opposite side, you can use techniques such as immutability in order to avoid race conditions:

new Thread(() => new ThreadUnsafeClass().Handler("value1")).Start();
new Thread(() => new ThreadUnsafeClass().Handler("value2")).Start();

In the above snippet the Thread types are dealing with separate instances of the ThreadUnsafeClass type, so the two threads will not share the same data.
Another way to avoid race conditions in the code is the use of the locking constructs provided by the framework.

LRUCache and locking constructs

The locking constructs are used to coordinate the usage of shared resources. To describe the lock constructs provided by the .NET Core, we going to use the following example:

The above code describes a common LRU Cache implementation. The code defines a Dictionary called _records, which will contain the id -> value of each cache record. The _freq attribute stores the order of the last-accessed records by referring to the indexes of the record. The LRUCache type defines two methods: Get, Set, and a _capacity attribute that represents the capacity of the LRU cache. In the example, we are ignoring some concurrent collections provided by the CLR.
The following tests verify the behaviors implemented in the LRUCache:

The tests lock some of the behaviors of the LRUCache by replicating the following actions:

the cache removes least accessed records when the capacity is reached;
the cache stores correctly the integer values;
the cache prioritization changes when the cache reads a record;

Let's suppose that we want to access the collection from multiple threads, as follow:

The code mentioned above executes a set of Task on the same LRUCache instance, the execution will result in the following exception:

Unhandled exception. System.InvalidOperationException: Operations that change non-concurrent collections must have exclusive access. A concurrent update was performed on this collection and corrupted its state. The collection's state is no longer correct.
   at System.Collections.Generic.Dictionary`2.FindEntry(TKey key)
   at System.Collections.Generic.Dictionary`2.get_Item(TKey key)
   at Blog.LRUCacheThreadSafe.LRUCache`1.Set(Int32 key, T val) in /Users/samuele.resca/Projects/Blog.LRUCacheThreadSafe/Blog.LRUCacheThreadSafe/LRUCache.cs:line 52
   at Blog.LRUCacheThreadSafe.Program.<>c__DisplayClass0_0.b__0() in /Users/samuele.resca/Projects/Blog.LRUCacheThreadSafe/Blog.LRUCacheThreadSafe/Program.cs:line 13
   at System.Threading.Tasks.Task.InnerInvoke()
   at System.Threading.Tasks.Task.<>c.<.cctor>b__274_0(Object obj)

Each Task element runs the Set operation on the LRUCache instance that results in a race condition between multiple Tasks. To avoid this kind of exception, we can proceed by implementing a locking process using the constructs available in .NET. The lock operator guarantee that a single thread can access the snippet of code in the parenthesis. This type uses the Monitor.Enter and Monitor.Exit constructs and it requires a reference type instance that works as synchronization object; Therefore, we can wrap our Get and Set methods in the following way:

Now the LRUCache class defines a _locker object that it is used to perform the locking operation. Every reference type can be used as a synchronization object; in the case above, we are using an object type. The lock constructs wrap the implementation of the Get and Set methods to guarantee the access at only one thread simultaneously.

An alternative way is to use the Semaphore / SemaphoreSlim constructors. They implement a number-based restriction which guarantees the concurrent access to the specific number of threads defined in the constructor, by implementing a queuing mechanism on the code to execute. The Semaphore type constructor accepts a name to use in case you need a multi-process lock. The SemaphoreSlim has been introduced recently: it is optimized for parallel-programming, and it exposes the asynchronous methods, but it does not support the cross-process. The following implementation describes the use of the SemaphoreSlim constructor:

The initialization of the _sem attribute defines the number of concurrent threads that can access the resources: in this case, we set a maximum of 1 thread per operation. Furthermore, we use the try {} finally {} block to manage the locking/unlocking of the resources:

the _sem.Wait() stops the thread until the stop requirements of the semaphore definition are met;
the _sem.Release() instruction exit from the semaphore and proceed with the release of the lock;

It is important to notice that the try {} finally {} block is used in order always to ensure that the code releases the semaphore. Once we have entered the semaphore using _sem.Wait() we need always to call the _sem.Release() method, otherwise the application code will be stuck waiting for the semaphore to be released. e.g.:

In the case described above, the _sem.Release() method is only called in the main execution branch of the application. Therefore, the application will be stuck in case the condition at line 7 (!_records.ContainsKey(key)) is satisfied because the execution will never exit from the semaphore.

Another synchronization constructor that enables a more granular approach in controlling the threads is the ReaderWriterLockSlim. This type provides a way to differentiating locking depending on the type of access performed on the resource. In most cases, instances are thread-safe for reading operations but not for concurrent updates. The ReaderWriterLockSlim provides the EnterReadLock / ExitReadLock methods that allows the other concurrent reading operations, and it provides the EnterWriteLock / ExitWriteLock methods that excludes both the reading and writing operation on a specific resource. To use this kind of process, we can proceed in the following way:

The code mentioned above describes the implementation of the ReaderWriterLockSlim into the Get method of the LRUCache type. The lock is applied in the following way:

Apply the read lock using EnterReadLock method;
Check if the _records member contains the key;
Exit from the read lock using ExitReadLock;
Apply the write lock using the EnterWriteLock method;
Update the table of the frequencies, by moving the element in the last position;
Exit from the write lock using ExitWriteLock;
Finally, it returns the CacheValue by entering in the read lock;

As you can see, there are a couple of steps where it happens a transition between the read lock and the write lock. For this reason, the ReaderWriterLockSlim introduced a new lock type that can be activated by using the EnterUpgadeableReadLock method. The upgradable lock starts as a normal read lock, and it can be upgraded to write lock, e.g.:

The code above mentioned uses the EnterUpgadeableReadLock to initialize the read lock on the Get method, and it upgrades the lock to the write mode when it needs to force the writing mode:

Apply the read lock using EnterReadLock method;
Check if the _records member contains the key;
Apply the write lock using the EnterWriteLock method;
Update the table of the frequencies, by moving the element in the last position;
Exit from the write lock using ExitWriteLock;
Finally, it returns the CacheValue and exit from the read lock;

The above implementation scale-up or scale-down the lock restriction depending on the actual access type that is running as of next instruction.

Example of signaling between threads

The following example implements some multithreading testing on the LRUCache implementation. The test uses the ManualResetEventSlim class to coordinate multiple concurrent operations on the LRUCache type. The ManualResetEventSlim can be used as a signaling event between threads; furthermore, it blocks and unblocks the execution of the threads.

The following code describes the implementation of the test:

The above test class declares a should_supports_operations_from_multiple_threads test method. The method performs the following operations:

Split the test into two phases by declaring the setPhase and getPhase using the ManualResetEventSlim instance type;
Initializes an array of type Thread. For each thread in the array, it performs a Set and a Get operation on the LRUCache type, and it uses the ManualResetEventSlim instances for blocking the threads;
Once the array of Thread is declared, the code calls the setPhase.Set() method to trigger the _cache.Set operations. The same approach is taken for the _cache.Get operations;

Furthermore, each thread proceeds as follow:

The threads wait until the setPhase.Set method is called by the main thread; at this point, every thread runs the _cache.Set operation on the LRUCache instance. The main thread keeps the count of the executed set by incrementing a progressCounter. The main thread proceeds by calling the getPhase.Set method once the progressCounter reaches the number of initialized threads.
In the following way, we are going to have a nThreads that will execute the _cache.Set operation at the same time; after that, they will be blocked until the last set operation is performed. Finally, they _cache.Get method will be called to read all the values.

As expected, the implementation executes all the _cache.Set operations before the _cache.Get execution.

Depending on the type LRUCache we initialize, the test fails or not.
Because the LRUCache type doesn't support multithreading we will receive a System.InvalidOperationException. If we use the LRUCacheReaderWriterLock type, we will receive the following outcome from the _outputConsole helper:

Another important aspect to consider is that we are incrementing the progressCounter in the following way:

Interlocked.Increment(ref progressCounter);

Because the counter is declared outside the scope of the threads, and multiple threads use it, we are using the Interlocked.Increment dependency to force an atomic operation. The Interlocked.Increment accepts as a reference a variable to increment using an atomic approach.

Although, read and writes on int can be considered as atomic:

CLI shall guarantee that read and write access to properly aligned memory
locations no larger than the native word size (the size of type native int) is atomic when all the write accesses to a location are the same size.
ECMA-335 standard

the increment operation is not atomic; for this reason, the Interlocked constructor can be helpful in these cases.

Final thoughts

The post provides some of the notions around threading in C#. It describes topics like starvation, event handling, blocking synchronization, and it applies these concepts to an LRUCache implementation example. You can find the code on the following repository. The information provided in this post are the foundation of some of the topics that are NOT covered like Dataflow, Channels, Blocking collections, and more in general, all the Task parallel library provided by .NET.

Below some of the references related to the post:

Getting started with Apache Spark using .NET Core

Samuele Resca — Mon, 30 Sep 2019 06:44:51 GMT

Since a few months, I’ve started to focus my attention on the Data / Big data technologies both for work and individual reasons.

The following post covers what I have learned about Apache Spark core, and its architecture. Above all, it introduces the .NET Core library for Apache Spark, which aims to bring the Apache Spark tools into the .NET ecosystem. I’ve already written about .NET in the data a few months ago: Data analysis using F# and Jupyter Notebook.

This post covers the following topics:

Intro to Apache Spark;
Apache Spark architecture: RDD, Dataframes, drive-workers way of working;
Querying using SparkSQL using the .NET Core lib;
Considerations around the use of .NET with Apache Spark;

Intro to Apache Spark

Apache Spark is a framework to process data in a distributed way. After 2014, it is considered the successor of Hadoop to handle and manipulate Big Data. Besides, the Spark success can be attributed not only to its performance but also to the rich and always growing ecosystem that support and contribute to the evolution of this technology. Moreover, the Apache Spark APIs are readable, testable and easy to understand.

Nowadays, Apache Spark and its libraries provide an ecosystem that can support a team or a company in data analysis, streaming analysis, machine learning and finally graph processing.

Apache Spark provides a wide set of modules:

Spark SQL for data analysis over relational data;
Spark Streaming for the streaming of data;
MLlib for the machine learning;
GraphX for distributed graph processing;

All these modules and libraries stand on top of the Apache Spark Core API.

Introduction to Spark for .NET Core

.NET Core is the multi-purpose, open-source and cross-platform framework built by Microsoft.

Microsoft is investing a lot on .NET Core ecosystem. Further, the .NET team is bringing .NET technologies into the data world. Furthermore, the last year, Microsoft released a machine learning framework for .NET Core, available on Github, and recently, they shipped the APIs for Apache Spark also available on Github.

For this reason, I directed my attention on Apache Spark and its structure.

Setup the project

Let’s see how to manage Apache Spark using .NET Core framework.

The example described in this post uses the following code available on GitHub and the Seattle Cultural Space Inventory dataset available on Kaggle. Moreover, the project is a simple console template created by using the following .NET Core command:

dotnet new console -n blog.apachesparkgettingstarted

The command mentioned above creates a new .NET Core console application project. In addition, we should also add the Apache Spark APIs, by executing the following CLI command in the root of the project:

dotnet add package Microsoft.Spark

The instruction adds the Apache Spark for .NET to the current project.

Apache Spark core architecture

As said earlier, Apache Spark Core APIs are the foundation of the additional modules and features provided in the framework. The following section will cover some of the base concepts of the Spark architecture.

First of all, let’s take an overview of the Apache Spark driver-workers concept, the following schema describes the mechanics behind Spark:

Driver – Worker mechanics

Every Spark application is usually composed by the Driver and a set of Workers. The Driver is the coordinator entity of the system, and it distributes the chuck of work across different Workers. Apache Spark 2.0.0 onwards, SparkSession type provides the single entry point to interact with underlying Spark functionality. Moreover, SparkSession type has the following responsibilities:

Create the tasks to assign to the different workers;
Locate and load the data used by the application;
handle eventual failures;

This is an example of a C# code that initializes a SparkSession:

The code mentioned above gets or create a new SparkSession instance with the name “My application”. The retrieved SparkSession instance provides all the necessary APIs.

RDDs, DataFrames core concept

Apache Sparks supports developers with a great set of API and collection objects. The following section will describe the two main set of APIs provided by Spark: RDD and Dataframes;

The RDD is the foundations for the Dataframes collection. RDD stands for Resilient Distributed Dataset (RDD), it is the basic abstraction in Spark. RDD represents an immutable, partitioned collection of elements that can be operated on in parallel. RDDs are designed as resilient and distributed. Therefore they distribute the work across multiple nodes and partitions, and they are able to handle failure by re-calculating the partition that failed.

The Dataframes are built on top RDD. Dataframes organizes the collections into named columns, which provides an higher-level of abstractions.

The subsequent example shows the definition of a Dataframes:

The code mentioned above, read data from the following Seattle Cultural Space Inventory dataset. The snippet loads the data into the Dataframe by automatically infer the columns. After that, the Dataframe collection provides a rich set of APIs which can be used to apply operations on the data, for example:

The preceding snippet combines the Select and Filter operations by selecting the Name and the Phone columns from the Dataframe, and by filtering out all the rows without the Phone column populated. Both the Select and Filter methods are part of the transformation APIs of Apache Spark.

Transformation and Action APIs

For instance, we can group the RDDs and Dataframes APIs into two groups:

Transformation APIs are all the functions that manipulate the data. All the APIs such as Select, Filter, GroupBy, Map . Furthermore, all the functions that return another RDD or Dataframe are part of the transformation APIs;
Action APIs are all the method that performs an action on data. They usually return a void result;

Moreover, we should notice that all the transformations APIs uses a lazy-evaluation approach. Hence, all the transformations on data are not immediately executed. In other words, the Apache Spark engine stores the set of operations in a DAG (Directed acyclic graph) and rebuild them in a lazy way.

The schema mentioned above describes the distributed data transformation approach used by Apache Spark. Let’s suppose that our code executes an action on the Dataframe collection, such as the Show() method:

As you can see from the code mentioned above, the Show() doesn’t return anything, which means that, unlike the Select() and Filter() methods, it is an action that effectively triggers the execution of the DataFrame evaluation.

The Show() method simply displays rows of the DataFrame data in tabular form.

Apache Spark execution and tooling

Let’s proceed by executing the snippet of code we implemented in the previous section. Apache Spark provides an out-of-box called spark-submit tool command, in order to submit and to execute the code.

Besides, it is possible to submit our .NET Core code to Apache Spark using the following command:

spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local microsoft-spark-2.4.x-0.4.0.jar dotnet

The above command uses the spark-submit tool to send and execute the dll which is provided as parameter:

As you can see, the result of the executions expose the first 20 populated numbers.

Spark.NET under the hood

You may have noticed, that every time you add the Microsoft.Spark package, it also brings the following jar packages: microsoft-spark-2.4.x-0.4.0.jar and microsoft-spark-2.3.x-0.4.0.jar. Both of them are used to communicate with the native Scala APIs of Apache Spark. Furthermore, if we take a closer look at the source code, available on GitHub, we can understand the purposes of these two packages:

The DotnetBackend Scala class behave as an interpreter between the .NET Core APIs and the native Scala APIs of Apache Spark. More in detail, this approach is taken also by other non-native languages that use Spark, such as R language.

The .NET team and community are already pushing for the native adoption of .NET binding as part of Apache Spark: https://issues.apache.org/jira/browse/SPARK-27006. Moreover, it seems that they are making some progress:

Thank you for reading this far and we look forward to seeing you at the SF Spark Summit in April where we will be presenting our early progress on enabling .NET bindings for Apache Spark.

Final thoughts

The following article gives a quick introduction to Apache Spark, and a quick Getting started with Apache Spark using .NET Core. Microsoft is investing a lot in .NET Core, and most important, is investing in opensource. Languages like Python, Scala, and R are definitely more established in the Data world, However, the implementation of the Spark library for and the ML.NET framework are also the facts that Microsoft is investing a lot to bring .NET into the data world.

Cover image by Tube mapper.

Web assembly and Blazor: state of the art

Samuele Resca — Tue, 11 Jun 2019 07:05:18 GMT

I had a first look to Blazor and, more in general, to the Web assembly technologies in 2017. The same year, I’ve written about this topic in the following blog post: Web assembly in .NET Core. After two years, Blazor is near to its first official release, it is no longer experimental, and it is becoming part of the .NET ecosystem. The following article gives some quick updates on Blazor framework.

How Blazor works?

First of all, let’s have a look what’s behind Blazor and how it works using the new Web assembly. The following schema shows the foundations of Blazor:

Web assembly stands at the base of the pyramid and it defines a binary standard format which allows running byte code into the browser. Furthermore, one of the

Web assembly is a standard not chained to the .NET ecosystem, but it has been the first step to bring .NET into the client side development.

The other core actor behind Blazor is the Mono framework. Mono is a .NET runtime, and it is part of the .NET runtimes maintained by Microsoft and the community. Mono is designed for portability, therefore it has been compiled into web assembly starting with the following PR: https://github.com/mono/mono/pull/5924

Finally, the top layer there is Blazor. Blazor is the UI framework that defines the startup process of the UI, and also it implements the infrastructure that allows components to communicate together. Starting from .NET Core 3.0, Blazor will be shipped as a part of the framework.

Overview of a Blazor app

It is possible to create a new Blazor template using the following instructions:

dotnet new -i Microsoft.AspNetCore.Blazor.Templates::3.0.0-preview5-19227-01

dotnet new blazor -n

The first command installs the Blazor template pack using the version 3.0.0-preview5-199227-01 of .NET Core. The second command creates a new base project in the current folder with the web_app_name.

The resulting project and file system will be similar to this:

There are some key parts to notice in the project structure. First of all, the Program and the Startup classes: the first one has the following implementation:

As you can see the above-mentioned snippet uses the BlazorWebAssemblyHost class to initialize a new host using the Startup class. This approach works very similar manner to the approach used in ASP.NET Core applications, but instead of returning an IWebHost type it returns a new instance of the IWebAssemblyHostBuilder interface.

The following code acts using the following namespace Microsoft.AspNetCore.Blazor.Hosting and resolves the Startup class using the following code.

Let’s proceed by having a look at the Startup class which is decidedly simpler compared with the Startup class of an ASP.NET Core application:

The Configure method resolves an instance of the IComponentsApplicationBuilder interface, and it invokes the AddComponent method in order to initialize the App component.

The AddComponent accepts a generic type which represents the main component, and a DOM selector which corresponds to the tag that is used in the index.html page to render the component.

Component-centric structure

Blazor, just like a common UI framework, has a component-centric structure. Components are all the UI elements that compose the pages. In the same way, components can be nested and reused in other parts of the UI.

Every file with the .razor extension is a component. Components render the HTML elements but also can contain UI logic and event handling, for example, let’s have a look to the FetchData.razor file:

The following component fetches some weather forecast data present in the application using an AJAX request, and it renders data in form of a table. As a first step, the component uses the @inject directive to declare a new HTTP client. Secondly, it declares some HTML elements to render in the page, e.g.: the table which contains the forecast data, and it finally declares the UI logic:

The code mentioned above defines a WeatherForecast type and an array which will contain the information fetched from the data, secondly, it declares an override async Task OnInitAsync() function that uses the HttpClient injected in the component to perform an HTTP call to our data. The OnInitAsync function is one of the built-in lifecycle methods implemented by default in the base class of the component.

Built-in lifecycle methods

The following table describes the lifecycle methods which are part of the ComponentBase.cs, and can be customized by the extended classes:

Lifecycle methods	Description
`OnInit /OnInitAsync`	The method executes code at the initialization step of the component.
`OnParametersSet /OnParametersSetAsync`	These two methods called when a component has received parameters from its parent caller and the values are assigned to properties. These methods are executed every time the component is rendered.
`OnAfterRender/OnAfterRenderAsync`	These methods are called after a component has finished rendering. The elements and the components references are populated at this point.
`SetParameters`	The method can set a custom code that interprets the incoming parameters value in any way required

Routing

Another essential aspect to notice form the above-described component is the @page "/fetchdata" directive. This directive is part of the routing mechanism of Blazor. By using the same approach of the routing of ASP.NET Core, it is also possible to add custom parameters in the @page value: something similar to @page "/fetchdata/{day}".

Client-side vs Server-side Hosting model

Blazor provides two different hosting models: the client-side one and the server-side.

The client-side hosting model downloads all the .NET dependencies on the client, therefore it doesn’t have any server-side dependency. It provides full web assembly support and also supports offline scenarios. It is possible to create a client-side Blazor app using the following command:

dotnet new blazor -n

The server-side hosting model is more light-way in terms of resources download on the client. It uses SignalR and web socket technologies to create a communication channel between the client and the server. Therefore, the code runs on the server; the client sends messages at each operation. It also supports old browsers, but it doesn’t have offline support. It is possible to create a server-side Balzor app using the following command:

dotnet new blazorserverside -n

The main concrete characteristic of between the client-side and server-side hosting models resides in the Program.Main method. The following is the snippet related to a client-side app:

This one is related to a server-side app:

As you can see, the first one returns a reference to the IWebAssemblyHost instance, the second one to an IHostBuilder instance.

Plus, in case of a server-side app, the Startup class also adds a service to the IServiceProvider collection using the services.AddServerSideBlazor() :

The resulting execution of the two hosting models behaves in two different ways. In the case of the client-side approach we can see the following resulting network behavior:

Client-side network behavior

The client-side app downloads the blazor.webassembly.js file the mono.wasm file, which is the Mono framework compiled for the web assembly, and it downloads all the .NET dll used by the application: System.dll, System.Core.dll, System.Net.Http.dll …;

On the other side, the server-side app uses a web-socket approach. Therefore the payload downloaded with the page is minimum:

Server-side network behavior

Each interaction with the page triggers a new message in the web socket channel:

Web-socket channel messages

Final thoughts

Starting in 2017, Blazor is becoming a standard citizen of the .NET ecosystem. Both Microsoft .NET team and the community are investing a lot of time in this project. You can find 3rd party libraries and other material about Blazor here: https://github.com/AdrienTorris/awesome-blazor#libraries–extensions.

Cover image by Corrado Zeni

Data analysis using F# and Jupyter notebook

Samuele Resca — Tue, 23 Apr 2019 07:27:13 GMT

During the last hackathon at @justeattech, I’ve played a lot around machine learning using ML.NET and .NET Core. Furthermore, the idea that a .NET developer is able to implement machine learning without switching language is cool. ML.NET has still a lot of space of improvement, but it could be a powerful framework to deal with machine learning.

I've played a lot around @MLdotnet, during the current hackathon in @justeat_tech. The idea that a .NET dev can implement machine learning without switching language is cool, but there is a lot of space of improvement, such as https://t.co/SbjpU2PdkV #MachineLearning #MLNET
— Samuele Resca (@samueleresca) March 15, 2019

The following post focuses on some general knowledge around data gathering and data analysis. Furthermore, it explains some basics tools to perform data analysis using F# and Jupyter notebook.

The importance of the data foundations

The data set gathering is the more critical step around machine learning. Data is the foundation of all the further process, and it the principal step of the ML workflow.

Therefore, it is crucial to understand the data that we are going to use and to train a machine learning model. For that reason, it is important to prototype and explores data before the start.

Lyrics data analysis

The purpose of the following example is to give some basic notions about the data analysis process. As a software engineer mainly focused on .NET Core, I will use the technologies around the .NET ecosystem. Therefore, the example will use F# as the primary language and some related libraries to handle data. The example is also available at the following Github repository: https://github.com/samueleresca/LyricsClassifier

It is essential to consider that most of the concepts of the following steps are independent of the language or the libraries we use. Moreover, almost all the languages and development frameworks come with some open-source tools for machine learning and data analysis. Here is a complete of machine learning libraries and frameworks: https://github.com/josephmisiti/awesome-machine-learning

More in specific, the following example will use the following libraries:

XPlot.Plotly: XPlot is a cross-platform data visualization library that supports creating charts using Google Charts and Plotly. The library provides a composable domain-specific language for building charts and specifying their properties;
MathNet.Numerics: Math.NET Numerics aims to provide methods and algorithms for numerical computations in science, engineering, and everyday use. Covered topics include special functions, linear algebra, probability models, random numbers, interpolation, integration, regression, optimization problems and more;
FSharp.Data: the F# Data library implements everything you need to access data in your F# applications and scripts. It contains F# type providers for working with structured file formats (CSV, HTML, JSON, and XML) and for accessing the WorldBank data. It also includes helpers for parsing CSV, HTML and JSON files and for sending HTTP requests;
ML.NET: ML.NET is a machine learning framework built for .NET developers;

Data schema

The example will use a dataset of lyrics available on Kaggle. The data set contains a list of songs of different genres and from several artists. The data has a straightforward schema, which can be represented using the following F# type:

The Song field refers to the title of the song, the Artist field contains the artist name, the Year is the release date, the Genre field contains the genre of the song and finally, the Lyrics field refers to the lyrics of the song.

Finally, let’s take a look to a preview of the input data:

A first look at the data

The data set gathering is the more critical step around machine learning. Data is the foundation of all the further process, and it the main dependency of the ML workflow.

We should always keep in mind is that data is the primary and the critical part for all the subsequent step. Just like some software engineering design processes use the structure of the data to build the domain model of the system, in the same way, we should start from our data to have a global view on the content.

Let’s start by analyzing the lyrics dataset to find out the possible correlations. For this propose, the example uses Jupyter notebook. Jupiter notebook is a useful tool which allows you to create and share documents that contain live code, equations, visualizations, and narrative text. You can find the source code of Jupyter notebook on GitHub: https://github.com/jupyter/notebook. By default, Jupyter notebook supports Python as a primary language. For this example, we can enable the support of F# by using the following library using https://github.com/fsprojects/IfSharp.

As a first step, we can start Jupyter and create a new notebook in our preferred folder. Then, we can proceed by importing the F# libraries describes above in the first cell of the notebook:

The snippet uses the Paket package manager to load the libraries used in the notebook. After that, we can proceed by opening the namespaces used by the notebook and defines the input type which reflects the structure of the dataset:

Once we defined the LyricInput type we can proceed by reading the lyrics.csv file and clean up our dataset:

The following snippet uses the FSharp.Data library to load the CSV file, and it performs some filtering and data cleaning on our lyrics:

It removes all the samples with empty lyrics;
It removes all the samples equals to [Instrumental];
Finally, it maps the rows with the LyricInput type defined above;

Let’s proceed by make a quick analysis of the critical feature of the dataset and see all the possible correlation. The following code is for rendering two Chart.Pie related to the Genre and the Year feature:

The above snippet uses the XPlot.Plotly library to render the following charts:

The above charts describe the dataset lyrics by genre. In the same way, we can group the songs by using Year field, in order to understand the distribution over time:

Feature and data engineering using ML.NET

Let’s continue by focusing on the Lyrics field by visualizing the frequency of the words used in the lyrics, both at the global level and also by genre. First of all, we should start a tokenization process. This process runs using the following snippet of code:

The code snippet defines a list of stopwords and a list of symbol. These variables are used by the tokenizeLyrics function which returns the list of words related to a lyric.

Besides, the tokenizeLyrics function uses the text transformation methods provided by ML.NET. More in detail, the tokenizeLyrics function defines a new MLContext object which is provided by the Microsoft.ML namespace. Next, the function runs the mlContext.Data.LoadFromEnumerable method to load the lyrics sequence into the mlContext. The tokenizeLyrics function calls some utilities provided by the mlContext.Tranforms.Text static class:

FeaturizeText("FeaturizedLyrics", "Lyrics") transform the Lyrics text column, in that case, the Lyrics field, into featurized float array that represents counts of n-grams and char-gram;
NormalizeText("NormalizedLyrics", "Lyrics") normalizes incoming text of the input column by changing case, removing diacritical marks, punctuation marks and/or numbers and outputs new text in the output column;
TokenizeWords("TokenizedLyric", "NormalizedLyrics", symbols)tokenizes incoming text in the input column using the separators provided as input. Then, it assigns the outputs the tokens to the output column;
RemoveStopWords("LyricsWithNoCustomStopWords", "TokenizedLyric", stopwords) removes the list of stopwords from incoming token streams provided as input and it outputs the token streams in the output column;
RemoveDefaultStopWords("LyricsWithNoStopWords", "TokenizedLyric") behaves in the same way of the RemoveStopWords except that it uses a default list of stopwords, thus it is also possible to specify them in different languages

It is also important to notice that the columns on the right are the input columns, and the ones on the left contain the output. Furthermore, it also possible to use the .Append method to compose a dataset of multiple columns, each of them, will contain a resulting output column.

Finally, the last step of the tokenizeLyrics function is to transform the data and put all the tokenized words together using the following instructions:

After that, it is possible to call the tokenizeLyrics function as follow:

The resulting chart shows all the top 20 most frequent words presents in all the lyrics:

Furthermore, it is also possible to check the top 20 most frequent words by Genre field using the following snippet:

For example, in that case, the resulting chart is the most popular word frequencies in Hip-Hop lyrics:

Final thoughts

This post provides some general knowledge around data analysis using Jupyter notebook and F#. It shows how Jupyter notebook can be used to fast prototype and understand the data models. Moreover, ML.NET provides the tools to perform feature engineering on our data and set up the data model for the initialization. In the next post, we will see how to train a model that detects a genre depending on the song lyrics. The above example is available on GitHub at the following URL: https://github.com/samueleresca/LyricsClassifier

References:

https://www.kaggle.com/corizzi/lyrics-genre-analysis-machine-learning

https://medium.com/luteceo-software-chemistry/statistical-analysis-using-f-and-jupyter-notebooks-2e2f31ee4cc1

Cover image by Benjamin Benschneider

Test .NET Core AWS Lambda

Samuele Resca — Wed, 20 Feb 2019 21:27:36 GMT

The following post shows some techniques about test .NET Core AWS Lambda, more in specific, it focuses on testing .NET Core AWS Lambda using the LambdaTestTool. I’ve already spoken about serverless, .NET Core, and AWS Lambda in the following article: Fast growing architectures with serverless and .NET Core. This post will be focused more on the testing side.

The testing issues in a serverless computing

As said in the previous post about serverless it is hard to test serverless applications. Especially when their complexity grows, it is hard to detect issues before running them on the cloud. Furthermore, from a development perspective, it is hard to deal with the problems if every time we need to re-deploy our services to verify the result.

Existing testing tools

Some existing testing tools help us to deal with the serverless systems. Speaking about AWS, things like https://github.com/localstack/localstack provides a way to emulate the AWS stack on your machine by installing it or by running a docker image. The point of these frameworks is that they consume a lot in terms of resources and they are not always suitable for integration testing.

Testing lambda using LambdaTestTool

The LambdaTestTool is a utility produced by the AWS .NET Team which provides a useful and light way approach to the Lambda testing. Furthermore, it provides a way to debugging your lambda locally by triggering some input events and attaching the debugger of your preferred IDE or code editor.

Recently, the AWS .NET Team also released a new version of the LambdaTestTool with the following PR:

https://github.com/aws/aws-lambda-dotnet/pull/364

which enables the debugging also for all the lambdas that use the serverless framework.

Installation

The LambdaTestTool is implemented as part of the dotnet tool suite (I’ve already written about dotnet tools in the following post: Artless HTTP server using .NET Core), it is cross-platform, and can be installed by using the following command:

dotnet tool install -g Amazon.Lambda.TestTool-2.1

Moreover, it is possible to update the tool at the latest version using the following instruction:

dotnet tool update -g Amazon.Lambda.TestTool-2.1

The above-mentioned code installs the new dotnet tool on your local machine. Therefore, it is possible to use it to run the lambda locally by configuring your IDE or code editor properly.

Debug the lambda

The LambdaTestTool runs an ASP.NET instance which returns the following interface:

The UI provides a way to select the config file of the lambda, the function in the project to execute, the credentials, the AWS region and the message to send as the input of the function.

It is possible to configure the LambdaTestTool on multiple editors, such as Visual Studio, Visual Studio Code, Rider and Visual Studio for Mac. The next section shows the configurations for Visual Studio Code and Rider.

Visual studio code

The same approach can be taken with a Visual Studio Code, by adding the following section in the launch.json file:

Rider

For example, in the case of Rider IDE, it is possible to use the tool by creating a new configuration in the IDE:

The Exe path field refers to the path to the dotnet tool:

/.dotnet/tools/.store/amazon.lambda.testtool-2.1//amazon.lambda.testtool-2.1//tools/netcoreapp2.1/any/Amazon.Lambda.TestTool.dll

The working directory field is the root folder of the lambda project.

Summary

It is possible to get more information about the LambdaTestTool on Github: https://github.com/aws/aws-lambda-dotnet/tree/master/Tools/LambdaTestTool. The tool is still in a preview phase, but it is a very useful and fast way when we want to verify and testing a .NET Core AWS Lambda. What’s still missing, is a quick way to integrate the package and run it directly in the code, for example, as a part of a unit testing process.

Samuele Resca

Implementing Thread State Analysis (TSA)

Overview

Implementing TSA on Linux with bpftrace

Calculating on-CPU time (Running state)

Tracking Runnable state (Waiting on Queue)

Tracking the Sleeping time (off-CPU)

Tracking lock time (Locked state)

Displaying the results

Wrap up

Notes on CVE assessment

Vulnerabilities lifecycle

Vulnerabilities in Open-Source

Scoring process

Challenges in vulnerabilities assessment

CVEs system is not perfect

CVSS base score is not accurate

Keep analysis consistent

Vulnerabilities hot spots

Analyze vulnerabilities

Early checks

1. Is a false positive?

2. Is the CVE OS-specific?

3. Is the CVE exploitable under any particular condition?

4. Is the CVE disputed?

Scoring analysis

Is the affected system deployed behind network boundary?

Is the affected system running on containers?

Populate the Temporal Score metrics

An approach to analysed vulnerabilities tracking

Vulnerabilities in container images

Prioritize mitigation with EPSS

Final thoughts

Changes and improvements in CVSS 4.0

Base score alone is not accurate

Temporal metrics weren't impacting the final score

User Interaction metric wasn't granular enough

Scope (S) metric was ambiguous

New supplemental metrics

Safety (S)

Automatable (AU)

Recovery (R)

Value Density (V)

Vulnerability Response Effort (RE)

Provider Urgency (U)

New EPSS scoring system

Recap

Memory management optimization techniques

Understanding perf

Bypassing cache

Cache optimization with sequential access and loop tiling

Optimizing using huge pages

Wrap-Up

References

Analysis of 'What Every Programmer Should Know About Memory'

This content is from 2007 why I should care?

Commodity hardware architectures

Optimizing device access with DMA

NUMA

CPU caches

Architecture

CPU caches implementation

Associativity

Direct-mapped cache

Fully-associative cache

Set-associative

Write behaviours in a single processor

Multi-processor coherency and MESI

Performance considerations on CPU caches

Virtual Memory

Address translation

Performance considerations on TLB

Wrap-Up

References

Techniques for fuzz testing

Anatomy of fuzzing

Fuzz target example

Black-box fuzzing

Coverage-guided fuzzing

Coverage-guided fuzzing in depth

Calculating on-CPU time (`Running` state)

Tracking `Runnable` state (Waiting on Queue)

Tracking the `Sleeping` time (off-CPU)

Tracking lock time (`Locked` state)