Memory management optimization techniques

This article builds upon the previous one from this series and delves into the optimizations outlined in the "What a programmer can do" section of the "What every programmer should know about memory" paper by Ulrich Drepper. The code examples are implemented in Rust, and the execution is analyzed using perf.

💡

Understanding perf

This post uses perf to explore how the code is executed on the CPU. perf is the official Linux profiler and it is included in the kernel source code, see perf Tutorial.

Let's take a moment to go through some of the main perf events that will be used in this post:

  • instructions: tells us how many instructions our code issued.
  • cycles tells us how many CPU cycles our code took. The number of cycles is usually lower than the number of instructions because the CPU can issue many instructions per cycle. The difference between cycles and instructions is called the IPC (Instructions Per Cycle). The IPC is a good metric to understand how efficiently the CPU is running your code. The higher the IPC, the better. It is also an indicator of how well the code is doing vectorization.
  • cache-references is representing the number of times the CPU accessed the cache. if we do not already have this data in the cache, we will have to access the main memory.
  • cache-misses is representing the number of times the CPU accessed the main memory. As mentioned in the previous post of this series, this is a very expensive operation, and we want to cut it.
  • task-clock: refers to how many clock cycles our task took. The task-clock is usually in milliseconds. It differs from the total runtime because if the program uses many CPUs or threads, the task-clock will be the sum of the time spent on each CPU or thread.
  • cs is representing the context switches happening during the code execution.
  • page-faults is representing the number of page faults happening during the code execution. Usually, there are dedicated events to distinguish between minor-page-faults and major-page-faults.

Note that, depending on the CPU model and the manufacturer, the events available in perf will be different. In some cases, dedicated cache layer events, such as L1 cache misses / references, will be available. In other cases, you will have to use the generic cache-misses/ cache-references events.

Bypassing cache

One of the ways of influencing memory proposed by the paper is to bypass CPU caches when data is initialized and not reused immediately. This approach avoids pushing out data that is already cached in favour of data that will not be read soon.

One of the intrinsics provided by the processors is for non-temporal writes operations. Non-temporal writes function flags the data as non-temporal so is not stored in the CPU cache.

Below is the implementation of a standard vs non-temporal matrix initialization:

The standard_initialize function initializes a matrix in a standard way. The nocache_initialize function takes the non-temporal approach by using the _mm_stream_si32 intrinsic function.

The _mm_stream_si32 function is defined in the std::arch::x86_64::_mm_stream_si32 backed by the corresponding Intel intrinsic.

The following analysis uses the perf stat command to understand the impact of non-temporal writes on the process execution. Note that the code is built in optimized mode using the --release flag.

Event Name standard_initialize nocache_initialize
l1d.replacement (in millions) 1,052,585 532,639
l2_lines_in.all 1,497,033 91,486

The table contains the event count comparison between standard and non-temporal writes. The l1d.replacement event describes the number of cache lines replaced in the L1 data cache. The l2_lines_in.all event describes the L2 cache lines filling L2.

Note that the numbers in the tables are the average event counts for 10 executions. A single execution is monitored using the perf stat command.
The events are part of the Intel Ice Lake V1.00 events definitions.

The analysis above shows how the initialization using the non-temporal approach avoids performing L1d cache replacement. On top of that, the L2 cache lines are ~16x less than a standard initialization.

Cache optimization with sequential access and loop tiling

Sequential access is crucial for improving the L1d cache performances because the processor prefetches data.

The paper takes as an example a matrix multiplication described as:

$$c_{i,j} \space = \space a_{i,1} \space b_{1,j} + \space a_{i,2} \space b_{2,j} + \space \cdots + \space a_{i,n} \space b_{n,j}$$

Matrix multiplication can be performed as follow:

The matrix multiplication above is an example of an operation that can be optimized by taking advantage of sequential access. The m1 is accessed sequentially (by accessing the matrix per row), while the m2 is not accessed sequentially (the implementation iterates on the columns first).

It is possible to transpose the m2 matrix to access it sequentially. This is done in the optimized function below:

Translating and multiplying the m2 matrix results in better performance. The table below shows the cycles and instruction perf events resulting from the execution of the non_optimized and optimized functions (Again, the code has been compiled in release mode).

Event non optimized optimized
cycles (in billion) 13.66 9.47
instructions (in billion) 26.41 24.97
ins. per cycle 1.94 2.64

The ins. per cycle is a metric that shows how many instructions are executed per cycle. The higher the value, the better the performance. The optimized function has higher ins. per cycle value because it is taking advantage of the sequential access for both the m1 and m2 matrices.

Another way to improve matrix multiplication speed is to increase the usage of L1d cache by implementing techniques such as loop tiling. The following code demonstrates how loop tiling is applied in matrix multiplication:

The loop tiling technique enhances cache locality by computing multiplications in blocks that can fit into the L1d cache, which maximizes cache usage and reduces cache misses. The results of the performance tests I run on three functions: non_optimized, optimized, and optimized_tiled are presented in the table below:

Event non optimized optimized optimized with loop tiling
cycles (in billion) 13.66 9.47 7.05
instructions (in billion) 26.47 24.97 27.77
ins. per cycle 1.94 2.64 3.94
L1d-cache-load-misses (in billion) 1.48 0.13 0.05

As seen in the table, the optimized_tiled function outperforms the other two functions, with higher ins. per cycle metric and fewer L1d cache misses. Note that additonal optimization can be probably achieved using SIMD instructions. Again, that would involve the use of the intrinsics provided by the CPU vendor.

Optimizing using huge pages

Another aspect to consider in how to optimize an application is the use of huge pages. Huge pages are pages that are larger than the usual ~4KB page size. The advantage of using huge pages is that the TLB can store more entries. So the CPU can access more memory without having to access the page table in memory and re-do the page table walk. There is an awesome post by Evan Jones that explains the advantages of using huge pages.
The post refers to a Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator paper by Google. The paper explains how they extended malloc to make it huge page aware and improve their requests-per-second (RPS) by 7.7% while reducing RAM usage by 2.4%.

Evan Jones provides the evanj/hugepagedemo repository containing a Rust demo showing the huge pages advantages. Again, the demo use perf to show the analyze the performance of the code.

Some other techniques for achieving huge pages are described in the Low Latency Optimization: Using Huge Pages on Linux (Part 2) post by Guillaume Morin.

Wrap-Up

This post covered some of the techniques that can be used to perform adhoc optimizations on the code. The post includes bypassing cache using non-temporal writes, cache optimization with sequential access and loop tiling, and using huge pages for improving the TLB performance. These techniques can be applied to most programming languages. It worth noting that these optimizations are strongly dependent on the CPU architecture. The same applies also to the perf events tracked in the post.
Finally, most of the theoretical concepts are explained in the What every programmer should know about memory paper. The full code samples are available in the samueleresca/what-every-programmer-should-know-about-memory repository.

References

What every programmer should know about memory (What a programmer can do) - Ulrich Drepper's

Hugepages are a good idea - Evan Jones

Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator - A.H. Hunter and Chris Kennelly and Paul Turner and Darryl Gove and Tipp Moseley and Parthasarathy Ranganathan

Low Latency Optimization: Using Huge Pages on Linux (Part 2) - Guillaume Morin