Memory management optimization techniques
This article builds upon the previous one from this series and delves into the optimizations outlined in the "What a programmer can do" section of the "What every programmer should know about memory" paper by Ulrich Drepper. The code examples are implemented in Rust, and the execution is analyzed using perf.
Understanding perf
This post uses perf
to explore how the code is executed on the CPU. perf
is the official Linux profiler and it is included in the kernel source code, see perf Tutorial.
Let's take a moment to go through some of the main perf
events that will be used in this post:
instructions
: tells us how many instructions our code issued.cycles
tells us how many CPU cycles our code took. The number of cycles is usually lower than the number of instructions because the CPU can issue many instructions per cycle. The difference betweencycles
andinstructions
is called theIPC
(Instructions Per Cycle). TheIPC
is a good metric to understand how efficiently the CPU is running your code. The higher theIPC
, the better. It is also an indicator of how well the code is doing vectorization.cache-references
is representing the number of times the CPU accessed the cache. if we do not already have this data in the cache, we will have to access the main memory.cache-misses
is representing the number of times the CPU accessed the main memory. As mentioned in the previous post of this series, this is a very expensive operation, and we want to cut it.task-clock
: refers to how many clock cycles our task took. Thetask-clock
is usually in milliseconds. It differs from the total runtime because if the program uses many CPUs or threads, thetask-clock
will be the sum of the time spent on each CPU or thread.cs
is representing the context switches happening during the code execution.page-faults
is representing the number of page faults happening during the code execution. Usually, there are dedicated events to distinguish betweenminor-page-faults
andmajor-page-faults
.
Note that, depending on the CPU model and the manufacturer, the events available in perf
will be different. In some cases, dedicated cache layer events, such as L1 cache misses / references, will be available. In other cases, you will have to use the generic cache-misses
/ cache-references
events.
Bypassing cache
One of the ways of influencing memory proposed by the paper is to bypass CPU caches when data is initialized and not reused immediately. This approach avoids pushing out data that is already cached in favour of data that will not be read soon.
One of the intrinsics provided by the processors is for non-temporal writes operations. Non-temporal writes function flags the data as non-temporal so is not stored in the CPU cache.
Below is the implementation of a standard vs non-temporal matrix initialization:
The standard_initialize
function initializes a matrix in a standard way. The nocache_initialize
function takes the non-temporal approach by using the _mm_stream_si32
intrinsic function.
The _mm_stream_si32
function is defined in the std::arch::x86_64::_mm_stream_si32 backed by the corresponding Intel intrinsic.
The following analysis uses the perf stat
command to understand the impact of non-temporal writes on the process execution. Note that the code is built in optimized mode using the --release
flag.
Event Name | standard_initialize | nocache_initialize |
---|---|---|
l1d.replacement (in millions) |
1,052,585 | 532,639 |
l2_lines_in.all |
1,497,033 | 91,486 |
The table contains the event count comparison between standard and non-temporal writes. The l1d.replacement
event describes the number of cache lines replaced in the L1 data cache. The l2_lines_in.all
event describes the L2 cache lines filling L2.
Note that the numbers in the tables are the average event counts for 10 executions. A single execution is monitored using the perf stat
command.
The events are part of the Intel Ice Lake V1.00 events definitions.
The analysis above shows how the initialization using the non-temporal approach avoids performing L1d cache replacement. On top of that, the L2 cache lines are ~16x less than a standard initialization.
Cache optimization with sequential access and loop tiling
Sequential access is crucial for improving the L1d cache performances because the processor prefetches data.
The paper takes as an example a matrix multiplication described as:
$$c_{i,j} \space = \space a_{i,1} \space b_{1,j} + \space a_{i,2} \space b_{2,j} + \space \cdots + \space a_{i,n} \space b_{n,j}$$
Matrix multiplication can be performed as follow:
The matrix multiplication above is an example of an operation that can be optimized by taking advantage of sequential access. The m1
is accessed sequentially (by accessing the matrix per row), while the m2
is not accessed sequentially (the implementation iterates on the columns first).
It is possible to transpose the m2
matrix to access it sequentially. This is done in the optimized
function below:
Translating and multiplying the m2
matrix results in better performance. The table below shows the cycles
and instruction
perf events resulting from the execution of the non_optimized
and optimized
functions (Again, the code has been compiled in release mode).
Event | non optimized | optimized |
---|---|---|
cycles (in billion) |
13.66 | 9.47 |
instructions (in billion) |
26.41 | 24.97 |
ins. per cycle |
1.94 | 2.64 |
The ins. per cycle
is a metric that shows how many instructions are executed per cycle. The higher the value, the better the performance. The optimized
function has higher ins. per cycle
value because it is taking advantage of the sequential access for both the m1
and m2
matrices.
Another way to improve matrix multiplication speed is to increase the usage of L1d cache by implementing techniques such as loop tiling. The following code demonstrates how loop tiling is applied in matrix multiplication:
The loop tiling technique enhances cache locality by computing multiplications in blocks that can fit into the L1d cache, which maximizes cache usage and reduces cache misses. The results of the performance tests I run on three functions: non_optimized
, optimized
, and optimized_tiled
are presented in the table below:
Event | non optimized | optimized | optimized with loop tiling |
---|---|---|---|
cycles (in billion) |
13.66 | 9.47 | 7.05 |
instructions (in billion) |
26.47 | 24.97 | 27.77 |
ins. per cycle |
1.94 | 2.64 | 3.94 |
L1d-cache-load-misses (in billion) |
1.48 | 0.13 | 0.05 |
As seen in the table, the optimized_tiled
function outperforms the other two functions, with higher ins. per cycle
metric and fewer L1d cache misses. Note that additonal optimization can be probably achieved using SIMD instructions. Again, that would involve the use of the intrinsics provided by the CPU vendor.
Optimizing using huge pages
Another aspect to consider in how to optimize an application is the use of huge pages. Huge pages are pages that are larger than the usual ~4KB page size. The advantage of using huge pages is that the TLB can store more entries. So the CPU can access more memory without having to access the page table in memory and re-do the page table walk. There is an awesome post by Evan Jones that explains the advantages of using huge pages.
The post refers to a Beyond malloc efficiency to fleet efficiency: a hugepage-aware memory allocator paper by Google. The paper explains how they extended malloc to make it huge page aware and improve their requests-per-second (RPS) by 7.7% while reducing RAM usage by 2.4%.
Evan Jones provides the evanj/hugepagedemo repository containing a Rust demo showing the huge pages advantages. Again, the demo use perf
to show the analyze the performance of the code.
Some other techniques for achieving huge pages are described in the Low Latency Optimization: Using Huge Pages on Linux (Part 2) post by Guillaume Morin.
Wrap-Up
This post covered some of the techniques that can be used to perform adhoc optimizations on the code. The post includes bypassing cache using non-temporal writes, cache optimization with sequential access and loop tiling, and using huge pages for improving the TLB performance. These techniques can be applied to most programming languages. It worth noting that these optimizations are strongly dependent on the CPU architecture. The same applies also to the perf
events tracked in the post.
Finally, most of the theoretical concepts are explained in the What every programmer should know about memory paper. The full code samples are available in the samueleresca/what-every-programmer-should-know-about-memory repository.
References
What every programmer should know about memory (What a programmer can do) - Ulrich Drepper's
Hugepages are a good idea - Evan Jones
Low Latency Optimization: Using Huge Pages on Linux (Part 2) - Guillaume Morin