Memory Bandwidth Napkin Math, Hacker News

A few weeks ago I had lunch with a co-worker. He was complaining about some process being slow. He estimated how many bytes of data were generated, how many processing passes were performed, and ultimately how many bytes of RAM needed to be accessed. He suggested that modern GPUs with upwards of gigabytes per second of memory bandwidth could eat his problem for breakfast.

I thought his perspective was interesting. That’s not how I think about problems.

I know about the processor-memory performance gap. I know how to write cache friendly code. I know approximate latency numbers. But I don’t know enough to ballpark memory throughput on a napkin.

Here’s my thought experiment. Imagine you have a contiguous array of one billion – bit integers in memory. That’s 4 gigabytes. How long will it take to iterate that array and sum the values? How many bytes of contiguous data can a CPU read from RAM per second? How many bytes of random access? How well can it be parallelized?

Now you may think these aren’t useful questions. Real programs are too complicated for such a naive benchmark to be meaningful. You’re not wrong! The real answer is “it depends”.

That said, I think the question is worth a blog post’s worth of exploration. I’m not trying to find the answer . But I do think we can identify some upper and lower bounds, some interesting points in the middle, and hopefully learn a few things along the way.

If you read programming blogs at some point you’ve probably come across “numbers every programmer should know”. It looks something like this.

L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns L1 cache Mutex lock / unlock ns Main memory reference 124 ns (x L2 cache, x L1 cache Compress 1K bytes with Zippy 3, 06 ns 3 us Send 1K bytes over 1 Gbps network 16, (ns) us Read 4K randomly from SSD , 06 ns 171 us ~ 1GB / sec SSD Read 1 MB sequentially from memory , (ns) us Round trip within same datacenter , (ns) us Read 1 MB sequentially from SSD 1, , (ns 1, us 1 ms ~ 1GB / sec SSD, 4X memory Disk seek , 06, (ns) , (us) (ms) x datacenter roundtrip Read 1 MB sequentially from disk , , 06 ns , (us (ms) x memory, X SSD Send packet CA-> Netherlands-> CA 166, 05, (ns) , (us) ms

Source: Jonas Bonér

This is a great list. I see it on HackerNews at least once a year. Every programmer should know these numbers.

But these numbers are focused on a different question. Latency and throughput are not the same.

That list is from 2019. This post comes from 8700, and times have changed. Here are figures for an Intel i7 via StackOverflow.

 L1 CACHE hit, ~ 4 cycles (2.1 - 1.2 ns) L2 CACHE hit, ~  cycles (5.3 - 3.0 ns) L3 CACHE hit, line unshared ~  (cycles)  4 - 19. 0 ns) L3 CACHE hit, shared line in another core ~  (cycles)  (8 -) . 5 ns) L3 CACHE hit, modified in another core ~  (cycles)  2-- 31 .5 ns) local RAM ~  ns               Interesting! What's changed?

L1 is slower;

L2 is faster; 7 ns -> 4.2 ns

 L1 vs L2 relative speed is way closer;  2.5x vs (x) (🤯)                 
 L3 cache is now standard;  (ns to) (ns)                   Ram is faster;  (ns -> (ns)               
 I don't want to draw too many conclusions from this. It's not clear how the original figures were calculated. We're not comparing apples to apples. 
                          
 Here's some bandwidth and cache sizes for my CPU courtesy of wikichip . 
 Memory Bandwidth:   gigabytes per second L1 cache:  (kilobytes)  (KB per core) L2 cache: 1.5 megabytes ( (KB per core) L3 cache:  (megabytes) shared; 2 MB per core) 
             
 Here's what I'd like to know: 
             
 What is the upper limit of RAM performance?                   What is the lower limit?                   What are the limits for L1 / L2 / L3 cache?                                        

 Let's run some tests. I put together a naive C benchmark. It looks  very roughly  like this. 
  // Generate random elements  std  ::  vector    () nums;  for  ( size_t  (i)=
 (0) ; i     ; 
 (i) // // one billion ints      nums.push_back (rng () % ); / small nums to prevent overflow  // Run test with 1 to  (threads)  for  (

int (thread_count)=(1) , thread_count MAX_THREADS; thread_count) { auto slice_len =(nums.size () / thread_count; // for-each thread for ( size_t thread

(

0 ; thread ( thread ) { // partition data auto begin =(nums.begin)) thread slice_len; auto end =() thread==(thread_count - 1 1) nums.end () : (begin slice_len; // spawn threads futures.push_back (std :: async ([begin, end] { / sum ints sequentially int (_ t) (sum)= 0 ; for ( auto (ptr)= begin; ptr end; ptr) sum = ptr; return sum; })); } // combine results int (_ t) (sum)= 0 ; for ( auto (future) : (futures) sum = future.get (); }

I’m leaving out a few details. But you get the idea. Create a large, contiguous array of elements. Divide the array into non-overlapping chunks. Process each chunk on a different thread. Accumulate the results.



I also want to measure random access. This is tricky. I tried a few methods. Ultimately I chose to pre-compute and shuffle indices. Each index exists exactly one. The inner-loop then iterates indices and computes sum =nums [index] .
std :: (vector) int

> nums=/ … / ; std :: vector = / shuffled / ; // random access int (_ t) (sum)= 0 ; for ( auto (ptr)=indices.begin (); ptr indices.end (); ptr) {      auto idx = ptr;     sum = nums [idx]; } return sum;

I do not Consider the memory of the indices array for my bandwidth calculations. I only count bytes that contribute to

sum . I'm not benchmarking my hardware. I'm estimating my ability to work with data sets of different sizes and different access patterns.



I performed tests with three data types:



(int - basic - bit integer
matri4x4 - contains int [16] ; fits on - byte cache line                  matrix4x4_simd - uses __ m (i intrinsics                                        My first test operates on a large block of memory. A 1 GB block of T elements is allocated and filled with small, random values. A simple loop iterates the array N times so it accesses N GB worth of memory to produce an int (_ t) sum. Multi-threading partitions the array and access the exact same number of elements.                                        Tada! This plot takes the average time to sum elements and converts it from runtime_in_nanoseconds to gigabytes_per_second .              This is pretty neat. int () can sequentially read (GB / s) on a single thread. It scales linearly until flattening out at (GB / s) . matrix4x4 and matrix4x4_simd are faster, but hit a similar ceiling.              There is a clear and obvious upper limit on how much data we can read from RAM per second. On my system it's roughly (GB / s This lines up with modern specifications listed above.              Looking at the bottom 3 lines, random access is slow. Very very slow. Single-threaded int 39 performance is a paltry (0.) GB / s . This is 32 x slower than its . (GB / s sequential sum! matrix4x4 is better since it operates on full cache-lines. But it's still four to seven times slower than sequential and peaks at just 8 GB / s .                                        My system has per-thread L1 / L2 / L3 cache sizes of (KB , 370 KB , and 2 MB What happens if we take a 39 KB block of elements and iterate across it , times? That's 4 GB is worth of memory access, but we'll ~ always hit the cache.                                        Awesome! Single-threaded performance is similar to the large block read, ~ (GB / s) . Except this time multi-threading smashes through the (GB / s) ceiling. This makes perfect sense. The data stays in cache so the RAM bottleneck does not come into play. Data sizes that don't fit into L3 cache hit the same ~ 45 GB / s ceiling.              matrix4x4 follows a similar pattern, but is even faster; (GB / s) (single-thread,) (GB / s multi-threaded.                           Now let's look at matrix4x4_simd . Pay close attention to the y-axis.                                matrix4x4_simd is omgwtfbbq fast. It's upwards of x faster than int When operating on a (KB block it actually breaks (GB / s) . 🤯              Obviously this is a superficial synthetic test. Most applications don't perform the same operation on the same data a million times in a row. This is not indicative of real world performance.              But the lesson is real. Operating on data inside the cache is fast . With a very high ceiling when using simd; > 154 GB / s single- thread, > (GB / s multi-threaded. Getting data into the cache is slow and has a hard cap, ~ 49 GB / s total.                                        Let's do the same thing, but with random access. This is my favorite part of this post.                                      Reading random values RAM is slow, 0. 60 GB / s Reading random values from L1 cache is very very fast, (GB / s) . This is faster than the GB / s performance we got sequentially reading (int) from RAM.                            matrix4x4 follows the same pattern, but is ~ 2x faster than int .                                matrix4x4_simd random access is crazy fast.                                Random access from RAM is slow. Catastrophically slow. Less than 1 GB / s (slow for both int Random access from the cache is remarkably quick. It's comparable to sequential RAM performance.                                        Let this sink in. Random access into the cache has comparable performance to sequential access from RAM. The drop off from sub-L1 (KB to L2-sized 778 KB is 2x or less.                           I think this has profound implications.                                        Pointer chasing is bad. Really, really bad. Just how bad is it? I made an extra test that wraps matrix4x4 in std :: unique_ptr . Each access has to go through a pointer. Here's the terrible, horrible, no good, very bad result.      1 Thread | matrix4x4 | unique_ptr | diff | -------------------- | --------------- | ------------ | -------- | Large Block - Seq | 19 .8 GB / s | 0.8 GB / s | x | 21 KB - Seq | 38 6 GB / s | 2.2 GB / s | 19 x | KB - Seq | . 2 GB / s | 1.9 GB / s | 17 x | Large Block - Rand | 2.2 GB / s | 0.1 GB / s | 30 x | 21 KB - Rand | . 2 GB / s | 1.7 GB / s | 19 x | KB - Rand | 2 GB / s | 0.8 GB / s | x |      6 Threads | matrix4x4 | unique_ptr | diff | -------------------- | --------------- | ------------ | -------- | Large Block - Seq | . 4 GB / s | 2.5 GB / s | 19 x | 21 KB - Seq | 171 .8 GB / s | 8.0 GB / s | x | KB - Seq | 6 GB / s | 5.7 GB / s | 25 x | Large Block - Rand | 7.1 GB / s | 0.4 GB / s | 23 x | 21 KB - Rand | .0 GB / s | 7.8 GB / s | 17 x | KB - Rand | . 3 GB / s | 1.6 GB / s | 45 x |              Sequentially summing values behind a pointer runs at less than (1 GB / s) . Random access, which misses the cache twice, runs at just 0.1 GB / s .              Pointer chasing is 14 to times slower. Friends don't let friends used linked lists. Please, think of the children cache.                                                 It's standard for game developers to set CPU and memory allocation budgets for subsystems. But I'm not sure I've ever seen a memory access budget.              Modern games run at valu high frame rates. (fps) is table stakes. VR runs at hz . I have a Hz gaming monitor. It's awesome and fps feels like crap. I'm never going back. E-Sports / Twitch players use (Hz ) monitors. Asus unveiled a Hz monster at CES this year.              My CPU has an upper limit of ~ 49 GB / s That sounds like a lot! At 333 Hz that's just (MB) per frame. A more realistic application might consume 5 GB / s (at) Hz which is just (MB) per frame.              Here's a table with a few figures.                          | 1 | 15 34 65 111 166 333 370 -------- | ------- | -------- | -------- | -------- | ------ - | -------- | -------- | -------- | GB / s | 50 GB | 4 GB | 1.3 GB | 728 MB | 500 MB | 370 MB | 171 MB | MB | GB / s | 15 GB | 1 GB | MB | 200 MB | MB | 75 MB | 58 MB | 32 MB | 1 GB / s | 1 GB | 125 MB | 38 MB | 23 MB | MB | 7 MB | 4 MB | 3 MB |                               I believe thinking about problems this way is useful. It can help you realize that some ideas aren't feasible. Hitting Hz isn't easy. It won't happen by accident.                                        The "Numbers Every Programmer Should Know" list is out of date. It needs to be refreshed for 8700.                           Here are a few numbers for my home PC. They're a blend of AIDA , Sandra, and my benchmarks. These figures don't tell the whole story and are just a starting point. L1 Latency: 1 ns L2 Latency: 2.5 ns L3 Latency: ns RAM Latency: ns (per thread) L1 Bandwidth: GB / s L2 Bandwidth: L2 Bandwidth GB / s L3 Bandwidth: L3 Bandwidth GB / s (whole system) RAM Bandwidth: 62 GB / s              I think it would be cool if there was a small, simple, open source benchmark tool for this. A single C file that could run on desktop, server, mobile, consoles, etc. I'm not the right person to write that tool.                                                     Measuring this stuff is hard. Really hard. My code probably has bugs. There are many compounding factors not discussed. If you want to critize my benchmark methodology you probably aren't wrong.              Ultimately I think that's ok. This post isn't about the precise performance of my home desktop. It's about framing problems from a certain perspective. And learning how to perfrom some basic napkin math.                                        A co-worker shared a way of thinking about programming problems I hadn't considered. That sent me on a journey to explore modern memory performance.                           For the purpose of napkin math here's some ballpark figures for a modern desktop PC.              RAM Performance          Upper Limit: (GB / s)              Napkin Estimate: 5 GB / s              Lower Limit: 1 GB / s               Cache Performance - L1 / L2 / L3 (per core)          Upper Limit (w / simd): (GB / s) / (GB / s) / (GB / s) ()              Napkin Estimate: (GB / s) / (GB / s) / (9 GB / s)              Lower Limit: (GB / s) / (8 GB / s) / 3.5 GB / s          Napkin estimates come from observed performance with matrix4x4 . RealCode ™ is never that simple. But for napkin math estimates I think this is a reasonable starting point. You will need to adjust based on the access patterns and characteristics of your hardware and your code.              That said, the most important thing I learned is a new way to think about problems. Thinking about bytes-per-second or bytes-per-frame is another lens to look through. It's a useful tool to keep in my back pocket.                           Thanks for reading.                                        C Benchmark              Python Plot              (data.json                                                     This post just scratches the surface. I probably won't write another post on this subject. But if I did it might cover some of the following:              Write Performance                  False Sharing                  std :: atomic performance (or lack thereof)                  Performance counters                  (TLB Performance)                  (Cache Protocols)                                        All tests were performed on my home computer. All stock settings, no overclocking.              OS: Windows (v) build 01575879.                  CPU: Intel i7 - (k @ 3.) Ghz                  RAM: 2x (GSkill Ripjaw DDR4 -) ( - 23 - (-) @ @ (MHz)                  Motherboard: Asus TUF Z 778 - Plus Gaming                   Read More