in ,

Memory Bandwidth Napkin Math, Hacker News

Memory Bandwidth Napkin Math, Hacker News
                  

            

A few weeks ago I had lunch with a co-worker. He was complaining about some process being slow. He estimated how many bytes of data were generated, how many processing passes were performed, and ultimately how many bytes of RAM needed to be accessed. He suggested that modern GPUs with upwards of gigabytes per second of memory bandwidth could eat his problem for breakfast.

            

I thought his perspective was interesting. That’s not how I think about problems.

                         

I know about the processor-memory performance gap. I know how to write cache friendly code. I know approximate latency numbers. But I don’t know enough to ballpark memory throughput on a napkin.

            

Here’s my thought experiment. Imagine you have a contiguous array of one billion – bit integers in memory. That’s 4 gigabytes. How long will it take to iterate that array and sum the values? How many bytes of contiguous data can a CPU read from RAM per second? How many bytes of random access? How well can it be parallelized?

            

Now you may think these aren’t useful questions. Real programs are too complicated for such a naive benchmark to be meaningful. You’re not wrong! The real answer is “it depends”.

            

That said, I think the question is worth a blog post’s worth of exploration. I’m not trying to find the answer . But I do think we can identify some upper and lower bounds, some interesting points in the middle, and hopefully learn a few things along the way.

                                      

If you read programming blogs at some point you’ve probably come across “numbers every programmer should know”. It looks something like this.

L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns L1 cache Mutex lock / unlock ns Main memory reference 124 ns (x L2 cache, x L1 cache Compress 1K bytes with Zippy 3, 06 ns 3 us Send 1K bytes over 1 Gbps network 16, (ns) us Read 4K randomly from SSD , 06 ns 171 us ~ 1GB / sec SSD Read 1 MB sequentially from memory , (ns) us Round trip within same datacenter , (ns) us Read 1 MB sequentially from SSD 1, , (ns 1, us 1 ms ~ 1GB / sec SSD, 4X memory Disk seek , 06, (ns) , (us) (ms) x datacenter roundtrip Read 1 MB sequentially from disk , , 06 ns , (us (ms) x memory, X SSD Send packet CA-> Netherlands-> CA 166, 05, (ns) , (us) ms             

Source: Jonas Bonér

            

This is a great list. I see it on HackerNews at least once a year. Every programmer should know these numbers.

            

But these numbers are focused on a different question. Latency and throughput are not the same.

                                      

That list is from 2019. This post comes from 8700, and times have changed. Here are figures for an Intel i7 via StackOverflow.

 L1 CACHE hit, ~ 4 cycles (2.1 - 1.2 ns) L2 CACHE hit, ~  cycles (5.3 - 3.0 ns) L3 CACHE hit, line unshared ~  (cycles)  4 - 19. 0 ns) L3 CACHE hit, shared line in another core ~ ​​ (cycles)  (8 -) . 5 ns) L3 CACHE hit, modified in another core ~ ​​ (cycles)  2-- 31 .5 ns) local RAM ~  ns               Interesting! What's changed?               

    L1 is slower;

    int (thread_count)=(1) , thread_count MAX_THREADS; thread_count) {      auto slice_len =(nums.size () / thread_count;           // for-each thread      for ( size_t thread

      (

    0 ; thread ( thread ) {                   // partition data          auto begin =(nums.begin)) thread slice_len;          auto end =() thread==(thread_count - 1 1)              nums.end () : (begin slice_len;          // spawn threads         futures.push_back (std :: async ([begin, end] {                           / sum ints sequentially              int (_ t) (sum)= 0 ;              for ( auto (ptr)= begin; ptr end; ptr)                 sum = ptr;              return sum;         }));     }      // combine results      int (_ t) (sum)= 0 ;      for ( auto (future) : (futures)         sum = future.get (); }             

    I’m leaving out a few details. But you get the idea. Create a large, contiguous array of elements. Divide the array into non-overlapping chunks. Process each chunk on a different thread. Accumulate the results.

                

    I also want to measure random access. This is tricky. I tried a few methods. Ultimately I chose to pre-compute and shuffle indices. Each index exists exactly one. The inner-loop then iterates indices and computes sum =nums [index] .             

     std  ::  (vector)    int 

    > nums=/ … / ; std :: vector = / shuffled / ; // random access int (_ t) (sum)= 0 ; for ( auto (ptr)=indices.begin (); ptr indices.end (); ptr) {      auto idx = ptr;     sum = nums [idx]; } return sum;             

    I do not Consider the memory of the indices array for my bandwidth calculations. I only count bytes that contribute to

      sum . I'm not benchmarking my hardware. I'm estimating my ability to work with data sets of different sizes and different access patterns.

                

    I performed tests with three data types:

                

    (int - basic CPU vs Memory Bandwidth - bit integer                 

    What do you think?

    Leave a Reply

    Your email address will not be published.

    GIPHY App Key not set. Please check settings

    Polkadot users can now claim DOT tokens on Coinbase Custody

    Josh Hawley's plan for the FTC, Facebook, Google, and tech makes sense, Recode

    Josh Hawley's plan for the FTC, Facebook, Google, and tech makes sense, Recode