Wednesday, 30 January 2013

Cache Money Hoes


Before we begin, the underlined words are briefly described in the glossary at the end. If there are any terms you don't understand let me know and I'll place them in the glossary as well.

As the wise contemporary philosophers the Wu-Tang Clan once said: 
"Cache Rules Everything Around Me ( C.R.E.A.M ) "
Many people believe this is just typical rappers exhibiting the materialistic nature of modern life. However, they were in fact visionaries describing the importance of memory hierarchy utilisation on processor performance.

Traditionally the time complexity of an algorithm was based on the RAM model.  The model assumes that access to all memory locations takes equivalent constant time.  In reality a cache hit is an order of magnitude faster than a cache miss.

Therefore it should be possible to significantly improve an algorithm by reordering its memory access to be cache friendly. This isn't a particularly novel idea and it has a name: cache aware algorithms.

The problem with cache aware algorithms is that they depend on the cache size. The algorithm must be tuned for every different cache it runs on. This unportable tuning was the inspiration for Harald Prokop's cache oblivious algorithms. They are algorithms that are cache aware but don't require tuning.

To illustrate cache-oblivious and cache-aware algorithms we first need an algorithm we can improve. The key is that we don't change the time complexity at all. That means we still do the same number of operations - so the processors do the same work. The only variant is the order of memory access and therefore cache utilisation.

Algorithms that access memory often are ripe for the picking. I decided to implement the matrix transpose because it one of the simplest memory heavy algorithms. The matrix transpose simply takes the rows of input algorithm and makes them the columns of the output matrix, like so:



You don't need to understand matrices at all to understand this algorithm. You only need to understand one definition:
[\mathbf{A}^\mathrm{T}]_{ij} = [\mathbf{A}]_{ji}

This simply states that the entry with jth row and ith column for the input matrix is equal to the entry with ith row and jth column in the output matrix.  

This cache-unaware or naive algorithm directly uses this definition. As shown below:
      for( int i = 0; i < N; i++ ){  
           for( int j = 0; j < N; j++ ){  
                out[j][i] = in[i][j];  
           }  
      }  

Just a side note, the matrices here are large e.g. in this blogs associated code N=1024 i.e. the matrix is 4MB ( each entry is 4 bytes so 4x1024x1024) . Anyway back to the analysis 2D arrays in C are stored in row-major order.

This means that an entry's horizontal neighbours ( i.e. row neighbours ) will be on the same cache line. Unless the entry itself is at the end of a cache line.  Therefore accessing a horizontal neighbour will likely be a cache hit. Conversely accessing a vertical neighbour will likely be  a cache miss ( unless its a relatively small matrix and its row is small enough to fit in a cache line ).

The naive algorithm reading the input matrix will often cache hit because it traverses in row order ( the inner iterator j walks the column ). However writing to the output matrix will likely cache miss due to column order traversal.

How can this be improved? Well we can change the order of writing to minimize the cache misses. That is exactly what the cache-aware algorithm does. It splits the matrix into a set of sub-matrices that have a size equivalent to the cache line. Then runs that algorithm on the sub-matrices:


Don't let sub-matrices confuse you. We are still essentially following the [\mathbf{A}^\mathrm{T}]_{ij} = [\mathbf{A}]_{ji} rule. The only difference is we are restricting it to a sub-blocks of the matrix. Once the block has been transferred with minimal cache misses we move onto the next block. Use the source Luke if you want to know more.

The problem with the previous cache aware algorithm is it took the cache line size as a parameter. So we would have to tune it for each different target. Can we eliminate this tuning? Sure we can! The cache oblivious (CO) way.

The CO basically keeps splitting the matrix into sub-blocks until we reach blocks of size 1. In reality you set it to slightly larger then one to minimize the cost of recursion. Then apply the transpose on the blocks. So the CO is basically the cache-aware algorithm but the tuning is done automatically through divide and conquer.

But wait, isn't setting the block size too small going to cost a cache miss? No thats the clever bit! If the block is smaller than the cache line then the siblings and eventually parent block ( the block that encompasses the subblock ) will still be in cache.  As shown below



The initial block being accessed is the sub-block in the top left consisting of red and blue blocks. Its entries are shown in the cache line above. Once its been accessed its sibling block ( the top right block consisting of green and yellow ) is accessed. Since the blocks are smaller than the cache line the siblings entries are also in cache.

So lets see if the wu-tang hypothesis is correct.  The average results of running the transpose on a 1024x1024 matrix 100 times are:

NaiveCache AwareCache Oblivious
5704 ms 763.33 ms 1212 ms

Of course the Wu-Tang clan was right on the money. So the algorithms are doing the exact same amount of operations and the difference is huge. The cache oblivious algorithm is slightly slower because its paying for the recursion. As the size of the matrix grows the cost of recursion will reduce. So we can say that the cache oblivious algorithm is asymptotically equivalent to the cache aware one.

This blog has given a very broad overview. For more information please read Harald Prokop's MIT master's thesis and the attached source code

For the people that do read the thesis one thing that had me confused was the cache complexity was often stated as O( 1 + k / N ) and I couldn't figure out where the 1 came from. Its because big-O represents worst case, and the worst case in terms of cache is when the memory address is misaligned so you have to pay for an additional cache miss per object access. 



So the take home for today is, when your code runs slow don't forget the cache! Until next time keep it real...



Glossary 

Time Complexity - A way of comparing algorithms mathematically. Instead of comparing them by actual time measurements that are affected by the computer the algorithm is run on. Compare them by the expected number of operations. The number of operations is typically dependent on the size of the input n. So we say its complexity is a function of n.

Cache Hit - When the CPU attempts to access an object that's already in cache.

Cache Miss - When the CPU attempts to access an object that's not in cache so it must be copied from memory to cache ( a line size at a time ).

Row Major - A 2D ( or n dimensional ) array in C is actually stored as one large single array. In row major order rows are written out after each other. So for example if the array is defined as x[3][5] and you access element x[2][4] its actual set at 2*5 + 4 because it has to skip the first two rows and then move to the 4th element.

12 comments:

  1. Cool article with good illustration, please keep writing!

    ReplyDelete
  2. thank-you ... I will try to blog about every Sunday project I take on. Next Sun is going to be on linear programming or indexing. I may write an article on JIT from the Sunday before last :)

    ReplyDelete
  3. I believe that cache-oblivious algorithms are asymptotically optimal as the number of cache levels in the hierarchy increases, not asymptotically optimal as the input size increases.

    In "The Cost of Cache-Oblivious Searching" (Proceedings. 44th Annual IEEE Symposium on Foundations of Computer Science, 2003), Bender et al. prove that cache-oblivious search structures take (lg e) or approximately 44% more block transfers than the optimal cache-aware algorithm in the 2-level memory model (the DAM model) and can do no better. As the number of levels in the memory model increases, the performance loss of the cache-oblivious algorithm approaches zero.

    Of course real hardware does not have a 2-level memory hierarchy nor does it have an infinite level memory hierarchy. But I have found that in practice, the 2-level memory model is a good abstraction for application performance. There is usually one "jump" in the memory hierarchy that has a significant performance impact. Either the gap from disk to main memory, or the gap from main memory to highest cache level, depending on the application.

    ReplyDelete
    Replies
    1. Thank you for letting me know. Ill read that on Sat and update the correction.

      Delete
  4. Awesome article! Interesting note when I ran the code, the cache oblivious was faster than the cache-aware. (compiled with gcc -O3 and ran on an i7)

    naive : 1696 1707 1707
    cache aware : 798 800 803
    cache oblivous: 364 366 364

    ReplyDelete
    Replies
    1. Thanks Danny ... A friend of mine also noticed similar results. Unfortunately I must restrict myself to Sundays for tangent work so I really can't do more research on the cause.

      If are interested in finding out. Before studying the optimised output make sure:

      1) You have set the cache line parameter for the aware algorithm appropriately
      2) Verify that the matrix begins on a page boundary using mmap ( so that its cached align )

      If you do find out, please let me know and Ill update this blog.

      Thanks

      Delete
    2. Worth noting is that the ratio of performance increase in the three cases are as follows:

      naive: 3.35 times faster
      cache aware: 0.954 times as fast
      cache oblivious: 3.32 times faster

      So that supports the hypothesis that your test machine is around 3.3 times faster than the one used for the article, but you have incorrectly configured the cache size or alignment in the cache-aware case, in one of the ways Iain describes above.

      Incidentally, this is a good anecdotal argument as to why the cache-oblivious algorithm is a better choice in general, because it's harder to misconfigure.

      Delete
  5. get the memory, mega mega bytes ya'll!

    ReplyDelete
  6. Well-written. Made my day.

    (memory) Word!

    ReplyDelete