Dense Linear Algebra
This pattern is trying to address that fact that memory access is much more expensive that CPU processing, so once a piece of data has been loaded, we need to make the most use of it.
“The BLAS level 3 routines stand out because of the smaller order of complexity for data movement (O(N^2)) than for computation (O(N^3)). Due to this fact, BLAS Level 3 routines can approach peak performance on most parallel systems. For this reason, a programmer would greatly benefit from structuring an algorithm to favor BLAS Level 3 functions. For example, replacing a large number of dot products (a BLAS level 1 function) with a single matrix-matrix multiplication (a BLAS level 3 function) can turn a problem bound by memory bandwidth into a compute-bound problem.” I’m still having trouble figuring out what this paragraph is saying, because at first I wasn’t sure why you’d want to take O(N)/O(N) vector operations and turn them into O(N^2)/O(N^3) operations, which are obviously much more expensive. However, once I actually turned by brain on, I realized that an NxN matrix is an aggregation of N 1xN vectors. Things still don’t quite click because it should take O(N^2) time to perform N BLAS level 1 operations on 1xN vectors. I realize that the point is to make the process compute-bound instead of the bottleneck being data movement, but what I’m focusing on is that it takes O(N^3) as a BLAS level 3 operation when it should take O(N^2) either as one BLAS level 2 operation or N BLAS level 1 operations. I know I’m missing something because the key seems to be BLAS level 3’s different bounds on data movement and computation, whereas BLAS levels 1 & 2 have the same bound on both, but I still don’t see how O(N^3) is better than O(N^2). The O(N^2) BLAS level 1 & 2 operations should be able to be parallelized, too, right?
Naïve implementations of matrix multiplications will repeatedly access data in a way that is suboptimal from a data caching and persistence strategy. A matrix is typically stored in either row-major or column-major order. Let’s say that we are using column-major order. That means accessing items that are adjacent in a matrix row is an expensive operation because the data are likely not stored in the same page on the disk and are likely not cached together. This becomes a problem when naively computing a matrix product cell by cell because we must traverse the same row many times. Instead of computing the product cell by cell, the paper proposes using an outer product, “in which the product of a column of A and a row of B acts as a rank one update to the entire product matrix C.” In this way, we do not have to repeatedly traverse each row and/or column, so spatial locality is still a problem, but a less frequent one.
An extension of the above idea is rather than to update all of C by using an entire column from A and an entire row from B, a smaller block that uses a subset of each is computed. An autotuning library is needed to determine what is the best block size for the given hardware, as block size is dependent on everything from the TLB to the number of floating point units available.
The idea of blocking above is typically optimized towards the quantity and speed of registers in the system. It can be extended further to create other layers of blocking optimized towards L1, L2, and, if it exists, L3 caches. The high-level blocks move around within the higher level blocks until the higher level blocks have been fully traversed, at which point they move.
One noteworthy optimization mentioned in the paper is to “use local variables to explicitly remove false dependencies.” If the value of a matrix cell is stored in a local variable, then operations can be reordered in a more optimal way than if there was a write-after-read/anti-dependency.
Nothing new here as far as creating the abstraction goes. This was covered very well in my algorithms class during my undergrad at UIUC (CS 473). My main take away from the class was that almost any problem could be solved by converting it to a graph and using max flow, though the solution may be less than ideal.
I think the list of graph structures (bipartite, connected, complete, k-regular, tree, forest, DAG) and parallelization opportunities listed make a nice starting point when it comes to doing some research on optimizing an algorithm. Same thing with the graph storage structure (vertex list, edge list, adjacency list, adjacency matrix), traversal options (breadth-first search, depth-first search, topological sort, minimum spanning trees [Kruskal, Prim], single source shortest paths [Bellman-Ford, Dijkstra], all-pairs shortest paths [Floyd-Warshall, Johnson], single source longest paths in DAG [Floss], max flow [Edmonds-Karp, push-relabel, relabel to front], graph coloring [Chaitin, Bomen]), and partitioning operations (Kernighan-Lin, Fiduccia-Matheyses, ParMetis). I found myself scouring my notes from CS 225 & CS473 and searching the Web for all the algorithms that I’ve used to know back during my undergrad but have since forgotten.
This method of choosing random points in the sample space can be beneficial when it is difficult to arrive at the answer via analytical methods.
I’ve seen SIMD mentioned frequently in this class during the last few weeks’ readings. I read up on it a bit on Wikipedia before, but I’ll have to dig a little deeper and see if/how it can impact my day job.
As mentioned in the paper, map-reduce and Monte Carlo often work well together because the random samplings are independent tasks that can be performed in the map phase and then have their statistics aggregated in the reduce phase.
I think the paper spends a bit too much time (it’s not much time in total, but these patterns seem to favor brevity) on how random number generation is difficult when not relying on any sampling of physical phenomenon. The more interesting discussion is around how to get pseudo random number generation and parallelized systems to play nicely together, which is a problem because the generator relies on state that can’t be shared by different threads without having them produce the same “random” number. The paper presents two possible solutions: either generate random numbers in batches so blocking/locking is not needed as frequently or “[g]enerate multiple unique sequences in parallel, using advanced parallel version of the Mersenne Twister PRNG or SPRNG Ver4.0…”
I found both the “Experiment Execution” and “Result Aggregation” sections lacking. The former focuses on nothing but SIMD lanes, while the later does not dig into any depth or many alternatives for combining results other than “use locks to synchronize threads.”