Saturday, October 24, 2009

OPL: Dense Linear Algebra, Graph Algorithms, & Monte Carlo

Dense Linear Algebra

This pattern is trying to address that fact that memory access is much more expensive that CPU processing, so once a piece of data has been loaded, we need to make the most use of it.

“The BLAS level 3 routines stand out because of the smaller order of complexity for data movement (O(N^2)) than for computation (O(N^3)). Due to this fact, BLAS Level 3 routines can approach peak performance on most parallel systems. For this reason, a programmer would greatly benefit from structuring an algorithm to favor BLAS Level 3 functions. For example, replacing a large number of dot products (a BLAS level 1 function) with a single matrix-matrix multiplication (a BLAS level 3 function) can turn a problem bound by memory bandwidth into a compute-bound problem.” I’m still having trouble figuring out what this paragraph is saying, because at first I wasn’t sure why you’d want to take O(N)/O(N) vector operations and turn them into O(N^2)/O(N^3) operations, which are obviously much more expensive. However, once I actually turned by brain on, I realized that an NxN matrix is an aggregation of N 1xN vectors. Things still don’t quite click because it should take O(N^2) time to perform N BLAS level 1 operations on 1xN vectors. I realize that the point is to make the process compute-bound instead of the bottleneck being data movement, but what I’m focusing on is that it takes O(N^3) as a BLAS level 3 operation when it should take O(N^2) either as one BLAS level 2 operation or N BLAS level 1 operations. I know I’m missing something because the key seems to be BLAS level 3’s different bounds on data movement and computation, whereas BLAS levels 1 & 2 have the same bound on both, but I still don’t see how O(N^3) is better than O(N^2). The O(N^2) BLAS level 1 & 2 operations should be able to be parallelized, too, right?

Naïve implementations of matrix multiplications will repeatedly access data in a way that is suboptimal from a data caching and persistence strategy. A matrix is typically stored in either row-major or column-major order. Let’s say that we are using column-major order. That means accessing items that are adjacent in a matrix row is an expensive operation because the data are likely not stored in the same page on the disk and are likely not cached together. This becomes a problem when naively computing a matrix product cell by cell because we must traverse the same row many times. Instead of computing the product cell by cell, the paper proposes using an outer product, “in which the product of a column of A and a row of B acts as a rank one update to the entire product matrix C.” In this way, we do not have to repeatedly traverse each row and/or column, so spatial locality is still a problem, but a less frequent one.

An extension of the above idea is rather than to update all of C by using an entire column from A and an entire row from B, a smaller block that uses a subset of each is computed. An autotuning library is needed to determine what is the best block size for the given hardware, as block size is dependent on everything from the TLB to the number of floating point units available.

The idea of blocking above is typically optimized towards the quantity and speed of registers in the system. It can be extended further to create other layers of blocking optimized towards L1, L2, and, if it exists, L3 caches. The high-level blocks move around within the higher level blocks until the higher level blocks have been fully traversed, at which point they move.

One noteworthy optimization mentioned in the paper is to “use local variables to explicitly remove false dependencies.” If the value of a matrix cell is stored in a local variable, then operations can be reordered in a more optimal way than if there was a write-after-read/anti-dependency.



Graph Algorithms

Nothing new here as far as creating the abstraction goes. This was covered very well in my algorithms class during my undergrad at UIUC (CS 473). My main take away from the class was that almost any problem could be solved by converting it to a graph and using max flow, though the solution may be less than ideal.

I think the list of graph structures (bipartite, connected, complete, k-regular, tree, forest, DAG) and parallelization opportunities listed make a nice starting point when it comes to doing some research on optimizing an algorithm. Same thing with the graph storage structure (vertex list, edge list, adjacency list, adjacency matrix), traversal options (breadth-first search, depth-first search, topological sort, minimum spanning trees [Kruskal, Prim], single source shortest paths [Bellman-Ford, Dijkstra], all-pairs shortest paths [Floyd-Warshall, Johnson], single source longest paths in DAG [Floss], max flow [Edmonds-Karp, push-relabel, relabel to front], graph coloring [Chaitin, Bomen]), and partitioning operations (Kernighan-Lin, Fiduccia-Matheyses, ParMetis). I found myself scouring my notes from CS 225 & CS473 and searching the Web for all the algorithms that I’ve used to know back during my undergrad but have since forgotten.



Monte Carlo

This method of choosing random points in the sample space can be beneficial when it is difficult to arrive at the answer via analytical methods.

I’ve seen SIMD mentioned frequently in this class during the last few weeks’ readings. I read up on it a bit on Wikipedia before, but I’ll have to dig a little deeper and see if/how it can impact my day job.

As mentioned in the paper, map-reduce and Monte Carlo often work well together because the random samplings are independent tasks that can be performed in the map phase and then have their statistics aggregated in the reduce phase.

I think the paper spends a bit too much time (it’s not much time in total, but these patterns seem to favor brevity) on how random number generation is difficult when not relying on any sampling of physical phenomenon. The more interesting discussion is around how to get pseudo random number generation and parallelized systems to play nicely together, which is a problem because the generator relies on state that can’t be shared by different threads without having them produce the same “random” number. The paper presents two possible solutions: either generate random numbers in batches so blocking/locking is not needed as frequently or “[g]enerate multiple unique sequences in parallel, using advanced parallel version of the Mersenne Twister PRNG or SPRNG Ver4.0…”

I found both the “Experiment Execution” and “Result Aggregation” sections lacking. The former focuses on nothing but SIMD lanes, while the later does not dig into any depth or many alternatives for combining results other than “use locks to synchronize threads.”

Friday, October 23, 2009

Joe Armstrong's Thesis "Making reliable distributed systems in the presence of software errors", Chapter 4: Programming Techniques

I took the time to read Chapter 3 on the language itself because I am quite interested in learning new languages whenever possible. One “best practice” for self-development that I often read is to learn one new programming language every year. The goal is not to become a master in the language and use it in production systems, but to become competent with it so you can understand its patterns and paradigms and have those influence your day job. Functional languages have been all the rage lately and have had significant impacts on modern OO languages, e.g. LINQ in C#. An example of applying functional concepts to your OO programming could be to create generic functions that let you program more declaratively and using immutable value objects.

The one section that I found interesting outside the syntax of the language was the treatment on page 65 of tail recursion. Although I am familiar with the concept, it was an interesting refresher on how it works technically with return pointers & how you can convert to tail recursion by introducing an auxiliary function.



Abstracting Concurrency

The author advocates putting all concurrency-related code into generic reusable modules written by expert programmers and putting all sequential code into plug-ins that anyone can write. Obviously concurrent programming is harder, even in a language that natively supports it like Erlang, than writing sequential code due to the complexities of deadlocks, starvation, interprocess communication, etc. He also advocates creating a plug-in architecture because he claims that you cannot write the concurrent code in a side-effect free manner, whereas you can easily do this with sequential code. To answer Ben’s question, I do believe this statement to be true. The language does not allow side effects in terms of mutating state, so the author is not referring to that. He is referring to the fact that the functions in the concurrent modules need to affect the environment by spawning & exiting processes.

I am not sure why the “find” method in the VSHLR responds with an error when it cannot find the person. I thought that each process was supposed to complete successfully or exit. Granted, this is a central server and we do not want it to die, but although it makes logical sense, it seems to go against the whole paradigm that the author has laid out thus far in the paper. (After reading the rest of the chapter & re-reading this section, the server should not need to exit here. The point of the “fail fast” and “let another process handle the problem” philosophies here means that the server should simply return an error and let the client figure out what to do rather than take any corrective action.)

I like how the author was able to “factor out” concurrency in the example of the VSHLR server. I am wondering if this pattern is always so easy to apply, though.

The idea of having supervisor processes monitor the health of the nodes in the system so a client does not have to worry about its server dying seems like it would work as long as there was a supervisor for the supervisor and a supervisor for that supervisor and so on. So far, the application seems durable, but I think that this same level of durability (except in the face of hardware failures) could be easily replicated in Java or C# without having separate JVMs running in separate processes, as the author said was necessary earlier. I have a “catch (Exception ex) {/*show error message*/}” at the top level in most of my apps that works just fine for errors that are not due to hardware failures. There are obviously many other benefits to Erlang that would be missed, but the durability argument is not shaping up to be quite what I thought it would be.

The whole idea of swapping code out on the fly is much simpler here than I imagined it to be. Here the server is just sent a new function/closure/lambda to execute instead of the one that it was using before. I know there is support for a much more extensive upgrade given what was mentioned in Chapter 3, and I’m interested to see how that works.



Error Handling Philosophy

Even after the author lists out four reasons why having another process handle the errors or the faulty process, I still do not see why it is much more advantageous than having the faulty process handle its own errors unless the errors are hardware related and/or unrecoverable and the faulty process dies. In most systems that I build, we rely on failover systems to handle hardware failures, but I imagine most of the systems that go to the lengths of using Erlang want a more bulletproof Plan B than that. None of his arguments related to code clutter or code distribution sell me at all, though (I really want to be sold on the idea & think outside my Java/.NET box, but the author isn’t a convincing salesman thus far or the pitch is going right over my head).

Later…

I guess that after reading a bit farther, I should recant what I said above. Having the error handling completely removed from the code the worker processes will be executing, even if it is not generic, seems much nicer than having to rely on “catch (Exception ex)” blocks sprinkled throughout the code. If the error handling code in the supervisors can be made generic enough to work with multiple different types of worker processes, all the better.

Monday, October 19, 2009

OPL: Event-based Implicit invocation & Map-reduce

Event-based, Implicit invocation

I’m not seeing a difference between this pattern and the observer pattern at the OOD level. The pattern is more general and can be applied to communication between systems, e.g. a SOA where the pricing module will fire an event to indicate a sale but not be concerned with what other modules (billing, CRM) are interested. Now that I’m re-reading Richard’s questions, that last example fits in with the pub-sub architectural model. I’m not sure that I see a difference between these patterns other than levels of abstraction. I think that the different examples at the end of the “Solutions” section do a good job of showing other levels of abstraction where the pattern works.

I’m not sure why you would prefer a synchronous dispatch over an asynchronous one. I’m sure there is a valid reason, but other than hardware limitations, I can’t think of one. Since the announcer shouldn’t know anything about its subscribers, why should it have to wait for them? It’s typically not expecting to get anything back. If a sender does want to reply to a message, it won’t be via a return value or anything like that; it will reply by raising an event of its own.

As far as error handling goes, the announcer should certainly not be notified of errors with the subscribers. The entire point of this pattern is to decouple the two types of entities. Since the announcer won’t know anything about the subscribers, it will likely have no idea how to handle any error that it receives from them.



Map-reduce

This pattern has been all the rage due to the fact that Google uses it. One example that they give, which I find much clearer than any of the examples given in the reading, is counting the number of times a string appears in a document. If there are w workers and p pages in the document, each worker can take p/w pages and find the number of occurrences of that string. Then since the reduce method is a simple summation, which is a distributive function (the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning), so even the reduce part of the algorithm can be done in parallel.

I think that the map portion of the pattern is simple and obvious (divide and conquer your independent tasks), but I think there is a significant amount of complexity in reducing in non-naïve manner that was glossed over. Maybe it’s simpler than I think, though, because I haven’t used the pattern before, or maybe doing naïvely is just fine.

This pattern reminds me a lot of the fork-join pattern, and I’m really thinking hard about what the differences are. The work stealing method presented in the fork-join paper seems like a very novel idea, but that is just an implementation detail that I imagine map-reduce could also use.

I don’t know if I like the idea of the framework handling errors. I usually prefer to have the system let the programmer specify exactly how errors should be handled so none get swallowed inadvertently. I think that if the framework attempts to handle the errors itself, it is violating the “fail fast” principal discussed in the other reading for this class. The framework can keep the computation going using data that is potentially incorrect, and one error can lead to drastically compounded problems in the output.

Sunday, October 18, 2009

Joe Armstrong's Thesis "Making reliable distributed systems in the presence of software errors", Chapter 2: The Architectural Model

I’m very excited for this paper. I’ve heard a lot of buzz about how Erlang facilitates the building of highly reliable systems. I was a little disappointed when the introduction said “The thesis does not cover in detail many of the algorithms used as building blocks for construction fault-tolerant systems—it is not the algorithms themselves which are the concern of this thesis, but rather the programming language in which such algorithms are expressed.” This gives me the impression that the paper will focus on how Erlang helps you develop reliable systems but leave out lessons that I can apply in other languages. We’ll see how things go…

I’m in complete agreement with the definition of architecture presented. The author says that there are myriad such definitions, but this one contains all the elements that I find noteworthy. It’s by Grady Booch, though, and he usually has some good ideas </understatement> .

I think that a philosophy or set of guiding principles is a commonly overlook portion of a software architecture, and I’m glad to see it called out here. As cheesy as this and related ideas like a system metaphor from Extreme Programming may sound, I have found that teams bigger than four or five people really do benefit from having a foundation that architects and developers can refer back to throughout the project’s lifecycle in an attempt to keep a uniform feel. A good example from a developer’s perspective, which bleed over into a “set of construction guidelines,” that still touches on some architectural decisions can be found here: http://www.cauldwell.net/patrick/blog/ThisIBelieveTheDeveloperEdition.aspx.

This sounds like an extremely interesting challenge. Each of the ten requirements would make for a challenging system, but mixing high concurrency, high reliability, in-place upgrades, and complex features sounds like quite the daunting task.

I’ve heard that Erlang isolates very small pieces of code in independent processes to restrict how far errors can propagate & that each process/module should always make the assumption any other modules that it will talk to will fail. However, the thought of truly to use a full process at the operating system level, this sounds like a very heavy-weight solution due to the overhead of having so many processes and the cost of interprocess communication versus intraprocess communication. It’s good to see that the processes used by Erlang in this sense “are part of the programming language and are not provided by the host operating system.” It’s interesting that towards the end of the chapter the author presents the idea that the isolation from errors that lightweight, hardware independent Erlang processes provide can only be matched in Java by running applications in different JVMs in different operating system processes, which is obviously a tad bit more resource intensive.

Concurrency orient programming doesn’t feel like it has too many differences from OOP to me. It just sounds like OOP in a COPL. Real-world activities are mapped to concurrent processes in much the same way the real-world objects and actions are mapped to classes and methods. Nothing about the characteristics of a COPL seem new, but it is a novel idea to have all the functionality listed (support for processes, process isolation, PIDs, communication only via unreliable message passing, failure detection) implemented completely within constructs of the programming language that can be completely independent of the operating system. It does seem like it could be a problem to have to copy data between processes, though. I can’t imagine that processes could always be organized so that large chunks of data never need to be passed between processes.

I’m never a fan of security through obscurity, so when the author mentions that not knowing the PID of a process means you can’t interact with it and thus can’t compromise it in any way. I’m sure that any decent architect would put more layers of security in place, but I dislike security through obscurity so much that I wouldn’t even bother bringing it up.

Message passing seems to be the best way to build reliable systems at any level of abstraction from my recent readings. At the enterprise architecture scale, event driven architectures and message passing systems are at the crux of many of the most durable sites as seen on High Scalability, e.g. http://highscalability.com/amazon-architecture. At the complete opposite end of the spectrum from systems integration is Erlang, which builds these exact same concepts, e.g. asynchronous message passing, into the programming language itself. Since I hear good things about reliable systems being coded in Erlang and reliable enterprise architectures building SOA stacks on EDAs and message passing systems, I’m guessing there’s something to this pattern.

The idea of failing fast as soon as you detect an unrecoverable error seems like common sense to me. Maybe I’m being naïve or not entirely comprehending the details of the idea, but I don’t understand why anyone would have their system continue after encountering an error which they can’t handle. I’ve come across this idea many times in the past but still have yet to feel like I’m not missing something.

Saturday, October 17, 2009

OPL Patterns: Pipes & Filters, Layered Architectures, and Iterative Refinement

Pipes & filters

The version of the pipes and filters pattern for this assignment was much shorter than the last reading because there was no detailed example and no extra fluff. The original had an overly complex example that I found useless. However, some omissions that jumped out differentiating push vs pull filters and problems with error handling and global state. Also, the original made four points on why the gains from parallel processing are often an illusion. The new one didn’t raise or counter any of those concerns.

The “Forces” sections contained some practical considerations that I don’t believe were mentioned in the original. It was nice to have those considerations in a short, independent section to call them out rather than burying them in dozens of pages of text as the POSA reading did, in my opinion. Overall, I found all three more concise because they lacked some verbiage and context, but I feel they retained 90% of the useful information in 10% of the space.

Layered architectures

I think there’s too much emphasis on how layers hurt performance. While true, it doesn’t matter for most applications that don’t involve systems programming or intense computations, though the critical paths in the later type of system can have the layers removed while the rest is layered to maintain modularity.

I liked the point about keeping the API smaller on lower layers to decrease the work necessary to change the underlying implementation.

Iterative Refinement

It sounds like this is saying that you should find high-level steps that need to be done sequentially, and then inside each step, find tasks that can be performed in parallel. It sounds similar to an article that I just read in the November/December issue of IEEE Software: “Parallelizing Bzip2: A Case Study in Multicore Software Engineering.” Four pairs of students were tasked with taking Bzip2 and parallelizing it. Most teams found that simple, low-level optimizations such as parallelizing loops were insufficient for achieving substantial gains. The successful teams found that the best approach was to refactor the code into high-level modules that were independent but needed to be executed in sequence. They then optimized the code within each module (if I’m remembering the article correctly), which sounds just like this pattern.

Thursday, October 15, 2009

Chess

I was instantly interested in this paper as soon as I read the title. Though my experience with concurrent programming is limited, many of the bugs I encountered during that time were “Heisenbugs.” However, the last time I read about an interesting testing library that was based off of a research paper and Microsoft had created a version for its internal testing, Randoop, I was disappointed to find that the .NET code was not available.

Chess focuses on using only a few preemptions in a sequence but interleavening those preemptions in as many different places in the sequence. The tool also allows you to specify which areas to test more and which areas to run atomically.

I find it nice that the tool runs deterministically & that you can instantly get the last thread interleavening, though those are obvious features without which the tool wouldn’t be very useful. It’s also quite nice tht the tool doesn’t need a custom VM to be able to intercept calls.

Good old Lamport timestamps back from my distributed systems class. They sound like they give a good amount of flexibility to the scheduling algorithm by only requiring a partial ordering. It just has to focus on interleavening a few key events at the right times with respect to each other but not worry about absolute times.

Threadpool work items, async callbacks, timer callbacks

Chess just needs to know if a synchronization operation “isWrite,” writing to shared state, or “isRelease,” releasing other tasks that are blocked on a synchronization mechanism. “isWrite” causes edges to be created in the happens-before graph backwards to all previous nodes and forwards to all subsequent nodes. If the API that Chess has to work with can’t be well understood, program correctness will be maintained by setting both to true; only performance and efficiency will be affected.

Chess avoids blocking calls by wrapping them & using the “try” version instead, which does not block. The tool can then move that task off into the “disabled” queue until a release event occurs on that synchronization resource. This prevents Chess from being blocked.

One problem with Chess is that it can’t guarantee the “real-time” semantics of timers, but few programs correctness or control flow rely on the threads being scheduled and executed real-time.

Data races for non-synchronized shared memory must be put into the “happens-before” graph, too. Chess enforces that the threads run in a single-threaded fashion so that Chess can deterministically execute the program and capture replay info.

There are n to the (k to the c) power thread interleavenings with n threads and k atomic operations performed by each and c interleavenings performed with each run. Chess solves this explosion of the search space by: keeping c low, usually around two, and by keeping k low by ignoring system functions and base libraries. Chess has found that keeping c low but examining all possible interleavening give the small magnitude of c has had great success.

When running Chess in a test harness, things implemented by devs to randomize the program like a large number of threads and sleep calls need to be undone to allow Chess to thoroughly and deterministically search the interleaving space.

Wednesday, October 14, 2009

Reentrancer

Reentrancer automatically converts shared, global state into thread-local state. While the refactorings that the paper presents will indeed make the program safer to run concurrently, it may not always be easy to convert shared state to thread-local state. Often the shared state is the result of lazy programming and can be refactored away with a bit of determination. However, there are many times where tasks cannot be run completely in isolation on their own threads and do truly need global state as a coordination mechanism. Given that, I’m not sure how useful I really think Reentrancer is.

The refactoring takes place in three parts: surrounding static fields with accessors, converting static initializers to lazy initializers, and finally making the data thread local. The author did point out that this process can lead to incorrect behavior. It doesn’t seem like it would be worth the risk of making your application both incorrect and slower (according to the authors’ own benchmarks). Not being able to preserve behavior in the face of exceptions could cause an enormous amount of rework to be necessary. Database and file system access were also mentioned as blockers to this refactoring, but I find those to be less problematic because in the applications that I work with, those external accesses are normally separated from key processing of business logic

I found that this paper had overlap with the previous two on automated refactoring tools, but I found it less readable and less useful in practice.