Friday, November 13, 2009

OPL: Shared Queue, Speculation, & Digital Circuits

Shared Queue:

What experience have you had with shared queues, and what factors did you use to decide whether to use single vs. multiple queues?

+I do not have any experience with these. I have not had to do concurrent programming yet.

An enqueue or dequeue seems like a very small operation. Why is there so much concern over contention to the queue? What factors cause this?

+Although the act of enqueuing or dequeuing may seem quick or simple, there is obviously a drastically negative impact of having contention in these situations that could easily cause the queue to behave unexpectedly or enter an inconsistent state if locking is not done properly. You could also easily end up with a single queue being a severe bottleneck if the tasks are small, there are many workers, or both.

Seeing as how the authors lay out quite a bit of code to support the variations of these patterns, they seem to mitigate their own concern as to the value of "simple protocols.” Why would someone opt for the less efficient implementation, given that the concerns are already outlined, and sample code available?

+I cannot think of a good reason to use the simple protocols other than the fact that it may be easier for n00bs to debug.


Speculation seems like a straightforward pattern, though potentially a quite powerful one. What factors would you consider to decide if speculation is an appropriate/beneficial pattern to apply?

+The pattern boils down to optimistic parallelization. This reminds me of my computer arch class, 433, and our examination of branch prediction.

+You break sequential dependencies assuming that the majority of the time that breakage will not affect the correctness of the problem. In the cases where this does cause a problem, which should be a small percentage if you are applying this pattern correctly, the cost of going back and fixing those problems should be outweighed by the benefits gained from the optimistic parallelization.

+Regarding what factors need to be considered, I would like the ability either to know the statistical distributions of the inputs or to prototype a system in a reasonable amount of time to check the validity of apply this pattern to the given data or domain.

+I think the HTML lexer was extremely helpful edifying this pattern for me. While the abstract discussion all makes sense and seems well written, I just could not figure out what the tasks and dependencies could look like. I wish there were a few more examples that are concrete.

Seeing as how you may not have data about the statistical distribution of inputs to expect before you build the program, and therefore perhaps not know the average cost of rollback, etc., is it prudent to implement the speculation pattern at the outset, or wait to evolve the application to include speculative optimizations down the road? What if the performance increase that speculation has the potential to generate is necessary to meet the minimum requirements of the specification? Is speculation still a viable pattern to apply, or would effort be better spent on other parallel optimizations with more certain outcome?

+I do not think that it is prudent to implement this pattern if you have no concept of whether it will be successful or not. The same applies to almost any architectural or design decisions that you would make.


Was this pattern what you expected from the name?

+Nope. I am not sure how the name fits the pattern that well. Sure, we are operating on the bits as circuits in a CPU would, but I feel that I had to think about how the name & pattern related rather than it being something more obvious. Then again, it does attempt to create an analogy with something that we all know.

It seems like a great deal of this pattern is described through examples of clever transformations for specific problems. Do these examples give you enough information to go about applying this pattern to a problem dissimilar to the given examples? Can you think of a way of describing these transformations at a more abstract yet actionable level than the examples do?

+I could theoretically think of creative ways to map other problems onto this pattern, but this pattern seems like such an extreme way to implement a pattern, that I imagine it would be a last resort that would only be considered if all other parallelization attempts were inadequate or unsuccessful. Though I agree that using 32 or 64 bits to represent what only needs 1, I feel that you would really have to be trying to wring every idle cycle out of your systems for this pattern to be something that you would turn to.

OPL: Branch-and-Bound & Graphical Models

Branch-and-Bound computational patterns generally involve finding an optimal solution to a problem by iteratively searching through a large space of all possible solutions. Typically, to perform such a search, no known polynomial time algorithm is known. The Branch-and-Bound pattern comprises of four steps: Branching, Evaluation, Bounding, Pruning.

1. Examples: The authors discuss 3-SAT and Traveling Salesman Problem as examples. Yet, a more fundamental example to grasp is the knapsack problem. That said, what might be some other simpler examples of branch-and-bound that you are familiar with?

+One example that I will steal from Wikipedia is a chess program. The incredibly large search space of possible moves 1 to n moves ahead explodes at an exponential rate, and being able to prune the search space efficiently is key for the system to be able to identify the optimal or almost optimal move.

2. Trade-offs: How does the nature of branch-and-bound pattern make it especially hard to deal with incurring low communication overhead while maximizing concurrent computation?

+When branch finds an optimal solution or constraints to apply to boundaries, that must be communicated to the other branches in order to aid their pruning. This could potentially mean that every branch needs to communicate with every other branch, which in turn would lead to a significant amount of chatter. There is a direct relationship between the amount of communication and the pruning efficiency.

3. Solutions and Parallelization:
a. The authors discuss three possible techniques for parallelization: operation-based, structure-based, and construction-based. What makes structure-based parallelization preferable? Are there any scenarios where you would use the operation-based or construction-based technique instead?

+I think this question should be answered by the authors of the paper. I do not feel that they are giving the topic a thorough analysis if they simply skip operation- and construction-based patterns in favor of structure-based ones without much justification.

b. For the structure-based technique, the authors describe four different methods for implementation: Synchronous Single Pool, Asynchronous Single Pool, Synchronous Multiple Pool, and Asynchronous Multiple Pool. Which method do you think would be hardest to debug for correctness? Which method do you think would be hardest to optimize for performance and scalability?

+Correctness: AMP. This complicates the scenario both by distributing data and by having updates happen with no definite period.

+Optimize: ASP. This may be hard to optimize for efficiency because the shared task queue needs synchronized access. It is also complicated by the fact that each processor can request data at any time. If all processors were updated at once, synchronization on the queue would likely be easier because just one process could handle the global lock & coordination.

c. The authors discuss two categories of task assignment: static assignment and dynamic assignment. In real-world applications, how important do you think the choice of the task assignment method is to performance?

+I think it depends on the problem domain. Static assignment can likely be more efficient if tasks size can be known with a reasonable degree of accuracy a priori. That way, the overhead of load balancing does not need to be incurred. It thus follows that dynamic assignment is beneficial in the opposite case. Given that, I think that choosing static assignment in the wrong situations can severely hurt performance by causing many processors to sit idle, but I do not imagine that choosing dynamic assignment in the wrong cases would lead to much of a loss in efficiency.

4. Pattern Comparison: How might fork/join or task queue pattern relate or fit with the branch-and-bound pattern?

+Task queue is an easy one since it is explicitly called out in the paper as being useful as the central data store in the SSP & ASP styles of structure-based parallelism. As for fork-join, each branch can be a fork until you get down to a low threshold, at which point it is best to compute the result sequentially.

Graphical Models are often times used in probability theory to represent uncertainty of a collection of (possibly independent) events. Given the probability of occurrence of some events in the network, one can use the model to infer probabilities of other events within the network.

1. Examples: Did the authors' examples (e.g. hidden Markov Models, Bayesian Netoworks) help illuminate what a graphical model is? Can you think of other simple example problems or scenarios that you could model as a network/chain of probabilistic events?

+The examples made sense only because of my prior exposure to HMMs and Bayesian networks in CS 512. One example of how they could be used is determining how likely you are to see certain sequences of genes or proteins. If the authors wish to talk about HMMs and Bayesian networks, they should really give examples. Although many readers will be familiar with basic graph theory, these concepts go beyond that.

2. Solutions and Parallelization:
a. The authors make a distinction between exact and approximate estimation of probability. In what example problems do you think approximate estimation would work best? In what examples exact estimation?

+It depends on the definition of "work best." Exact algorithms would always be ideal, but a faster approximate algorithm can typically be found & be beneficial from a computational resources perspective. The example above of finding or predicting certain patterns in a genetic sequence is a good candidate for using an approximate algorithm over an exact one due to the immense number of nucleotides that need to be processed.

b. The authors mention the Junction Tree Algorithm, indicating that searching for the correct factorizations can be done in parallel. What other characteristics of graphical models do you think could benefit from parallelism?

+Not sure. The paper is a bit too short for me to really wrap my head around parallelization opportunities, and the mention of the JTA does not help since I am unfamiliar with the pattern.

3. Pattern Comparison: What might this pattern have in common with the Branch-and-Bound pattern? Do you think it has similarity to the Graph Algorithms patterns?

Not sure.

Tuesday, November 10, 2009

PRES: Probabilistic Replay with Execution Sketching on Multiprocessors

- What do you expect from a debugging tool?
+stepping through the code line by line examining which branches are taken
+examining the state of variables
+examining the call stack

- What debugging tools do you usually use? What are good and bad about them?
+Visual Studio
+Covers all of the above. Has the ability to move backwards in a function; if you set a breakpoint & need to inspect what happened two lines earlier, you can drag it back and reply those lines. Conditional breakpoints. Can execute commands against the system's current state. Can modify the system's state & observer what impact that has.
+Doesn't have good support for "edit and continue," where you can change the inner workings of a method without needing to recompile & relaunch the application

- What are the challenges of debugging a sequential program? What are additional challenges of debugging a parallel program?
Debugging parallel programs can be extremely difficult due to heisenbugs that are timing related. These bugs happen only with certain interleavenings of the code, which are typically difficult to reproduce. The paper we read earlier in the semester on Chess from Microsoft Research is a good treatment of this problem and a possible solution.

- How important is replay for debugging?
Replay is essential for effective debugging. If the error cannot be consistently reproduced, it is often very difficult to diagnose and cure.

- If you need to design a deterministic replay system, how are you going to do it for sequential programs? Does it work for parallel programs? If now, how to make it work?
- What are the additional challenges to make your tool work on multi-core or multi-processor systems?
These questions seems like they would need a ridiculously long answer. The sequential problem is not too difficult, and you can refer to the Chess paper from earlier in the semester for the parallel case.

- Given a replay system, what else should we do to make it really help debugging?
Solve the halting problem.

- How would virtual machine technologies help?
Virtual machines could help by being able to completely capture every aspect of the system's state, providing you with a snapshot of the system's state that includes active processes, RAM utilization, etc. The environment can often have an impact on a system that behaves as expected internally, and those factors are often difficult to capture & especially difficult to repeat.

- What are the state of arts?
Not sure what the question is...

What the paper mentions in section 1.2 on "State of the Art" is based around systems to record executions. This is different from Chess, which did not attempt to record bugs but instead tried to find a way to efficiently and exhaustively examine different thread interleavenings. This seems to fall into the authors' bucket of existing software practices that repeatedly replay an execution to search for bugs caused by interleavenings. Granted, Chess should be drastically more efficient than the common naïve practices, but it is still in that category. PRES attempts to record enough information that a small amount of replays are needed to reproduce the bug, but it limits the amount of observations that are recorded in order to minimize the overhead of running a production system with PRES active.

PRES has three key components:
-recording only a subset of events. Although there may not be enough information to replay every intermediate event exactly as it occurred, enough information is recorded to reproduce the bug. Once PRES has reproduced the bug once, it can then consistently reproduce it 100% of the time.
-a replay system that reproduces unrecorded actions & events
-using data from unsuccessful replays to help guide subsequent attempts

OPL: Loop Parallelism, Task Queue, & Graph Partitioning

Loop Parallelism
1. The author says the goal of incremental parallelization is to "evolve" a sequential program into a parallel program. If you have experience in refactoring code from sequential to parallel, how did your approach differ from the steps presented in the paper?

I do not have experience doing this, but I do not imagine that you would have great results. I believe we talked about how evolutionary parallelism can achieve minor performance gains, but the architecture really needs to support the parallelism using the appropriate patterns for the problem domain for the speedups to be significant. Amdahl's Law says that we are limited by the amount of code that must be done sequentially, and I imagine refactoring loops will not parallelize a large percentage of a typical codebase. Though I am not a fan of premature optimization, there is a threshold above which you can be certain that an area of the code will be significantly computationally expensive relative to the rest of the code & look for parallelization opportunities there.

2. Many of us have used Eclipse or another IDE, with their respective plugins, for execution profiling. What has been your experience with these tools? What is your preferred method of locating the code in question (automatic or manual)?

I have had very good success with profiling tools in the .NET world. They have helped easily identify down to the line where performance bottlenecks are.

3. The author states, "The application of many transformations to many loops may be necessary before efficient performance is achieved." What is your opinion on this statement?

That is true. See my response to 1.

Task Queue Implementation Pattern
1. How would you compare the Task Queue pattern to other similar distribution patterns like divide-and-conquer and fork/join pattern?

I think the task queue is by definition a divide-and-conquer algorithm. Otherwise, there would be no need for a queue because there would be only one, large task. Perhaps James meant recursive splitting & fork-join.

The distinction with recursive splitting is that here the work items in the queue are broken down into their smallest-level pieces, which are then grabbed by threads that need more work. The problem that recursive splitting has that the task queue does not is that some effort should be spent to make sure that the splitting points for dividing the work into recursive tasks should split the work-load evenly. Otherwise, it can be quite easy for the load to become unbalanced, e.g. a worst-case quicksort that takes O(N^2) instead of the average O(N log N).

This pattern interests me because I have read several blog posts recently talking about it in different contexts. The typical example for explaining its optimality in queuing theory is people queuing at a grocery store. If customers divide themselves and form N queues for N cashiers, customers can become blocked for an inordinately long time behind a difficult/slow customer. In addition, adding and removing cashiers causes a significant amount of churn in the customers. By contrast, if all customers form a single queue and the customer at the head goes to the first available cashier, no customer is unfairly stuck behind another difficult/slow one, and cashiers can be added and removed depending on the queue length without affecting the waiting customers. The former is the distributed queue version and the later is the centralized queue version of this pattern.

Perhaps my memory is failing me a bit, but fork-join seems like a specific implementation of the decentralized version task queue pattern. In fact, the distributed task-stealing pattern that was first presented in the fork-join paper is explicitly mentioned here.

2. It was nice to see a pattern with some actual code examples. How important were the examples in understanding the concepts presented in this paper?

I did not find the code samples useful. I thought that the concept was explained quite well in the first few sections of the paper; better than most of the others, in fact.

Graph Partitioning Implementation Strategy Pattern
1. This was probably one of the more math-oriented patterns I have seen and/or read. How often have you experienced a situation where the graph partition strategy would have been useful? If never, can you think of any future situations where this pattern might be come in handy and could be applied?

I felt that the majority of the coverage was dedicated to the nitty-gritty details of several graph-partitioning algorithms. I think more high-level analysis would have been beneficial. That being said, I do think that this is a useful pattern to have in your toolbox because of the large number of problems that could be mapped to a set of nodes and weighted, optionally directed edges. Examples could be physical distance between points, latency between nodes on a network, or monetary costs to ship a package between two cities.

Friday, November 6, 2009

OPL: Geometric Decomposition, Pipeline, & Data Parallelism

Geometric Decomposition:

In geometric decomposition, they mention recursive data structures as good candidates for recursive algorithms and then mention arrays as good candidates for geometric decomposition. Arrays are recursive data structures, though, and have been used as examples in almost every one of the recursive algorithm patters, e.g. quicksort. I think the authors need to give more context to when arrays are good candidates for being geometrically decomposed as opposed to recursively processed...which they seem to do in the next paragraph. According to the authors, recursive algorithms are fine when the range of data that you're dealing with doesn't need to know about any of the other data in the structure. When there is some locality, though, then geometric decomposition is the way to go.

It's interesting to note in the multidimensional array examples that the surface area resulting from the data decomposition is important to consider. Since each block of data may need to communicate with adjacent blocks, the more surface area there is, the more communication is necessary. A similar idea can likely be extended to most data structures. In graphs, for example, you can consider the number of outgoing edges or the cut size from a section of the graph. There is a graph partitioning paper coming up, so hopefully that's yet another layer deep and not something parallel to this. Otherwise, I think I'm missing the point of this paper.

Another novel idea is to duplicate the neighbors of cells on the boundary of a chunk. In certain scenarios, the overhead incurred by duplicating data can make up for the fact that chunks should not need to communicate with each other. However, this only seems practical to me if the cells on the boundary only need the initial values of their duplicated neighbors. If they need the mostly updated data from their neighbors, then I don't see how this eases the communication burden. Also, if each border chunk has access to all the needed information when the process begins, then I think it is a data parallelism and not geometric decomposition situation.

Load balancing can become a problem if the data is not evenly distributed amongst the chunks of the data structure or if the processing causes the distribution to become uneven. The solution presented here, which makes sense to me, is to create many more chunks than UEs so no UE is likely to sit idle while another has a significant amount of processing to do.


It seems strange that the pipeline paper only differentiates itself (in the authors' words) from the pipes and filters pattern, which we've already read twice, simply by explicitly addressing parallelism. I think to avoid confusion, this should be made as a section of the pipes and filters pattern. Given that, it takes a while before this paper adds anything new to the two that we've already covered

Since each stage can typically only process one piece of data at a time, the more stages there are, the more operations can be performed in parallel. However, each stage comes at the cost of increasing communication overhead and latency. In order to make optimal use of the resources available at each stage, they should all require roughly the same amount of work so quicker stages don't leave resources idle for long while waiting for the longer ones to finish. This later point (and perhaps the former, too) was made in the previous treatments of this topic, though.

Data Parallelism:

Is this even worth discussing? The entire treatment of the topic was only three sentences.

Thursday, November 5, 2009

Joe Armstrong's Thesis "Making reliable distributed systems in the presence of software errors", Chapter 6: Building an Application

I am getting the sense that the author loves his hierarchies. When it comes to structuring a system, we throw a few more layers on top of the supervisor-worker trees mentioned in the last chapter. We group systems into applications that are either independent or loosely coupled. Each application then consists of supervisor-worker trees.

I find it very interesting that just by passing a 'global' flag into the server's start function, the server can be transparently accessed from any machine. I wonder what the setup is at the system or application level to make the communication between machines theoretically anywhere in the world transparent. How does the system avoid naming collisions? If you simply pass {global, Name} as an argument to start up a server, I imagine it could be very easy to get collisions. Sure, the process will likely fail fast, but what are good options for recovery? Restarting it will not help because it will have the same name. I imagine that changing the name will break other processes that want to communicate with it...Oh, wait. I think it is the PID that is used for IPC, not the name. What is the point of the name, then?

I am really overwhelmed by the API documentation. Although I can follow it somewhat (Chapter 3 is my only exposure to Erlang), I do not understand what benefit I am supposed to be getting from it. Although I love having concrete examples to back up the abstract concepts, I feel like this is conveying anything interesting, e.g. the structure of an .app file.

I am still somewhat confused on how SSRS is supposed to solve all your reliability problems. If there is a problem with some data or with some data and a certain piece of code, restarting & reprocessing it again will continually cause the same problem. The only solution is to completely ignore that data. I suppose the goal is just to not crash things at the application or system level, but simply throwing your hands up and saying "I don't know what to do" seems like it wouldn't be good enough in many cases.

I've been impressed by how Erlang can abstract out a lot of the tricky bits like managing concurrency & supervision while making it easy for every-day developers to plug into that. I think that is key for building reliable systems. If every-day programmers are building the code that is supposed to ensure reliability & fault tolerance, odds are that they will mess it up. I'm not sure if the examples lend themselves to this extremely well or things are just much easier in Erlang, but separating error handling concerns this easily in OO languages that I use is not quite so easy. It reminds me of AOP, though I have not seen AOP really take off as I was hoping it would, though there are a few good AOP tools out there.

Wednesday, November 4, 2009

Joe Armstrong's Thesis "Making reliable distributed systems in the presence of software errors", Chapter 5: Programming Fault-Tolerant Systems

“... [I]f we cannot do what we want to do, then try to do something simpler." I would like to see some examples of this because I cannot think of many ways to implement this in the systems that I have been working on lately. For example, one system is supposed to place orders for financial instruments. I cannot think of any way to break this down into simpler, alternative tasks. If the system cannot thoroughly and completely validate the order, then it cannot place the ord. If the order cannot be persisted to the database, then it cannot be placed. There is nothing simpler to try. Most key business task seem to be all-or-nothing at a cursory glance.

Here is another attempt to differentiate errors, exceptions, failures, faults, bugs, etc. I know there are some official, IEEE-backed definitions for these terms, but the problem I have is usually the terms are too similar and creating the distinctions seem like an academic exercise and the definitions usually do not clearly separate the terms. Here is my stab at interpreting what the author was trying to say: Errors occur at run-time when the system does not know what to do. Dividing by zero is an error to the VM since it cannot perform the computation, but this may or may not be an error to the parent process depending on whether or not the developer accounted for this scenario. If the code corrects the divide by zero error, it ceases to be an error because the system now knows what to do with it. (I had to read ahead a bit to section 5.3 get a decent definition.) Exceptions are a mechanism to communicate about errors between processes. When an error occurs in the VM, it raises an exception to let the calling process know what happened. That process may pass the exception on or create a new one to communicate to its parent that an error has occurred that it cannot correct. If a process receives an exception for which it does not have a catch handler, the process fails (shuts down) and notifies its linked peers why. So in a nutshell, an error is a run-time problem that the system does not know how to handle that is communicated between processes by exceptions that cause failures in processes that do not have catch blocks for them. Now let’s see how those definitions help up build a reliable system...

The author recommends creating supervision trees.
+Supervisors monitor other supervisors and worker nodes.
+Worker nodes execute well-behaved functions that raise exceptions when errors occur.
+Supervisors are responsible for detecting exceptions in child nodes & implementing the SSRS (stop, start, restart specification) for each node that it supervises.
+Supervisors are responsible for stopping all children if their parent stops them & restarting children that fail.
+Supervisors can be branded either AND or OR. If they are of the AND type & one child fails, they stop all other children & then restart them all. These are used for coordinated processes where a process cannot continue if any of its siblings fail. If they are of the OR type & one child fails, they restart that node and leave the rest of the children alone.

"The above discussion is all rather vague, since we have never said what an error is, nor have we said how we can in practice distinguish between a correctable and an uncorrectable error." The author absolutely read my mind! Time for some concrete examples...

Rules for well-behaved functions:
+"The program should be isomorphic to the specification": everything that is in the specification must be in the function and everything that is in the function must be in the specification. Nothing from the specification should be omitted in the function, and nothing should be added to the function that is not in the specification.
+"If the specification doesn’t say what to do, raise an exception": Don't guess; fail fast.
+Be sure the exceptions contain enough useful information so that you can isolate and fix what caused it.
+"Turn non-functional requirements into assertions (invariants) that can be checked at run-time": Not sure how I would do this in general. The author's example is timing function calls to make sure none go into uncontrolled infinite loops.