Friday, November 13, 2009

OPL: Shared Queue, Speculation, & Digital Circuits

Shared Queue:

What experience have you had with shared queues, and what factors did you use to decide whether to use single vs. multiple queues?

+I do not have any experience with these. I have not had to do concurrent programming yet.

An enqueue or dequeue seems like a very small operation. Why is there so much concern over contention to the queue? What factors cause this?

+Although the act of enqueuing or dequeuing may seem quick or simple, there is obviously a drastically negative impact of having contention in these situations that could easily cause the queue to behave unexpectedly or enter an inconsistent state if locking is not done properly. You could also easily end up with a single queue being a severe bottleneck if the tasks are small, there are many workers, or both.

Seeing as how the authors lay out quite a bit of code to support the variations of these patterns, they seem to mitigate their own concern as to the value of "simple protocols.” Why would someone opt for the less efficient implementation, given that the concerns are already outlined, and sample code available?

+I cannot think of a good reason to use the simple protocols other than the fact that it may be easier for n00bs to debug.


Speculation seems like a straightforward pattern, though potentially a quite powerful one. What factors would you consider to decide if speculation is an appropriate/beneficial pattern to apply?

+The pattern boils down to optimistic parallelization. This reminds me of my computer arch class, 433, and our examination of branch prediction.

+You break sequential dependencies assuming that the majority of the time that breakage will not affect the correctness of the problem. In the cases where this does cause a problem, which should be a small percentage if you are applying this pattern correctly, the cost of going back and fixing those problems should be outweighed by the benefits gained from the optimistic parallelization.

+Regarding what factors need to be considered, I would like the ability either to know the statistical distributions of the inputs or to prototype a system in a reasonable amount of time to check the validity of apply this pattern to the given data or domain.

+I think the HTML lexer was extremely helpful edifying this pattern for me. While the abstract discussion all makes sense and seems well written, I just could not figure out what the tasks and dependencies could look like. I wish there were a few more examples that are concrete.

Seeing as how you may not have data about the statistical distribution of inputs to expect before you build the program, and therefore perhaps not know the average cost of rollback, etc., is it prudent to implement the speculation pattern at the outset, or wait to evolve the application to include speculative optimizations down the road? What if the performance increase that speculation has the potential to generate is necessary to meet the minimum requirements of the specification? Is speculation still a viable pattern to apply, or would effort be better spent on other parallel optimizations with more certain outcome?

+I do not think that it is prudent to implement this pattern if you have no concept of whether it will be successful or not. The same applies to almost any architectural or design decisions that you would make.


Was this pattern what you expected from the name?

+Nope. I am not sure how the name fits the pattern that well. Sure, we are operating on the bits as circuits in a CPU would, but I feel that I had to think about how the name & pattern related rather than it being something more obvious. Then again, it does attempt to create an analogy with something that we all know.

It seems like a great deal of this pattern is described through examples of clever transformations for specific problems. Do these examples give you enough information to go about applying this pattern to a problem dissimilar to the given examples? Can you think of a way of describing these transformations at a more abstract yet actionable level than the examples do?

+I could theoretically think of creative ways to map other problems onto this pattern, but this pattern seems like such an extreme way to implement a pattern, that I imagine it would be a last resort that would only be considered if all other parallelization attempts were inadequate or unsuccessful. Though I agree that using 32 or 64 bits to represent what only needs 1, I feel that you would really have to be trying to wring every idle cycle out of your systems for this pattern to be something that you would turn to.

OPL: Branch-and-Bound & Graphical Models

Branch-and-Bound computational patterns generally involve finding an optimal solution to a problem by iteratively searching through a large space of all possible solutions. Typically, to perform such a search, no known polynomial time algorithm is known. The Branch-and-Bound pattern comprises of four steps: Branching, Evaluation, Bounding, Pruning.

1. Examples: The authors discuss 3-SAT and Traveling Salesman Problem as examples. Yet, a more fundamental example to grasp is the knapsack problem. That said, what might be some other simpler examples of branch-and-bound that you are familiar with?

+One example that I will steal from Wikipedia is a chess program. The incredibly large search space of possible moves 1 to n moves ahead explodes at an exponential rate, and being able to prune the search space efficiently is key for the system to be able to identify the optimal or almost optimal move.

2. Trade-offs: How does the nature of branch-and-bound pattern make it especially hard to deal with incurring low communication overhead while maximizing concurrent computation?

+When branch finds an optimal solution or constraints to apply to boundaries, that must be communicated to the other branches in order to aid their pruning. This could potentially mean that every branch needs to communicate with every other branch, which in turn would lead to a significant amount of chatter. There is a direct relationship between the amount of communication and the pruning efficiency.

3. Solutions and Parallelization:
a. The authors discuss three possible techniques for parallelization: operation-based, structure-based, and construction-based. What makes structure-based parallelization preferable? Are there any scenarios where you would use the operation-based or construction-based technique instead?

+I think this question should be answered by the authors of the paper. I do not feel that they are giving the topic a thorough analysis if they simply skip operation- and construction-based patterns in favor of structure-based ones without much justification.

b. For the structure-based technique, the authors describe four different methods for implementation: Synchronous Single Pool, Asynchronous Single Pool, Synchronous Multiple Pool, and Asynchronous Multiple Pool. Which method do you think would be hardest to debug for correctness? Which method do you think would be hardest to optimize for performance and scalability?

+Correctness: AMP. This complicates the scenario both by distributing data and by having updates happen with no definite period.

+Optimize: ASP. This may be hard to optimize for efficiency because the shared task queue needs synchronized access. It is also complicated by the fact that each processor can request data at any time. If all processors were updated at once, synchronization on the queue would likely be easier because just one process could handle the global lock & coordination.

c. The authors discuss two categories of task assignment: static assignment and dynamic assignment. In real-world applications, how important do you think the choice of the task assignment method is to performance?

+I think it depends on the problem domain. Static assignment can likely be more efficient if tasks size can be known with a reasonable degree of accuracy a priori. That way, the overhead of load balancing does not need to be incurred. It thus follows that dynamic assignment is beneficial in the opposite case. Given that, I think that choosing static assignment in the wrong situations can severely hurt performance by causing many processors to sit idle, but I do not imagine that choosing dynamic assignment in the wrong cases would lead to much of a loss in efficiency.

4. Pattern Comparison: How might fork/join or task queue pattern relate or fit with the branch-and-bound pattern?

+Task queue is an easy one since it is explicitly called out in the paper as being useful as the central data store in the SSP & ASP styles of structure-based parallelism. As for fork-join, each branch can be a fork until you get down to a low threshold, at which point it is best to compute the result sequentially.

Graphical Models are often times used in probability theory to represent uncertainty of a collection of (possibly independent) events. Given the probability of occurrence of some events in the network, one can use the model to infer probabilities of other events within the network.

1. Examples: Did the authors' examples (e.g. hidden Markov Models, Bayesian Netoworks) help illuminate what a graphical model is? Can you think of other simple example problems or scenarios that you could model as a network/chain of probabilistic events?

+The examples made sense only because of my prior exposure to HMMs and Bayesian networks in CS 512. One example of how they could be used is determining how likely you are to see certain sequences of genes or proteins. If the authors wish to talk about HMMs and Bayesian networks, they should really give examples. Although many readers will be familiar with basic graph theory, these concepts go beyond that.

2. Solutions and Parallelization:
a. The authors make a distinction between exact and approximate estimation of probability. In what example problems do you think approximate estimation would work best? In what examples exact estimation?

+It depends on the definition of "work best." Exact algorithms would always be ideal, but a faster approximate algorithm can typically be found & be beneficial from a computational resources perspective. The example above of finding or predicting certain patterns in a genetic sequence is a good candidate for using an approximate algorithm over an exact one due to the immense number of nucleotides that need to be processed.

b. The authors mention the Junction Tree Algorithm, indicating that searching for the correct factorizations can be done in parallel. What other characteristics of graphical models do you think could benefit from parallelism?

+Not sure. The paper is a bit too short for me to really wrap my head around parallelization opportunities, and the mention of the JTA does not help since I am unfamiliar with the pattern.

3. Pattern Comparison: What might this pattern have in common with the Branch-and-Bound pattern? Do you think it has similarity to the Graph Algorithms patterns?

Not sure.

Tuesday, November 10, 2009

PRES: Probabilistic Replay with Execution Sketching on Multiprocessors

- What do you expect from a debugging tool?
+stepping through the code line by line examining which branches are taken
+examining the state of variables
+examining the call stack

- What debugging tools do you usually use? What are good and bad about them?
+Visual Studio
+Covers all of the above. Has the ability to move backwards in a function; if you set a breakpoint & need to inspect what happened two lines earlier, you can drag it back and reply those lines. Conditional breakpoints. Can execute commands against the system's current state. Can modify the system's state & observer what impact that has.
+Doesn't have good support for "edit and continue," where you can change the inner workings of a method without needing to recompile & relaunch the application

- What are the challenges of debugging a sequential program? What are additional challenges of debugging a parallel program?
Debugging parallel programs can be extremely difficult due to heisenbugs that are timing related. These bugs happen only with certain interleavenings of the code, which are typically difficult to reproduce. The paper we read earlier in the semester on Chess from Microsoft Research is a good treatment of this problem and a possible solution.

- How important is replay for debugging?
Replay is essential for effective debugging. If the error cannot be consistently reproduced, it is often very difficult to diagnose and cure.

- If you need to design a deterministic replay system, how are you going to do it for sequential programs? Does it work for parallel programs? If now, how to make it work?
- What are the additional challenges to make your tool work on multi-core or multi-processor systems?
These questions seems like they would need a ridiculously long answer. The sequential problem is not too difficult, and you can refer to the Chess paper from earlier in the semester for the parallel case.

- Given a replay system, what else should we do to make it really help debugging?
Solve the halting problem.

- How would virtual machine technologies help?
Virtual machines could help by being able to completely capture every aspect of the system's state, providing you with a snapshot of the system's state that includes active processes, RAM utilization, etc. The environment can often have an impact on a system that behaves as expected internally, and those factors are often difficult to capture & especially difficult to repeat.

- What are the state of arts?
Not sure what the question is...

What the paper mentions in section 1.2 on "State of the Art" is based around systems to record executions. This is different from Chess, which did not attempt to record bugs but instead tried to find a way to efficiently and exhaustively examine different thread interleavenings. This seems to fall into the authors' bucket of existing software practices that repeatedly replay an execution to search for bugs caused by interleavenings. Granted, Chess should be drastically more efficient than the common naïve practices, but it is still in that category. PRES attempts to record enough information that a small amount of replays are needed to reproduce the bug, but it limits the amount of observations that are recorded in order to minimize the overhead of running a production system with PRES active.

PRES has three key components:
-recording only a subset of events. Although there may not be enough information to replay every intermediate event exactly as it occurred, enough information is recorded to reproduce the bug. Once PRES has reproduced the bug once, it can then consistently reproduce it 100% of the time.
-a replay system that reproduces unrecorded actions & events
-using data from unsuccessful replays to help guide subsequent attempts

OPL: Loop Parallelism, Task Queue, & Graph Partitioning

Loop Parallelism
1. The author says the goal of incremental parallelization is to "evolve" a sequential program into a parallel program. If you have experience in refactoring code from sequential to parallel, how did your approach differ from the steps presented in the paper?

I do not have experience doing this, but I do not imagine that you would have great results. I believe we talked about how evolutionary parallelism can achieve minor performance gains, but the architecture really needs to support the parallelism using the appropriate patterns for the problem domain for the speedups to be significant. Amdahl's Law says that we are limited by the amount of code that must be done sequentially, and I imagine refactoring loops will not parallelize a large percentage of a typical codebase. Though I am not a fan of premature optimization, there is a threshold above which you can be certain that an area of the code will be significantly computationally expensive relative to the rest of the code & look for parallelization opportunities there.

2. Many of us have used Eclipse or another IDE, with their respective plugins, for execution profiling. What has been your experience with these tools? What is your preferred method of locating the code in question (automatic or manual)?

I have had very good success with profiling tools in the .NET world. They have helped easily identify down to the line where performance bottlenecks are.

3. The author states, "The application of many transformations to many loops may be necessary before efficient performance is achieved." What is your opinion on this statement?

That is true. See my response to 1.

Task Queue Implementation Pattern
1. How would you compare the Task Queue pattern to other similar distribution patterns like divide-and-conquer and fork/join pattern?

I think the task queue is by definition a divide-and-conquer algorithm. Otherwise, there would be no need for a queue because there would be only one, large task. Perhaps James meant recursive splitting & fork-join.

The distinction with recursive splitting is that here the work items in the queue are broken down into their smallest-level pieces, which are then grabbed by threads that need more work. The problem that recursive splitting has that the task queue does not is that some effort should be spent to make sure that the splitting points for dividing the work into recursive tasks should split the work-load evenly. Otherwise, it can be quite easy for the load to become unbalanced, e.g. a worst-case quicksort that takes O(N^2) instead of the average O(N log N).

This pattern interests me because I have read several blog posts recently talking about it in different contexts. The typical example for explaining its optimality in queuing theory is people queuing at a grocery store. If customers divide themselves and form N queues for N cashiers, customers can become blocked for an inordinately long time behind a difficult/slow customer. In addition, adding and removing cashiers causes a significant amount of churn in the customers. By contrast, if all customers form a single queue and the customer at the head goes to the first available cashier, no customer is unfairly stuck behind another difficult/slow one, and cashiers can be added and removed depending on the queue length without affecting the waiting customers. The former is the distributed queue version and the later is the centralized queue version of this pattern.

Perhaps my memory is failing me a bit, but fork-join seems like a specific implementation of the decentralized version task queue pattern. In fact, the distributed task-stealing pattern that was first presented in the fork-join paper is explicitly mentioned here.

2. It was nice to see a pattern with some actual code examples. How important were the examples in understanding the concepts presented in this paper?

I did not find the code samples useful. I thought that the concept was explained quite well in the first few sections of the paper; better than most of the others, in fact.

Graph Partitioning Implementation Strategy Pattern
1. This was probably one of the more math-oriented patterns I have seen and/or read. How often have you experienced a situation where the graph partition strategy would have been useful? If never, can you think of any future situations where this pattern might be come in handy and could be applied?

I felt that the majority of the coverage was dedicated to the nitty-gritty details of several graph-partitioning algorithms. I think more high-level analysis would have been beneficial. That being said, I do think that this is a useful pattern to have in your toolbox because of the large number of problems that could be mapped to a set of nodes and weighted, optionally directed edges. Examples could be physical distance between points, latency between nodes on a network, or monetary costs to ship a package between two cities.

Friday, November 6, 2009

OPL: Geometric Decomposition, Pipeline, & Data Parallelism

Geometric Decomposition:

In geometric decomposition, they mention recursive data structures as good candidates for recursive algorithms and then mention arrays as good candidates for geometric decomposition. Arrays are recursive data structures, though, and have been used as examples in almost every one of the recursive algorithm patters, e.g. quicksort. I think the authors need to give more context to when arrays are good candidates for being geometrically decomposed as opposed to recursively processed...which they seem to do in the next paragraph. According to the authors, recursive algorithms are fine when the range of data that you're dealing with doesn't need to know about any of the other data in the structure. When there is some locality, though, then geometric decomposition is the way to go.

It's interesting to note in the multidimensional array examples that the surface area resulting from the data decomposition is important to consider. Since each block of data may need to communicate with adjacent blocks, the more surface area there is, the more communication is necessary. A similar idea can likely be extended to most data structures. In graphs, for example, you can consider the number of outgoing edges or the cut size from a section of the graph. There is a graph partitioning paper coming up, so hopefully that's yet another layer deep and not something parallel to this. Otherwise, I think I'm missing the point of this paper.

Another novel idea is to duplicate the neighbors of cells on the boundary of a chunk. In certain scenarios, the overhead incurred by duplicating data can make up for the fact that chunks should not need to communicate with each other. However, this only seems practical to me if the cells on the boundary only need the initial values of their duplicated neighbors. If they need the mostly updated data from their neighbors, then I don't see how this eases the communication burden. Also, if each border chunk has access to all the needed information when the process begins, then I think it is a data parallelism and not geometric decomposition situation.

Load balancing can become a problem if the data is not evenly distributed amongst the chunks of the data structure or if the processing causes the distribution to become uneven. The solution presented here, which makes sense to me, is to create many more chunks than UEs so no UE is likely to sit idle while another has a significant amount of processing to do.


It seems strange that the pipeline paper only differentiates itself (in the authors' words) from the pipes and filters pattern, which we've already read twice, simply by explicitly addressing parallelism. I think to avoid confusion, this should be made as a section of the pipes and filters pattern. Given that, it takes a while before this paper adds anything new to the two that we've already covered

Since each stage can typically only process one piece of data at a time, the more stages there are, the more operations can be performed in parallel. However, each stage comes at the cost of increasing communication overhead and latency. In order to make optimal use of the resources available at each stage, they should all require roughly the same amount of work so quicker stages don't leave resources idle for long while waiting for the longer ones to finish. This later point (and perhaps the former, too) was made in the previous treatments of this topic, though.

Data Parallelism:

Is this even worth discussing? The entire treatment of the topic was only three sentences.

Thursday, November 5, 2009

Joe Armstrong's Thesis "Making reliable distributed systems in the presence of software errors", Chapter 6: Building an Application

I am getting the sense that the author loves his hierarchies. When it comes to structuring a system, we throw a few more layers on top of the supervisor-worker trees mentioned in the last chapter. We group systems into applications that are either independent or loosely coupled. Each application then consists of supervisor-worker trees.

I find it very interesting that just by passing a 'global' flag into the server's start function, the server can be transparently accessed from any machine. I wonder what the setup is at the system or application level to make the communication between machines theoretically anywhere in the world transparent. How does the system avoid naming collisions? If you simply pass {global, Name} as an argument to start up a server, I imagine it could be very easy to get collisions. Sure, the process will likely fail fast, but what are good options for recovery? Restarting it will not help because it will have the same name. I imagine that changing the name will break other processes that want to communicate with it...Oh, wait. I think it is the PID that is used for IPC, not the name. What is the point of the name, then?

I am really overwhelmed by the API documentation. Although I can follow it somewhat (Chapter 3 is my only exposure to Erlang), I do not understand what benefit I am supposed to be getting from it. Although I love having concrete examples to back up the abstract concepts, I feel like this is conveying anything interesting, e.g. the structure of an .app file.

I am still somewhat confused on how SSRS is supposed to solve all your reliability problems. If there is a problem with some data or with some data and a certain piece of code, restarting & reprocessing it again will continually cause the same problem. The only solution is to completely ignore that data. I suppose the goal is just to not crash things at the application or system level, but simply throwing your hands up and saying "I don't know what to do" seems like it wouldn't be good enough in many cases.

I've been impressed by how Erlang can abstract out a lot of the tricky bits like managing concurrency & supervision while making it easy for every-day developers to plug into that. I think that is key for building reliable systems. If every-day programmers are building the code that is supposed to ensure reliability & fault tolerance, odds are that they will mess it up. I'm not sure if the examples lend themselves to this extremely well or things are just much easier in Erlang, but separating error handling concerns this easily in OO languages that I use is not quite so easy. It reminds me of AOP, though I have not seen AOP really take off as I was hoping it would, though there are a few good AOP tools out there.

Wednesday, November 4, 2009

Joe Armstrong's Thesis "Making reliable distributed systems in the presence of software errors", Chapter 5: Programming Fault-Tolerant Systems

“... [I]f we cannot do what we want to do, then try to do something simpler." I would like to see some examples of this because I cannot think of many ways to implement this in the systems that I have been working on lately. For example, one system is supposed to place orders for financial instruments. I cannot think of any way to break this down into simpler, alternative tasks. If the system cannot thoroughly and completely validate the order, then it cannot place the ord. If the order cannot be persisted to the database, then it cannot be placed. There is nothing simpler to try. Most key business task seem to be all-or-nothing at a cursory glance.

Here is another attempt to differentiate errors, exceptions, failures, faults, bugs, etc. I know there are some official, IEEE-backed definitions for these terms, but the problem I have is usually the terms are too similar and creating the distinctions seem like an academic exercise and the definitions usually do not clearly separate the terms. Here is my stab at interpreting what the author was trying to say: Errors occur at run-time when the system does not know what to do. Dividing by zero is an error to the VM since it cannot perform the computation, but this may or may not be an error to the parent process depending on whether or not the developer accounted for this scenario. If the code corrects the divide by zero error, it ceases to be an error because the system now knows what to do with it. (I had to read ahead a bit to section 5.3 get a decent definition.) Exceptions are a mechanism to communicate about errors between processes. When an error occurs in the VM, it raises an exception to let the calling process know what happened. That process may pass the exception on or create a new one to communicate to its parent that an error has occurred that it cannot correct. If a process receives an exception for which it does not have a catch handler, the process fails (shuts down) and notifies its linked peers why. So in a nutshell, an error is a run-time problem that the system does not know how to handle that is communicated between processes by exceptions that cause failures in processes that do not have catch blocks for them. Now let’s see how those definitions help up build a reliable system...

The author recommends creating supervision trees.
+Supervisors monitor other supervisors and worker nodes.
+Worker nodes execute well-behaved functions that raise exceptions when errors occur.
+Supervisors are responsible for detecting exceptions in child nodes & implementing the SSRS (stop, start, restart specification) for each node that it supervises.
+Supervisors are responsible for stopping all children if their parent stops them & restarting children that fail.
+Supervisors can be branded either AND or OR. If they are of the AND type & one child fails, they stop all other children & then restart them all. These are used for coordinated processes where a process cannot continue if any of its siblings fail. If they are of the OR type & one child fails, they restart that node and leave the rest of the children alone.

"The above discussion is all rather vague, since we have never said what an error is, nor have we said how we can in practice distinguish between a correctable and an uncorrectable error." The author absolutely read my mind! Time for some concrete examples...

Rules for well-behaved functions:
+"The program should be isomorphic to the specification": everything that is in the specification must be in the function and everything that is in the function must be in the specification. Nothing from the specification should be omitted in the function, and nothing should be added to the function that is not in the specification.
+"If the specification doesn’t say what to do, raise an exception": Don't guess; fail fast.
+Be sure the exceptions contain enough useful information so that you can isolate and fix what caused it.
+"Turn non-functional requirements into assertions (invariants) that can be checked at run-time": Not sure how I would do this in general. The author's example is timing function calls to make sure none go into uncontrolled infinite loops.

Tuesday, November 3, 2009

OPL: Task Parallelism, Recursive Splitting, & Discrete Event

Task Parallelism

When you first saw this pattern title, did you feel like you knew the subject beforehand? After reading the pattern, did you confirm your impression or did it surprise you? how?
+Yes & yes. It's basically a layer of abstraction on top of the implementation patterns. The Monte Carlo method is mentioned several times since it is a computational pattern that uses task parallelism.

In what category falls this pattern?
+Parallel algorithm strategy pattern. Although the different categorizations of the patterns on the parlab website can help group similar patterns, sometimes it feels like there's a little too much overlap or one to many layers of abstraction.

What platform do you have more experience programming parallel applications on? And what Task Management Mechanism? (Jobs over Networks, OS Processes, SW Task Queue, HW Task Queue)
+I have no experience writing any applications that use parallel programming to solve a problem. Here and there in industry and class projects, I'll make some async calls, but that's always just so the main thread of execution doesn't block when doing something like sending an email or having another single thread create a long-running report.

Do you think that the lack of new hardware is preventing us of reaching better run times?
+No. I don't think we utilize the number of cores in modern servers and PCs to their full extent. Language support that would make parallel programming accessible to the masses is the major obstacle right now.

Do you think that this patterns requires you to learn more about hardware platforms in order to make a correct implementation?
+No. You could implement this pattern in Erlang and never have to know a thing about the hardware.

Do you agree with the mapping the author made for the Monte Carlo example? Can you think of something else?
+Yes, the Monte Carlo VaR example seems like a set of independent tasks, and not a problem that is broken down into smaller recursive pieces, which would be the Recursive Splitting pattern, or small tasks performed in sequence, which would be the Pipeline pattern. After starting to read the recursive splitting paper where they say that recursive splitting is a specific form of task parallelism, I don't understand why there are so many different layers of abstraction inside the one Parlab category. What in here isn't considered task parallelism? A "sequential" pattern like pipline seems like the only one. If they could be arranged nicely enough, I think I would rather see the patterns presented here in a tree form showing which patterns are more specific versions of their parents.

Recursive Splitting

I thought this paper went into a good amount of detail, and I was very pleased by the fact that it showed how it was related to many of the other patterns & grouped those related pattern into higher-, same, and lower-level groupings.

How to control task granularity in this pattern?
+You just have to define a base case beyond which the problem is not split any further. The problem is sloved in a more straight-forward, usually sequential manner after crossing the threshold. The threshold is tied to both the characteristics of the problem domain, the data, and the hardware.

Can you think of any other popular algorithm (besides Selection Sort) which this pattern can not be applied to?
+An algorithm to find the median value in a set of numbers.

The authors mention the idea of composing the Data Parallelism pattern inside of the Recursive Splitting pattern. Can you think of other patterns composition?
+Task parallelism + pipeline.

The authors mention that the ideas behind the Fork/Join and Task-queue strategy patterns are essential to any developer who wants an efficient implementation of the Recursive Splitting pattern. Why?
+Although nascent, they are the most mature and tested implementation patterns for recursive splitting problems. Reinventing the wheel can lead you to make naïve mistakes. The work stealing algorithm in the fork/join pattern is especially novel and worth reusing.

Discrete Event

In the context of asynchronous communication of events, the authors mention two environments: message-passing and shared-memory. Which environment are you more familiarized with? Which one you like most?
+I'm not extremely familiar with either, but I'm a much bigger fan of message passing. Recently I've been coming across a lot of readings and OSS code that use message passing to build decoupled, reliable systems. Erlang is an example of using this pattern at a low level, and enterprise service buses are an example of doing it at a high level. In both instances, the systems are touted as being extremely reliable since components are isolated from each other. With shared memory, there is a greater probability of contention and especially corruption between different tasks and processes.

Do you agree with the authors on that, often, the best approach to deal with deadlocks is to use timeouts instead of accurate deadlock detection?
+I do agree. The timeout method requires little overhead/latency, and it's easy to roll your own implementation. Accurate deadlock checks will take more resources to execute & likely a library for the implementation that an everyday programmer would need to use.

Saturday, October 24, 2009

OPL: Dense Linear Algebra, Graph Algorithms, & Monte Carlo

Dense Linear Algebra

This pattern is trying to address that fact that memory access is much more expensive that CPU processing, so once a piece of data has been loaded, we need to make the most use of it.

“The BLAS level 3 routines stand out because of the smaller order of complexity for data movement (O(N^2)) than for computation (O(N^3)). Due to this fact, BLAS Level 3 routines can approach peak performance on most parallel systems. For this reason, a programmer would greatly benefit from structuring an algorithm to favor BLAS Level 3 functions. For example, replacing a large number of dot products (a BLAS level 1 function) with a single matrix-matrix multiplication (a BLAS level 3 function) can turn a problem bound by memory bandwidth into a compute-bound problem.” I’m still having trouble figuring out what this paragraph is saying, because at first I wasn’t sure why you’d want to take O(N)/O(N) vector operations and turn them into O(N^2)/O(N^3) operations, which are obviously much more expensive. However, once I actually turned by brain on, I realized that an NxN matrix is an aggregation of N 1xN vectors. Things still don’t quite click because it should take O(N^2) time to perform N BLAS level 1 operations on 1xN vectors. I realize that the point is to make the process compute-bound instead of the bottleneck being data movement, but what I’m focusing on is that it takes O(N^3) as a BLAS level 3 operation when it should take O(N^2) either as one BLAS level 2 operation or N BLAS level 1 operations. I know I’m missing something because the key seems to be BLAS level 3’s different bounds on data movement and computation, whereas BLAS levels 1 & 2 have the same bound on both, but I still don’t see how O(N^3) is better than O(N^2). The O(N^2) BLAS level 1 & 2 operations should be able to be parallelized, too, right?

Naïve implementations of matrix multiplications will repeatedly access data in a way that is suboptimal from a data caching and persistence strategy. A matrix is typically stored in either row-major or column-major order. Let’s say that we are using column-major order. That means accessing items that are adjacent in a matrix row is an expensive operation because the data are likely not stored in the same page on the disk and are likely not cached together. This becomes a problem when naively computing a matrix product cell by cell because we must traverse the same row many times. Instead of computing the product cell by cell, the paper proposes using an outer product, “in which the product of a column of A and a row of B acts as a rank one update to the entire product matrix C.” In this way, we do not have to repeatedly traverse each row and/or column, so spatial locality is still a problem, but a less frequent one.

An extension of the above idea is rather than to update all of C by using an entire column from A and an entire row from B, a smaller block that uses a subset of each is computed. An autotuning library is needed to determine what is the best block size for the given hardware, as block size is dependent on everything from the TLB to the number of floating point units available.

The idea of blocking above is typically optimized towards the quantity and speed of registers in the system. It can be extended further to create other layers of blocking optimized towards L1, L2, and, if it exists, L3 caches. The high-level blocks move around within the higher level blocks until the higher level blocks have been fully traversed, at which point they move.

One noteworthy optimization mentioned in the paper is to “use local variables to explicitly remove false dependencies.” If the value of a matrix cell is stored in a local variable, then operations can be reordered in a more optimal way than if there was a write-after-read/anti-dependency.

Graph Algorithms

Nothing new here as far as creating the abstraction goes. This was covered very well in my algorithms class during my undergrad at UIUC (CS 473). My main take away from the class was that almost any problem could be solved by converting it to a graph and using max flow, though the solution may be less than ideal.

I think the list of graph structures (bipartite, connected, complete, k-regular, tree, forest, DAG) and parallelization opportunities listed make a nice starting point when it comes to doing some research on optimizing an algorithm. Same thing with the graph storage structure (vertex list, edge list, adjacency list, adjacency matrix), traversal options (breadth-first search, depth-first search, topological sort, minimum spanning trees [Kruskal, Prim], single source shortest paths [Bellman-Ford, Dijkstra], all-pairs shortest paths [Floyd-Warshall, Johnson], single source longest paths in DAG [Floss], max flow [Edmonds-Karp, push-relabel, relabel to front], graph coloring [Chaitin, Bomen]), and partitioning operations (Kernighan-Lin, Fiduccia-Matheyses, ParMetis). I found myself scouring my notes from CS 225 & CS473 and searching the Web for all the algorithms that I’ve used to know back during my undergrad but have since forgotten.

Monte Carlo

This method of choosing random points in the sample space can be beneficial when it is difficult to arrive at the answer via analytical methods.

I’ve seen SIMD mentioned frequently in this class during the last few weeks’ readings. I read up on it a bit on Wikipedia before, but I’ll have to dig a little deeper and see if/how it can impact my day job.

As mentioned in the paper, map-reduce and Monte Carlo often work well together because the random samplings are independent tasks that can be performed in the map phase and then have their statistics aggregated in the reduce phase.

I think the paper spends a bit too much time (it’s not much time in total, but these patterns seem to favor brevity) on how random number generation is difficult when not relying on any sampling of physical phenomenon. The more interesting discussion is around how to get pseudo random number generation and parallelized systems to play nicely together, which is a problem because the generator relies on state that can’t be shared by different threads without having them produce the same “random” number. The paper presents two possible solutions: either generate random numbers in batches so blocking/locking is not needed as frequently or “[g]enerate multiple unique sequences in parallel, using advanced parallel version of the Mersenne Twister PRNG or SPRNG Ver4.0…”

I found both the “Experiment Execution” and “Result Aggregation” sections lacking. The former focuses on nothing but SIMD lanes, while the later does not dig into any depth or many alternatives for combining results other than “use locks to synchronize threads.”

Friday, October 23, 2009

Joe Armstrong's Thesis "Making reliable distributed systems in the presence of software errors", Chapter 4: Programming Techniques

I took the time to read Chapter 3 on the language itself because I am quite interested in learning new languages whenever possible. One “best practice” for self-development that I often read is to learn one new programming language every year. The goal is not to become a master in the language and use it in production systems, but to become competent with it so you can understand its patterns and paradigms and have those influence your day job. Functional languages have been all the rage lately and have had significant impacts on modern OO languages, e.g. LINQ in C#. An example of applying functional concepts to your OO programming could be to create generic functions that let you program more declaratively and using immutable value objects.

The one section that I found interesting outside the syntax of the language was the treatment on page 65 of tail recursion. Although I am familiar with the concept, it was an interesting refresher on how it works technically with return pointers & how you can convert to tail recursion by introducing an auxiliary function.

Abstracting Concurrency

The author advocates putting all concurrency-related code into generic reusable modules written by expert programmers and putting all sequential code into plug-ins that anyone can write. Obviously concurrent programming is harder, even in a language that natively supports it like Erlang, than writing sequential code due to the complexities of deadlocks, starvation, interprocess communication, etc. He also advocates creating a plug-in architecture because he claims that you cannot write the concurrent code in a side-effect free manner, whereas you can easily do this with sequential code. To answer Ben’s question, I do believe this statement to be true. The language does not allow side effects in terms of mutating state, so the author is not referring to that. He is referring to the fact that the functions in the concurrent modules need to affect the environment by spawning & exiting processes.

I am not sure why the “find” method in the VSHLR responds with an error when it cannot find the person. I thought that each process was supposed to complete successfully or exit. Granted, this is a central server and we do not want it to die, but although it makes logical sense, it seems to go against the whole paradigm that the author has laid out thus far in the paper. (After reading the rest of the chapter & re-reading this section, the server should not need to exit here. The point of the “fail fast” and “let another process handle the problem” philosophies here means that the server should simply return an error and let the client figure out what to do rather than take any corrective action.)

I like how the author was able to “factor out” concurrency in the example of the VSHLR server. I am wondering if this pattern is always so easy to apply, though.

The idea of having supervisor processes monitor the health of the nodes in the system so a client does not have to worry about its server dying seems like it would work as long as there was a supervisor for the supervisor and a supervisor for that supervisor and so on. So far, the application seems durable, but I think that this same level of durability (except in the face of hardware failures) could be easily replicated in Java or C# without having separate JVMs running in separate processes, as the author said was necessary earlier. I have a “catch (Exception ex) {/*show error message*/}” at the top level in most of my apps that works just fine for errors that are not due to hardware failures. There are obviously many other benefits to Erlang that would be missed, but the durability argument is not shaping up to be quite what I thought it would be.

The whole idea of swapping code out on the fly is much simpler here than I imagined it to be. Here the server is just sent a new function/closure/lambda to execute instead of the one that it was using before. I know there is support for a much more extensive upgrade given what was mentioned in Chapter 3, and I’m interested to see how that works.

Error Handling Philosophy

Even after the author lists out four reasons why having another process handle the errors or the faulty process, I still do not see why it is much more advantageous than having the faulty process handle its own errors unless the errors are hardware related and/or unrecoverable and the faulty process dies. In most systems that I build, we rely on failover systems to handle hardware failures, but I imagine most of the systems that go to the lengths of using Erlang want a more bulletproof Plan B than that. None of his arguments related to code clutter or code distribution sell me at all, though (I really want to be sold on the idea & think outside my Java/.NET box, but the author isn’t a convincing salesman thus far or the pitch is going right over my head).


I guess that after reading a bit farther, I should recant what I said above. Having the error handling completely removed from the code the worker processes will be executing, even if it is not generic, seems much nicer than having to rely on “catch (Exception ex)” blocks sprinkled throughout the code. If the error handling code in the supervisors can be made generic enough to work with multiple different types of worker processes, all the better.

Monday, October 19, 2009

OPL: Event-based Implicit invocation & Map-reduce

Event-based, Implicit invocation

I’m not seeing a difference between this pattern and the observer pattern at the OOD level. The pattern is more general and can be applied to communication between systems, e.g. a SOA where the pricing module will fire an event to indicate a sale but not be concerned with what other modules (billing, CRM) are interested. Now that I’m re-reading Richard’s questions, that last example fits in with the pub-sub architectural model. I’m not sure that I see a difference between these patterns other than levels of abstraction. I think that the different examples at the end of the “Solutions” section do a good job of showing other levels of abstraction where the pattern works.

I’m not sure why you would prefer a synchronous dispatch over an asynchronous one. I’m sure there is a valid reason, but other than hardware limitations, I can’t think of one. Since the announcer shouldn’t know anything about its subscribers, why should it have to wait for them? It’s typically not expecting to get anything back. If a sender does want to reply to a message, it won’t be via a return value or anything like that; it will reply by raising an event of its own.

As far as error handling goes, the announcer should certainly not be notified of errors with the subscribers. The entire point of this pattern is to decouple the two types of entities. Since the announcer won’t know anything about the subscribers, it will likely have no idea how to handle any error that it receives from them.


This pattern has been all the rage due to the fact that Google uses it. One example that they give, which I find much clearer than any of the examples given in the reading, is counting the number of times a string appears in a document. If there are w workers and p pages in the document, each worker can take p/w pages and find the number of occurrences of that string. Then since the reduce method is a simple summation, which is a distributive function (the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning), so even the reduce part of the algorithm can be done in parallel.

I think that the map portion of the pattern is simple and obvious (divide and conquer your independent tasks), but I think there is a significant amount of complexity in reducing in non-naïve manner that was glossed over. Maybe it’s simpler than I think, though, because I haven’t used the pattern before, or maybe doing naïvely is just fine.

This pattern reminds me a lot of the fork-join pattern, and I’m really thinking hard about what the differences are. The work stealing method presented in the fork-join paper seems like a very novel idea, but that is just an implementation detail that I imagine map-reduce could also use.

I don’t know if I like the idea of the framework handling errors. I usually prefer to have the system let the programmer specify exactly how errors should be handled so none get swallowed inadvertently. I think that if the framework attempts to handle the errors itself, it is violating the “fail fast” principal discussed in the other reading for this class. The framework can keep the computation going using data that is potentially incorrect, and one error can lead to drastically compounded problems in the output.

Sunday, October 18, 2009

Joe Armstrong's Thesis "Making reliable distributed systems in the presence of software errors", Chapter 2: The Architectural Model

I’m very excited for this paper. I’ve heard a lot of buzz about how Erlang facilitates the building of highly reliable systems. I was a little disappointed when the introduction said “The thesis does not cover in detail many of the algorithms used as building blocks for construction fault-tolerant systems—it is not the algorithms themselves which are the concern of this thesis, but rather the programming language in which such algorithms are expressed.” This gives me the impression that the paper will focus on how Erlang helps you develop reliable systems but leave out lessons that I can apply in other languages. We’ll see how things go…

I’m in complete agreement with the definition of architecture presented. The author says that there are myriad such definitions, but this one contains all the elements that I find noteworthy. It’s by Grady Booch, though, and he usually has some good ideas </understatement> .

I think that a philosophy or set of guiding principles is a commonly overlook portion of a software architecture, and I’m glad to see it called out here. As cheesy as this and related ideas like a system metaphor from Extreme Programming may sound, I have found that teams bigger than four or five people really do benefit from having a foundation that architects and developers can refer back to throughout the project’s lifecycle in an attempt to keep a uniform feel. A good example from a developer’s perspective, which bleed over into a “set of construction guidelines,” that still touches on some architectural decisions can be found here:

This sounds like an extremely interesting challenge. Each of the ten requirements would make for a challenging system, but mixing high concurrency, high reliability, in-place upgrades, and complex features sounds like quite the daunting task.

I’ve heard that Erlang isolates very small pieces of code in independent processes to restrict how far errors can propagate & that each process/module should always make the assumption any other modules that it will talk to will fail. However, the thought of truly to use a full process at the operating system level, this sounds like a very heavy-weight solution due to the overhead of having so many processes and the cost of interprocess communication versus intraprocess communication. It’s good to see that the processes used by Erlang in this sense “are part of the programming language and are not provided by the host operating system.” It’s interesting that towards the end of the chapter the author presents the idea that the isolation from errors that lightweight, hardware independent Erlang processes provide can only be matched in Java by running applications in different JVMs in different operating system processes, which is obviously a tad bit more resource intensive.

Concurrency orient programming doesn’t feel like it has too many differences from OOP to me. It just sounds like OOP in a COPL. Real-world activities are mapped to concurrent processes in much the same way the real-world objects and actions are mapped to classes and methods. Nothing about the characteristics of a COPL seem new, but it is a novel idea to have all the functionality listed (support for processes, process isolation, PIDs, communication only via unreliable message passing, failure detection) implemented completely within constructs of the programming language that can be completely independent of the operating system. It does seem like it could be a problem to have to copy data between processes, though. I can’t imagine that processes could always be organized so that large chunks of data never need to be passed between processes.

I’m never a fan of security through obscurity, so when the author mentions that not knowing the PID of a process means you can’t interact with it and thus can’t compromise it in any way. I’m sure that any decent architect would put more layers of security in place, but I dislike security through obscurity so much that I wouldn’t even bother bringing it up.

Message passing seems to be the best way to build reliable systems at any level of abstraction from my recent readings. At the enterprise architecture scale, event driven architectures and message passing systems are at the crux of many of the most durable sites as seen on High Scalability, e.g. At the complete opposite end of the spectrum from systems integration is Erlang, which builds these exact same concepts, e.g. asynchronous message passing, into the programming language itself. Since I hear good things about reliable systems being coded in Erlang and reliable enterprise architectures building SOA stacks on EDAs and message passing systems, I’m guessing there’s something to this pattern.

The idea of failing fast as soon as you detect an unrecoverable error seems like common sense to me. Maybe I’m being naïve or not entirely comprehending the details of the idea, but I don’t understand why anyone would have their system continue after encountering an error which they can’t handle. I’ve come across this idea many times in the past but still have yet to feel like I’m not missing something.

Saturday, October 17, 2009

OPL Patterns: Pipes & Filters, Layered Architectures, and Iterative Refinement

Pipes & filters

The version of the pipes and filters pattern for this assignment was much shorter than the last reading because there was no detailed example and no extra fluff. The original had an overly complex example that I found useless. However, some omissions that jumped out differentiating push vs pull filters and problems with error handling and global state. Also, the original made four points on why the gains from parallel processing are often an illusion. The new one didn’t raise or counter any of those concerns.

The “Forces” sections contained some practical considerations that I don’t believe were mentioned in the original. It was nice to have those considerations in a short, independent section to call them out rather than burying them in dozens of pages of text as the POSA reading did, in my opinion. Overall, I found all three more concise because they lacked some verbiage and context, but I feel they retained 90% of the useful information in 10% of the space.

Layered architectures

I think there’s too much emphasis on how layers hurt performance. While true, it doesn’t matter for most applications that don’t involve systems programming or intense computations, though the critical paths in the later type of system can have the layers removed while the rest is layered to maintain modularity.

I liked the point about keeping the API smaller on lower layers to decrease the work necessary to change the underlying implementation.

Iterative Refinement

It sounds like this is saying that you should find high-level steps that need to be done sequentially, and then inside each step, find tasks that can be performed in parallel. It sounds similar to an article that I just read in the November/December issue of IEEE Software: “Parallelizing Bzip2: A Case Study in Multicore Software Engineering.” Four pairs of students were tasked with taking Bzip2 and parallelizing it. Most teams found that simple, low-level optimizations such as parallelizing loops were insufficient for achieving substantial gains. The successful teams found that the best approach was to refactor the code into high-level modules that were independent but needed to be executed in sequence. They then optimized the code within each module (if I’m remembering the article correctly), which sounds just like this pattern.

Thursday, October 15, 2009


I was instantly interested in this paper as soon as I read the title. Though my experience with concurrent programming is limited, many of the bugs I encountered during that time were “Heisenbugs.” However, the last time I read about an interesting testing library that was based off of a research paper and Microsoft had created a version for its internal testing, Randoop, I was disappointed to find that the .NET code was not available.

Chess focuses on using only a few preemptions in a sequence but interleavening those preemptions in as many different places in the sequence. The tool also allows you to specify which areas to test more and which areas to run atomically.

I find it nice that the tool runs deterministically & that you can instantly get the last thread interleavening, though those are obvious features without which the tool wouldn’t be very useful. It’s also quite nice tht the tool doesn’t need a custom VM to be able to intercept calls.

Good old Lamport timestamps back from my distributed systems class. They sound like they give a good amount of flexibility to the scheduling algorithm by only requiring a partial ordering. It just has to focus on interleavening a few key events at the right times with respect to each other but not worry about absolute times.

Threadpool work items, async callbacks, timer callbacks

Chess just needs to know if a synchronization operation “isWrite,” writing to shared state, or “isRelease,” releasing other tasks that are blocked on a synchronization mechanism. “isWrite” causes edges to be created in the happens-before graph backwards to all previous nodes and forwards to all subsequent nodes. If the API that Chess has to work with can’t be well understood, program correctness will be maintained by setting both to true; only performance and efficiency will be affected.

Chess avoids blocking calls by wrapping them & using the “try” version instead, which does not block. The tool can then move that task off into the “disabled” queue until a release event occurs on that synchronization resource. This prevents Chess from being blocked.

One problem with Chess is that it can’t guarantee the “real-time” semantics of timers, but few programs correctness or control flow rely on the threads being scheduled and executed real-time.

Data races for non-synchronized shared memory must be put into the “happens-before” graph, too. Chess enforces that the threads run in a single-threaded fashion so that Chess can deterministically execute the program and capture replay info.

There are n to the (k to the c) power thread interleavenings with n threads and k atomic operations performed by each and c interleavenings performed with each run. Chess solves this explosion of the search space by: keeping c low, usually around two, and by keeping k low by ignoring system functions and base libraries. Chess has found that keeping c low but examining all possible interleavening give the small magnitude of c has had great success.

When running Chess in a test harness, things implemented by devs to randomize the program like a large number of threads and sleep calls need to be undone to allow Chess to thoroughly and deterministically search the interleaving space.

Wednesday, October 14, 2009


Reentrancer automatically converts shared, global state into thread-local state. While the refactorings that the paper presents will indeed make the program safer to run concurrently, it may not always be easy to convert shared state to thread-local state. Often the shared state is the result of lazy programming and can be refactored away with a bit of determination. However, there are many times where tasks cannot be run completely in isolation on their own threads and do truly need global state as a coordination mechanism. Given that, I’m not sure how useful I really think Reentrancer is.

The refactoring takes place in three parts: surrounding static fields with accessors, converting static initializers to lazy initializers, and finally making the data thread local. The author did point out that this process can lead to incorrect behavior. It doesn’t seem like it would be worth the risk of making your application both incorrect and slower (according to the authors’ own benchmarks). Not being able to preserve behavior in the face of exceptions could cause an enormous amount of rework to be necessary. Database and file system access were also mentioned as blockers to this refactoring, but I find those to be less problematic because in the applications that I work with, those external accesses are normally separated from key processing of business logic

I found that this paper had overlap with the previous two on automated refactoring tools, but I found it less readable and less useful in practice.

Beautiful Architecture Chapter 14: Rereading the Classics

I think expressing control flow in Smalltalk using an extremely small set of primitive operations provides the language with an elegance and flexibility that I had never seen before I started working with it. However, one downfall that the author points out is the inability of Smalltalk to correctly handle the order of operations for mathematical symbols. This wouldn’t bother me too much because I don’t write many expressions on a daily basis that involve mathematical expressions that I couldn’t easily group with parentheses.

I think that inheritance is such a critical and common piece of OOD that it would be next to impossible to find any decent system that wasn’t using inheritance. I’ve had good experiences with inheritance used to model “is-a” relationships, and I’ve had good experiences using it to implement design patterns. I’ve had bad experiences where the inheritance trees have gotten overly deep and wide, though I’d much prefer a wide one to a deep one. I don’t think there’s much to say about this question because it seems so obvious.

I’m not sure what the argument about the Shapir-Whorf Hypothesis is supposed to be about. I have had experience implementing features in the language that were not natively supported. After working with Squeak, I fell in love with iterator blocks. I implemented them in C# using anonymous delegates and came up with a solution that wasn’t as elegant as Smalltalk, but it was better than what I had before. Fortunately, LINQ came out in .NET 3.5 and introduced lambda expressions and extension methods to facilitate declarative programming against lists using iterator block-like constructs. I have read Steve McConnel’s “Code Complete,” which was mentioned in this chapter, and the distinction between programming into and programming in a language was something that really stuck with me.

I think the dynamic dispatch is scalable. In my vision of how it works, there’s simply a dictionary lookup for each class to see if the method exists. I imagine this would be very fast unless you had an obnoxious class with millions of methods. You’d lose the ability to inline methods like an optimizing compiler for static compiled languages could, but I don’t think the performance implications would be too bad.

I think it’s a good thing that you can modify any class. I don’t like being protected from myself by language designers. If I tinker with a core class and mess it up, that’s my problem. It’s very nice to have the ability to tweak those classes that don’t work out quite right for you. .NET has introduced a great feature in the last version of the framework called extension methods. They allow you to create methods that extend the functionality of any class, whether it’s a sealed core library class or your own code, as long as you don’t hide any existing method of the class. Although this doesn’t let you override existing methods that don’t work quite right for you, the ability to add what look like native methods to the class is extremely convenient.

I don’t think the metaprogramming features provided by Smalltalk, Ruby, or Python are good enough to convince me that a dynamic language is as easily maintainable in an enterprise application as a static one. In theory I believe the quote in the chapter than mentioned that your code is only correct if it passes tests that define the correctness of your program. However, in enterprise applications that are extremely large and have developers of varying skill levels working on them, the compile-time checks provided by static typing along with the productivity boost provided by Intellisense/autocompletion are essential. The example provided by the author of checking whether the parameters respond to a certain type of message results in code that is much uglier than its statically typed equivalent. I do think the great flexibility provided by dynamic typing is something that I would love if I was developing an application all by myself, but I think the safety net provided by static typing is something that I couldn’t live without in my day job as a consultant building enterprise applications in collaboration with unskilled developers on my clients’ teams.

I was hoping that the chapter would touch on a few more of the classics other than the Gang of Four book and Smalltalk. I did like, though, that the author emphasized that the GOF book is great not only because of its classic catalog of patterns but also because of the general guidelines it sets out for good OOD and how to avoid abusing the powerful features that it provides you. I was disappointed, though, by the value extracted from all of the code samples provided. If you don’t know Smalltalk (luckily I do thanks to Prof. Johnson’s CS 598), I think the return on investment for spending the time to read those samples would be quite low.

The Shapir-Whorf Hypothesis is an incredibly interesting theory that I touched on above. I’ve heard this theory mentioned during my days as an undergraduate in both my CS and my French classes. I also find it odd that I came across the SWH twice yesterday, once while reading this book and once while looking at a presentation on Ruby that was explaining how the flexibility of the language allows you to program *into* it very easily, much like Smalltalk.

I thought that I understood the Strategy design pattern until the author mentioned that add:withOccurrences: was a concise implementation of it. I just can’t see it.

I love the quote that Maillart “found that innovation…came not from laboratory work and mathematical theories, but from design offices and construction sites.”

Saturday, October 10, 2009


Although I read this whole paper, I can’t find too much to say about it that wasn’t already covered in the blogs and discussion about Concurrencer. This is because I see a lot of similarity between what ReLooper does and how Concurrencer will convert an algorithm to use the Fork-Join framework automatically for you. What ReLooper does actually seems like an easier and more naïve change that what Concurrencer does with the fork-join framework, so I’m surprised that Relooper gets its own paper and Eclipse plug-in, while Concurrencer does the fork-join optimization as only one of three of its features.

I would be happy with a compiler making these optimizations for me automatically if there were an iron-clad guarantee that no potential problems could be introduced. However, I think that is a hard guarantee to make and would prefer to either have a tool like Relooper imperatively define what I’d like to happen, but I would prefer an option to tell the compiler declaratively, via an attribute in .NET for example (an annotation in Java?), that the marked code is safe to be automatically parallelized.

Beautiful Architecture Chapter 13: Object-Oriented Versus Functional

I found this chapter pretty interesting because the functional and OO debate is alive and well in most of the communities of which I am a part. I feel that this chapter was pitting functional against OO in an all-or-nothing matchup, whereas most of the communities that I follow talk about what specific set of constraints on a problem give functional languages an edge. When I think of functional languages in the enterprise, I do see a place for them but only as libraries to help with certain specific problems. However, I would never think to build a major pillar of an enterprise architecture in a pure functional language. As we saw from the survey that Jerry put together, 100% of the people in our class think that OO is better for the enterprise, and though it is a biased group, there were several respondents who claimed to have experience with functional languages.

I think that in comparing functional languages with OO purely on the terms of modularity is doing a major disservice to functional languages. Meyer does call out that he is ignoring the elegance of a declarative programming style of the equivalent imperative one. Although I am not a big fan of developing purely in a functional language, I respect functional languages for the positive impact that they have had on most modern OO languages. Being a C#/.NET developer by day, most of my examples will focus on that. C#’s support for (anonymous) delegates and declarative programming via extension methods (e.g. LINQ) make the code I write on a daily basis extremely flexible and concise without sacrificing readability. I do things similar to Meyer’s example of making the visitor pattern much more concise via agents.

Meyer also mentions lazy evaluation as a nice feature of functional languages. However, C# fully supports this in a managed OO language via the “yield return” command. That being said, I don’t want to put functional languages down as being useless in the face of C# because I appreciate the fact that this incredibly useful feature was most likely inspired by the functional programming style.f

In the situations that Meyer describes functional languages lacking, such as the fact that a time module that only parses ISO8601 basic cannot be easily extended to support ISO8601 extended, I agree with Meyer and would definitely use OO languages in such a situation. That seems like an almost absurd shortcoming of functional languages.

Meyer makes a good point when he says that it is unnatural to not have a concept of state anywhere in the system. As Danny mentioned during the conversation, when you paint a house red, you don’t make a copy of the original house with red paint and end up with two houses, you just change the original one. I’ve heard people say ad nauseum that immutable objects help with parallel programming. While I’m sure this is true in some respects, I don’t think that it happens automatically. There are many scenarios where you truly do need to share state between threads, and if a change creates a new object, both threads will need to be synchronized in some manner to both use the new object.

Command-query separation is something that I try to adhere to in my day to day programming. As Meyer mentions, I feel that the referential transparency helps with the readability and maintainability of the code. When hidden side effects happen from a query operation, it can be very difficult to tell from reading a client of the guilty method. The only time that I return a value from a state-modifying command is when it’s some sort of error code or success/failure flag.

Eiffel’s nonconforming inheritance sounds, which disallows polymorphism, sounds extremely bizarre to me. I definitely plan on reading up on Meyer’s decision on doing things that way in Eifel since he seems to have many interesting ideas around OO design from various books and articles of this that I’ve read, though I do often disagree with him or find his coverage to be a bit academic and verbose.

Meyer mentioned a benefit of OO languages being that they support dynamic binding and thus “remove the need for client classes to perform multibranch discriminations to perform operations.” Although the pattern matching statements that functional languages like OCaml provide are extremely powerful, I find that they can be much more verbose and, as a result, more confusing compared to a solution that used dynamic dispatch.

Wednesday, October 7, 2009

Refactoring Sequential Java Code for Concurrency via Concurrent Libraries

I’m not sure if there is any task in software engineering that is easier to retrofit into an application than it is to implement at the start of the project that was designed for it. If the task, e.g. introducing parallelism, takes more time than the alternatives, e.g. writing sequential code, it may be wise from a cost/benefit analysis to go back and implement the more costly method only in the spots that start to cause you pain, e.g. excessive slowness, or once you cross the threshold where the cost of delaying the implementation of the more costly method overtakes the cost of the implementation of the costly method (I’m trying to set a world record for the most uses of the word “cost” in one sentence), which I believe is called “the last responsible moment” in the Lean methodology.

The big advantage of using parallel libraries is that very smart people spend a very long time solving very hard problems for you. The disadvantage is the APIs presented by a library are necessarily generic and may not be the absolutely most ideal solution to your problem. However, it is usually worth taking the minor hit of using a somewhat suboptimal solution in order to leverage the knowledge of the experts and avoid making costly mistakes yourself.

The disadvantage to having these types of refactorings done completely automatically is that there will always be some edge cases where for one reason or another you do not want to follow the usual pattern. Though I can’t think of any specific examples against automating the three refactorings presented, there’s *always* some exception to the rule. The advantage is that it saves you time & effort.

Beautiful Architecture Chapter 12: When the Bazaar Sets Out to Build Cathedrals

Making the call to completely redesign KDE to fully utilize the power of Qt 4 was, as the author says, a very risky decision. I have worked on a few projects that contemplated such a task, some which made the decision to do a rewrite and some which didn’t.

A project that made the decision to rewrite the system from scratch was a ColdFusion web application that was relatively feature complete. The decision was made to rewrite the system using the most modern .NET framework, ORM tools, and AJAX. The decision makers felt that the project could not successfully continue on as a ColdFusion app due to the relative scarcity and skill set of CF developers compared to those who work in Java or .NET. A major factor in the decision to rewrite the system was the fact that a freeze has been put on existing functionality in the current CF system. This meant that we had a fixed target to shoot at for our base functionality.

A project that did not make the decision was a VB.NET (yuck) web application that had a crippling dependency on hundreds of stored procedures that were hundreds of lines long for all business logic. The complete absence of unit tests made it terrifying to have to implement any change to the system, no matter how small. The system was also constantly in a state of flux, adding dozens of new features every month. The decision was made to make a focused effort to re-architect and upgrade the system in place both because the ideal target framework for their developers was .NET, which entailed few fixes to upgrade from 2.0 to 3.5, and also because the amount of new functionality constantly being added to the system meant that we had a moving target to hit whose base functionality was constantly moving farther and farther away.

I think the bazaar model works for open source projects because the labor is free to the project owners. That means that different implementations of a solution can be implemented and accepted or rejected with no cost to the owners. With an enterprise application where someone is paying the bill for every hour spent on development, a more cathedral-like style of management is necessary up front to ensure that minimal time is wasted on solutions that would not be acceptable in the end project. That isn’t to say that cathedral-style projects never to any prototyping or always get it right on the first try, but the design and architectural guidance provided by the key members should keep waste to a minimum.

That said, I didn’t get the impression from this chapter that KDE doesn’t have strong cathedral influences. The chapter said that high-level design decisions would be made by a core set of decision makers on an infrequent basis and lower-level decisions would be made on a more frequent basis by a more distributed set of decision makers. Although the decision making isn’t entirely centralized, there still sounds like there is quite a bit of cathedral-style planning taking place.

I found that having the chapter focus on both the design process as well as technical implementation details gave quite a bit more context than other chapters, but I also feel that the volumes of information made the chapter feel as if it would never end. I found the ideas of keeping data and control pipelines separate & initially using IMAP to transfer data quite interesting, though I’m sure the former is likely a common practice.

I didn’t get as much from the ThreadWeaver section, nor was it talked about as much during the class discussion. Although it was noticeable shorter than the Akonadi section, it presented an interesting interface for parallelizing multiple instances of tasks while still making it easy for each task to execute sequentially.

Saturday, October 3, 2009

Beautiful Architecture Chapter 11: GNU Emacs

I think that a system like Emacs could be created in a closed-source system give the right integration points.

I don’t think Firefox will replace Emacs. It seems like an unnatural hack to develop Lips or C++ programs in Firefox. Sure it could be done, but that doesn’t mean that it should be done.

This chapter sucked. Thankfully it was shorter than the last few boring chapters & the other reading for the day was quite interesting. If I had to come up with something that was maybe useful it would be that simplicity and openness can be a plus for certain types of apps. The only thing that I found noteworthy in the whole chapter was how terrible LISP looks. After listening to the class discussion, I still wasn’t able to get anything useful from other people’s experience.

Thursday, October 1, 2009

A Java Fork/Join Framework

I was scared at first by Alexander's first question that this paper would be obsolete, but I was pleasantly surprised to find that the concepts in the paper and the concepts in the reading on the new Java functionality that he provided complemented each other nicely. I think fork/join and other similar patterns like the ones outlined in the OPL pattern will be necessary in modern languages. As parallelizability across threads, cores, processors, & machines becomes the main mode of increasing performance and scalability, the standard mutex and monitor paradigms offered by most languages today are insufficient to prevent the average developer from running into severe problems with deadlock, starvation, and other concurrency-related problems.

I think that it is definitely a worth-while trade-off to present developers with canned approaches to concurrency problems that may not align perfectly with their given tasks. C++ developers were initially skeptical that managed languages like Java and C# could ever perform well, but advances in computer hardware, garbage collection, and virtualization technologies to name a few have overcome most performance problems to make managed languages perform beyond the needs of typical developers. I think something similar will happen with concurrency-related patterns & frameworks; the need to have absolute control will be overshadowed by advances in hardware and software technology so that the general patterns will work well enough.

I'm honestly not sure if I would use the fork/join API presented by the author because I skipped all the code samples. I was more interested in the underlying ideas of how fork/join frameworks work, especially novel ideas such as work stealing. Also, since Java will be coming out with support for fork/join, I didn't see any need to take a look at an interface that won't make it beyond research.

I have the feeling that faster interfaces for just about anything can be developed for C++ than a managed language like Java. I'm sure this rule-of-thumb breaks down for some specific scenarios, but it seems reasonable in general. However, the convenience offered by Java in terms of automatic garbage collection and portability can often outweigh the performance gains that C++ would give you. It depends if you need a bleedingly fast system like something that processes trades for the New York Stock Exchange, or if you just need to crunch a lot of numbers for business reporting purposes and a general fork/join implementation would give you some performance gains over a single-threaded app. But I'm a consultant; my standard answer to questions like this is always "It depends."

I wasn't exactly clear on how forking, dividing work up, breaking it down, and stealing all worked. I got the feeling that some of the large blocks that got stolen would be dependent on smaller tasks that weren't (e.g. one of the big tasks at the top is merge sorting all the data and one of the smaller tasks is directly sorting 5% of the total data), but that doesn't seem like it would work at all so my understanding is wrong somewhere. I'll have to re-read the paper and listen to the class discussion to get a feel for the pros and cons of this model.

As for patterns in the paper, I believe that fork/join is a pattern in and of itself, though that's a bit of a cop-out answer.

I found it odd that the fork/join pattern becomes so problematic when many of the workers empty out their own queues and can't find much work to steal. I was naively thinking that using a monitor to block them and then signaling the monitor when data is ready would work, but I'll have to think a little harder on why that won't work.

This paper was much better than reading about how Guardian did a bad job of naming things and allocating bits :)

Tuesday, September 29, 2009

Our Pattern Language

I liked the idea of crating patterns that can be used as modules to abstract the parallel programming problems, and I liked the tone of the paper. However, I don't feel that it ever got to a level of detail that would provide me with any useful information. Some of the patterns listed (backtrack, branch, and bound; dynamic programming; process control) will be jumping-off points for further reading for myself, but I need to get somewhat deeper into the patterns.

I was a fan of the way that the paper created patterns for different types of programmers rather than trying to completely abstract the parallelism from the typical developer or exposing too much detail in order to target parallel programming framework developers.

Parallelism is becoming a hot topic in computer science and software engineering today because processors are reaching the limits of how fast they can get on a single core. Increases in speed now must come from focusing on utilizing multiple processors or processor cores.

I think my personal challenges in programming parallel processors come from the fact that these applications do not execute deterministically. It's often next to impossible to duplicate timing-related bugs in a controlled development environment. If there were better tools and support from languages and patters, these bugs would be encountered less often because more robust code would typically be written & these difficult bugs would be less frequent.

Sunday, September 27, 2009

Beautiful Architecture Chapter 10: The Strength of Metacircular Virtual Machines: Jikes JVM

I found this chapter terribly boring. I don't understand how anyone who's not building a compiler or VM would glean any useful information from this chapter, which will likely never be me nor the majority of this book's readers.

The only thing that I found noteworthy was the challenging of the idea that garbage collection is slower than memory management. The author asserts that not only do managed environments perform at least as well as explicitly managed memory models, but they may even perform better. This was based not only on the fact that automatic garbage collection is an advanced area of research in computer science that has been around for many years now but also on the fact that explicit memory mangers are much more likely to have problems with memory fragmentation.

One advantage of having the JVM be self-hosting is that communication between the runtime and its applications is made easier by the fact that they both use the same language and care share data structures and communication paradigms with no need for a translation layer.

If Jikes were rewritten today, it should be able to leverage the threading model of modern operating systems rather than having to implement a "green threading" model. The author says that "The primary disadvantage is that the operating system is not aware of what the JVM is doing, which can lead to a number of performance pathologies, especially if the Java application is interacting heavily with native code (which itself may be making assumptions about the threading model)." There is no need for this disadvantage given modern hardware and software support for multithreading, multicore, and multiprocessor systems.

Saturday, September 26, 2009

Beautiful Architecture Chapter 9: JPC

Emulating an x86 in a JVM and getting speeds that are in any way decent sounds just about impossible, but this chapter walked through a lot of details on how they made it work.

I'm still not too clear on the difference between an emulator and a VM. I wish the author would have clarified that a little. I wish the author would have put the "Ultimate Flexibility" and "Ultimate Security" sections up front, because the whole time that I was reading the chapter, I was wondering what the big benefits are. Even after reading those sections, though, I'm still not clear on the emulator's advantages over a VM. One thing that they were touting as a feature was being able to snapshot the system in just about any state. However, I know that the infrastructure team at my company does something similar with VMs all the time. I'm guessing that the difference is that the VM snapshots don't capture active processes but the emulator snapshots do. Pretty cool features, though I'm still not sold on the practicality of the whole system.

The fact that this emulator would be running in the JVM's sandbox seemed like it would be a major problem because the JVM would restrict many operations that I imagine the emulator would try to perform. Some of these problems were mentioned in the Xen chapter on why hardware or software support was necessary for the hypervisors to work correctly and trap restricted operating system calls...though as I'm writing this I'm realizing that the problem there was calls failing silently without the hypervisor being able to trap them, but the emulator is going to see every single instruction. I'm still not sure how the emulator would be able to handle requests for access that the JVM prevents the emulator from carrying out.

Although I don't typically work in Java on a day to day basis, I found all of the low-level optimization techniques interesting. However, from a higher level design and architecture point of view, I didn't take away as much from this chapter. I did become much more familiar with the JVM, class loaders, etc., but that wasn't quite what I was expecting.

Thursday, September 24, 2009

The Adaptive Object-Model Architectural Style

The Adaptive Object-Model architecture sounds like a very interesting concept, but I have not have much success with it in the past.

The AOM architecture moves the definition of entities, their attributes, and their relationships into the database. The code deals with very generic types and has to know how to interpret the model found in the database. In my experience, this can lead to a very flexible data model whose flexibility is not realized at all in the code. Every system that I have ever worked on has ended up with too many special cases too be able to be successfully modeled by such a generic system. However, the system that I worked on did not try to model strategies and rule objects in the database, which was likely its primary reason for not being able to work generically with the data model.

Tuesday, September 22, 2009

Big Ball of Mud

The big ball of mud that I worked on became a terrible mess for one key reason: no one on the original development team knew anything about good design. They were a bunch of DB guys who wrote 1000 line long stored procedures from hell and 1000 line long methods that were absolutely impossible to comprehend as a whole. Over time, subsequent developers just stuck with the anti-patterns in place because consistency seemed better than having many different paradigms littered throughout the code. The complete absence of unit testing made refactoring practically impossible.

I think that it is obviously more involved to build a system with a good architecture up front than to have no architecture at all. Projects that I’ve worked on have had good success by being Agile/Lean about it and implementing just enough architecture at the last responsible moment. We typically come up with a layered architecture and an idea of what levels of abstraction & functionality will go in each and how they will be coupled. Other than that, we try to let the architecture evolve somewhat organically. Pair programming, peer code reviews, and architectural reviews all look for opportunities to refactor to patterns. As long as we keep as we keep the public interfaces as small as possible and keep unit test coverage high, this refactoring usually isn’t too difficult.

I disagree with the “Make it work, make it right, make it fast” mantra. I think that repeatedly cycling between “make it work” & “make it right” in very small increments leads to good systems. However adding “make it fast” onto the end implies to me that after you finish one round of “making it right” and before you begin “making it work” again, you should do performance optimization, which I believe should typically be delayed until it’s needed. I much prefer the TDD mantra of “red, green, refactor,” which boils down to “make it, make it work, make it right,” with the crucial assumption that the “red, green, refactor” loops are as small as possible.

Throwaway code is rarely thrown away. I detest implementing quick prototypes where deadlines are tight and design is naught because the code always makes its way into a production system and it rarely gets refactored because “it just works” and the design often prohibits adding unit tests to facilitate refactoring.

Sweeping it under the rug is never a good thing.

I am currently trying to lead an effort to reconstruct the big ball of mud that I was stuck working on for two days per week for two years. The project has taken on so much technical debt, that it takes an immense amount of time to implement the simplest features. It is terrifying to make any changes to the behemoth because there are no automated unit tests, and it is typically very difficult to figure out where the code is actually used by an end user and if the output has changed.

Monday, September 21, 2009

Beautiful Architecture Chapter 8: Guardian

The disadvantage to Guardian’s naming system was that it was overcomplicated, inconsistent, and all around sucked. There were no advantages to it.

I’ve never worked on a system that has needed much fault tolerance. However, when reading the section about the messaging system, it reminded me immensely of the good things that I’ve heard about Erlang leading to very reliable systems due to its message passing style of communication.

Guardian was terribly insecure.

Guardian’s checkpoint system seemed terrible for several reasons. It struck me as a huge flaw that the programmer had to explicitly invoke checkpoints. A decent system should hide that. If it forces you to do this manually, you think that would remove some overhead so that the system could perform well.

I thought it was very interesting that EXPAND and FOX allowed the system to scale to more processors with trivial ease, though apparently the performance gains were not quite what you would expect.

Overall, this chapter was a big waste of time. The book is named “Beautiful Architecture,” not “Grotesque Architecture.” There was very little to take away other than the fact that this OS failed because it was poorly designed. No good lessons, though.

Friday, September 18, 2009


This paper may have had some significance back in its day, but it seemed like nothing more than Common Sense 101. It didn’t add anything to what I consider the standard pattern of layering your architecture to group certain concerns and layers of abstraction together and decouple different layers.

Refactoring to this pattern will become exponentially more difficult as the project progresses. Trying to refactor a big ball of mud would be extremely difficult because of all the coupling.

I’m not quite sure of the difference between this pattern and pipes and filters other than pipes and filters seemed to be talking about having all the code at the same low-level of abstraction with quick & small filters, while this paper was talking about larger modules at different levels of abstraction. The OSI example seemed to wrap the data from each layer, while the streaming multimedia examples of pipes and filters operated on the same data over and over.

My professional experience as far as planning the architecture has matched the authors’ recommendations. It is extremely difficult to work from the bottom up and predict what low-level services you will need to provide the higher layers without knowing the services that the higher layers will provide. The yo-yo way of working from top to bottom back to top & repeating has served me well.

I have also found that having layer J depend on any layers other than J-1, whether they be above J or below J-1, is a slippery slope and degrades most of the benefits of this pattern.