Debugging and monitoring a SOA
Properties of a service-oriented architecture which are often stated to be essential are flexibility and easy adaptation to changing businesses. These properties can be found on a high level. By using a tool to model business processes one can acquire a lot of flexibility. You can, for instance, model a business process in the Mendix Business Modeler and, assuming the services you need to use already exist, deploy this process on the Mendix Application Server in a few clicks.
Sounds easy! But the realist among us will have some reservations. How do we check whether our orchestration is correct or not?
This is a legitimate question, as on a more technical level we have to deal with an asynchronous distributed system. A serious challenge in such a system is the detection of stable properties like deadlocks or terminations. To debug and monitor asynchronous distributed systems we need to have some clue about a global state. In a centralized system with one processor the determination of the global state is trivial. The processor can simply inspect the contents of its memory or the application the values of its variables. If you have ever used an IDE, for example Eclipse, to debug an application, you are familiar with the convenient debug method of these environments: just add some breakpoints and check the contents of every variable within the application at every desired moment. Unfortunately, for a distributed system this is not that easy.
Global states in asynchronous, distributed systems
When trying to determine the global state of an asynchronous system, two main problems can be distinguished. First, just sending each process (i.e. each autonomous service) a
message telling to record their own local states won't work, because there is no way of synchronizing these recordings. Second, messages in transit can exist that should be
included in the state. By state the joint states of all processes and all communication channels in the system is meant. The state of a process can be defined as the contents of (a part of) its memory, and the state of a channel must be some subsequence of the sequence of messages sent along it.
For instance, consider two bank accounts, P1 and P2. P1 transfers money to P2, as depicted in figure 1. Each number is a money transfer of that amount from P1 to P2. If P1 record its own state after the sending of 78, 759 and 23, and P2 record its own state after the receive of 78, the state of the channel is [23, 759].
Figure 1 – Example bank accounts
This example shows the state of a channel should be defined as the sequence of messages still in the pipeline, i.e. the messages already sent by P1, but not yet received by P2. Scientific research has proved determining a state an asynchronous system has actually been in is too difficult. Chandy and Lamport, however, have designed an algorithm which can find a global state the system might have been in [1].
Chandy and Lamport algorithm
The idea of the algorithm is as follows. We model each process as a node in a graph. The directed edges in this graph are the unidirectional channels between the processes. An example can be found in figure 2. We assume that every process is connected to every other process and that the channels are FIFO.
Figure 2 – Graph model used in algorithm
Each process can start the algorithm when it would like to record the global state. A process starting the algorithm first records its own local state and subsequently sends a special message, a marker, along every outgoing channel. If a process receives a marker along a channel it records the state of that channel as empty and records its own local state. Next, the process sends a marker along every outgoing channel notifying the other processes. The process also creates a buffer for each of its incoming channels (except for the one along which the marker is received). Every message the process receive along a channel is entered into the corresponding buffer. If a process receives a marker while its own state has already been recorded, the state of that channel is recorded as the sequence of messages in the corresponding buffer. A process is ready with its part of the algorithm when it has received a marker along every incoming channel. The algorithm is ready when every process is ready.
In pseudo-code it looks like this [2]:
I. Spontaneously recording a processor´s state
record_and_send_markers:
record local state
loc_state_recorder = true
for every outgoing channel c dosend marker along c
for every incoming channel c do
create message buffer Bc
II. Receiving a marker along a channel
upon receipt of marker along channel c do
if (not loc_state_recorded) then
record state of c as empty
record_and_send_markers
else
record state of c as contents of Bc
III. Receiving a message along a channel
upon receipt of m along channel c do
append m to Bc
A full description of the algorithm in the original work of Chandy and Lamport [1] can be found here
Conclusion
As the previous section showed, determining the global state of an asynchronous distributed system is not trivial, but can be very useful to get a lot more insight in the system. In fact, debugging or monitoring an asynchronous distributed system like a service-oriented architecture without an algorithm as that of Chandy and Lamport is not possible. Of course the used middleware can help in solving parts of the problem, but the messages in transit should not be forgotten. Moreover, the middleware itself can also well be a distributed system (otherwise it could easily become a bottleneck). Conclusion: As software architects and engineers we will always have some nice challenges.
[1] K. Mani Chandy, Leslie Lamport, Distributed Snapshots: Determining Global States of Distributed Systems. ACM Transactions on Computer Systems, Vol. 3, No. 1, February 1985, Pages 63-75.
[2] D.H.J. Epema, Distributed Algorithms, Parallel and Distributed Systems Group
Faculty of Electrical Engineering, Mathematics, and Computer Science. Delft University of Technology the Netherlands, december 2005.