This article was first published in 2014
The Proposition
The mirroring of software execution behavior, as performed by Simz (online) and Stenos (offline), has the potential to be one of the most significant advances in software systems engineering. Its impact could be as significant as that of distributed computing.
Data is routinely replicated across machine boundaries today, but what about execution behavior?
While remote procedure call (RPC) middleware has allowed us to move performance across process and machine boundaries at a very coarse granularity, these calls do not necessarily represent the replication of software behavior but merely a form of service delegation.
Mirroring in this article refers to the simulated online or offline playback of a software’s execution behavior.
When a thread performs a local method call, it is mirrored in one or more “paired” runtimes. Within such runtimes, a paired thread is created, if not already done so, to mirror the actual application thread and, in tandem, when online, perform pushing and popping of stack frames as they occur within the application.
This does not mean that the mirroring runtime needs to be implemented in the same programming language as the application. A C# application can be mirrored within a Java runtime if the mapping maintains the exact representation of the execution flow, consisting of events marking entry and exit into methods within a particular thread. As you will read later, there is no reason the flow cannot represent organizational and human activity.
While some contextual data that drives execution behavior can be mirrored, replicating the software execution behavior does not have the same side effects or outputs as in actual application runtime. It is immutable in that mirrored action cannot change the course of what has already occurred but mutable in that it can augment what has already occurred with new side effects that potentially extend beyond the application’s reach.
In the mirrored runtime, software engineers can inject new code routines into interception points, typically the entry and exit of stack frames, and perform additional processing of such behavior as if the injected code was within the application when the execution behavior occurred (the illusion of time). This is not to say that the simulated environment fully constitutes the state of the actual application runtime. The application code, a form of state, does not exist here. The simulation merely plays back thread stack frame operations of push and pop.
Within the frames, nothing happens other than more nested stack frame operations, but these are driven by more streamed events and not the content of a frame or code block that does not exist in the simulation.
What an engineer sees at the point of interception is the thread(s), its call stack, including the current call frame, and some contextual data accessible from the thread’s environment interface. The playback is like a video recording of execution behavior, and as such, we can’t directly touch, feel, or change what is captured in a frame. We can augment a frame like in the post-production of Hollywood movies that require special effects.
Augmentation is made far more straightforward when the mirroring and playback are not based on the capturing and rendering pixels but on event recordings taken from actual motion detectors. The instrumentation probes, added into applications at runtime, are motion sensors of software behavior. Likewise, we don’t need to be concerned with every minor and possibly unobservable change in the state or behavior between such recorded motion points; much like video equipment used to record and playback, the original scene does not need to be of the same make and model for the playback and augmentations. The augmentation that can be performed within the mirrored runtime allows us to mash up behaviors across space and time.
Before going further, it is worthwhile to compare software execution recording and simulated playback with metric monitoring consisting of measurement collection and reporting.
Metrics are questions decided on before an actual event or behavior occurs. They can be formulated well before the software is even developed or deployed. Metrics typically sample a counter that tracks a series of events or measures over some time window, i.e., transactions per second. Metrics don’t record execution; they count.
Many metrics and users of such don’t understand the what and how of such counting, and rarely are metrics tested to the same degree of functionality within an application or service. Metrics are far removed from the underlying software execution behavior. A metric could count the number of scenes in which an actor appeared in a movie but not the behavior of the actor and that of others present in the scene, how each interacted, or continuations across scenes. There is no playback of action with metric monitoring, only rendering of metric measurement samples.
Even if metrics were simulated, the playback would only be a replay of the metric collection itself and not the actual software execution behavior indirectly represented by and hidden from the metric. On the other hand, the simulated playback of a recording allows us to reconstruct the entire software execution behavior and create new measurements and metrics on the fly based on the repeated behavior. Questions can be formulated after the event or execution and continuously refined across multiple playbacks. What, when, and how such actions should be counted is deferred until needed or known. When played back, we experience, through observation, the behavior, not the collection of counters and gauges, as is the case with metrics.
Let’s explore how simulation can change the Past, Present, and Future of how we engineer and manage.
The Past
Assume we have previously mirrored the software execution behavior of an application to a file or to a socket that streams to a file or some other persistent store; here are a few possible usage scenarios:
Recruits in Ops are placed in a flight simulator, a dashboard fed data by the simulated playback of execution, and tasked with observing an application(s) perform. During the simulated playback, they are questioned on what they can perceive, comprehend, and predict.
After encountering a problem in production, Ops uses the simulated playback to check whether new alerting rules created as a result of the incident will indeed fire at the appropriate time in the past when it occurred and hopefully when it might reoccur. A similar use case exists for new metrics or software analytical insights.
After a failed IT audit, the development team uses the simulation to go back in time and recreate the necessary audit traces that were omitted in the source. Here, the new output is generated from past input.
The performance engineering team uses hooks into the simulated playback to schedule the execution of load test scripts at more realistic volumes and velocity.
The test team creates a recording with some contextual data capture of results or behaviors not directly visible from the unit test code. It uses the simulated playback to perform delta analysis, not just on the returned values or state changes but on the software’s resulting behavior in executing the tests. Instead of asserting the values exposed at call function boundaries, the team creates deep inspection rules on the expected call behavior.
The development team tasked with modularizing a monolithic system into multiple micro-services uses the simulated playback of past execution behaviors to identify candidate services using captured runtime call dependencies across components and packages as well as a cost impact assessment based on the frequency of interaction across proposed service boundaries.
The performance engineering team uses simulated playback to assess a proposed external service integration requiring the development team to add integration calls directly into their application code. The integration is first done in the simulation so that performance engineering can determine the impact of request latency and the additional resource consumption involved. In playing back a recording, the simulation uses minimal system resources to make this assessment accurate. The team plays back the simulation with and without the integration and then compares resource consumption across simulated playbacks. The team also uses the playback to test the performance and reliability of the integration endpoint before the code is deployed to production.
The business plans to move to another service provider after serious availability issues with a Saas APM solution. Ops uses the simulated playback runtime to feed the proposed new Saas APM with past data and then compares the reporting of both vendors on the same underlying software execution behavior. For a period, they feed both services the same daily recorded software execution behavior, allowing operations staff to transition over to the new visualization and reporting capabilities gradually. This is made accessible because of extensions to the simulation that make the necessary API calls to backends at the point of a stack frame operation.
To help resolve an intermittent problem in production with a particular third-party library, Ops creates a filtered recording from a simulated playback of production, which only includes those calls to the library itself. This limited recording is then sent to the third-party vendor for analysis by the support team using the same simulated playback engine and observation tools. A similar use case exists for internal component and platform teams.
The Present
Finding it impossible to get a handle on what is happening in their distributed systems, the Ops team decides to mirror and project the software execution behavior of all execution unit application processes into a single simulated runtime that is augmented with simple but powerful sensors and alerts that offer near real-time automated diagnosis across the entire space of the distributed system.
After firefighting many performance and reliability issues with non-critical or non-functional service integration, the development team decides to move integration code out of the application and into a simulated playback environment and, in doing so, defer the playback to a less busy time window. This is achieved with minimal change to the original integration code.
The engineering team partitions the system into two domains to increase agility in developing enhancements and integrations while ensuring reliability. The first domain is far more stable and reliable, and the second domain runs as a real-time mirrored simulation, allowing for greater ad-hoc and online experimentation via integrating dynamic business rules into the interception points of the simulation.
After significant delays in resolving production problems, Ops is pressured to allow developers access to production, including installing developer tools within the environment. Reluctant to allow such unfettered access, Ops creates a near real-time mirrored simulated environment from which developers can inspect the behavior of their code within production at any moment without giving them direct access to the application and the machines running them.
The system engineering team, frustrated by sub-optimal network load balancing in the routing of service requests to different nodes due to the load balancer being blind to the state of internal processing queues within the applications as well as the chain of service interactions for each particular service, decides to develop a new load balancer. This load balancer uses the mirrored simulation environment as the primary source in determining the present outstanding queued work items and the estimated time of when such items will be scheduled and completed on a per-node basis. The simulated environment drives the workload to the real applications, and the applications project their execution behavior back to the simulated environment, which in turn drives the routing of more or fewer service requests—a feedback loop between the actual and simulated world.
Following the repeated crashing of applications without sufficient capture of diagnostic data, Ops creates a universal launch script that starts a mini-me simulated mirrored process before starting the actual application process. When the application process does crash, the engineering team only needs to inspect the mini-me simulated process to determine what was happening on all thread stacks within the application before it crashes.
Eventually, the engineering team extends the mini-me process to take on a supervisory role in accessing the likelihood that an incident will occur, alerting Ops and an application management solution that preemptively readies a new service instantiation.
The Future
When an application starts up, it will connect to the simulated world and download a digest of past software execution behaviors, which it will then use to train its internal self-adaptive and self-regulating systems. Self-awareness will extend across the life cycles of an application process.
All devices and users connected to a software service will be mirrored in a simulation world. The software execution is a reflection of the user’s actions. The simulated playback is a mirroring of the software. The simulated reproduction is thus a mirror of the user and his device. There will be many simulated parallel worlds.
Each world mined more naturally and immediately to assess the effectiveness of dynamically injected behavioral influences and signaled back to the actual applications. These worlds will serve as a proxy to the physical world, though the time and space dimensions may be changed to make it appear as a whole and current and to circumvent abuse and unwanted alteration. Companies will offer paid access to such parallel worlds online and offline. Active agent technology will be deployed into the simulated worlds.
The push for faster “real-time” feedback loops between software machines and man will result in the projection of both their behaviors into the same simulated universe. Within this simulation, a typical behavioral model consisting of actors, activities, and resources will unify both worlds sufficiently, allowing a business to monitor and manage operations oblivious to the actual nature of an actor.