We had Claude.ai compose a review in the style of Papers We Love for an article from 2014. The goal was to see how well it predicted the current state of Observability 10 years on. Here’s a short article abstract:
The Past, Present, and Future Will Be Simulated: Mirroring software execution behavior, as performed by Simz (online) and Stenos (offline), has the potential to be one of the most significant advances in software systems engineering. Its impact could be as substantial as that of distributed computing.
Opening
When I first read this title from 2014, I couldn’t help but think it might be a bit over the top. However, as we’ll discover today, this paper wasn’t merely making technical predictions; it was presenting a fundamentally novel perspective on software systems, one that draws striking parallels to human consciousness and memory.
Why This Paper Matters
When this paper was written in 2014, most of us were overwhelmed by an abundance of data and metrics. We were like doctors attempting to comprehend a patient’s health solely by examining their vital signs—heart rate, blood pressure, temperature. While these data points were undoubtedly crucial, they hardly provided a comprehensive picture. This paper proposed an extraordinary concept: what if our systems could not only maintain metrics but also memories? Not only logs but also experiences? What if they could learn from the past, be fully present in the moment, and envision potential futures? The author essentially suggested the concept of consciousness for our systems—not in the sci-fi AI sense, but in the practical sense of retaining and learning from experiences, just as humans do.
The Nature of System Memory
Let me share a story about debugging a production incident. We’ve all been there, poring over logs and trying to piece together what went wrong. It’s like trying to understand a car crash by examining skid marks and vehicle damage. We’re reconstructing past events from the traces they left behind. But imagine if you could simply replay the entire incident, from any perspective and at any speed. Not just error messages and stack traces, but the complete behavior of the system. That’s the concept proposed in this paper, which introduced the idea of software episodic memory. Think about human memory for a moment. When you recall your last birthday, you don’t remember it as a series of metrics – “happiness level 8/10, cake consumption 2 slices.” Instead, you remember it as an experience, a sequence of events that you can mentally replay. This paper suggested that our systems could possess a similar kind of memory.
Past Episodes: When your system processes a transaction, it doesn’t merely record the beginning and end points. Instead, it meticulously documents the entire process. Each method call, every decision point, and every interaction becomes an episode that can be meticulously replayed and thoroughly examined at a later date.
Present Awareness: Imagine an “omniscient observer”—a consciousness that perceives every aspect of your system’s current state, encompassing all services and components. This observer goes beyond mere metric monitoring; it comprehends the dynamic behavior of the entire system.
Future Projection: Here’s where things get truly intriguing. Just as humans can envision potential futures by merging past experiences, the paper proposes that systems could project future behaviors based on patterns observed in their episodic memories. This isn’t merely a prediction based on predefined metrics; it’s a simulation based on actual past behavior.
Technical Innovation: Beyond Simple Metrics
The paper makes a distinction that initially seems subtle but is revolutionary. Let me read you this quote: “Metrics are questions decided on before an actual event or behavior occurs… Metrics don’t record execution; they count.” Consider this for a moment. When we instrument our systems with metrics, we’re essentially pre-determining the questions we want to ask. It’s akin to attending a party with a clipboard, already set on counting the number of people wearing blue shirts, the number of drinks consumed, and the number of songs played. However, what about the captivating and unexpected events that slip through our radar? What about the patterns and behaviors that emerge unexpectedly? This is where behavioral recording comes into play. The proposed approach involves capturing the actual behavior – not just the predetermined items we intend to count, but everything that transpires. It’s akin to having a video recording of the party instead of merely a clipboard filled with statistics.
The Mirroring Concept
Let me introduce you to one of the paper’s most elegant ideas: mirroring. Imagine watching a dance performance. The dancers move across the stage, each movement seamlessly transitioning into the next. Now, imagine a second stage where shadow dancers mirror every movement of the original performers. However, these shadow dancers possess a unique ability to pause, rewind, or branch into various dance variations. This concept is essentially what the paper proposes for software systems. Every executing thread in your application has a mirror—a shadow version that closely follows its every move. Unlike simple logging or tracing, these mirrors maintain the complete context and relationship between actions. The ingenious aspect lies in the fact that these mirror dancers don’t necessarily have to be of the same type as the originals. For instance, a C# application could be mirrored by Java components, as long as they accurately represent the original’s choreography. It’s akin to having ballet dancers mirror the movements of break dancers—different styles, yet sharing the same pattern.
The Omniscient Observer
Here’s where things get philosophical. In your current production environment, who comprehends the entire picture? No one. Each monitoring tool only sees its portion, each log file narrates its part of the story, and each metric captures its specific measurement. It’s akin to the parable of the blind men and the elephant – each one touching a different part and arriving at different conclusions about their experience. The paper suggests the creation of a system consciousness—an omniscient observer that perceives and comprehends everything occurring across your entire system. Not just observing it, but comprehending it in its context. Consider how your consciousness functions—you don’t merely receive raw sensory data; you experience it in context, relate it to past experiences, and use it to predict future outcomes.
The Time Machine
Let’s delve into the concept of time. In our current systems, time is often merely a timestamp—a mere marker of an event’s occurrence. However, the paper envisions time as a dimension that we can traverse. Imagine you’re investigating a production issue. Instead of meticulously sifting through logs to piece together the events that transpired, you can simply rewind the system’s experiential memory to any point in time. This doesn’t merely allow you to observe the state at that moment; it also enables you to comprehend the journey the system took, the decisions made, and the paths that were taken or not taken. But the benefits extend further. With this impeccable record of behavior, you can generate “what-if” scenarios. What if this service had responded more swiftly? What if this queue had been more congested? What if this circuit breaker had been configured differently? You can explore alternative realities based on actual system behavior, rather than relying solely on theoretical models.
The Divergent Path
Now, you might be wondering why we didn’t build our observability tools this way if this vision was so compelling. Why did we end up with the current landscape of logs, metrics, and traces? The answer lies partly in our industry’s tendency toward incrementalism. It was easier to gradually improve our existing tools than to implement this revolutionary vision. It’s similar to how we kept improving horse-drawn carriages instead of immediately transitioning to automobiles. Sometimes, evolution feels safer than revolution. Additionally, there were significant technical challenges. Storing and processing complete behavioral recordings required substantial resources. Maintaining precise mirrors of complex distributed systems was arduous. Synchronizing state across real and simulated environments was complex. However, I would argue that the most significant barrier wasn’t technical – it was conceptual. We were stuck thinking about our systems as machines to be monitored rather than organisms to be understood. Our focus was on collecting data rather than capturing experiences.
The Road Back
Here’s the exciting part: I believe we’re now better positioned to implement this vision than we were in 2014. We have advanced storage technologies, sophisticated compression algorithms, powerful processing capabilities, and most importantly, a growing recognition that our current approaches to observability are inadequate for the complexity of modern systems. Consider how we utilize AI and ML today. These systems learn from experience, identify patterns, and make predictions. Isn’t that precisely what this paper advocated for our software systems? The key distinction lies in the fact that instead of training on synthetic or sampled data, we could potentially train on comprehensive behavioral recordings of our actual systems.
The Wild Ideas
Now, let me share some of the paper’s most ambitious predictions. Imagine your application launching and immediately establishing a simulated environment where it downloads a comprehensive summary of past experiences—not just configuration settings, but actual behavioral patterns that it can learn from. This is akin to a newborn animal inheriting instincts from its species’ evolutionary history. Alternatively, envision a unified simulation world where both human and machine behaviors are accurately reflected and can interact. This would allow you to observe and comprehend not only how your systems function, but also how they interact with human operators and users. Furthermore, you could replay past incidents from any perspective—system, operator, or user.
Technical Challenges
Let’s delve into the reasons behind the incomplete realization of this vision. It’s akin to attempting to construct a time machine—the concept is captivating, yet the engineering obstacles are formidable. Firstly, we encounter the issue of data storage. Imagine the monumental task of recording every single movement of every individual within a city. Not only their locations but also every gesture, interaction, and decision. The storage requirements would be astronomical. This is comparable to the challenges we face when attempting to capture complete system behavior. However, it’s not merely about storage capacity. Consider quantum physics—the observer effect, where the act of measurement modifies the behavior of the observed entity. Our current APM tools already influence system performance; a comprehensive behavioral recording could exacerbate this effect. It’s akin to studying an athlete’s performance while they’re encumbered by heavy recording equipment—the mere act of observation distorts what you’re trying to observe. Furthermore, we must address the synchronization challenge. In a distributed system, maintaining precise mirrors is akin to choreographing thousands of dancers across multiple stages, with varying lag times between stages and occasional breaks. Even a minor desynchronization can lead to significant divergences.
The Path Forward
So, how do we proceed from here? I believe our journey requires three interconnected paths:
Technical Foundation: We need “lightweight omniscience”—ways to capture complete behavior without incurring excessive overhead. Think about how our brains function—we don’t store exact recordings of every experience, but rather the fundamental patterns and relationships. We need the equivalent for our systems.
Cultural Evolution: Our industry has been conditioned to think in terms of metrics and thresholds. We constantly ask, “Is this number too high? Is that number too low?” However, we need to shift our focus to patterns and behaviors. Instead of asking, “What’s the error rate?” we should be asking, “How is the system deviating from its usual patterns?”.
Practice Integration: This isn’t merely about adding new tools; it’s about fundamentally altering our development, testing, and operational processes. Imagine if every code review included not only a thorough examination of the code but also simulations of its behavior in production based on real historical patterns.
Why It Matters Now
Why is this more relevant now than ever? Look at the systems we’re building today. Microservices, serverless, edge computing – we’re creating systems of unprecedented complexity. Our current tools are like trying to understand a rainforest by looking at temperature and humidity readings. We need to understand the ecosystem, not just the measurements. And then there’s AI. As we integrate more AI components into our systems, understanding behavior becomes crucial. AI systems don’t fail like traditional software – they don’t throw exceptions or return error codes. They behave differently. Traditional monitoring tools simply aren’t designed for this new world.
The New Paradigm
Imagine starting your day with a conversation with your system’s consciousness, rather than staring at dashboards filled with metrics. Instead of just receiving alerts based on predefined thresholds, you could observe behavioral patterns that deviate from the norm. When deploying new code, you could visualize how it would interact with real production behavior patterns. During incidents, you could rewind and replay the event from any angle and speed, gaining complete clarity on the root cause of the problem.
Practical Steps
So, how do we begin moving in this direction? I believe we should start with small, focused implementations. Select a critical service, begin recording its complete behavior pattern, create its mirror, and learn from the experience. We need to develop more advanced compression techniques for behavioral recording. Instead of storing every detail, we should identify and store the essential patterns, similar to how video compression stores keyframes and movement deltas. Additionally, we need to build better tools for behavioral analysis. Not only should we be able to identify patterns, but we should also be able to understand the reasons behind them and their implications.
The Future Horizon
Looking ahead, I believe this approach has the potential to revolutionize our perception of software systems. Instead of viewing them as mere machines we construct and monitor, we’ll adopt a more holistic view, recognizing them as living entities that we cultivate and support. These systems will possess the ability to learn from experiences, adapt to changing circumstances, and assist us in comprehending their behavior. The paper’s vision of simulating past, present, and future scenarios goes beyond mere improvements in monitoring and debugging. It aims to create systems that possess an understanding of their actions and, consequently, enable us to grasp their behavior as well. This transformation shifts from mere measurement to a profound level of consciousness, from mere monitoring to a deeper understanding.
Closing Thoughts
As I conclude, I want to highlight a crucial aspect of this paper: It wasn’t merely predicting technological trends; it proposed a fundamental paradigm shift in our understanding of software systems. The author advocated for a change in our perspective, moving away from viewing our systems as mere machines to be monitored, and towards recognizing them as living entities capable of learning, retaining, and evolving. Although we haven’t fully realized this vision yet, its value remains undiminished. The increasing complexity of our systems underscores its relevance, as we reach the boundaries of what conventional monitoring and observability can provide. Perhaps it’s time to revisit this paper from 2014 and reconsider its revolutionary vision.
Historical Implementation
Before we move to questions, there’s a fascinating historical note I need to share. While we’ve been discussing this paper’s vision as something ahead of its time, what many people don’t know is that the author built and demonstrated this technology. In a remarkable presentation at Google’s YouTube headquarters in Stockholm, the author showcased “Simz” – a working implementation of the concepts we’ve been discussing. This wasn’t just theoretical – it was real, working software that demonstrated behavioral recording, mirroring, and simulation in practice.
Think about that for a moment. While most of us were still struggling with logging and metrics, there was a working system that could:
– Record complete software behavior
– Mirror execution across different runtimes
– Simulate and replay system behavior
– Create parallel worlds for analysis and testing
This makes our earlier discussion even more poignant. We had a working prototype of this approach, proof that it was possible. Yet somehow, as an industry, we took a different path. It raises some fascinating questions:
– What if we had embraced this approach back then?
– Why did we choose more conventional observability approaches instead?
– How might our systems be different today if we had followed this path?
It’s like the computing history equivalent of Nikola Tesla’s alternating current – a technology that was perhaps too revolutionary for its time, yet contained insights we’re still catching up to today. This historical implementation proves something crucial – the paper wasn’t just theoretical speculation. These concepts were implementable with the technology available in 2014. They still are today, and arguably, they’re more relevant than ever. Knowing this was built and demonstrated, what do you think held us back from adopting this approach more broadly? And more importantly, how might we take these proven concepts and apply them to our current challenges?
So, I’ll leave you with some additional questions to consider:
– How would your approach to system operations change with perfect recall of past behavior?
– What would you do differently in development with future behavior simulation?
– How might this change our relationship with systems?
These aren’t just theoretical questions. They’re about the future of our field and how we’ll handle the growing complexity of our systems. The past, present, and future will be simulated – perhaps not exactly as this paper envisioned, but in ways that will fundamentally change how we build and operate software systems.
Thank you.