This article was originally posted in 2020 on the OpenSignals website, which is now defunct.
Lost in a Fog
The current centralized approach to observability is not sustainable. The volume of useless data is growing daily, while our ability and capacity to make sense of it are shrinking at an alarming rate. Sensibility and significance must come back into the fray; otherwise, we are destined to wander around in a fog of data wondering how we ever got to this place and so lost.
We need to rethink our current approach to moving data; instead, we look to distribute the cognitive computation of the situation, an essential concept lost in all of this, back to machines or at least what constitutes the unit of execution today.
A Focus on Significance
We must relearn how to focus on operational significance: stability, systems, signals, states, scenes, scenarios, and situations. Instead of moving data and details, we should enable the communication of a collective assessment of operational status based on behavioral signals and local contextual inferencing from each computing node. The rest is noise, and any attention given is a waste of time and counterproductive.
Let us now get to the crux of many problems we face in managing complex systems – time.
At the heart of humankind’s major problems is our ability to conceive and perceive time (in passing) and project forward (time-travel mentally), but to fall short in fully experiencing such projections or past recollections at the same cognitive level and emotional intensity, we do the present.
We stole fire from the gods but have yet to wield it in a less destructive and far more conservation way.
Still, we have yet to fully appreciate and accept that any (in)sight we are offered in doing so is only a light shimmer of what lies ahead or behind us. We are always situated in the present, and the context of the past and the consequences of the future are invariably experienced, diluted, and distilled. We have yet to be able to step into the same river twice. We can look forward and backward, but it is never genuinely experienced like the present. Our current observability tools have not addressed this omission in our cognitive development and evolution. There can be no time travel without memory.
Past to Predictive
The table above is not necessarily a timeline of progression as observability initially started with the in-the-moment experience with direct human-to-machine communication of performance-related data when both human operator and machine were spatially collocated. That said, a trend moved from the past to the present with the introduction of near real-time telemetry data collection over yesteryear logging technology. Today even near-real-time is insufficient, with organizations moving from reactive to proactive in demanding predictive capabilities.
Observability deals with the past; it measures, captures, records, and collects some executed activity or generated event representing an operation or outcome. When humans consider the past, they are not thinking about metrics or logs; instead, they recall (decaying) memories of experiences. When a human operator does recall watching a metric dashboard, they do not remember the data points but instead the experience of observing. An operator might be able to recall one or two facts about the data, but this will be wrapped in the context of episodic memory.
A machine is entirely different; the past is never reconstructed in the same manner as the actual execution. A counter is not code or (work)flow. Historical data does not decay naturally, though it can be purged, and the precision diminishes over time. Instead, a log file or other historical store contains callouts, signposts, metrics, or messages that allude to what has happened. An operator must make sense of the past from a list of echoes.
A challenge here is when there are multiple separate historical data sources. So, at the beginning of the evolution of monitoring and observability, much of the engineering effort was fusing data, resulting in the marketing-generated requirement of a single pane. Unfortunately, fusion was simplistic and superficial; there was hardly any semantic-level integration. Instead, the much-hyped data fusion capabilities manifested merely as juxtaposing data tiles in a dashboard devoid of a situation representation.
Dealing with time becomes a far more complex matter when shifting from the past to the present. Again, there is never really a present when it comes to observability. The movement into the present is achieved by reducing the interval between recording observation and rendering such in some form of visual communication to an operator. Once observability moved into the near-real-time space of the present, the visualizations and underlying models changed. Instead of listing logs or charting metric samples, observability tooling concentrated more on depicting structures of networks of nodes and services and up-to-minute health indicators.
But as engineering teams competed further to reduce the time lag from minutes to seconds and below, other problems started to surface, particularly the difference in speeds between pulled and pushed data collection. Nowadays, modern observability pipelines are entirely pushed-based, which is also necessary when dealing with cloud computing dynamics and elasticity.
But time is still an ever-present problem. The amount of measurement data collected for each event instrumentation unit has increased, especially when employing distributed tracing instrumentation; it has been necessary to sample, buffer, batch, and drop payloads. Under heavy load, the bloated observability data pipelines cannot keep up with their transmission of payloads.
The need to send everything and anything and keep the experience near-real-time are incompatible. In the end, we have the worst possible scenario – uncertainty about the situation and uncertainty about the quality (completeness) of the data that is meant to help us recognize the situation. Not to mention that any engineering intervention at the data pipeline level only brings us back to dealing with even more significant latency variance. You cannot have the whole cake, consume it centrally, and keep it near real-time.
Observability solutions no longer describe latency when discussing their sub-second monitoring solution. Instead, the resolution of the data displayed can be seconds or even minutes old before it catches an operator’s attention. It must be pointed out that events can only be counted or timed after completion or closure, so it is incorrect to consider the dashboard a near-real-time view if you have a trace call lasting longer than a few seconds. An event must be decomposed into smaller events if near-real-time is most desired.
Ahead of Time
What do you do when it is impossible to experience the present in the present? You cheat by skipping ahead to predict what is coming next, which has probably already happened but has yet to be communicated. Here we anticipate the changing of the current situation. Unfortunately, this is a pipe dream with the current approach taken by observability because of its focus on data and detail in the form of traces, metrics, and (event) logs. These are not easy to predict. No solution will predict the occurrence of one of these phenomena from occurring, and they should not. Such low-level phenomena will happen naturally and in large quantities, but what does that tell us? Nothing when the data we use for analysis is too far removed from what is significant.
By not solving the problem at the source with local sensing, signaling, and status inference, we made it impossible to experience the present. The natural workaround for such a time lag is prediction, which is unsuitable for the type of data transmitted. That has not stopped vendors from claiming to offer machine learning and artificial intelligence. But in reality, and much like some current AI approaches, it increasingly looks like a dead end as we try to scale cognitive capacities to rapidly rising system complexities.
Blind to Change
The low-level data captured in volume by observability instruments has closed our eyes to salient change. We’ve built a giant wall of white noise. The human mind’s perception and prediction capabilities evolved to detect significant changes to our survival. Observability has no steering mechanism to guide effective and efficient measurement, modeling, and memory processes. Companies are gorging on ever-growing mounds of observability data collected that should be of secondary concern.
Ghosts of Machines
Perception, action, and attention are so integrated within the human mind. Yet, we see no consideration of how controllability can be employed and cognition constructed when looking at what observability is today. It is a tall order to ask machine learning to deliver on the hype of AIOps by feeding a machine non-curated sensory data and expecting some informed prediction of value.
Where are the prior beliefs to direct top-down inference when awareness and assessment of a situation are absent from the model? How can a machine of some intelligence communicate with a human when no standard conceptual model readily supports knowledge transfer in either direction?
Suppose a prediction is to be made by artificial intelligence in support of human operators. In that case, we need to explain the reasoning and, more importantly, the ability to continuously train the prediction (inference) engine when it misses the mark. There are no answers to these from the point we are at; it is not even been considered.
In the past (2013), I claimed that simulation was the future of observability eventually. Now I see it as a projection (of a situation), with simulation as a possible way to explore scenarios.