Organizational silos form in complex systems when collaboration becomes costly or uncertain, leading to inefficiencies and communication barriers. Effective integration requires balancing standardization with simplification, fostering collaboration across units, and managing the tension between short-term metrics and long-term transformative work.
This article presents an A-Z glossary of key concepts related to observability in complex systems and software engineering. It covers topics ranging from Attention and Boundaries to Topologies, emphasizing the importance of intelligent data analysis, contextual understanding, and adaptive learning in monitoring and managing modern distributed systems.
In an era where AI is rapidly transforming our digital landscape, how can we ensure that human-AI collaboration reaches its full potential? The answer lies in a paradigm shift towards task-centricity.
The observability community should move away from traditional metaphors like pillars and pipelines and adopt new ones like substrates and circuits. By doing this, we can gain a new and innovative outlook on tools and techniques, leaving behind outdated thinking that prioritizes data over decisions and content over control.
The prevailing metaphors of pillars and pipelines in observability have limited our understanding and hindered progress. These metaphors promote siloed thinking and a focus on data collection over actionable insights.
Abstraction and simplification are two fundamental principles that often work together in the design of systems. With abstraction, we reduce system complexity by focusing on the essential aspects in the area of structure, elements, and behavior.
Here we explore why the industry needs to move beyond the legacy tools and embrace a more dynamic and adaptable approach to gleaning genuine value from the ever-growing ocean of data collected.
As engineering systems grow ever more complex, the engineering community’s focus on simplistic measurement and reporting hinders achieving operational scalability by way of sensemaking and steering of such systems of systems.
To acquire the knowledge of suitable software performance heuristics, developers must experience software execution in a new, more modern manner – a simulated environment of episodic machine memory replay.
The mirroring of software execution behavior, as performed by Simz (online) and Stenos (offline), has the potential to be one of the most significant advances in software systems engineering. Its impact could be as significant as that of distributed computing.
This post introduces the reasoning, thinking, and concepts behind a technology we call Signals, which we believe has the potential to have a profound impact on the design and development of software, the performance engineering of systems, and the management of distributed interconnected applications and services.
There is always tension between adaptability and structural stability in engineering and possibly life. We want our designs to be highly adaptable. With adaptation, our designs attempt to respond to change, sensed within the environment, intelligently with more change, though far more confined and possibly transient, at least initially. But there are limits to how far we can accelerate adaptation without putting incredible stress on the environment and the very system contained within.
Today, the stimulus used to develop machine intelligence is sensory data, which is transferred between devices and the cloud – the same data that concerns many consumers. But what if instead of sending data related to such things as a thermostat’s temperature set point, what was transmitted mostly concerned the action taken by the embedded software machine – an episodic memory of the algorithm itself?
Our brain houses billions of neurons (nerve cells) that communicate with each other through intricate networks of neural circuits. These circuits play a fundamental role in various cognitive functions, sensory processing, motor control, and generating thoughts and emotions. Why should it be different for Observability?
Most current observability technologies don’t fair well as a source of behavioral signals or inferred states. They are not designed to reconstruct behavior that would allow the level of inspection we would need to translate from measurement to signal and, in turn, the state effectively. They are designed with data collection and reporting in mind of the event, not the signal or state.
We should not differentiate whether an agent is deployed, especially with companies electing to manually instrument some parts of an application’s codebase using open-source observability libraries. Instead, we should consider whether the observer, an agent or library, is stateless concerning what and how it observes, measures, composes, collects, and transmits observations.
Reducing and compressing measurements is critical, which is much helped by representations extracted from the environment via hierarchical boundary determination. When this is not done automatically, what happens then is that the custom dashboard capabilities of the Observability solution need to be used to reconstruct some form of structure that mirrors the boundaries all but lost in the data fog. Naturally, this is extremely costly and inefficient for an organization.
The overemphasis on data instead of signals and states has created a great fog. This data fog leads to many organizations losing their way and overindulging in data exploration instead of exploiting acquired knowledge and understanding. This has come about with the community still somewhat unconcerned with a steering process such as monitoring or cybernetics.
There are many perspectives one could take in considering the observability and monitoring of software services and systems of services, but here below are a few perspectives, stacked in layers, that would be included.
Observability is effectively a process of tracking change. At the level of a measurement device, software or hardware-based, change is the difference in the value of two observations taken at distinct points in time. This change detection via differencing is sometimes called static or happened change. Observability is all about happenings.
Once upon a time, there was a period in the world where humans watched over applications and services by proxy via dashboards housed on multiple screens hoisted in front of them – a typical mission control center. The interaction between humans and machines was relatively static and straightforward, like the environment and systems enclosed.
Substrates changed everything by introducing the concept of a Circuit consisting of multiple Conduits fed by Instruments that allowed Observers to subscribe to Events and, in processing such Events, generate further Events by way of calling into another Instrument. But with the introduction of Percept and Adjunct, it is now possible for Observers attached to Circuit and its locally registered Sources to process Events that have come from a far-off Circuit within another process.
With the latest update to the Substrates API, the metamorphosis to a general-purpose event-driven data flow library interface supporting the capture, collection, communication, conversion, and compression of perceptual data through a network of circuits and conduits has begun.
It is time for new direction closer aligned to goals, focused more on the dynamics of systems that humans are already highly adapted to with their social intelligence, within which situation is a crucial conceptual element of the cognitive model. Understanding and appropriately responding to different social situations is fundamental to social cognition and effective interpersonal interactions.
Disruptions are one factor affecting the maintenance of service quality levels. A disruption is an interruption in the flow of (work) items through a network that can, for a while, make it inoperable or where the network flow performance is subpar. Depending on the severity of the disruption, a network may need to replan and restructure itself for a period afterward. There are two main categories of disruptions: disturbance and deviation.
The low-level data captured in volume by observability instruments has closed our eyes to salient change. We’ve built a giant wall of white noise. The human mind’s perception and prediction capabilities evolved to detect significant changes to our survival. Observability has no steering mechanism to guide effective and efficient measurement, modeling, and memory processes. Companies are gorging on ever-growing mounds of observability data collected that should be of secondary concern.
The Recognition-Primed Decision (RPD) model asserts that individuals assess the situation, generate a plausible course of action (CoA), and then evaluate it using mental simulation. The authors claim that decision-making is primed by recognizing the situation and not entirely determined by recognition. The model contradicts the common thinking that individuals employ an analytical model in complex time-critical operational contexts.
The OODA loop emphasizes two critical environmental factors – time constraints and information uncertainty. The time factor is addressed by executing through the loop as fast as possible. Information uncertainty is tackled by acting accurately. The model’s typical presentation is popular because it closes the loop between sensing (observe and orient) and acting (decide and act).
Data and information are not surrogates for a model. Likewise, a model is not a dashboard built lazily and naively on top of a lake of data and information. A dashboard and many metrics, traces, and logs that come with it are not what constitutes a situation. A situation is formed and shaped by changing signals and states of structures and processes within an environment of nested contexts (observation points of assessment) – past, present, and predicted.
VPA is a technique used by researchers across many domains, including psychology, engineering, and architecture. The basic idea is that during a task, such as solving a problem, a subject will concurrently verbalize, think aloud, what is resident in their working memory – what they are thinking during the doing. Using Protocol Analysis, researchers can elicit the cognitive processes from start to completion of a task. After further processing, the information captured is analyzed to provide insights that can improve performance.
The next generation of Observability technologies and tooling will most likely take two distinctly different trajectories from the ever-faltering middle ground that distributed tracing and event logging currently represent. The first trajectory, the high-value road, will introduce new techniques and models to address complex and coordinated system dynamics in a collective social context rebuilding a proper foundation geared to aiding both humans and artificial agents.
When designing observability and controllability interfaces for systems of services, or any system, it is essential to consider how it connects the operator to the operational domain regarding the information content, structure, and visual forms. What representation is most effective in the immediate grounding of an operator within a situation?
Because of limited processing resource capacities, brains focus more on some signals than others – signals compete for the brain’s attention. This internal competition is partially under the bottom-up influence of a sensory stimuli model and somewhat under the top-down control of other mental states, including goals – this is very similar to how situational awareness is theorized to operate optimally.
Unfortunately, many of the solutions promoted in the Observability space, such as distributed tracing, metrics, and logging, have not offered a suitable mental model in any form whatsoever. The level of situation awareness is still sorely lacking in most teams, who appear to be permanently stalled at ground zero and overtly preoccupied with data and details.
Looking back over 20 years of building application performance monitoring and management tooling, little has changed, though today’s tooling does collect more data from far more data sources. But effectiveness and efficiency have not improved; it could be argued that both have regressed.
Science and technology have made it possible to observe the motion of atoms, but humans don’t actively watch such atomical movements in their navigation of physical spaces. Our perception, attention, and cognition have evolutionary scaled to an effective model for us in most situations. Distributed tracing spans, and the data items attached, are the atoms of observability.
Two distinct hemispheres seem to form within the application monitoring and observability space – one dominated by measurement, data collection, and decomposition, the other by meaning, system dynamics, and (re)construction of the whole.
The underlying observability model is the primary reason for distributed tracing, metrics, and event logging failing to deliver much-needed capabilities and benefits to systems engineering teams. There is no natural or inherent way to transform and scale such observability data collection analysis to generate signals and inferring states.
Humanism is a philosophical stance at the heart of what Humainary aims to bring to service management operations. It runs counter to the misguided trend of wanton and wasteful extensive data collection so heavily touted by those focused on selling a service rather than solving a problem, now and in the future.
As computing and complexity scaled up, the models and methods should have reduced and simplified the communication and control surface area between man and machine. Instead, monitoring (passive) and management (reactive) solutions have lazily reflected the complexity’s nature at a level devoid of simplicity and significance but instead polluted with noise.
There are at least two distinct paths to the future of observability. One path that would continue increasing the volume of collected data in its attempt to reconstruct reality in high-definition on a single plane with little consideration for effectiveness or efficiencies. Another would focus on seeing the big picture in near-real-time from the perspective of human or artificial agents.
We propose a model which can better serve site engineering reliability and service operations by being foundational to developing situational awareness capabilities and system resilience capacities, particularly adaptability and experimentation, as in dynamic configuration and chaos engineering.
The Double Cone Model is a valuable conceptualization in thinking about more efficient and effective methods to handle data overload and generate far more actionable insight from a model much closer to how the human mind reasons about physical and social spaces.
All points of experience within a topology offer some visibility, but the language (codes, syntax) and model (concepts) employed can differ greatly. This is problematic when the goal is to determine the intent and outcome of an interaction’s operation(s).
Today’s data, such as logs, traces, and metrics, are too far removed to be the basis for a language and model that illuminates the dynamic nature of service interaction and system stability inference and state prediction formed across distributed agents.
Observability is purposefully seeing a system in terms of operations and outcomes. In control theory, this is sometimes simplified to monitoring inputs and outputs, with the comparative prediction of the output from input, possibly factoring in history.
It could be argued that no one fully understands what AIOps pertains to now in its aspirational rise within the IT management industry and community. AIOps is a moving target and a term hijacked by Observability vendor marketing. It’s hard to pin down.
In interpreting a script or a scene within a movie, humans must identify the setting and actors and understand the dialog from the multi-sensory feed flowing into the brain. Observability is somewhat similar, except that solutions today have not had a billion years to evolve efficient and effective ways of detecting the salient features.
Context is crucial when it comes to the Observability of systems. But Context is an abstract term that is hard to pin down. Does it represent structure as in the configuration of software components? Does it represent behavior as in tracing a service request? Does it represent some attributes associated with a metric? Does it encompass purpose?
Many software systems have self-regulation routines that must be scheduled regularly. Observability libraries and toolkits are no different in this regard, with the sampling of metric values or resource states being notable examples; another less common one would be the status of inflight workflow constructs.
With an event-driven architectural approach as Substrates, whenever a value needs to be calculated from a series of events, a stateful consumer invariably must use another circuit component to continuously publish the result after each event processing. But there is an alternative option in the Substrates API.
The Substrates API has two basic categories of Instrument interfaces. The first category includes Instrument interfaces that offer a direct means of interaction by a caller. The second type of Instrument is one with no direct means of triggering Event publishing in the interface.
Today the approach to observability at various stages within the data pipeline, from Application to Console, has been to create different models in terms of concepts, structure, features, and values. But what if the model employed were the same across all stages in a data pipeline?
The typical flow of execution for an observability Instrument is for instrumentation within the application code to make a single call to a method defined in the Instrument interface. But there are cases where a single call into an Instrument interface causes the dispatching of multiple events.
A significant challenge in building observability agents or heavily instrumented applications or frameworks is in scaling, both up and down, the resources consumed. There is a trade-off here in terms of time and cost.
In this post, we walk through one of the showcases in the project’s code repository that demonstrates how the complexity of hooking up components in a Circuit is greatly simplified.
The interfaces defined within Substrates API are designed with extensibility and evolution in mind, both from a client library and provider implementation perspective.
An objective of the Substrates API is that it should be possible for developers to be location independent in that code can execute without change within the local process itself or a remote server that is being served the data via a Relay.
Three overriding principles are applied in the design of the Substrates API – consistency (standardizing), conciseness (simplifying), and correctness (sequencing).
Good design takes time, over many iterations (of converging and diverging design paths), in developing, discovering, discussing, discounting, and occasionally destroying.
The stages within a local data pipeline managed by the Substrates runtime are detailed, from a client call to an Instrument interface method to dispatching an Event to an Outlet registered by a Subscriber.
A walk through of one of the Substrates showcase examples hosted on GitHub, demonstrating two of the most critical aspects of pipelines – the production and consumption of captured data and published events.
Using Substrates, the fusion of multiple streams of data from multiple sources, an essential process of any monitoring and management solution, can be done in-process and in real-time.
The first official release of the Substrates API is nearing completion. With that, it is time to explore one of the toolkit’s most significant differences compared to other approaches, such as OpenTelemetry.
For Humainary, the goal is to encourage as much as possible the analytical processing of observations at the source of event emittance and in the moment of the situation. To propagate assessments, not data.
Since the very beginning of the hype of Observability, we have contended that the link with Controllability must be maintained for there ever to be a return on investment (ROI) that matches the extravagant claims from vendors pushing a message of more-is-better.
This two-part series will discuss critical factors that weighed heavily in our rethinking of Observability and how they manifest in our toolkit under the headings: conceptualization, communication, coordination, collaboration, and cognition.
The Humainary project aims to bring much-needed sensibility, streamlining, simplicity, and sophistication back to an area that seems to fight forcefully not to move past yesteryear technologies like logging and tracing.