Observability – The Significant Parts

This article was first published in 2019.

Each new day shows observability attached to other software and system engineering aspects. Observability pipelines. Observability platforms. It seems many want the love shown to observability to shine a light into their secluded corners. Unfortunately, observability has lost all meaning and is now all but divorced from its origins.

“Observability is a notion that plays a major role in filtering and reconstruction of states from inputs and outputs.
Together with reachability, observability is central to the understanding of feedback control systems.”
Lectures on Dynamics Systems and Control, MIT

You could argue that the above definition pertains to cybernetics, particularly control theory.

But much of what is happening today in the industry is centered around streamlining and stabilizing a much bigger feedback loop – one that is managing change and complexity within businesses and the enclosing environment.

The critical elements of cybernetics are feedback, flow, control, and communication – which many would agree are the essential aspects of successful systems operations. The big difference is that we must now somehow bring together the worlds of man and machine in such a way that plays to the strengths of each other without significantly taxing the capabilities and capacities of each. Talk of exploring the infinite space of the “unknown unknown” is, for the most part, not at all helpful or productive. There are natural limits and constraints in practice and, most certainly, in production. We must refocus on what should be observed and, more importantly, controlled.

What matters most is what we can infer from observations and the significance of such observations to the system (of systems) state we are attempting to see, sense, and steer. The problem with observability is that we have not defined a model consisting of a relatively small set of universal signals and states that reflect the nature of modern application systems of services, flows, streams, etc. Instead, observability vendors continue to focus on raw data.

We need to bring the fundamentals into focus, extracting the signal from the noise. What system (health) states are of concern, and at what granularity? How is the state determination made? By the system, or service itself, or by external dependents? How reliable, durable, and predictive can such determinations be? What is the set of qualitative signals that can describe much of the interaction, conversation as opposed to transaction, that occurs between systems and parts of the system? What is the signal mapping to state, and how sensitive should it be at various exchange points? What mechanism and policy can be employed to scale up (aggregation) and down (decompose) to the appropriate level of operational attention?

Most current observability technologies don’t fair well as a source of behavioral signals or inferred states. They are not designed to reconstruct behavior that would allow the level of inspection we would need to translate from measurement to signal and, in turn, the state effectively. They are designed with data collection and reporting in mind of the event, not the signal or state. They are unaware of the resilience mechanisms many service-to-service communication libraries now employ. They count errors at the request level and not failures in an exchange within a workflow, which can consist of one or more errors but still be regarded as successful completion. The default should be to only surface signals and states in tooling and only rarely drill down into low-level events.

When we talk only of signals and states, then and only then can we say there is observability of systems.