Contextualizing Observability

Context

Context is crucial when it comes to the Observability of systems. But Context is an abstract term that is hard to pin down. Does it represent structure (form) as in the configuration of software components? Does it represent behavior (flow) as in tracing a service request? Does it represent some attributes (features) associated with a metric? Does it encompass purpose (function)?

Is Context, the elusive situational awareness, still so lacking in the Observability solutions of today? How does one begin to define and model Context if it is compositional, comprised of many nested or related Contexts? How do the systems concepts of Boundary and Observer relate to the concept of Context? There are many inquiries, few explanations, and even fewer solutions.

Names

Most of today’s Observability solutions approach the capture and construction of Context simply using hierarchical namespaces, such as in the String value of a Metric identifier. Below is a small selection from the Kubernetes Metrics listing.

kubelet_runtime_operations_errors_total
kubelet_runtime_operations_latency_microseconds
kubelet_runtime_operations_total
kubelet_runtime_operations_duration_seconds
apiserver_audit_event_total
apiserver_audit_requests_rejected_total
apiserver_client_certificate_expiration_seconds
apiserver_client_certificate_rotation_duration_seconds
apiserver_client_certificate_rotation_total
apiserver_current_inflight_requests
apiserver_request_duration_seconds
apiserver_request_latencies
apiserver_request_total

Embedded within the Metric’s String identifier is a tree-like node path as follows:

+ kublet
+ runtime
+ operations
+ apiserver
+ audit
+ client
+ current
+ request

The above partial Context definition is typically augmented with attributes captured from the source of the Metric collection and publication. These attributes, for example, port and host, identify a computing resource, such as an application, process, or service. This way, the Metric name list is fixed, but the resources that emit Metric measures reflect the deployment’s specifics, including resource naming and node topology. This might seem sufficient for newcomers, but it is lacking and inadequate.

Whatever a Context might be, it must have a small number of active computational components at some level of abstraction; otherwise, it will be tough and unwieldy for a human operator to reason about and act upon. Examples of such components would be those with names similar to Service, Controller, Adaptor, Processor, Task, Connector, Transactor, Gateway, Executor, etc.

But name[space]s in themselves are not enough. What if there are multiple instances of a component with the same name? What are the relationships between the components? Container or Contained. Supervisor or Supervised. Proxy or Delegate. Producer or Consumer?

Attributes Applications

The problem with the above can be seen with vendors offering users of their tooling the ability to define application perspectives, basically complicated attribute selections, that attempt to collect several differently named and structured data points under a grouping commonly referred to as an Application. Utterly absent from all this are the computational components, which are more than a few surface-level metrics, such as queue length and request latency, and are somewhat independent of resource topologies.

Application perspectives are, in effect, headless dashboards with all the convoluted configurations of attribute selection sets. They are artificial constructs divorced from the reality of the code. Those tasked with creating application perspectives rarely have the internal know-how to model effectively and efficiently. Knowledge of the code base is not enough because metrics are not the code; they are something else. Even then, perspectives are nearly impossible to maintain because they are not integral to code.

Everyone imagines they will maintain the perspectives and other dashboard configurations. Still, things quickly drift because of significant effort in aligning with change and the smaller-than-expected return on investment. Best intentions don’t last long.

Incidentally, labels or tags are no different than attributes; they only remove the name part from the name-value pairing.

Traces

Do Traces and Logs get us closer to the Context we need? Both are about capturing happenings in a more flow-centric way.

Unfortunately, they both have the same issue: the naming of Spans and Loggers’ is taken to be the namespace of the method where the instrumentation code was added, typically by an instrumentation agent. The result is another list equivalent to the metric listing above, with some added relational structure regarding caller-to-callee interactions that, while different, don’t add much to identifying the core computation components within a process, like trying to discern human behavior by looking at a neuron firing.

Now, tooling users must employ even more complicated graph-like temporal queries to trace trees or log timelines to re-create the component boundaries from call sites required to understand, monitor, and manage. The call sites are not the contextual components we seek. Again, such configuration is external and disconnected and will drift even more rapidly than the above metrics.

Though trace trees can help detect divergence in the expected flow, such cases should be scarce and found early in the develop-deploy lifecycle. Effective management is severely limited if there are many paths to a particular service request processing. While a machine (learning model) can handle significant call path variation, a human-in-the-loop actor will not. When there is extensive path variation, configuring external tooling to identify the principal components (boundaries of interest) from paths in a large call graph model becomes impractical. However, most site reliability engineers do so without an alternative available.

Correlation

The above instrumentation approaches to Observability are primarily built as separate siloed pipelines, even if shipped in a single package. They might converge in transmission within a collector process or cloud service endpoint. Still, they are data collections with distinct conventions and constructs, like blindfolded humans asked to describe an animal, such as an elephant, from a single touch point. We’re not talking about pieces of the puzzle because the pieces in question are incompatible. Trying to harmonize and connect the dots after the fact has been a considerable engineering effort for many vendors, and the results are not great.

Even with correlation identifiers being passed between instruments, the problem of component boundary detection is not resolved because it only hints at such across data and execution flows but does not capture and model the component itself.

Change

Today’s Observability is no different than yesteryear’s Monitoring, which is unsurprising considering the technologies, such as tracing, logging, and metrics, are essentially the same. Nothing has been learned from what was known 20-plus years ago.

Change is a massive topic of interest in Observability, yet the community seems utterly resistant to change beyond maybe some rewording references and rewriting definitions. One reason is that the solution takes so much control and value away from the vendors. Achieving lasting value requires embedding more contextual intelligence and computational processing into the instrumented processes and toolkits, the initial requirement step before Controllability can be introduced — the end game.

The process and the instruments must be fully integrated, with bi-directional feedback loops, i.e., Circuits, at various levels of Context and Clock time resolutions; this is impossible with generic instrumentation agent toolkits from commercial vendors.