Streamlining Observability Pipelines

This article was originally posted in 2020 on the OpenSignals website, which is now defunct.

Shipping

The first generation of Observability instrumentation libraries, toolkits, and agents have combined the big data pipelines of application performance monitoring (traces, metrics) and event collection (logs) into one enormously big pipe to the cloud. Beyond some primitive sampling and buffer management, these measuring and data collection components are near exclusively focused on shipping bloated data payloads over the network. There is minimal intelligence in adaptivity or selectivity baked into the software that pertains to the context and situation. In some cases where an agent is deployed, there is automatic discovery and instrumentation but little runtime regulation.

Pipelines

You might imagine the reason for making such components and pipelines as simplistic as possible is to reduce the consumption of processor power at the source, but this would be somewhat naive. Most of the time, the overriding principle is to move the data away from the origin and into the service provider’s space, keeping the components, connectors, and channels as dumb as possible. All the magic happens in the backend processes and services under the operational and change management of the service provider. The engineers building such primitive data collection components do not think about services, situations, signals, status, or significance.

Duplicity

Because of this, much of the data collected is redundant or of small value or importance. The amount of duplication in transmission and inefficiencies in converting from one encoding to another as data moves from service to library, from agent to endpoint, and from a queue to a store is extraordinarily high by any enterprise standard. Data is invariably sampled or dropped when it is needed most.

Garbage

Instrumentation libraries such as distributed tracing and logging are at the core of their design, just data sinks or black holes where much of the data goes off to be never seen again. Not much regard is given to meaning or relevance per some active goal or evolving situation. Value is generated when the data is required and accessed, if at all. The attempted reconstruction of the problem is meant to happen following data transmission to the cloud if the situation is even considered within report tooling.

Because the primary consumer of observability is painted as a human aimlessly wandering in a dataverse waiting for an unknown of unknowns to plop right in front of their face, there is simply no way to curate this data at the source before transfer. Garbage in, garbage out!

Leaks

And what do data pipelines do with such waste? Much like waste depots in the real world, they use batching and compression, adding layers and layers of data on top of an ever-growing wasteland of computation and storage that heavily taxes human cognition.

This is how it has appeared to us, coming from a low-latency, high-frequency computing background. It is shocking and amazing how convoluted and complicated any type of “intelligence” is to achieve at the backend in trying to make all this seem somewhat coherent, consistent, and credible when clearly it is not and cannot ever be.

AIOps

Invariably, most vendors and customers give up on offering or expecting automatic expert advice and assistive operations, resorting to providing ad hoc querying and custom dashboard capabilities, effectively pushing the problem elsewhere, and then claiming this dereliction of service is a new thing – a platform.

The community accepts this because, for most, data collection and dashboard creation is rewarding, and no one seems to know any better. When someone does challenge such efforts’ effectiveness and efficiency, the hundreds, if not thousands, of dashboards are wheeled out to spraypaint a picture of just how complex reality is, not how lost we have become.

Today’s site reliability engineers (SRE) are so fearful of not collecting all pieces of data. However, at the same time, they accept that their situational awareness is still at ground zero, being blind to what matters – signals and states.

Trajectories

The next generation of Observability technologies and tooling will most likely take two distinctly different trajectories from the ever-faltering middle ground that distributed tracing and event logging currently represent. The first trajectory, the high-value road, will introduce new techniques and models to address complex and coordinated system dynamics in a collective social context rebuilding a proper foundation geared to aiding both humans and artificial agents to assess and understand the current and predicted (projected) state of a system of services, resources, and schedulers, in terms of operational goals such as systems service level management.

Playbacks

Simultaneously, there will be a strong push for capturing and reconstructing software execution flow within and across services via episodic machine memory, mirroring this being the high fidelity road of near-real-time simulated playback activated on-demand by situational awareness tooling following signal and status changes that direct both human and machine operator attention.

This second trajectory will not be bloated, much like today’s pipelines, because it will focus exclusively on replicating fine-grain method executions and not invocation parameters, request payloads, and whatever baggage items developers like to package in with tracing and logging records without much thought for utility and overhead. The goal is to mimic behavior, not yet another failed attempt at journaling data across boundaries. We need to restore the balance between collection and cognition and its control.

Signals

Humainary offers a solution that vastly streamlines the Observability data pipeline by shifting much of the computation from the backend to the source (systems and services) while being substantially more efficient than current data collection components.

Instead of collecting many arbitrary data values alongside events (log records or trace spans) captured, Humainary concentrates on what is significant and relevant to the situation and operational goal in the form of services, signals, and inferred states.

Inference processing, where a service’s status is derived from the sequencing of emitted signals, a set of sixteen behavioral codes, is performed locally. No data transmission must occur at the signaling level, though Humainary allows such cases via callbacks.

Compression

Even when transmitted remotely as an event, a signal event needs only three fields – the service name, the orientation token, and the signal token. This is a tiny fraction of what is involved in sending a log record or distributed trace, mainly when stack traces, tags, labels, events, and fields are factored in.

Humainary deals in single-digit byte-sized events, whereas all other yesteryear instrumentation libraries size their events in kilobytes.

You would be mistaken to believe that observability is significantly reduced by focusing on signals and status tokens. It is quite the opposite. Because of the efficient design of Humainary, a service need not be just an endpoint or exit point in some distributed workflow. A service can be as small as a code block within a method executed millions of times a second within a process or runtime.

Micronization

Humainary allows you to decompose a microservice into hundreds of sub-services without significantly perturbing processing times, which would be the case with distributed tracing or event logging. Of course, the degree of new code coverage will depend on the underlying service provider implementation of Humainary deployed and the configuration of plugins to be installed.

It is expected that for most large-scale systems, Humainary will only transmit status changes across the network for collective intelligence, where the inferred status values for a service, taken from multiple sources, will be aggregated by way of set policies.

Further, status changes will be propagated into higher system contexts and forwarded onto more sophisticated supervision and control routines for profiling and prediction purposes – fast, cheap, scalable, and effective.