AIOps – A Postmodern Observability Model

Legacy Logging

The model employed today across various observability technologies and solutions is unsuitable for effective and efficient information technology service management and distributed systems operations. Data structures such as traces, spans, baggage items, logs, tags, counters, and gauges have been around for decades, yet service management outcomes have not improved. One reason is that when technologies such as distributed tracing and event logging emerged, the computing landscape was far simpler, smaller, and more stable.

While there might still be some use for such data collection techniques for diagnostics purposes, the conceptual gap between what is collected by instruments and what is required for distributed cognition (humans and machines) and situational awareness (attention, assessment) is so vast that it is unsurmountable without using vast amounts of computational and storage processing costs.

For many, tracing is another form of logging, with correlation identifiers added. Some even say the same holds for metrics. Whether or not you fully agree, this widespread thinking highlights that much of the data collected has little operational value in managing services. These are not signals. These do not lend themselves to aiding a machine or human in discerning the situation, locally or globally.

For a long time, the industry has hoped that some magical transformation and translation mechanism will reduce the ever-growing gap between the saturating input channel (collection) and the shrinking output channel (cognition). It has not come about because so much engineering effort is spent trying to expand capacities to contain the avalanche of data flowing downstream through leaking pipes.

No observability vendor will admit the truth – the data-details-centric approach is unsustainable, and a rethink and re-education are required. Calling a trace a signal does not make it a signal. The same goes for logging and metrics. More data will only worsen matters. The entire ecosystem is dysfunctional because vendors are rewarded for accumulating data instead of offering value by extrapolating system dynamics, elevating situational awareness, and exposing operational states to those in most need, such as site reliability engineers.

Postmodern Observability

Instead of yesteryear’s concepts of traces and logs, we propose the following model, which can better serve site engineering reliability and service operations by being foundational to developing situational awareness capabilities and system resilience capacities, particularly adaptability and experimentation, as in dynamic configuration and chaos engineering. The concepts within the model are concise, clear, and comprehensive because, in the design, the focus has been on how best to synthesize effective and efficient layered cognitive structures and processes that will more readily support communication, coordination, and cooperation between machines and humans.

We start with the Subject, the system component under observation, and possibly control. A Subject can have one or more traits such as Resource, Service, Scheduler, etc. The trait, a behavioral perspective taken by Source, dictates the possible Signs emitted through an Event. Subjects have a Name and can be nested within other Subjects depending on how a Source or Observer has chosen to represent a system.

A Source or Observer publishes an Event that pertains to a single Subject reference, though it should be noted that a Subject can be a collection of Subjects as in a system. An Observer is a Source that also consumes Events published by Sources, including other Observers.

A Source can be the Subject or an Instrument that a Subject uses to support observability. A Source emits an Event that has a Signal as its emittance value. A Signal consists of a Sign, an enum-like token, and an Orientation. The Orientation value can be EMIT or RECEIPT.

An Orientation value of EMIT indicates that the phenomenon that the Sign stands for is happening presently (temporal) and locally (spatial). An Orientation value of RECEIPT indicates the phenomenon has already occurred external to the spatial scope of the Source and that the Source is only reporting receipt of the Signal communicated and determined by other retrospective means.

Consider an active component within a system that publishes UDP packets, indicating it is running. While broadcasting a UDP message, it publishes a Signal with the Sign value of ALIVE and an Orientation value of EMIT via a Source, the Instrument here. The Subject of the Event, enveloping the Signal, refers to the component via a Name and an instance Identifier. This can be considered self-reporting.

Now when a listener receives the UDP message, that listener, a Source itself, can, if instrumented, report the receipt by publishing a Signal where the Orientation value is RECEIPT, and the Sign value is ALIVE. The Subject is the reference used by the listener for the sender. Here the signaling of Signs creates a replica of an observable reality relevant for operational purposes, whether emitting or receiving.

Signals, Signs, and Sources represent the low-level inputs into the observability model. Signals, which can be typically grouped into operation or outcome, are explicitly designed to influence the reasoning of Observers and the behavior of Actuators (controllability).

Before moving up to the next layer, it is vital to note that a Sign, a token, is a meaningful and self-explanatory term within a small language of signs. It is not a metric as in the length of a queue because for a length value to have meaning requires an Observer to know much more, such as whether the queue is bounded and the capacity value of the bounding. Here EMPTY and FULL are Signs.

Let us move on to the Observer, which, as stated above, is a Source that also subscribes to Events published by other Sources. Observers, in general, publish Signs that reflect an assessment of a Subject’s Status (internal state). A Status has a Confidence score as most assessments are an inference or scoring of a collection or sequencing of Signals received by an Observer for a Subject. The scoring is also necessary when a Subject represents an aggregation of multiple Subjects. Our thinking about scoring is that it is qualitative and not quantitative. Instead of some confidence percentage value, it should be mapped to a set of terms.

An Observer can publish Signals, not just as some relaying mechanism, but more so when a set of Signals from other Sources for a particular Subject represents a high-order operation or outcome. The goal of Observers is to make sense of and shorten a sequence of Signs, by way of pattern mining and matching, into a single Event — Observers on top of Observers up to situated intelligence.