What AI SRE Tells Us About Observability

The AI SRE product category appeared across the observability industry within a remarkably compressed period. Established platforms converged on agents that read telemetry, correlate changes, investigate incidents, and produce an account of what went wrong. The convergence is telling. It means that a general-purpose reasoning layer could be attached to each vendor’s existing telemetry estate and produce a recognizably similar proposition.

Two things made this possible. The first was the standardized estate of traces, logs, and metrics records. The second was the newly available foundation model capable of reading such records and operating the tools around them.

What the vendors lacked between the two was the differentiator that should have taken years to reproduce: a proprietary operational understanding of the systems they observed.

This essay treats the speed of convergence as a diagnostic for the observability industry, that exposes how little operational intelligence the platforms had accumulated beneath their interfaces, and it locates the reason in an architectural omission that predates the foundation model by a decade.

What Was Never Built

The traditional observability stack runs in a fixed order: instrumentation, telemetry, storage, query, visualization, human interpretation. Each stage transforms the prior one, and the chain terminates at a human. Everything above the data line was supplied by the operator: what the system was trying to accomplish, which relationships mattered, what counted as normal, whether a change was causal or incidental, how a disturbance would propagate, which future states were becoming likely, and which intervention was safe. The operator was the one who held the model. In the operator’s mind, the model was always being refreshed after each event, and it would disappear when the operator moved on.

The platform accumulated representations, records not signals, and never made the operating whole its primary subject of understanding. A trace represents the path of a request. A metric represents a quantity aggregated over time. A log preserves a selected assertion. A service map presents inferred connectivity. Each is a partial projection produced through a particular instrument and for a particular purpose. None of them, alone or merely accumulated, constitutes a model of the living operational system. A living system has properties that records do not automatically preserve: nested operational boundaries, flows crossing those boundaries, transformations performed along those flows, capacities and constraints, feedback loops, delays, accumulation, state transitions, regulatory mechanisms, intended purposes, and alternative possible futures. These are the constituents of an operational world-model. Observability products can hold fragments of them but rarely as an integrated, first-class model of the operating whole. The fragments must still be assembled into a coherent world at investigation time, by whoever, or whatever, is reading.

The Substitution

A foundation model possesses the capability to comprehend records and construct a plausible explanation. This precise capability aligns seamlessly with the vacant slot. The AI SRE replaces the human interpreter with an artificial interpreter, incorporating instrumentation, telemetry, storage, query, foundation-model interpretation, and narrative.

With AI SRE the interpreter changed identity, and the position remained the same. But the reconstruction is still a reconstruction, performed on each incident, from the same categories of fragments. The intelligence is forensic, assembled after a disturbance from the available evidence. It is intelligence by repeated reconstruction.

Every vendor reached for the same components because the components were the only material the data offered. The differentiator collapsed to search, correlation, and narration, which are capacities of a foundation model rather than theories of the observed system. For AI SRE, the product became the narrator, one that is available to everyone.

AI SRE is therefore an admission of a past failure. A system was never taught to understand itself, so a LLM is asked to reconstruct an understanding whenever something goes wrong. The agent is a prosthesis fitted to a missing limb.

The Movement Toward Action

Once every platform produces a competent incident story, the next differentiating claim becomes ours can act. The industry trajectory follows directly: data presentation, data interpretation, action generation, and operational control.

For instance, Dash0 introduced Agent0 in November 2025 as a family of specialized agents designed to assist engineers in troubleshooting, querying, instrumentation, and dashboarding. By March 2026, the company unveiled its new vision as an autonomous nervous system for production, encompassing agents across various domains such as operations, deployment, security, cost, migration, and observability maintenance. While their AI SRE strategy wasn’t entirely abandoned, it was integrated into a broader AIOps proposition because assisted investigation alone couldn’t provide enough differentiation. This realization served as a wake-up call for Dash0’s lack of foresight and expertise.

AI SRE for ALL

When the underlying asset is telemetry, the intelligence layer is portable. An organization connects its own agent to its OpenTelemetry stores, deployment systems, source repositories, incident records, configuration, runbooks, ticketing, and institutional knowledge. The organization’s agent operates inside the organization’s boundary and carries context the vendor agent cannot reach: internal architecture, ownership, policy, business priority, risk tolerance.

An organization may permit a vendor to recommend a query or summarize an incident. It is more cautious about permitting that vendor to restart services, alter routing, change capacity, modify code, or advance a deployment. Authority remains close to accountability. The party that answers for the outcome retains the right to cause it.

The Missing Layer

The industry has never built the essential components for an AI SRE: semantics, system models, situational awareness, and controllability. Instead, it focuses on telemetry, improved search, LLMs, dashboards, and a templated runbook.

An architecture that closes the omission would carry an operational model within the observing system. Coherent signal circulation preserves ordering and relation. Judgment placed near the point of emission carries more operational meaning forward, reducing what must later be inferred from generic records. A holonic flow model supplies the persistent object: nested operating wholes, each with inlets, outlets, internal transformations, states, constraints, and responsibilities to larger wholes. Situational awareness integrates those judgments across boundaries and time. Controllability intervenes in a represented dynamic. Operability governs whether an intervention is legitimate, safe, and sustainable. In such an arrangement, intelligence is not confined to a narrator stationed above a warehouse of records. It becomes a property progressively constructed through the architecture of the system itself.

The Diagnosis

Observability accumulated records and remained principally a data discipline. It collected representations of what systems emitted, but it did not make the organized dynamics of those systems its central object. The foundation model now constructs temporary operational narratives from that evidence, while vendors move from reconstruction toward action without first establishing the models that accountable situational awareness requires. AI SRE is the prosthesis.