Notes

  • Observability: The Interpretive Ladder
    Do we have metrics, logs, and traces? This is the question most teams use to assess their observability. It is answered with an inventory: count the instrument types, and when all three are present, the assessment is complete. The outputs are counted; what the outputs let anyone do goes unexamined. This post works from another question: What interpretive function are we performing here? his question is answered by naming an operation and placing it on a scale. The scale runs from acknowledging that something happened to projecting where a situation is heading. Each rung is an interpretive operation, and each operation determines what can be inferred from an output and where that inference takes place.
  • Observability: The Unsustainable Path
    The observability industry must unequivocally state that its current approach is unsustainable. The reason is not that observability is redundant; on the contrary, it is becoming increasingly crucial. The systems we construct are becoming increasingly complex, unpredictable, and unreliable, and the need to comprehend their operations has never been more pressing. However, the industry’s persistent solution is fundamentally flawed. When faced with uncertainty, the industry resorts to collecting more data. This approach relies on the assumption that by gathering more data, preserving context, and transmitting enough records to various backends, a deeper understanding will eventually emerge. It will not.
  • What Blind Systems Leave Behind
    Modern software has learned to see itself through traces. To the software industry, this approach feels natural. Systems are complex, so we trace them. Services are numerous, so we correlate them. Events are dispersed, so we collect them into a path that we can maintain. The trace appears to be the obvious starting point for comprehending a system. However, it is an unusual thing to construct. Surprisingly, most systems in the world comprehend flow and function far more effective without the need for a trace.
  • To Observe Agents we need Signs not Traces
    For many, traces seem to be a fundamental building block of telemetry. Computers operate through a series of calls, with each procedure triggering the next, resulting in a cascading chain of invocations. When telemetry is organized around traces, it appears to be an intrinsic part of the machine’s architecture. Tracing essentially records the machine’s calls. While this intuitive understanding is helpful, it falls far short of providing a comprehensive grasp of systems.
  • OpenTelemetry is not Observability
    An organization can adopt OpenTelemetry (OTel): every service emits spans, every host reports metrics, and data flows perfectly through the collector into a store. The pipelines are healthy, and the dashboards are full. Yet, if you ask whether the system itself is healthy, the data alone cannot answer. Observability is a property of a system: it means a system’s internal state can be inferred from its outputs. OTel transmits data records, but stops there. The vital work of inference, situational intelligence, is left unaddressed.
  • Observability: Searchable Ignorance
    Observability vendors have reduced observability to the ability to ask questions of data. They collect telemetry, store it, index it, visualize it, correlate it, and now verbalize it through AI agents. Each capability answers a question posed within the data. The system itself stays outside the frame. The uncomfortable truth is that enterprises are satisfied with data answerability because system answerability would expose the absence of system understanding.
  • Between the Threshold and the Sentence
    Look at the current observability stack and two layers stand out, sitting one on top of the other. At the bottom is the threshold layer. It decides what is true. Error rate crossed a line. A connection wasn’t closed. The ninety-ninth percentile exceeded its budget. At the top is the language layer. It decides how to say what the bottom layer found. It takes a true thing and renders it as a fluent sentence. Stack those two and you get a system that detects with a comparator and speaks with a transformer. The detection is decades old, while the speech is brand new. The missing seam between them is precisely where understanding was supposed to reside.
  • Putting the AI in Failing – DataDog and Dash0
    The post argues that Datadog’s Bits AI and Dash0’s Agent0, despite the obvious polish, primarily automate evidence retrieval, telemetry record navigation, threshold-based insights, and natural-language summarization. They do not demonstrate a genuine “system model” or real-time situational intelligence that participates in regulation as incidents unfold.
  • What AI SRE Tells Us About Observability
    Observability, mainly a data field, kept track of what happened and showed off the system’s outputs. But it didn’t really look at how those systems worked together. Now, the language model is putting together quick stories about what’s happening based on that info. Companies are moving from just figuring out what happened to actually doing something, but they haven’t set up the models they need to really understand the situation.
  • Why AI Forces Bigger Bets
    There was a time when software strategy could hide inside delivery. A company could say it had a roadmap, and the roadmap would be made of features. Some would be small. Some would be ambitious. Most would take months to design, build, test, integrate, and release. The effort itself created weight. The fact that something was hard to build gave it a kind of strategic seriousness. That world is fading.
  • AI SRE – The Verbalization Layer
    Today’s AI SRE products aren’t autonomous operators or nervous systems. Instead, they’re verbalization layers that overlay telemetry, tickets, runbooks, and dashboards. These products are useful for summarizing known information but are structurally incapable of replacing the system model, engineering judgment, and situational intelligence required for actual operational regulation. Essentially, these products connect language models to existing operational interfaces, generating fluent summaries while leaving the underlying system model untouched.
  • The False Promise of the AI Nervous System
    Promising an “AI nervous system” for production infrastructure is fashionable. The pitch is enticing: centralize raw telemetry, let an AI process it, and observe autonomous monitoring and repairs. However, adding an AI to a centralized database doesn’t create a nervous system; it merely automates an external observer’s role. A true nervous system isn’t a remote brain processing and exporting data.
  • The Thinking Arrow
    This technical note emphasizes that true operational resilience hinges on an often-overlooked aspect: the “thinking arrow” within us. This internal process transforms raw actions, such as incidents, memory traces, and data, into valuable knowledge, including models, runbooks, and a deeper comprehension of the system. This crucial step—abduction, model-building, and quick thinking—generates the “interior” (a personal, reconstructible mental model of the system), which is essential for effective steering, particularly in unfamiliar situations.