Observability 101 – Introduction


In a previous post, we took issue with using the concept of pillars within the Observability, Application Performance Monitoring (APM), and Site Reliability Engineering (SRE) communities. In this post, we want to sketch out the main conceptual areas of concern we regard as necessary in making sense, keeping one’s sanity, and avoiding all the noise generated by vendor marketing departments.

Instead of chattering on about pillars, we will be (re)framing our discussions of Observability under the following headings:

# Measure, Model, and Memory
# Past, Present, and Projected
# Local, Outbound, and Remote
# Perception, Cognition, and Action

# Form, Function, and Flow
# System, Structure, and State

# Instrumentation, Interception, and Injection
# Static, Dynamic, and Adaptive

# Measurement, Collection, and Transmission
# Sampling, Aggregating, and Dropping

# Activity, Action, and Operation
# Caller, Callee, and Call
# Synch, Asynch, and Scheduled

# Ingress, Transit, and Egress
# Request, Context, and Conversation
# Enqueuing, Waiting, and Servicing

# Time, Clock, and Timing
# Process, Stack, and Frame
# Probe, Resource, and Meter

# Context, Environment, and Change
# Instrument, Referent, and Emittance
# Source, Sink, and Store

# Sign, State, and Symptom
# Signal, Service, and Status
# Objective, Indicator, and Level
# Deviation, Degradation, and Deficient

# Execution, Expectation, and Error
# Attention, Awareness, and Action
# Scene, Situation, and Scenario

# Reliability, Resilience, and Robustness
# Restart, Rollback, and Recovery

# Controller, Supervisor, and Valve
# Resource, Pool, and Reservation
# Require, Reserve, and Release

# Communication, Cognition, and Control
# Coordination, Cooperation, and Collaboration

# Steering, Synthesis, and Simulation
# Observability, Controllability, and Operability

But before we dive into the above headings, we will address the big question that keeps coming up: Observability versus Monitoring.

Monitoring / Observability

The debates over the difference between Monitoring and Observability have, for the most part, been driven by the need for small niche players to differentiate themselves from the more established application performance monitoring vendors. The Wikipedia definition of Observability is coupled with Control Theory (Controllability). Then there is what is being done in the community under the umbrella of OpenTelemetry, where it is mainly about collecting more and more data and details, whether logs or distributed traces (which OTel considers as another form of logging).

Strategic / Operative

If we take the OTel view, then Observability is the operative part of the Observability-Monitoring relationship. Observability collects and transmits data along pipes that reach back to a cloud endpoint. We call such endpoints blackholes. Data is stored, analyzed, and converted into reports and dashboards used for “Monitoring” and occasionally Diagnostics purposes. Monitoring is not talked of so much; instead, it is all about exploring or debugging data and marveling at its high dimensionality without concern in the world for the limited value offered (99% of logging is irrelevant).

Steering Observability

On the other hand, we view Monitoring, as in attention and awareness, giving purpose and direction to Observability. Monitoring is strategic; it steers, or at least should, the instrumentation, measurement, collection efforts of Observability. Without Monitoring, Observability would be just a wasteful data collector of anything and everything. Sadly, it is happening today because Observability has been forcibly divorced and disconnected from Monitoring by a very vocal few charlatans, their CNCF cohorts, and OTel coders. The attention and acuity of engineering have been hijackeddataboards being the result. Site reliability engineers and other monitoring specialists are doing what is easy, not what is right.