Scaling Observability for IT Ops

This article was originally posted in 2020 on the OpenSignals website, which is now defunct.

Scaling: Abstract • Aggregate • Compress

Scaling through abstraction, aggregation, and compression is critical in the effective and efficient service management of large-scale and highly connected systems. Scaling here is not merely storing vast amounts of observability data, often of questionable value. We have seen that play out in the application monitoring space with the movement to cloud hosting without much improvement in our day-to-day ability to monitor and manage. No, scaling here pertains to the operators’ (human or machine) ability to quickly observe and assess service quality across hundreds, if not thousands, of interconnected systems, services, and endpoints.

Tracing: Failing at Scaling

The underlying observability model is the primary reason for distributed tracing, metrics, and event logging failing to deliver much-needed capabilities and benefits to systems engineering teams. There is no natural or inherent way to transform and scale such observability data collection analysis to generate signals and inferring states.

Passing the Buck(et)

Current observability technologies and techniques make the fundamental mistake of thinking that someone somewhere will haplessly spend an enormous amount of time and money wading through vast amounts of quantitative data. And somehow miraculously churning out many aggregations and rules that generate signals and alerts afterward. Of course, this assumes it is possible to combine and relate metrics, logs, or traces into something a classification function can evaluate. In practice, this is impractical, if not impossible, to do and maintain even if it were feasible to aggregate diverse quantitative measurements across different systems, services, and endpoints.

Data-Driven vs. Model Managed

Service level management and alerting vendor demonstrations, often touted as smart or intelligent, only work with toy systems and applications because the model is inadequate for qualitative performance analysis work. A diagnostic-like model will not transform into a helpful service-level model, no matter how much data or engineering effort you wastefully throw at it. Stop. Rethink.

A Goal Oriented Model

Humainary, however, has been designed with the end goal in mind, with a model that scales by way of qualitative measures (signals), composition (naming), scoring (status), and scoping (context). There are no expensive intermediary steps in getting to a signal from a mixed bag of structured, semi-structured, and unstructured data. Humainary concerns itself with the emittance or receipt of signals.