This article was originally posted in 2020 on the OpenSignals website, which is now defunct.
Much of the initial motivation underlying the design of Humainary reflected a concern with the wanton proliferation of metric instrumentation and custom dashboard creations. Tools like Prometheus and Grafana have and continue to create an enormous, ever-growing gap between data and information, collection and analysis, and perception and projection – many early adopters of such tools are no better off. We can all remember walking around large office spaces and being bewildered by the significant differences across each of the giant television monitors that delineated each team boundary as we passed from one squad to another.
It is unquestionably clear that Grafana and Prometheus have managed to break the circular feedback link between doing and knowing.
Disorder: Dashboard Myopia
Problem identification and resolution skills seem irrelevant here; instead, the individuals most prized by an organization’s operational side are those who know which of the thousands of dashboards to navigate across squads when issues escalate across service boundaries.
When incidents arise in production, the disarray is depressing to watch as engineers waste time interpreting the multitude of custom team dashboards. Inconsistency is everywhere in naming metrics, data conversion, time resolution, aggregation, representation, visualization, placement, etc. For the most part, the situation itself is utterly absent from the minds of those involved. After a protracted period, most engineers jump back and forth between logs and code source lines – wandering down in the dungeons of data and logging details.
Illusion of Control
After witnessing one incident after another, what was clear was how little was known and understood during these events beyond the cue that brought the situation to engineering’s attention. It was no wonder that not much changed between incidents; the level of awareness made each production incident look unique. The data values were distinct, the dashboards unique, and rendered pixels different.
The tooling failed spectacularly in assisting engineering in assessing the situation. The tooling’s design and usage were at odds with human cognitive capacities, never playing to a human’s strength in visual and contextual pattern matching. The overly large television monitors paraded by squads were displays of vanity built on an illusion of control. On reflection, passing the buck from one unit or microservice to another results from differences in situation awareness levels across teams and the lack of a shared mental model.
Awakening: Situation Awareness
What is awareness of the current situation regarding systems of services? One popular standard definition states that situation awareness is the perception of elements in the environment within a volume of time and space, the comprehension of their meaning, and the projection of their status (Endsley, 1987).
Perception, level 1, involves sensing and collecting data, such as status and dynamics related to an environment’s elements.
Comprehension, level 2, consists of integrating collected data transformed into information and, in turn, understanding. Comprehension is essential in assessing the significance of the elements and events and acquiring a big-picture perspective.
Projection, level 3, involves applying acquired knowledge and analytical capabilities to predict the following states and possible interventions (if applicable). The accuracy of the prediction level depends, for the most part, on the accuracy and quality at the other lower levels. All processing at each level reflects current goals.
Situation awareness is knowing what is happening within an environment and, more importantly, what is essential. We significantly enhance situational awareness by developing internal mental models of systems managed. Such models direct attention efficiently and offer a means to integrate information effectively while providing a future state projection mechanism.
Unfortunately, many of the solutions promoted in the Observability space, such as distributed tracing, metrics, and logging, have not offered a suitable mental model in any form whatsoever. The level of situation awareness is still sorely lacking in most teams, who appear to be permanently stalled at ground zero and overtly preoccupied with data and details.
Bringing Significance into Focus
It is now widely recognized that more data does not equate to more information. The problem with today’s operational support systems, such as application performance monitoring, is not data but finding what is needed and significant when required. What something means is crucial and paramount to awareness, subjective interpretation, and the construct of the situation. Engineering needs a suitable situational model.
Minds and Machines Need Models
Working memory is the bottleneck within humans regarding situational awareness, mainly predicting future states – this is especially true for non-experts or novel situations. Mental models can circumvent such limitations in generating descriptions and explanations of systems, especially elements’ status.
A model acts as a schema for a plan, with a situation model representing the system model’s current state, much like a snapshot. A model provides a means of achieving a much higher level of situation awareness without overloading working memory.
Models should play to humans’ superior abilities in pattern matching, facilitating the direction of attention, a precious cognitive resource, noting critical cues, allowing for the projection of system states, and linking the current system state and its classification to an appropriate intervention.
A model’s selection reflects an operator’s goals, plans, and tasks like a template or class. It must be populated or instantiated, like an object, with data captured within the operating environment. Goals facilitate a top-down process of decision-making and planning.
In contrast, patterns and cues within an environment allow for bottom-up processing to change goals and plans and the system model employed. In Humainary, the signals fired by services and the status changes resulting from such firings aid situation awareness.
Top-down processing is underpinned in scoring signal and status change patterns at various attended collective scales (of aggregation and event propagation) and in the setting and ongoing monitoring of service level objectives (goals).
The presentation of information is a vital factor in how much information can be acquired accurately, effectively assessed and understood, and related to operational needs, goals, and plans. An optimal design seeks to convey as much information as possible without undue cognitive effort – attention and working memory must be carefully conserved. Much of what drives today’s observability dashboards is data collection-centric, with minimal consideration for orienting users to the current situation.
Information about the current goal, such as reliability, is rarely presented directly, being lost amongst many metrics and charts haphazardly thrown together on a limited-screen real estate. Today, in determining a service’s operational status, an essential piece of information, an operator needs to combine multiple metrics, sometimes mistakenly called signals, within their internal model at great expense and error.
A Lost Signal
Most dashboards are a poor and inadequate proxy for an operational and situational model. Critical cues, such as signals and states, must be perceptually prominent. Unfortunately, while much of the product and marketing literature around site reliability engineering mentions signals and states, these are not to be found or accessible in a manner suitable for pattern matching along space and time dimensions.
Distributed tracing, metrics, and logging don’t lend themselves to the type of transformation, presentation, and communication needed here. These are measurements and data collection technologies, not situational models.
Such yesteryear observability instrumentation techniques cannot be operationalized, save for when further diagnostics are required, which should always be guided by the situation’s assessment and awareness. The ever-growing problem of information overload in today’s application performance monitoring tooling needs to be tackled along the data pipeline, starting at the source with filtering and data reduction. Here, Humainary limits the value domain to a small set of signals and a smaller group of status values.
Projecting the Future
Semantics and dynamics must win over data and details; otherwise, there is little hope of scaling to the increasing complexity and change rates. Achieving level 3 in situation awareness is difficult if the measurement models employed are not applicable or appropriate to projection into the near future. It is hard to imagine how a model consisting of data and details collected by tracing and logging can project future events and states. There is minimal compatibility and alignment here, whereas the model promoted by Humainary provides great immediacy concerning significance in seeing and understanding the past and present, predicting possible future transitions, and offering assisted interventions.
Less Data, More Signs
The only goal that would seem to be addressed by distributed tracing, metrics, and logging is secondary at best – collecting data instead of detecting cues, signals, and salient changes. Until we abandon the view that all data is equal and that more of it is better, it is unlikely that operators will ever move to the next level. Suppose we organize signals into meaningful patterns of the current situation. In that case, we have a much better chance to project and predict future events and states of a system.
Control: The Origin of Observability
Observability originated in Control Theory, where analysis of system behavior rests on states (status) and goals (objective). Feedback control mechanisms are constructed and configured to reduce the difference between a goal (state or set of conditions) and the current state.
Feedback control responds to sensory stimuli, signals infer state changes, with interventions and actions designed to correct divergence between the current state and goal. On the other hand, feed-forward, a form of proactive, anticipatory behavior, uses an extended model to predict executing actions. The goal is the state, the operational status of one or more systems of services.
It is time to get back to the basics of awareness, attention, actioning, and adaptability within the operational management of systems of services: context, signals, services, (inferred or scored) status, scoring, sequencing, situation, and simulation.
Humainary is the beginning of a complete overhaul of reliability engineering.