Observability: Disruptions

Systems are Networks

This article views a distributed system consisting of microservices or data processing steps, stages, and pipelines as a network of data transmission nodes. When we talk of transmission, it can be a microservice request invocation or data movement along a machine learning ingestion pipeline.

While at the instrumentation level, we see a difference between the observability of software and data; this is not necessarily so when discussing service-level management and reliability engineering.

Disruptions

Let’s now focus on the factors affecting the maintenance of service quality levels, which we will lump under disruption. A disruption is an interruption in the flow of (work) items through a network that can, for a while, make it inoperable or where the network flow performance is subpar. Depending on the severity of the disruption, a network may need to replan and restructure itself for a period afterward.

There are two main categories of disruptions: disturbance and deviation.

Disturbances

A disturbance is an event (change) within the environment that significantly reduces and impairs the network’s ability to transfer items into and out of the network from one node to another.

Disturbances are typically not under the control of the network, such as a surge in ingress flow or the failure of an infrastructural component, such as power failure or lack of egress connectivity.

Generally, it isn’t easy to forecast and plan for such events. Beyond the continuous situation and risk assessment, the best course of action (recovery and mitigation) ensures that the network can adapt and realign swiftly and efficiently around the replanning of transportation movements, resource capacities, work schedules, etc. Here all processing nodes within the network must have access to crucial situational information and share a mental model of the situation in particular inferences and projections.

The challenge with disturbances is the significant impact on the entire network as work is reassigned following possibly complicated coordination over possible multiple communication channels and control levels. This is not without risks during such changes, likely inducing further failures and changes.

Much, if not all, of the observability solutions on the market focus exclusively on disturbances. This can be seen in the large volume of the data collected and its diagnostics nature in structure and content.

Deviations

A deviation is detected while actively monitoring the flow of items throughout the network and its nodes. A deviation involves a detected difference between an actual measurement value and an expected value at some particular point in time within an operational window. The compared value can be a network quality attribute such as latency or throughput for items moving from one node or stage to another or a value derived from a scheduled work plan.

The need for replanning and the scope of change will reflect the degree of divergence from expectations and the network’s sensitivity and its nodes, which can be policy- or goal-driven.

An objective of interventions for detected deviations should be to steer the network back on course with the minimum agitation to ongoing plans and schedules. Here, credible feedback must be immediately reflected in the assessment and awareness of the current situation to allow operators to reinforce learning and slowly escalate intervention levels.

It is also critical that deviations, once detected, can be further projected into the future to understand the window of opportunity for less severe acts of interventions and more effectively decide on the course of action.

Network observability and controllability must go hand in hand to limit the scope and severity of further deviations and improve the effectiveness and efficiency in responding to events.

To do so requires that the network, or the controller of such, has a form of memory that can associate and track deviations with approved or scripted interventions that effectively resolve problematic situations represented by such deviations.

It is not good enough to bring back order (or stability) to network planning and quality; knowledge and learning must be acquired. While a deviation might appear more local and initially have less impact on the network, it can rapidly spread like a contagion throughout the entire network, especially when nodes and plans are highly dependent, interconnected, and time-sensitive.

Finally, a challenge is distinguishing between the source (origin) of deviation(s) and the result of a knock-on effect from a deviation; this is further complicated as disturbances generate deviations, and deviations can lead to disturbances, as in the slowing down of processing causing queuing and surging.

It can be tricky to separate so many possible strands of causality in a network, which on paper might look like a chain of transmission and transformation processes and stages but is a web of nodes, links, and feedback loops.