The Reign of Data
The observability industry is dominated by solutions mostly built with a ground-up perspective – collect as much data as possible and then figure out what can be done. This can be seen in the marketing messaging on the vendors’ website. Here are some current examples pulled from vendors:
- Quickly sift through billions of events to see your application’s hidden inner workings – Honeycomb
- Data for engineers to monitor, debug, and improve their entire stack – NewRelic
- Aggregate all your observability data in one solution – Elastic
- Intelligent observability turns data into answers – Dynatrace
- Identify root causes quickly by correlating traces with logs and infrastructure metrics – Datadog
- Delivers continuous high-fidelity data at 1-second granularity and end-to-end traces – IBM
Data, data everywhere, and not a situation in sight.
The Era of Situations
It is time for a new direction closer aligned to goals, focused more on the dynamics of systems that humans are already highly adapted to with their social intelligence, within which situation is a crucial conceptual element of the cognitive model. Understanding and appropriately responding to different social situations is fundamental to social cognition and effective interpersonal interactions.
Why would this differ from sociotechnical systems where machines are tools of human agency or agents? Sociotechnical systems are dynamic and constantly evolving, requiring the recognition of how changes in the system—both from a human and technological perspective—can impact each other and the overall system behavior. This is effectively situation awareness.
Situation awareness is the perception and comprehension of the current state of a dynamic environment, including the recognition of relevant events, their context, and their potential implications. It involves predicting the future states of a system and its components, helping to anticipate potential issues, failures, or opportunities, and allowing proactive decision-making and problem-solving.
The scope of the previous paragraph is broad and could apply to many domains and industries. Yet, it seems far more relatable to the goals of site reliability engineering than the above vendor messaging.
Networks and Supply Chains
At an abstract level, managing a network of systems of services and managing a supply chain share remarkable similarities in their operational paradigms. Both encompass integrating disparate entities seamlessly to enhance efficiency, scalability, and resilience. Such systems invariably leverage virtualized resources and distributed architectures to provide on-demand services, analogous to how supply chain management orchestrates the flow of goods, information, and processes across partners.
Drawing parallels between the two domains, we see the common goal of optimizing resource utilization, minimizing latency, and adapting to dynamic demands aided by shared situation awareness.
How is this situation awareness achieved in supply chain management? The standard answer is a control tower – a platform or system that provides real-time visibility, coordination, and control over various aspects of a complex supply chain network. It serves as a command center for monitoring, analyzing, and managing the flow of goods, information, and processes across the entire supply chain ecosystem. Here we can substitute goods with events, requests, responses, and payloads.
A control tower provides a holistic situational view of a complex adaptive system that is a network.
Setting the Scene
Before listing the generic aspects of a control tower that serve the needs of site reliability engineering, we will elaborate on some of the terminology used in the following functional descriptions.
Event: An event refers to a specific occurrence or happening that takes place within a given environment. Events are the actions, incidents, or changes that individuals need to perceive and understand to be aware of what is happening around them. Events are the building blocks of situation awareness, as they are the concrete elements individuals monitor and interpret.
Context: A context encompasses the broader information and factors surrounding an event, providing additional meaning and understanding. It includes details about the environment, circumstances, conditions, and relationships contributing to interpreting events. Contextual information helps individuals make sense of events by placing them within a larger framework.
Snapshot: A snapshot refers to a concise and comprehensive representation of the current state of a situation or environment. It captures the most relevant and critical information without overwhelming the observer with unnecessary details. A snapshot is understood within the context of the broader environment. It considers the relationships between elements and their implications for the situation. While concise, a snapshot includes all the essential elements and factors for understanding the situation and making informed decisions.
Scene: A scene describes a snapshot of the environment and its elements at a particular moment. Elements can be static and dynamic, as well as abstract and concrete. A scene frames a control tower operator’s view of the environment. Within a scene, situations unfold from past to predicted. A scene is the primary communication interface for both human and machine operators. Many possible scenes can be constructed depending on the scope of responsibilities.
Situation: A situation refers to the overall state or condition of the environment, which emerges from the combination of events, context, and other relevant factors. It is a dynamic snapshot that captures the current status of the environment and the ongoing activities within it. A situation is an extended scene model focused on behavioral aspects, such as actions, events, and potentials, occurring across scene snapshots related to objects of interest and concern. While a scene is far more structural and objective, a situation is behavioral and more subjective. Many situations can exist within a scene; signification, memory, and recognition extract situations from a scene. Situations can consist of minor nested situations that narrow the scope of operator attention.
Setting: A setting provides the backdrop against events and scenes. It includes details about the situation’s physical, temporal, and environmental aspects. The setting helps individuals understand where and when events happen, contributing to a comprehensive understanding of the situation.
Subject: The subject is the individual or entity actively perceiving and interpreting events and the overall situation. The subject strives to achieve awareness by processing information about events, context, scenes, situations, settings, and other elements. The subject’s cognitive processes and capabilities are crucial in forming an accurate and timely understanding of the situation.
Scenario: A scenario describes the temporal trajectory across a sequence of snapshots reflecting the unfolding of one or more situations. A scenario details various events and actions that, directly or indirectly, alter the state of elements within a scene. A scenario is a storyline or storyboard that can be reconstructed from past experiences or projected trajectories. Depending on the usage of a scenario, it can be sufficient to describe it in just situations. While a scenario can span multiple snapshots, and in this regard, frame situations, the subject is nearly always starting with a situation and projecting forward in time along different scenario paths.
Script: A script is a stereotypical representation of a sequence of actions oriented towards steering a scene and its situation from one state to another to attain a goal or maintain an operational plan. Scripts are brought into play, activated in terms of operational attention, by one or more recognized situations within a scene. A situation is a pre-condition for evaluating a script as a potential intervention. Before selection and execution, a script is simulated to evaluate the consequences concerning goal(s).
Awareness: The state of being conscious and cognizant of the events, context, and overall situation. It involves understanding the significance and implications of the information gathered from the environment, events, and other elements.
A Control Tower
With the above terms defined, it is time to present some of the functions and features of a control tower that apply to observability in general. These will get us closer to site reliability engineering and IT operations goals and, in doing so, allow us to divorce such systems steering processes from the noise of data structures such as traces, spans, logs, and metrics.
- Delivering informative visualizations that allow observation of past, present, and projected scenes
- Extracting and classifying situations within scenes for recurring scenarios
- Assisting in the decision-making process when problematic situations are recognized
- Maintaining near real-time situation awareness throughout all operational periods
- Communicating state changes across scenes and inferring causality to events and actions
- Projecting the current scene into the future and foreseeing possible arising situations
- Guiding in the selection of one or more scripts for resolving problematic situations
- Recording the execution of scripts along with scene and situation changes for post-mortem analysis
- Replaying the history of scenes for a particular period with enhanced situation recognition routines
- Reconstructing scenes and situations simulated alongside scripted interventions
- Expanding the domain-specific knowledge of scenes, situations, scenarios, and scripts
- Reporting the effectiveness and efficiency in the recognition and resolution of problematic situations
- Facilitating information quality analysis of data employed in the (re)construction of scenes