Observability: A to Z

This article was first published in 2019.

While creating a glossary of terms, I faced several choices that required careful consideration. In such instances, I followed a systematic approach and gave priority to terms that had a wider scope and offered a more comprehensive understanding. For example, when it came to choosing a term for the letter M, I opted for Memory instead of Metric. This decision was made since Memory is a fundamental term that applies to all instruments, measurements, and models in the context of observability.

Attention, a limited cognitive resource, demands careful monitoring and management. Treating it as a valuable service and prioritizing it over data is vital. Just as attention naturally focuses on important signals and filters out distractions, it’s essential to acknowledge that we can’t focus on everything simultaneously. True intelligence lies in actively and adaptively directing our attention, taking appropriate actions, and distinguishing what to pay attention to and what to disregard. Intelligence transcends mere data accumulation or activity; it involves the ability to discern what to focus on and what to disregard.

Boundaries are fundamental in defining and tracking the structure of complex systems. They establish discernible structures, hierarchies, nested enclosures, and scopes. Boundaries don’t merely define objects of permanence; they also demarcate events that can be composed independently. By creating hierarchical structures, boundaries facilitate easier understanding and monitoring of systems at various levels of granularity. In observability, boundaries aid in defining the boundaries between events, enabling us to isolate problems when they arise. By understanding the boundaries between components, we can swiftly pinpoint the source of errors or performance bottlenecks. Well-defined boundaries facilitate better abstraction of complex systems, making it simpler to reason about the system as a whole while still being able to delve into specific components when necessary.

Complexity is a pervasive feature of modern systems, comprising numerous interconnected components that interact across various spatial and temporal scales. The inherent uncertainty in complex systems poses significant challenges in comprehending and controlling events. These systems often exhibit emergent behaviors—patterns or issues arising from interactions rather than from any single component. Predicting or diagnosing these behaviors requires comprehensive observability practices. The goal is to transform this inherent uncertainty into actionable intelligence, enabling teams to maintain control and optimize performance in an ever-evolving technological landscape.

Dynamics of a system are often overlooked, with data and debugging taking precedence. However, it’s crucial to recognize the significance of observing and visualizing how system behaviors, states, and characteristics evolve, influenced by various environmental factors. While instruments provide data at specific points in time, it’s essential to focus on the motion within a changing state space when observing and analyzing. Instead of emphasizing content or just counting, it’s important to highlight causes rather than just observing the effects. Placing context above code is also key to understanding, planning, and predicting system behavior. Several additional concepts further enrich our understanding of system dynamics in the context of observability. For instance, both positive and negative feedback loops significantly influence system behavior over time. Identifying and understanding these loops is crucial for predicting system evolution. Additionally, recognizing and analyzing non-linear relationships between system components is vital; small changes in one area can lead to disproportionate effects elsewhere.

Enterprise-wide scaling of observability goes beyond simply moving data from one machine to another and then querying and rendering it within a dashboard. It’s crucial to meticulously structure, group, link, and categorize data to enhance the relevance of derived information for specific teams within an organization. By ensuring data presentation aligns contextually with the task at hand, teams can avoid relying solely on users to navigate through the data, which can lead to inefficiencies, especially under stress and tight time constraints. Therefore, teams should be able to focus their initial investigations on the systems, services, groups, and hierarchies within their scope of responsibility before venturing beyond their self-defined boundaries of influence and control.

Fusion of data from diverse sensory sources is crucial for observability solutions that must effectively gather, transmit, and analyze information. The goal is to construct accurate and meaningful representations of objects within an environment. Just like the human brain, these solutions need to make informed decisions on how to integrate and differentiate structural and behavioral groupings from the data feeds. Moreover, they must address temporal inconsistencies by implementing various reconciliation techniques. These techniques bridge the measurement gap and mitigate uncertainties associated with state transitions across different domains and timeframes.

Guidance, both intelligent and assistive, has become more critical than ever before in today’s fast-paced and complex environment. The success of an observability initiative greatly depends on its operational effectiveness. Simply leaving users to sift through a massive volume of logs, events, traces, and metrics is likely to fail. To address this challenge, it’s essential to provide users with tools that can intelligently and contextually direct their attention towards significant events. These tools should streamline the process of shifting focus across different stack layers, systems, and services. Moreover, any interaction with information should be task and situation-specific, ensuring that it reflects and optimizes workflows.

Health is a state, not a signal. To effectively understand the status of a system, service, component, or endpoint, it’s crucial to rely on a well-defined and limited set of higher-order signals. These signals allow us to easily infer the system’s state, which is essential for managing complexity and guiding change. It’s important to assess the overall health of the system at various levels of aggregation, framing, and contextual bounding. This requires the ability to effortlessly analyze historical data, align it with the present, and make projections for the near future. This level of insight can’t be achieved by simply reviewing traces, metrics, logs, and events. Emphasizing quality over quantity and prioritizing qualitative analysis over quantitative metrics is key to achieving these goals.

Instruments, both software and sometimes hardware, are at the core of today’s observability. These instruments serve the purpose of measuring, collecting, and recording various events, including actions, outcomes, and state changes, along with their accompanying properties and contextual information. Instrumentation, on the other hand, involves the manual or dynamic insertion of these instruments into the code flow and communication pathways across different threads, processes, and machines. By responding to event callbacks, similar to a wire being triggered, instruments provide valuable insights into the inner workings of a computational process. However, it’s worth noting that many of the instruments currently used originate from a simpler era with limited complexity and infrequent changes. Consequently, these traditional instruments struggle to effectively address the challenges faced by both developers and operational staff in modern environments. Instrumentation toolkits in software development currently have limitations that restrict developers. They have to hardcode the type of instrument to be used and the model for data collection and transmission. This prevents developers from dynamically adjusting and modifying workflow introspection and state capture. Additionally, there’s often a lack of consideration for the value and significance of measurements taken. Instruments are frequently added without proper evaluation of their costs and benefits. Instruments need to evolve and overcome these limitations to improve their effectiveness.

Just-in-time delivery and analysis of behavioral signals and qualitative states are essential for effective monitoring and management of systems and services. This allows operators, whether human or machine, to make informed decisions and take appropriate actions when necessary. The significance of timeliness will increase as we move from passive data observation to more automated and adaptive responses. The future of observability lies in controllability, although when it’ll fully arrive in the industry is still uncertain.

Knowledge, a valuable resource constantly evolving, isn’t static but created, consumed, updated, and discarded as needed. It relies on knowers, humans or machines, and serves as a bridge between them and the truth. In observability, knowledge is the acceptance of sensory data transformed into valuable information. This acceptance is based on reliable observations or inferred states of affairs. To effectively monitor and manage systems involving multiple parties, it’s crucial to explain observations, operations, outcomes, and the current situation based on prior knowledge and learning. This explanation aids in understanding perceptions, patterns, and predictions using factual information. Framing this factual information within a comprehensive contextual framework that considers spatial and temporal boundaries deepens our understanding and analysis of the data and its implications. Knowledge acquisition involves both passive acceptance of facts through feeds and active construction and generation through experimentation. However, it’s important not to confuse communicated observations or testimony with knowledge or truth. In our increasingly complex and rapidly changing world, contextualizing knowledge becomes a challenge. Many propositions in knowledge depend on specific circumstances and shouldn’t be treated as universally true.

Learning, a fundamental objective of observability, involves acquiring information through experience or communication, leading to behavioral changes or enhanced subject understanding. Humans and machines must simultaneously learn from the phenomena they sense, transmit, and record. Learning shouldn’t be limited to simple data querying; it should encompass insightful information acquisition and behavioral changes. The learning process goes beyond data collection; it requires humans and machines to develop an understanding of how they perceive and reason about what they observe. Learning is a continuous feedback loop between humans and software machines, extending beyond system components to encompass broader operational goals and constraints. By collaborating, humans and machines can complement and enhance each other’s strengths.

Memory, a crucial aspect of both human and machine cognition, enables the recognition, identification, and detection of change. In the realm of observability tooling, memory plays a pivotal role in organizing, recalling, and displaying diverse data types. From low-level information like traces, metrics, events, and logs, to more significant data such as signals and states, memory facilitates efficient data searches and provides valuable insights for prioritization. While memory is often associated with retrospective analysis, modern application monitoring tools should adopt a prospective perspective, anticipating future events like system state transitions and behavioral changes that may occur after deploying new software releases. Both humans and machines require the ability to effortlessly navigate their cognition backward and forward along a timeline. Currently, memory primarily focuses on collecting sensory data. However, in the future, there’ll be a shift towards a memory that encompasses human-machine interactions, communications, and learning. This expanded memory capacity is essential for adapting, planning, and predicting future outcomes.

Noise and a lack of curation based on relevance are significant issues in the collection of observability data, resulting in dashboards cluttered with unnecessary information. In the monitoring and management of systems, less is more. To be effective, it’s vital to focus on meaningful signals that accurately indicate or infer a change in behavior and state. Unfortunately, extracting these signals from a vast volume of unqualified and unclassified data is an extremely challenging task that often overwhelms engineering teams. Machine learning isn’t a magical solution that can simply remove the noise created by metrics, traces, and logs. Moreover, the increasing complexity and continuous evolution of systems amplifies this challenge. Currently, most available tools are inadequate when it comes to differentiating between a change in measurement due to dynamics or environmental factors, and an actual change in the underlying code or configuration of a component. What is required at every stage of an observability pipeline is intelligent contextual filtering, augmentation, and classification, in addition to (simple) sampling. Alongside this, higher-order measures need to be synthesized, predictive signals must be extracted, and states should be inferred. It’s important to note that the true essence lies in the change itself rather than the model or the data it is built upon.

Organization is paramount to the success of an observability initiative. It’s crucial to properly coordinate and structure not only the data from various systems and services but also the communication and control flow across different teams within the organization. Organization plays a key role in enhancing operational efficiency and simplifying tasks, while also aligning with existing service and social structures, processes, functions, forms, and groupings. Additionally, observability solutions should be designed to organize displays and interactions based on common patterns of perception and processing. These patterns include temporal, sequential, spatial, similarity, cause-and-effect, depth, topics, tags, and markers. By organizing information in this manner, organizations can shape their operations more effectively and efficiently.

Perspective is a fundamental tool for understanding and analyzing events or situations from various viewpoints. It’s of great importance for organizations to cultivate the ability to observe systems from diverse perspectives to develop critical thinking skills. This approach proves to be more effective and productive than solely relying on high-dimensional data or deep systems exploration. By adopting multiple perspectives, conflict resolution is enhanced, and problem-solving capabilities are significantly improved, particularly in complex and uncertain systems. This method relies on the integration of different sensory data from various sources, rather than simply scanning through datasets. Therefore, organizations should prioritize the practice of switching between perspectives rather than solely relying on data analysis.

Quality data and information are necessary for effective observability. Organizations place importance on various aspects of data quality, such as relevance, accuracy, timeliness, coherence, consistency, comparability, completeness, comprehension, periodicity, and representativeness. Information quality includes additional factors like context, generalizations, integration, resolution, structure, temporal relevance, operational considerations, and effective communication. While data quality deals with intrinsic aspects of measurement and collection, information quality focuses on the derived value that helps achieve specific goals like situational awareness, understanding, causality analysis, control, or effective representation. Ultimately, quality in this context emphasizes intellectual rigor.

Resolution, primarily referring to data rather than information, is a concept often misunderstood. Unfortunately, a more detailed measurement scale frequently leads to increased noise. In well-established organizations and teams, the focus shifts from raw data payloads and events to the ability to adjust the resolution of various perspectives, representations, aggregations, and dimensions of space and time. To advance as an engineering discipline, the priority should be on how to automatically and adaptively scale the resolution of information based on the specific task and the diverse operational and environmental contexts. Currently, most measurements are relatively coarse-grained and ordinary. In the future, there’ll be two approaches in tooling: one will offer accurate playback of software execution, while the other will use signaling theory and social cognition to provide high-level contextual reasoning about systems, services, and segments in the service supply chain. Tracing, metrics, and logging will only occasionally be used as a middle ground between service management operations and deep diagnostics modes of awareness, acknowledgment, acquisition, and action.

Signal usage forms the foundation of purposive communication. Signals are the visible characteristics of a participant (service or agent) in communication that are presented to maximize the likelihood that the receiver will ascribe a specific state of affairs to the producer, environment, or situation. Unlike events or messages, signals immediately convey meaning without any loss or cost associated with translation. A signal provides information through its observability, which is the classification of a phenomenon related to the operation or outcome of a specific execution or call. Understanding signals is the primary source of information, derived from the sequential arrangement of signals and the inferred or learned state over time. Signals serve to synthesize information and prompt action. In both humans and animals, signals have evolved to modify the behavior of others by relying on a shared, albeit not necessarily perfect, understanding of their meaning. Systems of services need to adopt a similar approach.

Topology plays a vital role in understanding distributed systems and modern architectures like micro-services, forming the basis for their comprehension through maps. These maps focus on the shape, connection, and relative position of various elements within the system. Topology, which is the study of places, provides a spatial and temporal context for entities or enclosures with some permanence. However, the definition of permanence is constantly evolving due to the rapid changes happening nowadays. In the context of network topology, which encompasses various networks in their broadest sense, the importance lies in understanding what is connected to what rather than the physical distance between them. Topology isn’t primarily concerned with distance, and the same principle applies to engineering in most cases. Instead, the emphasis is on the degree of connection within networks, which are prevalent in today’s software systems. These networks span from the physical network layer to the call graph of a service workflow, encompassing multiple service nodes of execution. By reducing complex systems to an underlying architecture of nodes and links, topologies simplify the interpretation of high-level structural changes. This simplification enables easier analysis and comprehension of the system as a whole.

Usability becomes increasingly critical as machine learning and predictive technologies continue to advance and evolve. The key to achieving this is through efficient and effective communication and learning. An integral aspect of usability lies in the ease of using and learning from tooling. Simply adding an intuitive interface on top of data sources, streams, and sinks isn’t enough to achieve usability. While it may be satisfying to create visually appealing interfaces, this satisfaction is often short-lived and more prominent in the early stages of adoption and use. To truly enhance enjoyment, growth, and productivity, it’s essential to introduce advanced models of system and process introspection, interfacing, and intervention. These models should possess immeasurable capabilities in communication and cognition while being adaptable to different contexts and adjusting to factors such as confidence and control.

Visualization of the physical world, as well as virtual and simulated environments, is something humans rely on heavily for perception. Our mind processes sensory data and translates it into representations and symbols within our consciousness. These mental models often take the form of conceptual imagery and visual metaphors. To effectively convey information, particularly the significance of past events or ongoing situations, an observability solution must leverage this human strength. The primary aim of information visualization is to facilitate a comprehensive understanding of even the most complex systems and services, within the context of change and control. This enables us to visually perceive patterns of behavior and states across vast temporal and spatial domains. The importance of good design in communication, understanding, and learning is most noticeable in the realm of visualization. However, current observability tools still rely on basic data presentation through custom dashboards. Moving forward, future tooling must shift focus towards capturing knowledge and wisdom about the motion and dynamics of systems and services, which are constantly evolving due to rapid changes. Rather than simply inspecting or pivoting data, we should aim to observe motion and the information it conveys about processes at different scales, adjusting our frames of context accordingly. To ensure continued progress in the future, we must move away from an excessive preoccupation with data and its distribution and instead prioritize essential signals, states, situations, structures, and systems.

Workflow complexity in organizations cannot be adequately captured or modeled using tracing methods, whether distributed or local. Workflows consist of structured and repeatable patterns of activity that are orchestrated and managed to create value. The nature of work has evolved significantly over the past few years, yet we still rely on outdated techniques of observation and measurement to understand the intent, objectives, and context of business processes from technical, low-level data flows. Even at the technical level, our current methods of runtime introspection lack an understanding of resilience mechanisms that significantly impact service quality assessment. The traditional mindset of sequential request-response closures no longer aligns with the reality of modern work, which involves multiple loosely coupled systems of services operating across time and space. It’s now evident that tracing, as currently defined, fails to accurately represent and comprehend the complex workflow processes that utilize diverse technical and social resources. The concept of time is just one aspect of the intricate web of interconnected systems, processes, and activities that require constant monitoring and control. Instead of focusing solely on time spans, those developing software for instrumentation should consider schedules, stages, states, signals, and synchronization as crucial components.

Xeroxed replicas of past software states form the cornerstone of observability’s ultimate goal: meticulously reconstructing an episodic machine memory that captures the intricate history of system dynamics and execution, setting aside concerns of controllability. While data replication across machines is now a common practice, the replication of execution behavior remains a challenge. Although remote procedure call (RPC) middleware enables performance transfer across process and machine boundaries, these calls merely delegate services, rather than replicate software behavior at a fine granularity. In the future, observability will capture and reproduce the observable behavior and state changes of systems and services, resembling a comprehensive replica of the machine universe within a single observable space. This space will enable the safe examination of work processes, flow, resource consumption, costs, and states. Eventually, the simulation of memories from the past, present, and future will become a reality.

Yogi-like mindset and adherence to personal guidelines such as cleanliness, contentment, discipline, introspection, and surrender must be adopted to ensure effective and efficient observability. It’s essential to clean and filter data, ensuring that sensors and measurements are free from impurities and noise. Additionally, the organization and design of information must prioritize hygiene, creating a tidy and minimal surface area for decision-making and facilitating the embrace of change. Contentment plays an important role in observability, fostering a deeper appreciation and understanding of the system as a whole. Recognizing that the system is undergoing constant change, it’s important to steer it in a positive direction, acknowledging its flaws and imperfections. Simply increasing the volume of data doesn’t yield better insights—invariably, more is less when it comes to observability. Therefore, discipline is necessary for instrumenting and collecting data, preventing overload and operational impairment of the system and its agents, whether human or machine. Observability not only enables understanding but also guides action. This is achieved through continuous reflection on past events, actions taken, and their consequences. By engaging in introspection, we develop greater awareness and attention, allowing for a more holistic view of the system that goes beyond mere machines and mechanics. Finally, we must surrender to the complexity of the system in part, recognizing that attempting to exercise complete control is an illusion that doesn’t serve us well in the face of ever-changing complex systems.

Zenith refers to an abstract point directly above a specific location, situated in the vertical direction opposite to the gravitational force at that spot. The gravitational force represents a significant volume of constantly increasing measurement data. However, it’s common for observability teams to become overly fixated on the production and consumption of low-level data, neglecting the bigger picture of service and operation. Instead of solely focusing on the creation and maintenance of detailed “databoards,” teams should prioritize situational screens and predictive projections. Teams must maintain some level of detachment from the various types and sources of data, recognizing the transient nature of code, coupling, and containment. Organizations must never lose sight of the strategic goals in monitoring and management: to facilitate swift adaptation and agility while ensuring satisfactory levels of quality, reliability, stability, and predictability. It’s time to break away from the constant pursuit of data.