The Solution is not Distributed Tracing

This article was originally posted in 2020 on the OpenSignals website, which is now defunct.

Data Dungeons

Much of what is posted in the Observability space these days makes a claim, without much in the way of independent cost-benefit analysis, that to monitor highly interconnected systems effectively, one needs to trace and correlate every request across every hop (span) in the network. Deep and extensive data collection is actively encouraged, which one could argue is more a reflection of a lack of an appropriate model of perception than the utility of the data or the ability of a solution to transform such collections into information and, in turn, insight.

Redundant Reductionism

Science and technology have made it possible to observe the motion of atoms, but humans don’t actively watch such atomical movements in their navigation of physical spaces. Our perception, attention, and cognition have evolutionary scaled to an effective model for us in most situations. Distributed tracing spans, and the data items attached, are the atoms of observability.

Distributed Tracing

From many years of experience in designing, developing, and deploying distributed tracing systems and solutions, the reason it gets far more attention than it rightfully deserves for the engineering effort involved and conceptual complexity introduced, both customer and vendor, is that (local) code call tracing can in itself be helpful; most developers can understand a profile consisting of a tree of nodes representing chains of methods calls. But with increasing decoupling in terms of time and space, this is a questionable approach to undertake unless one is utterly blind to everything else; and that could very well be the crux of the problem that has been engineered without much critical reflection and assessment of the effectiveness and efficiency in deploying distributed tracing.

Misguided Practice

The emphasis on distributed tracing spans, attributes, paths, and data payloads it captures is misguided. It contradicts the touted encapsulation engineers introduce into their system, service, and library designs. Again local tracing can be helpful in some cases, but distribution is even less so. Distribution should not be the default case; code should not be separated by process boundaries so that engineering can obtain some degree of observability of a service. Don’t deploy hundreds of microservices or introduce a service mesh infrastructure to solely profile code execution over and via sockets. Observability is not a reason for distributing.

System Steering

The engineering community must consider changing course to reflect the original purpose and definition of observability – to infer the process state and introduce control measures where and when necessary to stabilize systems. Data exploration is mostly a simple stop-gap measure when engineering can’t steer systems and services to a stable state during execution flow or transitioning between change points. SRE and DevOps teams must step back from a dark data abyss if observability initiatives achieve success beyond the painting of pretty but mostly meaningless charts on dashboards. 

Effective Signposts

The more data you collect, the more you realize how much you don’t know, not because the data has shown this to be the case but because it has hijacked precious attention, overloaded cognitive capacities, and delayed decisive actions—deciding to collect everything.

At the same time, a simple decision is anything but simple. Simplicity, sensibility, and significance must return with greater awareness, a more in-depth understanding of a situation, and timely intelligent intervention. Observability assists in the operational monitoring and management of services through clear and concise communication centered around change and controllability. 

Situation: A State-Space

The situation, a state space, should dictate what other forms of observability can be dynamically enabled. Deep and detailed data collection, such as tracing, logging, and events, should follow and be framed by the situation – one that is derived from and described in terms of services, signals, and states. The situation we seek can not be found easily at the atomically level of data. Higher-order thinking, reasoning, and modeling focused on the dynamics instead of the data payloads is a mandatory requirement and a foundational framework for effectiveness, efficiency, and execution at scale regarding collection, communication, change, and control.

Divergence Detection

Only when there is a divergence from an expectation (past behavior), or prediction (planned change) should retention of detailed diagnostics be activated. Tracing is invariably transient data – not a model suitable for managing a process or a system of processes.

At the atomic level, the (data) differences are everywhere, yet there is no divergence to be seen and responded to in all practical sense. A quantitative measurement like tracing should not be the starting point for any enterprise observability or operational initiative. The communication model between machines and humans scales best with qualitative-based analysis and modeling. Data obesity and addiction must be fought with a renewed focus on abstraction, communication, and dynamics.

Conversations in Context

With distributed tracing, the request and the relationship of multiple chained requests get center stage – services are an afterthought, so much so that the concept of a service name was a post-release addition and is still not readily supported by all tracing providers today.

With Humainary, the rich contextual dynamic nature of the conversation between one service and another over multiple interactions is given priority in the model. Distributed tracing is oblivious, if not blind, to how services differ in their sensitivity to the service levels and the resilience mechanism employed when expectations are unmet. In contrast, Humainay focuses on capturing the locality of assessment and representing the existing service quality levels.