Verbal Protocol Analysis for Observability

This article was originally posted in 2020 on the OpenSignals website, which is now defunct.

Think Aloud

We always look for new ways to explain and relate to the Humainary conceptual model of signals and states. So it was a pleasant surprise to stumble across (Verbal) Protocol Analysis during a recent Design Thinking certification and some situational awareness readings.

VPA is a technique used by researchers across many domains, including psychology, engineering, and architecture. The basic idea is that during a task, such as solving a problem, a subject will concurrently verbalize, think aloud, what is resident in their working memory – what they are thinking during the doing. Using Protocol Analysis, researchers can elicit the cognitive processes from start to completion of a task. After further processing, the information captured is analyzed to provide insights that can improve performance.

Voluminous Data

An advantage of verbal protocol analysis over other cognitive investigation tasks is the richness of the data recorded. Unfortunately, this richness, unstructured and diverse in expression, can quickly become voluminous, requiring post-processing such as transcription and coding before being analyzed.

Sound familiar? Site reliability engineering (SRE) teams face the same issue when their primary data sources for monitoring and observability are event logging and its sibling distributed tracing.

Protocol Analysis: The Steps

The basic steps to Protocol Analysis are (1) recording the verbalization, (2) transcribing the recording, (3) segmenting the transcription, (4) aggregating the segments into episodes, (5) encoding the episodes, and finally (6) analyzing the code sequencing patterns.

While transcribing, researchers interpret the recording using a glossary of domain-relevant terms. The segmentation step aims to break the verbal into text units, segments, where a segment expresses one idea or action statement. Some segments are collapsed and combined into episodes in the aggregate step to simplify further coding and data analysis, especially when the recording volume is sufficiently large, requiring sampling to reduce human effort and cost.

Coding Scheme

The most crucial step in this process that dictates the success of the analysis comes down to the coding of statements. The coding scheme, where statements are mapped to processes of interest, is driven by the researchers’ question or goal being pursued. In this regard, a coding scheme must be effective and reliable in translation and express the aspects of concern for the investigation.

Typically, a small fixed set of concept variables are encoded for each statement, with each variable having a predefined set of possible codes. In the case of an investigation into how designers think, the variables might be the design step, knowledge, activity, and object. A more abstract variable set would be subject, predicate, and object.

Reliability and Effectiveness

A coding scheme is reliable when ambiguity is kept to a minimum in taking a statement or event in the real world and mapping it to the appropriate code across different persons tasked with the coding.

A scheme is effective when the coding is focused on the proper aspects of the domain and at the right level of granularity to answer questions via sequencing patterns. In the last step, analysis, researchers perform script analysis, sometimes introducing further higher-level process groupings and categorizations that can then be sequenced and analyzed – a scaling up. An example of scaling would be the inferencing of service status from signaling.

Logging: Tribulations in Transcribing

If you have ever worked with vast amounts of logs, metrics, and distributed tracing data, you would immediately recognize some of the above steps in turning recordings into something reliable and effective to monitor and manage applications and systems of services.

These days, most site reliability engineers get stuck in the transcribing phases, trying to bring uniformity and meaning to many different machine utterances, especially in logs and events. We’ve witnessed many organizations start an elaborate and ambitious initiative to remap all log records into something more relatable to service-level management or situational awareness via various record-level rules and pattern matches, only to abandon the initiative when the scale of the problem and human effort involved is genuinely recognized.

Smoke and Mirrors

These tasks only look good in vendor demonstrations, never reflecting the change rate that all software is undergoing now and in the future. You might ask how Protocol Analysis attempts to optimize the steps before coding. Well, by bringing forward to some degree the coding itself, in having transcribers already familiar with the coding scheme beforehand. It should be noted that for many doing VPA in new domains, the coding scheme is defined much later in the process. Fortunately, we deal with machines instead of humans in the Observability space of software systems, so it is far easier to introduce appropriate coding into the transcribing process.

Template

Humainary’s Serventis is a template for a protocol analysis model and coding scheme for understanding and reasoning about the processing and performance of microservices involved in the coordination and cooperation of distributed work. It is time for software services to think aloud with Humainary and abandon sending meaningless blobs of data to massive event data black holes in the cloud.

It is time to standardize a model that serves site reliability engineering and not some manufactured data addiction. Let’s have machines and humans communicate regarding service, sign, signal, and status.