Joint optimisation on sociotechnical systems
How do you modify, or even improve a any non-simple system?
Sociotechnical systems vision is the term I use for viewing systems through the individual lenses of social and technical work, then overlaying them to perceive a more complex system. Iâm pretty sure that thereâs a more appropriate term somewhere in the literature for this, but work with me for a moment mâkay? The field of Sociotechnical systems is fascinating to me, and I wholeheartedly recommend reading more about it.
In this article, weâll practice sociotechnical systems vision on a simple example. Along the way (hopefully) weâll demonstrate that system optimisation should work on both the social/organisational and technical dimensions. But firstâŚ
Stereoscopy and Sociotechnical principles
Do you remember the 3d glasses that were somewhat popular around 2000?
Irisblixten, CC BY-SA 4.0, via Wikimedia Commons
They are called red-cyan stereoscopic glasses. They work by filtering out colour ranges so that your left and your right eye end up seeing slightly different pictures, both of which have partial depth information. When you look through both eyes, the image is overlaid and your brain perceives a more complex picture (the stereoscopic illusion) than the individual pictures. This leads us to the first of the two Sociotechnical principles:
The interaction of social and technical factors creates the conditions for successful (or unsuccessful) organizational performance. This interaction consists partly of linear âcause and effectâ relationships (the relationships that are normally âdesignedâ) and partly from ânon-linearâ, complex, even unpredictable relationships (the good or bad relationships that are often unexpected). Whether designed or not, both types of interaction occur when socio and technical elements are put to work.
Letâs say you have a picture like the following, which can be viewed with these red-cyan glasses. What if you want to improve it by making the hammer bigger? You donât have to be a stereoscopy expert to understand that changing only the red part or only the blue part of the picture will surely fail - you need to change both.
Gabriel Rollenhagen, CC BY-SA 2.0, via Wikimedia Commons
This gets us to the second principle of sociotechnical systems (and corollary of the above)
 Optimisation of each aspect alone (socio or technical) tends to increase not only the quantity of unpredictable, âun-designedâ relationships, but those relationships that are injurious to the systemâs performance.
Now is time for our practical example. Letâs say you have created an alert.
Technical
Letâs turn on the cyan technical eye, that solely sees technical systems and interactions. In that case weâd see something like this:
The above diagram maps a fairly typical alert delivery diagram:
- PromQL, the query language to write the alert in
- Repository that controls a simple alert manager config
- Prometheus, the system that does the monitoring
- Prometheus alert manager
- Monitored service
- Grafana, which can display your alert and other metrics
- OpsGenie, a tool to manage and reach on call staff
- Responder phone
- Slack channel to receive alerts in
- Feature platform, which we will assume can disable the feature causing the alert
Social
Now letâs turn on the other eye, the red one, and look at the social side of the same space:
This diagram maps some of the organisational aspects of alert creation, response and delivery:
- Alert created after incident due to policy
- Alert has to follow company conventions
- Alert requires approval by o11y team
- Human on call rotation
- Group of engineers following the alerts channel
- Grafana is not accessible by all team members
- Feature platform is unclear to a lot of engineers, they havenât used it
Notice that the human side has a lot more actions. One of the powerful features of humans is that they can generally adapt to unexpected scenarios, and are usually authorised to act accordingly. Most incidents Iâve observed are solved by humans taking actions.
Sociotechnical
Letâs overlay these diagrams, and switch to vertical or weâll run out of space
Okay now, this is indeed a lot of complexity đ This diagram helps our brains perceive a picture thatâs more than the sum of its parts.
If we were to describe the diagram, in short:
- Humans create an alert
- Machines manage and deliver the alert
- Humans have to be available to respond
- Humans have to access, know and use technical systems to respond successfully
Observations
A few things stand out for me in the diagram above:
- Notice the terminal states highlighted in grey. These are fail states. Typically an operator arriving there is out of immediate options, besides escalating to an unknown other, if they exist and are reachable
- Notice the spectacular amount of wasted effort if our operator reaches the final step, the feature management service, and cannot use it
- Notice the mingling of social and technical actions. This system cannot reach a success state without the collaboration of humans and machines
- The moment when control passes from machine to human is quite important. In aviation, they call it âtransfer of controlâ and in their field, this is often where the failures that kill people actually happen
Equipped with the above, we are now able to perform âJoint optimisationâ, the design and improvement of both sides of a system to maximise our desired output. Some examples of joint optimisation questions that are meaningful in the alerting context are:
How can we improve average time to understanding an issue?
Weâd need to make sure our alert evaluation interval is fast, the alert fires as soon as it has enough data, and that the alert is clear and unique for the responder to parse or look up. That last part is usually the most time-consuming, yet is invisible if youâre looking purely on the technical side.
How can we improve average time to mitigation?
Besides the above, weâd need to make sure that all responders can access the feature platform, know how to use it, and have permissions to hit the âdisableâ button on a feature. Weâd need to have an on-call rotation to reliably have a responder available. Besides the main on-call responder, if the alert successfully propagates to Grafana and Slack and the responders can access and follow those platforms, we may be able to enlist additional responders.
Conclusions
Sociotechnical systems vision is the ability of seeing a system via a social, technical and finally a combined sociotechnical lens. Typically only the last one actually represents the systemâs complexity. Joint optimisation are the meaningful ways to improve the system which usually surface when you see that full complexity.
In short, failing or refusing to see the organizational aspect of any tech system is like working with your one eye permanently closed đ