The Cannery — The Joy of a Low-Hazard Development Environment

Which of these two boardwalks is more difficult to navigate?

Mountain biking on a tall skinny feature Mountain biking on a boardwalk

Of course, the one on the left is more difficult to navigate. But why would the height matter? It’s a similar width and material and is even straighter than the one on the right.

It’s easier when it’s safe to fail. If there is no tolerance for error, the task is more demanding.

High-Consequence is High Cost

The bigger the potential harm, the more care we need to take. For instance, if the lever in your car that activates the windshield wipers on was a multi-function lever that popped the wheels off when pushed, you would think twice before turning on your wipers. It’s easy to remember to pull and not to push, but you would not pull without due care, especially when you’re zipping down the highway. You may even choose to go without your windshield wipers if your windshield was only slightly obstructed — better not risk it.

High-Consequence Software Workflows

As we create software, we face many low-key hazards. I say “low-key” because we’re not losing our wheels at highway speeds or falling 10 feet into spikey branches. When I am talking here about hazards, I’m talking about the danger or risk of something undesirable happening — something that we want to avoid.

Examples of workflow hazards

These are listed in no particular order. Some are major (they can cause production outages with significant financial cost to the business and customers) and some are minor (they can cause some lost time). All adversely impact our stress and productivity.

right process, but wrong environment — when we connect to our pre-production environment, if it’s the same process as connecting to our production environment, we might unintentionally change production when we think we are changing pre-production. This is like the windshield wiper lever that controls both low-risk (wipers) and high-risk (wheel ejection).
- for example, see the Summary of the AWS S3 service disruption in 2017
scripts running as admin — we change our infrastructure with code (Terraform, CloudFormation, etc.) but the process running the Infrastructure as Code (IaC) has full admin access. If we make any mistake in our script, we break adjacent systems.
shared hosts / shared cloud infrastructure accounts — with containers and cloud service providers, it’s cheaper to have multiple hosts and multiple accounts. On the other hand, when we have more workloads running in the same environment (same AWS account or same server), we increase the risk that one interferes with the other. If they’re separate, they can’t interfere with one another.
too long between “save points” — the longer we work before reaching a “save point,” the bigger the risk that we’ve gone off course (and the harder it is to correct our mistakes). Examples of this hazard:
- trying multiple code experiments at once (if it doesn’t work, we don’t know what change caused the problem)
- 10s of minutes or even hours between version control commits (the more uncommitted code we have, the more we have to lose if something goes wrong)
- building a big batch of functionality without exposing it to customers for feedback (the more we build without feedback, the more we have to lose if a bad assumption invalidates our design)
low-cohesion, highly-coupled code (spaghetti code) — when we have to trace control flow through many files to understand how to make the change we need to make, then our success depends on not losing focus (see also: “why developers hate being interrupted” comic on Reddit.) By contrast, if our code is well factored, our working mental model is small, and we can manage interruptions without losing our work and our focus.
process requires good memory — to successfully complete a process, you need to do multiple steps in the correct order. For example: before pushing a change, you may have a series of manual checks to run:
- run linter
- run test
- start servers to make sure they start
If we skip one or more of these steps, then we must later redo our work when we discover that a minor linter error blocked our pipeline. That ties into the next example.
slow feedback loop — it takes a long time to get feedback (“long” is context-dependent here), so whenever you make a change, you are incentivized to avoid mistakes because every check is expensive. Common examples include unit test suites that take minutes to run, or pipelines that take tens of minutes to run. Again, “long” depends on context, but it’s whatever duration is long enough that you’re worried about avoiding mistakes to avoid wasting a “long” run.

Adrenaline-seeking Software Developers

In mountain biking, we intentionally build hazardous features for the thrill of riding them. The goal is to make them epic and to push our limits.

At work, we sometimes run into folks who revel in complexity. They have the memory for the process steps, the skill to make the most of each expensive feedback loop, and the experience to avoid connecting to the production environment by accident. Like the mountain biker ripping down the double black diamond with a GoPro on their chest, these highly skilled software developers navigate hazards with great satisfaction.

Low-hazard work environments

We can reduce the hazards in our workplace so that we apply the full force of our skill and experience to building value for our customers and our business. Instead of devoting part of our mental processes to mitigating hazards, we can form our tools and processes to avoid them entirely.

Examples of hazard-reducing changes

Limit our access to production to read-only, with a high-friction escalation process to get elevated access
Apply the principle of least privilege: limit scripts to the minimal permissions they need to do their work
- e.g., with AWS, create a new IAM role for your script with zero permissions. As you develop the script, every time you get an “insufficient permissions” error, add those permissions to the IAM role. By the time you have written your script, you will also have a bare-minimum permission set for that script to do what it needs to do.
Script your manual processes
- if you always invoke a command with similar arguments, wrap it in a zero-argument script that provides those arguments
- if you often invoke multiple commands in sequence, write a script that invokes those commands one after the other
Work in small chunks to reduce the risk of throwing away your work because of a mistake, a loss of focus, or an assumption-changing feedback
- small units of code (high cohesion, low coupling)
- small commits (as soon as the code is better and no worse, commit)
- small slices of functionality delivered to customers for feedback
Make a tool self-documenting
- throw a descriptive error if a required input is missing
- catch common errors and print a helpful instruction instead (translate cryptic errors into concrete advice)
- change script and parameter names to clarify intent

Objections and challenges

To do this, you’ll likely have to get more comfortable with Bash, PowerShell, AWS IAM, your build pipeline tool’s specific syntax, and other unglamorous technologies. These are the things that many people choose not to put first on their résumés. They are the things that are tempting to ignore.

The outcome of not reducing hazards

If you choose not to address your scripting, permissions management, and other hazard-reducing technologies, you are likely to create systems and processes that feel more like double black diamond mountain biking trails, and less like peaceful cross-country bike rides. To change those systems, you’ll have to allocate some of your creative energy to paying attention to — and mitigating — hazards while you work. This gives the feeling that if you slip up, something bad will happen.

The joy of a low-hazard workplace

When we identify and mitigate hazards, we create an environment where we can fully focus on our creative work. Whereas a hazardous environment requires vigilance, a low-hazard environment allows for safely pausing at almost any time. In a low-hazard environment, mistakes are less common and less costly, increasing that satisfying feeling of success. The reduced cost of error makes it cheaper to experiment, which accelerates learning. It’s good for productivity, and it’s good for our mental health. Win-win!

Steps to reduce hazards in your process

You can make incremental improvements in this area by paying attention to how you feel about your work, identifying hazards, and taking steps to mitigate them.

Pay attention to how you feel while working

It can be hard to identify hazards like this because we get used to working around them. Here are some indicators that you may have a process hazard:

while you’re working, you feel like you must not forget something (else something bad will happen)
while you’re working, you tell yourself to be careful to avoid something
when someone new joins you, there are “gotchas” to explain about the process mdash; “yes, I can see why you would think we do X here, but we must actually do Y instead”
you have a sense that you can’t stop what you’re doing until you get to a certain milestone
you feel stressed by the details of the work (you feel like there is “a lot going on”)
you have to remind yourself of your process after the intuitive approach goes wrong (you think you know the process, but when you do it and it doesn’t work, you remember some detail about it)

Identify hazards

When your feelings indicate a potential hazard, pause and reflect or discuss with teammates.

While you’re working, you can say things like:

“This feels risky, can we make it safer?”
“There is a lot to remember here; can we make this simpler?”
“Can we automate this?”
“I’m concerned that we wouldn’t know to do that if you weren’t here to tell us; can we write that down?”
“We’re going to have to do these steps again soon; can we make this easier to repeat?”

During a retrospective, you can say things like:

“I felt like there was a lot to keep in my head while we were working”
“We did X and everything worked, but we were close to doing Y which would have caused trouble. Can we find a way to make it clearer that we should do X and not Y in the future?”
“There were some close calls, recently (like X and Y). Could we change something to reduce the risk for activities like X and Y?”

Mitigate the risks

Just like a construction worker doesn’t call their boss to ask permission to put on a hard hat, you can make your work safer without escalating up your chain of command. Many hazards can be reduced in a single small step. If you take single small steps towards reducing hazards each day, you’ll find you’re facing fewer and fewer hazards in your day.

Some examples:

If your process is ad-hoc, discuss it and agree on a process. Discussed is better than implicit.
If your agreed-upon process is unwritten, jot it down in a README (e.g., Markdown) file. Rough notes are better than a verbal agreement.
If you have a written process, write a script. You don’t have to automate everything — you can automate one step and for all the other steps insert print statements that say “do X”, “do Y”
If you have an unreliable or incomplete script, add a missing step or handle a new edge case
If you have a reliable script that you manually run, can you find a way to trigger it automatically or incorporate it into another automated workflow step?
If you have a process or script that needs carefully crafted inputs, can you add input validation to protect against errors?

All these changes are small, atomic improvements that compound over time. If you want to eliminate all the hazards, that’s probably a sizeable project that would require budget approval. On the other hand, if you’re trying to marginally reduce one specific hazard, that’s as lighweight as putting on a hard hat, and can be done as part of your regular work. You can start today!