Running A Benevolent Puppet Regime

In March I presented the talk: “Running a Benevolent Puppet Regime” at Denver’s first ever PuppetCamp Denver! Held within the Code Talent building in Denver’s RiNo neighborhood:

The presentation content is available on SlideShare, but since the slides were designed to be an aid to my verbal story telling and not the primary method of communication, I figured a port to a blog post would also be helpful.

The whole presentation is also available in video form, if you want to sit through my yammering for 45 minutes.

Road To Production

If you don’t know much about me - I’m interested in building elegant systems in life and in my career. I don’t just care about how a system looks from the outside but also how every part of the system works inside - individually as parts, and together to create the whole. I care about how and why things work and interact. I strive to leverage machines in such a way that they work for me and those around me instead of the other way around.

I’m the DevTools Team Lead at ReadyTalk, and my team is all about paving a smooth and fast road to production for all of our infrastructure and applications. Our mission statement says it best: “We empower teams to continuously deliver software with increasing efficiency.”

When it comes to my work with Puppet over the years - I like to refer to it as taking steps to transform what we had: “A Puppet Regime” into something better, such as a “Benevolent Puppet Regime.” What’s the difference, you ask?

Let’s define a puppet regime to start:

Puppet Regime: a system that is directed by an outside authority that leads to hardships on those governed

The benevolent spin is this:

Benevolent Puppet Regime: a system that is directed by an outside authority that leads to improved operations on those governed.

It’s important to note that a Benevolent Puppet Regime has two aspects of operational dynamics: human and computer. If you run a Benevolent Puppet Regime, you’ll see improved operations by people as well as computers - and that’s what we’re looking for, right? People and computers working together well as one cohesive system.

ReadyTalk has been around since 2001. It’s definitely an established company at an age of about 14 years, but still holds on to some of its cultural roots developed in its younger (and smaller) years. I’ve only been a part of the company for about 4 years now, and Puppet has only been a part of ReadyTalk’s (and my own) story for 3 years. In the past 4 years, I’ve seen the company double the size of the engineering team and grow the number of managed servers from several hundred to a couple thousand.

Back in the day, our sysadmin would provision machines with a collection of Perl scripts - something I refer to as Ol’ Perly. These Perl scripts were complicated, hard to maintain, and most certainly would cause some hassle when used. Whenever we’d hire a new engineer, it would be their responsibility to build their machine with the sysadmin’s (and nearly the whole engineering team’s) help. Some people would have their workstation up in a week but others might not have an operational workstation within 3 weeks! This amount of delay was caused by technical troubles with Ol’ Perly for sure, but some of the time was honestly spent waiting - waiting for the right people to be available at the right times. Building a workstation for a new engineer was quite a task - but it shouldn’t have been this hard - and definitely not this dangerous!

Ol' Perly

Thankfully our sysadmin saw some chatter about Puppet on Reddit, and spent some of his free time trying it out and provisioning a system from scratch. When he presented his work at a demo session - everyone was excited about the possibilities of using Puppet at ReadyTalk, so he started to use it in more places, including during the build-out process of new developer workstations.

However after some time, excitement waned and the aggregation of various troubles of managing hosts with this new automated system began to overwhelm and frustrate people. I overheard a developer say one day:

“I won’t allow Puppet on my machine because it will destroy my dotfiles!”

or

“Why do we have to use Puppet when what we had before worked really well for us?”

After these comments became more common at ReadyTalk, my team (the DevTools team) and I decided to dive into investigation mode: what exactly was frustrating people about Puppet? Two years ago we released a survey to the whole engineering team to suss out some details regarding this Puppet angst.

 Which statement best describes your ability to work on Puppet modules?

a. I try to avoid working on Puppet modules.

b. I can read Puppet code and generally figure out what a Puppet module is supposed to do.

c. I can modify existing Puppet modules with little assistance to fix a bug or add functionality for a related development story.

d. I can add new Puppet modules to support a new feature or add functionality for a related story and I am familiar with how to test Puppet modules.

As you can see, a majority of the respondents chose either a or b.

Everyone Avoids Puppet

People Can Read Puppet

Some of the free-form responses cemented this quantitative data as well:

I don’t use Puppet

and

Are the currently used Puppet modules available in a repository for viewing and/or editing by any team member?

I wasn’t aware that I had access …

and

I feel like our Puppet modules are guarded. I am happy and able to contribute but don’t feel like there is a culture that supports this.

Clearly these responses showed that we had some work to do. But it got worse with the second question…

 What do you do when you run into issues with Puppet?

a. I say nothing and hope someone else will run into the problem and fix it later.

b. I file a Jira issue with steps to reproduce the problem.

c. I investigate and resolve the issue myself using online documentation or other resources.

d. I ask a fellow team member for assistance.

e. I ask a subject matter expert for assistance.

The answer we didn’t want to see was obviously a. c would show pretty good autonomy. Unfortunately what we saw was:

People File Issues

People Ask For Lots of Help

Then the free-form responses brought it home…

I’ve never had to do anything with Puppet.
I do know that Puppet has never worked on my dev environment and when I asked for help people ran away from me.

and

… also Puppet attacked and killed my family.

and

I like the idea of Puppet, but I have no idea how to get started or how it works. I can blow through the examples and tutorials online, but I have no idea how to contribute or improve our internal Puppet infrastructure.

The results from this question started to show something really clear: Puppet was a huge maintenance burden not only for engineers but also for the infrastructure and sysadmin team that had to respond to all these support requests.

That was 2 years ago. During the last two years, I made it a point to make the following phrase a commonly repeated mantra:

It’s not Puppet’s fault

it’s how we are using Puppet that’s a problem

My team saw a few fundamental drawbacks to our Puppet infrastructure at that time:

 People could help but “can’t”

which led to a culture of low involvement and engagement with our Puppet code

 People who know how to maintain/advance Puppet are too tied up in fixing problems

which led to the creation of a strained group of infrastructure personnel and sysadmins

 People perceive Puppet as something that gets in the way and is inconvenient.

which made it common practice to subvert establish processes and teams in order to “just get it done.”

This created a vicious cycle and made our an infrastructure support team work really hard… but never really go anywhere.

Vicious Cycle

A lot of times as engineers we look right to the tool that will solve our problems and we jump to use something that will provide a quick fix to our problems.

W. Edwards Deming has some thoughts that are quite poignant to thinking about how to solve problems that plague you:

Hard work will not ensure quality. Best efforts will not ensure quality. Gadgets, computers, or investment in machinery will not ensure quality.

It is not enough to do your best; you must know what to do and then do your best.

So how do we develop this “know how?” How do we transform a normal Puppet Regime, one that leads to hardships on those governed (which might include you!) into a Benevolent Puppet Regime - one that improves the operation of both people and computers?

How might we take our Puppet game to the next level?

Next Level Puppet

At ReadyTalk we approached our troubles with Puppet at 3 angles. I’ll outline these three angles by considering them tools for success in your Benevolent Puppet Regime toolbox.

Commit To Collaboration

 Commit to Collaboration

 Intended Use

This tool will help you improve involvement to advance your Puppet infrastructure and modules.

 Fundamental Idea

Collaborate.
Commit.

 Implementation Details

 Commit

 Collaboration

This allows various trades to contribute to the quality, design, architecture, and structure of your Puppet code

If you start using Puppet Forge modules, you might be tempted to start tweaking the “standard” modules to conform to your own practices. Don’t! Instead, look into ways you can start managing parts of you infrastructure in more standard ways (according to the software infrastructure community at large). The reality is that most of your problems are common problems in the industry. Chances are someone has spent more time thinking about the minutiae of certain infrastructure components - why not benefit from someone else’s hard work?

 Overall Idea

Treat your Puppet modules, manifests, code, etc. as a normal software project. Get all sorts of trades involved (QA, dev, SDET, CI, CD, DevTools, etc.) for the best chance of highest quality. Make your Puppet code define the common language between these trades. Refer to changes to systems as changes to Puppet code, not changes to individual boxes.

Automate Authority

 Automate Authority

 Intended Use:

This tool will help you reduce the strain on your overworked (and probably inaccessible) team

 Fundamental Idea

Trust but Verify.

 Implementation Details

Consider how a city’s streets are laid out, intersections striped, and stoplights positioned and timed. A really well planned street system has intersections with lights that allow a car to smoothly pass through the main (busiest) thoroughfare during rush hour. Traffic light control and coordination built into the design of a city and its streets will always be more efficient than a traffic cop. However, too often we treat our infrastructure (the flow of cars) too much like a traffic cop. Being in control of the operation of a whole intersection can be fun for a time… it’s definitely an adrenaline rush and fills you with purpose. But it’s not a sustainable way of living. For anyone. It gets old fast - just like manually testing your company’s Puppet modules.

I’ll admit - software is as much an art as a science. However, there’s no excuse for a test process that’s a work of art - one that’s manually executed and blessed by “those in charge.”

Don’t be a traffic cop for your Puppet modules. Let the system show its status: red, yellow, or green… through automated testing and a continuous integration server.

At ReadyTalk we have a automated testing pipeline of all of our Puppet modules that looks something like the following:

  1. Someone commits code to the Puppet module repository.
  2. Jenkins is notified that there are changes to the Puppet module and kicks off a verification job.
  3. This verification job runs some basic syntax checks, runs puppet-lint, and reports on the trend of these syntax problems.
  4. Then the meat of the automated test process starts.

For this part of the process we have been using Test-Kitchen which utilizes Vagrant to spin up virtual machines, apply a Puppet module or two using the Puppet provisioner kitchen-puppet and then run some ServerSpec tests to verify the virtual machines are in the expected states.

There’s a really awesome walkthrough of Test-Kitchen on the Kitchen.ci website that will help you understand the workings of all these pieces better. Disclaimer: test-kitchen was originally created for Chef Cookbooks, but works fairly well for Puppet too! You may want to look at the Beaker project from PuppetLabs as well.

Having this Jenkins job with Test-Kitchen delivers a bunch of benefits:

The most beneficial part of pulling this tool out of our toolbox was seeing ops folk able to spend more time building and automating tests now and in the future instead of fixing things that “were working just yesterday.”

One of my coworkers used to hate fixing Puppet problems, but once he was up and running with a local Puppet development environment (we Puppetized that too, of course), he created one of our most robust and well-tested Puppet modules. I actually used this module as part of my example in my Puppet Camp Presentation (writing tests for our Postgres setup)!

Rebuild Ruthlessly

 Rebuild Ruthlessly

 Intended Use

This tool will help you:

 Fundamental Idea

Prove your infrastructure systems are well designed and architected by rebuilding it continuously

 Implementation Details

The way I see configuration-management-done-well is through a collection of practices (keep in mind I’m talking about mutable infrastructure here):

What’s undesired state cleanup, you ask? Build servers provide an excellent example to explain what I’m referring to - they’re manual tampering magnets. Imagine you’re a developer (or … anyone) who has a project that works perfectly on your machine but the build server can’t build your project because it’s missing an extra-cool-provides-all-the-functionality-you-want-library.so. You’ve got to get that on the build server - and fast - people are waiting for your work to be complete and testable, right? This was exactly the situation at ReadyTalk. Since historicaly the Puppet modules were guarded and hard to improve, the easiest route to resolution was usually to login to the server that had a missing system library, install it, and the rerun the build. Success! Right…?

In Denver there’s a park with this sign:

No Shortcuts

We want people to take the right path - the path that’s safer for everyone. The path that’s paved for maximum success. We don’t want shortcuts in the process of creating a reproducible infrastructure. Shortcuts may feel good, shave off seconds of your task today, but possibly hurt everyone in the months ahead.

So what should we practice consistently?

At ReadyTalk we rebuild 2 of our build servers (seen below) every night. This has discouraged all sorts of manual system tampering (of which even I’m guilty).

Build Servers

A Jenkins job is the point of entry for the process, which uses Ansible to orchestrate the dance of rebuilds:

  1. Ansible uses the Jenkins REST API to take 2 servers out of commission
  2. Ansible uses the Foreman REST API to tell those 2 servers to re-provision from scratch when they’re rebooted next
  3. Ansible SSHes to those two servers and reboots them.
  4. Puppet and Foreman work together after the boxes PXE boot and go through the Debian Install process
  5. Ansible sees the machines open up SSH access and allow logins through LDAP
  6. Ansible uses the Jenkins REST API to put those 2 servers back in the build cluster

This process is really exciting to watch. It takes a while (about an hour) but it makes you feel like quite the omnipotent Puppet Master.


This last tool is the practice that has been the most beneficial to us as of late. Besides uncovering layers and layers of problems in our whole infrastructure such as:

We also have seen much more consistency and reliability in using Puppet at ReadyTalk. In the past there was a good chance that our Puppet modules wouldn’t work so well when we needed them most. However, a few weeks ago I had a developer come running up to me and say:

I rebuilt my development environment
today
by myself
without any problems!

What a win. That used to take weeks!

But that’s not all. Just the other day I saw that the Jenkins slave rebuild process failed due to Puppet’s inability to drop files in the Jenkins user’s home directory. I submitted an issue to our system administrators and heard later from him:

At first I had no idea what was going on. It was such a strange problem to be having. No other systems were exhibiting that same problem. However, When I realized what was happening, I had a small heart attack - I found that the whole LDAP cluster was down in the office, and the only way people were still logging into boxes is because sssd caches users. By finding this issue today you may have saved a HUGE headache from happening tomorrow. Thanks!

At this point, it’s important to reflect back on how our Puppet-backed infrastructure has transformed from a painful regime into a benevolent one. We now have:

That’s a way better, more energizing feedback cycle, isn’t it?

Energizing Cycle

By

  1. committing to collaboration,
  2. automating authority of our Puppet code, and
  3. rebuilding our infrastructure ruthlessly

we have seen our ability to respond to change internally and in the competitive market that is the web/audio/video conferencing space drastically improve. We’ve driven a shift from an oppressive Puppet Regime to a Benevolent Puppet Regime - a regime in which everyone wins. By having this type of regime, we’re driving on a much smoother road to production. And who doesn’t like the ability to go faster?

Better Road to Production

 
4
Kudos
 
4
Kudos

Now read this

Feature Branching

I recently made some waves on Twitter (and other circles) regarding my dislike of “Feature Branching.” The bigger the apparent reason to branch, the more you shouldn’t branch @jezhumble Feature branching is a poor man’s modular... Continue →