Running A Benevolent Puppet Regime
In March I presented the talk: “Running a Benevolent Puppet Regime” at Denver’s first ever PuppetCamp Denver! Held within the Code Talent building in Denver’s RiNo neighborhood:
It’s #puppetcamp Denver! Great turnout to learn about @puppetlabs and hopefully some @solidfire #fueledbySF pic.twitter.com/8kROKfbiIM— Josh Atwell (@Josh_Atwell) March 12, 2015
The presentation content is available on SlideShare, but since the slides were designed to be an aid to my verbal story telling and not the primary method of communication, I figured a port to a blog post would also be helpful.
The whole presentation is also available in video form, if you want to sit through my yammering for 45 minutes.
If you don’t know much about me - I’m interested in building elegant systems in life and in my career. I don’t just care about how a system looks from the outside but also how every part of the system works inside - individually as parts, and together to create the whole. I care about how and why things work and interact. I strive to leverage machines in such a way that they work for me and those around me instead of the other way around.
I’m the DevTools Team Lead at ReadyTalk, and my team is all about paving a smooth and fast road to production for all of our infrastructure and applications. Our mission statement says it best: “We empower teams to continuously deliver software with increasing efficiency.”
When it comes to my work with Puppet over the years - I like to refer to it as taking steps to transform what we had: “A Puppet Regime” into something better, such as a “Benevolent Puppet Regime.” What’s the difference, you ask?
Let’s define a puppet regime to start:
Puppet Regime: a system that is directed by an outside authority that leads to hardships on those governed
The benevolent spin is this:
Benevolent Puppet Regime: a system that is directed by an outside authority that leads to improved operations on those governed.
It’s important to note that a Benevolent Puppet Regime has two aspects of operational dynamics: human and computer. If you run a Benevolent Puppet Regime, you’ll see improved operations by people as well as computers - and that’s what we’re looking for, right? People and computers working together well as one cohesive system.
ReadyTalk has been around since 2001. It’s definitely an established company at an age of about 14 years, but still holds on to some of its cultural roots developed in its younger (and smaller) years. I’ve only been a part of the company for about 4 years now, and Puppet has only been a part of ReadyTalk’s (and my own) story for 3 years. In the past 4 years, I’ve seen the company double the size of the engineering team and grow the number of managed servers from several hundred to a couple thousand.
Back in the day, our sysadmin would provision machines with a collection of Perl scripts - something I refer to as Ol’ Perly. These Perl scripts were complicated, hard to maintain, and most certainly would cause some hassle when used. Whenever we’d hire a new engineer, it would be their responsibility to build their machine with the sysadmin’s (and nearly the whole engineering team’s) help. Some people would have their workstation up in a week but others might not have an operational workstation within 3 weeks! This amount of delay was caused by technical troubles with Ol’ Perly for sure, but some of the time was honestly spent waiting - waiting for the right people to be available at the right times. Building a workstation for a new engineer was quite a task - but it shouldn’t have been this hard - and definitely not this dangerous!
Thankfully our sysadmin saw some chatter about Puppet on Reddit, and spent some of his free time trying it out and provisioning a system from scratch. When he presented his work at a demo session - everyone was excited about the possibilities of using Puppet at ReadyTalk, so he started to use it in more places, including during the build-out process of new developer workstations.
However after some time, excitement waned and the aggregation of various troubles of managing hosts with this new automated system began to overwhelm and frustrate people. I overheard a developer say one day:
“I won’t allow Puppet on my machine because it will destroy my dotfiles!”
“Why do we have to use Puppet when what we had before worked really well for us?”
After these comments became more common at ReadyTalk, my team (the DevTools team) and I decided to dive into investigation mode: what exactly was frustrating people about Puppet? Two years ago we released a survey to the whole engineering team to suss out some details regarding this Puppet angst.
Which statement best describes your ability to work on Puppet modules?
a. I try to avoid working on Puppet modules.
b. I can read Puppet code and generally figure out what a Puppet module is supposed to do.
c. I can modify existing Puppet modules with little assistance to fix a bug or add functionality for a related development story.
d. I can add new Puppet modules to support a new feature or add functionality for a related story and I am familiar with how to test Puppet modules.
As you can see, a majority of the respondents chose either a or b.
Some of the free-form responses cemented this quantitative data as well:
I don’t use Puppet
Are the currently used Puppet modules available in a repository for viewing and/or editing by any team member?
I wasn’t aware that I had access …
I feel like our Puppet modules are guarded. I am happy and able to contribute but don’t feel like there is a culture that supports this.
Clearly these responses showed that we had some work to do. But it got worse with the second question…
What do you do when you run into issues with Puppet?
a. I say nothing and hope someone else will run into the problem and fix it later.
b. I file a Jira issue with steps to reproduce the problem.
c. I investigate and resolve the issue myself using online documentation or other resources.
d. I ask a fellow team member for assistance.
e. I ask a subject matter expert for assistance.
The answer we didn’t want to see was obviously a. c would show pretty good autonomy. Unfortunately what we saw was:
Then the free-form responses brought it home…
I’ve never had to do anything with Puppet.
I do know that Puppet has never worked on my dev environment and when I asked for help people ran away from me.
… also Puppet attacked and killed my family.
I like the idea of Puppet, but I have no idea how to get started or how it works. I can blow through the examples and tutorials online, but I have no idea how to contribute or improve our internal Puppet infrastructure.
The results from this question started to show something really clear: Puppet was a huge maintenance burden not only for engineers but also for the infrastructure and sysadmin team that had to respond to all these support requests.
That was 2 years ago. During the last two years, I made it a point to make the following phrase a commonly repeated mantra:
It’s not Puppet’s fault
it’s how we are using Puppet that’s a problem
My team saw a few fundamental drawbacks to our Puppet infrastructure at that time:
People could help but “can’t”
which led to a culture of low involvement and engagement with our Puppet code
People who know how to maintain/advance Puppet are too tied up in fixing problems
which led to the creation of a strained group of infrastructure personnel and sysadmins
People perceive Puppet as something that gets in the way and is inconvenient.
which made it common practice to subvert establish processes and teams in order to “just get it done.”
This created a vicious cycle and made our an infrastructure support team work really hard… but never really go anywhere.
A lot of times as engineers we look right to the tool that will solve our problems and we jump to use something that will provide a quick fix to our problems.
W. Edwards Deming has some thoughts that are quite poignant to thinking about how to solve problems that plague you:
Hard work will not ensure quality. Best efforts will not ensure quality. Gadgets, computers, or investment in machinery will not ensure quality.
It is not enough to do your best; you must know what to do and then do your best.
So how do we develop this “know how?” How do we transform a normal Puppet Regime, one that leads to hardships on those governed (which might include you!) into a Benevolent Puppet Regime - one that improves the operation of both people and computers?
How might we take our Puppet game to the next level?
At ReadyTalk we approached our troubles with Puppet at 3 angles. I’ll outline these three angles by considering them tools for success in your Benevolent Puppet Regime toolbox.
Commit to Collaboration
This tool will help you improve involvement to advance your Puppet infrastructure and modules.
- put your Puppet code in a repository or set of repositories. Now. Just do it.
- we didn’t, and this was a significant reason why people later thought that the Puppet modules were guarded and inaccessible (because they were, but we never advertised their transition to full availability to everyone).
- keep an open door to your Puppet infrastructure.
- we didn’t at the start, because our behavior with respect to ownership of hardware before Puppet leaked into our behavior with Puppet… even though our tools enabled us to act differently (and honestly with more maturity).
This allows various trades to contribute to the quality, design, architecture, and structure of your Puppet code
- keep an open door between your Puppet infrastructure and the external Puppet community
- use and contribute back to Puppet modules in the Puppet Forge.
If you start using Puppet Forge modules, you might be tempted to start tweaking the “standard” modules to conform to your own practices. Don’t! Instead, look into ways you can start managing parts of you infrastructure in more standard ways (according to the software infrastructure community at large). The reality is that most of your problems are common problems in the industry. Chances are someone has spent more time thinking about the minutiae of certain infrastructure components - why not benefit from someone else’s hard work?
Treat your Puppet modules, manifests, code, etc. as a normal software project. Get all sorts of trades involved (QA, dev, SDET, CI, CD, DevTools, etc.) for the best chance of highest quality. Make your Puppet code define the common language between these trades. Refer to changes to systems as changes to Puppet code, not changes to individual boxes.
This tool will help you reduce the strain on your overworked (and probably inaccessible) team
Trust but Verify.
Consider how a city’s streets are laid out, intersections striped, and stoplights positioned and timed. A really well planned street system has intersections with lights that allow a car to smoothly pass through the main (busiest) thoroughfare during rush hour. Traffic light control and coordination built into the design of a city and its streets will always be more efficient than a traffic cop. However, too often we treat our infrastructure (the flow of cars) too much like a traffic cop. Being in control of the operation of a whole intersection can be fun for a time… it’s definitely an adrenaline rush and fills you with purpose. But it’s not a sustainable way of living. For anyone. It gets old fast - just like manually testing your company’s Puppet modules.
I’ll admit - software is as much an art as a science. However, there’s no excuse for a test process that’s a work of art - one that’s manually executed and blessed by “those in charge.”
Don’t be a traffic cop for your Puppet modules. Let the system show its status: red, yellow, or green… through automated testing and a continuous integration server.
At ReadyTalk we have a automated testing pipeline of all of our Puppet modules that looks something like the following:
- Someone commits code to the Puppet module repository.
- Jenkins is notified that there are changes to the Puppet module and kicks off a verification job.
- This verification job runs some basic syntax checks, runs puppet-lint, and reports on the trend of these syntax problems.
- Then the meat of the automated test process starts.
For this part of the process we have been using Test-Kitchen which utilizes Vagrant to spin up virtual machines, apply a Puppet module or two using the Puppet provisioner kitchen-puppet and then run some ServerSpec tests to verify the virtual machines are in the expected states.
There’s a really awesome walkthrough of Test-Kitchen on the Kitchen.ci website that will help you understand the workings of all these pieces better. Disclaimer: test-kitchen was originally created for Chef Cookbooks, but works fairly well for Puppet too! You may want to look at the Beaker project from PuppetLabs as well.
Having this Jenkins job with Test-Kitchen delivers a bunch of benefits:
- higher visibility as to the state of our Puppet modules. All developers and ops personnel either are on Jenkins quite a bit over the day and see when projects aren’t in a stable state. Additionally, everyone in charge of a project of some sort (including Puppet modules) gets a HipChat notification if something they care about is unstable or broken.
- software breakage is reported by an emotionless system. There’s no more arguing about “who broke what” and who should be in charge of fixing something.
- anyone can contribute to Puppet modules and can even apply the modules to a disposable virtual machine - essentially a sandbox to play in - with no consequences.
The most beneficial part of pulling this tool out of our toolbox was seeing ops folk able to spend more time building and automating tests now and in the future instead of fixing things that “were working just yesterday.”
One of my coworkers used to hate fixing Puppet problems, but once he was up and running with a local Puppet development environment (we Puppetized that too, of course), he created one of our most robust and well-tested Puppet modules. I actually used this module as part of my example in my Puppet Camp Presentation (writing tests for our Postgres setup)!
This tool will help you:
- reduce the strain on your overworked (and probably inaccessible) team
- encourage people to follow the established “best practice” of managing your infrastructure and changes to it
Prove your infrastructure systems are well designed and architected by rebuilding it continuously
The way I see configuration-management-done-well is through a collection of practices (keep in mind I’m talking about mutable infrastructure here):
- automated provisioning (Puppet + Foreman’s job, or Puppet Enterprise if you like to make it rain)
- desired state tampering (Puppet’s job)
- undesired state cleanup
What’s undesired state cleanup, you ask? Build servers provide an excellent example to explain what I’m referring to - they’re manual tampering magnets. Imagine you’re a developer (or … anyone) who has a project that works perfectly on your machine but the build server can’t build your project because it’s missing an extra-cool-provides-all-the-functionality-you-want-library.so. You’ve got to get that on the build server - and fast - people are waiting for your work to be complete and testable, right? This was exactly the situation at ReadyTalk. Since historicaly the Puppet modules were guarded and hard to improve, the easiest route to resolution was usually to login to the server that had a missing system library, install it, and the rerun the build. Success! Right…?
In Denver there’s a park with this sign:
We want people to take the right path - the path that’s safer for everyone. The path that’s paved for maximum success. We don’t want shortcuts in the process of creating a reproducible infrastructure. Shortcuts may feel good, shave off seconds of your task today, but possibly hurt everyone in the months ahead.
So what should we practice consistently?
At ReadyTalk we rebuild 2 of our build servers (seen below) every night. This has discouraged all sorts of manual system tampering (of which even I’m guilty).
A Jenkins job is the point of entry for the process, which uses Ansible to orchestrate the dance of rebuilds:
- Ansible uses the Jenkins REST API to take 2 servers out of commission
- Ansible uses the Foreman REST API to tell those 2 servers to re-provision from scratch when they’re rebooted next
- Ansible SSHes to those two servers and reboots them.
- Puppet and Foreman work together after the boxes PXE boot and go through the Debian Install process
- Ansible sees the machines open up SSH access and allow logins through LDAP
- Ansible uses the Jenkins REST API to put those 2 servers back in the build cluster
This process is really exciting to watch. It takes a while (about an hour) but it makes you feel like quite the omnipotent Puppet Master.
This last tool is the practice that has been the most beneficial to us as of late. Besides uncovering layers and layers of problems in our whole infrastructure such as:
- preseed file problems
- approx problems
- initial Puppet run problems
We also have seen much more consistency and reliability in using Puppet at ReadyTalk. In the past there was a good chance that our Puppet modules wouldn’t work so well when we needed them most. However, a few weeks ago I had a developer come running up to me and say:
I rebuilt my development environment
without any problems!
What a win. That used to take weeks!
But that’s not all. Just the other day I saw that the Jenkins slave rebuild process failed due to Puppet’s inability to drop files in the Jenkins user’s home directory. I submitted an issue to our system administrators and heard later from him:
At first I had no idea what was going on. It was such a strange problem to be having. No other systems were exhibiting that same problem. However, When I realized what was happening, I had a small heart attack - I found that the whole LDAP cluster was down in the office, and the only way people were still logging into boxes is because sssd caches users. By finding this issue today you may have saved a HUGE headache from happening tomorrow. Thanks!
At this point, it’s important to reflect back on how our Puppet-backed infrastructure has transformed from a painful regime into a benevolent one. We now have:
- a growing involvement in the development and testing of modules
- a more performant team that can spend time being proactive instead of reactive
- a rewarding process of catching possibly disastrous problems before they expand to catastrophes
That’s a way better, more energizing feedback cycle, isn’t it?
- committing to collaboration,
- automating authority of our Puppet code, and
- rebuilding our infrastructure ruthlessly
we have seen our ability to respond to change internally and in the competitive market that is the web/audio/video conferencing space drastically improve. We’ve driven a shift from an oppressive Puppet Regime to a Benevolent Puppet Regime - a regime in which everyone wins. By having this type of regime, we’re driving on a much smoother road to production. And who doesn’t like the ability to go faster?