Rebooting Linux systems with Ansible has always been possible, but was tricky and error-prone. In Ansible 2.7, I am happy to say that rebooting Linux hosts with Ansible is finally easy and can be done with a single task using the newly minted reboot plugin.
win_reboot module was written by Matt Davis and included with Ansible 2.1. Rebooting Windows hosts is a much more common occurrence than rebooting Linux hosts. Necessity is the mother of invention, so it made sense that
win_reboot appeared before the equivalent for Linux. And while less than elegant, it is possible to reboot Linux hosts using
Rebooting Linux systems with Ansible never felt right to me — much too error prone and finicky. It finally bugged me enough that I refactored
reboot so Linux hosts could join the reboot party with their Windows counterparts.
When I set out to make the
reboot plugin2, the goal was to create a common class that
win_reboot (and potentially others) could easily subclass to override specific parts of the reboot process. I was also working in reverse, deconstructing
win_reboot into a new base class that it would then subclass.
After reading through the existing code in
win_reboot, I came up with a general outline for how to break things up:
- construct the command to run on the remote host
- do the reboot
- validate that the reboot was good
I wanted the class methods to be reusable between Linux and Windows simply by redefining appropriate variables. They also needed to be modular enough so that only certain aspects of the reboot process could be overridden if needed and not the entire
For example, I could see up front that there were more reboot edge cases to code around in Windows than Linux, but constructing the command and its arguments, as well as the basic reboot and validate mechanics were the same. Therefore, I broke up command construction, reboot, validation, and the retry or timeout logic into separate pieces.
The overall strategy of what the plugin does has not changed:
- construct a reboot command
- capture the current last boot time of the target system
- continuously check for the connection to be reestablished, or timeout
- continuously validate the system is actually up by running a command, or timeout
But the implementation is now more modular and less procedural.
Accounting for Different Operating Systems
I wanted this plugin to work on as many Linux (and Linux-like) operating systems as possible (and keep working the same on Windows!). To do this, I had to come up with a method to identify the operating system of the target host and account for subtle differences between distributions and versions.
While there are standard command line tools for rebooting and getting system boot time, the flags and exact syntax these commands accept varies enough that I could not use the same flags for everything. Also, the
PATH in some operating systems does not contain the
shutdown command, so I had to solve for that as well.
The first thing I needed to do was probe the target system to figure out what operating system was running. I did this by running
self._execute_module(name='setup'). While this does work, it’s a bit “heavy” since it copies over and runs all of the Python code associated with gathering facts on the target system. All I really needed was the distribution name.
I decided to give
uname a try. The output of
uname -a varies widely across operating systems, but plain old
uname gave me what I needed: it reports
Darwin3. I used the
_low_level_execute_command() method to run the command on the target host and capture the output, lowercasing it to avoid any ambiguity.
If the output from
uname doesn’t match any of these, it defaults to using the
Linux values. The
win_reboot module subclasses the
construct_command() method, so it does not probe the target system.
Once I had a value to use as a lookup key, I needed to determine the command and parameters needed to actually reboot the system. Of course almost all of them are different.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Let’s start with the commands. Linux and FreeBSD were able to use
shutdown without a full path, but Solaris and macOS (
darwin) do not have the
shutdown command in the
PATH in the
sudo environment. The parameters passed to the command varied even more.
The values in
SHUTDOWN_COMMAND_ARGS are strings that will be run through
str.format(). One nice feature of
str.format() that I’m taking advantage of is the fact that you can pass in arguments that you don’t use in the format string. This let me have a different time value — minutes or seconds, and 0 or 1 minutes for Linux or macOS, respectively — and still build the command arguments in a single line:
1 2 3 4 5
I want to talk briefly about the calculations for
delay_min_macos. The value for
pre_reboot_delay is an integer in seconds that defaults to
600 but can be overridden. Since this number may not be a value that divides cleanly by
60 and it needs to be a valid integer when passed to the
shutdown command, I use the
// operator which performs integer division (or floor division) which truncates a floating point result to an integer4. This gives me a nice clean integer I can pass to the
shutdown command and it will return a
0 for any value less than
60 (I did some defensive programming earlier to set
0 if for some reason a negative number is passed in).
This worked great on everything except macOS. Passing
shutdown -r +0 to macOS terminates the connection so abruptly that Ansible fails the play. The easy thing to do would be to just default to a
1 minute delay for everything. But that one minute seems like an eternity when you are watching a playbook run. Plus, one minute multiplied by thousands (maybe millions?) of Ansible users rebooting their systems starts to add up to a lot of person-years really fast. So hopefully I’m collectively saving humanity years with this optimization.
In order to default to
1 or macOS, I used the bitwise Or operator
|, affectionately known as the “pipe” character, to set
0 is “falsy” in Python, it evaluates to False, and the variable is set to the value to the right of the bitwise Or.
Accounting for Windows was done by subclassing
perform_reboot() as well as defining appropriate defaults for the shutdown command flags and the command to get the last boot time5.
win_reboot is using all the same code for capturing last boot time, validating the system came back up, and continuously checking the connection or timing out. Nice!
Once I had it working and tested it on as many different operating systems as I could find virtual machines for, I started asking around for others to test.
In the course of code review, another of my amazing teammates suggested that I use exponential backoff for polling rather than just hitting the system once a second repeatedly until it successfully rebooted. I had never heard of this before, so it was another great opportunity to learn something new.
After doing some reading, I learned that exponential backoff is a technique for gradually increasing the time between each check. I found some examples as inspiration, and one interesting thing I read was that it’s a good idea to introduce a bit of randomness in the algorithm to prevent the same code running on distributed systems potentially all hitting the same central service in lock step. I don’t believe that was entirely necessary in this scenario, but I put it in there just in case.
Armed with a general understanding of the technique and a few good examples, I experimented and tuned the algorithm to get acceptable behavior for the plugin. Here is the algorithm I came up with.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
I set an upper bound with
max_fail_sleep to prevent the wait time between each test from getting huge. I didn’t want the system to come back up in the middle of a really long sleep. I arrived at twelve just by experimenting and seeing what felt right and behaved well with my test systems. The end result is one to three queries before the play continues rather than ten or more. Thanks, Sviat, for the suggestion!
Friends Who Break Your Beautiful Code
I’m very fortunate to have some former coworkers who are still good friends, Ansible users, and very savvy with Linux. I asked if they would help test my plugin, helped them get setup to test, and very quickly got a report that “It looks stuck at the reboot task”.
It’s good to have friends that break your code.
After a few hours of late night troubleshooting, we determined that the output of
who -b from the his system was epoch:
1970-01-01 00:006. That threw a wrench in my “did the system actually reboot?” logic since it was comparing that value continuously and waiting for it to change. Since that value was the same both before and after reboot, the plugin assumed the system had not yet rebooted and eventually timed out.
After some research, it turns out that systems that lack a real time clock, such as the Orange Pi in my friend’s test, do not properly set the last boot time. I ended up using
uptime -s on those particular systems to work around this.
Ideally, I could set the default uptime check command to
uptime -s, but the
-s flag to
uptime is far from universally available. It is, however, on all recent versions of Armbian and Raspbian, which are the most likely systems to lack a real time clock and have incorrect output from
I added a check in
get_system_boot_time() to account for this scenario, and the plugin now works quite well on several Pi flavors:
Here is an example of what rebooting Linux systems looked liked before the
1 2 3 4 5 6 7 8 9 10 11 12
If you want to adjust the timeout for systems that take longer to boot, or run a different command to verify the system came back up, you can do that easily with a few parameters:
1 2 3 4
reboot module will wait for the system to come back up, then run the
test_command until it returns an exit code of 0 or the timeout value is reached. Since this in an action plugin, it runs on the control machine, so there is no need to worry about delegating the task to the appropriate host. There is even a failsafe in the plugin to prevent from accidentally rebooting the control node.
If you want to reboot both Windows and Linux hosts with the same task, you can do this using the action keyword. Configure
group_vars with the appropriate action plugin name, privilege escalation settings, and any additional arguments that you want to be group specific.
1 2 3 4
1 2 3 4 5 6 7 8 9
I’m very happy with how this turned out and hope it will make rebooting Linux systems with Ansible much easier than it is today. I already have some ideas for future features, such as support for pre-authenticated reboots for FileVault encrypted volumes. I would love to hear from anyone using the
reboot plugin and welcome your feedback and pull requests.
This was a pretty tough project for me during which I learned a lot — and I absolutely did not do it alone. I relied heavily on my teammates for input and guidance. Thank you to Matt Martz (sivel), Matt Davis(nitzmahone) (the original author of
win_reboot whose code I mostly rearranged and polished), Toshio Kuratomi(abadger), and Sviatoslav Sydorenko(webknjaz) for the detailed conversations, wonderful feedback, and answering all my dumb questions with kindness and insight.
wait_for_connectionmodule uses the exact same validation code that was in
It’s an action plugin, not a module. Action plugins are run on the controller, while modules are copied to the managed host and executed there. It wouldn’t make sense for this to be a module since that would mean rebooting the system running the module, leaving nothing behind to verify the machine came back up.↩
This is just what I had available to test on. I’d love to add more operating systems if anyone has systems to test against and can send me the output of
In Ansible, we use
from __future__ import divisionto make division consistent between Python 2 and Python 3.↩
Believe it or not, Windows and Linux actually use the same name for the shutdown command:
shutdown. It’s very original.↩