Announcing a new, Docker-based Hanlon-Microkernel

For several months now, we’ve been working hard on a new version of the Hanlon-Microkernel project with the goal of switching from our original Tiny-Core Linux based Microkernel to something that would be simpler to maintain and extend over time. This change was driven in part by the need to provide a simpler path for users who wanted to construct custom Hanlon Microkernels and in part by our own experience over time supporting the ‘standard’ Hanlon Microkernel. This post describes the changes that we have made to this end, as well as the corresponding changes that we had to make to the Hanlon project to support this new Microkernel type.

Why change?

When we were writing Razor (the tool that would become Hanlon) a few years ago, we searched long and hard for an in-memory Linux kernel that we could use for node discovery. Using a Linux kernel for node discovery gave us the ability to take advantage of tools that were already out there in the Linux/UNIX community (lshw, lscpu, facter, etc.) to discover the capabilities of the nodes being managed by Hanlon. Using an in-memory Linux kernel meant that we could iPXE-boot any node into our Microkernel without any side-effects that might damage anything that might already be on the node — an important consideration if we were to manage nodes that had already been provisioned with an operating system in an existing datacenter. As we have discussed previously, we eventually settled on Tiny-Core Linux as the base OS for our Microkernel.

Tiny-Core Linux (TCL) had several advantages over the other in-memory alternatives that were available at the time including the fact that it was very small (the ISO for the ‘Core’ version of TCL weighed in at a mere 7 megabytes) and that, out of the box, it provided pre-built versions of most of the packages that we needed to run our Ruby-based discovery agent in the form of Tiny-Core Extensions (TCEs). All that was left was to construct a shell-script based approach that would make it simpler for the typical user, with limited knowledge of Linux or UNIX system administration, to remaster the standard TCL ISO in order to built a ‘Microkernel ISO’ suitable for use with Hanlon. Things went quite well initially, but over time we started to notice issues with the approach we had chosen for building our Microkernel, and those issues became harder and harder to resolve over time.

Those issues boiled down to a few limitations in the way we were building our Microkernel — remastering a standard TCL ISO in order to construct a Microkernel ISO that included all of the dependencies needed for our discovery agent to run — and the static nature of that process. In short, those issues can be broken down into a few key areas of weakness in that approach:

  • Hardware support: when users started trying to use our Microkernel with some of the newer servers coming out on the market they discovered that those nodes, when booted into the pre-packaged Microkernel that we had posted online were not able to check-in and register with the Hanlon server. When we dug deeper, we realized that the issue was that the kernel modules for the NICs on those servers weren’t included in our pre-built Microkernel. We spent some time developing a mechanism that would give users the ability to add kernel modules to the Microkernel during the remastering process so they could build a custom Microkernel that worked with their hardware, but that meant that they would have to use our remastering process to create their own custom ISOs (something specific to their hardware). In spite of our efforts to make this process as simple as possible we found that process wasn’t that easy to follow (to say the least) for an inexperienced user.
  • Customizing the TCL kernel: Things got a bit worse when we started trying to define a scale-out strategy for Razor. The team we were working with wanted to set up a hardware load-balancer in front of a set of Razor servers and then route requests to the various Razor servers using a round-robin algorithm. Unfortunately, the hardware load-balancer that was chosen wasn’t capable of running a PXE-boot server locally and as a result our Microkernel was not able to discover the location of the Razor server using the next-server parameter it received back from the DHCP server (which pointed to the PXE-boot server, not the hardware load-balancer). We knew we could get around this by customizing the DHCP client in our Microkernel to support the parsing of additional options from the reply it received back from the DHCP server, but because TCL is based on a BusyBox Linux kernel that meant we would have to build our own customized version of the BusyBox kernel and replace the BusyBox kernel embedded in the standard TCL ISO with our new, customized BusyBox kernel during our remastering process. While we were able to modify the remastering process to support this change fairly quickly, the process of rebuilding the BusyBox kernel itself is not an exercise for the faint of heart since it requires cross-compilation of the kernel on a separate Linux machine.
  • Updating our Microkernel to support newer versions of TCL: At the same time, we started to find bugs in Razor that were the result of known issues in a few of the TCEs that were maintained by the TCL community. Because we were using an older version of TCL, the TCEs we were downloading during the remastering process were built from older versions of the packages they contained. We resolved many of these of these issues by moving to a newer version of the TCL kernel, but that process wasn’t an easy one since it required signficant changes to the remastering process itself to support changes in the boot process that had occurred between TCL 4.x and TCL 5.x (a process that took several weeks to get right).
  • Building custom TCEs: Not all of the issues we had with TCEs from the standard TCE repositories could be resolved by updating the TCL version we were basing our Microkernel on and we also found ourselves wanting to include packages that we couldn’t find pre-built in the standard TCE repositories. As a result, we quickly found ourselves in the business of building our own TCEs, then modifying our remastering process to allow for bundling of these locally-built TCEs into the remastered Microkernel ISOs. As was the case with rebuilding a customized version of the BusyBox kernel used in our Microkernel, this was not an easy process to follow for an inexperienced user, and it led to even more time being spent on things that were not related to development of the Microkernel itself.

So, we knew we needed to make a change to how we built our Microkernel and that left us with the question of what we should use as the basis for our new Microkernel platform. We knew we didn’t want to lose the features that had initially led us to choose TCL (a small, in-memory Linux kernel that provided us witha repository of the tools we needed for node discovery), but what, really, was our best alternative?

Times had changed

Fortunately for us, several technologies had come to the forefront in the two or three years since we conducted our original search. After giving the problem some thought, we realized that one of the easiest solutions, particularly from the point of view of a casual user of the Hanlon Microkernel, might actually be to convert our Microkernel Controller (the Ruby-based daemon running in the Microkernel that communicated with the Hanlon server) from a service running in a dynamically provisioned, in-memory Linux kernel to a service that was running in a Docker container in a dynamically provisioned, in-memory Linux kernel. By converting our Microkernel to a Docker image and running our Microkernel Controller in a Docker container based on that image, it would be very simple for a user to build their own version of the Hanlon Microkernel, customized for use in their environment. Plus it would be even simpler for us to define an Automated Build for the Hanlon-Microkernel project in our cscdock organization on DockerHub so users who wanted to use the standard Hanlon Microkernel could do so via a simple ‘docker pull’ and ‘docker save’ command.

With that thought in mind, we started looking more deeply at how much work it would be to convert our Microkernel Controller to something that could be run in a Docker container. The answer, as it turned out, was “not much”. The Microkernel Controller was already set up to run as a daemon process in a Linux environment and it didn’t really have any significant dependencies on other, external services. As it turned out, setting up a Docker container that could run our Microkernel controller turned out to be a very simple task. The most difficult part of the process was setting things up so that facter could discover and report ‘facts’ about the host operating system instance, not the ‘facts’ associated with the container environment it was running in. The solution to that turned out to be a bit of sed-magic run against the facter gem after it was installed during the docker build process (so that it would look for the facts it reported in a non-standard location), cross-mounting of the /proc, /dev, and /sys filesystems from the host as local directories in the Docker container’s filesystem, starting up the container in privileged mode, and setting the container’s network to host mode so that the details of the host’s network were visible from within the container.

With those changes in place, we had a working instance of our Microkernel Controller running in a Docker container. All that remained was to determine which Docker image we wanted to base our Docker Microkernel Image off of and which operating system we wanted to use for the host operating system that the node would be iPXE-booted into.

It actually took a bit of digging to answer both of these questions, but the first was easier to answer than the second. As was the case with our initial analysis, we had some criteria in mind when making this decision:

  • The Docker image should be smaller than 256MB in size (to speed up delivery of the image to the node); smaller was considered better
  • Only Docker images that were being actively developed were considered
  • The Docker image should be based on a relatively recent Linux kernel so that we could be fairly confident that it would support the newer hardware we knew we would find in many modern data-centers
  • Since we knew we would be using facter as part of the node discovery process, the distribution that the Docker image was based on needed to include a standard package for a relatively recent release of Ruby
  • The distribution should also provide standard packages for the other tools needed for the node discovery process (lshw, lscpu, dmidecode, impitool, etc.) and provide access to tools that could be used to discover the network topology around the node using Link Layer Datagram Protocol (LLDP)
  • The distribution that the Docker image was based on should be distributed under a commercial friendly open-source license in order to support development of commercial versions of any extensions that might be developed moving forward

After looking at several of the alternatives available to us, we eventually settled on the GliderLabs Alpine Linux Docker image, which is:

  • very small (weighing in at a mere 5.25MB in size)
  • actively being developed (the most recent release was made about three months ago at the time this was being written)
  • based on a recent release of the Linux kernel (v3.18.20)
  • distributed under a relatively commercial friendly GPLv2 license, a license that allows for development of commercial extensions of our Microkernel so long as those extensions are not bundled directly into the ISO.

Additionally, it provides pre-built packages for all of the tools needed by our Microkernel Controller (including recent versions of ruby, lshw, lscpu, dmidecode and impitool) through its apk package management tool.

For those interested in more details regarding this image, the GitHub page for the project used to build this image can be found here, and the README.md file on that page includes links to additional pages and documentation on the project.

Of course, we still needed an operating system

Now that we had a strategy for migrating our Microkernel Controller from a service running in an operating system to a service running in a Docker container, we were left with the question of which operating system we should use as the base for the new Hanlon Microkernel. Of course, we still had to consider the criteria we mentioned above (small, under active development, distributed under a commercial friendly license, etc.) when choosing the Linux distribution to use as an operating system for our Microkernel container. Not only that, but we wanted a standard, in-memory distribution that could be used to iPXE-boot a node, with no modifications to the ISO necessary to run our Microkernel container.

With those constraints in mind, we started looking at alternatives. Initially, we felt CoreOS would provided us with the best small platform for our Microkernel (small here being a relative concept, even though a CoreOS ISO weighs in at 190MB, that’s still much smaller than the 450+MB size for the LiveCD image of most major distributions). When we mentioned our search for a suitable, small OS that could run Docker containers to Aaron Huslage (@huslage) from Docker, he recommended we take a look at a relatively recent entry amongst small, in-memory Linux distributions, RancherOS. While it is still in beta, it is significantly smaller than the other distributions we were looking at (weighing in at a mere 22MB), runs Docker natively (even the system services are run in their own Docker containers in RancherOS), and it’s distributed under a very commercial friendly APLv2 license. Given these advantages, we decided to use RancherOS rather than CoreOS as the base operating system for our Microkernel.

Building a new Microkernel

With the new platform selected, it was time to modify our Microkernel Controller so that it could be run in in a Docker container. Since all of the tools required by our Microkernel Controller were available out of the box under Alpine Linux, this was really of an exercise in getting rid of the code in the Microkernel we didn’t need (mostly code that was specific to the work we had to do in the past to initialize the TCL platform) than any real modifications to the Microkernel Controller itself.

Specifically we:

  • Removed the code that was associated with the process of building the ‘bundle file’ and replaced it with a Dockerfile
  • Removed the code that was used to configure the old, TCL-based Microkernel during the boot process (this code was replaced by a cloud-config that was returned to the new Microkernel by Hanlon during the iPXE-boot process)

Overall, when these changes were made, we were able to reduce the size of the Hanlon-Microkernel codebase by more than 1400 lines of code. Not only that, but there were a few unexpected benefits, including:

  • Removing the need to use custom parameters in the DHCP response to pass parameters into our Microkernel so that it could check-in with the Hanlon server. Because RancherOS (like CoreOS) supports the use of a cloud-config (passed to the kernel as a URL during the iPXE-boot process), we could pass all of the parameters that we used to pass to the Microkernel via DHCP directly to the Microkernel from the Hanlon server as part of that same cloud-config.
  • Configuring the Microkernel Controller correctly from the start. Again, we are able to pass the configuration of the Microkernel directly from the Hanlon server using that same cloud-config, so the Microkernel Controller is correctly configured from the start. Previously, we burned a default configuration into every Microkernel instance and then updated that configuration after the Microkernel checked in with Hanlon for the first time. Being able to pass the initial configuration to the Microkernel directly from the Hanlon server makes it much simpler to debug any issues that might arise prior to first checkin since the log-level of the Microkernel controller can be set to Logger::DEBUG from the start, not just after the first check-in succeeds.

Not only that, but the shift from an ISO-based Microkernel to a Docker container-based Microkernel also simplified distribution of new releases of the Hanlon-Microkernel project. Since the Hanlon-Microkernel project is now built as a Docker image, we can now setup an Automated Build on DockerHub (under our cscdock organization in the cscdock/hanlon-microkernel repository) that will trigger whenever we merge changes into the master branch of the Hanlon-Microkernel project. In fact, we’ve already setup a build there and obtaining a local copy of the Hanlon Microkernel image that is suitable for use with the Hanlon server is as simple as running the following pair of commands:

$ docker pull cscdock/hanlon-microkernel
Using default tag: latest
latest: Pulling from cscdock/hanlon-microkernel
3857f5237e43: Pull complete
9606ec958876: Pull complete
42b186ff3b3c: Pull complete
4d46659c683d: Pull complete
Digest: sha256:19dcb9c0f5d4e55202c46eaff7f4b3cc5ac1d2e90e033ae1e81412665ab6a240
Status: Downloaded newer image for cscdock/hanlon-microkernel:latest
$ docker save cscdock/hanlon-microkernel> new_mk_image.tar

The result of that docker save command will be a tarfile that you can use as one of the inputs (along with a RancherOS ISO) when adding a Microkernel to Hanlon (more on this, below).

We are also creating standard Docker images from the Hanlon-Microkernel project (starting with the v3.0.0 release) under that same repository on DockerHub. To retrieve a specific build of the Docker Microkernel Image you’d simply modify the commands shown above to include the tag for that version. The tags we use for these version-specific builds in the DockerHub repository will be the same as those in the GitHub repository, but without the ‘v’ prefix, so the commands to retrieve (and save that image in a form usable with the Hanlon server) the build from the v3.0.0 Hanlon-Microkernel release would look like the following:

$ docker pull cscdock/hanlon-microkernel:3.0.0
3.0.0: Pulling from cscdock/hanlon-microkernel
3857f5237e43: Pull complete
40806b4dc54b: Pull complete
ed09cd42dec4: Pull complete
d346b8255728: Pull complete
Digest: sha256:45206e7407251a18db5ddd88b1d1198106745c43e92cd989bae6d38263b43665
Status: Downloaded newer image for cscdock/hanlon-microkernel:3.0.0
$ docker save cscdock/hanlon-microkernel:3.0.0 > new_mk_image-3.0.0.tar

As was the case in the previous example, the output of the docker save command will be a tarfile suitable for use as one of the arguments (along with a RancherOS ISO) when adding a Microkernel instance to a Hanlon server.

Building your own (Docker-based) Hanlon Microkernel

As we mentioned earlier, one of our goals in shifting from an ISO-based Hanlon Microkernel to a Docker container-based Hanlon Microkernel was to drastically simplify the process for users who were interested in creating their own, custom Microkernel images. In short, after a few weeks of experience with the new process ourselves we think we’ve met, and hopefully even surpassed that goal with the new Hanlon-Microkernel release.

Customizing the Microkernel is now as simple as cloning down a copy of the Hanlon-Microkernel project to a local directory (using a git clone command), making your modifications to the codebase, and then running a ‘docker build’ command to build your new, custom version of the standard Hanlon-Microkernel. The changes you make might be changes to the source code for the Microkernel Controller itself (to fix a bug or add additional capabilities to it) or they might involve modifications to the Dockerfile (eg. to add additional kernel modules needed for some specialized hardware only used locally), but no longer will users have to understand all of the details of the process of remastering a Tiny-Core Linux ISO to build their own version of the Hanlon-Microkernel. Now, building a new custom version of the Microkernel is as simple as the following:

$ docker build -t hanlon-mk-image:3.0.0 .
Sending build context to Docker daemon 57.51 MB
Step 0 : FROM gliderlabs/alpine
---> 2cc966a5578a
Step 1 : RUN apk update && apk add bash sed dmidecode ruby ruby-irb open-lldp util-linux open-vm-tools sudo && apk add lshw ipmitool --update-cache --repository http://dl-3.alpinelinux.org/alpine/edge/testing/ --allow-untrusted && echo "install: --no-rdoc --no-ri" > /etc/gemrc && gem install facter json_pure daemons && find /usr/lib/ruby/gems/2.2.0/gems/facter-2.4.4 -type f -exec sed -i 's:/proc/:/host-proc/:g' {} + && find /usr/lib/ruby/gems/2.2.0/gems/facter-2.4.4 -type f -exec sed -i 's:/dev/:/host-dev/:g' {} + && find /usr/lib/ruby/gems/2.2.0/gems/facter-2.4.4 -type f -exec sed -i 's:/host-dev/null:/dev/null:g' {} + && find /usr/lib/ruby/gems/2.2.0/gems/facter-2.4.4 -type f -exec sed -i 's:/sys/:/host-sys/:g' {} +
---> Running in 4bfa520b64f9
fetch http://alpine.gliderlabs.com/alpine/v3.2/main/x86_64/APKINDEX.tar.gz
v3.2.3-105-ge9ebe94 [http://alpine.gliderlabs.com/alpine/v3.2/main]
OK: 5290 distinct packages available
(1/35) Installing ncurses-terminfo-base (5.9-r3)
(2/35) Installing ncurses-libs (5.9-r3)
(3/35) Installing readline (6.3.008-r0)
(4/35) Installing bash (4.3.33-r0)
(5/35) Installing dmidecode (2.12-r0)
(6/35) Installing libconfig (1.4.9-r1)
(7/35) Installing libnl (1.1.4-r0)
(8/35) Installing open-lldp (0.9.45-r2)
(9/35) Installing fuse (2.9.4-r0)
(10/35) Installing libgcc (4.9.2-r5)
(11/35) Installing libffi (3.2.1-r0)
(12/35) Installing libintl (0.19.4-r1)
(13/35) Installing glib (2.44.0-r1)
(14/35) Installing libstdc++ (4.9.2-r5)
(15/35) Installing icu-libs (55.1-r1)
(16/35) Installing libproc (3.3.9-r0)
(17/35) Installing libcom_err (1.42.13-r0)
(18/35) Installing krb5-conf (1.0-r0)
(19/35) Installing keyutils-libs (1.5.9-r1)
(20/35) Installing libverto (0.2.5-r0)
(21/35) Installing krb5-libs (1.13.1-r1)
(22/35) Installing libtirpc (0.3.0-r1)
(23/35) Installing open-vm-tools (9.4.6_p1770165-r4)
Executing open-vm-tools-9.4.6_p1770165-r4.pre-install
(24/35) Installing gdbm (1.11-r0)
(25/35) Installing yaml (0.1.6-r1)
(26/35) Installing ruby-libs (2.2.2-r0)
(27/35) Installing ruby (2.2.2-r0)
(28/35) Installing ruby-irb (2.2.2-r0)
(29/35) Installing sed (4.2.2-r0)
(30/35) Installing sudo (1.8.15-r0)
(31/35) Installing libuuid (2.26.2-r0)
(32/35) Installing libblkid (2.26.2-r0)
(33/35) Installing libmount (2.26.2-r0)
(34/35) Installing ncurses-widec-libs (5.9-r3)
(35/35) Installing util-linux (2.26.2-r0)
Executing busybox-1.23.2-r0.trigger
Executing glib-2.44.0-r1.trigger
OK: 63 MiB in 50 packages
fetch http://dl-3.alpinelinux.org/alpine/edge/testing/x86_64/APKINDEX.tar.gz
fetch http://alpine.gliderlabs.com/alpine/v3.2/main/x86_64/APKINDEX.tar.gz
(1/2) Installing ipmitool (1.8.13-r0)
(2/2) Installing lshw (02.17-r1)
Executing busybox-1.23.2-r0.trigger
OK: 70 MiB in 52 packages
Successfully installed facter-2.4.4
Successfully installed json_pure-1.8.3
Successfully installed daemons-1.2.3
3 gems installed
---> e7a8344fda5a
Removing intermediate container 4bfa520b64f9
Step 2 : ADD hnl_mk*.rb /usr/local/bin/
---> c963bb236983
Removing intermediate container 0a42b371b2e9
Step 3 : ADD hanlon_microkernel/*.rb /usr/local/lib/ruby/hanlon_microkernel/
---> ac4cdf004a25
Removing intermediate container 1b66c3efd788
Successfully built ac4cdf004a25
$ docker save hanlon-mk-image:3.0.0 > hanlon-mk-image.tar

As was the case in the examples shown previously, the result of the ‘docker save’ command will be a tarfile suitable for use as one of the inputs required when adding a new Microkernel instance to a Hanlon server.

One final note on building your own Microkernel…it is critical that any Microkernel image you build be tagged with a version compatible with the semantic versioning used internally by Hanlon. In the example shown above, you can see that we actually tagged the Docker image we built using using a fixed string (3.0.0) for the version.

Of course, instead of using a fixed string you could use the git describe command, combined with a few awk or sed commands, to generate a string that would be quite suitable for use as a tag in a docker build command. Here is an example of just such a command pipeline:

git describe --tags --dirty --always | sed -e 's@-@_@' | sed -e 's/^v//'

This command pipeline returns a string that includes information from the most recent GitHub tag, the number of commits since that tag, the most recent commit ID for the repository, and a ‘-dirty’ suffix if there are currently uncommitted changes in the repository. For example, if this command pipeline returns the following string:

2.0.1_13-g3eade33-dirty

that would indicate that the repository is 13 commits ahead of the commit that is tagged as ‘v2.0.1’, that the commit ID for the latest commit is ‘g3eade33’, and that there are currently uncommitted changes in the repository. Of course, if you use the same command in a repository that has just been tagged as v3.0.0, then the output of that command pipeline would be much simpler:

3.0.0

So, the ‘git describe’ command pipeline shown above provides us with a mechanism for generating a semantic version compatible tag for images that are built using a ‘docker build’ command. Here’s an example:

docker build -t hanlon-mk-image:`git describe --tags --dirty --always | sed -e 's@-@_@' | sed -e 's/^v//'` .

Using our new Microkernel with Hanlon

So now we've got a tarfile containing our new Docker Microkernel Image, what's the next step? How exactly do we build a Microkernel Image containing our Microkernel Controller? This is where the changes to Hanlon (v3.0.0) come in, so perhaps a brief description of those changes is in order.

The first thing we had to change in Hanlon was its concept of exactly what a Microkernel image was. Prior to this release, an image in Hanlon always consisted of one and only one input file, the ISO that represented the image in question. A Hanlon image was built from a single ISO, regardless of whether it was an OS image, an ESX image, a Xen-server image, or a Microkernel image. The only difference as far as Hanlon was concerned was that the contents of the ISO (eg. the location of the kernel and ramdisk files) would change from one type of ISO to another, but up until the latest release a Hanlon image was built from a single ISO, period.

With this new release, a Microkernel image is significantly different from the other image types defined in Hanlon. A Microkernel image now consists of two input files, the RancherOS ISO containing the boot image for a node and the Docker image file containing the Microkernel Controller. So, while the command to add a Microkernel in previous versions of Hanlon (v2.x and older) looked like this:

hanlon image add -t mk -p ~/iso-build/v2.0.1/hnl_mk_debug-image.2.0.1.iso

(note the single argument, passed using the -p flag, that provides Hanlon with the path on the local filesystem where Hanlon can find the Microkernel ISO), the new Hanlon-Microkernel requires an additional argument:

hanlon image add -t mk -p /tmp/rancheros-v0.4.1.iso -d /tmp/cscdock-mk-image.tar.bz2

In this example you can see that not must the user provide the path on the local filesystem where Hanlon can find an instance of a RancherOS ISO (using the -p flag) when adding a new Microkernel instance to Hanlon, but they also must provide the path to a tarfile containing an instance of the Docker Microkernel Image file that we saved previously (using the -d flag). These two files, together, constitute a Hanlon Microkernel in the new version of Hanlon, and both pieces must be provided to successfully add a Microkernel instance to a Hanlon server.

So, what does the future hold?

Hopefully, it’s apparent that our shift from an ISO-based Hanlon Microkernel to a Docker container-based Hanlon Microkernel has successfully resolved the issues we set out to resolve. It is now much simpler for even an inexperienced Hanlon user to rebuild a standard Docker Microkernel Image locally or to build their own custom Docker Microkernel Images. Not only that, but it is now much easier to extend the existing Microkernel or update the Microkernel (eg. moving the Microkernel to a newer Alpine Linux build in order to support newer hardware). Finally, shifting over to a modern OS that can be configuration at boot time using a cloud-config URL and that can run our Microkernel Controller in a Docker container has meant that we could significantly simplify the codebase in our Hanlon-Microkernel project.

This same, modern platform may also provide us with opportunities to extend the behavior of the Hanlon Microkernel at runtime, something that we previously could only imagine. For example, there have been a number of ideas for the Microkernel that we have discussed over the past two or three years years that we really couldn’t imagine implementing, based on the static nature of the ISO-based Microkernel we were using. Now that we’re working with a much more dynamic platform for our Microkernel, perhaps it’s time to revisit some of those ideas — eg. creating Microkernel ‘stacks’ so that a Microkernel can behave differently but only for a single boot or a finite sequence of boots.

Only time will tell, but it’s a brave new world for Hanlon and the Hanlon Microkernel…

Hanlon does Windows!

One of the most often requested features whenever we’ve talked with Hanlon users (going all the way back to when Nick Weaver and I released Hanlon under it’s original name, Razor) has been Windows support. We’ve struggled with how to add support for Windows provisioning to both Razor and Hanlon for a couple of years now, and we’ve even had a few false starts at providing support for the feature, but somehow the implementations we tried never really made it into a production Razor/Hanlon server.

The issue hasn’t been a technical one, instead it’s been an issue of how to fit the provisioning of Windows and the management of Windows images into the provisioning workflow used by Hanlon. Windows is, well how shall we put this, just a bit different from the other operating systems and hypervisors that we’ve supported to date and we have struggled all along to find a way to integrate the workflow used for the the unattended Windows install process with the workflow Hanlon uses to automate the install of the operating systems and/or hypervisors that we already support.

That being said, today we are formally announcing that Hanlon now provides fully-automated provisioning of Windows instances in your datacenter using a workflow that should be familiar to both Hanlon users and Windows administrators. There are still a few features that remain to be implemented (mainly around the when notification is sent back to Hanlon that a node is “OS Complete” and support for “broker handoffs” to Windows nodes), but we felt that it was better to get these features out in public so that we could get feedback (and pull requests?) from the community as soon as possible.

In this blog posting, I’d like to walk you through the new Windows support we’ve added to Hanlon. Along the way, I’ll talk through some of the features we had to add to Hanlon in order to support Windows provisioning and highlight some of the differences in workflow for those Windows administrators who have not used Hanlon before and those Hanlon users not familiar with how unattended Windows installs work. For those of you who don’t have the time or patience to read the rest of this blog posting, you can find a screencast of yours truly using these new features posted here. As always, feedback and comments are more than welcome, and if anyone would like to help us improve these features, please do let us know.

Bare-metal Provisioning of Windows

As I mentioned in my introductory remarks, the process of bare-metal provisioning Windows via Hanlon is slightly different from the process that Hanlon follows when provisioning a Linux-based OS or a VMware or Xen Hypervisor. This is due to differences between how the Windows ISO is structured when compared to the Linux or Hypervisor ISOs that Hanlon has supported to date and the differences between how an unattended Windows install works and how a Linux or Hypervisor automated install works. As a result, an unattended Windows install requires that a few external components must be setup and configured before Hanlon can successfully iPXE-boot a Windows-based OS install.

Hanlon’s new static ‘slice’

The first big difference for unattended Windows installs is that the iPXE-boot process for that installation process relies on the ability to download a number of components from a web server that is available in the iPXE-boot network. Rather than requiring that users setup (and configure) an external web server, we have decided to add a new static area to Hanlon itself. To make use of this new capability, simply add a hanlon_static_path parameter to your Hanlon server configuration that points to the directory you wish to use for serving up static content through Hanlon. Any content placed under the directory will then be available vi a GET operation against the /static RESTful endpoint.

With that new static area configured in your Hanlon server, the next step is to setup the appropriate structure under that area to support iPXE-booting of a server into WinPE via Hanlon. The tree structure that you are setting up should look something like this:

$ tree
.
├── boot
│   ├── bcd
│   └── boot.sdi
├── sources
│   └── boot.wim
└── wimboot

2 directories, 4 files
$

The files that are placed into this directory tree (the bcd, boot.sdi, boot.wim, and wimboot files) come from a variety of sources. The boot.wim file under the sources directory is the WinPE image you wish to use to boot your hardware. This WinPE image will have to be built separately, and will likely have to be customized to suit your hardware (for those interested, Joe Callen has put together a blog posting of his own that describes this process, you can find his posting here). Unfortunately, licensing restrictions don’t allow for redistribution of ‘pre-built’ WinPE images but, as you can see in Joe’s post, we’ve tried to make the process of building this image as simple as possible (even for non-Windows developers).

The boot/bcd and boot/boot.sdi files can probably be obtained from any Windows ISO, although the easiest location to grab them from is probably the ISO you are going to install Windows from. These files can be obtained by mounting a Windows ISO and copying them over or they can be copied out of the directory created when you add a Windows ISO to Hanlon (more on this later). When copying over these files, keep in mind that while Windows is not case-sensitive when it comes to filenames, the server you are running your Hanlon server likely is. As such, make sure that the filenames you create in the static area match the case of the corresponding file in the tree structure shown, above (the only real issue is is probably the boot/bcd file, which will appear as boot/BCD on a Windows ISO).

Lastly, the wimboot file can be obtained from any recent build of the wimboot project (the latest version is typically available here, but older versions can be found through the project’s GitHub repository, which can be found here). Once these files are in place, your Hanlon server’s static area is ready to be used.

Changes to the DHCP server configuration

Since the DHCP client used by WinPE does not support passing of DHCP options in the same way as that of the DHCP client used in most Linux/Hypervisor distributions, some minor changes to your DHCP server configuration are probably necessary. Specifically, the section of your DHCP server that looks like this:

# specify a few server-defined DHCP options
option hanlon_server code 224 = ip-address;
option hanlon_port code 225 = unsigned integer 16;
option hanlon_base_uri code 226 = text;

will have to be modified to support both Linux and Windows PE clients as follows:

# specify a few server-defined DHCP options
option hanlon_server code 224 = ip-address;
option hanlon_port code 225 = unsigned integer 16;
option hanlon_base_uri code 226 = text;

# options used for Windows provisioning
option space hanlon;
option hanlon.server code 224 = ip-address;
option hanlon.port code 225 = unsigned integer 16;
option hanlon.base_uri code 226 = text;

Note that in the case of Windows PE clients, we rely on a hanlon space to pass through the parameters that the Windows PE client will need to successfully connect back to the Hanlon server and retrieve the active_model parameters that it needs to continue with the appropriate Windows install (based on the model that it was bound to).

Without these additional runtime parameters, we would have to customize our Windows PE image so that it knew how to contact Hanlon in order to retrieve the active_model instance that it has been bound to (which contains information needed by the WinPE instance to perform an unattended Windows install). With these additional parameters, it is actually quite simple to put together a generic Windows PE image that can connect back to the Hanlon server (via a simple PowerShell script) to obtain this information.

To finish off the task of reconfiguring your DHCP server, you’ll also have to make use of the new space that was defined, above. To accomplish this, simply track down the lines that look like this in your current DHCP server configuration file:

  option hanlon_server 192.168.1.2;
  option hanlon_port 8026;
  option hanlon_base_uri "/hanlon/api/v1";

and modify that section of your DHCP server configuration file so that it looks like this instead:

  class "MSFT" {
    match if substring (option vendor-class-identifier, 0, 4) = "MSFT";
    option hanlon.server 192.168.1.2;
    option hanlon.port 8026;
    option hanlon.base_uri "/hanlon/api/v1";
    vendor-option-space hanlon;
  }
  class "OTHER" {
    match if substring (option vendor-class-identifier, 0, 4) != "MSFT";
    option hanlon_server 192.168.1.2;
    option hanlon_port 8026;
    option hanlon_base_uri "/hanlon/api/v1";
  }

With those changes in place, your DHCP server should now be ready to support chain booting of your machines into a Hanlon-based Windows install.

Building a WinPE image that supports your hardware

In order to make the process of provisioning Windows as painless as possible, Joe Callen (@jcpowermac) created a simple PowerShell script that can be used (on a Windows machine) to build a WinPE image that is suitable for use with your hardware. Drivers for specific networking and storage devices are typically needed to successfully iPXE-boot a node and install an OS instance on bare-metal hardware, and in the case of Windows the drivers for these networking and storage devices must be included as part of the WinPE image that is used to drive the installation process.

Do keep in mind, however, that our goal here is to make the resulting WinPE image as generic as possible so that it can be easily reused with all of your hardware. Rather than embedding a lot of model-specific logic into the WinPE image, we’ve worked quite hard to define a few custom PowerShell scripts that can be used to download appropriate versions of the key files/scripts directly from Hanlon (with appropriate values filled in based on the active_model instance that was bound to the current node). I’ll avoid going into the specifics here; Joe has provided a much more complete discussion of this process in his recent blog post, which is available here.

Loading a Windows ISO into Hanlon

As was mentioned earlier, the structure of a Windows ISO is quite different from the structure of the typical Linux/Hypervisor ISO. The biggest difference is that while the typical Linux/Hypervisor ISO contains a single image, a Windows ISO actually contains a number of different Windows images packaged up into a single install.wim file. This difference in structure shows up in the way that an automated install of Windows actually works. As part of the “autounattend” file you present to the Windows installer, you have to specify not only the location of the install.wim file to use for the install but also the WIM Index of the image in that file that you wish to use for the install.

So what does all of this mean for our image slice under Hanlon? To put it quite simply, we’ve had to make a significant change to how Hanlon handles images in order to support loading of Windows ISOs by the image slice. Specifically, the process of adding a single Windows ISO to Hanlon has the effect of creating multiple (linked) Hanlon images, a concept that was not needed for any of the Linux/Hypervisor ISOs that were already supported by Hanlon.

In order to minimize the space that the ‘unpacked’ ISO takes on disk, we maintained the requirement that each ISO would create a single directory on disk (containing the contents of that ISO). What had to change to support Windows was to add the concept of multiple, linked image objects in Hanlon (each of which is linked to a single base image that actually maps to the underlying directory created during the image add process). This base image then becomes a ‘hidden’ image (one that cannot be seen or used), and it is this base image that is actually used by any of the linked images in order to access resources that are associated with these images (like the install.wim file).

To facilitate this new concept, we modified the Hanlon image object type to take advantage of a ‘hidden’ field that was already defined for all Hanlon objects. When a Windows ISO is imported to Hanlon, the following sequence of operations are performed:

  • The ISO is unpacked into a locally accessible directory and a Windows ‘base image’ is created. This image is ‘hidden’ and, as such, will not appear in the list of Hanlon images returned by a GET against the /image endpoint by default
  • A standard Linux utility (the wiminfo utility) is then used to parse the install.wim file associated with that base image. It should be noted that Hanlon will look for this install.wim file under the sources subdirectory for any given base image (eg. under the the ${IMAGE_PATH}/windows/6hQHOAhWyuiLpUmuZlfQAa/sources directory for a base image with the uuid of 6hQHOAhWyuiLpUmuZlfQAa)
  • The information returned from the wiminfo command is then used to create a set of image objects that are linked to the underlying base image that was created when the ISO was unpacked. These linked images are only valid image objects so long as the underlying base image remains intact. Each of these linked images correspond to one of the images found when the underlying install.wim file from the base image was parsed using the wiminfo command.

The result of adding a single Windows ISO to Hanlon as an image (using the hanlon image add command or it’s RESTful equivalent) will be something like the following:

$ hanlon image add -t win -p /tmp/win2012r2.iso
Attempting to add, please wait...
Images:
         UUID                Type                               Name/Filename                          Status
2GLbzghe1VABDmYVtlLQe0  Windows Install  Windows Server 2012 R2 Standard (Server Core Installation)    Valid
2GLcLzQbh6Uxid9Nei84cy  Windows Install  Windows Server 2012 R2 Standard (Server with a GUI)           Valid
2GLccHsh97YE5BA9H0ghqi  Windows Install  Windows Server 2012 R2 Datacenter (Server Core Installation)  Valid
2GLcp9biBf0uMz80EwGBmK  Windows Install  Windows Server 2012 R2 Datacenter (Server with a GUI)         Valid
$  hanlon image 2GLbzghe1VABDmYVtlLQe0
Image:
 UUID =>  2GLbzghe1VABDmYVtlLQe0
 Type =>  Windows Install
 Name/Filename =>  Windows Server 2012 R2 Standard (Server Core Installation)
 Status =>  Valid
 OS Name =>  Windows Server 2012 R2 Standard (Server Core Installation)
 WIM Index =>  1
 Base Image =>  eeQtUH3Id2xs96i0Cft7w
$

As you can see from this example, the ISO was unpacked into a base image (with a UUID of eeQtUH3Id2xs96i0Cft7w) and four linked images (with the UUIDs shown in the output of the hanlon image add command shown above). While the linked images are independent of each other, all of them depend on the underlying base image for their ‘contents’. As such, removing the filesystem associated with the underlying base image will render all four (in the above example) of these linked images invalid.

Methods used to ‘unpack’ Windows ISOs

It should be noted here that the process for unpacking Windows also may differ from that used to unpack Linux or Hypervisor ISOs. In the Linux/Hypervisor case, fuseiso can be used to ‘mount’ the ISO if it is available. If the fuseiso command cannot be found then a regular mount command is attempted as a fallback. If both of those commands fail, then an error is thrown indicating that the user cannot add the image to Hanlon. If either of those commands succeed, then the contents of the ISO are copied over from the mount-point to a directory on the Hanlon filesystem that is under the ${IMAGE_PATH} directory.

Unfortunately, in the case of Windows, the fuseiso command cannot be used to mount a Windows ISO since the UDF (Universal Data Format) file system format used with Windows ISOs is not supported by the fuseiso command (even though it does support the ISO 9600 and Joliet formats). While the mount command could still be used to mount and copy over the contents of a Windows ISO, we didn’t want to add a requirement to Hanlon that the Hanlon server be run as a user with sudo rights to execute the mount/unmount commands.

To get around this limitation of the fuseiso command, we have added support to Hanlon for use of the 7z command for unpacking of a Windows ISO into a subdirectory of the ${IMAGE_PATH} directory. 7z was chosen as an alternative because, even though it cannot properly read the Joliet filesystem used with some Hypervisor ISOs (specifically ESX 5.x ISOs), it does support the UDF filesystem used with Windows ISOs. The biggest difference of using 7z versus the previous methods supported is that the ISO is never mounted on the Hanlon filesystem, instead the contents of the ISO are directly extracted to the target directory.

A note on removing Windows images

As you can see, this single hanlon image add command created a set of 4 linked images. The underlying base image is not visible in this view, nor is it visible in any of the default views provided by Hanlon. To see the underlying base image details, we actually need request the base image details specifically (using a command like hanlon image eeQtUH3Id2xs96i0Cft7w in the example shown above) or we need to make use of the new --hidden flag that we’ve added to the hanlon image command to facilitate the display of all of the images currently available, including the hidden ones:

$ hanlon image --hidden
Images:
         UUID                Type                               Name/Filename                          Status
2GLbzghe1VABDmYVtlLQe0  Windows Install  Windows Server 2012 R2 Standard (Server Core Installation)    Valid
2GLcLzQbh6Uxid9Nei84cy  Windows Install  Windows Server 2012 R2 Standard (Server with a GUI)           Valid
2GLccHsh97YE5BA9H0ghqi  Windows Install  Windows Server 2012 R2 Datacenter (Server Core Installation)  Valid
eeQtUH3Id2xs96i0Cft7w   Windows Install  Windows (Base Image)                                          Valid
2GLcp9biBf0uMz80EwGBmK  Windows Install  Windows Server 2012 R2 Datacenter (Server with a GUI)         Valid
$

So, given that we now have images in Hanlon that are linked together, how do we handle removal of these images? Previously, Hanlon just removed the underlying directory containing the unpacked version of the ISO that was used create the image, then removed the corresponding image object from Hanlon. There is also a check to ensure that the image you are removing is not a part of a model that is currently defined in Hanlon. If it is part of a model, then removal of the underlying image is prohibited (to prevent removal of an image from breaking any existing models defined in Hanlon that might be using that image).

Hopefully the new rules for removing Windows images are fairly apparent (the rules for removal of Microkernel, Linux or Hypervisor images remain unchanged), but in case they are not, here’s a short summary:

  • if an image is a base image, removal of that image will remove all of the images that are linked to that image, the base image itself, and the directory containing the contents of the ISO that was created when that base image was added to Hanlon
  • if an image is a linked image, then removal of that image will only result in removal of the linked image object itself; after that linked image is removed, if there are no other images remaining that are linked to the base image of that linked image, then the underlying base image (and the directory containing the contents of the ISO that was created when that base image was added to Hanlon) will also be removed
  • requests to remove a linked image will be blocked if the image in question is used in a model currently defined in Hanlon
  • requests to remove a base image will be blocked if any of the images that link to that base image are used in a model currently defined in Hanlon

Creating a Windows model

Once a Windows ISO has been loaded, it is relatively simple to use one of the images created by that process to create a Windows model. The command to add a new Windows model to Hanlon will look something like the following:

$ hanlon model add -t windows_2012_r2 -l windows_2012_r2_dc -i 2GLcc
--- Building Model (windows_2012_r2):

Please enter Windows License Key (example: AAAAA-BBBBB-CCCCC-DDDDD-EEEEE)
(QUIT to cancel)
 > XXXXX-XXXXX-XXXXX-XXXXX-XXXXX
Please enter node hostname prefix (will append node number) (example: node)
default: node
(QUIT to cancel)
 >
Please enter local domain name (will be used in /etc/hosts file) (example: example.com)
default: localdomain
(QUIT to cancel)
 >
Please enter admin password (> 8 characters) (example: P@ssword!)
default: test1234
(QUIT to cancel)
 >
Please enter User Name (not blank) (example: My Full Name)
default: Windows User
(QUIT to cancel)
 >
Please enter Organization (not blank) (example: My Organization Name)
default: Windows Organization
(QUIT to cancel)
 >
Model Created:
 Label =>  windows_2012_r2_dc
 Template =>  windows_deploy
 Description =>  Windows 2012 R2
 UUID =>  3xFEKusakDJKBrYXeQPtYG
 Image UUID =>  2GLccHsh97YE5BA9H0ghqi

$

As you can see, using one of these linked Windows images is exactly the same as using a Linux or Hypervisor image; the only differences from creating a Linux model is the template name are the additional Windows License Key, User Name, and Organization fields that must be entered for a Windows model.

Creating a Windows policy

Creating a Windows policy is even simpler. The arguments for a Windows policy are exactly the same as those for a Linux or Hypervisor deployment policy (except, of course for the use of the windows_deploy template in the hanlon policy add ... command):

$ hanlon policy add -p windows_deploy -t 'ebig_disk,memsize_2GiB' -l windows_2012_r2_dc -m 3xFEK -e true
Policy Created:
 UUID =>  31GY0H7Dohh7hkjhGhDXXs
 Line Number =>  10
 Label =>  windows_2012_r2_dc
 Enabled =>  true
 Template =>  windows_deploy
 Description =>  Policy for deploying a Windows operating system.
 Tags =>  [ebig_disk, memsize_2GiB]
 Model Label =>  windows_2012_r2_dc
 Broker Target =>  none
 Currently Bound =>  0
 Maximum Bound =>  0
 Bound Counter =>  0

$

As you can see, the result is a Windows deployment policy that can be used to bind the underlying Windows model to a node that matches this policy (based on the tags assigned to that node).

Booting your Windows machine

Once you’ve loaded a Windows ISO into Hanlon as a set of linked Windows images and you’ve created a model and policy based on one of those linked images, the rest is simple and works exactly like it does for the Linux/Hypervisor installs you have already done. Simply configure a node that will be matched to your Windows model by your Windows policy so that it will network boot on the Hanlon server’s network, then power it on. The node will then chain boot (from a PXE-boot via TFTP to an iPXE-boot via Hanlon) and Hanlon will send back an iPXE-boot script that will trigger a WinPE-based unattended Windows install.

The workflow may be a bit different — requiring an extra reboot versus what you’re used to with the Linux/Hypervisor installs you’ve already done with Hanlon to date — but the process internally remains the same. The WinPE image reaches back to Hanlon via a RESTful request and obtains a PowerShell script that it uses to automate the process of setting up an unattended Windows install. In that script, it downloads the appropriate install.wim file from Hanlon along with an autounattend.xml file that will be used to control the unattended Windows install process. The autounattend.xml file that it downloads will be customized based on the specific Hanlon model that the node in question was bound to, and will contain all of the details that are needed to complete the unattended Windows install (the location of the install.wim file, the WIM Index of the image in that install.wim file that should be installed, the Windows license key, the hostname, the Administrator password, etc.). Finally, that script will also download a set of drivers from Hanlon and inject them into the downloaded install.wim. Currently, these drivers are assumed to be packaged in a single drivers.zip file that is available at the root of the static area that we added to Hanlon in our most recent release, but down the line we may end up extending how Hanlon downloads and injects these drivers in order to make this part of the process a bit easier to configure and extend.

With these tasks complete, the script then starts the standard setup process that will manage the unattended Windows install. When the unattended install is complete, you’ll have a fully-functional Windows instance that has been configured to match the parameters from the underlying Hanlon model that it was bound to.

In conclusion…

As always, we hope you all find this new set of features in Hanlon useful in your day-to-day work. We welcome feedback on these new features from anyone in the Hanlon or Windows communities, and look forward to your contributions as well. There are features we haven’t implemented yet (we still haven’t sorted out the process for handing off these Windows nodes to a Hanlon broker, for example), but we thought that even in this early stage of the game the features we’ve added to Hanlon are significant enough that we should release them into the wild, so to speak.

I’d also like to take this opportunity to thank Joe Callen (@jcpowermac) specifically. Without his unending support on the Windows side of this process (and patience with an old Linux/Unix dweeb like me), the seamless interaction between Hanlon and WinPE shown in these changes wouldn’t be nearly so seamless. Joe has worked long and hard on this set of changes, and deserves a great deal of the credit for where we are today.

Announcing the release of Hanlon v2.0

We recently released a major update to Hanlon that is focused on making Hanlon more usable in a production environment, and I’d like to go through some of the changes that we’ve made in that release here, which include:

  • Support for the use of the recently added ‘Hardware ID’ as an identifier for the node in the node and active_model slices
  • Changes that allow for deployment of the tools that make up the Hanlon CLI (the client) separately from the Hanlon server (which provides the RESTful API used by the CLI)
  • Support for new models and new model types
  • A simplified interface for creation of new models, with support for additional (model-specific) parameters in the Hanlon CLI using an answers file (to allow for automation of what was, up until now, an interactive process)
  • Additional power-status and power-control functionality in the node slice for nodes with an attached Baseboard Managment Controller (or BMC)

Overall, our focus in putting together this new release has been on adding features to Hanlon that will make it easier than ever to use Hanlon as part of the suite of tools that you already use to manage the infrastructure in your datacenter. Hopefully this release starts to realize that goal. To help you get started with the new version, here’s a brief outline of what was added in each of these categories.

Hardware ID support in the node and active_model slices

The first big change in the latest version of Hanlon is that the ‘Hardware ID’ of a node can now be used to identify the node in the GET and POST operations supported by the node slice’s RESTful API (and the CLI equivalents to these commands). To accomplish this, a new --hw_id command line flag (or the corresponding -i short form of this flag) has been added to the node slice’s CLI. An example of using this flag might look something like this:

$ hanlon node --hw_id 564DC8E3-22AC-0D46-6001-50B003AECE0B -f attrib

which will return the attributes registered with Hanlon by the Hanlon Microkernel for the specified node (the node with an SMBIOS UUID value of 564DC8E3-22AC-0D46-6001-50B003AECE0B). The corresponding RESTful command would look something like this:

GET /hanlon/v1/node?uuid=564DC8E3-22AC-0D46-6001-50B003AECE0B

As you can see, the Hardware ID value is included as a query parameter in the corresponding GET command against the /node endpoint in the RESTful API.

It should be noted that this capability is supported by any of the node subcommands previously supported by the Hanlon CLI, which includes the display of detailed information about or the field values for a specific node. The only difference is that you can now use this field (the Hardware ID, which is unique to a given node and assigned to that node by the manufacturer) to identify the node you are interested in. Previously you would only be able to use the UUID assigned to that node by Hanlon during the node registration process when executing these same commands (a value that can change over time). Giving users the ability to obtain this same information using the Hardware ID (which will typically be mapped into the SMBIOS UUID of a node, a unique string that is assigned to the node by the manufacturer) should make it much easier to gather the node-related information that you need to manage your environment using (external) automation systems.

It should also be noted that this same capability has also been added to the active_model slice, providing users with the ability to search for the active_model associated with a given node based on that node’s ‘Hardware ID’. An example of this sort of command from the CLI would be something like the following:

$ hanlon active_model --hw_id 564DC8E3-22AC-0D46-6001-50B003AECE0B

and the corresponding RESTful operation would look like this:

GET /hanlon/v1/active_model?uuid=564DC8E3-22AC-0D46-6001-50B003AECE0B

As was the case with the changes made to the node slice, adding the ability to search for an active_model instance based on the Hardware ID associated with a given node should make it much simpler for external systems to use the Hanlon API to determine which active_model instance (if any) is bound to a given node. This should make automated handling of nodes throughout their lifecycle much simpler.

Separation of the Hanlon client and server

Another big change in this version of Hanlon is that the client and server are now completely decoupled from each other. Previously, because of server-side code that was directly executed by the client (rather than relying on a RESTful request) and because of server-side configuration information that was used within the CLI, it was not possible to run an instance of the Hanlon CLI that was truly remote from the perspective of the machine being used to run the Hanlon server. With the changes in this release, such remote execution is now possible. While most of the changes that were necessary to accomplish this were behind the scenes (and, as such, shouldn’t be apparent to the end user), there were some changes made to how the Hanlon configuration is managed that are significant and that users will have to concern themselves with. We will discuss those changes here.

First (and foremost), changes were made to truly separate the Hanlon configuration into two separate files; the client and server configuration. These files are the cli/config/hanlon_client.conf and web/config/hanlon_server.conf files, respectively (all file paths in this posting are relative to the location where Hanlon was installed on your system). Examples of these two files are shown here; first, the client configuration:

$ cat cli/config/hanlon_client.conf
#
# This file is the main configuration for ProjectHanlon
#
# -- this was system generated --
#
#
--- !ruby/object:ProjectHanlon::Config::Client
noun: config
admin_port: 8025
api_port: 8036
api_version: v1
base_path: /hanlon
hanlon_log_level: Logger::ERROR
hanlon_server: 192.168.78.2
http_timeout: 90
$ 

As you can see, this client configuration has been reduced down to just the minimal set of parameters that are necessary for the CLI to do its job (all dependencies on the underlying server configuration parameters have been removed unless they were absolutely necessary). The server configuration file is much more complete. An example of that configuration (which is much closer to the original Razor configuration file in form) is shown here:

$ cat web/config/hanlon_server.conf
#
# This file is the main configuration for ProjectHanlon
#
# -- this was system generated --
#
#
--- !ruby/object:ProjectHanlon::Config::Server
noun: config
admin_port: 8025
api_port: 8036
api_version: v1
base_path: /hanlon
daemon_min_cycle_time: 30
force_mk_uuid: ''
hanlon_log_level: Logger::ERROR
hanlon_server: 192.168.78.2
image_path: /mnt/hgfs/Hanlon/image
ipmi_password: junk2
ipmi_username: test2
ipmi_utility: freeipmi
mk_checkin_interval: 30
mk_checkin_skew: 5
mk_gem_mirror: http://localhost:2158/gem-mirror
mk_gemlist_uri: /gems/gem.list
mk_kmod_install_list_uri: /kmod-install-list
mk_log_level: Logger::ERROR
mk_tce_install_list_uri: /tce-install-list
mk_tce_mirror: http://localhost:2157/tinycorelinux
node_expire_timeout: 180
persist_host: 127.0.0.1
persist_mode: :mongo
persist_password: ''
persist_port: 27017
persist_timeout: 10
persist_username: ''
register_timeout: 120
rz_mk_boot_debug_level: Logger::ERROR
rz_mk_boot_kernel_args: ''
sui_allow_access: 'true'
sui_mount_path: /docs
$ 

Note that it is this configuration file that contains all of the sensitive information that end users shouldn’t be concerned with (where the server is persisting its data and the username and password used by the persistence layer, for example). By separating out these parameters into a separate configuration file we can ensure that the sensitive information contained in this server configuration file can be properly protected, while still providing the ability to connect to the Hanlon server from a remote location (using either the CLI or the RESTful API).

As an aside, we also added a few new configuration parameters to these two files that didn’t appear in previous releases of Hanlon (or Razor). Specifically, we added an http_timeout parameter to the client configuration file (that controls how long the CLI will wait for a response from the RESTful API before timing out, something that is quite useful to have control over when uploading large ISOs through the image slice). This value defaults to 60 seconds (the default for HTTP requests in Ruby). We also added two new server-side configuration parameters:

  • a new ipmi_utility parameter, which controls which IPMI utility should be used to query for and control the power state of a node, more on that later in this posting, and
  • a new persist_dbname parameter to the server configuration, which controls the name of the database that should be used for persistence by the Hanlon server (a useful parameter to be able to set when running spec tests, for example).

Reasonable default values are set for these two new server-side configuration parameters (an empty string and ‘project_hanlon’ respectively), preserving the existing behavior provided by previous versions of Hanlon (and Razor).

With these changes in place it is now possible to deploy the CLI for Hanlon (the cli/hanlon script, its configuration, and all of its dependencies) remotely from the machine on which the Hanlon server is being run (either as a Rackup application under a framework like Puma or directly as a WAR file under a Java servlet container framework like JBoss or Tomcat). Provided the server allows for remote access to the RESTful endpoint used by the CLI, and provided the CLI is configured properly, it should now be possible to use all of the functionality in the CLI in such a remote deployment scenario.

New models and new model types

In this new version of Hanlon, we have actually added some interesting new ‘no-op’ models to the framework and have also extended some existing models to provide support for new features during the OS (or Hypervisor) deployment process. As such, we felt it would be helpful to users (new and old) if we summarized some of these changes.

Two new ‘no-op’ model types (and corresponding policies)

From the beginning, the only way to add a node to Hanlon (or Razor, for that matter) was to let Hanlon discover that node using the Hanlon (or Razor) Microkernel. If Hanlon knew nothing about a given node, then when that node powered up and network booted it would be booted into the Microkernel, and the Microkernel would then checkin and register the node with Hanlon. Hanlon could then make a policy-based decision as to what model (if any) should be bound to a given node based on its hardware profile.

Unfortunately, that made it rather difficult to build an inventory of nodes using Hanlon (and the Hanlon Microkernel) in two scenarios that are quite common in large datacenters:

  • if you didn’t already know what sort of operating system or hypervisor you wanted to provision to a node during the node discovery process, or
  • if the node had already been provisioned with an existing operating system (or hypervisor) and you didn’t want to overwrite that OS/Hypervisor instance with something new

In either of those two scenarios, since an active_model instance was never bound to such nodes (because an operating system or hypervisor was not deployed onto them by Hanlon), any information gathered about those nodes by the Microkernel would simply disappear from Hanlon shortly after they were powered off (and the Microkernel that they were booted into stopped checking in with Hanlon). In an attempt to resolve this issue, there have been suggestions over the past couple of years that perhaps we should come up with a way of ‘manually’ adding nodes to Hanlon (or Razor) to cover these sorts of scenarios, but we felt that this didn’t fit well into the philosophy behind Hanlon (we try to keep everything automated and policy-driven so that it can scale as easily as possible to thousands, or tens of thousands, of nodes). How then could we support adding these sorts of nodes to Hanlon?

The answer, as it turns out, was quite simple. We just added two new ‘no-op’ models (and two corresponding policy types) to the list of models supported by Hanlon. Those two new models (the discover_only and boot_local models) are best described as follows:

  • discover_only — when a model of this type is bound to a node, the node will boot into the Microkernel (every time) whenever the node is (re)booted; this has the effect of allowing for updates to the node inventory in Hanlon (by powering on nodes bound to this type of model) since the Microkernel these nodes are booted into will checkin with Hanlon, register any new facts that it might find during the boot/checkin process with Hanlon, and then power off again.
  • boot_local — when a model of this type is bound to a node, the node in question will boot locally (every time) in response to any (re)boot; no changes are made to the underlying node, but it is added to the inventory of nodes maintained by Hanlon (and that information will be preserved until the boot_local model that was bound to the node is removed).

Keep in mind that before either of these model types are bound to the node, the node would have booted up into the Microkernel, and the Microkernel would have checked in and registered with Hanlon. So binding a node with either of these models will ensure that the node in question has been added to the inventory of nodes maintained by Hanlon. The only difference is how those nodes behave in subsequent (re)boots (as is outlined, above).

Support for a new set of ‘optional parameters’ that can be defined for models of a given type

Previous releases of Hanlon supported the concept of a ‘required metadata hash’ for use in gathering (and storing) any metadata specific to a given model type during the model creation process. For models like the ‘redhat_6’ model, this metadata is actually quite simple (consisting of the root password, node name prefix, and domainname to use during deployment of the OS instance to a node), while for other models (like the ‘vmware_esxi_5’ model) this metadata could actually get quite involved. Not only that, but there were several requests by members of the community over the years to provide a mechanism for specifying additional meta-data parameters for some model instances of a given type but not for others. As an example, in the case of a ‘redhat_6’ model a user might want to specify a partitioning plan (complete with partitions, volume groups, and logical volumes) that they would like to have created or an additional package group that they would like have installed during the node provisioning process, but other users might not want to specify any of these ‘optional metadata parameters’ (preferring a simpler deployment). These sorts of optional metadata parameters were not supported in previous versions of Hanlon (or Razor), since users would be asked for a value for any field that was added to the set of required metadata parameters (and that was the only mechanism provided for specifying model-specific parameters that should be used during the OS/Hypervisor provisioning process).

This version of Hanlon changes all of that. It is now possible to define a set of ‘optional meta-data parameters’ for which a user can provide values when constructing a new model instance (how they do so isn’t specified here, more on that later). If values for these ‘optional parameters’ are not provided, then they assumed to not be assigned a value (unlike the ‘required parameters’, which will always be gathered from the user when a model instance is being created and which are always assigned a value, even if it is a default value). If, on the other hand, values are provided for these parameters when creating a given model instance, then the values assigned for those parameters can be used to add additional features to the OS (or Hypervisor) instances deployed to any nodes bound to that model instance.

Currently, we are only supporting the use of these parameters in the ‘vmware_esxi_5’ model (it’s how we’re allowing for installation of additional VIBs or for the creation of a VSAN using a set of ESX5.x nodes), but we have no doubt that this new feature will quickly be added to additional models. The set of optional parameters supported by models of a given type is still constrained based on what is defined in the code for those models, but this does provide a nice compromise between flexibility (in terms of what sorts of features can be enabled during the OS provisioning process for a given model type) and constraint (giving us the ability to keep the number of models templates to a minimum and keep the process for creating new model instances as simple as possible).

I’ll leave it to Joe Callen (@jcpowermac), who did the vast majority of this work, to explain how he is using these new features to enable the creation of VSANs and the installation of additional VIBs while deploying ESXi using the ‘vmware_esxi_5’ model under this new version Hanlon. He has a nice blog posting of his own that explains the details, here. This is a new and exciting area of development in the Hanlon codebase, and one that we feel will lead to additional model development (as other models are added to or extended by the community).

Support for use of an ‘Answers File’ when creating new models

The second part of adding the ‘optional parameters’ to models that we described above involves how to actually provide values for the ‘optional parameters’ that a user wants to specify when creating a new model. Rather than try to walk the user through some sort of interactive dialog to collect values for these optional parameters (something that we found to be confusing, at best), we decided that we would combine this requirement with a previous enhancement request from another Hanlon user and collect these values using an external ‘Answers File’, which could be used during the model creation/update process to provide values for these optional parameters and for the required metadata hash parameters that also need to be supplied by the user during the model creation process.

To accomplish this, a new command-line flag was added to the Hanlon model slice’s CLI (the --option flag, which can be shortened to -o for convenience) that takes a single argument (the name of the YAML file containing the answers that the user wants to provide). Any optional parameters not included in that answers file are assumed to be blank (not specified), but if the user leaves out any of the required metadata hash parameters for a given model from that answers file an interactive session will be started on the command-line to collect those unspecified required metadata parameters (they are required, after all). The end result is a system that provides users with a great deal of flexibility when it comes to creating an answers file; they could create a file that is very generic and ‘fill in’ the instance-specific required metadata parameters interactively or, if they are trying to drive the model creation process through an external tool of some sort, they could provide an answers file that is very specific to a given model (one that specifies all of the parameters needed to create a given model instance) in order to avoid the need to provide answers interactively. Of course, the old style of providing answers interactively via the CLI is still supported, but only required metadata parameters can be specified this way (not optional ones).

As an example, here’s what the new ‘hanlon model add’ command might look like:

$ hanlon model add -t vmware_esxi_5 -l test-esxi -i 6ZK -o esx-model.yaml

where the esx-model.yaml file looks like this:

$ cat esx-model-params.yaml
root_password: "test1234"
ip_range_network: "10.53.252"
esx_license: "AAAAA-BBBBB-CCCCC-DDDDD-EEEEE"
ip_range_subnet: "255.255.255.0"
ip_range_start: "50"
ip_range_end: "60"
hostname_prefix: "esxi-node"
nameserver: "10.53.252.123"
ntpserver: "10.53.252.246"
vcenter_name: "foovc"
vcenter_datacenter_path: "dc1"
vcenter_cluster_path: "cluster"
packages:
- {url: "http://foo.org/foo1.vib", force: false }
- {url: "http://foo.org/foo2.vib", force: true }
$ 

Note that the YAML file shown above contains a mix of required metadata hash parameters (like the ‘ip_range_network’ and ‘hostname_prefix’ parameters) and optional metadata hash parameters (like the ‘vcenter_name’ and ‘vcenter_datacenter_path’ parameters). Since values are provided for all of the required metadata hash parameters in this answer file, the user would not be asked for any additional information when using it to create a new ‘vmware_esxi_5’ model instance.

Added BMC support

This new release of Hanlon also provides users with the ability to query the power state of a node or to control the power state of a node using the Hanlon node slice (either via the CLI or through the RESTful API). From the CLI, this new functionality is provided via the new ‘–bmc’ command-line flag (which can be abbreviated using the shorter ‘-b’ form if you wish to do so). To obtain the power-state of a node, simply include that flag as part of a ‘hanlon get’ command, for example:

$ hanlon node -i 564DC8E3-22AC-0D46-6001-50B003AECE0B -b -u test -p junk

or

hanlon node 52GX2NDEBiTY47IqTbsjMu --bmc -u test -p junk

which corresponds to the following RESTful operations

GET /hanlon/v1/node/power?ipmi_username=test&ipmi_password=junk&hw_id=564DC8E3-22AC-0D46-6001-50B003AECE0B

or

GET /hanlon/v1/node/52GX2NDEBiTY47IqTbsjMu/power?ipmi_username=test&ipmi_password=junk

Notice that you can include an IPMI username and/or password directly through the command-line interface. Alternatively, you can specify the values to use for these parameters directly in the Hanlon server configuration file (by assigning values to the ipmi_username and ipmi_password fields in this file). If you provide values for these two fields in the server configuration file and also specify values for them when invoking this functionality via the CLI (or in the body/query parameters of the the corresponding RESTful API call), then the values on the CLI override those provided in the server configuration file (giving you the ability to use different usernames and passwords with each BMC in your network if you are so inclined or to use the same username and password with every BMC if you are not).

So we’ve described how you can get the current power state of a given node using the node slice. There are also a corresponding set of commands that can be used to (re)set the power-state of a node. To power a node on, for example, you would run a command that looks like one of these two commands (if you were using the Hanlon CLI):

hanlon node -i 564DC8E3-22AC-0D46-6001-50B003AECE0B --bmc on -u test -p junk

or

hanlon node update 52GX2NDEBiTY47IqTbsjMu --bmc on -u test -p junk

which would correspond to the following pair of RESTful operations:

POST /hanlon/v1/node/power

or

POST /hanlon/v1/node/52GX2NDEBiTY47IqTbsjMu/power

Since these are POST commands, the new power-state, IPMI username, IPMI password, and Hardware ID (if necessary) are all specified as fields in the JSON string that makes up the body of the request. For the first request (where you want to change the power-state of a node with a given Hardware ID) that body would look like this:

{"power_command":"on", "ipmi_username":"test", "ipmi_password":"junk", "hw_id":"564DC8E3-22AC-0D46-6001-50B003AECE0B"}

while for the second (where the node is identified by UUID, not Hardware ID) the body would look like this:

{"power_command":"on", "ipmi_username":"test", "ipmi_password":"junk"}

It should be noted here that for this functionality in the node slice, an update command from the CLI corresponds to a RESTful POST operation, not a PUT operation. This differs from the update command for the other slices in the Hanlon CLI (which map to a PUT, not a POST), but it was felt that this was the right mapping to make. The reason behind this choice is that a PUT operation is assumed to be idempotent in a RESTful interface (which is true for the update commands supported by the other slices in Hanlon), but for the node slice, the update command (which updates the power state of a given node) is not an operation that we can assume to be idempotent.

It should also be noted, that for this functionality to work, not only does the node in question have to have a Baseboard Management Controller, you also must have discovered that node using a relatively new (v2.0.0 or later) version of the Hanlon Microkernel (older versions will not report the facts necessary to map a given node to its BMC) and you’ll have to have one of the two recognized IPMI utilities installed locally on the Hanlon server node (ipmitool or freeipmi). Without one of these two utilities available on the Hanlon node, an error will be thrown if you try to execute one of these commands (to determine the power-status of a node or change the power-state of a node).

Finally, there are a limited number of power states supported when updating the power-state of a node via the node slice. The complete list is as follows: ‘on’, ‘off’, ‘reset’, ‘cycle’ or ‘softShutdown’. Attempting to use an unrecognized state will result in an error being thrown by the node slice’s RESTful API. Attempting to transition a node into an incompatible state (attempting a ‘softShutdown’ on a node that is already powered off, for example) will likely also result in an error being thrown.

In closing

After several weeks of intense development to add the functionality needed to support a NextGen datacenter lab environment we are managing with Hanlon, we feel that the changes we’ve made are ready for prime time. I’d like to thank several individuals who made it all possible, specifically three of my CSC colleagues who have put in long hours on this project over the last month or two:

  • Joe Callen (@jcpowermac)
  • Russell Callen (@mtnbikenc) and
  • Sankar Vema (@sankarvema)

without their tireless work, we would not have nearly as polished a product as you see here. In addition, I’d like to thank the following community members for their help in patching a few of the holes that we found in the previous release, specifically:

  • Cody Bunch (@bunchc), from Rackspace
  • JJ Asghar (@jjasghar), from Chef and
  • Seth Thomas (@cheeseplus), also from Chef

their contributions, while smaller, are no less significant. Thanks again to everyone who made this release possible, and we look forward to building out this community of developers further moving forward.

Announcing Hanlon and the Hanlon-Microkernel

Today, we are making an important announcement about two new open-source projects that we are releasing as part of the launch of our new CSC Open Source Program: Hanlon and the Hanlon Microkernel. These projects are the next-generation versions of two projects that some of you might already be familiar with, Razor and Razor-Microkernel.

For those of you who don’t know me, my name is Tom McSweeney and I am now working as a Senior Principal in the Office of the CTO at CSC. I joined CSC last November; since then I’ve been leading the team that has been defining the processes and procedures behind a new Open Source Program at CSC. I am also one of the co-creators of the Razor and Razor-Microkernel projects, which Nick Weaver and I wrote together when we were at EMC – projects that we open-sourced through Puppet Labs almost exactly two years ago today.

So, with that announcement, I’m sure that those of you who have been following the Razor and Razor-Microkernel projects from the beginning have a number of questions for us. I’ll take my best shot at answering a few of them here. If there are others that you have, you know how to reach me…

What’s in a name?

To start, many of you might be asking yourselves: ”Why the name change – from Razor to Hanlon – if it is basically the same project?” There are really two explanations for the name change that both had equal weight when we were making this decision. First, we decided to use a different name for “our Razor” in order to avoid confusion with the existing (Puppet Labs) Razor project. Without a name change we would always be left with a discussion of “our Razor” and “their Razor” (or worse, the “original Razor” and the “new Razor”). A simple change of names for our project removes that confusion completely.

Second, we felt that a name change would quickly highlight that “our Razor” was taking a new approach to solving the same problem as the “original Razor” that we released two years ago. We haven’t changed our emphasis on using an automated, policy-based approach for the discovery and provisioning of compute nodes, nor have we changed the basic structure of the interface: for example, we still talk of slices and we still support a RESTful API along with a CLI.

What has changed, however, is the structure and organization of the underlying codebase, along with how the RESTful API and CLI are implemented. There is a long tradition in many cultures of using name changes to highlight significant changes in the life of an individual, or in this case a project, and we felt that a name change needed to be made to signify this shift in how our server did what it did.

The next question that might come to mind might be “Why Hanlon?” Of all of the possible names we could have chosen for these projects, why would we pick the last name of an American writer from Scranton, PA? To put it quite simply, we felt that the name we chose for the project should be tied to the original name (Razor) in some way, shape, or form. As those of you who have been with us from the beginning might recall, the original name (Razor) was chosen because the journey that Nick and I set out on when we wrote the original Razor was very much inspired by Occam’s (or Ockham’s) Razor, which for us was best represented by the concept that, when you are seeking an explanation or solution to a problem, “Everything should be made as simple as possible, but no simpler”. Unfortunately, we couldn’t use the name Occam (or Ockham), because that name had already been trademarked and we didn’t want to start out CSC’s first foray into the world of open-source by contributing two new projects who’s names had to be changed shortly after they were released. After giving a bit of thought to many possible names for these two projects, we decided that we could easily link this project to the original Razor project by choosing another “Razor” from the many “Razors” that have been written down (in both modern and ancient times), and “Hanlon’s Razor” seemed to be a good fit.

Finally, many of you may be asking yourselves the following question: “If these two projects are really just the next-generation versions of Razor and the Razor-Microkernel why didn’t you just contribute your changes to the existing Puppet Labs projects?” The answer to this question is a bit more involved, and to provide an adequate answer, a bit of history is necessary.

In the beginning…

razor_of_ockham_thumbTo say that Nick and I were pleasantly surprised by the reception that Razor received from the open-source community when we released the project two years ago would be an understatement. Nick and I were both familiar with using open-source software, but neither of us had spent much time contributing to open-source project much less creating software to release under an open-source license, so we really had no idea what we were getting ourselves into when we decided that Razor was something that should be released to the world as an open-source project. From the start, the response from the community to the open-source announcement was overwhelming. The first pull request for the Razor project was received a mere four hours after the announcement that we were open-sourcing the project and by the end of the first month we had almost 100 forks of the project and many more watchers. It quickly became obvious that, whatever the gaps or weaknesses in the project were, the community longed for a solution like the one we had put together.

Over the next six months, there were many changes in Razor. The community continued to build and we went through a major rewrite of the RESTful API to make it more RESTful and remove inconsistencies in the both the CLI and RESTful API that existed from slice to slice. The documentation for the project was greatly improved, and pull requests continued to pour in from the community. By the end of the year, we even had a pull request from the Chef community that added a Chef broker to Razor, although I have to say that the concept of providing support for both Puppet and Chef in a Puppet Labs open-source project did strike some users as odd, at least initially. Nick Weaver demonstrated a proof-of-concept implementation of changes he’d made to support Windows provisioning during his keynote presentation at PuppetConf 2012 but left EMC shortly after that take on a key (leadership) role on the Zombie team at VMware. At VMware, he and his team built an automation platform – Project Zombie – that is still being used today to automate the discovery and provisioning of servers for VMware’s VCloud Hybrid Services product. Deep down under the covers of that automation platform they are still using Razor to automate the discovery of servers added to their datacenters and to provision those servers with an ESX image so that they can be used to support customer workloads. I left EMC in early 2013, first to join Nick on the Zombie team at VMware and then to join Dan Hushon’s OCTO team at CSC. Throughout that time, in spite of the fact that we did not contribute much to the projects we had created due to issues with CLAs that hadn’t been signed by the companies we were now working for, we were pleased with how Razor continued to grow and evolve, with features that we’d only dreamed of (or hadn’t even imagined) being added by the community.

A turning point was reached

All of that began to change last summer. Last June, the Puppet Labs team sent Nick and me a brief email outlining the changes that they wanted to make to Razor in order to improve it. Almost from day one, the Puppet Labs team that supported the Razor project had expressed grave concerns over some of the components that Nick and I had selected for the project. Most of their concern centered around our use of MongoDB and Node.js, which made bundling of Razor into a commercially supported product difficult.

There was also a serious scalability issue that we were aware of when we launched Razor as an open-source project that was caused by the design of Razor’s Node.js-based RESTful API. That RESTful API actually handled requests by forking off Ruby processes that used the Razor CLI to handle those requests, something that we knew would be a performance bottleneck but that we had planned on fixing after Razor was released. Now, a year after the launch of Razor, the Puppet Labs team was proposing that these “unsupportable” components be removed from Razor (and replaced by components that were more easily supported as part of a commercial offering) and they were proposing that the call order be reworked so that the RESTful API was called by the CLI, instead of the CLI being called by the RESTful API.

While these changes were being made, the Puppet Labs team also suggested that a number of other improvements should be made to Razor, and while Nick and I agreed that some of these changes were necessary, there were others that we simply did not agree with. In the end, the Puppet Labs team decided to move on with their changes to Razor, with or without the support of the project’s creators, and since we couldn’t reach agreement on the changes Nick and I parted ways with the Puppet Labs team.

Since then, the Puppet Labs team has gone on to significantly rewrite the original Razor project under the name “razor-server” and it bears very little resemblance to the project Nick and I co-wrote two years ago. They’ve removed support for the underlying key-value store we were using to maintain Razor’s state and replaced it with a fixed-schema relational database. They’ve removed the state machines from our “model” slice and replaced them with an “installer” (which uses a simple templating mechanism to “control” the install process for a node). They removed the underlying Node.js server (something we applauded), and replaced it with a Torquebox instance (something we thought of doing a bit differently). In short, the Puppet Labs team made the Razor project into something that was much easier for them to include in and support as part of their Puppet Enterprise commercial offering, but Nick and I felt that with these changes they were leaving a significant portion of the Razor community behind.

CSC and Razor

hadoop-logoAbout the time that I left EMC to join Nick at VMware, Dan Hushon left to join the CSC team as their new CTO. At CSC, Dan quickly became involved in discussions that led to the acquisition of InfoChimps by CSC. As part of that deal, Dan and his team were looking for a way to use DevOps-style techniques to automate the deployment of Big Data clouds and, naturally, they turned to Razor as part of that solution (Dan’s blog entry describing the Razor part of the solution that they built out can be found here).

And so a few seeds of change within CSC were planted. By using Razor, Dan and his team were able to quickly bootstrap the infrastructure they needed to build out Big Data clouds in an automated fashion, passing off the resulting systems to Puppet for final configuration as Hadoop clusters. The result of that groundbreaking work by Dan and his team last year and the interest that it generated, was that there was already a community of potential Razor users and developers in place when I joined CSC last November, and that community of users and developers has continued to build since we started work on the server we would come to call Hanlon.

The rebirth of Razor as Hanlon

So, how did we get where we are today (from Razor to a new project named Hanlon)?  As is usually the case in these sorts of situations, it all started with knowledge and experience that was picked up as part of another, only partly related, project. During my brief sojourn as part of the Zombie team, it became all too apparent that there were a few tools and techniques that we were using as part of Project Zombie that could solve some of the issues we were having with Razor. Specifically:

  • Zombie-LogoWe used Rackup/Sinatra for the underlying server (rather than the Ruby daemon that we had used in building out Razor)
  • We built a Grape::API-based RESTful API for that server (an interface provided by the grape gem), instead of trying to build that RESTful API using Node.js and then integrating that API with the underlying Ruby server
  • We based the server we wrote on JRuby instead of Ruby, and
  • We used the warbler gem to allow for distribution of that server as a WAR file to any servlet container we might want to use (including the Apache Tomcat server)

After a bit of thought, it wasn’t too hard to see how we could take this same set of tools and techniques and, with a bit of work, use them to redesign Razor and remove many of the issues we’d been struggling with over the previous 18 months, especially the performance issues we had struggled with from the beginning.

straight-razorSo, late last December, I set out to rewrite large chunks of Razor and, in the process, created the server that we would come to call Hanlon. The underlying Ruby daemon that we had used in Razor was removed, along with the associated Node.js image and server services. In their place, I constructed a Grape::API-based RESTful API for our new Rackup/Sinatra-based server. I also inverted the call order between our two APIs (the RESTful API and CLI) so that the CLI called the RESTful API, instead of the other way around. The dependencies on components that wouldn’t translate well to a JRuby-based server were removed (like the underlying reliance on native database drivers and the reliance on the daemon gem for some services) and the warbler gem was introduced to give us the ability to build a distributable version of the Hanlon server in form of a WAR file. In the end, what was left was a greatly simplified and much more performant codebase than we had started with in Razor.

Since the CSC Hanlon team was now building the Hanlon server as a WAR file, we also decided that we should do a bit of refactoring to separate out the parts of the codebase that were used by the CLI – a simple Ruby script – from the parts of the codebase that were used by the Rackup/Sinatra-based server. The result was a much simpler and significantly flatter directory structure for the project. Finally, we simplified the Hanlon server’s configuration file by removing many unused or redundant configuration parameters that were contained in the Razor server’s configuration file. In the end, we feel that we struck a good balance between reworking the codebase to make it more supportable and performant while maintaining the existing functionality from the old Razor project. In short, Hanlon should support the needs of most of the users of the original Razor project, with very little change needed.

…and of the Razor-Microkernel as the Hanlon-Microkernel

Of course, for any of these changes to work, we also had to make some changes to the Microkernel that we used to use with Razor to support our new Hanlon server; hence, a new Hanlon-Microkernel project. The biggest changes that we made to the Hanlon-Microkernel were changes to support the new URI structure used by the Hanlon server. We also made a few bug-fix type changes to properly support deployment of the Hanlon server as a WAR file to various servers (where the context of the RESTful API might change) and added support for a few new DHCP options the the Hanlon-Microkernel that were not supported in the old Razor-Microkernel project.

Finally, we added experimental support for gathering of BMC-related facts from the underlying node (if the node has a Baseboard Management Controller, of course). Our thought is that this will lead to changes to the node slice in Hanlon to support power-control of the node using that BMC-related meta-data, but that is a feature that will have to be added in the future; currently the facts are gathered, but the changes to the node slice have not yet been made. Of course, as was the case with the Hanlon project, the documentation for the Hanlon-Microkernel project in the project wiki was updated to reflect the changes that we had made.

In closing

We hope that those of you who have been using Razor to date will find Hanlon to be a preferable replacement. There are still a few rough edges to the project, but we have no doubt that with a bit of work most of the remaining gaps will be closed in short order.

The changes that we have made are a good start, but there are still other changes that are needed and that you, as the Razor community, can help with. Among them are the following:

  • A script that can be used to migrate an existing Razor database (under either MongoDB or PostgreSQL) to a Hanlon database. Since the serialized objects in a Razor and Hanlon database contain the class names of the objects that were serialized, and since the root of that object hierarchy was changed when the root classes/modules were renamed (from Razor to Hanlon), an existing Razor database (and its objects) are not visible to a Hanlon server
  • Changes to the node slice to support power-control of a node using the node’s BMC (and the BMC-related meta-data that is gathered from the node by the Hanlon Microkernel)
  • Modifications to add support for the use of PostgreSQL for Hanlon’s underlying object store (up until now, our development and testing has been done with a MongoDB-based object store; the code to support the use of PostgreSQL is still in place, but we haven’t added in the appropriate non-native drivers to the project to support the use of PostgreSQL under JRuby).
  • Adding support for provisioning of Windows using Hanlon

However, in spite of these gaps, we still feel that Hanlon is ready to release into the wild. We hope that you find it as useful as you found our initial foray – Razor – and we look forward to working with you to rebuild the formerly diverse and active Razor community around our two new CSC open-source projects: Hanlon and the Hanlon-Microkernel.