Category Archives: Open Source

Announcing a new, Docker-based Hanlon-Microkernel

For several months now, we’ve been working hard on a new version of the Hanlon-Microkernel project with the goal of switching from our original Tiny-Core Linux based Microkernel to something that would be simpler to maintain and extend over time. This change was driven in part by the need to provide a simpler path for users who wanted to construct custom Hanlon Microkernels and in part by our own experience over time supporting the ‘standard’ Hanlon Microkernel. This post describes the changes that we have made to this end, as well as the corresponding changes that we had to make to the Hanlon project to support this new Microkernel type.

Why change?

When we were writing Razor (the tool that would become Hanlon) a few years ago, we searched long and hard for an in-memory Linux kernel that we could use for node discovery. Using a Linux kernel for node discovery gave us the ability to take advantage of tools that were already out there in the Linux/UNIX community (lshw, lscpu, facter, etc.) to discover the capabilities of the nodes being managed by Hanlon. Using an in-memory Linux kernel meant that we could iPXE-boot any node into our Microkernel without any side-effects that might damage anything that might already be on the node — an important consideration if we were to manage nodes that had already been provisioned with an operating system in an existing datacenter. As we have discussed previously, we eventually settled on Tiny-Core Linux as the base OS for our Microkernel.

Tiny-Core Linux (TCL) had several advantages over the other in-memory alternatives that were available at the time including the fact that it was very small (the ISO for the ‘Core’ version of TCL weighed in at a mere 7 megabytes) and that, out of the box, it provided pre-built versions of most of the packages that we needed to run our Ruby-based discovery agent in the form of Tiny-Core Extensions (TCEs). All that was left was to construct a shell-script based approach that would make it simpler for the typical user, with limited knowledge of Linux or UNIX system administration, to remaster the standard TCL ISO in order to built a ‘Microkernel ISO’ suitable for use with Hanlon. Things went quite well initially, but over time we started to notice issues with the approach we had chosen for building our Microkernel, and those issues became harder and harder to resolve over time.

Those issues boiled down to a few limitations in the way we were building our Microkernel — remastering a standard TCL ISO in order to construct a Microkernel ISO that included all of the dependencies needed for our discovery agent to run — and the static nature of that process. In short, those issues can be broken down into a few key areas of weakness in that approach:

  • Hardware support: when users started trying to use our Microkernel with some of the newer servers coming out on the market they discovered that those nodes, when booted into the pre-packaged Microkernel that we had posted online were not able to check-in and register with the Hanlon server. When we dug deeper, we realized that the issue was that the kernel modules for the NICs on those servers weren’t included in our pre-built Microkernel. We spent some time developing a mechanism that would give users the ability to add kernel modules to the Microkernel during the remastering process so they could build a custom Microkernel that worked with their hardware, but that meant that they would have to use our remastering process to create their own custom ISOs (something specific to their hardware). In spite of our efforts to make this process as simple as possible we found that process wasn’t that easy to follow (to say the least) for an inexperienced user.
  • Customizing the TCL kernel: Things got a bit worse when we started trying to define a scale-out strategy for Razor. The team we were working with wanted to set up a hardware load-balancer in front of a set of Razor servers and then route requests to the various Razor servers using a round-robin algorithm. Unfortunately, the hardware load-balancer that was chosen wasn’t capable of running a PXE-boot server locally and as a result our Microkernel was not able to discover the location of the Razor server using the next-server parameter it received back from the DHCP server (which pointed to the PXE-boot server, not the hardware load-balancer). We knew we could get around this by customizing the DHCP client in our Microkernel to support the parsing of additional options from the reply it received back from the DHCP server, but because TCL is based on a BusyBox Linux kernel that meant we would have to build our own customized version of the BusyBox kernel and replace the BusyBox kernel embedded in the standard TCL ISO with our new, customized BusyBox kernel during our remastering process. While we were able to modify the remastering process to support this change fairly quickly, the process of rebuilding the BusyBox kernel itself is not an exercise for the faint of heart since it requires cross-compilation of the kernel on a separate Linux machine.
  • Updating our Microkernel to support newer versions of TCL: At the same time, we started to find bugs in Razor that were the result of known issues in a few of the TCEs that were maintained by the TCL community. Because we were using an older version of TCL, the TCEs we were downloading during the remastering process were built from older versions of the packages they contained. We resolved many of these of these issues by moving to a newer version of the TCL kernel, but that process wasn’t an easy one since it required signficant changes to the remastering process itself to support changes in the boot process that had occurred between TCL 4.x and TCL 5.x (a process that took several weeks to get right).
  • Building custom TCEs: Not all of the issues we had with TCEs from the standard TCE repositories could be resolved by updating the TCL version we were basing our Microkernel on and we also found ourselves wanting to include packages that we couldn’t find pre-built in the standard TCE repositories. As a result, we quickly found ourselves in the business of building our own TCEs, then modifying our remastering process to allow for bundling of these locally-built TCEs into the remastered Microkernel ISOs. As was the case with rebuilding a customized version of the BusyBox kernel used in our Microkernel, this was not an easy process to follow for an inexperienced user, and it led to even more time being spent on things that were not related to development of the Microkernel itself.

So, we knew we needed to make a change to how we built our Microkernel and that left us with the question of what we should use as the basis for our new Microkernel platform. We knew we didn’t want to lose the features that had initially led us to choose TCL (a small, in-memory Linux kernel that provided us witha repository of the tools we needed for node discovery), but what, really, was our best alternative?

Times had changed

Fortunately for us, several technologies had come to the forefront in the two or three years since we conducted our original search. After giving the problem some thought, we realized that one of the easiest solutions, particularly from the point of view of a casual user of the Hanlon Microkernel, might actually be to convert our Microkernel Controller (the Ruby-based daemon running in the Microkernel that communicated with the Hanlon server) from a service running in a dynamically provisioned, in-memory Linux kernel to a service that was running in a Docker container in a dynamically provisioned, in-memory Linux kernel. By converting our Microkernel to a Docker image and running our Microkernel Controller in a Docker container based on that image, it would be very simple for a user to build their own version of the Hanlon Microkernel, customized for use in their environment. Plus it would be even simpler for us to define an Automated Build for the Hanlon-Microkernel project in our cscdock organization on DockerHub so users who wanted to use the standard Hanlon Microkernel could do so via a simple ‘docker pull’ and ‘docker save’ command.

With that thought in mind, we started looking more deeply at how much work it would be to convert our Microkernel Controller to something that could be run in a Docker container. The answer, as it turned out, was “not much”. The Microkernel Controller was already set up to run as a daemon process in a Linux environment and it didn’t really have any significant dependencies on other, external services. As it turned out, setting up a Docker container that could run our Microkernel controller turned out to be a very simple task. The most difficult part of the process was setting things up so that facter could discover and report ‘facts’ about the host operating system instance, not the ‘facts’ associated with the container environment it was running in. The solution to that turned out to be a bit of sed-magic run against the facter gem after it was installed during the docker build process (so that it would look for the facts it reported in a non-standard location), cross-mounting of the /proc, /dev, and /sys filesystems from the host as local directories in the Docker container’s filesystem, starting up the container in privileged mode, and setting the container’s network to host mode so that the details of the host’s network were visible from within the container.

With those changes in place, we had a working instance of our Microkernel Controller running in a Docker container. All that remained was to determine which Docker image we wanted to base our Docker Microkernel Image off of and which operating system we wanted to use for the host operating system that the node would be iPXE-booted into.

It actually took a bit of digging to answer both of these questions, but the first was easier to answer than the second. As was the case with our initial analysis, we had some criteria in mind when making this decision:

  • The Docker image should be smaller than 256MB in size (to speed up delivery of the image to the node); smaller was considered better
  • Only Docker images that were being actively developed were considered
  • The Docker image should be based on a relatively recent Linux kernel so that we could be fairly confident that it would support the newer hardware we knew we would find in many modern data-centers
  • Since we knew we would be using facter as part of the node discovery process, the distribution that the Docker image was based on needed to include a standard package for a relatively recent release of Ruby
  • The distribution should also provide standard packages for the other tools needed for the node discovery process (lshw, lscpu, dmidecode, impitool, etc.) and provide access to tools that could be used to discover the network topology around the node using Link Layer Datagram Protocol (LLDP)
  • The distribution that the Docker image was based on should be distributed under a commercial friendly open-source license in order to support development of commercial versions of any extensions that might be developed moving forward

After looking at several of the alternatives available to us, we eventually settled on the GliderLabs Alpine Linux Docker image, which is:

  • very small (weighing in at a mere 5.25MB in size)
  • actively being developed (the most recent release was made about three months ago at the time this was being written)
  • based on a recent release of the Linux kernel (v3.18.20)
  • distributed under a relatively commercial friendly GPLv2 license, a license that allows for development of commercial extensions of our Microkernel so long as those extensions are not bundled directly into the ISO.

Additionally, it provides pre-built packages for all of the tools needed by our Microkernel Controller (including recent versions of ruby, lshw, lscpu, dmidecode and impitool) through its apk package management tool.

For those interested in more details regarding this image, the GitHub page for the project used to build this image can be found here, and the README.md file on that page includes links to additional pages and documentation on the project.

Of course, we still needed an operating system

Now that we had a strategy for migrating our Microkernel Controller from a service running in an operating system to a service running in a Docker container, we were left with the question of which operating system we should use as the base for the new Hanlon Microkernel. Of course, we still had to consider the criteria we mentioned above (small, under active development, distributed under a commercial friendly license, etc.) when choosing the Linux distribution to use as an operating system for our Microkernel container. Not only that, but we wanted a standard, in-memory distribution that could be used to iPXE-boot a node, with no modifications to the ISO necessary to run our Microkernel container.

With those constraints in mind, we started looking at alternatives. Initially, we felt CoreOS would provided us with the best small platform for our Microkernel (small here being a relative concept, even though a CoreOS ISO weighs in at 190MB, that’s still much smaller than the 450+MB size for the LiveCD image of most major distributions). When we mentioned our search for a suitable, small OS that could run Docker containers to Aaron Huslage (@huslage) from Docker, he recommended we take a look at a relatively recent entry amongst small, in-memory Linux distributions, RancherOS. While it is still in beta, it is significantly smaller than the other distributions we were looking at (weighing in at a mere 22MB), runs Docker natively (even the system services are run in their own Docker containers in RancherOS), and it’s distributed under a very commercial friendly APLv2 license. Given these advantages, we decided to use RancherOS rather than CoreOS as the base operating system for our Microkernel.

Building a new Microkernel

With the new platform selected, it was time to modify our Microkernel Controller so that it could be run in in a Docker container. Since all of the tools required by our Microkernel Controller were available out of the box under Alpine Linux, this was really of an exercise in getting rid of the code in the Microkernel we didn’t need (mostly code that was specific to the work we had to do in the past to initialize the TCL platform) than any real modifications to the Microkernel Controller itself.

Specifically we:

  • Removed the code that was associated with the process of building the ‘bundle file’ and replaced it with a Dockerfile
  • Removed the code that was used to configure the old, TCL-based Microkernel during the boot process (this code was replaced by a cloud-config that was returned to the new Microkernel by Hanlon during the iPXE-boot process)

Overall, when these changes were made, we were able to reduce the size of the Hanlon-Microkernel codebase by more than 1400 lines of code. Not only that, but there were a few unexpected benefits, including:

  • Removing the need to use custom parameters in the DHCP response to pass parameters into our Microkernel so that it could check-in with the Hanlon server. Because RancherOS (like CoreOS) supports the use of a cloud-config (passed to the kernel as a URL during the iPXE-boot process), we could pass all of the parameters that we used to pass to the Microkernel via DHCP directly to the Microkernel from the Hanlon server as part of that same cloud-config.
  • Configuring the Microkernel Controller correctly from the start. Again, we are able to pass the configuration of the Microkernel directly from the Hanlon server using that same cloud-config, so the Microkernel Controller is correctly configured from the start. Previously, we burned a default configuration into every Microkernel instance and then updated that configuration after the Microkernel checked in with Hanlon for the first time. Being able to pass the initial configuration to the Microkernel directly from the Hanlon server makes it much simpler to debug any issues that might arise prior to first checkin since the log-level of the Microkernel controller can be set to Logger::DEBUG from the start, not just after the first check-in succeeds.

Not only that, but the shift from an ISO-based Microkernel to a Docker container-based Microkernel also simplified distribution of new releases of the Hanlon-Microkernel project. Since the Hanlon-Microkernel project is now built as a Docker image, we can now setup an Automated Build on DockerHub (under our cscdock organization in the cscdock/hanlon-microkernel repository) that will trigger whenever we merge changes into the master branch of the Hanlon-Microkernel project. In fact, we’ve already setup a build there and obtaining a local copy of the Hanlon Microkernel image that is suitable for use with the Hanlon server is as simple as running the following pair of commands:

$ docker pull cscdock/hanlon-microkernel
Using default tag: latest
latest: Pulling from cscdock/hanlon-microkernel
3857f5237e43: Pull complete
9606ec958876: Pull complete
42b186ff3b3c: Pull complete
4d46659c683d: Pull complete
Digest: sha256:19dcb9c0f5d4e55202c46eaff7f4b3cc5ac1d2e90e033ae1e81412665ab6a240
Status: Downloaded newer image for cscdock/hanlon-microkernel:latest
$ docker save cscdock/hanlon-microkernel> new_mk_image.tar

The result of that docker save command will be a tarfile that you can use as one of the inputs (along with a RancherOS ISO) when adding a Microkernel to Hanlon (more on this, below).

We are also creating standard Docker images from the Hanlon-Microkernel project (starting with the v3.0.0 release) under that same repository on DockerHub. To retrieve a specific build of the Docker Microkernel Image you’d simply modify the commands shown above to include the tag for that version. The tags we use for these version-specific builds in the DockerHub repository will be the same as those in the GitHub repository, but without the ‘v’ prefix, so the commands to retrieve (and save that image in a form usable with the Hanlon server) the build from the v3.0.0 Hanlon-Microkernel release would look like the following:

$ docker pull cscdock/hanlon-microkernel:3.0.0
3.0.0: Pulling from cscdock/hanlon-microkernel
3857f5237e43: Pull complete
40806b4dc54b: Pull complete
ed09cd42dec4: Pull complete
d346b8255728: Pull complete
Digest: sha256:45206e7407251a18db5ddd88b1d1198106745c43e92cd989bae6d38263b43665
Status: Downloaded newer image for cscdock/hanlon-microkernel:3.0.0
$ docker save cscdock/hanlon-microkernel:3.0.0 > new_mk_image-3.0.0.tar

As was the case in the previous example, the output of the docker save command will be a tarfile suitable for use as one of the arguments (along with a RancherOS ISO) when adding a Microkernel instance to a Hanlon server.

Building your own (Docker-based) Hanlon Microkernel

As we mentioned earlier, one of our goals in shifting from an ISO-based Hanlon Microkernel to a Docker container-based Hanlon Microkernel was to drastically simplify the process for users who were interested in creating their own, custom Microkernel images. In short, after a few weeks of experience with the new process ourselves we think we’ve met, and hopefully even surpassed that goal with the new Hanlon-Microkernel release.

Customizing the Microkernel is now as simple as cloning down a copy of the Hanlon-Microkernel project to a local directory (using a git clone command), making your modifications to the codebase, and then running a ‘docker build’ command to build your new, custom version of the standard Hanlon-Microkernel. The changes you make might be changes to the source code for the Microkernel Controller itself (to fix a bug or add additional capabilities to it) or they might involve modifications to the Dockerfile (eg. to add additional kernel modules needed for some specialized hardware only used locally), but no longer will users have to understand all of the details of the process of remastering a Tiny-Core Linux ISO to build their own version of the Hanlon-Microkernel. Now, building a new custom version of the Microkernel is as simple as the following:

$ docker build -t hanlon-mk-image:3.0.0 .
Sending build context to Docker daemon 57.51 MB
Step 0 : FROM gliderlabs/alpine
---> 2cc966a5578a
Step 1 : RUN apk update && apk add bash sed dmidecode ruby ruby-irb open-lldp util-linux open-vm-tools sudo && apk add lshw ipmitool --update-cache --repository http://dl-3.alpinelinux.org/alpine/edge/testing/ --allow-untrusted && echo "install: --no-rdoc --no-ri" > /etc/gemrc && gem install facter json_pure daemons && find /usr/lib/ruby/gems/2.2.0/gems/facter-2.4.4 -type f -exec sed -i 's:/proc/:/host-proc/:g' {} + && find /usr/lib/ruby/gems/2.2.0/gems/facter-2.4.4 -type f -exec sed -i 's:/dev/:/host-dev/:g' {} + && find /usr/lib/ruby/gems/2.2.0/gems/facter-2.4.4 -type f -exec sed -i 's:/host-dev/null:/dev/null:g' {} + && find /usr/lib/ruby/gems/2.2.0/gems/facter-2.4.4 -type f -exec sed -i 's:/sys/:/host-sys/:g' {} +
---> Running in 4bfa520b64f9
fetch http://alpine.gliderlabs.com/alpine/v3.2/main/x86_64/APKINDEX.tar.gz
v3.2.3-105-ge9ebe94 [http://alpine.gliderlabs.com/alpine/v3.2/main]
OK: 5290 distinct packages available
(1/35) Installing ncurses-terminfo-base (5.9-r3)
(2/35) Installing ncurses-libs (5.9-r3)
(3/35) Installing readline (6.3.008-r0)
(4/35) Installing bash (4.3.33-r0)
(5/35) Installing dmidecode (2.12-r0)
(6/35) Installing libconfig (1.4.9-r1)
(7/35) Installing libnl (1.1.4-r0)
(8/35) Installing open-lldp (0.9.45-r2)
(9/35) Installing fuse (2.9.4-r0)
(10/35) Installing libgcc (4.9.2-r5)
(11/35) Installing libffi (3.2.1-r0)
(12/35) Installing libintl (0.19.4-r1)
(13/35) Installing glib (2.44.0-r1)
(14/35) Installing libstdc++ (4.9.2-r5)
(15/35) Installing icu-libs (55.1-r1)
(16/35) Installing libproc (3.3.9-r0)
(17/35) Installing libcom_err (1.42.13-r0)
(18/35) Installing krb5-conf (1.0-r0)
(19/35) Installing keyutils-libs (1.5.9-r1)
(20/35) Installing libverto (0.2.5-r0)
(21/35) Installing krb5-libs (1.13.1-r1)
(22/35) Installing libtirpc (0.3.0-r1)
(23/35) Installing open-vm-tools (9.4.6_p1770165-r4)
Executing open-vm-tools-9.4.6_p1770165-r4.pre-install
(24/35) Installing gdbm (1.11-r0)
(25/35) Installing yaml (0.1.6-r1)
(26/35) Installing ruby-libs (2.2.2-r0)
(27/35) Installing ruby (2.2.2-r0)
(28/35) Installing ruby-irb (2.2.2-r0)
(29/35) Installing sed (4.2.2-r0)
(30/35) Installing sudo (1.8.15-r0)
(31/35) Installing libuuid (2.26.2-r0)
(32/35) Installing libblkid (2.26.2-r0)
(33/35) Installing libmount (2.26.2-r0)
(34/35) Installing ncurses-widec-libs (5.9-r3)
(35/35) Installing util-linux (2.26.2-r0)
Executing busybox-1.23.2-r0.trigger
Executing glib-2.44.0-r1.trigger
OK: 63 MiB in 50 packages
fetch http://dl-3.alpinelinux.org/alpine/edge/testing/x86_64/APKINDEX.tar.gz
fetch http://alpine.gliderlabs.com/alpine/v3.2/main/x86_64/APKINDEX.tar.gz
(1/2) Installing ipmitool (1.8.13-r0)
(2/2) Installing lshw (02.17-r1)
Executing busybox-1.23.2-r0.trigger
OK: 70 MiB in 52 packages
Successfully installed facter-2.4.4
Successfully installed json_pure-1.8.3
Successfully installed daemons-1.2.3
3 gems installed
---> e7a8344fda5a
Removing intermediate container 4bfa520b64f9
Step 2 : ADD hnl_mk*.rb /usr/local/bin/
---> c963bb236983
Removing intermediate container 0a42b371b2e9
Step 3 : ADD hanlon_microkernel/*.rb /usr/local/lib/ruby/hanlon_microkernel/
---> ac4cdf004a25
Removing intermediate container 1b66c3efd788
Successfully built ac4cdf004a25
$ docker save hanlon-mk-image:3.0.0 > hanlon-mk-image.tar

As was the case in the examples shown previously, the result of the ‘docker save’ command will be a tarfile suitable for use as one of the inputs required when adding a new Microkernel instance to a Hanlon server.

One final note on building your own Microkernel…it is critical that any Microkernel image you build be tagged with a version compatible with the semantic versioning used internally by Hanlon. In the example shown above, you can see that we actually tagged the Docker image we built using using a fixed string (3.0.0) for the version.

Of course, instead of using a fixed string you could use the git describe command, combined with a few awk or sed commands, to generate a string that would be quite suitable for use as a tag in a docker build command. Here is an example of just such a command pipeline:

git describe --tags --dirty --always | sed -e 's@-@_@' | sed -e 's/^v//'

This command pipeline returns a string that includes information from the most recent GitHub tag, the number of commits since that tag, the most recent commit ID for the repository, and a ‘-dirty’ suffix if there are currently uncommitted changes in the repository. For example, if this command pipeline returns the following string:

2.0.1_13-g3eade33-dirty

that would indicate that the repository is 13 commits ahead of the commit that is tagged as ‘v2.0.1’, that the commit ID for the latest commit is ‘g3eade33’, and that there are currently uncommitted changes in the repository. Of course, if you use the same command in a repository that has just been tagged as v3.0.0, then the output of that command pipeline would be much simpler:

3.0.0

So, the ‘git describe’ command pipeline shown above provides us with a mechanism for generating a semantic version compatible tag for images that are built using a ‘docker build’ command. Here’s an example:

docker build -t hanlon-mk-image:`git describe --tags --dirty --always | sed -e 's@-@_@' | sed -e 's/^v//'` .

Using our new Microkernel with Hanlon

So now we've got a tarfile containing our new Docker Microkernel Image, what's the next step? How exactly do we build a Microkernel Image containing our Microkernel Controller? This is where the changes to Hanlon (v3.0.0) come in, so perhaps a brief description of those changes is in order.

The first thing we had to change in Hanlon was its concept of exactly what a Microkernel image was. Prior to this release, an image in Hanlon always consisted of one and only one input file, the ISO that represented the image in question. A Hanlon image was built from a single ISO, regardless of whether it was an OS image, an ESX image, a Xen-server image, or a Microkernel image. The only difference as far as Hanlon was concerned was that the contents of the ISO (eg. the location of the kernel and ramdisk files) would change from one type of ISO to another, but up until the latest release a Hanlon image was built from a single ISO, period.

With this new release, a Microkernel image is significantly different from the other image types defined in Hanlon. A Microkernel image now consists of two input files, the RancherOS ISO containing the boot image for a node and the Docker image file containing the Microkernel Controller. So, while the command to add a Microkernel in previous versions of Hanlon (v2.x and older) looked like this:

hanlon image add -t mk -p ~/iso-build/v2.0.1/hnl_mk_debug-image.2.0.1.iso

(note the single argument, passed using the -p flag, that provides Hanlon with the path on the local filesystem where Hanlon can find the Microkernel ISO), the new Hanlon-Microkernel requires an additional argument:

hanlon image add -t mk -p /tmp/rancheros-v0.4.1.iso -d /tmp/cscdock-mk-image.tar.bz2

In this example you can see that not must the user provide the path on the local filesystem where Hanlon can find an instance of a RancherOS ISO (using the -p flag) when adding a new Microkernel instance to Hanlon, but they also must provide the path to a tarfile containing an instance of the Docker Microkernel Image file that we saved previously (using the -d flag). These two files, together, constitute a Hanlon Microkernel in the new version of Hanlon, and both pieces must be provided to successfully add a Microkernel instance to a Hanlon server.

So, what does the future hold?

Hopefully, it’s apparent that our shift from an ISO-based Hanlon Microkernel to a Docker container-based Hanlon Microkernel has successfully resolved the issues we set out to resolve. It is now much simpler for even an inexperienced Hanlon user to rebuild a standard Docker Microkernel Image locally or to build their own custom Docker Microkernel Images. Not only that, but it is now much easier to extend the existing Microkernel or update the Microkernel (eg. moving the Microkernel to a newer Alpine Linux build in order to support newer hardware). Finally, shifting over to a modern OS that can be configuration at boot time using a cloud-config URL and that can run our Microkernel Controller in a Docker container has meant that we could significantly simplify the codebase in our Hanlon-Microkernel project.

This same, modern platform may also provide us with opportunities to extend the behavior of the Hanlon Microkernel at runtime, something that we previously could only imagine. For example, there have been a number of ideas for the Microkernel that we have discussed over the past two or three years years that we really couldn’t imagine implementing, based on the static nature of the ISO-based Microkernel we were using. Now that we’re working with a much more dynamic platform for our Microkernel, perhaps it’s time to revisit some of those ideas — eg. creating Microkernel ‘stacks’ so that a Microkernel can behave differently but only for a single boot or a finite sequence of boots.

Only time will tell, but it’s a brave new world for Hanlon and the Hanlon Microkernel…

Advertisements

Announcing Hanlon and the Hanlon-Microkernel

Today, we are making an important announcement about two new open-source projects that we are releasing as part of the launch of our new CSC Open Source Program: Hanlon and the Hanlon Microkernel. These projects are the next-generation versions of two projects that some of you might already be familiar with, Razor and Razor-Microkernel.

For those of you who don’t know me, my name is Tom McSweeney and I am now working as a Senior Principal in the Office of the CTO at CSC. I joined CSC last November; since then I’ve been leading the team that has been defining the processes and procedures behind a new Open Source Program at CSC. I am also one of the co-creators of the Razor and Razor-Microkernel projects, which Nick Weaver and I wrote together when we were at EMC – projects that we open-sourced through Puppet Labs almost exactly two years ago today.

So, with that announcement, I’m sure that those of you who have been following the Razor and Razor-Microkernel projects from the beginning have a number of questions for us. I’ll take my best shot at answering a few of them here. If there are others that you have, you know how to reach me…

What’s in a name?

To start, many of you might be asking yourselves: ”Why the name change – from Razor to Hanlon – if it is basically the same project?” There are really two explanations for the name change that both had equal weight when we were making this decision. First, we decided to use a different name for “our Razor” in order to avoid confusion with the existing (Puppet Labs) Razor project. Without a name change we would always be left with a discussion of “our Razor” and “their Razor” (or worse, the “original Razor” and the “new Razor”). A simple change of names for our project removes that confusion completely.

Second, we felt that a name change would quickly highlight that “our Razor” was taking a new approach to solving the same problem as the “original Razor” that we released two years ago. We haven’t changed our emphasis on using an automated, policy-based approach for the discovery and provisioning of compute nodes, nor have we changed the basic structure of the interface: for example, we still talk of slices and we still support a RESTful API along with a CLI.

What has changed, however, is the structure and organization of the underlying codebase, along with how the RESTful API and CLI are implemented. There is a long tradition in many cultures of using name changes to highlight significant changes in the life of an individual, or in this case a project, and we felt that a name change needed to be made to signify this shift in how our server did what it did.

The next question that might come to mind might be “Why Hanlon?” Of all of the possible names we could have chosen for these projects, why would we pick the last name of an American writer from Scranton, PA? To put it quite simply, we felt that the name we chose for the project should be tied to the original name (Razor) in some way, shape, or form. As those of you who have been with us from the beginning might recall, the original name (Razor) was chosen because the journey that Nick and I set out on when we wrote the original Razor was very much inspired by Occam’s (or Ockham’s) Razor, which for us was best represented by the concept that, when you are seeking an explanation or solution to a problem, “Everything should be made as simple as possible, but no simpler”. Unfortunately, we couldn’t use the name Occam (or Ockham), because that name had already been trademarked and we didn’t want to start out CSC’s first foray into the world of open-source by contributing two new projects who’s names had to be changed shortly after they were released. After giving a bit of thought to many possible names for these two projects, we decided that we could easily link this project to the original Razor project by choosing another “Razor” from the many “Razors” that have been written down (in both modern and ancient times), and “Hanlon’s Razor” seemed to be a good fit.

Finally, many of you may be asking yourselves the following question: “If these two projects are really just the next-generation versions of Razor and the Razor-Microkernel why didn’t you just contribute your changes to the existing Puppet Labs projects?” The answer to this question is a bit more involved, and to provide an adequate answer, a bit of history is necessary.

In the beginning…

razor_of_ockham_thumbTo say that Nick and I were pleasantly surprised by the reception that Razor received from the open-source community when we released the project two years ago would be an understatement. Nick and I were both familiar with using open-source software, but neither of us had spent much time contributing to open-source project much less creating software to release under an open-source license, so we really had no idea what we were getting ourselves into when we decided that Razor was something that should be released to the world as an open-source project. From the start, the response from the community to the open-source announcement was overwhelming. The first pull request for the Razor project was received a mere four hours after the announcement that we were open-sourcing the project and by the end of the first month we had almost 100 forks of the project and many more watchers. It quickly became obvious that, whatever the gaps or weaknesses in the project were, the community longed for a solution like the one we had put together.

Over the next six months, there were many changes in Razor. The community continued to build and we went through a major rewrite of the RESTful API to make it more RESTful and remove inconsistencies in the both the CLI and RESTful API that existed from slice to slice. The documentation for the project was greatly improved, and pull requests continued to pour in from the community. By the end of the year, we even had a pull request from the Chef community that added a Chef broker to Razor, although I have to say that the concept of providing support for both Puppet and Chef in a Puppet Labs open-source project did strike some users as odd, at least initially. Nick Weaver demonstrated a proof-of-concept implementation of changes he’d made to support Windows provisioning during his keynote presentation at PuppetConf 2012 but left EMC shortly after that take on a key (leadership) role on the Zombie team at VMware. At VMware, he and his team built an automation platform – Project Zombie – that is still being used today to automate the discovery and provisioning of servers for VMware’s VCloud Hybrid Services product. Deep down under the covers of that automation platform they are still using Razor to automate the discovery of servers added to their datacenters and to provision those servers with an ESX image so that they can be used to support customer workloads. I left EMC in early 2013, first to join Nick on the Zombie team at VMware and then to join Dan Hushon’s OCTO team at CSC. Throughout that time, in spite of the fact that we did not contribute much to the projects we had created due to issues with CLAs that hadn’t been signed by the companies we were now working for, we were pleased with how Razor continued to grow and evolve, with features that we’d only dreamed of (or hadn’t even imagined) being added by the community.

A turning point was reached

All of that began to change last summer. Last June, the Puppet Labs team sent Nick and me a brief email outlining the changes that they wanted to make to Razor in order to improve it. Almost from day one, the Puppet Labs team that supported the Razor project had expressed grave concerns over some of the components that Nick and I had selected for the project. Most of their concern centered around our use of MongoDB and Node.js, which made bundling of Razor into a commercially supported product difficult.

There was also a serious scalability issue that we were aware of when we launched Razor as an open-source project that was caused by the design of Razor’s Node.js-based RESTful API. That RESTful API actually handled requests by forking off Ruby processes that used the Razor CLI to handle those requests, something that we knew would be a performance bottleneck but that we had planned on fixing after Razor was released. Now, a year after the launch of Razor, the Puppet Labs team was proposing that these “unsupportable” components be removed from Razor (and replaced by components that were more easily supported as part of a commercial offering) and they were proposing that the call order be reworked so that the RESTful API was called by the CLI, instead of the CLI being called by the RESTful API.

While these changes were being made, the Puppet Labs team also suggested that a number of other improvements should be made to Razor, and while Nick and I agreed that some of these changes were necessary, there were others that we simply did not agree with. In the end, the Puppet Labs team decided to move on with their changes to Razor, with or without the support of the project’s creators, and since we couldn’t reach agreement on the changes Nick and I parted ways with the Puppet Labs team.

Since then, the Puppet Labs team has gone on to significantly rewrite the original Razor project under the name “razor-server” and it bears very little resemblance to the project Nick and I co-wrote two years ago. They’ve removed support for the underlying key-value store we were using to maintain Razor’s state and replaced it with a fixed-schema relational database. They’ve removed the state machines from our “model” slice and replaced them with an “installer” (which uses a simple templating mechanism to “control” the install process for a node). They removed the underlying Node.js server (something we applauded), and replaced it with a Torquebox instance (something we thought of doing a bit differently). In short, the Puppet Labs team made the Razor project into something that was much easier for them to include in and support as part of their Puppet Enterprise commercial offering, but Nick and I felt that with these changes they were leaving a significant portion of the Razor community behind.

CSC and Razor

hadoop-logoAbout the time that I left EMC to join Nick at VMware, Dan Hushon left to join the CSC team as their new CTO. At CSC, Dan quickly became involved in discussions that led to the acquisition of InfoChimps by CSC. As part of that deal, Dan and his team were looking for a way to use DevOps-style techniques to automate the deployment of Big Data clouds and, naturally, they turned to Razor as part of that solution (Dan’s blog entry describing the Razor part of the solution that they built out can be found here).

And so a few seeds of change within CSC were planted. By using Razor, Dan and his team were able to quickly bootstrap the infrastructure they needed to build out Big Data clouds in an automated fashion, passing off the resulting systems to Puppet for final configuration as Hadoop clusters. The result of that groundbreaking work by Dan and his team last year and the interest that it generated, was that there was already a community of potential Razor users and developers in place when I joined CSC last November, and that community of users and developers has continued to build since we started work on the server we would come to call Hanlon.

The rebirth of Razor as Hanlon

So, how did we get where we are today (from Razor to a new project named Hanlon)?  As is usually the case in these sorts of situations, it all started with knowledge and experience that was picked up as part of another, only partly related, project. During my brief sojourn as part of the Zombie team, it became all too apparent that there were a few tools and techniques that we were using as part of Project Zombie that could solve some of the issues we were having with Razor. Specifically:

  • Zombie-LogoWe used Rackup/Sinatra for the underlying server (rather than the Ruby daemon that we had used in building out Razor)
  • We built a Grape::API-based RESTful API for that server (an interface provided by the grape gem), instead of trying to build that RESTful API using Node.js and then integrating that API with the underlying Ruby server
  • We based the server we wrote on JRuby instead of Ruby, and
  • We used the warbler gem to allow for distribution of that server as a WAR file to any servlet container we might want to use (including the Apache Tomcat server)

After a bit of thought, it wasn’t too hard to see how we could take this same set of tools and techniques and, with a bit of work, use them to redesign Razor and remove many of the issues we’d been struggling with over the previous 18 months, especially the performance issues we had struggled with from the beginning.

straight-razorSo, late last December, I set out to rewrite large chunks of Razor and, in the process, created the server that we would come to call Hanlon. The underlying Ruby daemon that we had used in Razor was removed, along with the associated Node.js image and server services. In their place, I constructed a Grape::API-based RESTful API for our new Rackup/Sinatra-based server. I also inverted the call order between our two APIs (the RESTful API and CLI) so that the CLI called the RESTful API, instead of the other way around. The dependencies on components that wouldn’t translate well to a JRuby-based server were removed (like the underlying reliance on native database drivers and the reliance on the daemon gem for some services) and the warbler gem was introduced to give us the ability to build a distributable version of the Hanlon server in form of a WAR file. In the end, what was left was a greatly simplified and much more performant codebase than we had started with in Razor.

Since the CSC Hanlon team was now building the Hanlon server as a WAR file, we also decided that we should do a bit of refactoring to separate out the parts of the codebase that were used by the CLI – a simple Ruby script – from the parts of the codebase that were used by the Rackup/Sinatra-based server. The result was a much simpler and significantly flatter directory structure for the project. Finally, we simplified the Hanlon server’s configuration file by removing many unused or redundant configuration parameters that were contained in the Razor server’s configuration file. In the end, we feel that we struck a good balance between reworking the codebase to make it more supportable and performant while maintaining the existing functionality from the old Razor project. In short, Hanlon should support the needs of most of the users of the original Razor project, with very little change needed.

…and of the Razor-Microkernel as the Hanlon-Microkernel

Of course, for any of these changes to work, we also had to make some changes to the Microkernel that we used to use with Razor to support our new Hanlon server; hence, a new Hanlon-Microkernel project. The biggest changes that we made to the Hanlon-Microkernel were changes to support the new URI structure used by the Hanlon server. We also made a few bug-fix type changes to properly support deployment of the Hanlon server as a WAR file to various servers (where the context of the RESTful API might change) and added support for a few new DHCP options the the Hanlon-Microkernel that were not supported in the old Razor-Microkernel project.

Finally, we added experimental support for gathering of BMC-related facts from the underlying node (if the node has a Baseboard Management Controller, of course). Our thought is that this will lead to changes to the node slice in Hanlon to support power-control of the node using that BMC-related meta-data, but that is a feature that will have to be added in the future; currently the facts are gathered, but the changes to the node slice have not yet been made. Of course, as was the case with the Hanlon project, the documentation for the Hanlon-Microkernel project in the project wiki was updated to reflect the changes that we had made.

In closing

We hope that those of you who have been using Razor to date will find Hanlon to be a preferable replacement. There are still a few rough edges to the project, but we have no doubt that with a bit of work most of the remaining gaps will be closed in short order.

The changes that we have made are a good start, but there are still other changes that are needed and that you, as the Razor community, can help with. Among them are the following:

  • A script that can be used to migrate an existing Razor database (under either MongoDB or PostgreSQL) to a Hanlon database. Since the serialized objects in a Razor and Hanlon database contain the class names of the objects that were serialized, and since the root of that object hierarchy was changed when the root classes/modules were renamed (from Razor to Hanlon), an existing Razor database (and its objects) are not visible to a Hanlon server
  • Changes to the node slice to support power-control of a node using the node’s BMC (and the BMC-related meta-data that is gathered from the node by the Hanlon Microkernel)
  • Modifications to add support for the use of PostgreSQL for Hanlon’s underlying object store (up until now, our development and testing has been done with a MongoDB-based object store; the code to support the use of PostgreSQL is still in place, but we haven’t added in the appropriate non-native drivers to the project to support the use of PostgreSQL under JRuby).
  • Adding support for provisioning of Windows using Hanlon

However, in spite of these gaps, we still feel that Hanlon is ready to release into the wild. We hope that you find it as useful as you found our initial foray – Razor – and we look forward to working with you to rebuild the formerly diverse and active Razor community around our two new CSC open-source projects: Hanlon and the Hanlon-Microkernel.