A HPC admin in a Cloud world
Disclaimer: Personal experience may alter your perception of reality. This also applies to yours truly.
This post is also somewhat generalized, there are tons of approaches for HPC and IaaS.
I have a background in administering High Performance Computing (HPC) systems. As I have a lot of colleagues working with HPC, and IaaS is a big topic, I thougt I'd write a post about some of my experiences with cloud. Some of these lessons have been easy to learn, you just logic your way to the conclusions, some of them have had a larger cost.
HPC?
HPC systems are basically 100's of nodes connected by a fast network and sharing a large pool of networked storage. The HPC systems run batch jobs. These are user defined computational tasks with runtimes from minutes to days. These jobs are queued up in a batch scheduler system, which then schedules them to free available CPUs.
On the surface HPC systems look a lot like IaaS cloud systems.
- A few controller nodes.
- A lot of compute nodes for user load.
- Shared storage.
- A shared network.
- End-user defined load running on the system.
So managing these is quite similar, right?
Similarities
Let's start with one major point that strongly applies to both kinds of systems. Scale. You don't manage 100's on individual nodes. You automate it and handle them as a group. You don't update them individually and you don't configure them individually. This goes for both HPC and IaaS systems, even if HPC leans towards even larger scale and IaaS systems lean towards complexity of configuration.
The tooling might also be different, but the same basics apply.
Differences
The major difference in operations is easy to summarize.
The HPC jobs have a definite end time. The cloud ones don't. This means that you can never schedule a downtime.
This one small point changes everything.
In HPC you have non-interactive batch jobs that queue for execution.You can free up single nodes or the whole cluster by not scheduling more jobs. After that you're free to maintain the node, boot the storage system, fix the switches, or whatever. You can do this per node basis, per switch basis, or for the whole system.
In cloud systems, once you get user load, you can't get rid of it. Let's say a user starts a virtual machine which the user wants to run for 2 years. How do you do update kernels on the compute host? Or change a hard disks when it starts to fail? Maybe you contact the customer and schedule it? How do you do it for 100 compute hosts?
Consider the central storage system. You configure the cloud so that users can create and attach virtual disks to their machines. Cool. You have disk space on demand. Except now you have a lot of virtual machines using the central storage, and the storage system can never be down. Unless you have a well designed highly available setup without single points of failure, you basically can't even reboot your storage nodes.
Repeat this exercise with all the components of the cloud. Want to boot a switch? Is the switch a single point of failure? Tough luck, you can't. That would take away the customers' network and any storage behind that switch. Power line maintenance? Better think how you can keep your system up.
Fundamental changes
These differences change the design of the system drastically. First of all, technical debt comes easy and hits hard. Whenever you take any new component (software, hardware, service, etc.) into use, whatever it may be, you need to have make sure it can be down without affecting the customers. You must have a plan how to maintain it before you take it into use. You should also already have a plan how to migrate away from it.
How you do this depends on what part of the system you're working on. It can be the ability to use live migration. It may be Ceph for storage. I can be a redundant switching layer. Whatever you do, it usually it starts with "Software Defined".
Because of the impact of changes, testing your changes also plays a much bigger role. So having a QA system and a development system is almost a must. Things starting with "Software Defined"are great here too, since you can play around and test to your hearts content, until you're sure everything works.
I can never have downtime?
Well, there are different types of downtimes. API service downtime is usually more tolerated. This means that running VMs don't notice a thing, but you can't launch new VMs. This is common for e.g. big version updates of the cloud stack.
Disturbing running VMs is a different thing. At least if it's downtime for the whole platform, it's disruptive to all customers. It's not easy to communicate this properly.
Even if you have downtime for a subset of your compute nodes, you probably need to improve your game when it comes to customer communication. It goes from the HPC
A compute node died, and a compute job was lost. Well, a mail to the one customer is enough and he/she can resubmit the job.
To the cloud version
A compute node died. There were 13 customers with VMs on that compute nodes. We need to contact all of them, and single out which VMs died. We possibly need to coordinate some recovery, or at least support.
I hope you have the customer communication tooling in place. Doing this manually is not fun.
Conclusions
This post might sound a bit like "HPC is easy, IaaS is hard". That's not the point I want to make though. HPC has other challenges which don't even come up with IaaS.
However, operations-wise an IaaS cloud service is very different from a HPC cluster. You are your own little IT department, with everything that comes with that, not just a service operator.
This requires a radical shift in thinking, and most likely, tooling. Your typical one-and-a-half admin academic HPC cluster team will find itself severely under-resourced and overwhelmed if you decide
Hey guys (and/or gals), you know your stuff. Let's also run one of these OpenStack things since our customer want it.
Geek. Product Owner @CSCfi