The Devil, Juno

A long upgrade

Kalle Happonen

Geek. Product Owner @CSCfi

It's hard to find enough relevant Magic the Gathering card names for the blog post. It's much easier to come up with puns. And they're almost as awesome. So here we are.

Anyway, to the point. We're not on Icehouse anymore. Yay.

OpenStack upgrade Icehouse -> Juno

We recently updated our OpenStack service from Icehouse to Juno. Well, we had a few more things bundled into that. We needed to upgrade our frontend hosts to CentOS7. Well, plus moved them to new hardware. Also we changed our architecture. Oh, and we had a major power break when we had to shut everything down at the same time.

You might see where this is going. We had a big service break, where we thought we'd make big changes, since we had the break anyway. This caused big problems.

We basically bundled five updates into one.

Update API and network machines to CentOS7
Update to Juno
Make architecture changes (move to Galera)
Take new hardware into use
Large power break

In retrospect they should all have been done separately (e.g. in the order above). With all the changes, the planning for the upgrade took a long time. And while we were preparing for the upgrade, it was hard to add other things into production. Even with all the planning, all of the updates/maintenances caused one or more larger problems (some of them below). Debugging the issues was hard due to the volume of changes. This meant that fixing the issues was slower than it should have been. This also meant long hours, stress, and varying levels of unhappiness among our admins.

Lesson learned. Small incremental updates. Unless you really really have to. Which you probably don't.

keep calm

Technical Stuff

If you don't read beyond this, that's fine. As long as you repeat the mantra "Small incremental updates. Small incremental updates." Below are some some technical things that went awry, and knowing of these might help someone else.

Icehouse -> Juno Update

We had tested out Juno upgrade quite well, and we only hit one major problem with the neutron database update. It didn't work.

Trying to run

neutron-db-manage --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/plugin.ini upgrade head

gave this

sqlalchemy.exc.OperationalError: (OperationalError) (1005, "Can't create table 'neutron.routerroutes_mapping' (errno: 150)") '\nCREATE TABLE routerroutes_mapping (\n\trouter_id VARCHAR(36) NOT NULL, \n\tnuage_route_id VARCHAR(36), \n\tFOREIGN KEY(router_id) REFERENCES routers (id) ON DELETE CASCADE\n)ENGINE=InnoDB\n\n' ()

After some debugging and running this in MariaDB

show engine innodb status;

I saw it was some foreign key error. More digging found this. After a sarcastic

Of course, why didn't I think to check our database table collations!

I started fixing them. So our database contents have tagged along with us since Folsom. So I'm not too surprised to see that there might be some issues. Basically our old neutron tables had old collation types (apparently used for string comparisons), new tables were created with new ones. This meant they can't have foreign key dependencies. So the fix was to

Take the dump file made before the upgrade
Drop the neutron table in the database (make sure you have the dump before you do this, ok?)
Modify the neutron dump file for all "CREATE TABLE" statements from

CREATE TABLE `agents` (
...
...
 PRIMARY KEY (`id`),
  UNIQUE KEY `uniq_agents0agent_type0host` (`agent_type`,`host`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

to (add COLLATE=utf8_unicode_ci)

CREATE TABLE `agents` (
...
...
 PRIMARY KEY (`id`),
  UNIQUE KEY `uniq_agents0agent_type0host` (`agent_type`,`host`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

Import the fixed dump
Run the neutron database upgrade

Voilá!

CentOS7 Upgrade

The CentOS7 upgrade didn't bring many unexpected problems. This was mostly due to tons of preparation and testing. However, one thing stood out. After running puppet on our network nodes once, we couldn't do it again. It just hung. Some debugging later we found that it was actually facter that died. Some more debugging led me to this file

/usr/share/ruby/vendor_ruby/facter/dhcp_servers.rb

It's a facter fact we don't do anything with. It goes through each interface with nmcli to look for some DHCP information. With 300+ routers/DHCP agents with 700+ interfaces, you're going to have a bad time. For now it's fixed by editing the file so it doesn't run.

Poweroff

This is a freaky non-OpenStack issue. We run HP SL230 scaleout nodes in HP SL6500 chassises (chassi?) as our compute nodes. The HP firmware management is annoying at best, but this time it was even worse. We tried to upgrade the chassis power controller firmware on two of the chassises, but it failed. Trying to run it again just said

/usr/lib/i386-linux-gnu/hp-firmware-cdalevgen8-6.2-1.1/cpqsetup
Flash Engine Version: Linux-1.5.4-2

Name: Online ROM Flash Component for SL Chassis Firmware for Linux - HP ProLiant SL230s/SL250s Gen8 Servers
New Version: 6.2

The software will not be installed on this system because the required
hardware is not present in the system or the software/firmware doesn't
apply to this system.

This server does not have a Chassis Power Controller.

The nodes in the chassis couldn't see the power or fan status. Fine, it had no immediate effect, and since the power break was coming, we thought a power cycle might fix it.

That's a big Nope. It made it worse in an annoying non-obvious way. The nodes came back. They worked. They were 10-20 times slower that other nodes. There was no load on the system, but the nodes were slooow, and in practice unusable. And they were unable to patch the firmware issue. We're still figuring out if we have two large bricks in our rack. Really HP, really?

Bad firmware is bad

Nova api memory consumption

After we thought we got everything working, we were happy. This didn't last too long, since slowly but surely the system deteriorated and broke. Soon you couldn't launch new VMs. Gradually all our nova-api processes started taking up 8 GB of memory. After that they were pretty much dead.

The cause for this was one of our accounting scripts. It was calling the nova-api with

/v2/service/os-simple-tenant-usage?start=2014-06-20T01:01:06.454077&end=2015-11-06T02:01:06.201048&detailed=1

So due to a bug, it requested all usage for everyone for the last year and a half. It didn't use to be a problem, but now it was. We don't know what caused it (again, small incremental updates), but fixing the script took care of it.

Conclusion

Small. Incremental. Updates.