Here's my traditional OpenStack upgrade blog post. I usually write these after all of our major OpenStack upgrades.
Our stack: CentOS 7, RDO, Puppet + Ansible, Linuxbridges + VLAN, Ceph
We're a dinosaur who still has monolithic API nodes (Most services running on a pair of VMs), which is relevant for this procedure.
Credit where credit is due. I'm just a secretary writing up the report, most of the work was done by @carloscar and Darren Glynn.
We had fallen a bit behind in our updates, so the last few updates we have jumped a version. We went from Juno to Libety, and now we jumped from Liberty to Newton.
"You can't do that anymore!"
Yeah, you're right. But if you twiddle and fix a bit, you can almost do that.
The High Level
You can't easily jump over OpenStack versions any more. This is due to Nova's "online data migrations" of database data.
Nova does more than database schema migrations in an upgrade. Nova does actual data migration, usually on the background after the update is done. What data actually gets migrated depends on the changes in nova. The important thing is that the data migrations must be finished before going to the next version. Luckily you can trigger the migrations using nova-manage, without running the daemons.
To work around this our plan was
* Install a virtual environment with Mitaka componennts on a Liberty API node.
* Install a virtual environment with Newton Nova (explanation later) on a Liberty API node.
* Stop all services.
* Do a database backup dump.
* Do all Mitaka DB upgrades and the Nova online data migration in the Mitaka environment.
* Do a Newton Nova DB upgrade and the online data migration on the Newton environment.
* Redeploy API nodes. This includes running puppet on them to get everything to Newton.
* Update network nodes.
* Update compute nodes.
So we installed Mitaka in a virtualenv, and did the all Mitaka DB upgrades for all OpenStack services. You might ask why we didn't just do a Mitaka upgrade instead and then a Newton upgrade?
- The Mitaka virtualenv installation is much more lightweight than a whole Mitaka upgrade.
- We don't run any services at any point, we don't need to do functionality tests, as long at the DB gets updated correctly.
- We don't have to update our puppet code for Mitaka, we just use Liberty config files. We mainly need the DB connection info from them.
- We don't have to worry about compute or network nodes.
Liberty -> Newton issues
We did face a bunch of issues - as normal - when preparing the upgrade. Here's a non-exhaustive list of the problems and the fixes.
Newton nova database procedure
Now to the reason we Newton online data migrations in a virtual environment, instead of just doing it after deploying Newton.
We hit a database timezone issue in the upgrade scripts. We needed to fix the tzinfo in objects/aggregate.py and objects/flavor.py (example below). Since we had the code for setting this up, it was simpler to do a one-off virtual environment rather than carrying the changes with us.
@db_api.api_context_manager.writer
def _aggregate_create_in_db(context, values, metadata=None):
query = context.session.query(api_models.Aggregate)
query = query.filter(api_models.Aggregate.name == values['name'])
aggregate = query.first()
if not aggregate:
aggregate = api_models.Aggregate()
> created_at=values['created_at'].replace(tzinfo=None)
> updated_at=values['updated_at'].replace(tzinfo=None)
> values['created_at']=created_at
> values['updated_at']=updated_at
Database collation
I think we've had database collation issues in every upgrade we have had. Collation basically chooses how string comparisons are done. The main problem is that it has to be consistent when you use foreign key constraints. When database upgrades add tables, if they get different collations than previous tables, the database update will fail.
As utf8_general_ci is default, when we had shut down all API services, we changed all collations to utf_general_ci. Note, collations can be set on table and row level too, so you need to update all of them.
Pseudocode:
SET foreign_key_checks = 0;
loop over databases
ALTER DATABASE database CHARACTER SET utf8 COLLATE utf8_general_ci
loop over database_tables
alter table database.table convert to character SET utf8 COLLATE utf8_general_ci;";
SET foreign_key_checks = 1;
RPM dependency issue on network node
There was a small silly thing during the upgrade. The newton RDO neutron-linuxbride package had problem with dependencies. There was missing a dependency to a new enough version of python2-pecan, which caused the neutron-linuxbridge-agent to fail. Just updating the package solved it.
Keystone v3 update for services
We update all our service config files' keystone_authtoken sections to Keystone V3. This was a bigger change in Puppet than in the config files, the module had updated a bit.
New lauch dialog
Newton also brought a new VM launch dialog. It had silly defaults, for example the default was boot from volume. We didn't want to deal with that right now, so we set this horizon flag to use the old VM launch dialog.
# grep ^LAUNCH_INSTANCE /etc/openstack-dashboard/local_settings
LAUNCH_INSTANCE_LEGACY_ENABLED = True
LAUNCH_INSTANCE_NG_ENABLED = False
Non-migrated compute nodes in db
The nova online migrations we talked about didn't handle deleted compute nodes correctly, and refused to migrate them. Our solution was just to delete the old hypervisors from the database.
Duplicate compute nodes in Newton
Then we get to one of the bugs where the upgrade went well, but we noticed issues after the fact. Newton seems like an awesome release, since it almost doubled our compute capacity. For some reason we got duplicate hypervisors in the compute_nodes table in the nova database. This might be because we never ran Mitaka on the compute nodes?
The difference was the "host" column for each entry. The old ones were NULL, the new ones had it set to the hypervisor_hostname. We brute force resolved it with
delete from compute_nodes where host is NULL
Live migration woes
We also upgraded our libvirt while we were at it. We didn't notice that this actually broke our live migrations. Libvirt used to use just port 16509 to do live migration. In newer versions it uses on-demand ports in the range 49152 - 49215 (by default).
Upgrade testing recommendations
Read the upgrade notes for each release you're upgrading to or jumping over. Then read them again. And once more. It might seem boring, but it pays itself back, I promise.
Do a database dump of your production environment, import it in to a development environment, and test the DB update procedure against the production data. We started doing this a few releases ago, and we have always found problems. These problems are so much nicer to fix with a coffee cup in your hand and your feet on the table than in a panic on the upgrade day.
For this upgrade we ansiblized most of our upgrade procedure. We use ansible for a lot of our infra management anyway, so it's a natural fit for us. It might not work for you, but trying to automate as many of the upgrade steps as possible is a good idea. It reduces the chance for mistakes, and then you'll have a starting point for the next upgrade. Automating the procedure also speeds up the testing of the procedure. This means it's easier to do the upgrade more times, to make sure it works. It also reduces the reliance on having to read instructions, which one easily skims over, which leads to missing steps.
So, how did it go?
This upgrade preparation took quite a while. A big reason was that the work was mainly done by team members who had never prepared a complete upgrade before. This meant it was a bit slow, but it was also a great learning experience, so we're confident the effort will pay itself back.
I'm happy to say that the largest problems we faced was running the functionality tests after the upgrade. Not making them go through, but actually running them. We are moving to Rally/Tempest, so even that should go smoother in the future.
Geek. Product Owner @CSCfi