Spread 'em!

Especially if you're packin'!

Kalle Happonen

Geek. Product Owner @CSCfi

Generic disclaimer: OpenStack Juno (but up to Mitaka I haven't seen a way to do this).
Specific disclaimer: The code for this was written by my colleague Risto Laurikainen, and the credit goes to him.

Let's just jump right into it. We have a small scheduling problem in OpenStack Nova. To understand it, let's explain our offerings a bit.

We're traditionally an HPC house, so it's a large part of our OpenStack offering too. We offer non-oversubscribed flavors where you can really use the CPU. These flavors range from 1 CPU Core to full node instances. That is, you get every single core and all the memory of a compute host.

Scheduling these fullnode VMs natrually requires completely free compute hosts. This means we need to have our nova scheduling policy as "pack" (ram_weight_multiplier = -1.0). So the nova scheduler fills up compute nodes before moving to the next. This means that as long as we have any room on the cluster, we can probably schedule fullnode instances.

Now enter our new offering (not quite yet in production), IO intensive compute nodes. These run VMs on local SSDs for fast IO performance. This is for anything from Hadoop to IOPS intensive DB work.

Suddenly packing instances seems like a horrible thing to do. Spreading them out would allow the use of the whole IOPS capacity of the cluster, while packing just increases contention.

Or I'll throw you into a cell!

Currently we use host aggregates for partitioning hardware. So HPC flavors get scheduled to the HPC aggregate, which consists of the HPC nodes, and IO flavors get scheduled to the IO aggregate.

The problem is, ram_weight_multiplier is a nova.conf option, and can't be set per aggregate. So either we pack everywhere or spread everywhere. To have different scheduling policies, we'd have to use nova cells with different configs for the schedulers in different cells. However, we'd prefer to wait for cellsv2 before we start using them.

Per aggregate scheduling weights for nova-scheduler

So we (well, Risto) ended up writing this.

from oslo.config import cfg

from nova.i18n import _LW
from nova.scheduler import weights
from nova.scheduler.filters import utils
from nova.openstack.common import log as logging

ram_weight_opts = [
        cfg.FloatOpt('ram_weight_multiplier',
                     default=1.0,
                     help='Multiplier used for weighing ram.  Negative '
                          'numbers mean to stack vs spread.'),
]

CONF = cfg.CONF
CONF.register_opts(ram_weight_opts)

LOG = logging.getLogger(__name__)

class AggregateRAMWeigher(weights.BaseHostWeigher):
    def weight_multiplier(self):
        return 2*CONF.ram_weight_multiplier

    def _weigh_object(self, host_state, weight_properties):
        aggregate_vals = utils.aggregate_values_from_db(
            weight_properties['context'],
            host_state.host,
            'pack_instances')

        if len(aggregate_vals) > 0:
            try:
                default_pack_instances = True if CONF.ram_weight_multiplier < 0 else False
                pack_instances = min(map(lambda val: {"true": True, "false": False}.get(val.lower()), aggregate_vals))
            except ValueError as e:
                LOG.warning(_LW("Could not decode ram_weight_multiplier: '%s'"), e)
                return 0
        else:
            return 0

        if default_pack_instances == pack_instances:
            return 0
        else:
            return -1.0 * host_state.free_ram_mb

A single new weigher that you drop into

/usr/lib/python2.7/site-packages/nova/scheduler/weights/

With the default config, this gets used by nova-scheduler automatically after a restart.

What it does

This is a bit hacky (after long discussions), since we don't want to touch the scheduler code itself if we can help it.

First of all for the weight, we return 2 * configured weight in weight_multiplier. This is so that we can (if necessary) revert the decision from RAMWeigher.

Then comes the interesting part in weigh_object. This code should actually be in weight_multiplier, but that function doesn't have all the necessary info available (weight_properties).

in weigh_object we check if the host aggregate(s) have a scheduling policy set (defaults to spread if they conflict). If so we check if it's the same as the nova.conf configured policy. If it's not set, or the same as nova.conf, we just return 0, which means this is ignored.

If the policy differs from the default, we return
free ram amount * -1
Since the multiplier is 2 * ram_weight_multiplier, we basically revert the spread/pack decision, and keep the values correct in case you have other weighers too.

How to use it

So, how do you actually use this? At your own risk!

Ok, more seriously. Just drop this file into

/usr/lib/python2.7/site-packages/nova/scheduler/weights/

Restart nova-scheduler.

Then you can add host_aggregate metadata. You can either set
pack_instances=true or pack_instances=false for your aggregate, and Bob's your uncle.

nova aggregate-metadata-set <id> pack_instances=false

We haven't done performance testing, but it didn't break our stuff. We're also not sure how maintainable this is across releases, but since it's just one file you drop in, it should be lightweight. There might of course be problems with newer versions of OpenStack where you have more weighers, but I don't see why there should be.