Agile Services - Part 6: The Long Term

Kalle Happonen

Geek. Product Owner @CSCfi

This is the sixth post in the series about how to apply Agile methods to run services based on existing software, and I think this will be the conclusion of the series.

The first post describing the problem at some length is here, but the TL;DR; is

I don't think there are great resources on how to apply agile methods to run and develop services based on existing (open source) software. We have struggled with it, and I try to write down practices that work for us, mostly taken from Scrum and SRE.

The second post discussed how the service lifecycle looks and when, what kind, and how much work we need to apply to the service.

The third post tried to classify that work that goes into the day-to-day administration of a production service.

The fourth post discussed teams vs. individual admins and some important aspects of a team.

The fifth post discussed the team processes, and was basically the original post I wanted to write.

In this final post, I will look at something challenging, looking at the 2-5 year view of a service.

Normal disclaimer: these are my own opinions based on my own experiences.

Predicting the Future

There is an old quote that I like (attributed to tons of different people)

It’s difficult to make predictions, especially about the future.

I think our field is hit hard by this. It takes a lot of experience to estimate the long term impact of our actions. Decisions we make (or don't make) today can have huge unintended costs or consequences several years into the future.

So we should make good - or at least less bad - decisions. Not all decisions have a long-term impact, and we should identify which ones do.

The importance of long-term planning of course differs per service. Is our service something that is used in the long term? Do we already have an end date for the service? How much integration effort have others put into our service? What is the cost of migrating away?

I'll go through a few examples using diagrams. There are two main things to focus on. The first picture show where our team effort goes.The important part is how much dev bandwidth the team has. This basically affects the service value with a longish delay. The lower picture shows the service value over time.

But to the topic. Let's take a trivial example.

Good, well planned, constantly developing service

I guess we all wish we had this one. I'm mainly adding this to have a reference picture of what I think is pretty much optimal.

Silvrback blog image

Here we see that we have a good mix of operational and development work in the team. We seem to do enough internal development work to keep the ops work in check, while constantly increasing the value of the service.

Please note! Even if you feel like you don't need to increase the service value, it doesn't meant that your development effort goes to zero. You still have to do internal development to keep up with the changing world, or you start to erode the service value.

Looming Engineering Bankruptcy

Or: Failing due to success.

So let's take a case which I've seen a few times, and it's hard to identify early on. Let's say we have a service that gets popular, and is growing strongly. Growth should not cause linear increase in ops work, as we can mitigate this with internal development. But ops work still grows.

Silvrback blog image

We have faced problems by not reacting gradual ops growth ourselves.

How does this affect our work? When time goes on, let's say a year or two, our development efforts gets an ever smaller share of time. Because this is gradual, and the lack of dev work takes a while to affect the service.

This seems to be a hard problem to tackle. I think the main reason is that the problem only visible within the team for the longest time. After a while our stakeholders start noticing that development requests are slow. Lastly, it affects the service.

When the problems become visible, we may already be in quite a bad state. We can't spend time on development, which would ease our ops burden, and maybe the ops amount is even hard to automate away. We have probably collected a lot of technical debt that needs to be paid off. In the worst case, we're heading towards engineering bankruptcy.

If we're already in this situation, it will take time and resources to climb out, and how we do it really depends on the circumstances. Adding people helps - with a delay (see next graph). Delegating ops work away to other teams can help immediately.

I haven't personally seen a situation where declaring bankruptcy, and shutting down the service and starting from scratch has ever been a preferred option, but I'm sure you can get there too. I have the feeling that's usually a very very expensive way to solve the problem.

Handling Admin Churn and Training New People

Competent admins leaving the team often has a big impact on the performance of the team. In the best case we can mitigate the impact, but it can't be avoided.

Silvrback blog image

First of all, you lose a productive team member. As you can't make your ops work go away, you have to reduce the dev work. This makes a team slower, but, it may not be too bad - yet.

Then you hire a new person. Of course we all want to try to hire the unicorn, who magically has the same knowledge of the internal system as the person who left, but that's not possible. So we should train them (*). This takes even more effort from the team. That effort comes from dev work - as it's the only place to take it from. Now your dev work is crawling ahead, while you're training the new person. The new person's productivity is probably very low in the beginning. How fast it grows depends on the training and complexity of the service.

(*) We can of course decide to focus on dev work and not train the new person. They can learn on their own. This may have slight short term gains, but in the longer term, it hurts the team in many ways from productivity to team cohesion. And team cohesion can be a large factor for people leaving in the first place.

The less drastic example of this situation is growing the team. If nobody left, you won't lose total team productivity when you hire new people, but again, you have to move focus from dev to training. This also explains the age old adage, where throwing more people at a project that is late will make it slower to finish.

Dev Decisions Affecting Ops and Taking Tech Debt ...

This is quite a simple graph - and one of the more important ones to justify team autonomy.

Silvrback blog image

You know the drill. We're working on something, but stakeholders/customers/managers need it done yesterday for good/bad/non-existent/silly reasons. This is when we take shortcuts and take on tech debt. Probably people are tired or stressed and say something like...

Let's just ignore processes this time, if we get this out, everybody will stop shouting at us.

(Spoiler: no they won't, there is always the next thing).

There are reasons to take on tech debt, but then we need to pay it off, or it'll affect our dev capacity in the future. If we don't pay it off, we'll land our self in a mess, similar to the engineering bankruptcy one, but instead of going there because our service is popular and growing, we go there because we choose to.

Taking on tech debt also requires clear communication, as like in the other case, the problems are rarely visible outside the team early on, and it may require the team to slow down their business development to work more on paying the tech debt back.

... and What to Do About the Tech Debt

After the previous graph, this one is almost self-explanatory. If we've taken on tech debt, at some point we should pay it off. Yes, it'll slow down the business development in the short term, but it will pay itself back quite quickly.

Silvrback blog image

Even outside known tech debt, we should keep an eye out if there are good ways to reduce long term ops work with some internal development. This often has an excellent pay off.

We do need to keep in mind realistic estimates of the effort/pay off ratio of our internal development. It's really annoying to go down into an ever deepening rabbit hole for weeks or months, to finally fix something that did not bring that much improvement.

Hardware Decisions and Other Possible Time Bombs

This is where we really can mess up our service in the long term, and de-saturate the color of our admins' hair.

There are a lot of possible decisions that can affect operations. For those of us who who still play with hardware, hardware decisions are one of them. As opposed to the other graphs, the impact of these decisions is not always gradual. It may come pretty much at once.

Silvrback blog image

Of course, if we know our service is pushed into Horizon 0 by then, it may not matter, but if we don't have plans for shutting our service down, it matters.

I have a general rule of thumb for any new hardware, or any new tight integrations to the service.

We must have a feasible exit plan - or a plan for several hardware generations - before we make the decision.

Let's say we buy an appliance of some kind - e.g. storage for our service. It immediately improves the value of the service and the customers are happy. We (and maybe the customers) start tightly integrating to this - it's not useful if not used.

Now a lot of our integrations rely on this hardware, and we realize it's End of Life in a year. If we shut it down, we significantly cripple our service. We don't have a plan how to get rid of it. We don't know how to move the integrations/load away from the appliance. Maybe we need to redo a lot of the integrations from scratch. Maybe we need to handle every case manually.

I have had this happen to me. It's really not fun, and it makes me curse the idiot (me) who went with these solutions in the first place.

So we should plan well in advance. The lifecycle of a hardware generation passes faster than we can imagine, and the uptake may slowly grow across a few years. This means we only get a few really valuable years from the hardware.

This effect is of course not only limited to hardware, but also other integrations we may need to get rid of. Hardware is just a very useful example, as it has a built-in expiration date.

Wrapping Things Up

Some of these graphs are both simplified and exaggerated at the same time, and the timeline is not perfect for all of them.

In real life, we probably have a constant mix of many of these situations going on at once. However, I hope the examples serve to convey the main point of the post. The decisions we do have a large impact on our workload and by proxy on our service. This doesn't mean we never need to make uncomfortable decisions, but when we do, we should have a good view on the impact of those decisions. If we can prepare for the possible consequences in advance, everybody involved - from the team, to the stakeholders and the customers - will be happier.

Really Wrapping Things Up

This was the last blog post in this series! I really hope it has helped somebody! Writing this whole thing down has also clarified my own views on the topic, and I've gone back and forth on many points trying to see if I can actually justify them. While I've gotten a lot of suggestions for improvements, and had discussions about the finer points, in general I think I'm quite happy about how it turned out.

But - as always - I am just one person with one point of view. So if you want to share your own experiences, I would be more glad to hear them! It's always better to try to learn from others' experiences than doing everything by ourselves.

Kalle Happonen

Geek. Product Owner @CSCfi

November 28, 2023

Subscribe to this blog