Hassan: Hello, and welcome to Ctrl Alt Deliver. I'm your host, Hassan. Today, we're exploring one of the most challenging decisions faced by tech leaders, that is, whether to rewrite a piece of legacy system or find a way to modernise it without starting over.
My guest is Tom Hill. He's an experienced technology leader who has guided multiple teams through some of these critical choices. We'll discuss how to avoid this trap and make decisions which are in line with what's best for the business. So welcome to the show, Tom.
Tom: Thank you for having me.
Hassan: Why don't you tell us about yourself and what you're working on these days?
Tom: My background is pretty checkered, really. I've taken a non-traditional path as a technologist, so I started as a sound engineer and a session musician, completely unrelated, then moved into graphic design, went from graphic design to web design, then into digital marketing.
Before eventually finding my way into software engineering, working my way up to being a software architect and then joining the dark side of people management, where again, I've been ever since, holding different levels and styles of positions at different types of companies.
I now find myself as the Director of Engineering for branding content protection at a company called Coresearch, which is a legal tech business, and part of my remit is managing the teams that look after what we now call our legacy systems, as well as looking after one of the new strategic directions for our business in content protection. So, the topic of being tempted by rewrite and legacy is certainly a pertinent one for me right now.
Hassan: What an interesting background, and I must say some of the signs of the sound engineering are sort of still visible on your wall.
Hassan: It's interesting. So let's dive in then. Let me ask you this: why do you think teams sort of gravitate towards this option of rewriting a system from scratch in the first place?
Tom: I think it comes down to two things. One is the idea that starting from scratch and building something new, now that you have more information, is gonna give you this magic ability to write a better outcome at the end of it. It's as humans, we're terrible at assuming and predicting time and complexity and the other side to it is time. Obviously, most systems are built in very small sections, piece by piece, so you only have a very limited bit of context, you don't know what's gonna come in the future. And then, as with any system, as it develops over time, you get more and more emergent properties, some good, some bad.
But normally people are tempted by the rewrite because by the time they've gathered enough context and information to really understand the customer requirement properly and to know where they are today, they then get this false impression that all these decisions made in the past were bad, and that they were suboptimal and they won't work, and now I know how to build it perfectly,
The honest answer is most of the time people don't, even if you are to go and rewrite afterwards, which seems appealing because you feel like you now know all the answers. Things continue to develop, and actually, you'll just find yourself back in the same position again several years later, because there will be new emerging properties, there'll be new requirements, there'll be changes in direction, there'll be new technology that exists now that didn't exist 4 years ago. And the same will be true whether you rewrite or not.
Hassan: So, have you been in a situation similar to that where there was a very strong momentum towards a rewrite and like, how that conversation triggered?
Tom: 100%. And to be honest, as with every person that's ever worked in software engineering, I've been the person firmly leading the charge on, hey, I want to rewrite this thing, let's build it from scratch, it'll be so much better at the end of it. A lot of the time, the thing that triggers that conversation is complexity and slowness. So the longer a system has been in place, the more complex it becomes to add new things to it. And if there's not been a really intelligent and deliberate approach to predicting change, not what the change is going to be, but just writing software in a way that it is changeable, that it's soft, that it's not hardware, it's not set in place.
You end up with brittle code, so you need to do a simple change, something feels like it should be really easy, and you feel like you're constantly battling against this bit of your system, or even the whole system itself, you feel like you're on a treadmill, you can't get anywhere, no matter how fast you run, you don't go anywhere any quicker. So people just get frustrated, and they reach a point where they're like, I'm better off rewriting this. It's gonna take me so many days to add such a simple feature. I might as well spend that same amount of time just rebuilding the whole thing from scratch.
Hands up, I've definitely been guilty of being the person who's got stuck behind legacy code that holds complexity, has got frustrated and fed up of trying to battle against it and decided I should tell everyone we need to rewrite it instead.
Hassan: Well, it's good that you're experienced because we're here to learn a few things from that. And if I were to ask you, you know, what are some of the hidden costs when someone is considering a rewrite? What is it that someone should be really conscious about?
Tom: I think the main thing is your customer at the end of it. The thing that we often lose sight of as technologists is what we're ultimately building systems for, which is to solve a problem for customers. What's very easy to do is get sucked into this idea that in order to be able to build something new faster, we have to completely change the landscape on which we're building right now.
So yeah, it's 100% possible you find yourself in a situation where things are so brittle that small changes do break things. They impact the customer, they prevent you from adding new features, they break existing features, it has an impact on perception for the business. You might start to get churn as a result. Rewrites 100% look very appealing at this point in time. The risk is that by rewriting, you effectively leave people in the situation they are now, which is potentially with the shaky platform, that's maybe not delivering what they need. And also, a full rewrite requires substantial effort. Like, there is no miracle solution where you can continue to build all this luxury new stuff for your customers whilst building a completely new platform under the hood and eventually catch up. You'd have to double, if not triple, your engineering capacity and team size to do both. So the risk is always gonna be that if you dive after a full rewrite, actually, you're potentially gonna do more harm than good. So the last thing you wanna do is end up with the most perfectly written system and software architecture that anyone's ever seen, but no customers.
Hassan: In your experience, what are the signals and conditions that would truly justify a complete rebuild?
Tom: To do a justification for a complete rebuild, I think you'd have to find yourself in a situation where the future for the existing platform is non-existent. For example, if you'd built your entire platform and your entire business around maybe a proprietary piece of tech, something that was owned by Microsoft or an Oracle or someone else, where that company has turned around and decided, hey, we're not supporting this anymore.
A great example would be, not too long ago, you could have had your entire application written in Flash. And now all of a sudden everyone is universally agreed, hey, we're not supporting Flash anymore. Like, your application is just gonna stop dead in its tracks. In those situations, a complete rewrite is absolutely going to be necessary.
You can't even strangle it out because you can't leave that piece of application there for the foreseeable cause it's just gonna stop working at a given point in time. In that situation, the rewrite is 100% absolutely necessary because, beyond a given fixed point in time, your application has no future. It can't run, it won't be supported. Your users can't do it; it would be the end of your business without that complete rewrite. I think that's probably the situation that I can think of that would be, hey, we've got to rewrite or we're done.
Hassan: Oh, that's very true. I guess one of the decisions that needs to be made is which of the team members would be responsible for doing the rewrite and which ones will continue to support the current system until the new system is like, because business goes on, customer requirements are not gonna pause one day because you're working on this grand rewrite. So, how do you bifurcate the team members? What's the strategy around that?
Tom: I think it depends ultimately on the situation you find yourself in. So, if you're lucky enough to be in a situation where you can kind of piggyback ahead, which is almost where we've been at Coresearch. So we have a team of engineers who were key knowledge holders on what we now dub our legacy platforms. But we're not rewriting the legacy platforms; we've completely built an entirely new product, and we've been able to leapfrog ahead through mergers and acquisitions.
So we've actually ended up with a new platform where we're just building new features on top, and we had a platform that is almost imperative with our legacy one anyway. So that legacy team are basically the critical knowledge holders that understood our old platforms. They're helping us keep those things alive. And we're able to move quite quickly from those to the new one.
If you're in a situation of a true rewrite, I wouldn't try to take that approach. I wouldn't dub a team the legacy team and some other team or some other group of engineers as the people working on the forward-looking thing. You'd have to be more careful with your approach, like, I think the pattern that I find most useful in these situations is the strangler fig pattern, which is rather than saying like we're gonna do a full rewrite, you don't do a full rewrite, you have a piece of your platform that works perfectly as it does. It solves the customer's problems. Maybe it's getting too brittle, maybe it's becoming too fragile, but instead of trying to rewrite it, you ring fence it. You just say, OK, we're not gonna add any new code to this service, to this bit of the code base. Instead, we're gonna build something new on the side, so we need a new feature. Let's write that new feature in a clean code base. And then find a way to integrate that with our existing code base. And over time, you can build all of the new features in something clean, and as you might need to touch or update an old feature, instead of rebuilding the whole thing, you rebuild just that piece of the feature in the new code base and then cut out the old one from the ring-fenced legacy system. So what you end up with is rather than a team of people that look after legacy. Every team does. Like every team that owns a service that you've ring-fenced off, continues to maintain and own that service. They just happen to now build new stuff in a new, separate, clean code base and starting point.
Hassan: Understood. You talked about the strangler fig pattern. When you're figuring out which parts of the system absolutely need a rework, whereas others may work, how do you make those decisions?
Tom: The approach I take comes back to the customer again. So it's looking at the parts of the system that are most frequently used by our customers, the parts of the system that are most important to our customers as far as cost. And you just want to be coming back to what's gonna have the greatest return on investment. There's no point in spending engineering time, effort and money. To modernise something simply because we think it could be more optimal. If it works, and it works for our customers, and it's not having a negative impact on them, we should probably actually just choose to leave it alone for now.
We might still want to ring fence it and say, hey, look, we don't want to do any more development on this because we know it's suboptimal, we know it's flaky, and we know it's gonna break. But that doesn't mean you should start modernising it. It just means that you should potentially ring fence it off and say, hey, if we find ourselves in a situation that we need to, that we need to add a new feature because it is a benefit, or this thing is breaking so often it is causing pain. That's the signal that you then use to modernise or to build something else instead.
Hassan: So, what would be your biggest piece of advice to technical leaders who are sort of struggling with legacy systems or frustrated teams who are in a way, being slowed down by legacy code and being unproductive?
Tom: To go back and start from first principles. So you want to look at it and truly understand what and why this thing is causing these issues, because it's very easy to assume we should just rewrite the whole thing. And naturally, if you start to look into it, there might be other things that can be done instead, like maybe we're blaming the system, and it's not the fault of the system. Actually, the true reason that things keep breaking is because we've lacked test coverage. We lack a pertinent process around quality control. We've not got the correct process around educational knowledge sharing, maybe it's not that the system is bad, maybe it's that the system is old and actually, we don't carry with us anymore the knowledge around why things were built the way they were, what problems they solved for the customer and why they were built that way. And actually it's not the fault of the system, it's the fault of knowledge, at which point you can start to dig in, you can attempt to learn it.
There's an interesting book, I think the title of it was: Kill It with Fire. I think that might actually be the, the name of the book, now I'm trying to remember. But that's all about legacy systems and legacy system modernisation. And one of the suggestions in there is if there are pieces of the platform that you don't know or understand, it's almost to deliberately break them, not in production, obviously, but whilst running it locally, experiment, like turn things off, turn them back on. Like see what happens up and downstream of the system to understand and learn its behaviour, and then document it. Because 9 times out of 10, especially, the older a system becomes and the longer it is in production, the more out of date, the more out of tune the documentation becomes, if it even existed in the first place. So I dare say it's to start not by thinking about what needs to change within the system, but thinking about the things that surround the system, about human knowledge, about the processes you have in place, about the policies you have in place, and try to understand it first. Because once you understand it, you can begin to make more intelligent decisions about what to do with it.
Hassan: Well, that's good advice. So that leads us to our quickfire round. Don't think too much about it. Just answer with whatever comes first to your mind.
Hassan: Remote work, love it or hate it.
Hassan: An app you can't live without.
Tom: I dare say at this point, as bad as it sounds, it probably has to be Slack.
Hassan: Favourite way to recharge outside of work?
Tom: Listening to music, reading a book. Or just being in the middle of nowhere, ideally a combination of them.
Hassan: If you weren't in tech, what would you be doing?
Tom: Something creative involving solving problems. I think that's clear from my career history. I would have found a problem somewhere that needed a creative solution; I'd be doing that.
Hassan: Well, that's it for this episode of Ctrl Alt Deliver. Thank you so much again, Tom, for joining us. It was insightful, and I look forward to catching up with you, maybe on a different topic someday.
Tom: Absolutely, I appreciate you having me on, Hassan. Enjoyed the conversation.