The True Costs of Architecture Complexity by Chris Shaffer

Cloud hosting bills are the tip of the iceberg


Let’s start by walking through a (simplified, but only slightly) version of a mistake I’ve seen 3 times in the past year (at different companies).

A developer needs to write a SQL query:

select thing_id, sum(my_column)
from my_table
where thing_id in (select id from @ids)
group by thing_id

Your developer either doesn’t know enough SQL to write that, or they (more likely) abuse their ORM. They end up with a piece of code which ultimately runs millions of:

select * from my_table where id = @id

These results are all pulled to the application server and the summation is performed there.

That’s bad. (We all know that’s bad, right?)


Somehow, and exactly how is a topic for a psychologist or anthropologist, the developer ends up concluding that the way to improve this piece of code is to use Redis (or ElasticSearch or MongoDB or a graph database) instead of SQL.

They run a “scientific experiment,” replete with loads of metrics down to the millisecond, and surely enough, conclude that the code is faster is with the new database technology. Because, of course it is, they wrote a key-value document retrieval algorithm instead of a set-based one; at no point did they reconsider their algorithm, just their brand of alphabet soup.


Okay, so now we’ve added an entirely new database to our system. Our AWS/Azure bill has jumped from $6k to $10k per month for our 12-person team.

At this point, you might be tempted to think:

Chris, you don't get it. A full-time developer costs a lot more than that. It's not worth it in the big picture to pay someone to improve that code.


That would be true if we were looking at some nested loops on an application server, and the end result was having 10 identical machines in an auto-scale group instead of 8. But in this example, we’ve fundamentally changed our architecture in a way that doesn’t scale linearly on the human side. Think about:

  • How does that new database get populated?

  • How do records updated in one database flow through to the other? Records created? Records deleted?

  • How does it interface with other components of the application?

  • What about security?

  • Documentation?


What you thought was a $4k increase in server costs actually represents a large, difficult, and error-prone new development task. It will have taxes at every level of your organization:

  • It will have bugs, making QA more expensive.

  • It’s another thing to deploy, making DevOps more expensive.

  • It’s another thing to learn, making training more expensive.

  • It’s another thing to protect, making security more expensive.

  • It’s another thing to configure, making development more expensive.

  • It’s another thing to break, making debugging more expensive.

To say nothing of the cost to initially build a component that populates and keeps the data synchronized!

  • Any errors in that synchronization process will bubble up to users, via inconsistent or out-of-date information, making your end product actively worse.

  • Working around those limitations will likely pull in your UI/UX designers and project managers.

  • If those issues are bad enough, it could even place limitations on sales, marketing, and strategic decisions.


You make a few decisions like this one, and a year later, you’ve got 3 “DevOps Engineers” on a team of 12 people. The DevOps team has the smartest, most experienced, and most expensive people in the engineering department.

Over 1/4 of your engineering budget is devoted to people who spend their days concocting elaborate architecture to support rather than improve your algorithms; their skills now aren’t available to be applied directly to your customers’ needs.

Flash forward another year and there’s some dashboard that shows synchronization status, and someone has to log in at random hours of the night to figure out why the “Cache Bus” is backed up and manually re-order the queue.

The sales team knows to switch back to PowerPoint at this point in the demo, lest the data they just entered not appear on the dashboard. There’s some “recompute” button that you have to train users on, but they just press randomly any time something isn’t working as expected, right before they restart their computer.


It all feels right, engineers hitting buttons in real-time to keep the Warp Core online and calibrated - that’s what they do on TV and this looks a lot like something you read about on the blog of a company that was experimenting with blockchain and AI before they went out of business.

But it’s not right.

Of course, sometimes you do need that new technology and that complexity. Keeping a cache up-to-date doesn’t have to be that difficult and it doesn’t have to leak so transparently to an end user. Sometimes it might be unavoidable (after all, you’re reading this in a web browser with a refresh button). But, if the whole reason it exists is because you didn’t try to avoid it, don’t expect not to make equal mistakes with the new technology.


How do you prevent and/or respond to this?

Developers:

  • Always start an investigation with your own code.

  • Issues are in your own usage until they’re proven to be in the framework itself.

  • Ask “why is my SQL slow” not “why is SQL slow”

DevOps:

  • Your job is not to produce more architecture, it’s to produce the architecture that best solves the business needs

  • You’re allowed to review code, in fact, you should be considering the code that led to a given architectural change

  • You work for the customer first, your development team second

Managers:

  • Introducing new technology for new technology’s sake doesn’t make you “cutting edge”

  • Be very suspicious of anything that adds architectural complexity

  • The burden of proof is always on the person who claims they need new tools to prove why the task can’t be done with existing ones