Problem #1/100: Why your "clean" codebase still takes two weeks to change one thing

Your team did the refactor. They moved logic out of controllers. They created service classes. They introduced proper error codes and dedicated database tables for tracking failures. The pull request got approved. Everyone felt good about it.

Three months later, the product owner asks: "Can we add automatic retries when a workflow step fails?" The tech lead estimates two weeks. For a retry mechanism. On code that was just refactored.

You're confused — you paid for the cleanup. Why is the estimate still high?

Because the refactor moved the mess from one room to another. The mess is still there. It's just wearing nicer clothes.

This is Problem #1 in a series of 100 I'm documenting from real SaaS projects. Every one of these is something I've seen cost a founder weeks of velocity, thousands in wasted dev hours, or — in the worst cases — a rewrite that didn't need to happen.

What you're seeing from your seat

You don't need to read code to diagnose this. Here's what it looks like:

Small infrastructure changes take surprisingly long. "Add Slack notifications when something fails" should be a morning's work. Your team says three days because they need to add it in eleven places.

Switching providers is always a bigger deal than it should be. You want to move from one email service to another. The estimate is two weeks — not because the new API is complex, but because your business logic is tangled up with the old one in ways nobody mapped.

The same bug appears in different disguises. You fix a failure-logging bug in the email step. Two weeks later, the same bug shows up in the document generation step. Because the same code was copied, not shared.

Every new workflow step takes longer than the last. Not because the business logic is complex — but because each step has to replicate a growing list of infrastructure rituals: log this, track that, notify here, retry there.

If two or more of these sound familiar, your codebase has what I call infrastructure bleed — infrastructure responsibilities that have leaked into business logic, even in code that looks clean.

The factory floor analogy

Imagine a factory with ten assembly stations. Each station does something different — one cuts metal, one welds, one paints.

Now imagine that every station also handles its own safety incident reporting. When something goes wrong at Station 3, the operator at Station 3 fills out the incident form, calls the safety office, logs the event in the tracking system, and schedules a follow-up inspection. Station 7 does the same thing. Station 1 does the same thing. Each one has its own copy of the procedure, slightly different.

One day, management says: "We're switching to a new incident tracking system." Now someone has to visit all ten stations, retrain every operator, and update ten different procedures.

A well-run factory doesn't work this way. Each station has one job: if something goes wrong, press the red button. One central safety team handles the rest — the form, the call, the log, the follow-up. Change the tracking system? You retrain one team, not ten.

That red button is the separation between business logic and infrastructure. The station's job is to know what went wrong — "the weld didn't hold", "the paint was the wrong color." The safety team's job is to know what to do about it — log it, escalate it, schedule a fix.

In your codebase, each workflow step is a station. And right now, every station is doing its own incident reporting.

What this looks like in code (simplified)

You don't need to understand the syntax — look at the shape.

Here's what a workflow step looks like before separation. This is the "clean" version that already passed code review:

final class SendAutomatedEmail
{
    public function execute(Workflow $workflow, Deal $deal): void
    {
        $content = $this->mailer->render($workflow->templateId(), $deal);

        if (!$content->hasBody()) {
            // These three lines are the "incident reporting"
            $this->noteService->addNote($workflow, $deal, 'Email body empty.');
            $this->failureRepository->persist($workflow, $this, 'EMPTY_BODY');
            throw new \RuntimeException('Email body empty.');
        }

        $this->mailer->send($content, $deal->contactEmail());
    }
}

See those three lines in the middle? That's the station operator filling out the incident form. Every other step in the system has the same three lines, with slightly different parameters. There are eleven steps. That's thirty-three lines of duplicated infrastructure, spread across eleven files, that all need to change together.

Here's the same step after separation:

final class SendAutomatedEmail
{
    public function execute(Workflow $workflow, Deal $deal): void
    {
        $content = $this->mailer->render($workflow->templateId(), $deal);

        if (!$content->hasBody()) {
            throw WorkflowStepFailed::because('Email body empty.');
        }

        $this->mailer->send($content, $deal->contactEmail());
    }
}

The step presses the red button — "email body was empty" — and that's it. A single workflow runner catches all failures and handles the logging, tracking, and retries in one place. Add Slack notifications? One file. Change the failure tracking system? One file. Add automatic retries? One file.

The business impact

Here's what infrastructure bleed costs in real terms:

Change amplification. A change that should touch one file touches ten. At typical SaaS dev rates, that's not a 10x cost — it's worse, because each touch needs its own testing, its own review, and its own chance to introduce a regression. A one-day task becomes a week-long pull request.

Onboarding drag. A new developer joining the project has to learn the same infrastructure ritual repeated in eleven slightly different ways. They'll ask "why does this step do it differently from that step?" The answer is usually "because different people wrote them at different times." That's not architecture — that's archaeology.

Regression gravity. Fix a logging bug in one step, and the same bug lives on in the other ten. You'll see the same ticket reopened three months later, filed against a different step, and your team will feel like they're running in circles.

Vendor lock-in surface area. The more places your business logic touches an external service directly, the more expensive it is to switch. I've seen a payment provider migration estimated at eight weeks — not because the new API was hard, but because the old one was referenced in forty-three files that mixed business rules with API calls.

Multiply any of these by your monthly burn rate, and you'll see why a "clean" codebase can still feel expensive to change.

What to do about it — without a rewrite

The fix is not a rewrite. Rewrites kill more SaaS startups than architectural debt ever will. Here's what works:

Identify the ritual. Look for code patterns that repeat across multiple files but aren't business logic. Failure handling, logging, notifications, retry logic, audit trails — these are the usual suspects. Ask your tech lead: "If we wanted to add Slack alerts on failure, how many files would we touch?" If the answer is more than two, you've found the bleed.

Extract the ritual once. Move the repeated infrastructure into one place — a runner, a middleware, an event handler. The mechanism depends on your framework, but the principle is the same: business actions declare what happened, one central place decides what to do about it.

Apply the red-button rule going forward. Every new workflow step, every new action, every new handler — it only declares outcomes. It presses the button. It doesn't fill out forms.

Migrate existing steps by attrition. When a bug comes in on Step 7, fix it by removing Step 7's infrastructure ritual and routing it through the central handler. Over six months, the duplication quietly dies without a dedicated refactoring sprint.

The test that tells you it's working: your tech lead estimates "add Slack notifications on failure" at two hours instead of three days. That's the sound of infrastructure bleed stopping.