Designing Distributed Systems That Don't Collapse Under Reality

Date: 2026-04-15 Category: Architecture Read Time: ~10 min

The Lie We Tell Ourselves

We like to believe our systems work.

Requests go in, responses come out, jobs get processed, and dashboards stay green.

Until they don't.

In real systems:

Networks fail silently
Messages arrive twice
Services restart mid-operation
Databases lag behind reality

Failure is not an edge case. It is the default state.

If your system only works when everything works - it doesn't work.

The Only Model That Matters: Partial Failure

A distributed system is not "a system".

It's a collection of independent processes pretending to be one.

At any moment:

One service is slow
Another is down
Kafka is temporarily unavailable
Your consumer lost its group coordinator

And yet...

The system must continue behaving correctly enough.

Not perfectly. Not instantly. But safely.

Idempotency: The First Line of Defense

If you retry without idempotency, you are not fixing failure.

You are multiplying it.

The problem

Let's say your API publishes a task:

PublishTask(taskID)

Then your service crashes before acknowledging success.

Now what?

You retry.

Now the task exists twice.

The solution: Idempotent operations

Every operation must be safe to execute multiple times.

Example:

func CreateTask(taskID string) error {
    exists := db.Exists(taskID)
    if exists {
        return nil
    }
    return db.Insert(taskID)
}

Better:

Use unique constraints
Use idempotency keys
Use state transitions instead of blind writes

Retries without idempotency turn small failures into data corruption.

Retries: Necessary, Dangerous, Inevitable

Retries are not optional.

But naive retries are one of the fastest ways to melt your system.

Bad retry

retry immediately -> fail -> retry immediately -> fail -> repeat

This creates:

Traffic spikes
Thundering herd problems
Downstream collapse

Correct retry strategy

Exponential backoff
Jitter (random delay)
Retry limits

Example:

1s -> 2s -> 4s -> 8s (+ jitter)

Why jitter matters:

Without it, all clients retry at the same time.

With it, retries spread out.

Message Queues Don't Save You

Kafka doesn't magically make your system reliable.

It just moves the problem.

What actually goes wrong

Consumer crashes after processing but before commit
Message is processed twice
Out-of-order events break state logic
Lag grows silently

What you actually need

Idempotent consumers
Offset management awareness
Dead-letter queues (DLQ)
Observability on lag and failure

Dead Letter Queues: Where Failures Go to Be Seen

A DLQ is not optional.

It is your last line of truth.

Without it:

Failed messages disappear
Bugs become invisible
You lose data silently

With it:

You have a recovery path
You can inspect failures
You can replay safely

If you don't know what failed, you don't have a system.

State Machines > Random Status Updates

Most systems fail not because of infrastructure...

...but because of invalid state transitions.

Example of chaos:

PENDING -> FAILED -> COMPLETED

Fix: Explicit state machines

Define allowed transitions:

PENDING -> RUNNING -> COMPLETED
PENDING -> RUNNING -> FAILED
FAILED -> RETRYING -> RUNNING

Reject everything else.

Observability Is Not Optional

If you can't see it, you can't fix it.

Minimum requirements:

Metrics

Request rate
Error rate
Latency
Queue lag

Logs

Structured
Correlated (trace ID)

Tracing

Request -> Kafka -> Worker -> DB -> Response

The Real Goal: Controlled Degradation

A good system does not "stay up".

It degrades gracefully.

Final Thought

Distributed systems are not about scale.

They are about surviving reality.

Key Takeaways

Failure is the default state
Idempotency is non-negotiable
Retries must be controlled
Kafka does not solve correctness
DLQs are mandatory
State machines prevent chaos
Observability is survival

_If your system works perfectly in your head but breaks in production..._

_you didn't design a system._

_you designed a demo._