Designing Distributed Systems That Don't Collapse Under Reality

Date: 2026-04-15 Category: Architecture Read Time: ~10 min


The Lie We Tell Ourselves

We like to believe our systems work.

Requests go in, responses come out, jobs get processed, and dashboards stay green.

Until they don't.

In real systems:

  • Networks fail silently
  • Messages arrive twice
  • Services restart mid-operation
  • Databases lag behind reality

Failure is not an edge case. It is the default state.

If your system only works when everything works - it doesn't work.


The Only Model That Matters: Partial Failure

A distributed system is not "a system".

It's a collection of independent processes pretending to be one.

At any moment:

  • One service is slow
  • Another is down
  • Kafka is temporarily unavailable
  • Your consumer lost its group coordinator

And yet...

The system must continue behaving correctly enough.

Not perfectly. Not instantly. But safely.


Idempotency: The First Line of Defense

If you retry without idempotency, you are not fixing failure.

You are multiplying it.

The problem

Let's say your API publishes a task:

PublishTask(taskID)

Then your service crashes before acknowledging success.

Now what?

You retry.

Now the task exists twice.


The solution: Idempotent operations

Every operation must be safe to execute multiple times.

Example:

func CreateTask(taskID string) error {
    exists := db.Exists(taskID)
    if exists {
        return nil
    }
    return db.Insert(taskID)
}

Better:

  • Use unique constraints
  • Use idempotency keys
  • Use state transitions instead of blind writes

Retries without idempotency turn small failures into data corruption.


Retries: Necessary, Dangerous, Inevitable

Retries are not optional.

But naive retries are one of the fastest ways to melt your system.

Bad retry

retry immediately -> fail -> retry immediately -> fail -> repeat

This creates:

  • Traffic spikes
  • Thundering herd problems
  • Downstream collapse

Correct retry strategy

  • Exponential backoff
  • Jitter (random delay)
  • Retry limits

Example:

1s -> 2s -> 4s -> 8s (+ jitter)

Why jitter matters:

Without it, all clients retry at the same time.

With it, retries spread out.


Message Queues Don't Save You

Kafka doesn't magically make your system reliable.

It just moves the problem.

What actually goes wrong

  • Consumer crashes after processing but before commit
  • Message is processed twice
  • Out-of-order events break state logic
  • Lag grows silently

What you actually need

  • Idempotent consumers
  • Offset management awareness
  • Dead-letter queues (DLQ)
  • Observability on lag and failure

Dead Letter Queues: Where Failures Go to Be Seen

A DLQ is not optional.

It is your last line of truth.

Without it:

  • Failed messages disappear
  • Bugs become invisible
  • You lose data silently

With it:

  • You have a recovery path
  • You can inspect failures
  • You can replay safely

If you don't know what failed, you don't have a system.


State Machines > Random Status Updates

Most systems fail not because of infrastructure...

...but because of invalid state transitions.

Example of chaos:

PENDING -> FAILED -> COMPLETED

Fix: Explicit state machines

Define allowed transitions:

PENDING -> RUNNING -> COMPLETED
PENDING -> RUNNING -> FAILED
FAILED -> RETRYING -> RUNNING

Reject everything else.


Observability Is Not Optional

If you can't see it, you can't fix it.

Minimum requirements:

Metrics

  • Request rate
  • Error rate
  • Latency
  • Queue lag

Logs

  • Structured
  • Correlated (trace ID)

Tracing

  • Request -> Kafka -> Worker -> DB -> Response

The Real Goal: Controlled Degradation

A good system does not "stay up".

It degrades gracefully.


Final Thought

Distributed systems are not about scale.

They are about surviving reality.


Key Takeaways

  • Failure is the default state
  • Idempotency is non-negotiable
  • Retries must be controlled
  • Kafka does not solve correctness
  • DLQs are mandatory
  • State machines prevent chaos
  • Observability is survival

_If your system works perfectly in your head but breaks in production..._

_you didn't design a system._

_you designed a demo._