Designing Distributed Systems That Don't Collapse Under Reality
Date: 2026-04-15 Category: Architecture Read Time: ~10 min
The Lie We Tell Ourselves
We like to believe our systems work.
Requests go in, responses come out, jobs get processed, and dashboards stay green.
Until they don't.
In real systems:
- Networks fail silently
- Messages arrive twice
- Services restart mid-operation
- Databases lag behind reality
Failure is not an edge case. It is the default state.
If your system only works when everything works - it doesn't work.
The Only Model That Matters: Partial Failure
A distributed system is not "a system".
It's a collection of independent processes pretending to be one.
At any moment:
- One service is slow
- Another is down
- Kafka is temporarily unavailable
- Your consumer lost its group coordinator
And yet...
The system must continue behaving correctly enough.
Not perfectly. Not instantly. But safely.
Idempotency: The First Line of Defense
If you retry without idempotency, you are not fixing failure.
You are multiplying it.
The problem
Let's say your API publishes a task:
PublishTask(taskID)Then your service crashes before acknowledging success.
Now what?
You retry.
Now the task exists twice.
The solution: Idempotent operations
Every operation must be safe to execute multiple times.
Example:
func CreateTask(taskID string) error {
exists := db.Exists(taskID)
if exists {
return nil
}
return db.Insert(taskID)
}Better:
- Use unique constraints
- Use idempotency keys
- Use state transitions instead of blind writes
Retries without idempotency turn small failures into data corruption.
Retries: Necessary, Dangerous, Inevitable
Retries are not optional.
But naive retries are one of the fastest ways to melt your system.
Bad retry
retry immediately -> fail -> retry immediately -> fail -> repeatThis creates:
- Traffic spikes
- Thundering herd problems
- Downstream collapse
Correct retry strategy
- Exponential backoff
- Jitter (random delay)
- Retry limits
Example:
1s -> 2s -> 4s -> 8s (+ jitter)Why jitter matters:
Without it, all clients retry at the same time.
With it, retries spread out.
Message Queues Don't Save You
Kafka doesn't magically make your system reliable.
It just moves the problem.
What actually goes wrong
- Consumer crashes after processing but before commit
- Message is processed twice
- Out-of-order events break state logic
- Lag grows silently
What you actually need
- Idempotent consumers
- Offset management awareness
- Dead-letter queues (DLQ)
- Observability on lag and failure
Dead Letter Queues: Where Failures Go to Be Seen
A DLQ is not optional.
It is your last line of truth.
Without it:
- Failed messages disappear
- Bugs become invisible
- You lose data silently
With it:
- You have a recovery path
- You can inspect failures
- You can replay safely
If you don't know what failed, you don't have a system.
State Machines > Random Status Updates
Most systems fail not because of infrastructure...
...but because of invalid state transitions.
Example of chaos:
PENDING -> FAILED -> COMPLETEDFix: Explicit state machines
Define allowed transitions:
PENDING -> RUNNING -> COMPLETED
PENDING -> RUNNING -> FAILED
FAILED -> RETRYING -> RUNNINGReject everything else.
Observability Is Not Optional
If you can't see it, you can't fix it.
Minimum requirements:
Metrics
- Request rate
- Error rate
- Latency
- Queue lag
Logs
- Structured
- Correlated (trace ID)
Tracing
- Request -> Kafka -> Worker -> DB -> Response
The Real Goal: Controlled Degradation
A good system does not "stay up".
It degrades gracefully.
Final Thought
Distributed systems are not about scale.
They are about surviving reality.
Key Takeaways
- Failure is the default state
- Idempotency is non-negotiable
- Retries must be controlled
- Kafka does not solve correctness
- DLQs are mandatory
- State machines prevent chaos
- Observability is survival
_If your system works perfectly in your head but breaks in production..._
_you didn't design a system._
_you designed a demo._