How I Broke Down WhatsApp's System Design — And What I Learned Along the Way

By Saurabh Prajapati | Full-Stack Engineer
So, I recently sat down to design WhatsApp from scratch. Not the real thing — obviously — but the kind of "design this system" challenge you'd face in a senior-level system design interview.
And honestly? It was one of the most fun and mind-bending exercises I've done in a while.
I wanted to document the whole journey here — not just the final architecture, but the thinking process, the mistakes, and those "wait, that's actually brilliant" moments along the way.
Let's get into it.
Why WhatsApp? Why Now?
I kept seeing WhatsApp pop up as a classic system design interview question. And every time, I'd think — "Yeah, it's just a messaging app, how hard can it be?"
Spoiler alert: incredibly hard.
Once I actually started breaking it down — 500 million daily active users, real-time messaging, group chats, end-to-end encryption — I realized this single app touches almost every concept in modern system design.
So I thought: why not try it myself and see where I get stuck?
That's exactly what happened. And this blog is me sharing everything — the good, the confusing, and the "I wish someone had told me this earlier" parts.
Step 1: Start With the Requirements (Seriously, Don't Skip This)
Before touching any architecture diagram, the first thing we did was define what WhatsApp actually needs to do.
This is split into two buckets:
Functional Requirements (What the app does)
Real-time one-on-one messaging
Group chats
Media sharing (images, videos, documents)
End-to-end encryption (E2EE)
Non-Functional Requirements (How well it does it)
Low latency — messages should feel instant (< 200ms)
High availability — the app should almost never go down (99.99% uptime)
Massive scale — we're talking 500 million+ daily active users
I'll be honest — when I first wrote down "500 million DAU," my brain just kind of short-circuited for a second. That's not a small number. That number changes everything about how you design the system.
💡 I wish I knew this earlier: Always nail down the scale first. It's the single biggest factor that shapes every decision you make — from database choices to caching strategies.
Step 2: The High-Level Architecture
Okay, here's where it gets exciting. We mapped out the big picture — all the major components and how they talk to each other.
Here's the flow:
Client Apps → Load Balancer → Chat Servers → Message Queue (Kafka) → Database Layer → Cache (Redis) → Push Notification Service
Let me break down what each piece does:
Client Apps — The WhatsApp app on your phone or desktop. This is where the user interacts.
Load Balancer — Spreads incoming traffic across multiple chat servers so no single server gets overwhelmed.
Chat Servers — The brain of the operation. They handle incoming messages, route them, and manage WebSocket connections.
Message Queue (Kafka) — A buffer that holds messages before they get written to the database. This is huge for reliability (more on that later).
Database Layer — Where all the data actually lives. (We used two different databases here — and the reason why is interesting.)
Cache (Redis) — Stores frequently accessed data in memory so we don't hit the database every single time.
Push Notification Service (FCM/APNs) — Sends notifications to your phone when you get a new message and you're not actively using the app.
The moment this all clicked together for me was when I realized — every single piece here exists for a reason. Nothing is extra. Nothing is "just there." Each component solves a specific problem at scale.
Step 3: How Do Devices Actually Talk? (Protocols)
This was one of the parts I found really interesting to think about.
Not all communication in WhatsApp uses the same protocol. Different situations call for different tools:
WebSockets — For real-time messaging. WebSockets keep a persistent, two-way connection open between your device and the server. So when you type a message, it goes through instantly — no need to make a new request every time.
HTTP/REST — For things like signing up, updating your profile, or fetching your contact list. These are one-time requests, so a simple HTTP call works perfectly.
FCM / APNs — For push notifications when you're offline. FCM is Google's service (Android), APNs is Apple's (iOS).
🤔 A question I asked myself: "Why not just use WebSockets for everything?" The answer? WebSockets are great for real-time stuff, but they're overkill for simple request-response tasks like fetching your profile. Using the right tool for the right job = less complexity, less resource usage.
Step 4: Choosing the Right Databases (This One Blew My Mind)
Okay, this was a big learning moment for me. WhatsApp doesn't use just one database. It uses multiple databases, each built for a different kind of data.
Here's the breakdown:
Cassandra — For Messages
Messages are written constantly and in massive volumes
Cassandra is built for exactly this — insane write throughput
It's also great for time-series data (messages are naturally ordered by time)
Think of it as: "I need to store a river of messages, fast and reliably"
PostgreSQL — For User & Group Data
User profiles, group info, contact lists — this data is relational
You need to join tables, run complex queries
PostgreSQL is a classic, reliable relational database — perfect for this
Redis — For Caching & Online Status
"Is this user online right now?" — that's a question that gets asked millions of times per second
Hitting PostgreSQL every single time would be way too slow
Redis stores this in memory — lightning fast reads
💡 The "aha" moment: I always thought of apps using "a database." But at this scale, you pick the bestdatabase for each type of data. It's like choosing the right tool from a toolbox — not just grabbing the first one you see.
Key Data Models We Defined
| Table | Database | Purpose |
users | PostgreSQL | Stores user profiles |
messages | Cassandra | Stores all chat messages |
groups | PostgreSQL | Group chat metadata |
group_members | PostgreSQL | Who's in which group |
media_files | Cassandra / Object Store | Images, videos, docs |
Step 5: The Message Flow — Where the Magic Happens
This was my favorite part. Let's trace what actually happens when you send a message to a friend.
Case 1: Both Users Are Online
You type "Hey!" → Your app sends it via WebSocket
→ Chat Server receives it
→ Message is written to Kafka (queued)
→ Message is saved to Cassandra (database)
→ Chat Server forwards it to your friend's WebSocket connection
→ Your friend sees "Hey!" on their screen
Total time for this? Around 150–200ms. That's basically instant to a human.
Case 2: Your Friend Is Offline
You type "Hey!" → Same flow as above...
→ But your friend has no active WebSocket connection
→ Message is stored in the database (kept for up to 30 days)
→ A push notification (FCM/APNs) is sent to their phone
→ When they open the app, the message is fetched and delivered
Makes sense, right? The message doesn't disappear just because your friend isn't online.
Case 3: Group Messages
This one is trickier. When you send a message to a group of, say, 50 people:
Your message → Chat Server → Fan-out to all 50 members
"Fan-out" just means the server sends the message out to everyone individually. At scale, this can be expensive — but that's where optimizations like Kafka partitioning come in handy.
Case 4: What If the Network Fails?
This is where things get really interesting. What if your message gets lost mid-send?
Your app uses exponential backoff — it retries sending, but waits a little longer each time (so it doesn't spam the server)
Deduplication makes sure that even if a message is sent multiple times during retries, it only gets stored once
🤯 Wait, this is actually really cool because... The system is designed to expect failures. It doesn't hope everything works perfectly — it plans for things to break and handles it gracefully. That's a whole mindset shift.
Step 6: Scaling to 500 Million Users (The Hard Part)
Okay, here's where the rubber meets the road. How do you actually make this thing work for half a billion people?
We talked about a bunch of strategies:
Database Sharding
You can't fit all 500M users' messages on one database server
Sharding splits the data across multiple servers
We used hash-based sharding — each user's messages go to a specific shard based on a hash of their user ID
One cool detail: we planned for growing from 16 shards to 32 shards — and discussed how to migrate data smoothly during that transition
Read Replicas
Most of the time, people are reading messages (scrolling through chats) more than writing new ones
Read replicas are copies of the database that only handle read requests — spreading the load
Multi-Level Caching
L1 Cache — Super fast, for the most recent messages
L2 Cache — A bit larger, for slightly older data
L3 Cache — Even larger, catches anything L1 and L2 miss
Think of it like layers of a safety net — each one catches what the previous one missed
Other Scaling Tricks
GeoDNS — Routes users to the nearest server based on their location (lower latency!)
Kafka Partitioning — Splits message processing across multiple workers
Auto-Scaling — Automatically adds more servers when traffic spikes
Rate Limiting (Token Bucket) — Prevents any single user or bot from overwhelming the system
💰 The Cost Optimization That Surprised Me
Here's a stat that genuinely surprised me: deleting messages after they're delivered saves 99.2% of storage costs.
Think about it — once a message is delivered and the recipient has it, do you really need to keep it on the server forever? For most messages, no. That single optimization saves an insane amount of money at scale.
Step 7: End-to-End Encryption — The Security Layer
This one is a big deal. WhatsApp uses the Signal Protocol for E2EE, and it's honestly fascinating.
Here's the simplified version of how it works:
Double Ratchet Algorithm — Every single message is encrypted with a different key. So even if someone somehow cracks one key, they can't read any other messages.
Forward Secrecy — If a key is compromised today, it can't be used to decrypt past messages. Only future messages could potentially be affected.
🤔 My honest reaction: I spent a good chunk of time just trying to understand the double ratchet. It's one of those things that sounds complex but once you get the "why" behind it — protecting each message independently — it actually makes a lot of sense.
Step 8: Cool Advanced Features We Explored
Beyond basic messaging, WhatsApp has a bunch of features that each have their own interesting design challenges:
Disappearing Messages — Uses a TTL (Time-To-Live) value. After X seconds/days, the message is automatically deleted. Simple concept, but enforcing it reliably across all devices? That's the tricky part.
View-Once Media — Images/videos that can only be opened once. Requires careful client-side enforcement + server-side tracking.
Status / Stories — These use a pull-based model (your app fetches statuses when you open it) and expire after 24 hours.
Voice Messages — Encoded using the Opus codec, which is great for compressing audio without losing quality.
Live Location Sharing — Your phone sends a GPS update every 30 seconds to the server, which forwards it to whoever you're sharing with. Simple idea, but at scale, that's a lot of location updates flying around.
Step 9: What Happens When Things Break? (Failure Handling)
No system is perfect. The real test of a well-designed system is: what happens when something goes wrong?
We covered a bunch of failure scenarios:
| What Breaks | How We Handle It |
| A chat server crashes | Another server picks up within 30–90 seconds |
| Database goes down | Automatically falls back to a replica + retries |
| Network partition | Uses circuit breakers to stop cascading failures |
| Message might get lost | Write to Kafka before sending the ACK (acknowledgment) |
| Thundering herd (tons of users reconnecting at once) | Jittered backoff — each client retries at a slightly random time |
Disaster Recovery Numbers
RTO (Recovery Time Objective): 15 minutes — the system should be back up within 15 minutes of a major failure
RPO (Recovery Point Objective): 5 minutes — we can tolerate losing at most 5 minutes of data
💡 Key Insight: The phrase "write to Kafka before ACK" is one of the most important reliability patterns here. It means: don't tell the user their message was sent until it's safely in the queue. Even if the next step fails, the message isn't lost.
Step 10: The Gaps — What We Didn't Fully Cover
Here's the part I actually appreciated the most from our session. The interviewer flagged areas where a real interview would expect deeper answers. Being honest about gaps is part of growth, so here they are:
Message Search — Searching through millions of messages. Tools like Elasticsearch can help, but it gets complicated when messages are encrypted (you can't search encrypted text easily).
Spam & Abuse Prevention — A 5-layer approach is needed to catch spam, scams, and abusive content at scale.
Media Processing Pipeline — Generating thumbnails, transcoding videos, serving via CDN — this is its own mini-system.
User Authentication — OTP-based login flow, JWT tokens for session management, and how to authenticate WebSocket connections.
API Design & Versioning — How to structure and version the APIs cleanly.
Cost Estimation — Rough estimate for running this whole thing: ~$650K/month for 500 million users. Yeah, WhatsApp is not cheap to run.
These are areas I want to dive deeper into next. Especially message search and the media pipeline — they sound like they'd make great blog posts on their own!
What I'd Do Differently Next Time
Looking back, here are a few things I'd change or improve:
Start with the message flow earlier. The message flow diagram is the heart of the system. Everything else is built around it. I'd draw that first, then layer in the components.
Think about failure scenarios from the start. I added failure handling at the end, but in reality, you should be asking "what if this breaks?" at every step.
Estimate costs earlier. Numbers like $650K/month give you a sense of scale — and that helps you make better design decisions earlier on.
🎯 Interview Tip: Structure your system design answer like this: Requirements → High-Level Design → Deep Dive → Scaling → Wrap-Up
Don't jump straight to solutions. Don't hand-wave on scale. And don't ignore edge cases. These are the three most common mistakes candidates make.
About the Author
Saurabh Prajapati is a Full-Stack Software Engineer at IBM India Software Lab, specializing in GenAI, React, and modern web technologies. He loves exploring new tools, building things, and sharing what he learns along the way.
📧 saurabhprajapati120@gmail.com 🐙 GitHub: prajapatisaurabh 💼 LinkedIn: saurabh-prajapati
Currently available for work. Let's build something cool together.




