Skip to main content

Command Palette

Search for a command to run...

How I Broke Down WhatsApp's System Design — And What I Learned Along the Way

Updated
12 min read
How I Broke Down WhatsApp's System Design — And What I Learned Along the Way

By Saurabh Prajapati | Full-Stack Engineer


So, I recently sat down to design WhatsApp from scratch. Not the real thing — obviously — but the kind of "design this system" challenge you'd face in a senior-level system design interview.

And honestly? It was one of the most fun and mind-bending exercises I've done in a while.

I wanted to document the whole journey here — not just the final architecture, but the thinking process, the mistakes, and those "wait, that's actually brilliant" moments along the way.

Let's get into it.


Why WhatsApp? Why Now?

I kept seeing WhatsApp pop up as a classic system design interview question. And every time, I'd think — "Yeah, it's just a messaging app, how hard can it be?"

Spoiler alert: incredibly hard.

Once I actually started breaking it down — 500 million daily active users, real-time messaging, group chats, end-to-end encryption — I realized this single app touches almost every concept in modern system design.

So I thought: why not try it myself and see where I get stuck?

That's exactly what happened. And this blog is me sharing everything — the good, the confusing, and the "I wish someone had told me this earlier" parts.


Step 1: Start With the Requirements (Seriously, Don't Skip This)

Before touching any architecture diagram, the first thing we did was define what WhatsApp actually needs to do.

This is split into two buckets:

Functional Requirements (What the app does)

  • Real-time one-on-one messaging

  • Group chats

  • Media sharing (images, videos, documents)

  • End-to-end encryption (E2EE)

Non-Functional Requirements (How well it does it)

  • Low latency — messages should feel instant (< 200ms)

  • High availability — the app should almost never go down (99.99% uptime)

  • Massive scale — we're talking 500 million+ daily active users

I'll be honest — when I first wrote down "500 million DAU," my brain just kind of short-circuited for a second. That's not a small number. That number changes everything about how you design the system.

💡 I wish I knew this earlier: Always nail down the scale first. It's the single biggest factor that shapes every decision you make — from database choices to caching strategies.


Step 2: The High-Level Architecture

Okay, here's where it gets exciting. We mapped out the big picture — all the major components and how they talk to each other.

Here's the flow:

Client Apps → Load Balancer → Chat Servers → Message Queue (Kafka) → Database Layer → Cache (Redis) → Push Notification Service

Let me break down what each piece does:

  • Client Apps — The WhatsApp app on your phone or desktop. This is where the user interacts.

  • Load Balancer — Spreads incoming traffic across multiple chat servers so no single server gets overwhelmed.

  • Chat Servers — The brain of the operation. They handle incoming messages, route them, and manage WebSocket connections.

  • Message Queue (Kafka) — A buffer that holds messages before they get written to the database. This is huge for reliability (more on that later).

  • Database Layer — Where all the data actually lives. (We used two different databases here — and the reason why is interesting.)

  • Cache (Redis) — Stores frequently accessed data in memory so we don't hit the database every single time.

  • Push Notification Service (FCM/APNs) — Sends notifications to your phone when you get a new message and you're not actively using the app.

The moment this all clicked together for me was when I realized — every single piece here exists for a reason. Nothing is extra. Nothing is "just there." Each component solves a specific problem at scale.


Step 3: How Do Devices Actually Talk? (Protocols)

This was one of the parts I found really interesting to think about.

Not all communication in WhatsApp uses the same protocol. Different situations call for different tools:

  • WebSockets — For real-time messaging. WebSockets keep a persistent, two-way connection open between your device and the server. So when you type a message, it goes through instantly — no need to make a new request every time.

  • HTTP/REST — For things like signing up, updating your profile, or fetching your contact list. These are one-time requests, so a simple HTTP call works perfectly.

  • FCM / APNs — For push notifications when you're offline. FCM is Google's service (Android), APNs is Apple's (iOS).

🤔 A question I asked myself: "Why not just use WebSockets for everything?" The answer? WebSockets are great for real-time stuff, but they're overkill for simple request-response tasks like fetching your profile. Using the right tool for the right job = less complexity, less resource usage.


Step 4: Choosing the Right Databases (This One Blew My Mind)

Okay, this was a big learning moment for me. WhatsApp doesn't use just one database. It uses multiple databases, each built for a different kind of data.

Here's the breakdown:

Cassandra — For Messages

  • Messages are written constantly and in massive volumes

  • Cassandra is built for exactly this — insane write throughput

  • It's also great for time-series data (messages are naturally ordered by time)

  • Think of it as: "I need to store a river of messages, fast and reliably"

PostgreSQL — For User & Group Data

  • User profiles, group info, contact lists — this data is relational

  • You need to join tables, run complex queries

  • PostgreSQL is a classic, reliable relational database — perfect for this

Redis — For Caching & Online Status

  • "Is this user online right now?" — that's a question that gets asked millions of times per second

  • Hitting PostgreSQL every single time would be way too slow

  • Redis stores this in memory — lightning fast reads

💡 The "aha" moment: I always thought of apps using "a database." But at this scale, you pick the bestdatabase for each type of data. It's like choosing the right tool from a toolbox — not just grabbing the first one you see.

Key Data Models We Defined

TableDatabasePurpose
usersPostgreSQLStores user profiles
messagesCassandraStores all chat messages
groupsPostgreSQLGroup chat metadata
group_membersPostgreSQLWho's in which group
media_filesCassandra / Object StoreImages, videos, docs

Step 5: The Message Flow — Where the Magic Happens

This was my favorite part. Let's trace what actually happens when you send a message to a friend.

Case 1: Both Users Are Online

You type "Hey!" → Your app sends it via WebSocket
→ Chat Server receives it
→ Message is written to Kafka (queued)
→ Message is saved to Cassandra (database)
→ Chat Server forwards it to your friend's WebSocket connection
→ Your friend sees "Hey!" on their screen

Total time for this? Around 150–200ms. That's basically instant to a human.

Case 2: Your Friend Is Offline

You type "Hey!" → Same flow as above...
→ But your friend has no active WebSocket connection
→ Message is stored in the database (kept for up to 30 days)
→ A push notification (FCM/APNs) is sent to their phone
→ When they open the app, the message is fetched and delivered

Makes sense, right? The message doesn't disappear just because your friend isn't online.

Case 3: Group Messages

This one is trickier. When you send a message to a group of, say, 50 people:

Your message → Chat Server → Fan-out to all 50 members

"Fan-out" just means the server sends the message out to everyone individually. At scale, this can be expensive — but that's where optimizations like Kafka partitioning come in handy.

Case 4: What If the Network Fails?

This is where things get really interesting. What if your message gets lost mid-send?

  • Your app uses exponential backoff — it retries sending, but waits a little longer each time (so it doesn't spam the server)

  • Deduplication makes sure that even if a message is sent multiple times during retries, it only gets stored once

🤯 Wait, this is actually really cool because... The system is designed to expect failures. It doesn't hope everything works perfectly — it plans for things to break and handles it gracefully. That's a whole mindset shift.


Step 6: Scaling to 500 Million Users (The Hard Part)

Okay, here's where the rubber meets the road. How do you actually make this thing work for half a billion people?

We talked about a bunch of strategies:

Database Sharding

  • You can't fit all 500M users' messages on one database server

  • Sharding splits the data across multiple servers

  • We used hash-based sharding — each user's messages go to a specific shard based on a hash of their user ID

  • One cool detail: we planned for growing from 16 shards to 32 shards — and discussed how to migrate data smoothly during that transition

Read Replicas

  • Most of the time, people are reading messages (scrolling through chats) more than writing new ones

  • Read replicas are copies of the database that only handle read requests — spreading the load

Multi-Level Caching

  • L1 Cache — Super fast, for the most recent messages

  • L2 Cache — A bit larger, for slightly older data

  • L3 Cache — Even larger, catches anything L1 and L2 miss

  • Think of it like layers of a safety net — each one catches what the previous one missed

Other Scaling Tricks

  • GeoDNS — Routes users to the nearest server based on their location (lower latency!)

  • Kafka Partitioning — Splits message processing across multiple workers

  • Auto-Scaling — Automatically adds more servers when traffic spikes

  • Rate Limiting (Token Bucket) — Prevents any single user or bot from overwhelming the system

💰 The Cost Optimization That Surprised Me

Here's a stat that genuinely surprised me: deleting messages after they're delivered saves 99.2% of storage costs.

Think about it — once a message is delivered and the recipient has it, do you really need to keep it on the server forever? For most messages, no. That single optimization saves an insane amount of money at scale.


Step 7: End-to-End Encryption — The Security Layer

This one is a big deal. WhatsApp uses the Signal Protocol for E2EE, and it's honestly fascinating.

Here's the simplified version of how it works:

  • Double Ratchet Algorithm — Every single message is encrypted with a different key. So even if someone somehow cracks one key, they can't read any other messages.

  • Forward Secrecy — If a key is compromised today, it can't be used to decrypt past messages. Only future messages could potentially be affected.

🤔 My honest reaction: I spent a good chunk of time just trying to understand the double ratchet. It's one of those things that sounds complex but once you get the "why" behind it — protecting each message independently — it actually makes a lot of sense.


Step 8: Cool Advanced Features We Explored

Beyond basic messaging, WhatsApp has a bunch of features that each have their own interesting design challenges:

  • Disappearing Messages — Uses a TTL (Time-To-Live) value. After X seconds/days, the message is automatically deleted. Simple concept, but enforcing it reliably across all devices? That's the tricky part.

  • View-Once Media — Images/videos that can only be opened once. Requires careful client-side enforcement + server-side tracking.

  • Status / Stories — These use a pull-based model (your app fetches statuses when you open it) and expire after 24 hours.

  • Voice Messages — Encoded using the Opus codec, which is great for compressing audio without losing quality.

  • Live Location Sharing — Your phone sends a GPS update every 30 seconds to the server, which forwards it to whoever you're sharing with. Simple idea, but at scale, that's a lot of location updates flying around.


Step 9: What Happens When Things Break? (Failure Handling)

No system is perfect. The real test of a well-designed system is: what happens when something goes wrong?

We covered a bunch of failure scenarios:

What BreaksHow We Handle It
A chat server crashesAnother server picks up within 30–90 seconds
Database goes downAutomatically falls back to a replica + retries
Network partitionUses circuit breakers to stop cascading failures
Message might get lostWrite to Kafka before sending the ACK (acknowledgment)
Thundering herd (tons of users reconnecting at once)Jittered backoff — each client retries at a slightly random time

Disaster Recovery Numbers

  • RTO (Recovery Time Objective): 15 minutes — the system should be back up within 15 minutes of a major failure

  • RPO (Recovery Point Objective): 5 minutes — we can tolerate losing at most 5 minutes of data

💡 Key Insight: The phrase "write to Kafka before ACK" is one of the most important reliability patterns here. It means: don't tell the user their message was sent until it's safely in the queue. Even if the next step fails, the message isn't lost.


Step 10: The Gaps — What We Didn't Fully Cover

Here's the part I actually appreciated the most from our session. The interviewer flagged areas where a real interview would expect deeper answers. Being honest about gaps is part of growth, so here they are:

  • Message Search — Searching through millions of messages. Tools like Elasticsearch can help, but it gets complicated when messages are encrypted (you can't search encrypted text easily).

  • Spam & Abuse Prevention — A 5-layer approach is needed to catch spam, scams, and abusive content at scale.

  • Media Processing Pipeline — Generating thumbnails, transcoding videos, serving via CDN — this is its own mini-system.

  • User Authentication — OTP-based login flow, JWT tokens for session management, and how to authenticate WebSocket connections.

  • API Design & Versioning — How to structure and version the APIs cleanly.

  • Cost Estimation — Rough estimate for running this whole thing: ~$650K/month for 500 million users. Yeah, WhatsApp is not cheap to run.

These are areas I want to dive deeper into next. Especially message search and the media pipeline — they sound like they'd make great blog posts on their own!


What I'd Do Differently Next Time

Looking back, here are a few things I'd change or improve:

  1. Start with the message flow earlier. The message flow diagram is the heart of the system. Everything else is built around it. I'd draw that first, then layer in the components.

  2. Think about failure scenarios from the start. I added failure handling at the end, but in reality, you should be asking "what if this breaks?" at every step.

  3. Estimate costs earlier. Numbers like $650K/month give you a sense of scale — and that helps you make better design decisions earlier on.


🎯 Interview Tip: Structure your system design answer like this: Requirements → High-Level Design → Deep Dive → Scaling → Wrap-Up

Don't jump straight to solutions. Don't hand-wave on scale. And don't ignore edge cases. These are the three most common mistakes candidates make.


About the Author

Saurabh Prajapati is a Full-Stack Software Engineer at IBM India Software Lab, specializing in GenAI, React, and modern web technologies. He loves exploring new tools, building things, and sharing what he learns along the way.

📧 saurabhprajapati120@gmail.com 🐙 GitHub: prajapatisaurabh 💼 LinkedIn: saurabh-prajapati

Currently available for work. Let's build something cool together.