Designing WhatsApp at Scale
Engineer a messaging system optimized for reliability, offline sync, and extreme efficiency with billions of users worldwide.
Problem Statement
Design WhatsApp: a messaging platform serving 2+ billion users with emphasis on reliability, minimal infrastructure, offline message sync, and guaranteed delivery. The system must work well on low-end devices with poor network connectivity, support E2E encryption, and handle billions of messages daily with remarkably small engineering teams.
Why This Problem Matters
- WhatsApp is famous for extreme efficiency: ~50 engineers supporting 900M users at acquisition. Understanding their architecture reveals powerful simplicity.
- Tests ability to design for unreliable networks and offline-first usage patterns.
- Demonstrates how constraints (small team, global scale) drive architectural decisions.
Thought Process
Prioritize reliability over features
WhatsApp's core promise is message delivery. Every architectural decision prioritizes this. Messages must never be lost, even if the user is offline for days. Delivery receipts (single check, double check, blue check) provide transparency.
Design for poor network conditions
Many users are on 2G networks or have intermittent connectivity. The protocol must be efficient (minimal bytes), resumable (handle dropped connections), and offline-capable (queue messages locally, sync when connected).
Choose the right technology stack
WhatsApp famously uses Erlang/BEAM for its servers. Erlang excels at concurrent connections (millions per server), fault tolerance (supervisor trees), and hot code reloading. This enables massive scale with small teams.
Implement guaranteed delivery
Use a store-and-forward model. When a message is sent, it's stored on the server until the recipient acknowledges receipt. The sender sees "delivered" only after server confirmation. Messages are deleted from server only after delivery ACK.
Handle multi-device and sync
Originally WhatsApp was single-device (phone as source of truth). Multi-device requires each message to be delivered to all user devices. Use message sequence numbers and sync protocols to ensure consistency across devices.
Implement end-to-end encryption
Use Signal Protocol. Key exchange via prekeys stored on server. Each message encrypted with unique session key. Server cannot read message content. Group chats use sender keys for efficiency.
Step-by-Step Reasoning
- Connection: Client establishes persistent connection via custom protocol (or WebSocket) → Server tracks connection state in memory → Heartbeat every 30s to detect disconnection.
- Send Message: Client encrypts message → Sends to server with local message ID → Server stores in Messages DB → Returns ACK with server message ID → Client shows single checkmark (sent).
- Deliver Message: Server checks if recipient connected → If yes, pushes immediately → Recipient ACKs receipt → Server forwards ACK to sender → Sender shows double checkmark (delivered).
- Offline Handling: If recipient offline, message stays in server queue → When recipient connects, server pushes all pending messages → Client processes in order and ACKs each.
- Read Receipt: When recipient opens chat, client sends "read" event for latest message → Server notifies sender → Blue checkmarks shown.
- Multi-device Sync: Each device has device_id → Messages delivered to all registered devices → Each device ACKs independently → Message deleted from server only when all devices ACK.
Dry Run
Alice sends "Hi" to Bob (both online, good connectivity)
Message encrypted → Sent to server (50ms) → Stored → Pushed to Bob (30ms) → Bob ACKs → Alice sees double check. Total: <100ms.
Alice sends "Hello?" to Bob (Bob offline)
Message stored on server → Alice sees single check → 2 hours later Bob connects → Server pushes message → Bob ACKs → Alice sees double check.
Alice sends message while on flaky 2G network
Client queues message locally → Retries with exponential backoff → Connection restored → Message sent → Server ACKs → Local queue cleared.
Complexity Analysis
Time
Send: O(1) server processing. Delivery: O(devices) for multi-device. Offline sync: O(pending_messages).
Space
O(pending_messages) server storage—messages deleted after delivery. O(message_history) on device.
Why
Store-and-forward bounds server storage to in-flight messages only. Device stores full history.
Annotated Solution
// WhatsApp-style message delivery pseudocode
public class MessageDeliveryService {
private MessageStore messageStore;
private ConnectionManager connectionManager;
private DeviceRegistry deviceRegistry;
public SendResult sendMessage(String senderId, String recipientId,
EncryptedMessage message) {
// 1. Store message (source of truth)
String messageId = messageStore.store(message);
// 2. Attempt delivery to all recipient devices
List<String> devices = deviceRegistry.getDevices(recipientId);
for (String deviceId : devices) {
attemptDelivery(messageId, recipientId, deviceId);
}
return new SendResult(messageId, MessageStatus.SENT);
}
private void attemptDelivery(String messageId, String recipientId,
String deviceId) {
Connection conn = connectionManager.getConnection(recipientId, deviceId);
if (conn != null && conn.isActive()) {
// Online: push immediately
Message message = messageStore.get(messageId);
conn.send(message);
// Wait for ACK asynchronously
}
// If offline: message stays in store, delivered on connect
}
public void onDeviceConnected(String userId, String deviceId) {
// Deliver all pending messages for this device
List<Message> pending = messageStore.getPending(userId, deviceId);
Connection conn = connectionManager.getConnection(userId, deviceId);
for (Message message : pending) {
conn.send(message);
}
}
public void onDeliveryAck(String messageId, String deviceId) {
// Mark delivered to this device
messageStore.markDelivered(messageId, deviceId);
// If delivered to all devices, notify sender
if (messageStore.isFullyDelivered(messageId)) {
notifySenderDelivered(messageId);
// Can now delete from server (optional, for privacy)
}
}
}The store-and-forward pattern ensures no message loss. Messages persist on server until all recipient devices ACK, then can be deleted to minimize storage.
Common Pitfalls
- Message ordering: Network delays can cause reordering. Use timestamp + sender + sequence for deterministic ordering.
- Duplicate messages: Client retries can cause duplicates. Use idempotency keys (local message ID) for deduplication.
- Offline duration limits: Can't store pending messages forever. Set reasonable TTL (30 days) with user notification.
- Presence accuracy: "Last seen" can be stale. Use eventual consistency—accuracy within minutes is acceptable.
Follow-Up Questions
- How would you implement message backup to cloud? (Encrypt with user key, store in Google Drive/iCloud, restore on new device)
- How do you handle group chats with 1000+ members? (Sender keys for encryption, fan-out through message queues, delivery receipts sampled)
- How would you implement voice/video calls? (WebRTC for media, signaling through existing message channel)