Meta System DesignSystem Design

Designing Facebook Messenger

Architect a real-time messaging platform supporting billions of messages daily with presence, delivery receipts, and end-to-end encryption.

Real-timeWebSocketMessage QueueEncryptionPresence

Problem Statement

Design a messaging system like Facebook Messenger that supports 1:1 and group chats, real-time message delivery, read receipts, typing indicators, presence (online/offline), message history, and end-to-end encryption. The system must handle billions of messages per day and support both mobile and web clients.

Why This Problem Matters

  • Real-time messaging is foundational to Meta's ecosystem. Understanding it demonstrates expertise in bidirectional communication, consistency, and offline support.
  • Tests knowledge of WebSocket management, message ordering, delivery guarantees, and encryption.
  • Reveals how to handle edge cases like offline users, message sync across devices, and group chat scalability.

Thought Process

Define message delivery semantics

Messages should be delivered at-least-once with client-side deduplication. Order must be preserved within a conversation. Consider the states: sent, delivered (reached server), received (reached recipient device), read (opened by recipient).

Choose the communication protocol

Real-time delivery requires persistent connections. WebSocket is standard for web clients. For mobile, consider MQTT (lightweight, battery-efficient) or long-polling fallback. Each connected client maintains a session with a Gateway server.

Design the message flow

Sender → Gateway → Message Service → Store in DB → Lookup recipient's Gateway → Push to recipient. If recipient offline, queue the message for later delivery. Use message queues (Kafka) between services for reliability.

Handle presence and typing indicators

Presence (online/offline) can be tracked by Gateway heartbeats. Store last-seen timestamp. Typing indicators are ephemeral—broadcast via WebSocket without persistence. Use pub/sub for efficient fan-out in group chats.

Plan message storage and sync

Store messages in a distributed database partitioned by conversation ID. Support pagination for history. For multi-device sync, assign sequence numbers to messages and let clients fetch messages since their last known sequence.

Add end-to-end encryption

Use the Signal Protocol: each user has identity keys, and each conversation has session keys. Messages are encrypted on device before sending. Server stores ciphertext only. Key exchange happens during first message via prekeys stored on server.

Step-by-Step Reasoning

  1. Connection: Client opens WebSocket to Gateway → Gateway authenticates and registers session → Session stored in distributed cache (user_id → gateway_id).
  2. Send Message: Client sends message → Gateway forwards to Message Service → Store in Messages DB → Lookup recipient session → Route to recipient's Gateway → Push via WebSocket.
  3. Offline Handling: If recipient has no active session, message stored in "pending" queue → When recipient connects, Gateway fetches pending messages → Delivered in order.
  4. Group Chat: Message Service looks up all group members → Fan-out to each member's Gateway (or pending queue if offline) → Use Kafka topics per group for ordering.
  5. Read Receipts: Recipient sends "read" event → Message Service updates message status → Notifies sender via WebSocket.
  6. Sync Across Devices: Each message has conversation_id + sequence_num → Client requests messages where sequence > last_known → Server returns batch.

Dry Run

Alice sends "Hello" to Bob (both online)

Message encrypted on Alice's device → Sent to Gateway A → Message Service stores in DB → Looks up Bob's session (Gateway B) → Pushes to Bob → Bob's client decrypts and displays.

Alice sends "Are you there?" to Bob (Bob offline)

Message stored in DB with status "sent" → No active session for Bob → Added to Bob's pending queue → When Bob connects, pending messages delivered in order.

Group chat with 50 members, Alice sends message

Message Service fans out to 50 recipients in parallel → Online members receive immediately → Offline members get pending delivery → Sequence number ensures all see same order.

Complexity Analysis

Time

Send message: O(1) for 1:1, O(group_size) for groups. Fetch history: O(log n) with indexed queries.

Space

O(messages) for storage, O(active_users) for session registry, O(pending_messages) for offline queues.

Why

Partitioning by conversation_id distributes load. Pending queues are bounded by offline duration.

Annotated Solution

// Message delivery pseudocode
public class MessageService {
    private MessageStore messageStore;
    private SessionRegistry sessionRegistry;
    private PendingQueue pendingQueue;
    private GatewayClient gatewayClient;

    public void sendMessage(Message message) {
        // 1. Persist message
        message.setSequence(getNextSequence(message.getConversationId()));
        message.setStatus(MessageStatus.SENT);
        messageStore.save(message);

        // 2. Find recipient sessions
        List<String> recipientIds = getRecipients(message);

        for (String recipientId : recipientIds) {
            Optional<Session> session = sessionRegistry.getSession(recipientId);

            if (session.isPresent()) {
                // Online: push immediately
                gatewayClient.push(session.get().getGatewayId(), message);
            } else {
                // Offline: queue for later
                pendingQueue.enqueue(recipientId, message);
            }
        }
    }

    public void onUserConnected(String userId, String gatewayId) {
        // Register session
        sessionRegistry.register(userId, gatewayId);

        // Deliver pending messages
        List<Message> pending = pendingQueue.drain(userId);
        for (Message message : pending) {
            gatewayClient.push(gatewayId, message);
        }
    }
}

This shows the core send path: persist first, then route. Online users get immediate delivery via their Gateway, while offline users have messages queued.

Common Pitfalls

  • Message ordering: Network delays can cause out-of-order delivery. Use vector clocks or sequence numbers per conversation.
  • Thundering herd: When a user with many pending messages comes online, fetching all at once can overwhelm the system. Use pagination and rate limiting.
  • Gateway hotspots: Popular users' sessions might concentrate on one Gateway. Use consistent hashing to distribute.
  • E2E encryption key management: Lost device keys mean lost messages. Support key backup (encrypted with user password) or accept data loss.

Follow-Up Questions

  • How would you implement message search? (Store encrypted index on device, or use server-side search with user consent)
  • How do you handle 10,000-member group chats? (Hierarchical fan-out, lazy delivery, sampling for read receipts)
  • How would you implement disappearing messages? (TTL on storage, client-side deletion confirmation)

Key Takeaways

Persistent connections (WebSocket/MQTT) enable real-time delivery but require careful session management.
At-least-once delivery with client deduplication is simpler than exactly-once and sufficient for messaging.
Partition by conversation_id to keep related messages together and maintain ordering.
E2E encryption shifts trust from server to client—the server only sees ciphertext.