WebSocket Patterns for Real-Time AI Applications

The moment you add AI to a product, users expect streaming responses. They expect to see tokens appear word by word. They expect the typing effect. They expect it because ChatGPT trained them to expect it.

And honestly? They are right to expect it. The difference between waiting 8 seconds for a complete response and seeing text stream in after 200 milliseconds is enormous. Not in total time. In perceived responsiveness. The streaming version feels fast even when it takes the same total time.

WebSockets make this possible. HTTP is request-response. WebSocket is a persistent, bidirectional channel. The server can push data to the client whenever it wants, as fast as it wants, without the client asking. Perfect for streaming AI outputs.

But streaming text is just the beginning. The real power of WebSockets in AI applications goes much deeper.

Streaming AI Responses: The Foundation

The basic pattern is simple. Client sends a message over WebSocket. Server forwards it to the AI provider with streaming enabled. As tokens arrive from the provider, the server immediately pushes them to the client. The client appends each token to the display.

Simple in concept. Tricky in practice.

First, you need message framing. Each WebSocket message should include a type, a request ID, and a payload. Types include "token" for individual tokens, "done" for end of response, and "error" for failures. The request ID lets you match tokens to the correct conversation when a user has multiple requests in flight.

Second, handle partial words. AI models sometimes split tokens in the middle of words. "Hel" followed by "lo" followed by " world." Your frontend needs to buffer and render this smoothly without visual glitches. A simple approach: buffer tokens and only render complete words. A better approach: render everything immediately but use a monospace or consistent-width font to prevent layout shift.

Third, handle the done signal correctly. When the model finishes generating, send a "done" message so the client knows the response is complete. This triggers actions like re-enabling the input field, saving the message to history, and updating the conversation state.

Backpressure is the issue nobody thinks about until production. If your AI provider sends tokens faster than your WebSocket can transmit them, you need a strategy. Buffer on the server side with a maximum buffer size. If the buffer fills, you are sending data faster than the client can consume it, which usually means the client is disconnected or on an extremely slow connection.

Connection Management: The Hard Part

WebSocket connections die. Constantly. Mobile users switch between WiFi and cellular. Laptops go to sleep. Network hiccups interrupt connections for a few seconds. Your application needs to handle all of this gracefully.

Heartbeat mechanisms detect dead connections. Send a ping every 30 seconds. If you do not receive a pong within 10 seconds, consider the connection dead. Clean up server-side resources. The alternative is ghost connections that consume memory and file descriptors indefinitely.

Automatic reconnection is non-negotiable. When the connection drops, reconnect immediately. If the reconnection fails, retry with exponential backoff: 1 second, 2 seconds, 4 seconds, 8 seconds, capped at 30 seconds. Show the user a subtle "reconnecting" indicator. Do not show an error modal. Reconnection is normal, not exceptional.

State synchronization after reconnection is where most implementations fall apart. The user was in the middle of receiving a streaming response. The connection dropped. They reconnected. What happens to the partial response?

Two strategies. Optimistic: resume the stream from where it left off using sequence numbers. Each token gets a sequence number. On reconnect, the client sends the last sequence number it received. The server replays from that point. This is clean but requires server-side buffering of recent streams.

Pragmatic: mark the partial response as incomplete and let the user retry. Simpler to implement. Slightly worse user experience. But it works reliably, and users understand "connection lost, please try again."

Message Queuing During Disconnection

When the connection drops, the user might keep typing. Maybe they submit a new message. Maybe they switch conversations. These actions need to be queued and sent when the connection restores.

Implement a client-side message queue. When the WebSocket is disconnected, push messages to a local queue instead of the socket. When the connection restores, drain the queue in order. If any queued message fails, retry it before processing subsequent messages.

This sounds simple. The complexity is in edge cases. What if the user sends the same message twice because they thought the first one failed? Deduplicate by client-generated request IDs. What if the queued messages are no longer relevant because the user navigated away? Allow queue cancellation by conversation ID.

Scaling WebSocket Applications

Here is the fundamental challenge. HTTP is stateless. WebSocket is stateful. That single difference makes scaling dramatically harder.

When a user connects via WebSocket, they connect to a specific server. Their connection lives on that server. If you have four servers behind a load balancer, the user's WebSocket is on server 2. But a message intended for that user might originate from server 4.

Sticky sessions are the simplest solution. Route each user to the same server consistently. Works until that server goes down and all its connections die simultaneously.

A pub/sub layer is the production solution. Redis Pub/Sub or similar. When server 4 needs to send a message to a user on server 2, it publishes to a channel. Server 2 subscribes to that channel and forwards the message over the WebSocket. Every server subscribes to channels for its connected users.

This adds latency. A few milliseconds. For streaming AI tokens, that is imperceptible. For real-time collaboration where multiple users see each other's cursors, it might matter. Test your specific use case.

Connection limits per server matter more than you think. Each WebSocket connection consumes a file descriptor and some memory. A typical server handles 10,000-50,000 concurrent WebSocket connections comfortably. Plan your infrastructure accordingly.

The investment in WebSocket infrastructure pays off. Real-time AI experiences feel fundamentally different from request-response ones. Users notice. Engagement metrics prove it.

But streaming text is just the beginning. The real power of WebSockets in AI applications goes much deeper.

Streaming AI Responses: The Foundation

Simple in concept. Tricky in practice.

Connection Management: The Hard Part

Message Queuing During Disconnection

When the connection drops, the user might keep typing. Maybe they submit a new message. Maybe they switch conversations. These actions need to be queued and sent when the connection restores.

Scaling WebSocket Applications

Here is the fundamental challenge. HTTP is stateless. WebSocket is stateful. That single difference makes scaling dramatically harder.

Sticky sessions are the simplest solution. Route each user to the same server consistently. Works until that server goes down and all its connections die simultaneously.

The investment in WebSocket infrastructure pays off. Real-time AI experiences feel fundamentally different from request-response ones. Users notice. Engagement metrics prove it.

WebSocket Patterns for Real-Time AI Applications

Streaming AI Responses: The Foundation

Connection Management: The Hard Part

Message Queuing During Disconnection

Scaling WebSocket Applications

Related Articles

Building Real-Time Applications with AI Agents and Convex

Convex Real-Time Backend: The Complete Developer Guide

API Rate Limiting Patterns for AI Applications

Want to Implement This?

WebSocket Patterns for Real-Time AI Applications

Streaming AI Responses: The Foundation

Connection Management: The Hard Part

Message Queuing During Disconnection

Scaling WebSocket Applications

Related Articles

Building Real-Time Applications with AI Agents and Convex

Convex Real-Time Backend: The Complete Developer Guide

API Rate Limiting Patterns for AI Applications

Want to Implement This?