How We Stream SSE Through 3 Proxies Without Buffering
When a user sends a message to their OpenClaw agent on KiwiClaw, the LLM response streams through three separate proxy layers before it reaches the browser. Each chunk of the Server-Sent Events stream passes through all three in real time, without any layer buffering the full response. Getting this right was one of the harder infrastructure problems we solved, and the answer was surprisingly simple: stop using HTTP frameworks.
This post covers why most SSE proxy implementations break streaming, the architecture of our three-proxy chain, the specific code patterns that make chunk-by-chunk forwarding work, and the non-obvious gotchas we hit along the way.
The Architecture: Three Hops
Here is the full path an LLM response takes from the upstream provider to the user's browser:
Browser (EventSource / fetch)
|
v
Dashboard (Next.js on Vercel) ---- hop 1
|
v
Orchestrator (Node.js on Fly.io) ---- hop 2
| extracts tenant slug from Host header
| looks up Fly machine ID
| proxies to tenant's private IP
v
LLM Proxy (Node.js on Fly.io internal network) ---- hop 3
| verifies per-tenant JWT
| checks usage caps in Redis
| injects stream_options for usage tracking
v
Upstream LLM API (Moonshot / Anthropic)
Three proxies. Each one has legitimate business logic that cannot be removed: the dashboard handles authentication and session management, the orchestrator handles tenant routing and machine lifecycle, and the LLM proxy handles JWT verification, usage cap enforcement, and API key injection.
The constraint is simple: none of these layers can buffer the response. The user needs to see tokens appearing in real time, not wait 30 seconds for a buffered blob.
Why Most Frameworks Buffer SSE
If you build an SSE proxy with Express, you will likely get buffering by default. Here is why.
Express applies middleware in a pipeline. The compression middleware buffers the response to gzip it. The default response handling in many frameworks adds Content-Length headers, which requires knowing the full body size upfront. Some frameworks use Transfer-Encoding: chunked but still buffer internally for performance.
Fastify has similar issues. Its reply serialization pipeline is designed for request-response patterns, not long-lived streams. You can work around it with reply.raw, but then you are bypassing the framework entirely and using raw Node.js anyway.
The pattern we see in many SSE proxy tutorials is fundamentally broken:
// BROKEN: This buffers the entire response
app.get('/proxy/stream', async (req, res) => {
const upstream = await fetch(upstreamUrl, { ... });
const body = await upstream.text(); // waits for full response
res.send(body);
});
Even if you use upstream.body.pipe(res), the framework's middleware stack may still interfere. And if any middleware calls res.end() or sets Content-Length, the stream breaks.
Our Solution: Raw Node.js http
All three of our proxy services use the raw node:http (or node:https) module. No Express. No Fastify. No Koa. No Hono. The entire KiwiClaw backend is built on createServer from node:http.
Here is the core pattern for SSE passthrough at the LLM proxy layer:
import { request as httpsRequest } from "node:https";
import { createServer, type IncomingMessage, type ServerResponse } from "node:http";
// Inside the proxy handler:
const proxyReq = httpsRequest(upstreamUrl, { method: "POST", headers }, (proxyRes) => {
// SSE streaming passthrough
res.writeHead(proxyRes.statusCode || 200, {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
});
proxyRes.on("data", (chunk: Buffer) => {
res.write(chunk); // forward immediately, no buffering
});
proxyRes.on("end", () => {
res.end();
});
});
That is the entire streaming proxy in its essential form. The proxyRes.on("data") callback fires for every chunk the upstream sends, and res.write(chunk) forwards it immediately to the downstream client. No buffering, no transformation, no middleware interference.
The Usage Tracking Problem
Here is the catch: we need to track how many tokens each request consumed for usage cap enforcement. For non-streaming responses, this is trivial since the usage object is in the JSON body. But for SSE streams, the usage data is in the final chunk, after the [DONE] marker.
We solved this without buffering the full stream by keeping a rolling tail buffer:
const SSE_TAIL_BUFFER_SIZE = 8192; // 8KB
let tailBuffer = "";
proxyRes.on("data", (chunk: Buffer) => {
const chunkStr = chunk.toString();
tailBuffer += chunkStr;
// Trim to keep only the tail
if (tailBuffer.length > SSE_TAIL_BUFFER_SIZE) {
tailBuffer = tailBuffer.slice(-SSE_TAIL_BUFFER_SIZE);
}
res.write(chunk); // still forwarded immediately
});
proxyRes.on("end", () => {
res.end();
// Parse the tail for usage data
const usage = parseOpenAIStreamUsage(tailBuffer);
if (usage) {
recordUsage(accountId, {
inputTokens: usage.inputTokens,
outputTokens: usage.outputTokens,
estimatedCostUsd: calculateCost(usage),
});
}
});
We keep the last 8KB of the stream in memory. The usage chunk from OpenAI-compatible APIs is typically under 500 bytes. After the stream ends, we parse the tail buffer to extract token counts. The stream itself is never delayed since every chunk is forwarded immediately to res.write() before any parsing happens.
One critical detail: we inject stream_options: { include_usage: true } into the request body before forwarding it upstream. Without this flag, most OpenAI-compatible APIs do not include token counts in SSE streams at all. We learned this the hard way when our usage dashboard showed zero for all streaming requests.
// Inject usage tracking into the request before proxying
const parsed = JSON.parse(body);
if (parsed.stream === true && !parsed.stream_options?.include_usage) {
parsed.stream_options = {
...(parsed.stream_options || {}),
include_usage: true
};
modifiedBody = JSON.stringify(parsed);
}
The Orchestrator Proxy Layer
The orchestrator does not just forward HTTP. It also handles WebSocket upgrades for the OpenClaw control UI. The same slug-based routing applies to both.
For HTTP requests (including SSE), the orchestrator extracts the tenant slug from the Host header, looks up the Fly machine ID in the database, and proxies to the machine's private IP on the Fly internal network:
function extractSlugFromHost(host: string): string | null {
const m = host.match(/^([a-z0-9-]+)\.kiwiclaw\.app(?::\d+)?$/i);
if (!m) return null;
const slug = m[1].toLowerCase();
if (slug === "app" || slug === "kiwiclaw-orchestrator") return null;
return slug;
}
// Proxy uses the raw http module, same pattern as LLM proxy
const proxyReq = httpRequest({
hostname: `${machineId}.vm.kiwiclaw-orchestrator.internal`,
port: 18789,
path: req.url || "/",
method: req.method,
headers: proxyHeaders,
});
proxyReq.on("response", (proxyRes) => {
res.writeHead(proxyRes.statusCode!, responseHeaders);
proxyRes.pipe(res, { end: true });
});
req.pipe(proxyReq, { end: true });
The .pipe() calls handle chunk-by-chunk forwarding in both directions. For SSE specifically, proxyRes.pipe(res) does the same thing as the manual on("data") / res.write() pattern but in a single line.
Gotchas We Hit
1. Transfer-Encoding conflicts
If you set both Content-Length and Transfer-Encoding: chunked on the response, some browsers ignore the stream. We strip Content-Length from proxied responses and let Node.js handle chunked encoding automatically.
2. Connection header propagation
HTTP/1.1 proxies must not forward hop-by-hop headers like Connection, Keep-Alive, and Transfer-Encoding. When proxying SSE, we set our own Connection: keep-alive on the downstream response rather than copying the upstream header.
3. Proxy timeouts killing long streams
LLM responses can take 30-60 seconds for complex tasks. Default proxy timeouts in nginx, Cloudflare, and even Node.js itself (2 minutes) can kill the connection mid-stream. We set aggressive timeouts only on the initial connection, not on the stream duration. Fly.io's proxy has a generous streaming timeout, but Cloudflare requires proxy_read_timeout tuning if you use it as a CDN layer.
4. Client disconnection cleanup
When a user closes their browser tab mid-stream, the upstream LLM keeps generating tokens you are paying for but will never deliver. Both the orchestrator and LLM proxy attach error handlers that destroy the upstream connection when the client disconnects:
proxyReq.on("error", (err) => {
if (!res.headersSent) {
res.writeHead(502, { "Content-Type": "application/json" });
res.end(JSON.stringify({ error: "Upstream provider error" }));
}
});
// If client disconnects, stop the upstream request
req.on("close", () => {
if (!proxyReq.destroyed) proxyReq.destroy();
});
5. Machine wake-up delays
Fly machines can be suspended to save costs. When a request arrives for a suspended machine, the orchestrator needs to wake it first. This adds 2-5 seconds before any SSE data flows. We handle this transparently since the ensureMachineRunning() call completes before the proxy connection is established, so the SSE stream starts cleanly after the machine is ready.
Why Not WebSockets?
The OpenClaw control UI actually uses WebSockets for its primary communication channel. But the LLM API layer uses SSE because that is what upstream providers (Anthropic, Moonshot, OpenAI) expose. We proxy both:
- WebSocket for the OpenClaw gateway control channel (chat, tool approvals, session management)
- SSE for LLM API completions (token streaming from the model)
For WebSocket proxying, we use raw TCP socket piping at the orchestrator level, bypassing HTTP entirely after the upgrade handshake. The pattern is similar to SSE but operates at a lower level: targetSocket.pipe(socket) and socket.pipe(targetSocket).
Performance in Production
With three proxy hops, you would expect significant latency overhead. In practice, the added latency is negligible because all three hops are within the same Fly.io region (iad), communicating over the Fly private network (WireGuard mesh). The intra-region latency is sub-millisecond.
The real latency comes from the upstream LLM API's time-to-first-token, which is typically 500ms to 2 seconds depending on the model and prompt length. Our proxy chain adds less than 5ms to that.
The Takeaway
If you are building an SSE proxy, especially one with multiple hops, do not reach for an HTTP framework. The raw node:http module gives you exactly what you need: a request object that is a readable stream and a response object that is a writable stream. Pipe them together and get out of the way.
The entire LLM proxy service is about 400 lines of TypeScript. The orchestrator's proxy module is about 350 lines. No dependencies beyond Node.js standard library and our shared JWT package. The simplicity is the feature since there is nothing to buffer because there is nothing in between.
Frequently Asked Questions
Why does Express buffer SSE streams?
Express applies default middleware like compression and body parsing that buffer the full response before sending. This breaks SSE because the browser needs to receive chunks as they arrive. The raw Node.js http module gives you direct control over the response stream, letting you pipe chunks through without buffering.
How do you track token usage on a streaming SSE response?
We inject stream_options: { include_usage: true } into the request body before proxying to the upstream LLM. This tells the provider to include a usage object in the final SSE chunk. We keep a rolling tail buffer of the last 8KB of the stream and parse it after the stream ends to extract token counts, without buffering the entire response.
What happens if one proxy in the chain goes down?
Each proxy has error handlers on both the client and upstream sockets. If the upstream connection fails, the proxy returns a 502 Bad Gateway error. If the client disconnects mid-stream, the proxy destroys the upstream connection to stop the LLM from generating tokens you will never deliver.
Why not use a single reverse proxy like nginx for SSE?
A single nginx reverse proxy works for simple SSE, but our architecture requires application-level logic at each hop: JWT verification at the LLM proxy, tenant routing at the orchestrator, and authentication at the dashboard. Each layer adds business logic that a generic reverse proxy cannot provide. The key is making sure none of them buffer.
Written by Amogh Reddy