Per-Tenant Subdomains with Wildcard DNS on Fly.io
Every KiwiClaw customer gets a subdomain: jarvis.kiwiclaw.app, research-bot.kiwiclaw.app, my-agent.kiwiclaw.app. These are not cosmetic. Each subdomain routes to a specific Firecracker microVM running that customer's OpenClaw agent. No shared infrastructure, no path-based routing, no container-level isolation. Real subdomain-per-tenant architecture.
This post explains how we set it up with one wildcard DNS record, Fly.io certificate management, and a slug-based routing proxy.
The DNS Layer
The entire routing scheme starts with a single Cloudflare DNS record:
*.kiwiclaw.app CNAME kiwiclaw-orchestrator.fly.dev (DNS only, no proxy)
This wildcard CNAME routes every subdomain of kiwiclaw.app to our orchestrator service on Fly.io. Whether the request is for jarvis.kiwiclaw.app or nonexistent.kiwiclaw.app, it arrives at the same Fly application.
One critical detail: the Cloudflare proxy must be disabled for this record. It must be DNS-only (gray cloud, not orange cloud). Here is why:
- If Cloudflare proxies the traffic, it terminates TLS with its own certificate. Fly.io never sees the TLS handshake and cannot present the per-tenant certificate.
- Cloudflare's wildcard SSL only covers one level of subdomain on free plans. Custom subdomains require their ACM (Advanced Certificate Manager).
- WebSocket connections from the OpenClaw control UI need to pass through without Cloudflare's intermediation, which can buffer or transform WS frames.
With DNS-only mode, the client connects directly to Fly.io's edge. Fly terminates TLS with the per-tenant certificate and forwards the request to our orchestrator.
The Full Request Path
1. User opens https://jarvis.kiwiclaw.app
|
2. DNS: *.kiwiclaw.app → CNAME → kiwiclaw-orchestrator.fly.dev
|
3. Fly Edge: TLS termination with cert for jarvis.kiwiclaw.app
|
4. Orchestrator receives request with Host: jarvis.kiwiclaw.app
|
5. extractSlugFromHost("jarvis.kiwiclaw.app") → "jarvis"
|
6. Database lookup: slug "jarvis" → machine ID "6830397f450568"
|
7. Proxy to 6830397f450568.vm.kiwiclaw-orchestrator.internal:18789
|
8. OpenClaw agent responds, proxied back to user
TLS Certificate Management
Each tenant subdomain needs its own TLS certificate. You cannot use a single wildcard certificate on Fly.io's platform since their certificate management issues per-hostname certificates via Let's Encrypt.
During tenant provisioning, we call the Fly Certificates API to request a certificate for the new subdomain:
// Step 9 of provisioning: issue TLS certificate
async function addCertificate(hostname: string): Promise<void> {
await flyApiRequest("POST", `/apps/${FLY_APP}/certificates`, {
hostname,
});
}
// Called during provisioning:
await addCertificate(`${slug}.kiwiclaw.app`);
Fly handles the ACME challenge automatically. Because our wildcard DNS already points to Fly, the HTTP-01 challenge verification succeeds immediately. Certificate issuance typically completes in 1-2 seconds.
When a tenant is destroyed, we clean up the certificate:
// During tenant teardown (best-effort)
try {
await deleteCertificate(`${tenant.slug}.kiwiclaw.app`);
} catch {
// Certificate may not exist — best-effort cleanup
}
Slug Extraction and Routing
The routing logic lives in the orchestrator's request handler. Every incoming request checks the Host header to determine if it is a tenant request or an internal API call:
function extractSlugFromHost(host: string): string | null {
const m = host.match(/^([a-z0-9-]+)\.kiwiclaw\.app(?::\d+)?$/i);
if (!m) return null;
const slug = m[1].toLowerCase();
// Exclude infrastructure subdomains
if (slug === "app" || slug === "kiwiclaw-orchestrator" || slug === "kiwiclaw-caddy") {
return null;
}
return slug;
}
Infrastructure subdomains (app.kiwiclaw.app for the dashboard, kiwiclaw-orchestrator for the API) are explicitly excluded. Everything else is treated as a tenant slug and routed to the corresponding Fly machine.
The routing priority is important. Slug-based routing is checked before internal API authentication. This means tenant requests bypass the internal API secret check entirely since they are user-facing and authenticated by the OpenClaw gateway token instead.
async function handler(req: IncomingMessage, res: ServerResponse) {
// 1. Health check (unauthenticated)
if (path === "/health") { ... }
// 2. Slug-based proxy (checked BEFORE internal auth)
const slug = extractSlugFromHost(host);
if (slug) {
await proxyHttpToTenant(req, res, slug);
return;
}
// 3. Internal API (requires Bearer token)
if (!validateInternalAuth(req)) { ... }
// ... route to internal handlers
}
The Proxy Layer
Once the slug is extracted and the machine ID is resolved from the database, the orchestrator proxies the request over the Fly private network. Tenant machines are not publicly accessible. They have no Fly services configured and are only reachable via their private IP:
// Machine hostname on Fly private network
function machineHost(machineId: string): string {
return `${machineId}.vm.kiwiclaw-orchestrator.internal`;
}
// Proxy the request
const proxyReq = httpRequest({
hostname: machineHost(tenant.flyMachineId),
port: 18789, // OpenClaw gateway port
path: req.url || "/",
method: req.method,
headers: proxyHeaders,
});
The .vm.{app-name}.internal hostname format is Fly's machine-specific DNS. It resolves to the machine's Fly private IPv6 address (fdaa:...), which is routable only within the Fly WireGuard mesh. No internet exposure.
WebSocket Handling
The OpenClaw control UI communicates over WebSockets for real-time chat, tool approvals, and session management. WebSocket upgrades are handled separately from HTTP requests in Node.js:
server.on("upgrade", (req, socket, head) => {
const slug = extractSlugFromHost(host);
if (slug) {
proxyWebSocketToTenant(req, socket, head, slug);
} else {
socket.write("HTTP/1.1 404 Not Found\r\n\r\n");
socket.destroy();
}
});
WebSocket proxying uses raw TCP socket piping, not HTTP. After the upgrade handshake, the orchestrator creates a TCP connection to the tenant machine and pipes data in both directions: targetSocket.pipe(socket) and socket.pipe(targetSocket). This gives us the lowest possible latency for real-time agent communication.
Handling Edge Cases
Machine wake-up on first request
If a tenant's machine is suspended (stopped to save costs), the first request triggers a wake-up. The proxy calls ensureMachineRunning(), which starts the machine and waits for it to reach the "started" state. This adds 2-5 seconds to the first request. Subsequent requests are fast.
async function ensureMachineRunning(machineId: string): Promise<void> {
const machine = await getMachine(machineId);
if (machine.state === "started" || machine.state === "running") return;
await startMachine(machineId);
await waitForState(machineId, "started", 30_000);
// Small delay for the process inside the machine to bind its port
await new Promise((r) => setTimeout(r, 2_000));
}
Certificate delays on new tenants
If someone hits a tenant subdomain within seconds of provisioning (before the TLS certificate finishes issuing), they will see a certificate error. This is rare in practice because the onboarding flow takes longer than certificate issuance. But we handle it gracefully by showing a "Your agent is still being set up" page on the dashboard if the iframe fails to load.
Nonexistent slugs
A request to nonexistent.kiwiclaw.app passes DNS resolution (wildcard) and TLS termination (Fly uses a default certificate for unknown hostnames) but fails at the database lookup. The orchestrator returns a 404:
const tenant = await getTenantBySlug(slug);
if (!tenant || !tenant.flyMachineId) {
res.writeHead(404, { "Content-Type": "application/json" });
res.end(JSON.stringify({ error: "Agent not found or not provisioned" }));
return;
}
Header security
When proxying to tenant machines, we strip Fly forwarding headers (X-Forwarded-For, X-Real-IP, Fly-Request-Id) and override the Host header to 127.0.0.1. This ensures OpenClaw sees the connection as coming from loopback, which is important for its internal auth model. The dashboard iframe's origin is authenticated via the allowedOrigins config and gateway token, not via IP-based trust.
Cost
The DNS and routing infrastructure costs almost nothing incremental:
- Cloudflare DNS: Free tier (wildcard CNAME is supported on free plans)
- Fly TLS certificates: Free (included with Fly apps, uses Let's Encrypt)
- Orchestrator compute: Already running for API handling, proxy routing adds negligible CPU
The cost is entirely in the per-tenant VMs, not in the routing layer.
Frequently Asked Questions
How does wildcard DNS work with per-tenant subdomains?
A wildcard CNAME record (*.kiwiclaw.app) routes all subdomains to the same destination, our Fly.io orchestrator. The orchestrator extracts the tenant slug from the Host header, looks up the corresponding machine in the database, and proxies to that specific VM over the Fly private network. No per-tenant DNS records needed.
How are TLS certificates handled for each tenant?
Fly.io handles TLS certificate issuance via Let's Encrypt. During provisioning, we call the Fly Certificates API to request a cert for the tenant's subdomain. Fly handles the ACME challenge automatically. Certificates issue in 1-2 seconds. Cloudflare's proxy must be disabled (DNS-only) so Fly can terminate TLS directly.
What happens on the first request to a new tenant subdomain?
If the TLS certificate has not finished issuing (rare, since issuance takes 1-2 seconds), the request will fail with a certificate error. If the tenant's VM is suspended, the first request triggers a wake-up that adds 2-5 seconds of latency. Subsequent requests are fast.
Written by Amogh Reddy