Multi-Tenant Isolation on Fly.io: One VM Per Customer Without Kubernetes

Q: Why inject config via environment variables instead of config files?

Fly's files mechanism writes files before volume mounts happen. Since our tenant data volume mounts at /data, any config file written via Fly's file injection gets masked by the volume mount. We solved this by base64-encoding the config as an environment variable (OPENCLAW_CONFIG_B64) and decoding it in the boot script, which runs after the volume is mounted.

Every KiwiClaw customer gets their own virtual machine. Not a container. Not a namespace. A real Firecracker microVM with its own kernel, its own filesystem, and its own network stack. We provision these in 30-45 seconds using the Fly Machines API, without Kubernetes, without a container orchestrator, and without a dedicated DevOps team. This post explains how.

Why Not Kubernetes

The honest answer: operator burden. Kubernetes is a powerful abstraction, but running it as a solo founder means becoming a Kubernetes operator first and a product builder second. EKS costs money. GKE costs money. Self-managed K8s on bare metal costs sanity. For a product where each tenant runs a single long-lived process (an OpenClaw agent), the scheduling, service mesh, and pod orchestration features of Kubernetes are overhead we do not need.

Fly.io gives us the isolation properties we want (Firecracker microVMs) with a REST API for machine lifecycle management. One API call creates a VM. One API call starts it. One API call destroys it. No YAML manifests, no Helm charts, no operators.

Why VM Isolation Matters for AI Agents

AI agents are not typical web applications. They execute arbitrary code, browse websites, run shell commands, read and write files, and install third-party skills that can do all of the above. Shared containers with Linux namespace isolation are not sufficient because a container escape gives access to other tenants' data and processes.

Firecracker microVMs run on top of KVM with hardware virtualization. Each VM gets its own Linux kernel instance. A full VM escape would require a hypervisor vulnerability, not just a container misconfiguration. This is the same isolation model AWS Lambda uses.

The Provisioning Flow

When a new customer signs up and completes payment on our dashboard, the orchestrator service kicks off provisioning. Here is the complete flow:

1. Generate gateway token (UUID)           ~0s
2. Generate per-tenant JWT for LLM proxy    ~0s
3. Create Fly volume (1GB persistent)      2-3s
4. Generate OpenClaw config JSON            ~0s
5. Create Fly machine                      5-10s
   - Stock OpenClaw Docker image
   - Config injected as base64 env var
   - Boot script handles setup
6. Wait for "started" state               10-15s
7. Install browser dependencies            10-15s
   (Chromium system libs via fly exec)
8. Issue TLS certificate for subdomain     1-2s
9. Update database with machine ID          ~0s
                                    TOTAL: 30-45s

The key insight is that we provision the machine with a stock Docker image. We do not build custom images per tenant. Every tenant runs the exact same ghcr.io/openclaw/openclaw:latest image. Tenant-specific configuration is injected entirely through environment variables and a boot script.

The Machine Creation Request

const machineRequest = {
  name: `kc-${slug}`,
  region: "iad",
  config: {
    image: "ghcr.io/openclaw/openclaw:latest",
    env: {
      OPENCLAW_GATEWAY_TOKEN: gatewayToken,
      OPENCLAW_STATE_DIR: "/data",
      OPENCLAW_CONFIG_B64: Buffer.from(configJson).toString("base64"),
      PLAYWRIGHT_BROWSERS_PATH: "/data/.browsers",
      KIWICLAW_PROXY_TOKEN: proxyJwt,  // managed tenants only
    },
    init: {
      exec: ["/bin/sh", "-c", bootScript],
    },
    guest: {
      cpu_kind: "shared",
      cpus: 1,
      memory_mb: 2048,
    },
    mounts: [{ volume: volumeId, path: "/data" }],
    // No "services" section — machines are private-network only
  },
};

The Config Injection Bug

This was the most frustrating bug we hit during development, and it taught us something important about how Fly.io works internally.

Fly Machines have a files property that lets you inject files into the machine filesystem at creation time. This seemed like the natural way to deliver the OpenClaw config file. Write /data/openclaw.json via the files mechanism and the agent reads it on boot.

The problem: Fly writes files before it mounts volumes. Our tenant data volume mounts at /data. The file injection writes /data/openclaw.json to the root filesystem. Then the volume mounts over /data, masking the config file entirely. The agent boots with no config and uses defaults.

The fix: encode the config as a base64 environment variable (OPENCLAW_CONFIG_B64) and decode it in the boot script, which runs after the volume is mounted:

# Boot script step 1: decode config from env var to volume
printf '%s' "$OPENCLAW_CONFIG_B64" | base64 -d > /data/openclaw.json

This is not documented anywhere in Fly's docs. We discovered it through three hours of debugging why machines booted with empty configs.

The Boot Script

Each machine runs a multi-step boot script that handles everything between "machine started" and "OpenClaw is accepting connections":

# 1. Decode config from base64 env var (volume is mounted at this point)
printf '%s' "$OPENCLAW_CONFIG_B64" | base64 -d > /data/openclaw.json

# 2. Download Chromium if not cached on volume (first boot only, ~60s)
test -f /data/.browsers/chromium_headless_shell-*/chrome-headless-shell-linux64/... \
  || (cd /app && PLAYWRIGHT_BROWSERS_PATH=/data/.browsers \
     npx playwright install chromium) || true

# 3. Create stable symlink for Chromium (OpenClaw needs a fixed path)
mkdir -p /data/bin && ln -sf "$CHROME_HS" /data/bin/chrome-headless-shell

# 4. Start IPv6 bridge (Fly private net → loopback for OpenClaw)
node -e "require('net').createServer(function(c){
  var t=require('net').connect(18789,'127.0.0.1');
  c.pipe(t);t.pipe(c);
  ...
}).listen({port:18789,host:process.env.FLY_PRIVATE_IP})" &

# 5. Launch OpenClaw gateway
exec node --max-old-space-size=1536 /app/openclaw.mjs gateway \
  --allow-unconfigured --bind loopback

The IPv6 bridge is necessary because OpenClaw binds to loopback (127.0.0.1) by default, but Fly's private networking uses IPv6 addresses (fdaa:...). The bridge listens on the Fly private IP and forwards TCP connections to localhost, making the agent reachable from other Fly machines in the same organization.

Private Network Architecture

Tenant machines have no public services. They are not enrolled in Fly's load balancer, and they do not have public-facing ports. The only way to reach a tenant machine is through the Fly private network (WireGuard mesh):

Internet
  |
  v
Fly Edge (TLS termination)
  |
  v
Orchestrator (kiwiclaw-orchestrator.fly.dev)
  |  extracts slug from Host header
  |  looks up machine ID in database
  v
Fly Private Network (WireGuard)
  |
  v
Tenant Machine ({machineId}.vm.kiwiclaw-orchestrator.internal:18789)

This is a deliberate security decision. If we added a services section to the machine config, Fly would enroll the machine in the app's HTTP proxy. Requests to kiwiclaw-orchestrator.fly.dev could then be load-balanced to a tenant machine instead of the orchestrator. By keeping tenant machines service-less, they are completely invisible from the public internet.

Machine Lifecycle: Suspend, Resume, Destroy

Not every machine needs to run 24/7. We support three lifecycle operations beyond initial provisioning:

Suspend stops the machine and deallocates compute. The persistent volume remains intact. The tenant's data, conversation history, and downloaded browser binaries survive. Compute cost drops to zero. We suspend machines when customers churn or when idle detection fires.

Resume starts a suspended machine, waits for it to reach the "started" state, polls the health check endpoint, and reinstalls Chromium system dependencies. That last step is necessary because Fly machine root filesystems are ephemeral. System packages installed via apt-get are lost on restart. Only data on the persistent volume survives.

Destroy permanently deletes the machine, its volume, and the TLS certificate. This is irreversible. We destroy machines when customers explicitly delete their agent or when their account is permanently closed.

// Cleanup on churn — best-effort, tolerates already-destroyed resources
async function destroyTenant(tenantId: string): Promise<Tenant> {
  const tenant = await getTenant(tenantId);
  if (tenant.flyMachineId) {
    await destroyMachine(tenant.flyMachineId);
  }
  if (tenant.flyVolumeId) {
    await deleteVolume(tenant.flyVolumeId);
  }
  try {
    await deleteCertificate(`${tenant.slug}.kiwiclaw.app`);
  } catch { /* best-effort */ }

  return updateTenant(tenantId, {
    status: "deleted",
    flyMachineId: null,
    flyVolumeId: null,
  });
}

Cost Optimization

Running one VM per customer sounds expensive. Here is the actual unit economics:

Resource	Spec	Monthly Cost
Compute	shared-cpu-1x, 2GB RAM	~$5-7
Volume	1GB persistent SSD	~$0.15
TLS cert	Fly-managed	Free
Bandwidth	Internal only	Free (private net)
Total per tenant		~$5-7/month

At our $15/month BYOK and $39/month Standard price points, the compute cost is sustainable. The LLM cost for Standard tenants is the bigger variable, which is why we have usage caps.

Suspended machines cost nothing for compute. Only the $0.15/month volume storage persists. This means churned customers who might return cost us almost nothing to retain.

Failure Handling

Provisioning involves multiple external API calls, any of which can fail. Our strategy: provision forward, clean up on failure.

If machine creation fails after the volume was created, we keep the volume (cheap at $0.15/GB/month) so a retry can reuse it. If the health check fails, we log it but still mark the machine as provisioned since it may just be slow to boot. The dashboard shows a "Retry Provisioning" button that re-runs the full flow.

For destruction, every step tolerates 404 errors (resource already deleted). This makes the destroy operation idempotent and safe to retry.

What We Would Change

If we were starting over, we would pre-warm a pool of unassigned machines. The 30-45 second provisioning time is acceptable for onboarding but feels slow compared to instant-deploy products. Pre-warming 5-10 machines in each region would let us assign one to a new customer in under 5 seconds, then replenish the pool asynchronously.

We would also explore Fly's machine standby feature for cost optimization since standby machines consume minimal resources while maintaining the ability to start quickly.

Frequently Asked Questions

Why use one VM per tenant instead of shared containers?

AI agents execute arbitrary code, browse websites, and run shell commands. Shared containers with namespace isolation are not sufficient. A container escape gives access to other tenants' data. Firecracker microVMs provide hardware-level isolation with separate kernels, filesystems, and network stacks.

How long does it take to provision a new tenant VM?

The full provisioning flow takes 30-45 seconds. This includes creating a persistent volume, creating the Fly machine, waiting for it to start, installing browser dependencies via fly exec, and issuing a TLS certificate for the tenant's subdomain.

Why inject config via environment variables instead of config files?

Fly's files mechanism writes files before volume mounts happen. Since our tenant data volume mounts at /data, any config file written via file injection gets masked by the volume mount. We base64-encode the config as an environment variable and decode it in the boot script, which runs after the volume is mounted.

How much does per-tenant VM isolation cost?

Each tenant machine runs on Fly's shared-cpu-1x with 2GB RAM, costing roughly $5-7 per month per tenant. Persistent volumes add $0.15/GB/month. Suspended machines cost nothing for compute, making it cheap to retain churned customers' data. The unit economics work at our $15-39/month price points.

Written by Amogh Reddy