Enforcing LLM Usage Caps with Redis: Atomic Operations at Scale

Q: How does INCRBYFLOAT prevent race conditions?

Redis INCRBYFLOAT is an atomic operation — it reads the current value, adds the increment, and writes the result in a single step that cannot be interleaved with other operations. Even if two LLM requests finish simultaneously and both call INCRBYFLOAT, each increment is applied correctly. No read-modify-write race condition is possible.

Q: How do weekly usage counters reset automatically?

We set a TTL (time-to-live) on the Redis key that expires at the next Monday 00:00 UTC. When the key expires, it is automatically deleted. The next usage check reads a non-existent key as zero, effectively resetting the counter. A pipeline checks the TTL after each increment and sets it if the key is new (TTL < 0).

KiwiClaw Standard includes managed LLM access at $39/month. We pay the upstream API providers for every token our customers consume. Without usage caps, a single customer running a tight loop of Opus 4.6 requests could burn through hundreds of dollars of API costs in an hour. Caps are non-negotiable. The question is how to enforce them without ruining the user experience.

We use Redis with atomic INCRBYFLOAT operations and a soft-limit model. This post covers the key patterns, the trade-offs, and the specific bugs we hit.

The Problem: You Cannot Pre-Check Exact Costs

LLM usage billing has a fundamental timing problem. You do not know how many tokens a request will consume until after the response is complete. A seemingly simple prompt might trigger a 4,000-token response. A complex one might get a 200-token answer. You cannot pre-authorize the exact cost.

This rules out hard limits where you check-and-decrement atomically before the request. If you reserved pessimistic amounts ($0.50 per request when the average is $0.008), customers would hit their cap 60x earlier than expected. If you blocked mid-stream, the agent would produce truncated, useless responses.

Our approach: check before, record after. Before proxying the request to the upstream LLM, we estimate the cost and verify the customer has headroom. After the response completes, we record the actual cost. This creates a window where concurrent requests can push usage slightly over the cap.

Redis Key Schema

We track usage at the account level (not the tenant level) because a single customer may have multiple agents. The key pattern is simple:

account:{accountId}:usage_week    — weekly spend in USD (float)
account:{accountId}:usage_month   — monthly spend in USD (float)
account:{accountId}:credits       — purchased credit balance in USD (float)

Each key stores a floating-point dollar amount. Weekly usage resets every Monday at 00:00 UTC. Monthly usage resets on the billing cycle. Credits are purchased separately and deducted when usage exceeds the cap.

The Check: Before the Request

When an LLM request hits the proxy, we estimate the cost and check it against both weekly and monthly caps:

async function checkCaps(accountId: string, modelClass: ModelClass): Promise<CapCheckResult> {
  const estimate = estimateCost(modelClass);
  // Auto model: ~$0.007 per request, MAX model: ~$0.09 per request

  const [weeklyUsedStr, monthlyUsedStr, creditsStr] = await Promise.all([
    redis.get(weekKey(accountId)),
    redis.get(monthKey(accountId)),
    redis.get(creditsKey(accountId)),
  ]);

  const weeklyUsed = parseFloat(weeklyUsedStr || "0");
  const monthlyUsed = parseFloat(monthlyUsedStr || "0");
  const credits = parseFloat(creditsStr || "0");

  // Over weekly cap? Check for credits
  if (weeklyUsed + estimate > weeklyCap) {
    if (credits > estimate) {
      return { allowed: true, ... }; // credits cover it
    }
    return { allowed: false, reason: "Weekly limit reached" };
  }

  // Over monthly cap? Same check
  if (monthlyUsed + estimate > monthlyCap) {
    if (credits > estimate) {
      return { allowed: true, ... };
    }
    return { allowed: false, reason: "Monthly limit reached" };
  }

  return { allowed: true, weeklyUsed, weeklyCap, monthlyUsed, monthlyCap, credits };
}

The three Redis reads are parallelized with Promise.all. No pipeline needed here because reads are independent and non-mutating. Total latency: one Redis round trip (~1-2ms to Upstash).

Default caps for Standard tier: $15/week and $50/month. These are tuned for typical usage patterns during beta and will adjust as we learn more about actual consumption.

The Record: After the Response

After the LLM response completes and we know the actual token counts, we record the cost using an atomic pipeline:

async function recordUsage(accountId: string, event: UsageEvent): Promise<void> {
  const cost = event.estimatedCostUsd;

  // Atomic pipeline: increment both counters and check TTLs
  const pipeline = redis.pipeline();
  pipeline.incrbyfloat(weekKey(accountId), cost);
  pipeline.incrbyfloat(monthKey(accountId), cost);
  pipeline.ttl(weekKey(accountId));
  pipeline.ttl(monthKey(accountId));
  const results = await pipeline.exec();

  // Set TTL on weekly key if it is new (expires next Monday 00:00 UTC)
  const weeklyTtl = results[2] as number;
  if (weeklyTtl < 0) {
    await redis.expire(weekKey(accountId), secondsUntilMonday());
  }

  // Monthly key: 35-day TTL (billing reset handled separately)
  const monthlyTtl = results[3] as number;
  if (monthlyTtl < 0) {
    await redis.expire(monthKey(accountId), 35 * 24 * 60 * 60);
  }
}

The key detail here is INCRBYFLOAT. This is an atomic Redis operation that reads the current value, adds the increment, and writes the result in a single step. No read-modify-write race condition is possible, even if two LLM requests finish simultaneously.

The pipeline bundles four operations into one round trip: two increments and two TTL checks. This is important because Upstash charges per command, and pipelining reduces both cost and latency.

Weekly Reset via TTL

Rather than running a cron job to reset all usage counters every Monday, we let Redis handle it. When we first write to a weekly key, we set a TTL that expires at the next Monday 00:00 UTC:

function secondsUntilMonday(): number {
  const now = new Date();
  const day = now.getUTCDay(); // 0=Sun, 1=Mon
  const daysUntilMonday = day === 0 ? 1 : 8 - day;
  const nextMonday = new Date(now);
  nextMonday.setUTCDate(now.getUTCDate() + daysUntilMonday);
  nextMonday.setUTCHours(0, 0, 0, 0);
  return Math.max(Math.ceil((nextMonday.getTime() - now.getTime()) / 1000), 1);
}

When the key expires, it is deleted. The next redis.get() returns null, which parses to 0. The counter is effectively reset. No cron job, no batch operation, no distributed coordination.

Why Soft Limits Are Acceptable

Between the cap check and the usage recording, concurrent requests can push the total slightly over the cap. Here is the worst case:

Customer is at $14.95 of their $15.00 weekly cap
Two requests arrive simultaneously, both pass the cap check
Both requests complete, each costing $0.09
Final usage: $15.13 — overage of $0.13

The maximum theoretical overage is the cost of the most expensive single request (Opus 4.6 can cost up to $0.50 for a long conversation). In practice, typical overages are under $0.10. We accept this because:

The alternative (hard limits) requires either pessimistic reservations or mid-stream cancellation, both of which are worse for the user
AI agents are not high-frequency traders. Concurrent requests are uncommon since agents are typically conversational
The overage cost is absorbed into our margins, not billed to the customer

The Credit System

When a customer hits their weekly or monthly cap, they are not stuck. They can purchase credit packs ($5, $10, or $20) through the dashboard. Credits are stored in Redis and checked during cap enforcement:

// In checkCaps: if over cap but has credits, allow the request
if (weeklyUsed + estimate > weeklyCap) {
  if (creditsRemaining > estimate) {
    return { allowed: true, ... };
  }
  return { allowed: false, reason: "Weekly limit reached" };
}

After a request that exceeds the cap, credits are deducted:

async function deductCredits(accountId: string, amount: number): Promise<void> {
  const current = parseFloat((await redis.get(creditsKey(accountId))) || "0");
  if (current <= 0) return;
  const deduction = Math.min(amount, current); // never go below zero
  await redis.incrbyfloat(creditsKey(accountId), -deduction);
}

There is a subtle race condition here: between reading the balance and deducting, another request could also deduct. The Math.min prevents the balance from going negative, but two requests could both deduct from the same balance. We accept this since the worst case is a customer gets slightly more usage than they paid for, which is preferable to declining a valid request.

Multi-Agent Cap Scaling

A customer with three agents should have a higher cap than a customer with one. We scale caps using a simple formula:

// Scale: baseCap * (1 + 0.5 * (agentCount - 1))
// 1 agent: $15/week
// 2 agents: $22.50/week
// 3 agents: $30/week
const scaleFactor = 1 + CAP_SCALING_FACTOR * (agentCount - 1);
const weeklyCap = baseCaps.weekly * scaleFactor;

Each additional agent adds 50% of the base cap. This is deliberate since agents share context and often work on related tasks, so total usage does not scale linearly with agent count.

What We Learned

Track at account level, not tenant level. We initially tracked usage per tenant (agent). When we added multi-agent support, customers hit their cap on one agent and could not use another. Moving to account-level tracking with scaled caps solved this.

Upstash REST API, not raw Redis. We use Upstash's REST-based Redis client, which works in serverless environments (Vercel, Fly.io) without maintaining persistent TCP connections. The trade-off is slightly higher latency (HTTP vs TCP), but the simplicity is worth it for our scale.

Pipeline everything you can. Four Redis commands in a pipeline cost the same as one round trip. This matters more for latency than for Upstash billing, since every millisecond of cap-checking delays the LLM request.

Frequently Asked Questions

Why use soft limits instead of hard limits for LLM usage caps?

Hard limits require checking and incrementing the counter atomically before the LLM request starts. But you do not know the exact token cost until the response completes. You would have to either reserve pessimistic amounts (bad UX) or block mid-stream (even worse UX). Soft limits check before the request and record after, accepting small overages as a reasonable trade-off.

How does INCRBYFLOAT prevent race conditions?

Redis INCRBYFLOAT is an atomic operation. It reads, adds, and writes in a single step that cannot be interleaved with other operations. Even if two LLM requests finish simultaneously, each increment is applied correctly with no read-modify-write race.

How do weekly usage counters reset automatically?

We set a TTL on the Redis key that expires at the next Monday 00:00 UTC. When the key expires, it is deleted. The next usage check reads a non-existent key as zero, effectively resetting the counter. No cron job needed.

Written by Amogh Reddy