I run 208 active agent sessions on OpenClaw — a self-hosted AI gateway that handles everything from WhatsApp-based property management to automated research workflows. These agents run 24/7, constantly processing messages, performing heartbeat checks, and executing tasks.
Originally, everything was powered by Anthropic Claude Opus 4.6. It's an incredible model, but for high-volume background tasks, the costs are brutal:
| Metric | Claude Opus 4.6 |
|---|---|
| Input cost | $15.00 / M tokens |
| Output cost | $75.00 / M tokens |
| Daily spend | ~$250 |
| Monthly spend | ~$7,500 |
Something had to change.
The Solution: Tiered Model Routing on Vertex AI MaaS
Google Cloud's Vertex AI Model-as-a-Service (MaaS) offers access to top-tier open-source models at a fraction of the cost. The key insight: not every task needs the most expensive model. Instead of one-size-fits-all, I set up intelligence on demand with automatic failover:
| Tier | Model | Provider | Purpose | Cost (Input/Output) |
|---|---|---|---|---|
| Primary | GPT-OSS 120B | Vertex (us-central1) | The Workhorse — high-level reasoning | $0.09 / $0.19 per M |
| Fallback | Qwen3 80B Thinking | Vertex (global) | The Specialist — deep logical reasoning | ~$0.12 / $0.24 per M |
| Heartbeat | GPT-OSS 20B | Vertex (us-central1) | The Sentry — keeps 208 sessions alive | $0.04 / $0.19 per M |
The result: Daily costs dropped from $250 to roughly $5 — a 98% reduction.
Prerequisites
- A Mac (this guide uses macOS with Homebrew; adapt paths for Linux)
- OpenClaw installed (
npm install -g openclaw) - A Google Cloud Platform account with a project
- The
gcloudCLI installed (brew install google-cloud-sdk) - An active OpenClaw gateway with at least one agent
Step 1: Set Up Google Cloud Authentication
First, authenticate with Google Cloud and set your project:
gcloud auth login
gcloud config set project YOUR_PROJECT_ID
Gotcha #1: The GCP Organization Wall
When I first tried to enable API access, Google blocked me. My Google Cloud Organization had a strict policy disabling API keys by default.
The fix: You may need to create an environment tag and bind it to your project to override the organization's security constraint. Check with your org admin, or if you control the org:
gcloud resource-manager org-policies describe constraints/iam.disableServiceAccountKeyCreation \
--project=YOUR_PROJECT_ID
Generate an Access Token
Unlike Anthropic's permanent API keys, Vertex AI MaaS uses OAuth 2.0 Access Tokens that expire every 60 minutes:
gcloud auth print-access-token
Step 2: Test Your Models Directly
Before touching OpenClaw, verify that your models actually work via curl. This saves hours of debugging config issues.
Find the Right Endpoints
This is where I burned the most time. Not all models are available in all regions. Here's what I discovered:
openai/gpt-oss-120b-maas→ works on us-central1openai/gpt-oss-20b-maas→ works on us-central1qwen/qwen3-next-80b-a3b-thinking-maas→ NOT available in us-central1, only works on the global endpoint
Step 3: Configure OpenClaw Providers
Now that you know which endpoints work for which models, configure OpenClaw. You'll need to edit two files:
~/.openclaw/openclaw.json— main gateway config~/.openclaw/agents/main/agent/models.json— agent-specific model config
This is critical: OpenClaw reads models from both files. If a model is in the main config but not the agent config, you'll get "Unknown model" errors at runtime.
Gotcha #3: The reasoning Parameter
Qwen is a "thinking" model, so it's natural to set "reasoning": true. Don't. Vertex AI's MaaS endpoints don't support the reasoning/thinking parameter in the OpenAI-compatible API. Setting it to true causes a 400 error. Always set "reasoning": false.
Gotcha #4: OpenClaw's Strict Schema Validation
OpenClaw 2026.2.x is very particular about its JSON schema. Fields that seem reasonable will cause "Config Invalid" errors:
"alias"inside model objects → Not allowed. Use the top-levelagents.defaults.modelssection for aliases instead.- Missing
baseUrl→ Every provider must have abaseUrl.
Step 4: Set Up Model Aliases and Fallbacks
{
"agents": {
"defaults": {
"model": {
"primary": "vertex-maas/openai/gpt-oss-120b-maas",
"fallbacks": [
"vertex-maas-global/qwen/qwen3-next-80b-a3b-thinking-maas",
"vertex-maas/openai/gpt-oss-20b-maas"
]
}
}
}
}
Step 5: Automate Token Refresh
This is the most critical piece. Vertex AI tokens expire every 60 minutes. Without automation, your entire 208-session fleet goes down every hour.
Create a refresh script that runs every 30 minutes via LaunchAgent:
TOKEN=$(gcloud auth print-access-token)
# Update both openclaw.json and models.json with new token
Gotcha #7: Binary Paths in LaunchAgents
On macOS with Homebrew, gcloud lives at /opt/homebrew/bin/gcloud, not /usr/local/bin/gcloud. Use absolute paths for everything.
The Architecture
┌──────────────────────────────────────────────────────────┐
│ Mac Mini │
│ ┌────────────────────────────────────────────────────┐ │
│ │ OpenClaw Gateway │ │
│ │ 208 Active Sessions (WhatsApp, Web, etc.) │ │
│ └─────────────────┬──────────────────────────────────┘ │
│ │ │
│ ┌─────────────────┴──────────────────────────────────┐ │
│ │ Token Refresh (every 30 min) │ │
│ └─────────────────┬──────────────────────────────────┘ │
└────────────────────┼──────────────────────────────────────┘
┌─────────────┼──────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ GPT-OSS │ │ Qwen │ │ GPT-OSS │
│ 120B │ │ 80B │ │ 20B │
│ PRIMARY │ │ FALLBACK │ │HEARTBEAT │
│ $0.09/M │ │ $0.12/M │ │ $0.04/M │
└──────────┘ └──────────┘ └──────────┘
Daily cost: ~$5 (down from $250)
Monthly cost: ~$150 (down from $7,500)
Key Takeaways
- Not every task needs the biggest model. Tiered routing can cut costs by 90%+ without sacrificing quality.
- Test models with curl before configuring your gateway. It takes 30 seconds and saves hours of debugging.
- Region availability is real. Don't assume a model works everywhere. The global endpoint is your friend.
- Token refresh automation is non-negotiable. One hour of uptime without it will bring your entire agent fleet down.
- Always update both config files. The gateway config and the agent config must stay in sync.
In Part 2, I go further — adding MiniMax M2.5 via OpenRouter, Claude Opus 4.6 as a premium fallback, and building a true multi-provider architecture.
Ready to cut your AI costs by 90%+? Whether you're running a few agents or hundreds, we can help you set up tiered model routing and optimize your AI spending. Book an AI-First Fit Call and we'll show you exactly what's possible.
