Bright vibrant AI cost savings concept with colorful bar chart showing dramatic cost reduction from $250 to $5

I run 208 active agent sessions on OpenClaw — a self-hosted AI gateway that handles everything from WhatsApp-based property management to automated research workflows. These agents run 24/7, constantly processing messages, performing heartbeat checks, and executing tasks.

Originally, everything was powered by Anthropic Claude Opus 4.6. It's an incredible model, but for high-volume background tasks, the costs are brutal:

Metric	Claude Opus 4.6
Input cost	$15.00 / M tokens
Output cost	$75.00 / M tokens
Daily spend	~$250
Monthly spend	~$7,500

Something had to change.

The Solution: Tiered Model Routing on Vertex AI MaaS

Google Cloud's Vertex AI Model-as-a-Service (MaaS) offers access to top-tier open-source models at a fraction of the cost. The key insight: not every task needs the most expensive model. Instead of one-size-fits-all, I set up intelligence on demand with automatic failover:

Tier	Model	Provider	Purpose	Cost (Input/Output)
Primary	GPT-OSS 120B	Vertex (us-central1)	The Workhorse — high-level reasoning	$0.09 / $0.19 per M
Fallback	Qwen3 80B Thinking	Vertex (global)	The Specialist — deep logical reasoning	~$0.12 / $0.24 per M
Heartbeat	GPT-OSS 20B	Vertex (us-central1)	The Sentry — keeps 208 sessions alive	$0.04 / $0.19 per M

The result: Daily costs dropped from $250 to roughly $5 — a 98% reduction.

Prerequisites

A Mac (this guide uses macOS with Homebrew; adapt paths for Linux)
OpenClaw installed (npm install -g openclaw)
A Google Cloud Platform account with a project
The gcloud CLI installed (brew install google-cloud-sdk)
An active OpenClaw gateway with at least one agent

Step 1: Set Up Google Cloud Authentication

First, authenticate with Google Cloud and set your project:

gcloud auth login
gcloud config set project YOUR_PROJECT_ID

Gotcha #1: The GCP Organization Wall

When I first tried to enable API access, Google blocked me. My Google Cloud Organization had a strict policy disabling API keys by default.

The fix: You may need to create an environment tag and bind it to your project to override the organization's security constraint. Check with your org admin, or if you control the org:

gcloud resource-manager org-policies describe constraints/iam.disableServiceAccountKeyCreation \
  --project=YOUR_PROJECT_ID

Generate an Access Token

Unlike Anthropic's permanent API keys, Vertex AI MaaS uses OAuth 2.0 Access Tokens that expire every 60 minutes:

gcloud auth print-access-token

Step 2: Test Your Models Directly

Before touching OpenClaw, verify that your models actually work via curl. This saves hours of debugging config issues.

Find the Right Endpoints

This is where I burned the most time. Not all models are available in all regions. Here's what I discovered:

openai/gpt-oss-120b-maas → works on us-central1
openai/gpt-oss-20b-maas → works on us-central1
qwen/qwen3-next-80b-a3b-thinking-maas → NOT available in us-central1, only works on the global endpoint

Step 3: Configure OpenClaw Providers

Now that you know which endpoints work for which models, configure OpenClaw. You'll need to edit two files:

~/.openclaw/openclaw.json — main gateway config
~/.openclaw/agents/main/agent/models.json — agent-specific model config

This is critical: OpenClaw reads models from both files. If a model is in the main config but not the agent config, you'll get "Unknown model" errors at runtime.

Gotcha #3: The reasoning Parameter

Qwen is a "thinking" model, so it's natural to set "reasoning": true. Don't. Vertex AI's MaaS endpoints don't support the reasoning/thinking parameter in the OpenAI-compatible API. Setting it to true causes a 400 error. Always set "reasoning": false.

Gotcha #4: OpenClaw's Strict Schema Validation

OpenClaw 2026.2.x is very particular about its JSON schema. Fields that seem reasonable will cause "Config Invalid" errors:

"alias" inside model objects → Not allowed. Use the top-level agents.defaults.models section for aliases instead.
Missing baseUrl → Every provider must have a baseUrl.

Step 4: Set Up Model Aliases and Fallbacks

{
  "agents": {
    "defaults": {
      "model": {
        "primary": "vertex-maas/openai/gpt-oss-120b-maas",
        "fallbacks": [
          "vertex-maas-global/qwen/qwen3-next-80b-a3b-thinking-maas",
          "vertex-maas/openai/gpt-oss-20b-maas"
        ]
      }
    }
  }
}

Step 5: Automate Token Refresh

This is the most critical piece. Vertex AI tokens expire every 60 minutes. Without automation, your entire 208-session fleet goes down every hour.

Create a refresh script that runs every 30 minutes via LaunchAgent:

TOKEN=$(gcloud auth print-access-token)
# Update both openclaw.json and models.json with new token

Gotcha #7: Binary Paths in LaunchAgents

On macOS with Homebrew, gcloud lives at /opt/homebrew/bin/gcloud, not /usr/local/bin/gcloud. Use absolute paths for everything.

The Architecture

┌──────────────────────────────────────────────────────────┐
│                      Mac Mini                              │
│  ┌────────────────────────────────────────────────────┐  │
│  │              OpenClaw Gateway                       │  │
│  │   208 Active Sessions (WhatsApp, Web, etc.)         │  │
│  └─────────────────┬──────────────────────────────────┘  │
│                    │                                      │
│  ┌─────────────────┴──────────────────────────────────┐  │
│  │           Token Refresh (every 30 min)              │  │
│  └─────────────────┬──────────────────────────────────┘  │
└────────────────────┼──────────────────────────────────────┘
       ┌─────────────┼──────────────┐
       ▼             ▼              ▼
 ┌──────────┐  ┌──────────┐  ┌──────────┐
 │ GPT-OSS  │  │  Qwen    │  │ GPT-OSS  │
 │  120B    │  │  80B     │  │   20B    │
 │ PRIMARY  │  │ FALLBACK │  │HEARTBEAT │
 │ $0.09/M  │  │ $0.12/M  │  │ $0.04/M  │
 └──────────┘  └──────────┘  └──────────┘

Daily cost: ~$5 (down from $250)
Monthly cost: ~$150 (down from $7,500)

Key Takeaways

Not every task needs the biggest model. Tiered routing can cut costs by 90%+ without sacrificing quality.
Test models with curl before configuring your gateway. It takes 30 seconds and saves hours of debugging.
Region availability is real. Don't assume a model works everywhere. The global endpoint is your friend.
Token refresh automation is non-negotiable. One hour of uptime without it will bring your entire agent fleet down.
Always update both config files. The gateway config and the agent config must stay in sync.

In Part 2, I go further — adding MiniMax M2.5 via OpenRouter, Claude Opus 4.6 as a premium fallback, and building a true multi-provider architecture.

Ready to cut your AI costs by 90%+? Whether you're running a few agents or hundreds, we can help you set up tiered model routing and optimize your AI spending. Book an AI-First Fit Call and we'll show you exactly what's possible.

Browse more blog posts →

About the Author

Levi Brackman

Levi Brackman is the founder of Be AI First, helping companies become AI-first in 6 weeks. He builds and deploys agentic AI systems daily and advises leadership teams on AI transformation strategy.

Learn more →

How I Cut My AI Agent Costs by 98%: Migrating from Anthropic to Google Vertex AI