🛠️ Engine RoomFebruary 24, 20264 min read

Hyper-Local AI: Enforcing Orchestration, Edge Caching, and Model-Source Tracking

A full AI infrastructure migration — every Gemini call now routes through the Hybrid AI Orchestrator with 3-tier local model routing, 1-hour AI Gateway edge caching, and per-request model-source diagnostics.

By TradeStance Engineering

ai-infrastructureworkers-aigeminiedge-caching

Why This Matters

TradeStance uses AI for HS code classification, fraud detection audits, sentiment scoring, and consensus pricing. Previously, several services bypassed the central Hybrid AI Orchestrator and called the Gemini API directly — meaning no caching, no fallback, no cost tracking, and no local-model offloading.

1. Enforced Orchestration

Both hsMapper.ts (HS code classification) and anomalyDetector.ts (fraud detection audit) have been refactored to route through askAI() instead of making direct Gemini API calls. This means every AI request now benefits from KV caching, automatic fallback, and unified logging.

2. Three-Tier Local Model Routing

The orchestrator now supports three priority tiers:

LOW — Mistral 7B Instruct (@cf/mistral/mistral-7b-instruct-v0.1) for HS classification, price summaries, and sentiment scoring
MEDIUM — Llama 3.1 8B Instruct (@cf/meta/llama-3.1-8b-instruct) for anomaly detection and fraud audits
HIGH — Gemini 2.0 Flash via AI Gateway for consensus pricing and complex tasks

LOW and MEDIUM tasks run entirely on Cloudflare Workers AI at zero marginal cost. Only HIGH-priority tasks hit the paid Gemini API, and even those benefit from 1-hour edge caching.

3. AI Gateway Edge Caching

All Gemini API requests routed through Cloudflare AI Gateway now include the cf-aig-cache-ttl: 3600 header, enabling 1-hour edge caching for repeated prompts. This dramatically reduces Gemini API costs for identical consensus queries and HS verification lookups.

4. Model-Source Diagnostics

Every AI request now logs which provider handled it — workers-ai, gemini-api, or cache. The Hybrid AI Dashboard displays color-coded source badges per log entry, letting operators see at a glance how traffic splits across providers. A new SQL migration adds the model_source column with an index for fast dashboard queries.

5. Fire-and-Forget Logging

All AI logging now uses ctx.waitUntil() so log writes never block the response. The orchestrator accepts an optional ExecutionContext and defers Supabase inserts and KV cache writes to background execution, keeping response latency minimal.

Related Help GuideView Hybrid AI Dashboard

Open Guide

Was this update useful?

#ai-infrastructure#workers-ai#gemini#edge-caching#observability#orchestrator

All Posts Turkey Engine: Live TuRiB + Aydin + TMO Price FeedsNext