Local Inference Setup

Verdify’s public traffic does not call the local model. Launch readers hit Cloudflare, the static Quartz site, public read-only API endpoints, Grafana, and warmed static dashboard renders. Cortex is for greenhouse planning events: sunrise, sunset, forecast deltas, band transitions, deviations, daily reviews, and manual operator runs.

This page is an operational snapshot of the local inference stack as verified on 2026-05-07. It uses public-safe service labels; private DNS names, exact kernel build strings, and internal storage mounts are intentionally omitted where they do not help readers understand the design.

Iris (our OpenClaw AI agent) normally has two planner routes: Gemma 4 26B A4B (MoE), served locally under the gemma4-26b alias, for routine work and a cloud peer for heavier reviews. This is local-first, not local-only. OpenClaw chooses the agent instance, MCP validates writes, Slack Operations explains the result to humans, and the ESP32 keeps the safety boundary by owning relay decisions every 5 seconds.

Model Inventory

The planner-facing source of truth is config/ai.yaml.

PurposeProviderModel labelTemperatureOutput capNotes
PlannervLLMgemma4-26b0.34096 tokensGemma 4 26B A4B (MoE), served locally under the gemma4-26b alias, for tactical planning through OpenClaw.
VisionGeminigemini-3.1-pro-preview0.24096 tokensGreenhouse camera snapshot analysis.
EmbeddingsGeminigemini-embedding-2-previewn/a3072 dimensionsObservation similarity search.

The local model server is Cortex, a private-LAN vLLM host. OpenClaw reaches it through a private OpenAI-compatible /v1 endpoint; that endpoint currently advertises gemma4-26b as a vLLM-owned model with max_model_len=131072.

The OpenClaw fleet label for the local planner is gemma4-26b @ 131k ctx. The planner contract does not try to spend that entire window. make planner-dry enforces a local prompt budget of roughly 60k Gemma tokens for the stable preamble, using a 208k character cap as the offline guard.

MoE means mixture-of-experts: the checkpoint has expert sub-networks and a router that activates a subset for a given token. That helps capacity fit the planning task, but it does not make throughput unlimited. Long-context prefill, tool-call parsing, PCIe tensor-parallel traffic, and KV-cache pressure still determine whether a required full-plan cycle can finish cleanly.

Cortex Host

Cortex home-lab GPU rack with stacked compute nodes and illuminated cooling fans
Cortex is the local home-lab inference target for routine Verdify planner runs. Public launch traffic reads already-written evidence; it does not call this rack directly.
ResourceCurrent value
Host identityPrivate Cortex VM
Public exposureNot public; reached from OpenClaw and local services inside the home network
OS / kernelDebian 13 trixie; exact kernel build omitted from public docs
VirtualizationKVM full virtualization on an AMD Ryzen Threadripper 2950X host
CPU28 vCPUs, single socket, one thread per core, one NUMA node
RAM110 GiB total
SwapEffectively none
Active model storageLocal model volume mounted into the vLLM container
Model cache shelfLocal model-cache volume for non-active checkpoints
GPUs2 x GeForce RTX 4070 Ada, 16 GiB VRAM each
GPU topologyPHB over PCIe, no NVLink
NVIDIA stackDriver 575.51.02, CUDA 12.9

The two 4070s are the practical ceiling for this box. Tensor-parallel all-reduces cross PCIe through the host bridge, so the service is tuned for predictable routine planning rather than unconstrained interactive throughput. Published throughput numbers are launch observations, not a capacity guarantee for arbitrary prompts.

vLLM Service

cortex-serve.service owns the vLLM process. It is a systemd service that stops any prior vllm-serve container, removes it, and starts a fresh vLLM OpenAI-compatible server with host networking.

Runtime detailCurrent value
systemd unitcortex-serve.service
Container namevllm-serve
Imagevllm/vllm-openai:gemma4
API shapeOpenAI-compatible completions API
EndpointPrivate OpenAI-compatible /v1 endpoint on the Cortex host
Served modelgemma4-26b
Model root in containerLocal AWQ Gemma model volume
Tensor parallelism--tensor-parallel-size 2
Context window--max-model-len 131072
GPU memory policy--gpu-memory-utilization 0.92
KV cache--kv-cache-dtype fp8
Compute dtype--dtype bfloat16
Batching--max-num-seqs 16, --max-num-batched-tokens 8192
Prefix handling--enable-prefix-caching, --enable-chunked-prefill
Tool parsing--enable-auto-tool-choice, --tool-call-parser gemma4
Reasoning parsing--reasoning-parser gemma4
Multimodal policy--limit-mm-per-prompt {"image":0,"audio":0}
IPC / shared memory--ipc=host, --shm-size=16g
Restart policysystemd Restart=always, RestartSec=10, TimeoutStartSec=300

The launch-safe command shape is:

docker run --rm --name vllm-serve \
  --gpus all --ipc=host --network host --shm-size=16g \
  -v <model-cache-volume>:/models -v <active-model-volume>:/srv-models \
  -e PYTORCH_ALLOC_CONF=expandable_segments:True \
  -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
  vllm/vllm-openai:gemma4 \
  --model /srv-models/<gemma4-awq-checkpoint> \
  --served-model-name gemma4-26b \
  --tensor-parallel-size 2 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.92 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching --enable-chunked-prefill \
  --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4 \
  --trust-remote-code \
  --max-num-seqs 16 --max-num-batched-tokens 8192 \
  --limit-mm-per-prompt '{"image":0,"audio":0}' \
  --dtype bfloat16 \
  --host 0.0.0.0 --port <private-port>

Why These Flags Matter

SettingWhy it is there
--tensor-parallel-size 2One rank per 4070; the active checkpoint does not fit comfortably on one 16 GiB card.
--max-model-len 131072Keeps the service aligned with the 128K context capability advertised by the locally served gemma4-26b route.
--kv-cache-dtype fp8Reduces KV cache footprint enough to make long-context planning practical on this hardware.
--enable-prefix-cachingCritical for repeated Cortex planner prompts; the system/template prefix is reused heavily.
--enable-chunked-prefillLets large prefills share scheduler time with generation.
--tool-call-parser gemma4Converts the model’s native tool-call markup into OpenAI-compatible tool calls for OpenClaw clients.
--reasoning-parser gemma4Surfaces reasoning blocks through the OpenAI-compatible response fields instead of leaking parser syntax into content.
--limit-mm-per-prompt {"image":0,"audio":0}Disables vision/audio towers so VRAM goes to weights and KV cache.
--ipc=host and --shm-size=16gProvides the shared-memory space vLLM needs for tensor-parallel workers.
PYTORCH_ALLOC_CONF=expandable_segments:TrueReduces fragmentation risk under long-context prefill and warm KV cache.
TimeoutStartSec=300Cold starts are long enough that a default 90 second timeout would kill the service during warmup.

Model File

DetailCurrent value
Host pathPrivate active-model volume
Container pathLocal AWQ model mount
On-disk size17 GB
Shards4 safetensors shards
Modified2026-04-04
ArchitectureGemma4ForConditionalGeneration
Model typegemma4
Weight dtypebfloat16 for non-quantized layers
QuantizationAWQ, 4-bit, group size 32, symmetric, MSE observer
Quantized targetLinear layers
Full-precision exclusionsMoE expert projections and router projections stay unquantized
Tokenizer / templateStandard tokenizer files plus chat_template.jinja

The model directory includes the safetensors shards, model.safetensors.index.json, config.json, generation_config.json, recipe.yaml, chat_template.jinja, tokenizer files, multimodal processor metadata, and the upstream model card.

The live model endpoint verifies the active route:

{
  "id": "gemma4-26b",
  "object": "model",
  "owned_by": "vllm",
  "root": "local AWQ model mount",
  "max_model_len": 131072
}

Endpoint Surface

vLLM binds to a private LAN endpoint. The internal port choice is kept stable for Ollama-era local clients, but public launch traffic never reaches it.

PathUse
GET /v1/modelsModel listing; returns gemma4-26b.
POST /v1/chat/completionsPrimary chat endpoint for OpenClaw, including tool calls and reasoning fields.
POST /v1/completionsLegacy completions endpoint.
GET /healthLiveness probe.
GET /metricsPrometheus metrics for vLLM and scheduler behavior.

One migration note matters operationally: Ollama-era clients that probe GET /api/tags will receive a 404 from this service. Those local clients should probe /v1/models instead.

Boot and Capacity Observations

The running service was inspected live on 2026-05-07.

ObservationCurrent value
vLLM buildv0.18.2rc1.dev73+gdb7a17ecc, V1 engine
Quantization detectedcompressed-tensors AWQ pack-quantized
Worker world size2 tensor-parallel workers
Cold initAbout 108 seconds before serving
CUDA graph capture sizes[1,2,4,8,16,24,32]
Per-worker KV cacheAbout 4.42 GiB available
Full-context concurrencyAbout 4.15x at the configured 131K context, per vLLM logs
Steady batch-1 throughputRoughly 90 to 100 tokens per second observed
Prefix cache hit rateClimbs into roughly 90 to 96 percent under repeated planner prompts
CPU thread warningvLLM reduces Torch CPU threads from 28 to 1; set OMP_NUM_THREADS if CPU-side work becomes a bottleneck

This is enough for routine greenhouse planning because public traffic does not create model calls, and planner triggers are sparse compared with page views. The main launch capacity risk is the public proof layer, not Cortex.

Throughput caveat: batch-1 token speed is not the same thing as guaranteed end-to-end planning latency. A full plan includes context assembly, long-context prefill, tool calls, validation, and Slack/archive writes; required full plans currently use the cloud peer until the local full-plan context is trimmed and baked.

Cortex API Gateway

Cortex also has a FastAPI gateway conceptually above vLLM: a unified POST /v1/cortex/infer endpoint with profile-based routing across local backends, cloud fallback, job queueing, and budget checks. That gateway talks to the local model through the same private route.

The gateway role is:

Gateway surfaceRole
/v1/cortex/healthService health.
/v1/cortex/profilesAvailable inference profiles.
/v1/cortex/costsBudget and cost view.
/v1/cortex/gpuGPU state and residency.
/v1/cortex/queueAsync queue status.
/v1/cortex/readyReadiness gate.
/v1/cortex/inferMain profile-based inference endpoint.
/v1/cortex/jobs/{id}Async job result lookup.

The current public documentation treats the vLLM service as verified and the Cortex gateway deployment as a thing to re-check before making stronger claims. If the gateway moves hosts or changes process manager, this page should be refreshed.

Adjacent Services

The inference box also runs related observability and local-AI services:

ServiceRole
tei-embeddingsCPU text-embedding inference on a local Qwen embedding model.
open-webuiLocal UI for OpenAI-compatible access to the model.
dcgm-exporterGPU metrics.
node-exporterHost metrics.
alloy and promtailMetrics and log shipping.
portainer_agentRemote management agent.

LMCache configuration exists on disk, but the running vLLM command line does not currently wire LMCache into the service. Prefix caching is live; LMCache should be treated as dormant until the vLLM launch arguments prove otherwise.

Planning-Loop Boundary

The local inference route is not the safety mechanism. If Gemma 4 26B A4B (MoE), served locally under the gemma4-26b alias, is slow, unavailable, or wrong, the planner SLA opens an alert rather than silently pretending the cycle worked. The ESP32 still owns real-time control and enforces the last valid bounded setpoints plus hard safety rails.

The end-to-end loop is:

Trigger

Solar milestone, forecast delta, transition, deviation, or manual operator request.

Route

OpenClaw chooses Gemma 4 26B A4B (MoE), served locally under the gemma4-26b alias, for routine checks or a cloud peer for heavier reviews.

Infer

Cortex serves gemma4-26b through vLLM's OpenAI-compatible API.

Write

Iris writes bounded tactical intent through MCP tools and audited plan records.

Enforce

The dispatcher and ESP32 validate, clamp, and enforce deterministic state-machine behavior.

Brief

Slack gets the human-readable plan summary, watch items, and operator tasks.

Score

Telemetry, cost, stress, compliance, and plan outcomes judge what actually happened.

For launch, required_full_plan_instance is set to opus in config/ai.yaml. Required SUNRISE/SUNSET/MIDNIGHT full-plan events use the cloud planner path until the local full-plan context is trimmed and a clean local bake passes. Routine FORECAST and TRANSITION work stays local.

That is why the public archive can show both local-first architecture and a cloud-written catch-up full plan without contradiction: the route is explicit, stamped, and audited.

Read-Only Verification

These are the private-operator command categories that refresh the operational snapshot. Exact host paths and private endpoint details stay in the operator runbook.

# Service state
systemctl status cortex-serve.service
docker ps --filter name=vllm-serve --format '{{.Status}}'
docker logs vllm-serve --tail 100
 
# Endpoint
curl -s http://127.0.0.1:<private-port>/v1/models | jq
curl -s http://127.0.0.1:<private-port>/health
 
# Hardware
nvidia-smi
nvidia-smi topo -m
free -h
lscpu | head -20
df -hT
 
# Model file
ls -la <active-model-volume>/gemma4-26b-awq/
du -sh <active-model-volume>/gemma4-26b-awq/
 
# Effective vLLM args
docker inspect vllm-serve --format '{{json .Args}}' | jq
docker inspect vllm-serve --format '{{range .Config.Env}}{{println .}}{{end}}'

Open Questions

  • Confirm where the Cortex API gateway process actually runs in production before documenting it as a verified systemd-managed service.
  • Decide whether LMCache is intentionally dormant or should be wired into vLLM for cross-request KV reuse beyond prefix caching.
  • Find and update any Ollama-era client still probing /api/tags; vLLM serves /v1/models.

Related pages: