Local Inference Setup
Verdify’s public traffic does not call the local model. Launch readers hit Cloudflare, the static Quartz site, public read-only API endpoints, Grafana, and warmed static dashboard renders. Cortex is for greenhouse planning events: sunrise, sunset, forecast deltas, band transitions, deviations, daily reviews, and manual operator runs.
This page is an operational snapshot of the local inference stack as verified on 2026-05-07. It uses public-safe service labels; private DNS names, exact kernel build strings, and internal storage mounts are intentionally omitted where they do not help readers understand the design.
Iris (our OpenClaw AI agent) normally has two planner routes: Gemma 4 26B A4B (MoE), served locally under the gemma4-26b alias, for routine work and a cloud peer for heavier reviews. This is local-first, not local-only. OpenClaw chooses the agent instance, MCP validates writes, Slack Operations explains the result to humans, and the ESP32 keeps the safety boundary by owning relay decisions every 5 seconds.
Model Inventory
The planner-facing source of truth is config/ai.yaml.
| Purpose | Provider | Model label | Temperature | Output cap | Notes |
|---|---|---|---|---|---|
| Planner | vLLM | gemma4-26b | 0.3 | 4096 tokens | Gemma 4 26B A4B (MoE), served locally under the gemma4-26b alias, for tactical planning through OpenClaw. |
| Vision | Gemini | gemini-3.1-pro-preview | 0.2 | 4096 tokens | Greenhouse camera snapshot analysis. |
| Embeddings | Gemini | gemini-embedding-2-preview | n/a | 3072 dimensions | Observation similarity search. |
The local model server is Cortex, a private-LAN vLLM host. OpenClaw reaches it through a private OpenAI-compatible /v1 endpoint; that endpoint currently advertises gemma4-26b as a vLLM-owned model with max_model_len=131072.
The OpenClaw fleet label for the local planner is gemma4-26b @ 131k ctx. The planner contract does not try to spend that entire window. make planner-dry enforces a local prompt budget of roughly 60k Gemma tokens for the stable preamble, using a 208k character cap as the offline guard.
MoE means mixture-of-experts: the checkpoint has expert sub-networks and a router that activates a subset for a given token. That helps capacity fit the planning task, but it does not make throughput unlimited. Long-context prefill, tool-call parsing, PCIe tensor-parallel traffic, and KV-cache pressure still determine whether a required full-plan cycle can finish cleanly.
Cortex Host
| Resource | Current value |
|---|---|
| Host identity | Private Cortex VM |
| Public exposure | Not public; reached from OpenClaw and local services inside the home network |
| OS / kernel | Debian 13 trixie; exact kernel build omitted from public docs |
| Virtualization | KVM full virtualization on an AMD Ryzen Threadripper 2950X host |
| CPU | 28 vCPUs, single socket, one thread per core, one NUMA node |
| RAM | 110 GiB total |
| Swap | Effectively none |
| Active model storage | Local model volume mounted into the vLLM container |
| Model cache shelf | Local model-cache volume for non-active checkpoints |
| GPUs | 2 x GeForce RTX 4070 Ada, 16 GiB VRAM each |
| GPU topology | PHB over PCIe, no NVLink |
| NVIDIA stack | Driver 575.51.02, CUDA 12.9 |
The two 4070s are the practical ceiling for this box. Tensor-parallel all-reduces cross PCIe through the host bridge, so the service is tuned for predictable routine planning rather than unconstrained interactive throughput. Published throughput numbers are launch observations, not a capacity guarantee for arbitrary prompts.
vLLM Service
cortex-serve.service owns the vLLM process. It is a systemd service that stops any prior vllm-serve container, removes it, and starts a fresh vLLM OpenAI-compatible server with host networking.
| Runtime detail | Current value |
|---|---|
| systemd unit | cortex-serve.service |
| Container name | vllm-serve |
| Image | vllm/vllm-openai:gemma4 |
| API shape | OpenAI-compatible completions API |
| Endpoint | Private OpenAI-compatible /v1 endpoint on the Cortex host |
| Served model | gemma4-26b |
| Model root in container | Local AWQ Gemma model volume |
| Tensor parallelism | --tensor-parallel-size 2 |
| Context window | --max-model-len 131072 |
| GPU memory policy | --gpu-memory-utilization 0.92 |
| KV cache | --kv-cache-dtype fp8 |
| Compute dtype | --dtype bfloat16 |
| Batching | --max-num-seqs 16, --max-num-batched-tokens 8192 |
| Prefix handling | --enable-prefix-caching, --enable-chunked-prefill |
| Tool parsing | --enable-auto-tool-choice, --tool-call-parser gemma4 |
| Reasoning parsing | --reasoning-parser gemma4 |
| Multimodal policy | --limit-mm-per-prompt {"image":0,"audio":0} |
| IPC / shared memory | --ipc=host, --shm-size=16g |
| Restart policy | systemd Restart=always, RestartSec=10, TimeoutStartSec=300 |
The launch-safe command shape is:
docker run --rm --name vllm-serve \
--gpus all --ipc=host --network host --shm-size=16g \
-v <model-cache-volume>:/models -v <active-model-volume>:/srv-models \
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
-e VLLM_WORKER_MULTIPROC_METHOD=spawn \
vllm/vllm-openai:gemma4 \
--model /srv-models/<gemma4-awq-checkpoint> \
--served-model-name gemma4-26b \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--gpu-memory-utilization 0.92 \
--kv-cache-dtype fp8 \
--enable-prefix-caching --enable-chunked-prefill \
--enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4 \
--trust-remote-code \
--max-num-seqs 16 --max-num-batched-tokens 8192 \
--limit-mm-per-prompt '{"image":0,"audio":0}' \
--dtype bfloat16 \
--host 0.0.0.0 --port <private-port>Why These Flags Matter
| Setting | Why it is there |
|---|---|
--tensor-parallel-size 2 | One rank per 4070; the active checkpoint does not fit comfortably on one 16 GiB card. |
--max-model-len 131072 | Keeps the service aligned with the 128K context capability advertised by the locally served gemma4-26b route. |
--kv-cache-dtype fp8 | Reduces KV cache footprint enough to make long-context planning practical on this hardware. |
--enable-prefix-caching | Critical for repeated Cortex planner prompts; the system/template prefix is reused heavily. |
--enable-chunked-prefill | Lets large prefills share scheduler time with generation. |
--tool-call-parser gemma4 | Converts the model’s native tool-call markup into OpenAI-compatible tool calls for OpenClaw clients. |
--reasoning-parser gemma4 | Surfaces reasoning blocks through the OpenAI-compatible response fields instead of leaking parser syntax into content. |
--limit-mm-per-prompt {"image":0,"audio":0} | Disables vision/audio towers so VRAM goes to weights and KV cache. |
--ipc=host and --shm-size=16g | Provides the shared-memory space vLLM needs for tensor-parallel workers. |
PYTORCH_ALLOC_CONF=expandable_segments:True | Reduces fragmentation risk under long-context prefill and warm KV cache. |
TimeoutStartSec=300 | Cold starts are long enough that a default 90 second timeout would kill the service during warmup. |
Model File
| Detail | Current value |
|---|---|
| Host path | Private active-model volume |
| Container path | Local AWQ model mount |
| On-disk size | 17 GB |
| Shards | 4 safetensors shards |
| Modified | 2026-04-04 |
| Architecture | Gemma4ForConditionalGeneration |
| Model type | gemma4 |
| Weight dtype | bfloat16 for non-quantized layers |
| Quantization | AWQ, 4-bit, group size 32, symmetric, MSE observer |
| Quantized target | Linear layers |
| Full-precision exclusions | MoE expert projections and router projections stay unquantized |
| Tokenizer / template | Standard tokenizer files plus chat_template.jinja |
The model directory includes the safetensors shards, model.safetensors.index.json, config.json, generation_config.json, recipe.yaml, chat_template.jinja, tokenizer files, multimodal processor metadata, and the upstream model card.
The live model endpoint verifies the active route:
{
"id": "gemma4-26b",
"object": "model",
"owned_by": "vllm",
"root": "local AWQ model mount",
"max_model_len": 131072
}Endpoint Surface
vLLM binds to a private LAN endpoint. The internal port choice is kept stable for Ollama-era local clients, but public launch traffic never reaches it.
| Path | Use |
|---|---|
GET /v1/models | Model listing; returns gemma4-26b. |
POST /v1/chat/completions | Primary chat endpoint for OpenClaw, including tool calls and reasoning fields. |
POST /v1/completions | Legacy completions endpoint. |
GET /health | Liveness probe. |
GET /metrics | Prometheus metrics for vLLM and scheduler behavior. |
One migration note matters operationally: Ollama-era clients that probe GET /api/tags will receive a 404 from this service. Those local clients should probe /v1/models instead.
Boot and Capacity Observations
The running service was inspected live on 2026-05-07.
| Observation | Current value |
|---|---|
| vLLM build | v0.18.2rc1.dev73+gdb7a17ecc, V1 engine |
| Quantization detected | compressed-tensors AWQ pack-quantized |
| Worker world size | 2 tensor-parallel workers |
| Cold init | About 108 seconds before serving |
| CUDA graph capture sizes | [1,2,4,8,16,24,32] |
| Per-worker KV cache | About 4.42 GiB available |
| Full-context concurrency | About 4.15x at the configured 131K context, per vLLM logs |
| Steady batch-1 throughput | Roughly 90 to 100 tokens per second observed |
| Prefix cache hit rate | Climbs into roughly 90 to 96 percent under repeated planner prompts |
| CPU thread warning | vLLM reduces Torch CPU threads from 28 to 1; set OMP_NUM_THREADS if CPU-side work becomes a bottleneck |
This is enough for routine greenhouse planning because public traffic does not create model calls, and planner triggers are sparse compared with page views. The main launch capacity risk is the public proof layer, not Cortex.
Throughput caveat: batch-1 token speed is not the same thing as guaranteed end-to-end planning latency. A full plan includes context assembly, long-context prefill, tool calls, validation, and Slack/archive writes; required full plans currently use the cloud peer until the local full-plan context is trimmed and baked.
Cortex API Gateway
Cortex also has a FastAPI gateway conceptually above vLLM: a unified POST /v1/cortex/infer endpoint with profile-based routing across local backends, cloud fallback, job queueing, and budget checks. That gateway talks to the local model through the same private route.
The gateway role is:
| Gateway surface | Role |
|---|---|
/v1/cortex/health | Service health. |
/v1/cortex/profiles | Available inference profiles. |
/v1/cortex/costs | Budget and cost view. |
/v1/cortex/gpu | GPU state and residency. |
/v1/cortex/queue | Async queue status. |
/v1/cortex/ready | Readiness gate. |
/v1/cortex/infer | Main profile-based inference endpoint. |
/v1/cortex/jobs/{id} | Async job result lookup. |
The current public documentation treats the vLLM service as verified and the Cortex gateway deployment as a thing to re-check before making stronger claims. If the gateway moves hosts or changes process manager, this page should be refreshed.
Adjacent Services
The inference box also runs related observability and local-AI services:
| Service | Role |
|---|---|
tei-embeddings | CPU text-embedding inference on a local Qwen embedding model. |
open-webui | Local UI for OpenAI-compatible access to the model. |
dcgm-exporter | GPU metrics. |
node-exporter | Host metrics. |
alloy and promtail | Metrics and log shipping. |
portainer_agent | Remote management agent. |
LMCache configuration exists on disk, but the running vLLM command line does not currently wire LMCache into the service. Prefix caching is live; LMCache should be treated as dormant until the vLLM launch arguments prove otherwise.
Planning-Loop Boundary
The local inference route is not the safety mechanism. If Gemma 4 26B A4B (MoE), served locally under the gemma4-26b alias, is slow, unavailable, or wrong, the planner SLA opens an alert rather than silently pretending the cycle worked. The ESP32 still owns real-time control and enforces the last valid bounded setpoints plus hard safety rails.
The end-to-end loop is:
Solar milestone, forecast delta, transition, deviation, or manual operator request.
OpenClaw chooses Gemma 4 26B A4B (MoE), served locally under the gemma4-26b alias, for routine checks or a cloud peer for heavier reviews.
Cortex serves gemma4-26b through vLLM's OpenAI-compatible API.
Iris writes bounded tactical intent through MCP tools and audited plan records.
The dispatcher and ESP32 validate, clamp, and enforce deterministic state-machine behavior.
Slack gets the human-readable plan summary, watch items, and operator tasks.
Telemetry, cost, stress, compliance, and plan outcomes judge what actually happened.
For launch, required_full_plan_instance is set to opus in config/ai.yaml. Required SUNRISE/SUNSET/MIDNIGHT full-plan events use the cloud planner path until the local full-plan context is trimmed and a clean local bake passes. Routine FORECAST and TRANSITION work stays local.
That is why the public archive can show both local-first architecture and a cloud-written catch-up full plan without contradiction: the route is explicit, stamped, and audited.
Read-Only Verification
These are the private-operator command categories that refresh the operational snapshot. Exact host paths and private endpoint details stay in the operator runbook.
# Service state
systemctl status cortex-serve.service
docker ps --filter name=vllm-serve --format '{{.Status}}'
docker logs vllm-serve --tail 100
# Endpoint
curl -s http://127.0.0.1:<private-port>/v1/models | jq
curl -s http://127.0.0.1:<private-port>/health
# Hardware
nvidia-smi
nvidia-smi topo -m
free -h
lscpu | head -20
df -hT
# Model file
ls -la <active-model-volume>/gemma4-26b-awq/
du -sh <active-model-volume>/gemma4-26b-awq/
# Effective vLLM args
docker inspect vllm-serve --format '{{json .Args}}' | jq
docker inspect vllm-serve --format '{{range .Config.Env}}{{println .}}{{end}}'Open Questions
- Confirm where the Cortex API gateway process actually runs in production before documenting it as a verified systemd-managed service.
- Decide whether LMCache is intentionally dormant or should be wired into vLLM for cross-request KV reuse beyond prefix caching.
- Find and update any Ollama-era client still probing
/api/tags; vLLM serves/v1/models.
Related pages: