Local Inference Setup

Verdify’s public traffic does not call the local model. Launch readers hit Cloudflare, the static Quartz site, public read-only API endpoints, Grafana, and warmed static dashboard renders. Cortex is for greenhouse planning events: sunrise, sunset, forecast deltas, band transitions, deviations, daily reviews, and manual operator runs.

This page is an operational snapshot of the local inference stack as verified on 2026-05-07. It uses public-safe service labels; private DNS names, exact kernel build strings, and internal storage mounts are intentionally omitted where they do not help readers understand the design.

Iris (our OpenClaw AI agent) normally has two planner routes: Gemma 4 26B A4B (MoE), served locally under the gemma4-26b alias, for routine work and a cloud peer for heavier reviews. This is local-first, not local-only. OpenClaw chooses the agent instance, MCP validates writes, Slack Operations explains the result to humans, and the ESP32 keeps the safety boundary by owning relay decisions every 5 seconds.

Model Inventory

The planner-facing source of truth is config/ai.yaml.

Purpose	Provider	Model label	Temperature	Output cap	Notes
Planner	vLLM	`gemma4-26b`	0.3	4096 tokens	Gemma 4 26B A4B (MoE), served locally under the `gemma4-26b` alias, for tactical planning through OpenClaw.
Vision	Gemini	`gemini-3.1-pro-preview`	0.2	4096 tokens	Greenhouse camera snapshot analysis.
Embeddings	Gemini	`gemini-embedding-2-preview`	n/a	3072 dimensions	Observation similarity search.

The local model server is Cortex, a private-LAN vLLM host. OpenClaw reaches it through a private OpenAI-compatible /v1 endpoint; that endpoint currently advertises gemma4-26b as a vLLM-owned model with max_model_len=131072.

The OpenClaw fleet label for the local planner is gemma4-26b @ 131k ctx. The planner contract does not try to spend that entire window. make planner-dry enforces a local prompt budget of roughly 60k Gemma tokens for the stable preamble, using a 208k character cap as the offline guard.

MoE means mixture-of-experts: the checkpoint has expert sub-networks and a router that activates a subset for a given token. That helps capacity fit the planning task, but it does not make throughput unlimited. Long-context prefill, tool-call parsing, PCIe tensor-parallel traffic, and KV-cache pressure still determine whether a required full-plan cycle can finish cleanly.

Cortex Host

Cortex home-lab GPU rack with stacked compute nodes and illuminated cooling fans — Cortex is the local home-lab inference target for routine Verdify planner runs. Public launch traffic reads already-written evidence; it does not call this rack directly.

Resource	Current value
Host identity	Private Cortex VM
Public exposure	Not public; reached from OpenClaw and local services inside the home network
OS / kernel	Debian 13 trixie; exact kernel build omitted from public docs
Virtualization	KVM full virtualization on an AMD Ryzen Threadripper 2950X host
CPU	28 vCPUs, single socket, one thread per core, one NUMA node
RAM	110 GiB total
Swap	Effectively none
Active model storage	Local model volume mounted into the vLLM container
Model cache shelf	Local model-cache volume for non-active checkpoints
GPUs	2 x GeForce RTX 4070 Ada, 16 GiB VRAM each
GPU topology	PHB over PCIe, no NVLink
NVIDIA stack	Driver `575.51.02`, CUDA `12.9`

The two 4070s are the practical ceiling for this box. Tensor-parallel all-reduces cross PCIe through the host bridge, so the service is tuned for predictable routine planning rather than unconstrained interactive throughput. Published throughput numbers are launch observations, not a capacity guarantee for arbitrary prompts.

vLLM Service

cortex-serve.service owns the vLLM process. It is a systemd service that stops any prior vllm-serve container, removes it, and starts a fresh vLLM OpenAI-compatible server with host networking.

Runtime detail	Current value
systemd unit	`cortex-serve.service`
Container name	`vllm-serve`
Image	`vllm/vllm-openai:gemma4`
API shape	OpenAI-compatible completions API
Endpoint	Private OpenAI-compatible `/v1` endpoint on the Cortex host
Served model	`gemma4-26b`
Model root in container	Local AWQ Gemma model volume
Tensor parallelism	`--tensor-parallel-size 2`
Context window	`--max-model-len 131072`
GPU memory policy	`--gpu-memory-utilization 0.92`
KV cache	`--kv-cache-dtype fp8`
Compute dtype	`--dtype bfloat16`
Batching	`--max-num-seqs 16`, `--max-num-batched-tokens 8192`
Prefix handling	`--enable-prefix-caching`, `--enable-chunked-prefill`
Tool parsing	`--enable-auto-tool-choice`, `--tool-call-parser gemma4`
Reasoning parsing	`--reasoning-parser gemma4`
Multimodal policy	`--limit-mm-per-prompt {"image":0,"audio":0}`
IPC / shared memory	`--ipc=host`, `--shm-size=16g`
Restart policy	systemd `Restart=always`, `RestartSec=10`, `TimeoutStartSec=300`

The launch-safe command shape is:

docker run --rm --name vllm-serve \
  --gpus all --ipc=host --network host --shm-size=16g \
  -v <model-cache-volume>:/models -v <active-model-volume>:/srv-models \
  -e PYTORCH_ALLOC_CONF=expandable_segments:True \
  -e VLLM_WORKER_MULTIPROC_METHOD=spawn \
  vllm/vllm-openai:gemma4 \
  --model /srv-models/<gemma4-awq-checkpoint> \
  --served-model-name gemma4-26b \
  --tensor-parallel-size 2 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.92 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching --enable-chunked-prefill \
  --enable-auto-tool-choice --tool-call-parser gemma4 --reasoning-parser gemma4 \
  --trust-remote-code \
  --max-num-seqs 16 --max-num-batched-tokens 8192 \
  --limit-mm-per-prompt '{"image":0,"audio":0}' \
  --dtype bfloat16 \
  --host 0.0.0.0 --port <private-port>

Why These Flags Matter

Setting	Why it is there
`--tensor-parallel-size 2`	One rank per 4070; the active checkpoint does not fit comfortably on one 16 GiB card.
`--max-model-len 131072`	Keeps the service aligned with the 128K context capability advertised by the locally served `gemma4-26b` route.
`--kv-cache-dtype fp8`	Reduces KV cache footprint enough to make long-context planning practical on this hardware.
`--enable-prefix-caching`	Critical for repeated Cortex planner prompts; the system/template prefix is reused heavily.
`--enable-chunked-prefill`	Lets large prefills share scheduler time with generation.
`--tool-call-parser gemma4`	Converts the model’s native tool-call markup into OpenAI-compatible tool calls for OpenClaw clients.
`--reasoning-parser gemma4`	Surfaces reasoning blocks through the OpenAI-compatible response fields instead of leaking parser syntax into content.
`--limit-mm-per-prompt {"image":0,"audio":0}`	Disables vision/audio towers so VRAM goes to weights and KV cache.
`--ipc=host` and `--shm-size=16g`	Provides the shared-memory space vLLM needs for tensor-parallel workers.
`PYTORCH_ALLOC_CONF=expandable_segments:True`	Reduces fragmentation risk under long-context prefill and warm KV cache.
`TimeoutStartSec=300`	Cold starts are long enough that a default 90 second timeout would kill the service during warmup.

Model File

Detail	Current value
Host path	Private active-model volume
Container path	Local AWQ model mount
On-disk size	17 GB
Shards	4 safetensors shards
Modified	2026-04-04
Architecture	`Gemma4ForConditionalGeneration`
Model type	`gemma4`
Weight dtype	bfloat16 for non-quantized layers
Quantization	AWQ, 4-bit, group size 32, symmetric, MSE observer
Quantized target	`Linear` layers
Full-precision exclusions	MoE expert projections and router projections stay unquantized
Tokenizer / template	Standard tokenizer files plus `chat_template.jinja`

The model directory includes the safetensors shards, model.safetensors.index.json, config.json, generation_config.json, recipe.yaml, chat_template.jinja, tokenizer files, multimodal processor metadata, and the upstream model card.

The live model endpoint verifies the active route:

{
  "id": "gemma4-26b",
  "object": "model",
  "owned_by": "vllm",
  "root": "local AWQ model mount",
  "max_model_len": 131072
}

Endpoint Surface

vLLM binds to a private LAN endpoint. The internal port choice is kept stable for Ollama-era local clients, but public launch traffic never reaches it.

Path	Use
`GET /v1/models`	Model listing; returns `gemma4-26b`.
`POST /v1/chat/completions`	Primary chat endpoint for OpenClaw, including tool calls and reasoning fields.
`POST /v1/completions`	Legacy completions endpoint.
`GET /health`	Liveness probe.
`GET /metrics`	Prometheus metrics for vLLM and scheduler behavior.

One migration note matters operationally: Ollama-era clients that probe GET /api/tags will receive a 404 from this service. Those local clients should probe /v1/models instead.

Boot and Capacity Observations

The running service was inspected live on 2026-05-07.

Observation	Current value
vLLM build	`v0.18.2rc1.dev73+gdb7a17ecc`, V1 engine
Quantization detected	`compressed-tensors` AWQ pack-quantized
Worker world size	2 tensor-parallel workers
Cold init	About 108 seconds before serving
CUDA graph capture sizes	`[1,2,4,8,16,24,32]`
Per-worker KV cache	About 4.42 GiB available
Full-context concurrency	About 4.15x at the configured 131K context, per vLLM logs
Steady batch-1 throughput	Roughly 90 to 100 tokens per second observed
Prefix cache hit rate	Climbs into roughly 90 to 96 percent under repeated planner prompts
CPU thread warning	vLLM reduces Torch CPU threads from 28 to 1; set `OMP_NUM_THREADS` if CPU-side work becomes a bottleneck

This is enough for routine greenhouse planning because public traffic does not create model calls, and planner triggers are sparse compared with page views. The main launch capacity risk is the public proof layer, not Cortex.

Throughput caveat: batch-1 token speed is not the same thing as guaranteed end-to-end planning latency. A full plan includes context assembly, long-context prefill, tool calls, validation, and Slack/archive writes; required full plans currently use the cloud peer until the local full-plan context is trimmed and baked.

Cortex API Gateway

Cortex also has a FastAPI gateway conceptually above vLLM: a unified POST /v1/cortex/infer endpoint with profile-based routing across local backends, cloud fallback, job queueing, and budget checks. That gateway talks to the local model through the same private route.

The gateway role is:

Gateway surface	Role
`/v1/cortex/health`	Service health.
`/v1/cortex/profiles`	Available inference profiles.
`/v1/cortex/costs`	Budget and cost view.
`/v1/cortex/gpu`	GPU state and residency.
`/v1/cortex/queue`	Async queue status.
`/v1/cortex/ready`	Readiness gate.
`/v1/cortex/infer`	Main profile-based inference endpoint.
`/v1/cortex/jobs/{id}`	Async job result lookup.

The current public documentation treats the vLLM service as verified and the Cortex gateway deployment as a thing to re-check before making stronger claims. If the gateway moves hosts or changes process manager, this page should be refreshed.

Adjacent Services

The inference box also runs related observability and local-AI services:

Service	Role
`tei-embeddings`	CPU text-embedding inference on a local Qwen embedding model.
`open-webui`	Local UI for OpenAI-compatible access to the model.
`dcgm-exporter`	GPU metrics.
`node-exporter`	Host metrics.
`alloy` and `promtail`	Metrics and log shipping.
`portainer_agent`	Remote management agent.

LMCache configuration exists on disk, but the running vLLM command line does not currently wire LMCache into the service. Prefix caching is live; LMCache should be treated as dormant until the vLLM launch arguments prove otherwise.

Planning-Loop Boundary

The local inference route is not the safety mechanism. If Gemma 4 26B A4B (MoE), served locally under the gemma4-26b alias, is slow, unavailable, or wrong, the planner SLA opens an alert rather than silently pretending the cycle worked. The ESP32 still owns real-time control and enforces the last valid bounded setpoints plus hard safety rails.

The end-to-end loop is:

Trigger

Solar milestone, forecast delta, transition, deviation, or manual operator request.

Route

OpenClaw chooses Gemma 4 26B A4B (MoE), served locally under the gemma4-26b alias, for routine checks or a cloud peer for heavier reviews.

Infer

Cortex serves gemma4-26b through vLLM's OpenAI-compatible API.

Write

Iris writes bounded tactical intent through MCP tools and audited plan records.

Enforce

The dispatcher and ESP32 validate, clamp, and enforce deterministic state-machine behavior.

Brief

Slack gets the human-readable plan summary, watch items, and operator tasks.

Score

Telemetry, cost, stress, compliance, and plan outcomes judge what actually happened.

For launch, required_full_plan_instance is set to opus in config/ai.yaml. Required SUNRISE/SUNSET/MIDNIGHT full-plan events use the cloud planner path until the local full-plan context is trimmed and a clean local bake passes. Routine FORECAST and TRANSITION work stays local.

That is why the public archive can show both local-first architecture and a cloud-written catch-up full plan without contradiction: the route is explicit, stamped, and audited.

Read-Only Verification

These are the private-operator command categories that refresh the operational snapshot. Exact host paths and private endpoint details stay in the operator runbook.

# Service state
systemctl status cortex-serve.service
docker ps --filter name=vllm-serve --format '{{.Status}}'
docker logs vllm-serve --tail 100
 
# Endpoint
curl -s http://127.0.0.1:<private-port>/v1/models | jq
curl -s http://127.0.0.1:<private-port>/health
 
# Hardware
nvidia-smi
nvidia-smi topo -m
free -h
lscpu | head -20
df -hT
 
# Model file
ls -la <active-model-volume>/gemma4-26b-awq/
du -sh <active-model-volume>/gemma4-26b-awq/
 
# Effective vLLM args
docker inspect vllm-serve --format '{{json .Args}}' | jq
docker inspect vllm-serve --format '{{range .Config.Env}}{{println .}}{{end}}'

Open Questions

Confirm where the Cortex API gateway process actually runs in production before documenting it as a verified systemd-managed service.
Decide whether LMCache is intentionally dormant or should be wired into vLLM for cross-request KV reuse beyond prefix caching.
Find and update any Ollama-era client still probing /api/tags; vLLM serves /v1/models.

🌱 Verdify