Skip to content

LLM endpoint setup

Status: Operator runbook. Applies to MVP1 onward. Audience: Operators configuring RelyLoop's LLM provider — local, remote, cloud, or any mix.

RelyLoop talks to any OpenAI-compatible endpoint via a single environment variable. That one knob is the entire LLM integration surface — no per-provider SDK, no per-provider auth flow, no code change. Set OPENAI_BASE_URL to wherever your model is served and OPENAI_MODEL to whichever model name that endpoint serves, and RelyLoop talks to it.

This guide walks through the most common configurations side by side.

Not the only judgment source. This guide configures the LLM endpoint that powers LLM-as-judge generation, the digest narrative, and the chat agent. If your cluster has captured click/dwell traffic, User Behavior Insights (UBI) generates judgments from real user behavior with no external LLM — see the UBI judgment-generation runbook.

The pattern in one paragraph

RelyLoop uses the openai Python SDK pointed at whatever URL you set in OPENAI_BASE_URL. The SDK speaks the OpenAI Chat Completions wire protocol (POST /v1/chat/completions). Anything that serves that protocol — OpenAI itself, Azure OpenAI's OpenAI-compatible mode, Ollama, LM Studio, vLLM, HuggingFace TGI, OpenRouter, LiteLLM proxy, or any other compatible server — works without changing a line of RelyLoop code. The API key is mounted at ./secrets/openai_key (or whatever path you point OPENAI_API_KEY_FILE at) per CLAUDE.md Absolute Rule #2 on secrets handling.

The operator's three env vars, all in .env:

OPENAI_BASE_URL=https://api.openai.com/v1    # or whichever endpoint
OPENAI_MODEL=gpt-4o-2024-08-06               # or whichever model name
OPENAI_MODEL_CHAT=gpt-4o-mini-2024-07-18     # optional override for chat orchestrator

Plus the mounted secret file:

./secrets/openai_key    # contents: your-api-key-or-any-placeholder-for-local-endpoints

That's the whole integration surface.

Configurations side by side

OpenAI cloud (default)

# .env
OPENAI_BASE_URL=https://api.openai.com/v1
OPENAI_MODEL=gpt-4o-2024-08-06
# ./secrets/openai_key
sk-proj-...your-real-key...

This is the default. Newest models, fastest, paid per-token.

Ollama (local, air-gapped)

Ollama runs models locally on your laptop or server. Install Ollama, pull a model, and point RelyLoop at it.

# Install + pull a model (one-time)
brew install ollama        # or download the installer
ollama pull llama3.1:70b   # or qwen2.5:72b, mistral-large, etc.
ollama serve               # starts the daemon on localhost:11434
# .env
OPENAI_BASE_URL=http://host.docker.internal:11434/v1   # from inside Docker
OPENAI_MODEL=llama3.1:70b
# ./secrets/openai_key
ollama   # placeholder — Ollama ignores the key

Zero data leaves your machine. Best fit for compliance-bound shops and offline evaluation.

Note: host.docker.internal resolves to the host from inside Docker on macOS and Windows. On Linux, use --add-host=host.docker.internal:host-gateway in your Docker command or set OPENAI_BASE_URL=http://172.17.0.1:11434/v1 (the default docker0 bridge gateway).

LM Studio (local, GUI)

LM Studio is a desktop app for running local LLMs with a GUI for model selection. Start LM Studio, load a model in the Local Server tab, and point RelyLoop at the URL it shows.

# .env
OPENAI_BASE_URL=http://host.docker.internal:1234/v1
OPENAI_MODEL=lmstudio-community/Meta-Llama-3.1-70B-Instruct-GGUF
# ./secrets/openai_key
lm-studio   # placeholder — LM Studio ignores the key

Same air-gapped properties as Ollama, but with a GUI for model browsing. Good for experimentation across multiple models without command-line work.

vLLM (local or remote, production-grade)

vLLM is the production-grade inference server for self-hosting open-weight models with high throughput. Best fit when you want your own LLM serving infrastructure on a GPU server.

# Start vLLM on a GPU box (one-time)
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct --port 8000
# .env
OPENAI_BASE_URL=http://your-gpu-server.internal:8000/v1
OPENAI_MODEL=meta-llama/Meta-Llama-3.1-70B-Instruct
# ./secrets/openai_key
vllm   # placeholder; vLLM can be configured with a real key if needed

HuggingFace TGI (local or remote)

HuggingFace Text Generation Inference — similar role to vLLM, popular in HF-centric stacks.

# Start TGI on a GPU box (one-time)
docker run --gpus all -p 8080:80 ghcr.io/huggingface/text-generation-inference \
  --model-id meta-llama/Meta-Llama-3.1-70B-Instruct
# .env
OPENAI_BASE_URL=http://your-gpu-server.internal:8080/v1
OPENAI_MODEL=meta-llama/Meta-Llama-3.1-70B-Instruct

Azure OpenAI (cloud, OpenAI-compatible mode)

Azure OpenAI exposes an OpenAI-compatible endpoint mode that works without the Azure-specific SDK.

# .env
OPENAI_BASE_URL=https://your-resource.openai.azure.com/openai/v1
OPENAI_MODEL=gpt-4o-deployment-name   # whatever you named the deployment
# ./secrets/openai_key
your-azure-openai-key

OpenRouter (cloud, multi-model routing)

OpenRouter is a single endpoint that routes to many providers (Anthropic Claude, Google Gemini, Mistral, DeepSeek, etc.) using the OpenAI-compatible protocol. Lets you experiment with non-OpenAI models without writing per-provider code.

# .env
OPENAI_BASE_URL=https://openrouter.ai/api/v1
OPENAI_MODEL=anthropic/claude-3.5-sonnet   # or google/gemini-pro-1.5, mistralai/mistral-large, etc.
# ./secrets/openai_key
sk-or-v1-...your-openrouter-key...

LiteLLM proxy (your-own-proxy in front of Bedrock / Vertex / Anthropic native)

If you need to use AWS Bedrock, Google Vertex, or Anthropic's native API (each with its own non-OpenAI wire protocol), run LiteLLM proxy in front of them. LiteLLM exposes an OpenAI-compatible endpoint that translates to each provider's native API.

# Configure LiteLLM with whichever providers you need (see LiteLLM docs)
litellm --config config.yaml --port 4000
# .env
OPENAI_BASE_URL=http://your-litellm-host:4000/v1
OPENAI_MODEL=bedrock-claude-3-5-sonnet   # whichever model alias you defined in LiteLLM
# ./secrets/openai_key
sk-litellm-master-key

This is the unblocking pattern for shops with strict Bedrock-only or Vertex-only policies — no native non-OpenAI provider SDK in RelyLoop required, today.

What changes between providers

The wire protocol is identical, so RelyLoop's code doesn't change. What changes:

Concern Notes
Model name format OpenAI uses gpt-4o-2024-08-06. Ollama uses llama3.1:70b. LM Studio uses HuggingFace-style names. Always read your endpoint's docs for the exact OPENAI_MODEL value.
Latency Local models on consumer GPUs are slower than cloud APIs. Budget overnight-study timing accordingly.
Tool calling / structured output quality OpenAI's GPT-4-class models handle tool calling reliably. Smaller local models (7B–13B) often need careful prompt tuning. RelyLoop's search-space proposal and judgment generation prompts target GPT-4-class capability; if you swap to a smaller local model, watch the eval metrics carefully.
Cost OpenAI cloud bills per token. Local is "free" once you've paid for the hardware. Azure OpenAI is your Azure bill. Per-tenant cost rollups (per Langfuse) ship with MVP3 observability.

What's deliberately NOT supported (yet)

Some providers don't expose an OpenAI-compatible endpoint and have their own wire protocol:

  • Anthropic native API (Anthropic's messages shape, not OpenAI's chat/completions shape) — use Anthropic via OpenRouter, or run LiteLLM in front.
  • AWS Bedrock native API (Bedrock has its own auth via SigV4 + per-model request shapes) — use Bedrock via LiteLLM proxy.
  • Google Vertex AI native API (Vertex's own request shape) — use Vertex via LiteLLM proxy.

Native non-OpenAI provider SDKs are in the backlog — the unblocking pattern (LiteLLM proxy or OpenRouter) covers most adopters today, so this is an ergonomics + cost-attribution upgrade rather than a "can't talk to that LLM" gap.

Capability check at startup

RelyLoop's API runs a one-shot capability probe at startup against whatever endpoint OPENAI_BASE_URL points at, checking that:

  • The endpoint responds at /v1/models
  • The configured OPENAI_MODEL is in the returned model list
  • A trivial POST /v1/chat/completions round-trip succeeds

The result is cached in Redis and reported via /healthz under subsystems.openai. If the probe fails (wrong URL, wrong model name, missing key), /healthz reports subsystems.openai: down and a structured WARN log explains the failure. LLM-dependent features (judgment generation, digest narrative, chat agent) refuse to run until the probe passes; everything else (cluster registration, query-set CRUD, manual study setup) continues to work.

See docs/01_architecture/llm-orchestration.md for the orchestration layer details and docs/03_runbooks/local-dev.md for the broader local-dev runbook.

Cross-references