Skip to content

Generate judgments via LLM

About this walkthrough

Estimated time: 5 minutes (mostly waiting on the worker) Tags: judgments, llm, ground-truth

Trigger the LLM-as-judge worker against a query set — every (query, top-K doc) pair is rated 0-3 with a real OpenAI call. The deterministic alternative is the import path (guide 05).

Trouble playing? Download the walkthrough video.

Step 1 — Open a query set's detail page. The 'Associated…

Open a query set's detail page. The 'Associated judgment lists' card on the right shows what's already been generated — or, on a fresh set, an empty state with the 'Generate judgments' button as the next action.

Step 2 — Click 'Generate judgments' to open the dialog. The…

Click 'Generate judgments' to open the dialog. The form has four fields: a name for the judgment list, the target index/collection on the cluster, the query template that produces candidate docs, and the rubric the LLM judge uses to score relevance.

Step 3 — Fill the text fields: a unique name for…

Fill the text fields: a unique name for the judgment list, the target index/collection on the cluster (e.g., 'products'), and customize the rubric if needed. The default rubric is a 0-3 relevance scale; tailor it to your domain ('Rate ecommerce search relevance' vs. 'Rate documentation search relevance').

Step 4 — Open the template dropdown and pick the query…

Open the template dropdown and pick the query template that produces candidate docs for each query. The worker renders this template per query, runs the search, and rates every returned doc — so template choice + top-K determine both the cost of generation and the judgments' downstream usefulness.

Step 5 — Submit. The worker enqueues immediately (202 ACCEPTED) and…

Submit. The worker enqueues immediately (202 ACCEPTED) and begins hitting OpenAI per (query, doc) pair. The /judgments/{id} page polls for status. On success (complete), the per-judgment rows populate with rating + brief reasoning; on failure (failed), check the worker logs — common causes are cluster unreachable from the API container, daily budget exceeded, or LLM_PROVIDER_INCAPABLE (the configured model lacks structured output). Cost ~$0.01-0.05 with gpt-4o-mini for a small set.

← Back to walkthroughs