Quick tour — what RelyLoop does, in 10 minutes¶
A short, click-through tour for a search-relevance engineer who has never seen the product. Skips the operator setup ceremony (registering a cluster, importing samples, generating judgments) and goes straight to the value prop: the overnight optimization loop, the metric lift it produces, and the PR-based ship discipline that wraps it.
If you want the hands-on 30-minute end-to-end including setup, follow
tutorial-first-study.md instead. Use this
guide when you want a quick narrative tour against pre-baked data.
Before you start¶
You need:
- The stack up:
make up,make migrate,make seed-clusters,make seed-es(these populate two clusters and 1,000 sample products — seetutorial-first-study.mdSteps 1–4 for details) - One LLM provider configured (OpenAI key OR a local model per
tutorial-first-study.mdStep 0)
Then run the demo seed:
This seeds five real demo scenarios:
- Four small-data scenarios (e-commerce, knowledge base, news, jobs) — each with 5 indexed docs and 5–10 LLM-graded judgments. The optimizer runs a real 12-trial study against each. These demonstrate the system's mechanics and produce real (though usually flat) metrics.
- One rich-data scenario (
acme-products-rich-prod) — uses the full 1000-product ESCI dataset fromsamples/products.json, real LLM-generated judgments (5 queries × top-K rated), and a real 15-trial study with 3 boost knobs (title_boost,description_boost,bullet_points_boost). This is the headline demo target — it produces a real, non-zero metric lift with populated parameter importance.
Total runtime: ~7–10 minutes (the LLM judgment-generation step adds ~30–60s,
the rich 15-trial study adds ~2–3 min). Cost: ~$0.05 in LLM tokens.
Repeatability: fixed config.seed=42 + stable judgments + pinned ES =
same metric values run after run.
Recommended demo target: the acme-products-rich-prod cluster's
study (tune-acme-products-rich-boosts). What you'll see on it:
- Real
metric_delta({ndcg@10: baseline 0.682 → achieved 0.698, +2.3%}) - Populated
parameter_importance(title_boost ≈ 58%,bullet_points_boost ≈ 34%,description_boost ≈ 8%) - Three actionable followups:
narrow,widen,swap_template(the LLM auto-suggests swapping to the simpler title-only template used by the small acme scenario — emergent demo storytelling)
The four small scenarios make the studies dashboard look populated and demonstrate the "no-headroom" honest-reporting case (their digests correctly say "no change in performance" on sparse data).
When the script finishes, find the rich scenario's proposal URL with:
RICH_CLUSTER=$(docker compose exec -T postgres psql -U relyloop -d relyloop -At \
-c "SELECT id FROM clusters WHERE name='acme-products-rich-prod'")
RICH_PROPOSAL=$(docker compose exec -T postgres psql -U relyloop -d relyloop -At \
-c "SELECT id FROM proposals WHERE cluster_id='$RICH_CLUSTER' AND status='pending' ORDER BY created_at DESC LIMIT 1")
echo "Open: http://localhost:3000/proposals/$RICH_PROPOSAL"
Keep that URL open in a tab; you'll visit it at Stop 3. Also keep
http://localhost:3000/studies open
for Stop 1.
Stop 1 — /studies — "what the loop did overnight"¶
Open http://localhost:3000/studies.
The studies table shows the five scenarios make seed-demo produced.
Each row is a real, completed Optuna study with its own metric value —
not a hardcoded fixture. Values you'll typically see:
| Scenario | Study | Best ndcg@10 |
|---|---|---|
| acme-products-rich-prod | tune-acme-products-rich-boosts | 0.6979 (16 trials, real +2.3% lift) |
| acme-products-prod | tune-product-title-boost-baseline | 1.0 (small-data ceiling) |
| corp-docs-search | reduce-fuzziness-helpcenter-search | 0.7305 (small data, no movement) |
| news-search-staging | add-7day-freshness-decay-news | 0.9060 (small data, no movement) |
| jobs-marketplace-prod | tune-jobtitle-vs-company-boost | 1.0 (small-data ceiling) |
Click into tune-acme-products-rich-boosts — that's the demo target.
What this says: five studies ran against real datasets. The rich scenario found a real +2.3% lift on 1000 ESCI products. The small scenarios mostly hit a metric ceiling (5 docs each is too small to give the optimizer headroom) — and the system correctly reports that. The honest "no-change" message on small-data scenarios is itself a credibility signal: RelyLoop tells you when your tuning surface is too small instead of fabricating a lift.
Click into the row to open the study detail.
Stop 2 — Study detail — the lift, the trials, the parameter importance¶
A few orientation surfaces sit above the panels:
- Linked entities row — named, clickable links to the cluster, query set, judgment list, and template the study ran against. Hover any to confirm names; click to drill into the source of truth. This is how operators answer "what did this study actually test against?" without grepping UUIDs.
- View-proposal link — when a proposal has been promoted from this
study, a
Proposal: view proposal (<status>)link appears below the header. Click it for the round-trip back from study → proposal that mirrors the proposal → study link on the proposal page. - Glossary tooltips — small
(i)icons next to Target, Trials, Best metric, and other column headings. Hover for the short definition; the Guide button (bottom-right) opens the full glossary.
The study detail page itself has four panels worth narrating:
Metric delta¶
Baseline vs. best on the headline metric. The rich scenario's study
shows baseline 0.682 → achieved 0.698 (+2.3%) — a real, non-trivial
improvement found by Optuna's TPE sampler exploring the 3-D
(title_boost × description_boost × bullet_points_boost) space.
Trials table¶
16 rows (1 baseline + 15 Optuna trials), sorted by primary_metric DESC.
The top trial is highlighted. Trial values typically range from ~0.55
to ~0.70 — real, varied scores, not all identical.
What this says: the loop is auditable. You can see every trial it ran and exactly which parameter values produced which metric. No black box.
Parameter importance bars¶
A bar chart showing how much each parameter contributed to variance in the metric. For the rich scenario you'll see roughly:
title_boost≈ 58%bullet_points_boost≈ 34%description_boost≈ 8%
What this says: you don't just get a winning config. You get an explanation of which knobs the optimizer actually leaned on and which ones didn't matter. Here: title matches dominate, description matches barely move the needle. That's actionable insight for the search engineer.
Confidence panel¶
This is the panel that answers "is this winner statistically reliable, or did Optuna get lucky on one trial?" It has four parts:
- Headline metric + 95% CI band. The rich scenario typically shows the central estimate above the baseline with a CI that's wide because the data is bounded (5 queries). Wide-but-honestly-above-baseline is the right read, not a problem to hide.
- Per-query outcome chips —
X Improved · Y Unchanged · Z Regressedvs. runner-up (or baseline). This is where the audience sees that the +2.3% lift is broad-based, not driven by one query. - Queries that improved and Queries that regressed tables — named query text + winner score + comparison score + signed delta. The improver table is green; the regressor table is red. This is the point in the demo where a relevance engineer reading the panel says "oh, so I can see exactly which workloads gained and lost — not just the aggregate."
- Secondary callouts — runner-up gap (robust plateau vs. sharp peak), late-trial 1σ, and convergence regime (early-and-held vs. late-rising vs. noisy). Together these distinguish a winner the optimizer is confident about from one it's still hunting.
Every (i) icon opens a glossary definition for the term it's next to.
Stop 3 — Proposal — the ship gate¶
Click "View proposal" from the study detail (or open the proposals page and find the matching row).
The proposal page shows:
Config diff¶
A small, scannable before/after diff for the winning parameter values. This is what would land in the operator's search-config repo.
Metric delta¶
The same baseline → best comparison shown on the study detail, presented in the proposal context (with delta-pct breakdown).
Digest narrative¶
Two to three LLM-generated paragraphs explaining what the loop learned: which parameters moved the needle (or which didn't, when the data is flat), what the winning configuration says about the optimizer's exploration, and what the operator should investigate next. This narrative becomes the PR body.
"Open PR" button¶
Clicking this opens a real pull request against the operator's configured search-config repo on GitHub. The operator's existing CI, reviewers, and branch protection all stay in charge of what reaches production.
What this says: RelyLoop is not in the production-serving path. It proposes; the operator's existing ship discipline disposes. No surprise config changes.
Stop 4 — Suggested followups — the loop continues¶
Scroll the proposal page to the "Suggested followups" panel. The digest worker (real LLM call against the actual study data) generates the followups. What kinds appear depends on what the data shows — the LLM looks at the trial distribution and picks the most-informative next experiment(s):
narrow— when the optimizer found a stable winning region: "Tighten boost bounds around the winning value to extract the last few percentage points." Same template, narrower search space.widen— when the winner sat near a search-space boundary: "Test boost values further from the winner to confirm the optimum isn't a local maximum." Same template, wider search space.swap_template— when the data suggests a different query shape might do better: "Test whether the alternative template beats this one." Cross-template exploration.text— when no actionable change is warranted: prose recommending the operator rethink the rubric, judgment density, or query selection. No Run button (text-kind suggestions aren't one-click runnable).
For the rich scenario you'll see three cards: narrow (tighten
the boost bounds), widen (test boundaries further), and
swap_template (the LLM auto-suggests swapping to the
title-only multi-match-title-boost-v1 template — emergent insight
from the parameter-importance breakdown showing title dominates). All
three have a "Run this followup" button.
What this says: the loop didn't stop at one winner. It analyzed where the optimizer's curiosity remained and proposed a concrete next-experiment. The relevance engineer goes from "tune one thing and ship" to "continuous, automated experimentation with a steady stream of small wins for review."
Stop 5 (optional) — Kick off a followup live¶
If the demo has time, click "Run this followup" on the swap-template card. A modal opens, pre-filled:
- Template: the swap target (a function_score template instead of multi_match)
- Search space: LLM-suggested narrower bounds carrying the parent's winning insight forward
- Cluster / query set / judgment list: inherited from the parent
- Name:
<parent name> — followup #1 (swap_template)
Walk through the five wizard steps with step-next. Submit.
The new study appears in /studies as queued. Within a few seconds
it flips to running and trials start accumulating.
Stop 6 (optional) — Drive via chat¶
Open http://localhost:3000/chat.
Type something operator-shaped:
"I'm seeing poor recall on long-tail queries against the products index. What should I try?"
The agent will:
- Use
get_cluster_statusto look at the configured cluster - Suggest creating a new judgment list against a recent query slice
- Offer to start a study with a recall-oriented metric (e.g.,
recall@100)
What this says: a relevance engineer doesn't need to learn a new UI to drive the loop. Describe the problem in plain language and the agent picks the right tools.
Closing pitch¶
Three points to land, in order:
- Engine-neutral by design. What you saw runs against Elasticsearch and OpenSearch today. The Solr adapter is on the MVP2 roadmap. The optimization engine doesn't care which backend implements the search; it tunes whatever knobs the adapter exposes. Your existing search query pipelines map directly.
- Real signals, available now. User Behavior Insights (UBI) is a first-class judgment source alongside LLM-as-judge: when your cluster has captured click/dwell traffic, you grade studies against your users' real behavior instead of an LLM's guess. UBI is engine-neutral (Elasticsearch, OpenSearch, Solr) and complements the LLM path — the demo above uses LLM because it needs no traffic to run.
- Open source, self-hosted. Apache 2.0. Run it on a laptop, run it on your own infra. The PR-based ship workflow keeps your existing CI and existing reviewers in charge of production.
Where to go next¶
tutorial-first-study.md— full 30-min hands-on fromgit clonethrough "PR opened in GitHub"workflows-overview.md— inventory of all 30 distinct workflows the product supports- In-app guides 01–10 (open the
/guidepage in the UI) — per-workflow 60-second screenshot decks docs/01_architecture/— architecture docs: adapters, optimization, agent tools, apply path