Published on

A/B Testing LLM Prompts in Production: A Practical Guide

Authors
  • avatar
    Name
    Benjamin Lee
    Twitter

Why Prompt Testing Is Harder Than It Looks

Changing a prompt feels low-stakes. It's a string, not a binary. No compilation, no type errors, no failing unit tests. But a prompt change can meaningfully shift model behavior — output format, tone, reasoning depth, failure modes — in ways that are invisible until they hit real users.

The most reliable way to know whether a prompt change improves or regresses your system is to test it against live production traffic. That means A/B testing, the same way you'd test any other user-facing change. Per Traceloop's production guide, this is the only method that accounts for real query distributions, real user behavior, and real downstream effects.

The Basic Setup: Canary Deployment

The standard pattern is a canary rollout: deploy the new prompt alongside the existing one, route a small percentage of traffic to the new variant, and measure both side by side.

import random
from langchain_anthropic import ChatAnthropic
from langchain_core.prompts import ChatPromptTemplate

CONTROL_PROMPT = """You are a helpful assistant. Answer concisely."""
VARIANT_PROMPT = """You are a helpful assistant. Think step by step before answering. Be concise."""

def get_prompt_variant(user_id: str, traffic_split: float = 0.1) -> tuple[str, str]:
    # Deterministic assignment by user_id — same user always gets same variant
    import hashlib
    hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16) / (2**128)
    if hash_val < traffic_split:
        return VARIANT_PROMPT, "variant"
    return CONTROL_PROMPT, "control"

def run_with_tracking(user_id: str, user_input: str) -> str:
    prompt_text, variant_name = get_prompt_variant(user_id)
    prompt = ChatPromptTemplate.from_messages([
        ("system", prompt_text),
        ("human", "{input}"),
    ])
    chain = prompt | ChatAnthropic(model="claude-sonnet-4-6")
    response = chain.invoke({"input": user_input})

    # Log for analysis
    log_experiment_event(
        variant=variant_name,
        user_id=user_id,
        input_tokens=response.usage_metadata["input_tokens"],
        output_tokens=response.usage_metadata["output_tokens"],
        response=response.content,
    )
    return response.content

Hashing the user ID for variant assignment is important: it keeps the same user on the same variant across multiple requests, which reduces noise and makes per-user analysis possible.

What to Measure

Standard software A/B testing optimizes a single primary metric (click-through rate, conversion). LLM evaluation requires a mix of automated, human, and operational metrics. Per Galileo's evaluation framework:

Operational metrics (cheap, automatic):

  • Latency (p50, p95, p99)
  • Cost per request (input + output tokens × model price)
  • Error rate and refusal rate

Quality metrics (require evaluation):

  • Relevance: does the response address the user's actual question?
  • Faithfulness: are factual claims grounded in provided context?
  • Format compliance: does output match the expected structure?

LLM-as-judge is the current practical answer for quality at scale — use a separate, powerful model to score responses on a rubric. The key is consistency: the judge model and scoring prompt should be frozen for the duration of the experiment.

from langchain_anthropic import ChatAnthropic

judge = ChatAnthropic(model="claude-opus-4-7")

JUDGE_PROMPT = """Rate the following response on Relevance (1-5) and Conciseness (1-5).

User question: {question}
Response: {response}

Return JSON: {{"relevance": <score>, "conciseness": <score>, "reasoning": "<one sentence>"}}"""

def score_response(question: str, response: str) -> dict:
    result = judge.invoke(JUDGE_PROMPT.format(question=question, response=response))
    import json
    return json.loads(result.content)

Tooling: Langfuse

Langfuse is purpose-built for this. It manages prompt versions, splits traffic between variants, and surfaces per-variant metrics — latency, cost, evaluation scores — in a dashboard.

The integration is a decorator:

from langfuse.decorators import langfuse_context, observe
from langfuse import Langfuse

langfuse = Langfuse()

@observe()
def run_agent(user_input: str, user_id: str):
    # Langfuse fetches the active prompt variant for this session
    prompt = langfuse.get_prompt("my-agent-prompt", label="production")
    langfuse_context.update_current_trace(
        user_id=user_id,
        tags=[f"prompt-version:{prompt.version}"],
    )
    # ... run model ...

Langfuse then lets you compare variants across any metric you've logged — you can filter to a specific date range, model version, or user cohort.

Statistical Significance

A common mistake is calling a test too early. LLM responses have high variance — a sample of 50 queries tells you almost nothing. Rules of thumb:

  • Run each variant on at least 500–1,000 requests before drawing conclusions
  • Use a two-sample t-test or Mann-Whitney U test for continuous metrics (latency, scores)
  • Apply Bonferroni correction if you're testing multiple metrics simultaneously — you'll otherwise get false positives

For low-traffic applications, consider offline evaluation: build a representative test set of 200–500 real queries, run both prompt variants against all of them, and score the outputs. It's less rigorous than live A/B testing but much faster to run.

The Recommendation Systems Angle

For agents that power recommendation or ranking systems, prompt optimization takes on additional dimensions. Recent research from LFAI & Data shows that generative recommenders can collapse multi-stage ranking pipelines into a single model call — but the prompt engineering for ranking is non-trivial. You're optimizing for diversity, novelty, and fairness alongside pure relevance.

For these systems, A/B test against business metrics (click-through rate, conversion, session depth) not just LLM quality scores. The LLM judge can tell you if a response is relevant; only real user behavior can tell you if it's useful.

When to Ship

Call the experiment when:

  1. You have sufficient sample size (per above)
  2. The primary quality metric shows a meaningful improvement (not just within noise)
  3. Latency and cost haven't regressed unacceptably
  4. The improvement is consistent across user segments (not driven by one cohort)

Prompt changes that win on quality but add 300ms of latency or 40% token cost require a deliberate tradeoff decision — document it explicitly before shipping.


Sources: