Published on

A Diagnostic Framework for Choosing a RAG Architecture

Authors
  • avatar
    Name
    Benjamin Lee
    Twitter

Most teams pick a RAG architecture the way they pick a JavaScript framework: by what is fashionable, not by what their system actually needs. They read about agentic retrieval, assume it is the destination, and skip the cheaper patterns that would have solved their problem. The result is a slow, expensive pipeline that is harder to debug than the failure it was meant to fix.

A better approach treats architecture selection as a diagnostic process. You measure how your current retrieval fails, then reach for the simplest pattern that addresses that specific failure. This article walks through the Simple RAG to Agentic RAG, and frames each one by the problem it is designed to solve rather than its place on a sophistication ladder.

Start by Measuring Failure, Not by Choosing a Pattern

Before adopting any architecture more complex than the baseline, instrument your pipeline so you know where it breaks. Two bodies of work give you the vocabulary for this.

The RGB benchmark (Chen et al., 2023) decomposes RAG failure into four distinct abilities: noise robustness, negative rejection (declining to answer when no relevant context exists), information integration across documents, and counterfactual robustness. These axes matter because each one points at a different architectural remedy. A system that hallucinates when retrieval returns nothing has a negative-rejection problem, not a chunking problem.

The RAGAS framework (Es et al., 2023) gives you the operational metrics to watch: faithfulness, answer relevancy, context precision, and context recall. Run these against your baseline on a representative sample of real queries. The metric that is lowest tells you which architecture to consider next. Choosing without this measurement is how teams end up with sophisticated pipelines that do not move their actual numbers.

Simple RAG and Simple RAG with Memory: The Correct Starting Point

Lewis et al. (2020) established the retrieve-then-generate paradigm that defines Simple RAG: a retriever fetches relevant documents from a static index, and a generator conditions on them to produce a grounded answer. Izacard and Grave (2021) showed that conditioning generation on retrieved passages substantially outperforms a closed-book model of comparable size, which is the empirical reason retrieval is worth the added infrastructure.

Simple RAG is the right place to begin every project. It is fast to build, cheap to run, and handles single-hop factual questions over a clean corpus well. Its limits are predictable rather than mysterious: single-pass retrieval struggles on multi-hop questions, and a static index goes stale without re-indexing. Those limits are the signal to move on, not a reason to skip the tier.

Simple RAG with Memory adds a store for prior conversational turns, so the system can resolve follow-up questions that depend on earlier context. You can implement this with prompt caching or an external store such as Redis or DynamoDB. Ram et al. (2023) showed that in-context retrieval augmentation, without modifying the underlying model, is enough to deliver consistent gains, which is why memory at this tier is usually a storage and prompt-assembly concern rather than a training one. Chunk size and overlap remain the highest-leverage parameters you can tune before changing architecture at all.

Branched RAG and HyDE: Fixing Retrieval Precision

If your diagnosis shows low context precision, you are retrieving irrelevant material. Two patterns target this directly.

Branched RAG routes a query to the most relevant source or sources instead of searching everything. Querying every connected system for every question pollutes the context window with noise, and routing removes that noise at its origin. This pattern fits organizations with several distinct corpora, such as separate stores for policies, code, and support tickets, where most queries belong to exactly one of them. LangChain's routing documentation notes a production caveat worth heeding: an LLM-based router needs a fallback path, or novel query types get silently misrouted.

HyDE, or Hypothetical Document Embeddings (Gao et al., 2022), attacks a different precision problem: the vocabulary gap between how users phrase questions and how source documents phrase answers. Instead of embedding the raw query, the model first drafts a hypothetical ideal answer and embeds that, retrieving real documents that resemble it. It helps most when queries are underspecified or the corpus uses specialized language, and it can hurt on short, exact-match lookups where the raw query is already well aligned. Apply it selectively, not as a default.

Adaptive, Corrective, and Self-RAG: Fixing Faithfulness and Wasted Retrieval

When your faithfulness metric is low, the model is generating claims the retrieved context does not support. When latency is uneven, you are likely retrieving on queries that did not need it. The self-reflective patterns address both.

Adaptive RAG (Jeong et al., 2024) classifies each query by complexity and routes it accordingly: no retrieval for questions the model can answer directly, single-step retrieval for moderate ones, and multi-step retrieval for the hardest. The motivation is grounded in Mallen et al. (2023), who showed that retrieval can actually degrade answers on well-known facts while remaining essential for long-tail ones. Adaptive RAG turns that finding into a routing policy.

Corrective RAG, or CRAG (Yan et al., 2024), inserts a grading step between retrieval and generation. It scores retrieved documents for relevance, decomposes them into finer-grained pieces, and triggers a fallback such as web search when the primary index returns weak results. Because the grader acts as a quality gate before generation rather than a check afterward, it is a strong choice for high-stakes domains where an unsupported answer is costly. The tradeoff is the web-search fallback, which adds non-deterministic latency and an external dependency that may rule it out for air-gapped systems.

Self-RAG (Asai et al., 2023) goes further by training the model to emit reflection tokens that decide, mid-generation, whether to retrieve and whether its own output is supported. This reduces retrieval on queries that do not need it and improves self-consistency, but it requires fine-tuning on a curated reflection-token dataset, which makes it materially more expensive to adopt than the prompt-based patterns above.

A minimal version of CRAG's grading step looks like this in LangChain:

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

grader_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a grader assessing whether a retrieved document is relevant "
               "to the user question. Output JSON with a single key 'score': 'yes' or 'no'."),
    ("human", "Document:\n{document}\n\nQuestion: {question}")
])

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
grader = grader_prompt | llm

def grade_documents(documents: list[str], question: str) -> list[str]:
    relevant = []
    for doc in documents:
        result = grader.invoke({"document": doc, "question": question})
        if '"score": "yes"' in result.content:
            relevant.append(doc)
    return relevant

If grade_documents returns an empty list, the pipeline triggers its fallback before generation instead of answering from weak context.

Agentic RAG: For Genuine Multi-Step Synthesis

Agentic RAG is the most capable pattern and the one most often adopted prematurely. It assigns document-level agents and coordinates them with a meta-agent that plans multi-step retrieval across heterogeneous sources and tools. The use cases that genuinely need it are narrow but high-value: synthesizing across many corpora, executive decision support that aggregates several systems, and complex multi-document questions where no single retrieval pass can surface the evidence.

GraphRAG (Edge et al., 2024) is a close relative that builds an entity-and-relationship graph over the corpus, which helps on global, thematic questions that span a whole document set. The cost is real: it runs LLM inference over every chunk at index time to extract the graph, which is hard to justify on large corpora that update frequently. RAPTOR (Sarthi et al., 2024) takes a related approach, building a tree of recursive summaries so retrieval can operate at multiple levels of abstraction, which helps when answers require synthesizing information spread across long documents.

The defining tradeoff of agentic patterns is captured well by Harrison Chase of LangChain: agentic RAG is compelling because it collapses the line between retrieval and reasoning, but you are no longer debugging a pipeline, you are debugging an autonomous system that can take paths you did not anticipate. Multi-step orchestration multiplies both cost and latency per query, and it demands observability and loop guards that simpler patterns do not. Adopt it only when your evaluation data shows multi-hop synthesis failures that nothing cheaper resolves.

Combining Patterns: When Layering Pays Off

These eight patterns are not mutually exclusive. The strongest production systems usually compose two or three of them, because the patterns address different stages of the pipeline: routing decides where to look, query transformation decides how to look, grading decides whether what you found is good enough, and orchestration decides how many times to repeat the loop. Combinations work well when they stack across those stages and work badly when they stack within one stage, adding cost without adding a new capability.

Agentic plus Branched RAG is the combination worth understanding first, because it is both common and frequently misjudged. On its own, Branched RAG makes a single routing decision: it picks the most relevant source for a query and retrieves once. Agentic RAG makes routing a repeatable, reasoned step inside a multi-turn loop. Composing them means the agent treats each branch as a tool it can choose, observe the result of, and choose again. This is a good combination when your knowledge lives in several genuinely distinct stores, a vector index, a SQL warehouse, a document API, and a single query may need to consult more than one of them in sequence. The agent routes to the policy store, reads what it found, realizes it also needs the contract store, and routes again. Branched RAG alone cannot do that second hop, and a non-branched agent wastes steps searching one undifferentiated index.

The same combination is a bad fit when your corpus is effectively one source. If everything lives in a single index, the branch selection is a no-op and you are paying for agentic orchestration, the extra latency, the loop guards, the observability, to make a routing decision that has only one possible answer. In that case Branched RAG adds nothing and a simpler agentic loop, or no agent at all, is the better choice. The rule of thumb: add Branched RAG to an agent only when you have multiple heterogeneous sources and at least some queries need more than one of them.

A few other combinations are reliably worth it. Branched RAG plus HyDE pairs cleanly because they act on different stages: route to the right source first, then use a hypothetical-document embedding to search precisely within it, which matters most when the chosen source uses specialized vocabulary. Adaptive RAG plus CRAG is a strong general-purpose stack: Adaptive decides whether a query needs retrieval at all, and CRAG grades whatever comes back, so you spend retrieval and grading effort only where it earns its keep. Agentic RAG plus CRAG is close to a default for high-stakes agents, because the grading step gives the agent a principled signal for its retrieve-again decision rather than letting it loop on vibes.

Some combinations are redundant and should be avoided. Self-RAG already learns adaptive retrieval and self-grading as trained behaviors, so bolting external Adaptive or CRAG logic on top usually duplicates what the model was fine-tuned to do, adding latency for little gain. Stacking HyDE on short exact-match lookups, even inside an otherwise sophisticated system, still degrades precision for the same reason it does alone. The governing principle is the same one that drives single-pattern selection: each layer you add should resolve a distinct, measured failure mode. If you cannot name the metric a combination improves, you are buying complexity, not capability.

A Controlled Ablation: What Actually Moved the Numbers

To pressure-test this framework, I ran seven of these configurations through a single evaluation harness on three datasets: HotpotQA (multi-hop question answering, 60 questions over 2 trials); a merged BEIR corpus (nfcorpus, scidocs, and scifact combined into 933 documents, 60 queries over 2 trials); and a composed cross-source set built from the same BEIR datasets (40 two-source queries over 2 trials), described below. Retrieval used a dense embedding model (bge-small-en-v1.5), answers were LLM-generated where the dataset has gold answers, and the agentic configurations ran a real multi-step agent loop. These are outputs from one harness on small samples, not published leaderboard results, so read them as directional evidence for the diagnostic argument rather than a definitive ranking. Every figure is reproducible from the harness (for example python -m agents.experiments.run_rigorous --benchmark hotpotqa --retriever dense --llm --agent); the per-query dollar costs are derived from operator-set token prices, not a vendor quote.

A methodological note matters here, because it changed the conclusions. The unit of statistical analysis is the query, with repeated trials averaged within a query before any test, since trials of a deterministic pipeline are not independent observations and treating them as such understates the uncertainty. And every configuration is held to the same retrieval budget: each one may place at most the same number of candidates (k) in front of the reader, with multi-hop configs fusing their hops by rank rather than being allowed a larger pool. An earlier version of this harness violated both rules, and the apparent wins it produced for the sophisticated patterns did not survive fixing them.

On HotpotQA, ranked by retrieval recall at 5:

Patternrecall@5token F1exact matchcost/query
HyDE0.9630.4550.267$0.0041
Simple0.8830.4050.233$0.0025
Agentic0.8830.3950.225$0.0063
CRAG0.8500.4290.250$0.0044
Agentic + Branched0.8210.4310.283$0.0050
Adaptive0.6710.2970.158$0.0025
Branched0.3420.2050.133$0.0010

The honest headline on HotpotQA is that nothing beat Simple by a statistically significant margin on answer quality. HyDE scored highest on raw retrieval recall and led on token F1 (+0.050 over Simple), but with trials treated correctly the difference was not significant (p=0.20). Agentic, once given the same retrieval budget as Simple, matched it almost exactly (recall@5 0.883 for both, token F1 difference -0.010, not significant) rather than trailing it. That is the expected result on a single-pool benchmark: when one retrieval pass already surfaces the evidence, an agent that reasons over the same candidates has nothing left to add, and it pays roughly 2.5 times the cost for the privilege.

Two effects were real. Branched alone was the worst configuration on every metric, because HotpotQA's distractor passages behave as a single pool and routing to one branch discards passages the answer needs. Agentic + Branched then recovered most of that loss (token F1 +0.226 over Branched, p < 0.001), which is worth understanding precisely: layering the agent rescued a pattern that was wrong for this corpus, but it did not lift the result above the plain Simple baseline. Paying for orchestration to climb back to where you started is not a gain. This is the single-source warning from the Branched section, measured.

The BEIR multi-source run isolates retrieval, since that corpus has no gold short answers, so token F1 and exact match are zero by construction. There, Simple, Agentic, and HyDE tied at the top on recall@5 (0.58 to 0.59) and all reached full source recall, while Adaptive collapsed (recall@5 0.213, source recall 0.325). Adaptive's complexity gate decided to skip retrieval far too often on a corpus where every query needed it, which is the precise failure the pattern risks: when the skip-retrieval decision misfires, the answer has nothing to stand on. Again, no combination beat Simple, for the same reason as on HotpotQA: although the BEIR corpus is built from several datasets, each query's answer still lives in exactly one of them, so a single good retrieval pass suffices and there is no second source for an agent to go find.

That observation is the whole point, and it exposes a limit in the first two benchmarks: neither contains a query that genuinely requires evidence from more than one source. So I built one. The cross-source set pairs a real query from one BEIR dataset with a real query from another and concatenates them, so that answering the composed question requires retrieving from two distinct sources. Only the pairing is synthetic; the passages and sub-questions are real, and the set is labeled illustrative. Here source recall, the fraction of required sources actually covered, is the metric that matters.

On those genuinely two-source queries, the combination finally earns its cost:

Patternsource recallrecall@5hopscost/query
Agentic + Branched0.9810.0952.91$0.0036
Simple0.8120.0931.00$0.0000
Agentic0.8120.0932.83$0.0035
CRAG0.7810.0931.44$0.0043
HyDE0.6880.1001.00$0.0039
Adaptive0.6620.0771.98$0.0006
Branched0.4750.0681.00$0.0000

Agentic + Branched covered both required sources 98 percent of the time, against 81 percent for Simple (+0.169, p < 0.001) and just 48 percent for Branched alone (+0.506, p < 0.001). This is the one place in the study where the most sophisticated combination is also the best choice, and it is exactly the place the framework predicts: a query that must consult several distinct stores in sequence. Note too that single-hop Branched is the worst pattern here, because one routing decision can only ever reach one of the two sources a composed query needs. The agent's ability to route, read, and route again is what closes the gap.

Read together, the three datasets tell a consistent and slightly deflating story. On single-source workloads, which is most workloads, Simple is the bar to beat and the elaborate patterns mostly fail to clear it once budget and statistics are controlled. The combination's decisive win appears only when the benchmark has the genuine multi-source structure the pattern was designed for. That is not a knock on the sophisticated patterns; it is the argument of this article, now visible in the measurements. Match the pattern to the failure, and confirm the failure is real before you pay for the cure.

A few caveats keep these numbers in their lane. HotpotQA is drawn from Wikipedia, which the answer model has very likely seen in pretraining, so its answer-quality figures partly reflect what the model already knows rather than what retrieval surfaced; the BEIR datasets, built from scientific abstracts, are a cleaner test of retrieval and are where the cross-source result was measured. The agentic, Adaptive, and CRAG configurations are faithful reimplementations of each pattern's mechanism, not the exact published systems, and they use a stock model rather than one fine-tuned for the method. And every result rests on one embedding model, one generator, and small samples. The effects that reproduce the framework's predictions are the ones to trust; the rest is directional.

The Decision Path

Map the diagnosis to the pattern:

  • Low context precision, irrelevant retrieval: Branched RAG to route by source, or HyDE to close the query-document vocabulary gap.
  • Low faithfulness, unsupported claims: CRAG to grade context before generation, or Self-RAG if you can fine-tune.
  • Uneven latency, retrieving when you should not: Adaptive RAG to route by query complexity.
  • Multi-hop or thematic questions a single pass cannot answer: Agentic RAG, GraphRAG, or RAPTOR, accepting the cost and operational overhead.
  • None of the above failing on real queries: stay on Simple RAG and tune chunking and memory.

Combinations follow the same logic: layer patterns only when they act on different stages of the pipeline. Agentic plus Branched RAG when an agent must consult several distinct sources per query; Branched plus HyDE to route then search precisely; Adaptive plus CRAG to retrieve selectively and grade what returns. Avoid stacking patterns that solve the same problem, such as adding external Adaptive or CRAG logic on top of Self-RAG.

The architecture that resolves your measured bottleneck always beats the one that sounds most advanced. Measure first, add complexity only against evidence, and you will spend your latency and cost budget where it actually buys accuracy.

References

Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2023). Self-RAG: Learning to retrieve, generate, and critique through self-reflection. arXiv. https://arxiv.org/abs/2310.11511

Chen, J., Lin, H., Han, X., & Sun, L. (2023). Benchmarking large language models in retrieval-augmented generation. arXiv. https://arxiv.org/abs/2309.01431

Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., ... & Larson, J. (2024). From local to global: A graph RAG approach to query-focused summarization. arXiv. https://arxiv.org/abs/2404.16130

Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2023). RAGAS: Automated evaluation of retrieval augmented generation. arXiv. https://arxiv.org/abs/2309.15217

Gao, L., Ma, X., Lin, J., & Callan, J. (2022). Precise zero-shot dense retrieval without relevance labels. arXiv. https://arxiv.org/abs/2212.10496

Humanloop. (2025, February 1). RAG architectures. Humanloop. https://humanloop.com/blog/rag-architectures

Izacard, G., & Grave, E. (2021). Leveraging passage retrieval with generative models for open domain question answering. arXiv. https://arxiv.org/abs/2007.01282

Jeong, S., Baek, J., Cho, S., Hwang, S. J., & Park, J. C. (2024). Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity. arXiv. https://arxiv.org/abs/2403.14403

LangChain. (2024). How to route between sub-chains. LangChain. https://python.langchain.com/docs/how_to/routing/

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv. https://arxiv.org/abs/2005.11401

Mallen, A., Asai, A., Zhong, V., Das, R., Khashabi, D., & Hajishirzi, H. (2023). When not to trust language models: Investigating the effectiveness of parametric and non-parametric memories. arXiv. https://arxiv.org/abs/2212.10511

Ram, O., Levine, Y., Dalmedigos, I., Muhlgay, D., Shashua, A., Leyton-Brown, K., & Shoham, Y. (2023). In-context retrieval-augmented language models. arXiv. https://arxiv.org/abs/2302.00083

Sarthi, P., Abdullah, S., Tuli, A., Khanna, S., Goldie, A., & Manning, C. D. (2024). RAPTOR: Recursive abstractive processing for tree-organized retrieval. arXiv. https://arxiv.org/abs/2401.18059

Yan, S.-Q., Gu, J.-C., Zhu, Y., & Ling, Z.-H. (2024). Corrective retrieval augmented generation. arXiv. https://arxiv.org/abs/2401.15884