Skip to content

RAG Fusion

1. Why does this topic exist?

Multi-Query Retriever (Chapter 7) generates N paraphrases and merges results via set union — losing rank information. RAG Fusion improves this by preserving and combining ranks across retrievals.

Industry pain example: A 2023 RAG system used Multi-Query. It returned the right docs but in random order — the LLM saw the most relevant chunk at rank 5, less relevant chunks at rank 1-2. Answer quality was uneven. Switching the merge to RRF (Reciprocal Rank Fusion) → consistent quality.

The intuition: documents that rank high in MULTIPLE queries are probably more relevant than those ranking high in one.


2. What is it?

Simple explanation

RAG Fusion = Multi-Query Retriever + smart rank-based merging (RRF). It rewards documents that show up consistently across many paraphrases.

Technical explanation

RAG Fusion is a retrieval pattern that: 1. Generates sub-queries from the user query (via LLM). 2. Runs the base retriever on each sub-query. 3. Merges the resulting ranked lists using Reciprocal Rank Fusion (RRF).

Mental model

Imagine 3 expert recommenders each give you a top-5 list of restaurants. A restaurant that appears in all 3 lists (even if not #1 in any) is probably better than one that's #1 in just one list.

Analogy

Olympic gymnastics scoring: multiple judges, ranks aggregated to reduce single-judge bias.


3. How does it work?

The full workflow

flowchart LR
    Q[User Query] --> LLM[LLM]
    LLM --> SQ1[Sub-Query 1]
    LLM --> SQ2[Sub-Query 2]
    LLM --> SQ3[Sub-Query 3]
    SQ1 --> R[Retriever]
    SQ2 --> R
    SQ3 --> R
    R --> L1[Ranked list 1]
    R --> L2[Ranked list 2]
    R --> L3[Ranked list 3]
    L1 --> RRF[Reciprocal Rank Fusion]
    L2 --> RRF
    L3 --> RRF
    RRF --> FINAL[Final ranked docs]

Reciprocal Rank Fusion — the math

For document \(d\) appearing across \(N\) ranked lists:

\[ \text{RRF}(d) = \sum_{i=1}^{N} \frac{1}{k + \text{rank}_i(d)} \]
  • \(\text{rank}_i(d)\) — position of \(d\) in list \(i\) (1-indexed; lower rank = better)
  • \(k\) — smoothing constant (typically 60, from the original 2009 paper)
  • If \(d\) doesn't appear in list \(i\), contribution is 0

Why k? Without smoothing, rank 1 contributes 1.0 and rank 2 contributes 0.5 — a massive 2× gap. With k=60, rank 1 contributes 1/61 ≈ 0.0164 and rank 2 contributes 1/62 ≈ 0.0161 — gentler gradient.

Worked example

Three retrievers return:

List Rank 1 Rank 2 Rank 3
L1 A B C
L2 B D A
L3 A E B

With k=60:

Doc L1 score L2 score L3 score Total
A 1/61 = 0.0164 1/63 = 0.0159 1/61 = 0.0164 0.0487
B 1/62 = 0.0161 1/61 = 0.0164 1/63 = 0.0159 0.0484
C 1/63 = 0.0159 0 0 0.0159
D 0 1/62 = 0.0161 0 0.0161
E 0 0 1/62 = 0.0161 0.0161

Final: A > B > D ≈ E > C. A wins by appearing in all three lists.

Code

import os
from collections import defaultdict
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field


class SubQuerySchema(BaseModel):
    sub_queries: list[str] = Field(..., description="Generated paraphrases.")


class RAGFusion:
    def __init__(self, retriever: BaseRetriever, llm_chain, n=3, k=5, rrf_k=60):
        self.retriever = retriever
        self.llm_chain = llm_chain
        self.n = n
        self.k = k
        self.rrf_k = rrf_k

    @classmethod
    def from_llm(cls, llm, retriever, n=3, k=5):
        prompt = ChatPromptTemplate.from_messages([
            ("system",
             "Generate {n} different short paraphrases of the user's query. "
             "Return JSON: {{\"sub_queries\": [\"...\", \"...\", \"...\"]}}."),
            ("human", "{query}"),
        ]).partial(n=n)
        chain = prompt | llm.with_structured_output(SubQuerySchema)
        return cls(retriever, chain, n, k)

    def _gen(self, query):
        r = self.llm_chain.invoke({"query": query})
        return [query] + r.sub_queries

    def _retrieve(self, queries):
        return [self.retriever.invoke(q) for q in queries]

    def _rrf(self, ranked_lists):
        scores = defaultdict(float)
        lookup = {}
        for rl in ranked_lists:
            for rank, doc in enumerate(rl, start=1):
                key = doc.page_content
                scores[key] += 1.0 / (self.rrf_k + rank)
                lookup[key] = doc
        sorted_keys = sorted(scores, key=scores.get, reverse=True)
        return [lookup[x] for x in sorted_keys[: self.k]]

    def invoke(self, query):
        subqueries = self._gen(query)
        return self._rrf(self._retrieve(subqueries))

4. Visual Learning

Architecture

flowchart LR
    subgraph EXP[Query Expansion]
        Q[Query] --> SUB[LLM generates N sub-queries]
    end
    subgraph PAR[Parallel Retrieval]
        SUB --> R1[Retriever query 1]
        SUB --> R2[Retriever query 2]
        SUB --> R3[Retriever query 3]
    end
    subgraph MERGE[Rank Fusion]
        R1 --> RRF[RRF]
        R2 --> RRF
        R3 --> RRF
    end
    RRF --> FINAL[Top-k unified]

Sequence — RAG Fusion call

sequenceDiagram
    participant App
    participant LLM as LLM expander
    participant R as Retriever
    participant RRF as Fusion
    App->>LLM: generate N paraphrases
    LLM-->>App: [q1, q2, q3]
    par
        App->>R: retrieve q1
        R-->>App: list1
    and
        App->>R: retrieve q2
        R-->>App: list2
    and
        App->>R: retrieve q3
        R-->>App: list3
    end
    App->>RRF: merge ranked lists
    RRF-->>App: top-k

Multi-Query vs RAG Fusion

flowchart LR
    subgraph MQ[Multi-Query]
        A1[Set union] --> R1[Deduped chunks - no rank info]
    end
    subgraph RF[RAG Fusion]
        A2[RRF] --> R2[Ranked top-k - rank preserved]
    end

5-7. Pros / Cons / Trade-offs

Pros Cons Trade-off
Preserves rank signal N× LLM calls per query Use cheap fast model
Surfaces consensus docs Latency overhead Cache common queries
Robust to phrasing Adds dependency on expander LLM Tune k, n_subqueries
Easy to bolt onto existing retrievers RRF parameters need tuning Defaults often good (k=60)

Multi-Query vs RAG Fusion:

Multi-Query RAG Fusion
Generate paraphrases
Retrieve per paraphrase
Merge results Set union (dedup) RRF (rank-aware)
Rank info preserved

RAG Fusion is almost strictly better — same LLM cost, better merging.


8. Real-world Industry Usage

  • Anthropic Contextual Retrieval pipeline includes RRF-style merging.
  • Cohere documents RRF as the default ensemble strategy.
  • Elastic and Vespa support RRF natively.
  • OpenAI Assistants internal retrieval uses fusion-style merging.

When teams use RAG Fusion

Use case Why
Multi-faceted queries Sub-queries surface different aspects
Long-form Q&A Need broad recall
Comparative questions "Compare X and Y" — naturally decomposes
When Multi-Query gives wrong-ranked results Switch to RAG Fusion

9. Interview Questions

Beginner

  1. What's RAG Fusion? — Multi-query expansion + RRF merge.
  2. What's RRF? — Sum of 1/(k + rank) across retrievals; rewards docs ranking high in multiple lists.
  3. Why not just set union? — Loses rank info; the LLM sees random order.

Intermediate

  1. What's k for in RRF? — Smoothing — prevents top rank from dominating.
  2. Multi-Query vs RAG Fusion? — Same expansion; different merge. RRF wins.
  3. Tune n sub-queries? — Eval on a set; usually 3-5.

Advanced

  1. What if all paraphrases retrieve the same chunks? — RRF degenerates to ranking those chunks. Acceptable but no improvement over single-query in that case.
  2. Why use original query as one of the sub-queries? — Anchors retrieval to user's exact phrasing.
  3. Score-based fusion (CombSum) vs RRF — when which? — Score-based when retrievers' score distributions are comparable. RRF is safer when they're not (e.g., dense + sparse).

System design

  1. Design RAG Fusion at 1000 QPS. — Cache common (query → sub-queries). Batch sub-query embedding. Parallel retrieval. Persist RRF computation in app, not in vector store.

10. Common Mistakes

  • ❌ Sub-query count > 5: diminishing returns, linear cost.
  • ❌ Not including the original query — losing the anchor.
  • ❌ Using unstable IDs for dedup (using page_content if metadata differs).
  • ❌ Using expensive LLM for sub-query generation (use mini/haiku).

11. Best Practices

  • Default n_subqueries=3, rrf_k=60, final_k=5.
  • Use structured output (Pydantic schema) for sub-query generation.
  • Cache the LLM expansion step per query.
  • Combine with reranking for max quality (RAG Fusion broadens recall, reranker tightens precision).

12. Evolution Story

flowchart LR
    A[Single-query retrieval<br/>misses paraphrases] --> B[Multi-Query<br/>paraphrase coverage]
    B --> C[Set union merge<br/>loses rank signal]
    C --> D[RAG Fusion + RRF<br/>preserves rank]
    D --> E[+ Reranking<br/>RRF for recall, rerank for precision]

Next: HyDE — a different angle on query expansion. Instead of generating PARAPHRASES, HyDE generates a hypothetical ANSWER, then searches for chunks similar to the answer.


Practice

What does this print?

Expected: 0.04865

score = 1/(60+1) + 1/(60+3) + 1/(60+1)
print(round(score, 5))

Compute RRF correctly (divide by k + rank, not just rank)

Expected: True

rank = 1
k = 60
naive = 1 / rank                # bug
correct = 1 / (k + rank)
print(naive > correct * 10)

Quiz — Quick check

What you remember

Q1. Core RRF formula?

  • sum of 1/(k + rank) across lists
  • sum of ranks
  • max of ranks
  • 1/rank

Q2. What does k=60 do?

  • Smooths the contribution curve; prevents rank-1 dominance
  • Number of retrievers
  • Number of docs returned
  • Learning rate

Q3. RAG Fusion vs Multi-Query?

  • Same expansion; RAG Fusion uses RRF, Multi-Query uses set union
  • Different LLMs
  • Different vector stores
  • No difference

Common doubts

Should I always use RAG Fusion over Multi-Query?

Yes — same cost, better results. The only reason to use Multi-Query (set union) is if downstream code expects dedup-only behavior.

Can RRF combine dense + sparse retrievals?

Absolutely — and it should. Score-based fusion fails because dense [0,1] and BM25 (unbounded) live on different scales. RRF uses ranks, which are scale-free. The EnsembleRetriever in LangChain uses RRF under the hood.

HyDE RAG