RAG Fusion¶
1. Why does this topic exist?¶
Multi-Query Retriever (Chapter 7) generates N paraphrases and merges results via set union — losing rank information. RAG Fusion improves this by preserving and combining ranks across retrievals.
Industry pain example: A 2023 RAG system used Multi-Query. It returned the right docs but in random order — the LLM saw the most relevant chunk at rank 5, less relevant chunks at rank 1-2. Answer quality was uneven. Switching the merge to RRF (Reciprocal Rank Fusion) → consistent quality.
The intuition: documents that rank high in MULTIPLE queries are probably more relevant than those ranking high in one.
2. What is it?¶
Simple explanation¶
RAG Fusion = Multi-Query Retriever + smart rank-based merging (RRF). It rewards documents that show up consistently across many paraphrases.
Technical explanation¶
RAG Fusion is a retrieval pattern that: 1. Generates sub-queries from the user query (via LLM). 2. Runs the base retriever on each sub-query. 3. Merges the resulting ranked lists using Reciprocal Rank Fusion (RRF).
Mental model¶
Imagine 3 expert recommenders each give you a top-5 list of restaurants. A restaurant that appears in all 3 lists (even if not #1 in any) is probably better than one that's #1 in just one list.
Analogy¶
Olympic gymnastics scoring: multiple judges, ranks aggregated to reduce single-judge bias.
3. How does it work?¶
The full workflow¶
flowchart LR
Q[User Query] --> LLM[LLM]
LLM --> SQ1[Sub-Query 1]
LLM --> SQ2[Sub-Query 2]
LLM --> SQ3[Sub-Query 3]
SQ1 --> R[Retriever]
SQ2 --> R
SQ3 --> R
R --> L1[Ranked list 1]
R --> L2[Ranked list 2]
R --> L3[Ranked list 3]
L1 --> RRF[Reciprocal Rank Fusion]
L2 --> RRF
L3 --> RRF
RRF --> FINAL[Final ranked docs]
Reciprocal Rank Fusion — the math¶
For document \(d\) appearing across \(N\) ranked lists:
- \(\text{rank}_i(d)\) — position of \(d\) in list \(i\) (1-indexed; lower rank = better)
- \(k\) — smoothing constant (typically 60, from the original 2009 paper)
- If \(d\) doesn't appear in list \(i\), contribution is 0
Why k? Without smoothing, rank 1 contributes 1.0 and rank 2 contributes 0.5 — a massive 2× gap. With k=60, rank 1 contributes 1/61 ≈ 0.0164 and rank 2 contributes 1/62 ≈ 0.0161 — gentler gradient.
Worked example¶
Three retrievers return:
| List | Rank 1 | Rank 2 | Rank 3 |
|---|---|---|---|
| L1 | A | B | C |
| L2 | B | D | A |
| L3 | A | E | B |
With k=60:
| Doc | L1 score | L2 score | L3 score | Total |
|---|---|---|---|---|
| A | 1/61 = 0.0164 | 1/63 = 0.0159 | 1/61 = 0.0164 | 0.0487 |
| B | 1/62 = 0.0161 | 1/61 = 0.0164 | 1/63 = 0.0159 | 0.0484 |
| C | 1/63 = 0.0159 | 0 | 0 | 0.0159 |
| D | 0 | 1/62 = 0.0161 | 0 | 0.0161 |
| E | 0 | 0 | 1/62 = 0.0161 | 0.0161 |
Final: A > B > D ≈ E > C. A wins by appearing in all three lists.
Code¶
import os
from collections import defaultdict
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
class SubQuerySchema(BaseModel):
sub_queries: list[str] = Field(..., description="Generated paraphrases.")
class RAGFusion:
def __init__(self, retriever: BaseRetriever, llm_chain, n=3, k=5, rrf_k=60):
self.retriever = retriever
self.llm_chain = llm_chain
self.n = n
self.k = k
self.rrf_k = rrf_k
@classmethod
def from_llm(cls, llm, retriever, n=3, k=5):
prompt = ChatPromptTemplate.from_messages([
("system",
"Generate {n} different short paraphrases of the user's query. "
"Return JSON: {{\"sub_queries\": [\"...\", \"...\", \"...\"]}}."),
("human", "{query}"),
]).partial(n=n)
chain = prompt | llm.with_structured_output(SubQuerySchema)
return cls(retriever, chain, n, k)
def _gen(self, query):
r = self.llm_chain.invoke({"query": query})
return [query] + r.sub_queries
def _retrieve(self, queries):
return [self.retriever.invoke(q) for q in queries]
def _rrf(self, ranked_lists):
scores = defaultdict(float)
lookup = {}
for rl in ranked_lists:
for rank, doc in enumerate(rl, start=1):
key = doc.page_content
scores[key] += 1.0 / (self.rrf_k + rank)
lookup[key] = doc
sorted_keys = sorted(scores, key=scores.get, reverse=True)
return [lookup[x] for x in sorted_keys[: self.k]]
def invoke(self, query):
subqueries = self._gen(query)
return self._rrf(self._retrieve(subqueries))
4. Visual Learning¶
Architecture¶
flowchart LR
subgraph EXP[Query Expansion]
Q[Query] --> SUB[LLM generates N sub-queries]
end
subgraph PAR[Parallel Retrieval]
SUB --> R1[Retriever query 1]
SUB --> R2[Retriever query 2]
SUB --> R3[Retriever query 3]
end
subgraph MERGE[Rank Fusion]
R1 --> RRF[RRF]
R2 --> RRF
R3 --> RRF
end
RRF --> FINAL[Top-k unified]
Sequence — RAG Fusion call¶
sequenceDiagram
participant App
participant LLM as LLM expander
participant R as Retriever
participant RRF as Fusion
App->>LLM: generate N paraphrases
LLM-->>App: [q1, q2, q3]
par
App->>R: retrieve q1
R-->>App: list1
and
App->>R: retrieve q2
R-->>App: list2
and
App->>R: retrieve q3
R-->>App: list3
end
App->>RRF: merge ranked lists
RRF-->>App: top-k
Multi-Query vs RAG Fusion¶
flowchart LR
subgraph MQ[Multi-Query]
A1[Set union] --> R1[Deduped chunks - no rank info]
end
subgraph RF[RAG Fusion]
A2[RRF] --> R2[Ranked top-k - rank preserved]
end
5-7. Pros / Cons / Trade-offs¶
| Pros | Cons | Trade-off |
|---|---|---|
| Preserves rank signal | N× LLM calls per query | Use cheap fast model |
| Surfaces consensus docs | Latency overhead | Cache common queries |
| Robust to phrasing | Adds dependency on expander LLM | Tune k, n_subqueries |
| Easy to bolt onto existing retrievers | RRF parameters need tuning | Defaults often good (k=60) |
Multi-Query vs RAG Fusion:
| Multi-Query | RAG Fusion | |
|---|---|---|
| Generate paraphrases | ✅ | ✅ |
| Retrieve per paraphrase | ✅ | ✅ |
| Merge results | Set union (dedup) | RRF (rank-aware) |
| Rank info preserved | ❌ | ✅ |
RAG Fusion is almost strictly better — same LLM cost, better merging.
8. Real-world Industry Usage¶
- Anthropic Contextual Retrieval pipeline includes RRF-style merging.
- Cohere documents RRF as the default ensemble strategy.
- Elastic and Vespa support RRF natively.
- OpenAI Assistants internal retrieval uses fusion-style merging.
When teams use RAG Fusion¶
| Use case | Why |
|---|---|
| Multi-faceted queries | Sub-queries surface different aspects |
| Long-form Q&A | Need broad recall |
| Comparative questions | "Compare X and Y" — naturally decomposes |
| When Multi-Query gives wrong-ranked results | Switch to RAG Fusion |
9. Interview Questions¶
Beginner¶
- What's RAG Fusion? — Multi-query expansion + RRF merge.
- What's RRF? — Sum of
1/(k + rank)across retrievals; rewards docs ranking high in multiple lists. - Why not just set union? — Loses rank info; the LLM sees random order.
Intermediate¶
- What's
kfor in RRF? — Smoothing — prevents top rank from dominating. - Multi-Query vs RAG Fusion? — Same expansion; different merge. RRF wins.
- Tune
nsub-queries? — Eval on a set; usually 3-5.
Advanced¶
- What if all paraphrases retrieve the same chunks? — RRF degenerates to ranking those chunks. Acceptable but no improvement over single-query in that case.
- Why use original query as one of the sub-queries? — Anchors retrieval to user's exact phrasing.
- Score-based fusion (CombSum) vs RRF — when which? — Score-based when retrievers' score distributions are comparable. RRF is safer when they're not (e.g., dense + sparse).
System design¶
- Design RAG Fusion at 1000 QPS. — Cache common (query → sub-queries). Batch sub-query embedding. Parallel retrieval. Persist RRF computation in app, not in vector store.
10. Common Mistakes¶
- ❌ Sub-query count > 5: diminishing returns, linear cost.
- ❌ Not including the original query — losing the anchor.
- ❌ Using unstable IDs for dedup (using
page_contentif metadata differs). - ❌ Using expensive LLM for sub-query generation (use mini/haiku).
11. Best Practices¶
- Default
n_subqueries=3,rrf_k=60,final_k=5. - Use structured output (Pydantic schema) for sub-query generation.
- Cache the LLM expansion step per query.
- Combine with reranking for max quality (RAG Fusion broadens recall, reranker tightens precision).
12. Evolution Story¶
flowchart LR
A[Single-query retrieval<br/>misses paraphrases] --> B[Multi-Query<br/>paraphrase coverage]
B --> C[Set union merge<br/>loses rank signal]
C --> D[RAG Fusion + RRF<br/>preserves rank]
D --> E[+ Reranking<br/>RRF for recall, rerank for precision]
Next: HyDE — a different angle on query expansion. Instead of generating PARAPHRASES, HyDE generates a hypothetical ANSWER, then searches for chunks similar to the answer.
Practice¶
What does this print?
Expected: 0.04865
Compute RRF correctly (divide by k + rank, not just rank)
Expected: True
Quiz — Quick check¶
What you remember
Q1. Core RRF formula?
-
sum of 1/(k + rank) across lists -
sum of ranks -
max of ranks -
1/rank
Q2. What does k=60 do?
- Smooths the contribution curve; prevents rank-1 dominance
- Number of retrievers
- Number of docs returned
- Learning rate
Q3. RAG Fusion vs Multi-Query?
- Same expansion; RAG Fusion uses RRF, Multi-Query uses set union
- Different LLMs
- Different vector stores
- No difference
Common doubts¶
Should I always use RAG Fusion over Multi-Query?
Yes — same cost, better results. The only reason to use Multi-Query (set union) is if downstream code expects dedup-only behavior.
Can RRF combine dense + sparse retrievals?
Absolutely — and it should. Score-based fusion fails because dense [0,1] and BM25 (unbounded) live on different scales. RRF uses ranks, which are scale-free. The EnsembleRetriever in LangChain uses RRF under the hood.
→ HyDE RAG