Advanced Retrievers — Reranking, Multi-Query, Compression, Parent-Child, Self-Query¶
1. Why does this topic exist?¶
Hybrid retrieval gets you to 70-80% recall. Production-grade RAG needs 90%+. Each advanced technique targets a specific residual failure mode.
| Symptom | Root cause | Advanced technique |
|---|---|---|
| Query rephrasing misses results | Single embedding misses paraphrases | Multi-Query Retriever |
| Top-k returns relevant docs but noisy | Chunks are long, mostly irrelevant text | Contextual Compression |
| Small chunks match but lack context | Granularity ↔ context trade-off | Parent-Document Retriever |
| Users describe metadata in natural language | "Show me 2024 docs" — vector ignores year | Self-Query Retriever |
| Top-1 isn't actually the best match | Embeddings are rough, miss subtleties | Reranking with cross-encoder |
| Retrieved chunks all duplicate the same fact | No diversity | MMR (covered in Ch 6) |
Industry pain example: A legal-tech company had 0.7 recall@5 with hybrid retrieval. After adding Cohere Rerank, recall@5 jumped to 0.92. The change was a 50-line code addition. It paid for itself in customer retention within a week.
2. What are they?¶
Simple explanation¶
Advanced retrievers are filters and refinements on top of base retrieval. Each fixes a specific failure mode.
Mental model¶
Think of retrieval as a pipeline:
Each "advanced retriever" plugs into one of these stages.
3. How does each work?¶
We'll cover five techniques. Each follows the standard 12-section breakdown, but compressed since the pattern is the same.
Technique 1: Multi-Query Retriever¶
Why?¶
One user query embeds in one specific way. Paraphrases get different embeddings. Vector search may miss them.
Example: User asks "how to fix the issue". The doc says "solving the bug" / "resolving the error". Single-query retrieval misses 2 of 3 phrasings.
What?¶
Generate N paraphrases of the user's query, run retrieval for each, merge results.
How?¶
flowchart LR
Q[User Query] --> LLM[LLM]
LLM --> Q1[Variant 1]
LLM --> Q2[Variant 2]
LLM --> Q3[Variant 3]
Q1 & Q2 & Q3 --> R[Base Retriever]
R --> M[Deduplicated Union]
M --> OUT[Final docs]
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI
mq = MultiQueryRetriever.from_llm(
retriever=vector_store.as_retriever(search_kwargs={"k": 3}),
llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
)
docs = mq.invoke("How do I fix the issue?")
Pros / Cons / Trade-offs¶
| Pros | Cons | Trade-off |
|---|---|---|
| +10-15% recall on vague queries | N× LLM calls per query | Use cheap fast model |
| Robust to phrasing | Latency overhead | Cache aggressively |
| Easy to add | Doesn't help when query is already clear | Skip for well-formed queries |
Industry usage¶
OpenAI Assistants File Search uses query rewriting internally. Anthropic's Contextual Retrieval pipeline includes query expansion as Step 1.
Technique 2: Contextual Compression Retriever¶
Why?¶
A retrieved chunk has 800 chars, but only 100 chars are relevant to the query. The LLM wastes tokens reading the irrelevant 700.
What?¶
Run each retrieved chunk through an extractor (LLM or reranker) that keeps only the parts relevant to the query.
How?¶
flowchart LR
Q[Query] --> R[Base Retriever]
R --> C[Raw chunks - long]
C --> EX[LLM/Reranker Extractor]
EX --> S[Compressed chunks - relevant parts only]
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
compressor = LLMChainExtractor.from_llm(
ChatOpenAI(model="gpt-4o-mini", temperature=0)
)
cc = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vector_store.as_retriever(search_kwargs={"k": 6}),
)
docs = cc.invoke("How does pgvector handle filtering?")
Pros / Cons / Trade-offs¶
| Pros | Cons | Trade-off |
|---|---|---|
| Tighter prompts | One LLM call per chunk | Use small model |
| Easier reasoning for LLM | Indexing-time alternative cheaper | For high-value queries only |
| Better fit in prompt budget | Risk of losing useful context | Tune carefully |
Industry usage¶
Common in customer-support RAG where chunks are long-form articles. Stripe Docs AI uses contextual compression to fit more relevant content in the prompt.
Technique 3: Parent-Document Retriever¶
Why?¶
Trade-off: Small chunks match precisely (good recall@k) but lack surrounding context. Large chunks have context but match imprecisely.
What?¶
Index small chunks for matching. When matched, return the LARGER parent chunk for context.
How?¶
flowchart LR
DOC[Original Document] --> P[Parent chunks 2000]
P --> C[Child chunks 200]
C --> VS[Vector store - small chunks]
Q[Query] --> VS
VS --> MATCH[Matching child IDs]
MATCH --> RET[Return PARENT chunks]
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
retriever = ParentDocumentRetriever(
vectorstore=vector_store,
docstore=InMemoryStore(),
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
retriever.add_documents(docs)
Pros / Cons / Trade-offs¶
| Pros | Cons | Trade-off |
|---|---|---|
| Precise match + rich context | Two stores to maintain | Use persistent docstore |
| Best of both worlds | More implementation complexity | LangChain handles most of it |
| Production-favorite pattern | Slightly more memory | Worth it |
Industry usage¶
GitHub Copilot Chat uses a variant: small code-symbol chunks match queries, surrounding function/class context is returned. Almost every production RAG with hierarchical documents uses this pattern.
Technique 4: Self-Query Retriever¶
Why?¶
Users describe filters in natural language. "Show me docs about RAG from 2024" — vector search ignores "2024" unless your metadata is queryable.
What?¶
An LLM parses the user query into a structured form: vector query + metadata filter.
How?¶
flowchart LR
Q[NL query: RAG docs from 2024] --> LLM[LLM Parser]
LLM --> VQ[vector_query = RAG]
LLM --> MF[filter = year=2024]
VQ --> R[Retriever]
MF --> R
R --> D[Filtered semantic results]
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.schema import AttributeInfo
metadata_fields = [
AttributeInfo(name="year", description="Year of publication", type="integer"),
AttributeInfo(name="topic", description="Article topic", type="string"),
]
sq = SelfQueryRetriever.from_llm(
llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
vectorstore=vector_store,
document_contents="Technical articles on AI",
metadata_field_info=metadata_fields,
)
docs = sq.invoke("Find articles about RAG from 2024")
Pros / Cons / Trade-offs¶
| Pros | Cons | Trade-off |
|---|---|---|
| NL filters | One LLM call to parse | Use small model |
| Cleaner UX | Requires structured metadata | Set up at indexing |
| Multi-field filters | LLM may parse wrong | Validate parsed filter |
Industry usage¶
Stripe Docs AI uses Self-Query to filter by API version. Notion AI extracts page-type and date filters from "find my latest meeting notes" type queries.
Technique 5: Reranking (the most impactful)¶
Why?¶
Vector retrieval gives APPROXIMATE top-k. The actual best match may be at rank 7, not rank 1. Bi-encoder embeddings (dense retrievers) are FAST but ROUGH.
What?¶
Use a cross-encoder model that scores (query, document) PAIRS — more accurate but slower than bi-encoders.
Pattern: retrieve broadly with bi-encoder (k=20-50), rerank with cross-encoder, keep top-k (4-8).
How?¶
flowchart LR
Q[Query] --> R[Retriever k=20 wide]
R --> CE[Cross-encoder reranker]
CE --> SCORE[Pair-wise scores]
SCORE --> SORT[Re-sort]
SORT --> TOP[Top-5 to LLM]
Why bi-encoder vs cross-encoder?¶
flowchart LR
subgraph BI[Bi-encoder fast]
Q1[Query] --> E1[Encode]
D1[Doc] --> E2[Encode]
E1 --> COS[Cosine similarity]
E2 --> COS
end
subgraph CE[Cross-encoder accurate]
Q2[Query] --> CAT[Concatenate]
D2[Doc] --> CAT
CAT --> CEM[Single transformer pass]
CEM --> SCORE2[Score 0-1]
end
| Encoder | Speed | Accuracy | Use |
|---|---|---|---|
| Bi-encoder | Fast (precompute doc embeddings) | Rough | Retrieval (k=20-50) |
| Cross-encoder | Slow (one pass per pair) | Precise | Reranking (re-score top 20→5) |
Code¶
from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
rerank = CohereRerank(model="rerank-english-v3.0", top_n=4)
retriever = ContextualCompressionRetriever(
base_compressor=rerank,
base_retriever=vector_store.as_retriever(search_kwargs={"k": 20}),
)
docs = retriever.invoke("query")
Other rerankers: - Cohere Rerank — paid API, fast, top-tier. - bge-reranker-v2-m3 (BAAI) — open-source, runs locally. - Jina Reranker — open + API. - ColBERT v2 — token-level scoring, strong but heavier.
Pros / Cons / Trade-offs¶
| Pros | Cons | Trade-off |
|---|---|---|
| +20-30% precision over pure vector | Extra latency (100-300ms) | Worth it for production |
| Catches subtle relevance | Per-query cost | Cohere ~$1 per 1000 |
| Easy to plug in | Requires model choice | Test 2-3 rerankers |
Industry usage¶
Most production-grade RAG (Cohere customers, Anthropic Contextual Retrieval) uses reranking. The recall@k gain is the single largest documented improvement in published RAG benchmarks.
4. Visual Learning — combining techniques¶
In production, you stack them:
flowchart LR
Q[Query] --> SQ[Self-Query: extract metadata filter]
SQ --> MQ[Multi-Query: generate paraphrases]
MQ --> H[Hybrid retrieval: dense + BM25]
H --> RR[Cross-encoder Rerank]
RR --> CC[Contextual Compression]
CC --> FINAL[Final docs to LLM]
Each layer fixes a specific failure mode. Add ONE at a time. Measure each.
Real-world example — legal contract retrieval¶
flowchart LR
Q[Show me indemnity clauses<br/>from NDAs signed in 2024] --> SQ[Self-Query]
SQ --> F[filter=type:NDA, year:2024]
SQ --> VQ[vector_query=indemnity clauses]
F & VQ --> H[Hybrid retrieve]
H --> 20[Top 20 chunks]
20 --> CR[Cohere Rerank]
CR --> 5[Top 5]
5 --> PD[Parent-Document expand to full clauses]
PD --> LLM
Self-Query extracts the date+type filter, hybrid retrieves, rerank refines, parent-doc gives full clause context.
5-7. Pros / Cons / Trade-offs (consolidated)¶
| Technique | Best for | Cost | Quality gain |
|---|---|---|---|
| Multi-Query | Vague queries | N× LLM calls | +10-15% recall |
| Contextual Compression | Long chunks | N× LLM calls | Cleaner prompts |
| Parent-Document | Small-chunk recall + big-chunk context | Free at runtime | +5-15% answer quality |
| Self-Query | NL filters over metadata | 1 LLM call | Better UX, fewer wrong-domain hits |
| Reranking | High-precision needs | $1/1000 queries (Cohere) | +20-30% precision |
Combined effect: going from naive vector retrieval to "hybrid + rerank + parent-doc" typically yields +30-40% recall@k.
8. Real-world Industry Usage¶
OpenAI¶
- File Search in Assistants API: dense retrieval + rerank.
- ChatGPT browse: web search + content rerank.
Anthropic¶
- Contextual Retrieval paper (2024) recommends: hybrid + reranking + contextual chunking → 67% reduction in retrieval failures.
- Claude Projects use a similar stack internally.
Google¶
- Vertex AI Search has built-in reranking.
- Gemini Code Assist uses contextual compression on code chunks before answering.
Enterprise¶
- Stripe Docs AI: hybrid + Cohere Rerank + Self-Query for API version filtering.
- GitHub Copilot Chat: parent-document retrieval + cross-encoder rerank.
- JPMorgan DocLLM: Self-Query for filings; rerank for relevance.
- Notion AI: Multi-Query for paraphrase robustness on its narrative content.
9. Interview Questions¶
Beginner¶
- What's reranking? — Re-scoring top-k candidates with a more accurate model (cross-encoder).
- Bi-encoder vs cross-encoder? — Bi: encodes each side independently (fast). Cross: encodes the pair together (slow but more accurate).
- What's Parent-Document Retriever? — Index small chunks, return large parents.
Intermediate¶
- Why Multi-Query helps recall? — Vector embeddings are sensitive to phrasing. Multiple paraphrases broaden coverage.
- Self-Query — what does the LLM do? — Parses NL query → vector query + structured filter.
- Why retrieve k=20 then rerank to k=5? — Bi-encoder gives fast recall (broad). Cross-encoder gives slow precision (narrow). Combine for best of both.
Advanced¶
- How tune the reranker's
top_n? — Eval on a labeled set; recall@top_n vs LLM prompt budget; usually 4-8. - When does reranking hurt? — Tiny corpora (<1000 chunks) where rough retrieval already wins. Or when reranker's training distribution differs from yours.
- Contextual Compression vs Reranking — overlap? — Reranking re-orders chunks. Compression modifies them (strips irrelevant parts). Use both for max quality + minimum tokens.
System design¶
- Design retrieval for a 1000-product e-commerce search. — Hybrid (BM25 for SKUs, dense for descriptions) + Self-Query (filters: brand, color, price) + Cohere Rerank.
- Plan a rollout: dense → hybrid → rerank. — Phase 1: add BM25 retriever, measure. Phase 2: add Cohere Rerank, measure. Phase 3: add Self-Query if needed. Compare metrics at each step.
10. Common Mistakes¶
Beginners¶
- ❌ Adding all 5 at once — can't tell which helps.
- ❌ Using GPT-4 for Multi-Query expansion (overkill, expensive).
- ❌ Reranker with k=100 — slow, wasteful.
- ❌ Skipping eval — guessing improvements.
Production¶
- ❌ No latency budget for the stack — Multi-Query + rerank + compression adds 500-1500ms.
- ❌ Multi-Query without caching common queries.
- ❌ Self-Query without validating the LLM's parsed filter.
- ❌ Rerank top_n equal to retrieve k (rerank does nothing — same set in same order).
11. Best Practices¶
- Add ONE technique at a time. Measure recall@k. Keep if better.
- Use cheap fast models for non-final-answer steps (gpt-4o-mini, claude-haiku).
- Cache LLM-augmented retrieval outputs aggressively.
- Set strict latency budgets per layer.
- Run RAGAS (Chapter 12) after every change.
12. Evolution Story¶
flowchart LR
A[Pure vector retrieval] --> B[+ BM25 hybrid]
B --> C[+ Multi-Query]
C --> D[+ Self-Query NL filters]
D --> E[+ Parent-Document]
E --> F[+ Contextual Compression]
F --> G[+ Cross-encoder Reranking]
G --> H[+ Anthropic Contextual chunking]
Where we are: Each layer fixes a specific failure mode. Production RAG typically uses 3-4 layers, not all.
Where we're going (next chapter): RAG Fusion — a specific advanced pattern that combines Multi-Query with Reciprocal Rank Fusion. We'll see the math in detail and understand why it beats vanilla Multi-Query.
Practice¶
What does this print?
Expected: True
Use a CHEAP model for Multi-Query expansion (not GPT-4)
Expected: True
Quiz — Quick check¶
What you remember
Q1. Parent-Document Retriever solves what problem?
- Mismatch between matching granularity (small) and context size (large)
- Slow embeddings
- Multilingual
- Filtering
Q2. Why retrieve broad (k=20) then rerank to k=5?
- Bi-encoder = fast recall, cross-encoder = precise scoring — combine
- Costs less
- Required by sklearn
- Avoids OOM
Q3. Cross-encoder vs bi-encoder?
- Cross encodes (query, doc) PAIR together; bi encodes each separately
- No difference
- Cross is deprecated
- Bi is more accurate
Common doubts¶
Do I always need a reranker?
For chat/Q&A over a non-trivial corpus, yes — typically the biggest quality win you'll get. Skip only for tiny corpora or latency-critical paths.
Multi-Query vs RAG Fusion — what's the difference?
Multi-Query merges retrieval results via set union (loses rank info). RAG Fusion uses Reciprocal Rank Fusion (preserves rank info). RAG Fusion typically wins. We cover it in detail in Chapter 8.
Should I run all 5 techniques together?
No. Each adds latency + cost. Start with hybrid retrieval. Add reranking. Measure. If recall is still inadequate, add Multi-Query or Self-Query based on diagnosis. Stacking everything is over-engineering.