Advanced Retrievers — Reranking, Multi-Query, Compression, Parent-Child, Self-Query¶

1. Why does this topic exist?¶

Hybrid retrieval gets you to 70-80% recall. Production-grade RAG needs 90%+. Each advanced technique targets a specific residual failure mode.

Symptom	Root cause	Advanced technique
Query rephrasing misses results	Single embedding misses paraphrases	Multi-Query Retriever
Top-k returns relevant docs but noisy	Chunks are long, mostly irrelevant text	Contextual Compression
Small chunks match but lack context	Granularity ↔ context trade-off	Parent-Document Retriever
Users describe metadata in natural language	"Show me 2024 docs" — vector ignores year	Self-Query Retriever
Top-1 isn't actually the best match	Embeddings are rough, miss subtleties	Reranking with cross-encoder
Retrieved chunks all duplicate the same fact	No diversity	MMR (covered in Ch 6)

Industry pain example: A legal-tech company had 0.7 recall@5 with hybrid retrieval. After adding Cohere Rerank, recall@5 jumped to 0.92. The change was a 50-line code addition. It paid for itself in customer retention within a week.

2. What are they?¶

Simple explanation¶

Advanced retrievers are filters and refinements on top of base retrieval. Each fixes a specific failure mode.

Mental model¶

Think of retrieval as a pipeline:

Query → Expand queries → Base retrieval → Filter junk → Rerank → Best top-k

Each "advanced retriever" plugs into one of these stages.

3. How does each work?¶

We'll cover five techniques. Each follows the standard 12-section breakdown, but compressed since the pattern is the same.

Technique 1: Multi-Query Retriever¶

Why?¶

One user query embeds in one specific way. Paraphrases get different embeddings. Vector search may miss them.

Example: User asks "how to fix the issue". The doc says "solving the bug" / "resolving the error". Single-query retrieval misses 2 of 3 phrasings.

What?¶

Generate N paraphrases of the user's query, run retrieval for each, merge results.

How?¶

flowchart LR
    Q[User Query] --> LLM[LLM]
    LLM --> Q1[Variant 1]
    LLM --> Q2[Variant 2]
    LLM --> Q3[Variant 3]
    Q1 & Q2 & Q3 --> R[Base Retriever]
    R --> M[Deduplicated Union]
    M --> OUT[Final docs]

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI

mq = MultiQueryRetriever.from_llm(
    retriever=vector_store.as_retriever(search_kwargs={"k": 3}),
    llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
)
docs = mq.invoke("How do I fix the issue?")

Pros / Cons / Trade-offs¶

Pros	Cons	Trade-off
+10-15% recall on vague queries	N× LLM calls per query	Use cheap fast model
Robust to phrasing	Latency overhead	Cache aggressively
Easy to add	Doesn't help when query is already clear	Skip for well-formed queries

Industry usage¶

OpenAI Assistants File Search uses query rewriting internally. Anthropic's Contextual Retrieval pipeline includes query expansion as Step 1.

Technique 2: Contextual Compression Retriever¶

Why?¶

A retrieved chunk has 800 chars, but only 100 chars are relevant to the query. The LLM wastes tokens reading the irrelevant 700.

What?¶

Run each retrieved chunk through an extractor (LLM or reranker) that keeps only the parts relevant to the query.

How?¶

flowchart LR
    Q[Query] --> R[Base Retriever]
    R --> C[Raw chunks - long]
    C --> EX[LLM/Reranker Extractor]
    EX --> S[Compressed chunks - relevant parts only]

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(
    ChatOpenAI(model="gpt-4o-mini", temperature=0)
)
cc = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vector_store.as_retriever(search_kwargs={"k": 6}),
)
docs = cc.invoke("How does pgvector handle filtering?")

Pros / Cons / Trade-offs¶

Pros	Cons	Trade-off
Tighter prompts	One LLM call per chunk	Use small model
Easier reasoning for LLM	Indexing-time alternative cheaper	For high-value queries only
Better fit in prompt budget	Risk of losing useful context	Tune carefully

Industry usage¶

Common in customer-support RAG where chunks are long-form articles. Stripe Docs AI uses contextual compression to fit more relevant content in the prompt.

Technique 3: Parent-Document Retriever¶

Why?¶

Trade-off: Small chunks match precisely (good recall@k) but lack surrounding context. Large chunks have context but match imprecisely.

What?¶

Index small chunks for matching. When matched, return the LARGER parent chunk for context.

How?¶

flowchart LR
    DOC[Original Document] --> P[Parent chunks 2000]
    P --> C[Child chunks 200]
    C --> VS[Vector store - small chunks]
    Q[Query] --> VS
    VS --> MATCH[Matching child IDs]
    MATCH --> RET[Return PARENT chunks]

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter  = RecursiveCharacterTextSplitter(chunk_size=200)

retriever = ParentDocumentRetriever(
    vectorstore=vector_store,
    docstore=InMemoryStore(),
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)
retriever.add_documents(docs)

Pros / Cons / Trade-offs¶

Pros	Cons	Trade-off
Precise match + rich context	Two stores to maintain	Use persistent docstore
Best of both worlds	More implementation complexity	LangChain handles most of it
Production-favorite pattern	Slightly more memory	Worth it

Industry usage¶

GitHub Copilot Chat uses a variant: small code-symbol chunks match queries, surrounding function/class context is returned. Almost every production RAG with hierarchical documents uses this pattern.

Technique 4: Self-Query Retriever¶

Why?¶

Users describe filters in natural language. "Show me docs about RAG from 2024" — vector search ignores "2024" unless your metadata is queryable.

What?¶

An LLM parses the user query into a structured form: vector query + metadata filter.

How?¶

flowchart LR
    Q[NL query: RAG docs from 2024] --> LLM[LLM Parser]
    LLM --> VQ[vector_query = RAG]
    LLM --> MF[filter = year=2024]
    VQ --> R[Retriever]
    MF --> R
    R --> D[Filtered semantic results]

from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.schema import AttributeInfo

metadata_fields = [
    AttributeInfo(name="year", description="Year of publication", type="integer"),
    AttributeInfo(name="topic", description="Article topic", type="string"),
]

sq = SelfQueryRetriever.from_llm(
    llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
    vectorstore=vector_store,
    document_contents="Technical articles on AI",
    metadata_field_info=metadata_fields,
)

docs = sq.invoke("Find articles about RAG from 2024")

Pros / Cons / Trade-offs¶

Pros	Cons	Trade-off
NL filters	One LLM call to parse	Use small model
Cleaner UX	Requires structured metadata	Set up at indexing
Multi-field filters	LLM may parse wrong	Validate parsed filter

Industry usage¶

Stripe Docs AI uses Self-Query to filter by API version. Notion AI extracts page-type and date filters from "find my latest meeting notes" type queries.

Technique 5: Reranking (the most impactful)¶

Why?¶

Vector retrieval gives APPROXIMATE top-k. The actual best match may be at rank 7, not rank 1. Bi-encoder embeddings (dense retrievers) are FAST but ROUGH.

What?¶

Use a cross-encoder model that scores (query, document) PAIRS — more accurate but slower than bi-encoders.

Pattern: retrieve broadly with bi-encoder (k=20-50), rerank with cross-encoder, keep top-k (4-8).

How?¶

flowchart LR
    Q[Query] --> R[Retriever k=20 wide]
    R --> CE[Cross-encoder reranker]
    CE --> SCORE[Pair-wise scores]
    SCORE --> SORT[Re-sort]
    SORT --> TOP[Top-5 to LLM]

Why bi-encoder vs cross-encoder?¶

flowchart LR
    subgraph BI[Bi-encoder fast]
        Q1[Query] --> E1[Encode]
        D1[Doc] --> E2[Encode]
        E1 --> COS[Cosine similarity]
        E2 --> COS
    end
    subgraph CE[Cross-encoder accurate]
        Q2[Query] --> CAT[Concatenate]
        D2[Doc] --> CAT
        CAT --> CEM[Single transformer pass]
        CEM --> SCORE2[Score 0-1]
    end

Encoder	Speed	Accuracy	Use
Bi-encoder	Fast (precompute doc embeddings)	Rough	Retrieval (k=20-50)
Cross-encoder	Slow (one pass per pair)	Precise	Reranking (re-score top 20→5)

Code¶

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

rerank = CohereRerank(model="rerank-english-v3.0", top_n=4)

retriever = ContextualCompressionRetriever(
    base_compressor=rerank,
    base_retriever=vector_store.as_retriever(search_kwargs={"k": 20}),
)
docs = retriever.invoke("query")

Other rerankers: - Cohere Rerank — paid API, fast, top-tier. - bge-reranker-v2-m3 (BAAI) — open-source, runs locally. - Jina Reranker — open + API. - ColBERT v2 — token-level scoring, strong but heavier.

Pros / Cons / Trade-offs¶

Pros	Cons	Trade-off
+20-30% precision over pure vector	Extra latency (100-300ms)	Worth it for production
Catches subtle relevance	Per-query cost	Cohere ~$1 per 1000
Easy to plug in	Requires model choice	Test 2-3 rerankers

Industry usage¶

Most production-grade RAG (Cohere customers, Anthropic Contextual Retrieval) uses reranking. The recall@k gain is the single largest documented improvement in published RAG benchmarks.

4. Visual Learning — combining techniques¶

In production, you stack them:

flowchart LR
    Q[Query] --> SQ[Self-Query: extract metadata filter]
    SQ --> MQ[Multi-Query: generate paraphrases]
    MQ --> H[Hybrid retrieval: dense + BM25]
    H --> RR[Cross-encoder Rerank]
    RR --> CC[Contextual Compression]
    CC --> FINAL[Final docs to LLM]

Each layer fixes a specific failure mode. Add ONE at a time. Measure each.

Real-world example — legal contract retrieval¶

flowchart LR
    Q[Show me indemnity clauses<br/>from NDAs signed in 2024] --> SQ[Self-Query]
    SQ --> F[filter=type:NDA, year:2024]
    SQ --> VQ[vector_query=indemnity clauses]
    F & VQ --> H[Hybrid retrieve]
    H --> 20[Top 20 chunks]
    20 --> CR[Cohere Rerank]
    CR --> 5[Top 5]
    5 --> PD[Parent-Document expand to full clauses]
    PD --> LLM

Self-Query extracts the date+type filter, hybrid retrieves, rerank refines, parent-doc gives full clause context.

5-7. Pros / Cons / Trade-offs (consolidated)¶

Technique	Best for	Cost	Quality gain
Multi-Query	Vague queries	N× LLM calls	+10-15% recall
Contextual Compression	Long chunks	N× LLM calls	Cleaner prompts
Parent-Document	Small-chunk recall + big-chunk context	Free at runtime	+5-15% answer quality
Self-Query	NL filters over metadata	1 LLM call	Better UX, fewer wrong-domain hits
Reranking	High-precision needs	$1/1000 queries (Cohere)	+20-30% precision

Combined effect: going from naive vector retrieval to "hybrid + rerank + parent-doc" typically yields +30-40% recall@k.

8. Real-world Industry Usage¶

OpenAI¶

File Search in Assistants API: dense retrieval + rerank.
ChatGPT browse: web search + content rerank.

Anthropic¶

Contextual Retrieval paper (2024) recommends: hybrid + reranking + contextual chunking → 67% reduction in retrieval failures.
Claude Projects use a similar stack internally.

Google¶

Vertex AI Search has built-in reranking.
Gemini Code Assist uses contextual compression on code chunks before answering.

Enterprise¶

Stripe Docs AI: hybrid + Cohere Rerank + Self-Query for API version filtering.
GitHub Copilot Chat: parent-document retrieval + cross-encoder rerank.
JPMorgan DocLLM: Self-Query for filings; rerank for relevance.
Notion AI: Multi-Query for paraphrase robustness on its narrative content.

9. Interview Questions¶

Beginner¶

What's reranking? — Re-scoring top-k candidates with a more accurate model (cross-encoder).
Bi-encoder vs cross-encoder? — Bi: encodes each side independently (fast). Cross: encodes the pair together (slow but more accurate).
What's Parent-Document Retriever? — Index small chunks, return large parents.

Intermediate¶

Why Multi-Query helps recall? — Vector embeddings are sensitive to phrasing. Multiple paraphrases broaden coverage.
Self-Query — what does the LLM do? — Parses NL query → vector query + structured filter.
Why retrieve k=20 then rerank to k=5? — Bi-encoder gives fast recall (broad). Cross-encoder gives slow precision (narrow). Combine for best of both.

Advanced¶

How tune the reranker's top_n? — Eval on a labeled set; recall@top_n vs LLM prompt budget; usually 4-8.
When does reranking hurt? — Tiny corpora (<1000 chunks) where rough retrieval already wins. Or when reranker's training distribution differs from yours.
Contextual Compression vs Reranking — overlap? — Reranking re-orders chunks. Compression modifies them (strips irrelevant parts). Use both for max quality + minimum tokens.

System design¶

Design retrieval for a 1000-product e-commerce search. — Hybrid (BM25 for SKUs, dense for descriptions) + Self-Query (filters: brand, color, price) + Cohere Rerank.
Plan a rollout: dense → hybrid → rerank. — Phase 1: add BM25 retriever, measure. Phase 2: add Cohere Rerank, measure. Phase 3: add Self-Query if needed. Compare metrics at each step.

10. Common Mistakes¶

Beginners¶

❌ Adding all 5 at once — can't tell which helps.
❌ Using GPT-4 for Multi-Query expansion (overkill, expensive).
❌ Reranker with k=100 — slow, wasteful.
❌ Skipping eval — guessing improvements.

Production¶

❌ No latency budget for the stack — Multi-Query + rerank + compression adds 500-1500ms.
❌ Multi-Query without caching common queries.
❌ Self-Query without validating the LLM's parsed filter.
❌ Rerank top_n equal to retrieve k (rerank does nothing — same set in same order).

11. Best Practices¶

Add ONE technique at a time. Measure recall@k. Keep if better.
Use cheap fast models for non-final-answer steps (gpt-4o-mini, claude-haiku).
Cache LLM-augmented retrieval outputs aggressively.
Set strict latency budgets per layer.
Run RAGAS (Chapter 12) after every change.

12. Evolution Story¶

flowchart LR
    A[Pure vector retrieval] --> B[+ BM25 hybrid]
    B --> C[+ Multi-Query]
    C --> D[+ Self-Query NL filters]
    D --> E[+ Parent-Document]
    E --> F[+ Contextual Compression]
    F --> G[+ Cross-encoder Reranking]
    G --> H[+ Anthropic Contextual chunking]

Where we are: Each layer fixes a specific failure mode. Production RAG typically uses 3-4 layers, not all.

Where we're going (next chapter): RAG Fusion — a specific advanced pattern that combines Multi-Query with Reciprocal Rank Fusion. We'll see the math in detail and understand why it beats vanilla Multi-Query.

Practice¶

What does this print?

Expected: True

techniques = ["multi_query", "compression", "parent_document", "self_query", "reranking"]
print(len(techniques) == 5)

Use a CHEAP model for Multi-Query expansion (not GPT-4)

Expected: True

expansion_model = "gpt-4"        # bug: too expensive per query
is_cheap = expansion_model in ("gpt-4o-mini", "claude-haiku")
print(not is_cheap)

Quiz — Quick check¶

What you remember

Q1. Parent-Document Retriever solves what problem?

Mismatch between matching granularity (small) and context size (large)
Slow embeddings
Multilingual
Filtering

Q2. Why retrieve broad (k=20) then rerank to k=5?

Bi-encoder = fast recall, cross-encoder = precise scoring — combine
Costs less
Required by sklearn
Avoids OOM

Q3. Cross-encoder vs bi-encoder?

Cross encodes (query, doc) PAIR together; bi encodes each separately
No difference
Cross is deprecated
Bi is more accurate

Common doubts¶

Do I always need a reranker?

For chat/Q&A over a non-trivial corpus, yes — typically the biggest quality win you'll get. Skip only for tiny corpora or latency-critical paths.

Multi-Query vs RAG Fusion — what's the difference?

Multi-Query merges retrieval results via set union (loses rank info). RAG Fusion uses Reciprocal Rank Fusion (preserves rank info). RAG Fusion typically wins. We cover it in detail in Chapter 8.

Should I run all 5 techniques together?

No. Each adds latency + cost. Start with hybrid retrieval. Add reranking. Measure. If recall is still inadequate, add Multi-Query or Self-Query based on diagnosis. Stacking everything is over-engineering.

→ RAG Fusion