Skip to content

Advanced Retrievers — Reranking, Multi-Query, Compression, Parent-Child, Self-Query

1. Why does this topic exist?

Hybrid retrieval gets you to 70-80% recall. Production-grade RAG needs 90%+. Each advanced technique targets a specific residual failure mode.

Symptom Root cause Advanced technique
Query rephrasing misses results Single embedding misses paraphrases Multi-Query Retriever
Top-k returns relevant docs but noisy Chunks are long, mostly irrelevant text Contextual Compression
Small chunks match but lack context Granularity ↔ context trade-off Parent-Document Retriever
Users describe metadata in natural language "Show me 2024 docs" — vector ignores year Self-Query Retriever
Top-1 isn't actually the best match Embeddings are rough, miss subtleties Reranking with cross-encoder
Retrieved chunks all duplicate the same fact No diversity MMR (covered in Ch 6)

Industry pain example: A legal-tech company had 0.7 recall@5 with hybrid retrieval. After adding Cohere Rerank, recall@5 jumped to 0.92. The change was a 50-line code addition. It paid for itself in customer retention within a week.


2. What are they?

Simple explanation

Advanced retrievers are filters and refinements on top of base retrieval. Each fixes a specific failure mode.

Mental model

Think of retrieval as a pipeline:

Query → Expand queries → Base retrieval → Filter junk → Rerank → Best top-k

Each "advanced retriever" plugs into one of these stages.


3. How does each work?

We'll cover five techniques. Each follows the standard 12-section breakdown, but compressed since the pattern is the same.


Technique 1: Multi-Query Retriever

Why?

One user query embeds in one specific way. Paraphrases get different embeddings. Vector search may miss them.

Example: User asks "how to fix the issue". The doc says "solving the bug" / "resolving the error". Single-query retrieval misses 2 of 3 phrasings.

What?

Generate N paraphrases of the user's query, run retrieval for each, merge results.

How?

flowchart LR
    Q[User Query] --> LLM[LLM]
    LLM --> Q1[Variant 1]
    LLM --> Q2[Variant 2]
    LLM --> Q3[Variant 3]
    Q1 & Q2 & Q3 --> R[Base Retriever]
    R --> M[Deduplicated Union]
    M --> OUT[Final docs]
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI

mq = MultiQueryRetriever.from_llm(
    retriever=vector_store.as_retriever(search_kwargs={"k": 3}),
    llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
)
docs = mq.invoke("How do I fix the issue?")

Pros / Cons / Trade-offs

Pros Cons Trade-off
+10-15% recall on vague queries N× LLM calls per query Use cheap fast model
Robust to phrasing Latency overhead Cache aggressively
Easy to add Doesn't help when query is already clear Skip for well-formed queries

Industry usage

OpenAI Assistants File Search uses query rewriting internally. Anthropic's Contextual Retrieval pipeline includes query expansion as Step 1.


Technique 2: Contextual Compression Retriever

Why?

A retrieved chunk has 800 chars, but only 100 chars are relevant to the query. The LLM wastes tokens reading the irrelevant 700.

What?

Run each retrieved chunk through an extractor (LLM or reranker) that keeps only the parts relevant to the query.

How?

flowchart LR
    Q[Query] --> R[Base Retriever]
    R --> C[Raw chunks - long]
    C --> EX[LLM/Reranker Extractor]
    EX --> S[Compressed chunks - relevant parts only]
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(
    ChatOpenAI(model="gpt-4o-mini", temperature=0)
)
cc = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vector_store.as_retriever(search_kwargs={"k": 6}),
)
docs = cc.invoke("How does pgvector handle filtering?")

Pros / Cons / Trade-offs

Pros Cons Trade-off
Tighter prompts One LLM call per chunk Use small model
Easier reasoning for LLM Indexing-time alternative cheaper For high-value queries only
Better fit in prompt budget Risk of losing useful context Tune carefully

Industry usage

Common in customer-support RAG where chunks are long-form articles. Stripe Docs AI uses contextual compression to fit more relevant content in the prompt.


Technique 3: Parent-Document Retriever

Why?

Trade-off: Small chunks match precisely (good recall@k) but lack surrounding context. Large chunks have context but match imprecisely.

What?

Index small chunks for matching. When matched, return the LARGER parent chunk for context.

How?

flowchart LR
    DOC[Original Document] --> P[Parent chunks 2000]
    P --> C[Child chunks 200]
    C --> VS[Vector store - small chunks]
    Q[Query] --> VS
    VS --> MATCH[Matching child IDs]
    MATCH --> RET[Return PARENT chunks]
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter  = RecursiveCharacterTextSplitter(chunk_size=200)

retriever = ParentDocumentRetriever(
    vectorstore=vector_store,
    docstore=InMemoryStore(),
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)
retriever.add_documents(docs)

Pros / Cons / Trade-offs

Pros Cons Trade-off
Precise match + rich context Two stores to maintain Use persistent docstore
Best of both worlds More implementation complexity LangChain handles most of it
Production-favorite pattern Slightly more memory Worth it

Industry usage

GitHub Copilot Chat uses a variant: small code-symbol chunks match queries, surrounding function/class context is returned. Almost every production RAG with hierarchical documents uses this pattern.


Technique 4: Self-Query Retriever

Why?

Users describe filters in natural language. "Show me docs about RAG from 2024" — vector search ignores "2024" unless your metadata is queryable.

What?

An LLM parses the user query into a structured form: vector query + metadata filter.

How?

flowchart LR
    Q[NL query: RAG docs from 2024] --> LLM[LLM Parser]
    LLM --> VQ[vector_query = RAG]
    LLM --> MF[filter = year=2024]
    VQ --> R[Retriever]
    MF --> R
    R --> D[Filtered semantic results]
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.schema import AttributeInfo

metadata_fields = [
    AttributeInfo(name="year", description="Year of publication", type="integer"),
    AttributeInfo(name="topic", description="Article topic", type="string"),
]

sq = SelfQueryRetriever.from_llm(
    llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
    vectorstore=vector_store,
    document_contents="Technical articles on AI",
    metadata_field_info=metadata_fields,
)

docs = sq.invoke("Find articles about RAG from 2024")

Pros / Cons / Trade-offs

Pros Cons Trade-off
NL filters One LLM call to parse Use small model
Cleaner UX Requires structured metadata Set up at indexing
Multi-field filters LLM may parse wrong Validate parsed filter

Industry usage

Stripe Docs AI uses Self-Query to filter by API version. Notion AI extracts page-type and date filters from "find my latest meeting notes" type queries.


Technique 5: Reranking (the most impactful)

Why?

Vector retrieval gives APPROXIMATE top-k. The actual best match may be at rank 7, not rank 1. Bi-encoder embeddings (dense retrievers) are FAST but ROUGH.

What?

Use a cross-encoder model that scores (query, document) PAIRS — more accurate but slower than bi-encoders.

Pattern: retrieve broadly with bi-encoder (k=20-50), rerank with cross-encoder, keep top-k (4-8).

How?

flowchart LR
    Q[Query] --> R[Retriever k=20 wide]
    R --> CE[Cross-encoder reranker]
    CE --> SCORE[Pair-wise scores]
    SCORE --> SORT[Re-sort]
    SORT --> TOP[Top-5 to LLM]

Why bi-encoder vs cross-encoder?

flowchart LR
    subgraph BI[Bi-encoder fast]
        Q1[Query] --> E1[Encode]
        D1[Doc] --> E2[Encode]
        E1 --> COS[Cosine similarity]
        E2 --> COS
    end
    subgraph CE[Cross-encoder accurate]
        Q2[Query] --> CAT[Concatenate]
        D2[Doc] --> CAT
        CAT --> CEM[Single transformer pass]
        CEM --> SCORE2[Score 0-1]
    end
Encoder Speed Accuracy Use
Bi-encoder Fast (precompute doc embeddings) Rough Retrieval (k=20-50)
Cross-encoder Slow (one pass per pair) Precise Reranking (re-score top 20→5)

Code

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

rerank = CohereRerank(model="rerank-english-v3.0", top_n=4)

retriever = ContextualCompressionRetriever(
    base_compressor=rerank,
    base_retriever=vector_store.as_retriever(search_kwargs={"k": 20}),
)
docs = retriever.invoke("query")

Other rerankers: - Cohere Rerank — paid API, fast, top-tier. - bge-reranker-v2-m3 (BAAI) — open-source, runs locally. - Jina Reranker — open + API. - ColBERT v2 — token-level scoring, strong but heavier.

Pros / Cons / Trade-offs

Pros Cons Trade-off
+20-30% precision over pure vector Extra latency (100-300ms) Worth it for production
Catches subtle relevance Per-query cost Cohere ~$1 per 1000
Easy to plug in Requires model choice Test 2-3 rerankers

Industry usage

Most production-grade RAG (Cohere customers, Anthropic Contextual Retrieval) uses reranking. The recall@k gain is the single largest documented improvement in published RAG benchmarks.


4. Visual Learning — combining techniques

In production, you stack them:

flowchart LR
    Q[Query] --> SQ[Self-Query: extract metadata filter]
    SQ --> MQ[Multi-Query: generate paraphrases]
    MQ --> H[Hybrid retrieval: dense + BM25]
    H --> RR[Cross-encoder Rerank]
    RR --> CC[Contextual Compression]
    CC --> FINAL[Final docs to LLM]

Each layer fixes a specific failure mode. Add ONE at a time. Measure each.

flowchart LR
    Q[Show me indemnity clauses<br/>from NDAs signed in 2024] --> SQ[Self-Query]
    SQ --> F[filter=type:NDA, year:2024]
    SQ --> VQ[vector_query=indemnity clauses]
    F & VQ --> H[Hybrid retrieve]
    H --> 20[Top 20 chunks]
    20 --> CR[Cohere Rerank]
    CR --> 5[Top 5]
    5 --> PD[Parent-Document expand to full clauses]
    PD --> LLM

Self-Query extracts the date+type filter, hybrid retrieves, rerank refines, parent-doc gives full clause context.


5-7. Pros / Cons / Trade-offs (consolidated)

Technique Best for Cost Quality gain
Multi-Query Vague queries N× LLM calls +10-15% recall
Contextual Compression Long chunks N× LLM calls Cleaner prompts
Parent-Document Small-chunk recall + big-chunk context Free at runtime +5-15% answer quality
Self-Query NL filters over metadata 1 LLM call Better UX, fewer wrong-domain hits
Reranking High-precision needs $1/1000 queries (Cohere) +20-30% precision

Combined effect: going from naive vector retrieval to "hybrid + rerank + parent-doc" typically yields +30-40% recall@k.


8. Real-world Industry Usage

OpenAI

  • File Search in Assistants API: dense retrieval + rerank.
  • ChatGPT browse: web search + content rerank.

Anthropic

  • Contextual Retrieval paper (2024) recommends: hybrid + reranking + contextual chunking → 67% reduction in retrieval failures.
  • Claude Projects use a similar stack internally.

Google

  • Vertex AI Search has built-in reranking.
  • Gemini Code Assist uses contextual compression on code chunks before answering.

Enterprise

  • Stripe Docs AI: hybrid + Cohere Rerank + Self-Query for API version filtering.
  • GitHub Copilot Chat: parent-document retrieval + cross-encoder rerank.
  • JPMorgan DocLLM: Self-Query for filings; rerank for relevance.
  • Notion AI: Multi-Query for paraphrase robustness on its narrative content.

9. Interview Questions

Beginner

  1. What's reranking? — Re-scoring top-k candidates with a more accurate model (cross-encoder).
  2. Bi-encoder vs cross-encoder? — Bi: encodes each side independently (fast). Cross: encodes the pair together (slow but more accurate).
  3. What's Parent-Document Retriever? — Index small chunks, return large parents.

Intermediate

  1. Why Multi-Query helps recall? — Vector embeddings are sensitive to phrasing. Multiple paraphrases broaden coverage.
  2. Self-Query — what does the LLM do? — Parses NL query → vector query + structured filter.
  3. Why retrieve k=20 then rerank to k=5? — Bi-encoder gives fast recall (broad). Cross-encoder gives slow precision (narrow). Combine for best of both.

Advanced

  1. How tune the reranker's top_n? — Eval on a labeled set; recall@top_n vs LLM prompt budget; usually 4-8.
  2. When does reranking hurt? — Tiny corpora (<1000 chunks) where rough retrieval already wins. Or when reranker's training distribution differs from yours.
  3. Contextual Compression vs Reranking — overlap? — Reranking re-orders chunks. Compression modifies them (strips irrelevant parts). Use both for max quality + minimum tokens.

System design

  1. Design retrieval for a 1000-product e-commerce search. — Hybrid (BM25 for SKUs, dense for descriptions) + Self-Query (filters: brand, color, price) + Cohere Rerank.
  2. Plan a rollout: dense → hybrid → rerank. — Phase 1: add BM25 retriever, measure. Phase 2: add Cohere Rerank, measure. Phase 3: add Self-Query if needed. Compare metrics at each step.

10. Common Mistakes

Beginners

  • ❌ Adding all 5 at once — can't tell which helps.
  • ❌ Using GPT-4 for Multi-Query expansion (overkill, expensive).
  • ❌ Reranker with k=100 — slow, wasteful.
  • ❌ Skipping eval — guessing improvements.

Production

  • ❌ No latency budget for the stack — Multi-Query + rerank + compression adds 500-1500ms.
  • ❌ Multi-Query without caching common queries.
  • ❌ Self-Query without validating the LLM's parsed filter.
  • ❌ Rerank top_n equal to retrieve k (rerank does nothing — same set in same order).

11. Best Practices

  • Add ONE technique at a time. Measure recall@k. Keep if better.
  • Use cheap fast models for non-final-answer steps (gpt-4o-mini, claude-haiku).
  • Cache LLM-augmented retrieval outputs aggressively.
  • Set strict latency budgets per layer.
  • Run RAGAS (Chapter 12) after every change.

12. Evolution Story

flowchart LR
    A[Pure vector retrieval] --> B[+ BM25 hybrid]
    B --> C[+ Multi-Query]
    C --> D[+ Self-Query NL filters]
    D --> E[+ Parent-Document]
    E --> F[+ Contextual Compression]
    F --> G[+ Cross-encoder Reranking]
    G --> H[+ Anthropic Contextual chunking]

Where we are: Each layer fixes a specific failure mode. Production RAG typically uses 3-4 layers, not all.

Where we're going (next chapter): RAG Fusion — a specific advanced pattern that combines Multi-Query with Reciprocal Rank Fusion. We'll see the math in detail and understand why it beats vanilla Multi-Query.


Practice

What does this print?

Expected: True

techniques = ["multi_query", "compression", "parent_document", "self_query", "reranking"]
print(len(techniques) == 5)

Use a CHEAP model for Multi-Query expansion (not GPT-4)

Expected: True

expansion_model = "gpt-4"        # bug: too expensive per query
is_cheap = expansion_model in ("gpt-4o-mini", "claude-haiku")
print(not is_cheap)

Quiz — Quick check

What you remember

Q1. Parent-Document Retriever solves what problem?

  • Mismatch between matching granularity (small) and context size (large)
  • Slow embeddings
  • Multilingual
  • Filtering

Q2. Why retrieve broad (k=20) then rerank to k=5?

  • Bi-encoder = fast recall, cross-encoder = precise scoring — combine
  • Costs less
  • Required by sklearn
  • Avoids OOM

Q3. Cross-encoder vs bi-encoder?

  • Cross encodes (query, doc) PAIR together; bi encodes each separately
  • No difference
  • Cross is deprecated
  • Bi is more accurate

Common doubts

Do I always need a reranker?

For chat/Q&A over a non-trivial corpus, yes — typically the biggest quality win you'll get. Skip only for tiny corpora or latency-critical paths.

Multi-Query vs RAG Fusion — what's the difference?

Multi-Query merges retrieval results via set union (loses rank info). RAG Fusion uses Reciprocal Rank Fusion (preserves rank info). RAG Fusion typically wins. We cover it in detail in Chapter 8.

Should I run all 5 techniques together?

No. Each adds latency + cost. Start with hybrid retrieval. Add reranking. Measure. If recall is still inadequate, add Multi-Query or Self-Query based on diagnosis. Stacking everything is over-engineering.

RAG Fusion