Retrievers¶
1. Why this matters¶
In RAG, the LLM only sees what the retriever feeds it. If the right document doesn't show up in top-k, the LLM can't answer correctly — no amount of prompt engineering fixes that.
LangChain unifies all retrieval strategies behind one Runnable interface, so you can A/B test (similarity vs MMR vs multi-query) without rewiring your chain.
2. Mental model¶
A retriever is a black box: feed it a query string, get back the most relevant Documents.
flowchart LR
Q[Query string] --> R[Retriever<br/>Runnable]
R --> D[List of Document<br/>top-k results]
Different retrievers use different strategies to populate that list:
| Strategy | How it ranks |
|---|---|
| Vector similarity | Cosine sim of query embedding vs chunk embeddings |
| MMR | Similarity minus redundancy (more diverse top-k) |
| BM25 / keyword | Classic IR — TF-IDF style |
| Multi-Query | Generates 3–5 paraphrases of the query, unions the results |
| Contextual Compression | Retrieves more, then LLM-filters the chunks |
| Self-Query | LLM extracts metadata filters from the natural-language query |
| Ensemble | Combines multiple retrievers with reciprocal-rank fusion |
| Wikipedia / ArXiv | Hits a third-party search API |
3. Architecture / Flow¶
flowchart TD
Q[Question] --> V{Strategy?}
V -->|simple| A[VectorStore retriever<br/>top-k similarity]
V -->|diverse top-k| B[MMR retriever]
V -->|paraphrase queries| C[MultiQueryRetriever<br/>LLM generates variants]
V -->|filter then rerank| D[ContextualCompressionRetriever<br/>LLM filters or LongContextReorder]
V -->|extract filters from NL| E[SelfQueryRetriever<br/>LLM → filter expression]
V -->|combine signals| F[EnsembleRetriever<br/>vector + BM25]
V -->|external API| G[WikipediaRetriever / ArxivRetriever]
A --> R[Documents]
B --> R
C --> R
D --> R
E --> R
F --> R
G --> R
4. Core concepts¶
BaseRetriever— abstract class. Every retriever implements.invoke(query) -> list[Document].vectorstore.as_retriever(...)— the bridge from a vector store to a Runnable retriever.search_type—"similarity"(default),"mmr", or"similarity_score_threshold".search_kwargs— backend-specific:{"k": 4, "filter": {...}, "fetch_k": 20, "lambda_mult": 0.5}.- Hybrid retrieval — combining dense (embedding) + sparse (keyword/BM25) signals. Often the single highest-impact upgrade.
- Reranking — after retrieving, re-score top-N with a cross-encoder (e.g., Cohere Rerank) and keep top-K. Big quality boost for ambiguous queries.
5. Code — minimal working example¶
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
vs = Chroma(persist_directory="./chroma_db",
embedding_function=OpenAIEmbeddings())
retriever = vs.as_retriever(search_kwargs={"k": 4})
docs = retriever.invoke("What is our refund policy?")
for d in docs:
print(d.metadata.get("source"), "→", d.page_content[:80])
6. Code — real-world pattern¶
MMR for diverse top-k:
retriever = vs.as_retriever(
search_type="mmr",
search_kwargs={"k": 4, "fetch_k": 20, "lambda_mult": 0.5},
# fetch_k = pool size to choose from; lambda_mult = relevance vs diversity (0..1)
)
Multi-Query — the LLM rewrites the query into 3–5 variants, then unions the results:
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI
mq_retriever = MultiQueryRetriever.from_llm(
retriever=vs.as_retriever(search_kwargs={"k": 4}),
llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
)
docs = mq_retriever.invoke("how can I get my money back?")
# under the hood: also tries "what is the refund procedure", "how to request a return", ...
Contextual Compression — retrieve broad, then have an LLM keep only relevant sentences:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
base = vs.as_retriever(search_kwargs={"k": 8})
compressor = LLMChainExtractor.from_llm(ChatOpenAI(model="gpt-4o-mini"))
retriever = ContextualCompressionRetriever(
base_retriever=base,
base_compressor=compressor,
)
Ensemble — combine vector + BM25:
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
bm25 = BM25Retriever.from_documents(chunks)
bm25.k = 4
vec = vs.as_retriever(search_kwargs={"k": 4})
retriever = EnsembleRetriever(
retrievers=[bm25, vec],
weights=[0.4, 0.6], # tune
)
External — Wikipedia:
from langchain_community.retrievers import WikipediaRetriever
wiki = WikipediaRetriever(top_k_results=3, lang="en")
docs = wiki.invoke("Theory of relativity")
Use in an LCEL chain (retriever feeds the prompt):
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
prompt = ChatPromptTemplate.from_template(
"Answer using only the context.\n\nContext: {context}\nQ: {question}"
)
format_docs = lambda ds: "\n\n".join(d.page_content for d in ds)
chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| ChatOpenAI(model="gpt-4o-mini", temperature=0)
| StrOutputParser()
)
print(chain.invoke("What is our refund window?"))
7. Common pitfalls¶
- ❗ Trusting top-k=1. Always retrieve at least 4. Recall improves dramatically and the LLM is good at ignoring irrelevant chunks.
- ❗ Forgetting MMR or rerankers when the corpus has duplicates. Pure cosine returns 4 versions of the same chunk; the LLM only sees one piece of info.
- ❗ Calling the retriever inside the LLM prompt step by string concat. Use the LCEL pattern — it's traceable, batchable, and tested.
- ❗ Ignoring metadata filters. Filtering before retrieval (
filter={"year": 2024}) is faster and more accurate than retrieving broadly and post-filtering. - ❗ Putting too many chunks in context. More ≠ better. After ~6–8 chunks, the LLM gets confused. Use rerank + compression instead.
- ❗ MultiQueryRetriever in production with a slow LLM. It calls the LLM N+1 times per query. Use a fast/cheap model for the rewrite step.
8. When to use vs not use¶
| Use this retriever | When |
|---|---|
vs.as_retriever() (similarity) |
Default — start here |
| MMR | Corpus has many near-duplicates |
| MultiQuery | User queries are vague / paraphrased |
| ContextualCompression | Long chunks; want only the relevant sentences |
| SelfQuery | Queries naturally contain filterable structure (dates, categories) |
| Ensemble | You can run both vector + BM25 affordably |
| Wikipedia / Arxiv | Need world knowledge, not private docs |
| Re-ranker (Cohere / cross-encoder) | Quality matters more than latency |
9. Cheatsheet¶
# From a vector store
retriever = vs.as_retriever(
search_type="similarity", # or "mmr"
search_kwargs={
"k": 4,
"fetch_k": 20, # MMR-only
"lambda_mult": 0.5, # MMR-only
"filter": {"year": 2024, "doc_type": "policy"},
"score_threshold": 0.75, # with similarity_score_threshold
},
)
# Pre-built strategies
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.retrievers import (
ContextualCompressionRetriever,
EnsembleRetriever,
SelfQueryRetriever,
ParentDocumentRetriever,
)
from langchain.retrievers.document_compressors import (
LLMChainExtractor,
LongContextReorder,
CohereRerank,
)
from langchain_community.retrievers import (
BM25Retriever,
WikipediaRetriever,
ArxivRetriever,
TFIDFRetriever,
)
# Use anywhere a Runnable is expected
retriever.invoke("query")
async for batch in retriever.abatch(["q1", "q2"]): ...
# As part of an LCEL chain
chain = ({"context": retriever | format_docs, "question": pass_through} | prompt | model)
10. Q&A — recall test¶
-
Q: What's the difference between a vector store and a retriever? A: A vector store is a storage + ANN-search engine; a retriever is the higher-level Runnable interface (
query → docs). A vector store can become a retriever via.as_retriever(), but retrievers can also wrap APIs (Wikipedia), keyword search (BM25), or compose multiple sources. -
Q: When does MMR beat plain similarity? A: When the corpus has redundancy. Plain similarity returns the 4 chunks most similar to the query — often near-duplicates. MMR adds a diversity term so the 4 returned chunks cover different aspects.
-
Q: What does MultiQueryRetriever solve? A: User queries are often vague or use different vocabulary than the docs. MultiQuery has the LLM paraphrase the query 3–5 ways, retrieves for each, and unions the results — improving recall.
-
Q: Why use a re-ranker? A: Embedding similarity is a fast first-pass but inexact. A cross-encoder (looks at query + doc together) is slower but much more accurate. Pattern: retrieve top-20 cheap, rerank to top-4 expensive.
-
Q: Difference between contextual compression and reranking? A: Reranking reorders the existing chunks by quality. Contextual compression modifies chunks — removes irrelevant sentences inside each one, keeping only the parts useful for the query.
Practice¶
Use a retriever's invoke method (not the deprecated get_relevant_documents)
Expected: True
Quiz — Quick check¶
What you remember
Q1. What does a retriever do in a RAG pipeline?
- Takes a query string, returns the most relevant documents from a knowledge base
- Generates answers
- Trains the model
- Stores embeddings
Why: The retriever is the lookup step. Different retrievers use different strategies (vector similarity, BM25, hybrid). All return a list of Documents for the LLM to use.
Q2. What's MMR (Maximal Marginal Relevance) retrieval?
- Returns the most relevant chunk only
- Returns chunks that are relevant BUT diverse — avoids returning 5 near-duplicates
- Excludes the most relevant
- A reranking technique
Why: Naive vector search may return 5 highly similar chunks. MMR balances "relevant to query" and "different from already-selected chunks" — better coverage with the same
k.
Q3. Why use hybrid (vector + BM25) retrieval?
- Always faster
- Combines semantic search (vectors) with exact keyword matching (BM25) — catches both "what they meant" and "exact terms they used"
- Required for production
- Cheaper than vector search
Why: Pure vector search can miss exact-keyword matches (e.g., a specific product code). Pure keyword can miss semantic equivalents. Hybrid (e.g.,
EnsembleRetriever) gets the best of both.
Common doubts¶
How do I know my retrieval is working well?
Build a small test set: questions + the docs that should answer them. Run retrieval; measure recall@k (was the correct doc in the top-k?). Aim for recall@5 ≥ 0.9. Below that, your generation will struggle no matter how good the LLM is.
Should I use a reranker?
For top quality, yes. Pattern: vector search retrieves top-20 candidates → cross-encoder reranker (Cohere Rerank, BAAI/bge-reranker) scores each pair (query, doc) more accurately → keep top-5 for the LLM. Costs slightly more but typically improves recall.
What's the right value of k?
Start with k=4. Too few = miss relevant info; too many = irrelevant noise dilutes the prompt. Tune by retrieval quality and LLM context budget. For multi-hop questions or long contexts, k=8-10 may be better.