Skip to content

Retrievers

1. Why this matters

In RAG, the LLM only sees what the retriever feeds it. If the right document doesn't show up in top-k, the LLM can't answer correctly — no amount of prompt engineering fixes that.

LangChain unifies all retrieval strategies behind one Runnable interface, so you can A/B test (similarity vs MMR vs multi-query) without rewiring your chain.

2. Mental model

A retriever is a black box: feed it a query string, get back the most relevant Documents.

flowchart LR
    Q[Query string] --> R[Retriever<br/>Runnable]
    R --> D[List of Document<br/>top-k results]

Different retrievers use different strategies to populate that list:

Strategy How it ranks
Vector similarity Cosine sim of query embedding vs chunk embeddings
MMR Similarity minus redundancy (more diverse top-k)
BM25 / keyword Classic IR — TF-IDF style
Multi-Query Generates 3–5 paraphrases of the query, unions the results
Contextual Compression Retrieves more, then LLM-filters the chunks
Self-Query LLM extracts metadata filters from the natural-language query
Ensemble Combines multiple retrievers with reciprocal-rank fusion
Wikipedia / ArXiv Hits a third-party search API

3. Architecture / Flow

flowchart TD
    Q[Question] --> V{Strategy?}
    V -->|simple| A[VectorStore retriever<br/>top-k similarity]
    V -->|diverse top-k| B[MMR retriever]
    V -->|paraphrase queries| C[MultiQueryRetriever<br/>LLM generates variants]
    V -->|filter then rerank| D[ContextualCompressionRetriever<br/>LLM filters or LongContextReorder]
    V -->|extract filters from NL| E[SelfQueryRetriever<br/>LLM → filter expression]
    V -->|combine signals| F[EnsembleRetriever<br/>vector + BM25]
    V -->|external API| G[WikipediaRetriever / ArxivRetriever]
    A --> R[Documents]
    B --> R
    C --> R
    D --> R
    E --> R
    F --> R
    G --> R

4. Core concepts

  • BaseRetriever — abstract class. Every retriever implements .invoke(query) -> list[Document].
  • vectorstore.as_retriever(...) — the bridge from a vector store to a Runnable retriever.
  • search_type"similarity" (default), "mmr", or "similarity_score_threshold".
  • search_kwargs — backend-specific: {"k": 4, "filter": {...}, "fetch_k": 20, "lambda_mult": 0.5}.
  • Hybrid retrieval — combining dense (embedding) + sparse (keyword/BM25) signals. Often the single highest-impact upgrade.
  • Reranking — after retrieving, re-score top-N with a cross-encoder (e.g., Cohere Rerank) and keep top-K. Big quality boost for ambiguous queries.

5. Code — minimal working example

from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

vs = Chroma(persist_directory="./chroma_db",
            embedding_function=OpenAIEmbeddings())

retriever = vs.as_retriever(search_kwargs={"k": 4})

docs = retriever.invoke("What is our refund policy?")
for d in docs:
    print(d.metadata.get("source"), "→", d.page_content[:80])

6. Code — real-world pattern

MMR for diverse top-k:

retriever = vs.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 4, "fetch_k": 20, "lambda_mult": 0.5},
    # fetch_k = pool size to choose from; lambda_mult = relevance vs diversity (0..1)
)

Multi-Query — the LLM rewrites the query into 3–5 variants, then unions the results:

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI

mq_retriever = MultiQueryRetriever.from_llm(
    retriever=vs.as_retriever(search_kwargs={"k": 4}),
    llm=ChatOpenAI(model="gpt-4o-mini", temperature=0),
)
docs = mq_retriever.invoke("how can I get my money back?")
# under the hood: also tries "what is the refund procedure", "how to request a return", ...

Contextual Compression — retrieve broad, then have an LLM keep only relevant sentences:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

base = vs.as_retriever(search_kwargs={"k": 8})
compressor = LLMChainExtractor.from_llm(ChatOpenAI(model="gpt-4o-mini"))
retriever = ContextualCompressionRetriever(
    base_retriever=base,
    base_compressor=compressor,
)

Ensemble — combine vector + BM25:

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

bm25 = BM25Retriever.from_documents(chunks)
bm25.k = 4
vec  = vs.as_retriever(search_kwargs={"k": 4})

retriever = EnsembleRetriever(
    retrievers=[bm25, vec],
    weights=[0.4, 0.6],   # tune
)

External — Wikipedia:

from langchain_community.retrievers import WikipediaRetriever
wiki = WikipediaRetriever(top_k_results=3, lang="en")
docs = wiki.invoke("Theory of relativity")

Use in an LCEL chain (retriever feeds the prompt):

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

prompt = ChatPromptTemplate.from_template(
    "Answer using only the context.\n\nContext: {context}\nQ: {question}"
)
format_docs = lambda ds: "\n\n".join(d.page_content for d in ds)

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | ChatOpenAI(model="gpt-4o-mini", temperature=0)
    | StrOutputParser()
)
print(chain.invoke("What is our refund window?"))

7. Common pitfalls

  • Trusting top-k=1. Always retrieve at least 4. Recall improves dramatically and the LLM is good at ignoring irrelevant chunks.
  • Forgetting MMR or rerankers when the corpus has duplicates. Pure cosine returns 4 versions of the same chunk; the LLM only sees one piece of info.
  • Calling the retriever inside the LLM prompt step by string concat. Use the LCEL pattern — it's traceable, batchable, and tested.
  • Ignoring metadata filters. Filtering before retrieval (filter={"year": 2024}) is faster and more accurate than retrieving broadly and post-filtering.
  • Putting too many chunks in context. More ≠ better. After ~6–8 chunks, the LLM gets confused. Use rerank + compression instead.
  • MultiQueryRetriever in production with a slow LLM. It calls the LLM N+1 times per query. Use a fast/cheap model for the rewrite step.

8. When to use vs not use

Use this retriever When
vs.as_retriever() (similarity) Default — start here
MMR Corpus has many near-duplicates
MultiQuery User queries are vague / paraphrased
ContextualCompression Long chunks; want only the relevant sentences
SelfQuery Queries naturally contain filterable structure (dates, categories)
Ensemble You can run both vector + BM25 affordably
Wikipedia / Arxiv Need world knowledge, not private docs
Re-ranker (Cohere / cross-encoder) Quality matters more than latency

9. Cheatsheet

# From a vector store
retriever = vs.as_retriever(
    search_type="similarity",                   # or "mmr"
    search_kwargs={
        "k": 4,
        "fetch_k": 20,        # MMR-only
        "lambda_mult": 0.5,   # MMR-only
        "filter": {"year": 2024, "doc_type": "policy"},
        "score_threshold": 0.75,                # with similarity_score_threshold
    },
)

# Pre-built strategies
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.retrievers import (
    ContextualCompressionRetriever,
    EnsembleRetriever,
    SelfQueryRetriever,
    ParentDocumentRetriever,
)
from langchain.retrievers.document_compressors import (
    LLMChainExtractor,
    LongContextReorder,
    CohereRerank,
)
from langchain_community.retrievers import (
    BM25Retriever,
    WikipediaRetriever,
    ArxivRetriever,
    TFIDFRetriever,
)

# Use anywhere a Runnable is expected
retriever.invoke("query")
async for batch in retriever.abatch(["q1", "q2"]): ...

# As part of an LCEL chain
chain = ({"context": retriever | format_docs, "question": pass_through} | prompt | model)

10. Q&A — recall test

  • Q: What's the difference between a vector store and a retriever? A: A vector store is a storage + ANN-search engine; a retriever is the higher-level Runnable interface (query → docs). A vector store can become a retriever via .as_retriever(), but retrievers can also wrap APIs (Wikipedia), keyword search (BM25), or compose multiple sources.

  • Q: When does MMR beat plain similarity? A: When the corpus has redundancy. Plain similarity returns the 4 chunks most similar to the query — often near-duplicates. MMR adds a diversity term so the 4 returned chunks cover different aspects.

  • Q: What does MultiQueryRetriever solve? A: User queries are often vague or use different vocabulary than the docs. MultiQuery has the LLM paraphrase the query 3–5 ways, retrieves for each, and unions the results — improving recall.

  • Q: Why use a re-ranker? A: Embedding similarity is a fast first-pass but inexact. A cross-encoder (looks at query + doc together) is slower but much more accurate. Pattern: retrieve top-20 cheap, rerank to top-4 expensive.

  • Q: Difference between contextual compression and reranking? A: Reranking reorders the existing chunks by quality. Contextual compression modifies chunks — removes irrelevant sentences inside each one, keeping only the parts useful for the query.

Practice

What does this print?

Expected: 4

# k is the number of chunks the retriever returns
k = 4
print(k)

Use a retriever's invoke method (not the deprecated get_relevant_documents)

Expected: True

# Modern API: retriever.invoke(query); old: retriever.get_relevant_documents(query)
use_modern = False                # bug: should be True for invoke
print(use_modern)

Quiz — Quick check

What you remember

Q1. What does a retriever do in a RAG pipeline?

  • Takes a query string, returns the most relevant documents from a knowledge base
  • Generates answers
  • Trains the model
  • Stores embeddings

Why: The retriever is the lookup step. Different retrievers use different strategies (vector similarity, BM25, hybrid). All return a list of Documents for the LLM to use.

Q2. What's MMR (Maximal Marginal Relevance) retrieval?

  • Returns the most relevant chunk only
  • Returns chunks that are relevant BUT diverse — avoids returning 5 near-duplicates
  • Excludes the most relevant
  • A reranking technique

Why: Naive vector search may return 5 highly similar chunks. MMR balances "relevant to query" and "different from already-selected chunks" — better coverage with the same k.

Q3. Why use hybrid (vector + BM25) retrieval?

  • Always faster
  • Combines semantic search (vectors) with exact keyword matching (BM25) — catches both "what they meant" and "exact terms they used"
  • Required for production
  • Cheaper than vector search

Why: Pure vector search can miss exact-keyword matches (e.g., a specific product code). Pure keyword can miss semantic equivalents. Hybrid (e.g., EnsembleRetriever) gets the best of both.

Common doubts

How do I know my retrieval is working well?

Build a small test set: questions + the docs that should answer them. Run retrieval; measure recall@k (was the correct doc in the top-k?). Aim for recall@5 ≥ 0.9. Below that, your generation will struggle no matter how good the LLM is.

Should I use a reranker?

For top quality, yes. Pattern: vector search retrieves top-20 candidates → cross-encoder reranker (Cohere Rerank, BAAI/bge-reranker) scores each pair (query, doc) more accurately → keep top-5 for the LLM. Costs slightly more but typically improves recall.

What's the right value of k?

Start with k=4. Too few = miss relevant info; too many = irrelevant noise dilutes the prompt. Tune by retrieval quality and LLM context budget. For multi-hop questions or long contexts, k=8-10 may be better.