Perform Self-Querying Retrieval with MongoDB and LangChain

You can integrate MongoDB Vector Search with LangChain to perform self-querying retrieval. This tutorial demonstrates how to use the self-querying retriever to run natural language MongoDB Vector Search queries with metadata filtering.

Self-querying retrieval uses an LLM to process your search query to identify possible metadata filters, forms a structured vector search query with the filters, and then runs the query to retrieve the most relevant documents.

Example

With a query like "What are thriller movies from after 2010 with ratings above 8?", the retriever can identify filters on the genre, year, and rating fields, and use those filters to retrieve documents that match the query.

Work with a runnable version of this tutorial as a Python notebook.

Prerequisites

To complete this tutorial, you must have the following:

One of the following MongoDB cluster types:
- An Atlas cluster running MongoDB version 6.0.11, 7.0.2, or later. Ensure that your IP address is included in your Atlas project's access list.
- A local Atlas deployment created using the Atlas CLI. To learn more, see Create a Local Atlas Deployment.
- A MongoDB Community or Enterprise cluster with Search and Vector Search installed.
A Voyage AI API key. To learn more, see Voyage AI Documentation.
An OpenAI API Key. You must have an OpenAI account with credits available for API requests. To learn more about registering an OpenAI account, see the OpenAI API website.

Use MongoDB as a Vector Store

In this section, you create a vector store instance using your MongoDB cluster as a vector database.

Set up the environment.

Set up the environment for this tutorial. Create an interactive Python notebook by saving a file with the .ipynb extension. This notebook allows you to run Python code snippets individually, and you'll use it to run the code in this tutorial.

To set up your notebook environment:

Run the following command in your notebook:
```
pip install --quiet --upgrade langchain-mongodb langchain-voyageai langchain-openai langchain langchain-core lark
```
Set environment variables.
Run the following code to set the environment variables for this tutorial. Provide your Voyage API key, OpenAI API Key, and MongoDB cluster's SRV connection string.
```
import os
os.environ["OPENAI_API_KEY"] = "<openai-key>"
os.environ["VOYAGE_API_KEY"] = "<voyage-key>"
MONGODB_URI = "<connection-string>"
```
Note
Replace <connection-string> with the connection string for your Atlas cluster or local Atlas deployment.
Your connection string should use the following format:
mongodb+srv://<db_username>:<db_password>@<clusterName>.<hostname>.mongodb.net
To learn more, see Connect to a Cluster via Drivers.
Your connection string should use the following format:
mongodb://localhost:<port-number>/?directConnection=true
To learn more, see Connection Strings.

Instantiate the vector store.

Run the following code in your notebook to create a vector store instance named vector_store using the langchain_db.self_query namespace in MongoDB:

from langchain_mongodb import MongoDBAtlasVectorSearch
from langchain_voyageai import VoyageAIEmbeddings
# Use the voyage-3-large embedding model
embedding_model = VoyageAIEmbeddings(model="voyage-3-large")
# Create the vector store
vector_store = MongoDBAtlasVectorSearch.from_connection_string(
   connection_string = MONGODB_URI,
   embedding = embedding_model,
   namespace = "langchain_db.self_query",
   text_key = "page_content"
)

Add data to the vector store.

Paste and run the following code in your notebook to ingest some sample documents with metadata into your collection in MongoDB.

from langchain_core.documents import Document
docs = [
    Document(
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose",
        metadata={"year": 1993, "rating": 7.7, "genre": "action"},
    ),
    Document(
        page_content="A fight club that is not a fight club, but is a fight club",
        metadata={"year": 1994, "rating": 8.7, "genre": "action"},
    ),
    Document(
        page_content="Leo DiCaprio gets lost in a dream within a dream within a dream within a ...",
        metadata={"year": 2010, "genre": "thriller", "rating": 8.2},
    ),
    Document(
        page_content="A bunch of normal-sized women are supremely wholesome and some men pine after them",
        metadata={"year": 2019, "rating": 8.3, "genre": "drama"},
    ),
    Document(
        page_content="Three men walk into the Zone, three men walk out of the Zone",
        metadata={"year": 1979, "rating": 9.9, "genre": "science fiction"},
    ),
    Document(
        page_content="A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea",
        metadata={"year": 2006, "genre": "thriller", "rating": 9.0},
    ),
    Document(
        page_content="Toys come alive and have a blast doing so",
        metadata={"year": 1995, "genre": "animated", "rating": 9.3},
    ),
    Document(
        page_content="The toys come together to save their friend from a kid who doesn't know how to play with them",
        metadata={"year": 1997, "genre": "animated", "rating": 9.1},
    ),
]
# Add data to the vector store, which automaticaly embeds the documents
vector_store.add_documents(docs)

If you're using Atlas, you can verify your vector embeddings by navigating to the langchain_db.self_query namespace in the Atlas UI.

Create the MongoDB Vector Search index with filters.

Run the following code to create the MongoDB Vector Search index with filters for the vector store to enable vector search and metadata filtering over your data:

# Use LangChain helper method to create the vector search index
vector_store.create_vector_search_index(
   dimensions = 1024, # The dimensions of the vector embeddings to be indexed
   filters = [ "genre", "rating", "year" ], # The metadata fields to be indexed for filtering
   wait_until_complete = 60 # Number of seconds to wait for the index to build (can take around a minute)
)

Tip

create_vector_search_index API reference

The index should take about one minute to build. While it builds, the index is in an initial sync state. When it finishes building, you can start querying the data in your collection.

Create the Self-Querying Retriever

In this section, you initialize the self-querying retriever to query data from your vector store.

Describe the documents and metadata fields.

To use the self-querying retriever, you must describe the documents in your collection and the metadata fields that you want to filter on. This information helps the LLM understand the structure of your data and how to filter results based on user queries.

from langchain.chains.query_constructor.schema import AttributeInfo
# Define the document content description 
document_content_description = "Brief summary of a movie"
# Define the metadata fields to filter on
metadata_field_info = [
    AttributeInfo(
        name="genre",
        description="The genre of the movie",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The year the movie was released",
        type="integer",
    ),
    AttributeInfo(
        name="rating", 
        description="A 1-10 rating for the movie", 
        type="float"
    ),
]

Initialize the self-querying retriever.

Run the following code to create a self-querying retriever using the MongoDBAtlasSelfQueryRetriever.from_llm method.

from langchain_mongodb.retrievers import MongoDBAtlasSelfQueryRetriever
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")
retriever = MongoDBAtlasSelfQueryRetriever.from_llm(
    llm=llm,
    vectorstore=vector_store,
    metadata_field_info=metadata_field_info,
    document_contents=document_content_description
)

Tip

MongoDBAtlasSelfQueryRetriever API Reference

Run Queries with the Self-Querying Retriever

Run the following queries to see how the self-querying retriever executes different types of queries:

# This example specifies a filter (rating > 9)
retriever.invoke("What are some highly rated movies (above 9)?")

[Document(id='686e84de13668e4048bf9ff3', metadata={'_id': '686e84de13668e4048bf9ff3', 'year': 1979, 'rating': 9.9, 'genre': 'science fiction'}, page_content='Three men walk into the Zone, three men walk out of the Zone'),
 Document(id='686e84de13668e4048bf9ff5', metadata={'_id': '686e84de13668e4048bf9ff5', 'year': 1995, 'genre': 'animated', 'rating': 9.3}, page_content='Toys come alive and have a blast doing so'),
 Document(id='686e84de13668e4048bf9ff6', metadata={'_id': '686e84de13668e4048bf9ff6', 'year': 1997, 'genre': 'animated', 'rating': 9.1}, page_content="The toys come together to save their friend from a kid who doesn't know how to play with them")]

# This example specifies a semantic search and a filter (rating > 9)
retriever.invoke("I want to watch a movie about toys rated higher than 9")

[Document(id='686e84de13668e4048bf9ff5', metadata={'_id': '686e84de13668e4048bf9ff5', 'year': 1995, 'genre': 'animated', 'rating': 9.3}, page_content='Toys come alive and have a blast doing so'),
 Document(id='686e84de13668e4048bf9ff6', metadata={'_id': '686e84de13668e4048bf9ff6', 'year': 1997, 'genre': 'animated', 'rating': 9.1}, page_content="The toys come together to save their friend from a kid who doesn't know how to play with them"),
 Document(id='686e84de13668e4048bf9ff3', metadata={'_id': '686e84de13668e4048bf9ff3', 'year': 1979, 'rating': 9.9, 'genre': 'science fiction'}, page_content='Three men walk into the Zone, three men walk out of the Zone')]

# This example specifies a composite filter (rating >= 9 and genre = thriller)
retriever.invoke("What's a highly rated (above or equal 9) thriller film?")

[Document(id='686e84de13668e4048bf9ff4', metadata={'_id': '686e84de13668e4048bf9ff4', 'year': 2006, 'genre': 'thriller', 'rating': 9.0}, page_content='A psychologist / detective gets lost in a series of dreams within dreams within dreams and Inception reused the idea')]

# This example specifies a query and composite filter (year > 1990 and year < 2005 and genre = action)
retriever.invoke(
    "What's a movie after 1990 but before 2005 that's all about dinosaurs, " +
    "and preferably has the action genre"
)

[Document(id='686e84de13668e4048bf9fef', metadata={'_id': '686e84de13668e4048bf9fef', 'year': 1993, 'rating': 7.7, 'genre': 'action'}, page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose'),
 Document(id='686e84de13668e4048bf9ff0', metadata={'_id': '686e84de13668e4048bf9ff0', 'year': 1994, 'rating': 8.7, 'genre': 'action'}, page_content='A fight club that is not a fight club, but is a fight club')]

# This example only specifies a semantic search query
retriever.invoke("What are some movies about dinosaurs")

[Document(id='686e84de13668e4048bf9fef', metadata={'_id': '686e84de13668e4048bf9fef', 'year': 1993, 'rating': 7.7, 'genre': 'action'}, page_content='A bunch of scientists bring back dinosaurs and mayhem breaks loose'),
 Document(id='686e84de13668e4048bf9ff5', metadata={'_id': '686e84de13668e4048bf9ff5', 'year': 1995, 'genre': 'animated', 'rating': 9.3}, page_content='Toys come alive and have a blast doing so'),
 Document(id='686e84de13668e4048bf9ff1', metadata={'_id': '686e84de13668e4048bf9ff1', 'year': 2010, 'genre': 'thriller', 'rating': 8.2}, page_content='Leo DiCaprio gets lost in a dream within a dream within a dream within a ...'),
 Document(id='686e84de13668e4048bf9ff6', metadata={'_id': '686e84de13668e4048bf9ff6', 'year': 1997, 'genre': 'animated', 'rating': 9.1}, page_content="The toys come together to save their friend from a kid who doesn't know how to play with them")]

Use the Retriever in Your RAG Pipeline

You can use the self-querying retriever in your RAG pipeline. Paste and run the following code in your notebook to implement a sample RAG pipeline that performs self-querying retrieval.

This code also configures the retriever to use the enable_limit parameter, which allows the LLM to limit the number of documents returned by the retriever if necessary. The generated response might vary.

import pprint
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import  RunnablePassthrough
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o")
# Configure self-query retriever with a document limit
retriever = MongoDBAtlasSelfQueryRetriever.from_llm(
    llm,
    vector_store,
    document_content_description,
    metadata_field_info,
    enable_limit=True
)
# Define a prompt template
template = """
   Use the following pieces of context to answer the question at the end.
   {context}
   Question: {question}
"""
prompt = PromptTemplate.from_template(template)
# Construct a chain to answer questions on your data
chain = (
   { "context": retriever, "question": RunnablePassthrough()}
   | prompt   
   | llm
   | StrOutputParser()
)
# Prompt the chain
question = "What are two movies about toys after 1990?" # year > 1990 and document limit of 2
answer = chain.invoke(question)
print("Question: " + question)
print("Answer: " + answer)
# Return source documents
documents = retriever.invoke(question)
print("\nSource documents:")
pprint.pprint(documents)

Question: What are two movies about toys after 1990?
Answer: The two movies about toys after 1990 are:
1. The 1995 animated movie (rated 9.3) where toys come alive and have fun.
2. The 1997 animated movie (rated 9.1) where toys work together to save their friend from a kid who doesn’t know how to play with them.
Source documents:
[Document(id='686e84de13668e4048bf9ff5', metadata={'_id': '686e84de13668e4048bf9ff5', 'year': 1995, 'genre': 'animated', 'rating': 9.3}, page_content='Toys come alive and have a blast doing so'),
 Document(id='686e84de13668e4048bf9ff6', metadata={'_id': '686e84de13668e4048bf9ff6', 'year': 1997, 'genre': 'animated', 'rating': 9.1}, page_content="The toys come together to save their friend from a kid who doesn't know how to play with them")]

Back

Parent Document Retrieval

Local RAG