I built a RAG chatbot using LangChain + ChromaDB + OpenAI embeddings. The pipeline works, but sometimes the chatbot doesn’t return the most relevant PDF content, even though it exists in the vector DB.
Code snippet (simplified):
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
db = Chroma(persist_directory="db", embedding_function=embeddings)
query = "What is the interest rate policy?"
docs = db.similarity_search(query, k=3)
Sometimes, it retrieves totally irrelevant documents.
What I’ve checked:
The chunks were split at 500 tokens. Embeddings were created with text-embedding-ada-002. The database persists correctly.
Question:
Could this be due to chunk size, embedding model choice, or similarity metric? How can I improve retrieval accuracy?