September 24, 2024

Building a RAG system? There’s no one embedding model to rule them all

The Pointable engineering team: Mike Wu, Scott Wey, Brian Wu, Jerry Wang, Justin Sung, Tyler Duong, and Andrew Maas

Introduction

When building a Retrieval-Augmented Generation (RAG) system, choosing an embedding model is a critical decision to achieve high-quality results. Embedding models transform text into dense vector representations in a high-dimensional space, aiming to represent similar concepts as vectors that are close to each other. A RAG system stores many “chunks” in a large database and relies on embedding similarity to choose which chunks are most relevant to responding to a particular query / LLM context.

However, the best notion of similarity can vary dramatically depending on the domain and task of a RAG system. For instance, in a general context, "weather forecast" and "snowflake inbound" might be considered related concepts. But in a tech startup environment, "snowflake inbound" could refer to data architecture for Snowflake’s data cloud – quite a different semantic relationship!

This variability has major implications for choosing an embedding model: it’s difficult to know which embedding will best encode similarity for any particular RAG system. In fact, even when we carefully choose embeddings for a RAG task we can’t know how well it actually performs until evaluating the effect of embedding similarity choices in a full end-to-end RAG system.

In this blog post, we'll explore a case study that demonstrates why popular heuristics like "bigger is better" or "always choose the top-ranked model" or even “use the domain-specific embedding” can be misleading when selecting an embedding similarity model for RAG. By the end, you'll understand why an empirical, systematic approach to embedding selection is essential for getting the most out of your RAG system along with some surprising results about which embedding works best on this task!

Experiment

We compared the performance of various embedding models, including several top-ranked embeddings, as part of a RAG system built on top of the FiQA financial question-answering dataset [2], one of the datasets in the BEIR [1] & MTEB [4] benchmarks. This simulates an engineer developing a RAG solution for a particular task in finance or data processing related to finance. We want the retrieval system to return chunks from the RAG database based on finance-specific semantic similarity which might have nuanced concepts or specialized notions of similarity that are important for achieving reliable RAG responses. Remember, if the retrieval query returns irrelevant information, the RAG system response is often confusing or wrong whether it’s a chatbot RAG system or other LLM application. 

For each embedding, we optimized other RAG hyperparameters using Pointable’s configuration engine to ensure a fair comparison. We then measured performance using the metric nCDG@10 (normalized Cumulative Discounted Gain at 10). This metric evaluates how well the system retrieves relevant documents across a test set, with a focus on the top 10 results. nCDG@10 not only considers whether relevant documents are retrieved but also their ranking, giving higher weight to relevant documents appearing earlier in the results. A score of 1.0 indicates perfect retrieval and ranking. Note that evaluating retrieval quality directly means we can ensure good contextual information is provided to any LLM we choose for downstream processing.

The experiment results were eye-opening:

Embedding MTEB rank Model size Dimensions FiQA nCDG@10 FiQA hit rate@10
mukaj/fin-mpnet-base 257 109M 768 0.799 0.913
Alibaba-NLP/gte-Qwen2-7B-instruct 4 7613M 3584 0.557 0.632
Salesforce/SFR-Embedding-2_R 3 7111M 4096 0.541 0.599
nvidia/NV-Embed-v1-2 10 7851M 4096 0.559 0.634
avsolatorio/GIST-Embedding-v0 43 109M 768 0.406 0.471
thenlper/gte-base 56 109M 768 0.407 0.480

The standout performer was mukaj/fin-mpnet-base, a relatively unknown embedding ranked 257th on the HuggingFace MTEB leaderboard for embedding models. Despite its lower general ranking, it achieved a 43% gain in nCDG@10 over much larger and more popular embedding models when applied to financial data. Intuitively, it makes sense that this embedding did well, since it’s an embedding specific to financial data. Now let’s dig deeper into the results to understand what is happening.

Qualitative results

Looking at a few examples helps us understand how superior performance on nCDG@10 might translate into user experience for a production RAG system. We compared the responses generated using mukaj/fin-mpnet-base (our top-performing domain-specific embedding) and Salesforce/SFR-Embedding-2_R (a highly-ranked general embedding) for three financial queries. We retrieved chunks using embedding similarity and then produced a RAG response using OpenAI’s GPT-4o model.

The main takeaway from these examples is clear: by retrieving more relevant documents, the domain-specific embedding enables the generation of more detailed, accurate, and helpful responses. Let's look at each query in turn (examples are from the FiQA test set):

query: How to find a business consultant that would ensure that all your business activities are legal and compliant with all regulations?

mukaj/fin-mpnet-base: To find a business consultant who can ensure that all your business activities are legal and compliant with all regulations, you might start by establishing relationships with professionals such as a business law attorney and a CPA or tax attorney. These professionals can provide essential legal and financial guidance. It's advisable to have a contract or retainer agreement with them or at least have met them in advance so you're prepared in case of an emergency. Next, when looking for a business consultant, seek someone who offers a comprehensive package that includes consulting and business planning in addition to their accounting skills. To evaluate the ethical and professional reliability of potential consultants, you can: 1. Make a small transaction with them first and observe how they treat you. If they handle small dealings with integrity, it's a good sign. 2. Ask family and friends for recommendations based on their experiences. 3. Utilize public sources like business review websites, consumer advocacy organizations, and groups such as the Better Business Bureau to gather more information. While relying on the advice of strangers has its limitations, these steps can help ensure you choose a trustworthy and competent professional.
Salesforce/SFR-Embedding-2_R: To find a business consultant that would ensure all your business activities are legal and compliant with regulations, you should seek out an experienced consultant who specializes in this area. Make sure they have a track record of providing effective solutions and understand the regulatory environment relevant to your industry. It's also beneficial to approach reputable organizations known for offering knowledgeable consultants who can analyze your business and develop a compliance-focused action plan. 

query: Do I need a Like-Kind Exchange when selling a personal vehicle for a company car? 

mukaj/fin-mpnet-base: No, you do not need to perform a Like-Kind Exchange when selling a personal vehicle for a company car. Like-Kind Exchanges (Sec. 1031) are only applicable to business or investment properties, not personal property like your personal car. Additionally, since the company car does not belong to you and you didn't purchase it yourself, you didn't actually perform any exchange. It seems that if Turbo-Tax is prompting you to fill out an exchange form, you may have entered incorrect information. Double-check your entries to ensure you correctly indicated that you sold your personal car and did not purchase a new one in the transaction.
Salesforce/SFR-Embedding-2_RNo, you do not need a Like-Kind Exchange when selling a personal vehicle for a company car. Additionally, since you did not purchase the new car but are instead receiving a company car, no actual exchange occurred. Make sure your tax entries reflect that you sold your personal car and did not buy a new one.

query: Does revenue equal gross profit for info product business? 

mukaj/fin-mpnet-base: No, revenue does not equal gross profit for an info product business. Revenue is the total income generated from selling the products or services, while gross profit is the revenue minus the cost of goods sold (COGS). In an info product business, there are multiple costs that need to be considered, such as web-hosting fees, internet service costs, computer equipment expenses, development costs, support calls or emails, phone service for sales, and advertising or marketing expenses. Gross profit takes into account these costs to provide a clearer picture of the actual profit made from business operations.
Salesforce/SFR-Embedding-2_R: No, revenue does not equal gross profit for an info product business. Revenue is the total amount of money generated from sales before any expenses are deducted. Gross profit, on the other hand, is the revenue minus the cost of goods sold.

You can do a subjective evaluation as to which response you prefer for each example above. We found that the domain-specific fin-mpnet embedding similarities consistently yielded more specific, detailed responses (which makes sense because the embeddings provide the response-generating LLM with detailed, relevant context via similarity queries). So in this case we can really see a difference in response quality due to the difference in embedding similarity.

Visualizing embedding differences

We can use visualization and debugging tools beyond simply reviewing responses to get a better qualitative understanding of embedding model similarity. To understand what an embedding model deems as similar, it's helpful to visualize how different embeddings group and organize text chunks. We'll use UMAP [3] (Uniform Manifold Approximation and Projection) to reduce high-dimensional embedding vectors into a 2D space. In the visualizations below, each point corresponds to a chunk of text that has been “embedded” by the embedding model. The closer two points are in the two dimensional visualization, the closer they are in the higher dimensional embedding space.

Let's start by examining the NVIDIA NV-Embed-v1-2, a top-performing general embedding model, applied to our FiQA dataset.

At first glance, NV-Embed-v1-2 appears to handle financial topics quite well. When we zoom in on a cluster in the upper left of the visualization (the box in the image above), we find chunks discussing life insurance as an investment, expense, and asset. These chunks are grouped together in a sensible way for our notion of “finance domain semantic similarity”

Going deeper into financial subtopics, we explore cryptocurrency related terms, since we know that’s one area where NV-Embed-v1-2 underperformed in the quantitative retrieval evaluation. 

Superficially, it seems to do well, but as we continue to explore the visualization, we notice a distinct cluster in the top right corner. Upon closer inspection, we discover something unexpected: these chunks all begin with the same boilerplate summary template. This reveals a potential flaw in the embedding model – it's grouping chunks based on text structure rather than content. That’s not good! This could be a place where valuable information is not retrieved as “similar” due to the embedding model being more sensitive to text format than finance topic.

To confirm this observation, we highlighted any point corresponding to text that starts with the phrase "This is the best tl;dr I could make, [original]". The result clearly shows that the NVIDIA embedding model tightly groups these summarized articles together, regardless of their actual content. 

Now, let's compare this to the ultimate winner of the embedding model experiment, the domain-specific embedding, mukaj/fin-mpnet-base.

The difference is striking. With mukaj/fin-mpnet-base, article summaries are spread across various topics, grouped by relevant financial topics such as oil industry and investment discussions, rather than clumped together based on text format as we saw above.

This content-based similarity is far more likely to retrieve relevant information regardless of the text structure. We can see texts discussing the same oil price topic in texts of varying format – the domain-specific embeddings would return clearly more relevant information for this topic (oil prices, oil industry, oil companies). 

Now, it’s not practical to visually explore all the topics or query/chunk clusters in a retrieval database, but visualizing in this way helped us confirm the aggregate quantitative metrics correspond to some serious differences in how different embedding models encode similarity for this task. 

The Case for Systematic RAG pipeline configuration

Our experiment with the FiQA dataset shows that there's no reliable heuristic to determine the best embedding model for your specific RAG use case – instead we can only really rely on quantitative evaluation to decide which embedding similarity is best for a new task. 

At Pointable we’ve seen dozens of datasets for RAG systems, and it’s quite common to find the optimal embedding model for a task or specific domain is not a top 25 system on the global MTEB rankings. Relatedly, people often assume that larger embedding models will always yield better results, but often the training data of an embedding model matters more than its size. In the case study above we saw the top MTEB-ranked models, which are 70x bigger than fin-mpnet-base, still significantly underperformed. Of course smaller isn't universally superior either: other embeddings of similar size to fin-mpnet-base performed about half as well in terms of nCDG.

Embedding similarity is one of several choices that impact overall RAG quality, and it’s critical to configure/optimize these parameters for your task. This includes the full pipeline from raw source data to a live RAG system including steps like text/data normalization, chunking, retrieval strategy, hybrid retrieval weight tuning, and response generation / context prompting choices. 

At Pointable, we take an empirical, MLOps-powered approach to this problem. We build our technology around two key principles:

  1. Develop an empirical evaluation of the end-to-end RAG system on real data, and use optimization techniques to choose the many system parameters required for RAG. Don’t waste time hand-engineering trial and evaluation to configure the RAG system. 
  2. Focus on improving the data. Once a RAG system is configured, it nearly always becomes limited by the scope, quality, and specificity of the source data. If you don’t want LLM hallucinations, the RAG source data must be comprehensive and accurate. By automating the configuration and evaluation of RAG systems, we free time to iterate on dataset improvements. This is a close analog to the idea of data-centric deep learning.

Conclusion

Picking the right embedding model is only one piece of the RAG optimization puzzle. Other important parameters include data/text normalization, chunking strategy, and the retrieval weighting for hybrid queries. 

Each of these factors interacts with the others in complex ways, creating a vast configuration space that's impossible to navigate through intuition or manual tuning alone. The only way is with end-to-end, systematic optimization.

With Pointable, you don’t need to make rules-based choices like, “it’s a finance dataset, so we should use the finance embedding.” That you should use the finance embedding is something that will fall naturally out of the optimization process. 

If you're building a RAG system and want to ensure you’re choosing the right embedding model, we'd love to help! Our team of experts can guide you through the optimization process, helping you unlock the full potential of your RAG application.

If you’re a Product leader building an LLM application feel free to reach out! We have helped many teams with projects stuck in the prototype phase. Book time with one of our founders here.

References

[1] Thakur, Nandan, et al. "Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models." arXiv preprint arXiv:2104.08663 (2021).

[2] Macedo Maia, Siegfried Handschuh, Andre Freitas, ´Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. 2018. Www’18 open challenge: Financial opinion mining and question answering. In Companion of the The Web Conference 2018 on The Web Conference 2018. International World Wide Web Conferences Steering Committee, pages 1941–1942.

[3] McInnes, Leland, John Healy, and James Melville. "Umap: Uniform manifold approximation and projection for dimension reduction." arXiv preprint arXiv:1802.03426 (2018).

[4] Muennighoff, Niklas, et al. "MTEB: Massive text embedding benchmark." arXiv preprint arXiv:2210.07316 (2022).

© Pointable 2024. All rights reserved.