Why you need RAG optimization

Introduction

Imagine you're building a chatbot to answer questions about your company's knowledge base. You hear about retrieval-augmented generation (RAG) and decide to give it a try. You chunk up your documents, embed them with a popular model, store the vectors in a database, and set up a basic RAG pipeline.

At first glance, everything seems to be working. Your chatbot is responding to queries and pulling in relevant information. But as you start testing more thoroughly, you notice some issues. Sometimes the responses are vague or miss key details. Other times they contain irrelevant or contradictory information. What's going on?

The reality is that while RAG systems are relatively easy to set up, crossing the gap from "interesting prototype" to "production-ready application" is much more challenging. Let's look at an example from a recent project at Pointable -- building a conversational agent powered by the New York Federal Reserve Bank’s blog. Given a domain-relevant test question, let’s compare the response from a poorly-tuned RAG system versus a well-tuned system:

Input: What is the reason for the recent shocks in crypto stablecoins?

Poorly-Tuned Response: The recent shocks in crypto stablecoins, such as Terra, occurred due to an algorithmic mechanism that the system relied on. This mechanism assumed that investors would always be willing to buy the stablecoin whenever its price dropped below the $1 peg. This eventually led to a situation where investors were unwilling to buy the stablecoin, even if its price had dropped below $1, because it was backed by an asset whose price was declining.

Well-Tuned Response: The recent shocks in crypto stablecoins can be attributed to a run on Terra, an algorithmic stablecoin, in May 2022. Prior to this event, this stablecoin's price was pegged at $1. However, it broke its peg and plummeted to zero, sparking a domino effect throughout the stablecoin industry. Stablecoins backed by riskier assets were particularly hard hit, prompting investors to shift their investments to less risky stablecoins regulated by U.S. authorities. This incident also impacted broader cryptocurrency markets, resulting in an estimated $200 billion decline in market value over eight days.

While both responses have correct information, the well-tuned system provides more specific, comprehensive, and actionable information – providing a precise date (May 2022) and quantifies the market impact ($200 billion decline). This level of detail and accuracy is what users expect from a production-grade system, and a big reason why we often experience chatbot responses as “bland” or “not quite right.”

The challenge: RAG systems are compound systems

So why is it so challenging to get RAG systems to create accurate, rich responses on large datasets?

The key is understanding that RAG systems are compound AI systems. In a compound system multiple components interact to produce a final response and in RAG systems several of those components are AI/ML systems! This makes the overall system behavior difficult to understand and modify in targeted ways.

RAG (retrieval augmented generation) is not a single, specific technology or model. RAG is a design pattern for LLM applications where we use dynamic context inserted into an LLM context/prompt rather than a single static prompt. There are many ways to query an external system and provide information from that system to an LLM context window. Whether interfacing with a graph database, collection of documents, or internet search API, so long as we are dynamically populating the context inputs for an LLM call, it’s RAG.
Short version: RAG == Dynamic context for LLM prompts

The canonical RAG pipeline that you might see on Twitter looks like this:

Chunk documents into smaller pieces
Vectorize the chunks using an embedding model
Store the vectors in a database
At query time, retrieve the "closest" vectors to the querysome text
- Try string matching queries and combine with vector similarity if good
Insert the retrieved context into the prompt for an LLM inference call in your broader agent/application

While this process may seem straightforward, each step involves numerous configuration choices and parameters that can dramatically impact overall system performance. There are several questions you’ll need to address in building and improving this RAG approach, for example:

What chunking strategy should you use? Sentence-level, paragraph-level, or something else?
Which embedding model is best for your specific data and use case?
What distance metric should you use for similarity search?
How many results should you retrieve?
How should you balance vector similarity vs exact string matching?
What prompt structure will work best with your retrieved context?

In total, a production RAG system can have hundreds of system parameters to tune. And here's the kicker - you should adjust these parameters for your specific dataset/task, and you can't optimize these parameters in isolation! You might already have heard of ways these parameters interact – the best chunking strategy depends on your embedding model and the ideal number of results to retrieve interacts with your prompt engineering. Every change ripples through the full system and can impact final output quality (for better or worse) in ways we can’t easily anticipate.

The impact of RAG configuration choices

With so many interacting parameters, there are millions of potential configurations for any given RAG system. Which one is best for your specific datasets and tasks? And how much does it matter really? Is proper configuration a “nice to have” or “must have” step when developing a RAG system?

Pointable builds lots of RAG systems, so let's look at some real data to illustrate the impact. For the Federal Reserve Bank blog, Pointable’s RAG configuration engine evaluated over 136,808 configurations of the pipelines to populate and serve RAG data. In this optimization run, we measure retrieval quality via top-10 document hit rate on a large set of (natural language query, relevant document chunk) ground truth pairs. A high hit rate means the system is fetching the right documents to answer questions, while a low hit rate indicates irrelevant information is being surfaced.

The results were eye-opening:

The best configurations achieved a 99% hit rate
The worst configurations had hit rates below 20%
The average hit rate was 90%
There was a bimodal distribution, with clusters around 98-99% and 70-80%

This is a histogram of RAG query retrieval performance for 136,808 different (hyperparameter) configurations that were tested while Pointable’s Configuration Engine searched for an optimal configuration. The hit rate spans the range of 0.16 to 0.99 with an average of 0.90 (y-axis is log scale).

‍

This massive variance in performance highlights just how critical proper optimization is. The difference between an 80% hit rate and a 99% hit rate is the difference between a system that's incorrect enough to be essentially useless and one that consistently provides highly relevant information.

Moreover, the bimodal distribution reveals the difficulty of optimization. You can easily get stuck in a local maximum around 70-80% and miss out on the configurations that push performance to the next level.

Optimization: Choosing optimal RAG configurations automatically

Okay, so RAG configuration seems important. If we agree we want a RAG configuration that is adjusted for our task/datasets, how can we ensure that happens? In particular, what does it take to find the optimal configuration for a particular task? Here’s a few considerations:

We need a metric! To quantify whether a configuration is optimal, we need an evaluation task, dataset, and metric. One approach, use a dataset of (query, document) pairs and use information retrieval metrics to quantify performance. Whatever you choose, you might need some tooling or production data to use as an evaluation set
Manually choosing parameters is painful and hard to track. It’s not a good use of engineering time to manually explore configurations, you will want a more efficient way to try several approaches and keep track of what works
Choosing the best individual components from various leaderboards will not necessarily yield the best combined system for your task
This is a combinatorial optimization problem and we are optimizing over several steps of populating and serving the RAG system, so setting up automatic approaches for choosing configurations can require substantial engineering effort

Let’s look more closely at how we might choose optimal RAG configurations.

The challenges of manual optimization

Given the importance of optimization, why don't more teams invest heavily in it? The simple answer is that manual optimization is incredibly time-consuming and complex. Most teams don’t set up proper metrics and instead rely on ad hoc interactive testing of their system to determine if it’s good enough. This works for prototyping, but does not scale as we build RAG systems that cover more data/topics, or need to ensure our RAG system will respond properly in a range of production settings. Even with a metric, it can be time consuming to re-run RAG setup if you have not already built adequate infrastructure and workflow management for the RAG query layer.

As a result, most teams resort to educated guesswork, using the configuration settings they have heard of, or at best, testing a handful of configurations based on intuition. As in life, “not choosing is also making a decision” so teams using software packages and leaving embedding, chunking, querying choices unchanged from default settings, nearly always end up with sub-optimal configurations.

Global optimization benchmarking isn’t helpful

Most teams do not have a test suite or any sort of quantitative scoring of their RAG system to help guide configuration decisions. They might probe the system with ad-hoc queries, and get a sense of whether it’s “good” or “bad”, but that hardly constitutes a rigorous evaluation. Another approach is to choose the best individual components according to general best practices or broad benchmarks. We could choose embedding models, chunking strategies, and retrieval strategies by looking at HuggingFace leaderboards for those categories. Unfortunately, the top embedding model on HuggingFace is likely the “top” according to some global metric of embedding that doesn’t translate to top performance for our dataset, and the best individual components might not work well in concert for a particular task.

Metric-driven configuration optimization requires excellent infrastructure

We’d like to have a metric and use some automated routine to help search through the space of possible configurations to find something optimal for a given task. Sounds nice, but achieving this means you will likely need to re-run your RAG setup and evaluation at least dozens of times. If your RAG setup requires manual execution, deployment, etc it’s not practical to re-run things many times. Teams who invest in scalable RAG infrastructure and infrastructure-as-code approaches to RAG pipelines can automate configuration search, but it’s a substantial engineering effort to set up such infrastructure and workflow tools.

Pointable’s Approach: automated, bespoke RAG optimization

At Pointable, we've developed a solution to this challenge: an automated RAG optimization engine. Our system simulates thousands of potential conversations with different RAG configurations, allowing us to rapidly explore the configuration space and identify optimal parameters for a specific task and dataset.

Here's how it works:

We start by ingesting the source document corpus and generating a diverse set of test queries based on the content. (evaluation set and metric ✅)
Our configuration engine then generates thousands of candidate RAG configurations, varying parameters across all components of the system.
- We use Starpoint, a high-performance vector database, to enable rapid testing of these configurations against the test queries. (scalable infrastructure and optimization automation ✅)
Advanced optimization algorithms guide the exploration of the configuration space, focusing on the most promising areas.
Finally, you pick from the handful of top-performing configurations which to use in production, and re-run configuration optimization as-needed when data changes. (Pointable provides hosted RAG backends, or you can run the configuration in your own LLM tool ecosystem)

Continuous improvement and monitoring

One of the key advantages of automating configuration selection is enabling continuous improvement and monitoring of the RAG system. As the document corpus grows or changes, as user query patterns evolve, or as new models become available, you can easily re-run the optimization process to ensure the full system remains at peak quality.

Moreover, the evaluation framework used for optimization also provides a clear set of metrics for ongoing monitoring. This allows rapid detection of performance regressions or unexpected behavior changes as the system and datasets evolve over time.

Conclusion

RAG systems are powerful tools for enhancing LLM applications with domain-specific knowledge, but building a high quality RAG system can be challenging. The complexity of these systems, with their numerous interacting components and parameters, means teams often want optimization but might not have enough engineering resources to do it well.

At Pointable, we believe that automated RAG optimization is not just a nice-to-have, but a necessity for any team serious about deploying production-grade RAG systems. Our Configuration Engine and testing framework provide a data-driven, scalable approach to this challenge, enabling teams to achieve and maintain peak performance without the need for endless manual tuning.

If you're building a RAG system and want to ensure you're getting the most out of your data and models, we'd love to help! Our team of experts can guide you through the optimization process, helping you unlock the full potential of your RAG application.

Ready to take your RAG system to the next level? If you’re a Product leader with an LLM application project stuck in the prototype phase and want a second opinion, book time with one of our founders here.

‍