November 20, 2024

Exploring semantic chunking for RAG-LLM systems

The Pointable engineering team: Mike Wu, Scott Wey, Brian Wu, Jerry Wang, Justin Sung, Tyler Duong, and Andrew Maas

Introduction

When building LLM agents/systems, retrieval augmented generation (RAG) allows custom source data to feed into LLM responses by querying an information retrieval system to provide dynamic contextual information to the LLM. Converting unstructured source documents into a useful retrieval system requires many choices about preprocessing, populating, and querying a retrieval system. In this post we explore a critical step in building a RAG system – converting a document into “chunks.” In particular, we explore a recently-popular library for chunking and ask whether complex, semantically-motivated chunking approaches outperform standard, simple chunking techniques for RAG retrieval. Let’s go!

What is chunking?

A fundamental building block for RAG is to take unstructured documents and embed them to high dimensional vectors that can be stored into a vector database. Given a user query, we can embed that as well and find the “most related” documents by smallest Euclidean distance (aka highest similarity) in the vector space. But…most embedding models have a “maximum token length”. Which makes sense! We can’t hope to shove infinite amounts of text into a finite number of floating point numbers. When these embedding models are trained, the designers choose a finite number of tokens as the maximum allowed for training documents. A common choice you might see with transformer based embedding models is 512 tokens. 

What this means in practice is that if your document is longer than 512 tokens, you need to break it up. Even beyond the technical logistics, breaking up a long document is theoretically a good idea. You can imagine that it is a lot harder to compress a 1000 page book into a single vector embedding than a single paragraph. When documents get too long, we want to break them up to properly capture the details in the text. 

This is where chunking comes into play. “Chunking” is an algorithm to break a large document into smaller “chunks”, each of which is more amenable to clustering. 

Simple chunking approaches

We briefly cover three “traditional” chunking approaches. It’s actually easiest to explain them by illustrating what they do. 

Here’s a paragraph of text from the start of the Hitchhiker’s Guide to the Galaxy 2

Far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the Galaxy lies a small unregarded yellow sun. Orbiting this at a distance of roughly ninety-two million miles is an utterly insignificant little blue green planet whose ape-descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea. This planet has - or rather had - a problem, which was this: most of the people on it were unhappy for pretty much of the time. Many solutions were suggested for this problem, but most of these were largely concerned with the movements of small green pieces of paper, which is odd because on the whole it wasn't the small green pieces of paper that were unhappy. And so the problem remained; lots of the people were mean, and most of them were miserable, even the ones with digital watches.

Token chunking

Token chunking is the simplest form. It splits the entire paragraph into “byte-pair encoded tokens”, popularized by GPT2 and the tiktoken python library 3. These tokens break up words into smaller subwords and merge some together based on frequency of co-occurrence. Token chunking breaks up a paragraph into these BPE tokens with a maximum chunk size.

Token chunking with max size 100 and overlap 0

CHUNK 1

Far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the Galaxy lies a small unregarded yellow sun. Orbiting this at a distance of roughly ninety-two million miles is an utterly insignificant little

CHUNK 2
blue green planet whose ape-descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea. This planet has - or rather had - a problem, which was this: most of the people on it were unhappy

CHUNK 3
 for pretty much of the time. Many solutions were suggested for this problem, but most of these were largely concerned with the movements of small green pieces of paper, which is odd because on the whole it wasn't the small green pieces of paper that

CHUNK 4
 were unhappy. And so the problem remained; lots of the people were mean, and most of them were miserable, even the ones with digital watches.

Word and sentence chunking

Word chunking is quite similar to token chunking and often produces similar results. The difference is subtle in that byte pair encoding sometimes may break complex words into multiple tokens e.g. unbelievably" → ["un", "believ", "ably"]. If you are unlucky, then sometimes a word might be separated into separate chunks. Word chunking tries to prevent cases like this to keep whole words together.

Sentence chunking takes that idea a step further. Intuitively a good chunking algorithm keeps related ideas in the same chunk since related ideas should be embedded together to be more efficient. For example, two sentences both describing the same planet shouldn’t be in separate chunks since that makes retrieval more difficult. The challenge is how do we know what words belong to the “same idea”? A simple and good heuristic for that is to keep sentences together since a single sentence most often describes the same idea. 

Sentence chunking with max size 100 and overlap 0.

CHUNK 1

Far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the Galaxy lies a small unregarded yellow sun. Orbiting this at a distance of roughly ninety-two million miles is an utterly insignificant little blue green planet whose ape-descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea.

CHUNK 2

This planet has - or rather had - a problem, which was this: most of the people on it were unhappy for pretty much of the time. Many solutions were suggested for this problem, but most of these were largely concerned with the movements of small green pieces of paper, which is odd because on the whole it wasn't the small green pieces of paper that were unhappy.

CHUNK 3

And so the problem remained; lots of the people were mean, and most of them were miserable, even the ones with digital watches.

This is starting to look better! But it’s not perfect – notice that the last sentence of chunk 1 talks about a planet that is also referred to in the second sentence of chunk 2. Ideally, these would be put together in the same chunk. That’s likely too tall of an order for a simple heuristic like chunking by punctuation. You would likely need to understand what each sentence is actually saying and the topics that it covers. 

Chonkie: The recently-buzzy chunking tool

The no-nonsense chunking library Chonkie1 recently got attention within LLM developer communities. If you haven’t seen it yet, Chonkie is an easy-to-use Python library to split large text documents into smaller manageable chunks with the end goal of RAG. Chonkie is easy to try and integrates with popular libraries for tokenization and embeddings that many people already use. Beyond ease of use for simpler chunking, Chonkie also offers advanced chunking approaches that in theory split data based on semantically meaningful boundaries in a document. 

Advanced approaches

AI models are increasingly good at capturing what topics and concepts a text describes due to powerful embedding models 4. Chonkie includes two advanced chunking algorithms, both of which use pretrained embedding models to gauge how similar in topic a sentence is to one another. The final chunking attempts to group characters not just by punctuation, but also by the similarity of their content. 

“Semantic” chunking

Chonkie’s “semantic chunking” algorithm requires the user to choose an embedding model (defaults to minilm5) and a similarity threshold (aka. 1 - euclidean distance) to decide when to stop and start a new chunk. To keep things tractable, semantic chunking requires sentences in a chunk to be contiguous and in the original order. That means what semantic chunking is really doing is going sentence by sentence and finding “break points” i.e. when the next sentence deviates too much in semantic meaning from previous ones. 

Here is an example of semantic chunking output on the same example text from above,

Semantic chunking with all-minilm-l6-v2.

CHUNK 1
Far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the Galaxy lies a small unregarded yellow sun.

CHUNK 2
Orbiting this at a distance of roughly ninety-two million miles is an utterly insignificant little blue green planet whose ape-descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea. This planet has - or rather had - a problem, which was this: most of the people on it were unhappy for pretty much of the time.

CHUNK 3
Many solutions were suggested for this problem, but most of these were largely concerned with the movements of small green pieces of paper, which is odd because on the whole it wasn't the small green pieces of paper that were unhappy.

CHUNK 4
And so the problem remained; lots of the people were mean, and most of them were miserable, even the ones with digital watches.

Nice! The semantic approach puts the two sentences about planets together. The only downside is that to achieve this result, we had to try a lot of different similarity thresholds. The default of 0.7 ended up splitting each sentence by itself. But, with some tuning, we obtained a good result.

Semantic double-pass merging (SDPM)

The final approach in Chonkie, called SDPM (for semantic double-pass merging), is an extension of the semantic chunking approach. It attempts to remove the contiguous requirement by allowing chunking to “skip” a number of sentences when deciding to include the next sentence or not. For example:

My favorite color is blue. I like playing tennis. My second favorite color is red. 

Semantic Chunking
Chunk #1: My favorite color is blue.
Chunk #2: I like playing tennis.
Chunk #3: My second favorite color is red.

SDMP
Chunk #1: My favorite color is blue. My second favorite color is red. 
Chunk #2: I like playing tennis.

The important hyperparameter of this algorithm is a skip window, which specifies the number of candidate sentences to skip before we stop the current chunk and start a new one. A larger skip window is computationally more expensive but potentially more beneficial. 

To Chonkie or not to Chonkie? (that is the question)

After learning about these chunking algorithms, a big question looms: Which chunking algorithm is best for your RAG application?

Maybe, like us, your gut instinct is that more sophisticated algorithms like semantic chunking and SDMP chunking are better since they capture notions of sentence content and structure. Let’s test that hypothesis with some experiments!

We assembled a dataset of 1,000 English Wikipedia articles with at least 10,000 words, randomly sampled from the March 1st, 2022 data dump on Huggingface 6. We chose Wikipedia as a challenging playground to compare chunking algorithms because Wikipedia articles have a lot of interesting structure - there are headers, lists, image captions, tables, and chart descriptions that make proper chunking difficult. We used this data dump on purpose since it is less cleaned than other Wikipedia versions. We believe this is more reflective of a real world chunking application - most webpages, enterprise documents, or other data sources are organized and contain different types of text. Rarely do we find uniformly structured text akin to a book chapter in the real world.

When we tried semantic chunking and SDMP on Wikipedia articles (w/ multiple similarity thresholds), we saw a lot of chunks that look like the following examples.

Example 1: Screenshot of the actual text from the Blenheim, Ontario Wikipedia page and the corresponding chunks using semantic chunking.
From https://en.wikipedia.org/wiki/Blenheim,_Ontario

Semantic chunking
Chunk #1:  Elementary schools
Chunk #2:  Harwich-Raleigh Public School is the “rural” public school.
Chunk #3:  It offers Junior Kindergarten to Grade 8.
Chunk #4:  H.
Chunk #5:  R.
Chunk #6:  P.
Chunk #7:  S offers both English and French Immersion programs.
Chunk #8:  H.
Chunk #9:  R.
Chunk #10: P.
Chunk #11: S is home of the Wildcats and its school colours are red and white.
Chunk #12: The motto of the school is: “live to learn, learn to live”.
Chunk #13: St. Anne Catholic School serves the rural community and the town.
Chunk #14: It offers Junior Kindergarten to Grade 8.
Chunk #15: St. Anne’s offers both English and French Immersion programs.
Chunk #16: St. Anne’s is home to the Stars.
Chunk #17: W. J. Baird is the in-town public school and offers Junior Kindergarten to Grade 8.
Chunk #18: “Baird” as it is known is home to the Griffins.
Chunk #19: Its school colours are green and white.

Example 2: Screenshot of the actual text from the The King of Fighters 2000 Wikipedia page and the corresponding chunks using SDMP
From https://en.wikipedia.org/wiki/The_King_of_Fighters_2000 

SDMP chunking

Chunk #1:  The PlayStation 2 version was re-released on May 3, 2016, for the PlayStation 4 through the PlayStation Network.
Chunk #2:  The game was later released on the Nintendo Switch through the Nintendo eShop service on August 10, 2017.
Chunk #3:  Reception
Chunk #4:  {{Video game reviews
Chunk #5:  | GR =
Chunk #6:  | IGN = 7.4/10
Chunk #7:  | GSpot = 7.8
Chunk #8:  | Fam =
Chunk #9:  | NLife = 7/10
Chunk #10: | rev1 = Bonus Stage
Chunk #11: | rev1Score = 9/10
Chunk #12: | rev2 = Pure Nintendo | rev2Score = 7.5/10
Chunk #13: }}

What these two examples, among many, illustrate is the edge cases for biases from more complex algorithms. Some specific things to notice:

Here is a subset of the output from the simpler approach of sentence chunking: 

From https://en.wikipedia.org/wiki/Blenheim,_Ontario
Sentence chunking with max token size 256

CHUNK 9 of 12
No religious affiliation: 11.2% Language English: 93.4% French: 1.3% French and English: 0.2% Other: 5.1% Immigration Canada-born population: 91.4% Foreign-born population: 8.1% Non-permanent residents: 0.5% Education Blenheim's elementary and secondary schools are under the control of two school boards, the Lambton Kent District School Board and the St. Clair Catholic District School Board. Elementary schools Harwich-Raleigh Public School is the "rural" public school. It offers Junior Kindergarten to Grade 8. H.R.P.S offers both English and French Immersion programs. H.R.P.S is home of the Wildcats and its school colours are red and white. The motto of the school is: "live to learn, learn to live". St. Anne Catholic School serves the rural community and the town. It offers Junior Kindergarten to Grade 8. St.Anne's offers both English and French Immersion programs. St.Anne's is home to the Stars.

In this particular application, it seems that a simpler strategy is more reliable. Revisiting our original question, there doesn’t appear to be an easy, one-size-fits-all answer as to which chunking approach is better. 

Quantitative comparison for better perspective

For some applications, semantic chunking may be the right choice. For others, simpler algorithms may actually outperform more complicated ones. So maybe we were not asking the right question. Perhaps the right question is instead, given a task and dataset, what is the right chunking algorithm?

Chunking as a search problem

There is no way to know a priori which chunking algorithm is best given a dataset. So, the best we can do is try a bunch of them and see which one works the best. In other words, we can view finding the best chunking algorithm as a search problem!

We’ve actually studied a similar problem in a different blog post 7 on choosing the best embedding model to use in a vector database for RAG. The takeaway was that the choice of embedding model has a large impact on the final RAG quality, but knowing which one to use ahead of time is nigh impossible. Instead, using Pointable’s configuration engine, we can search for the best RAG design, which includes the choice of embedding model, that leads to the best performing RAG system on a synthetically generated set of questions and answers. This is a similar idea to hyperparameter optimization/search techniques for machine learning model training.

We consider the choice of chunking strategy as a hyperparameter of the overall RAG system setup. We can not only search across chunking algorithm types (token vs semantic), but we can also search across chunking parameters for semantic chunking algorithms (i.e. skip window size, similarity threshold, or underlying embedding model used for vector calculations), which removes the need for manual optimization and guesswork. 

Results

We conducted an experiment to explore the impact of different chunking hyperparameters on a resulting RAG system. To measure the quality of a RAG system, we generate synthetic questions (and answers to them) using chunks from the set of 1,000 Wikipedia articles. Because we know the right chunk for each question (since we generated the question from the chunk), we can compute the hit rate: the frequency that the RAG system correctly retrieves the right chunk. A better RAG system should have a higher hit rate. 

We pick 20 random configurations of a semantic / SDMP chunker varying the embedding (from top embedding models from huggingface), skip window (1 or 2), and similarity threshold (0.1, 0.5, or 0.9). We also included a traditional token chunker as a baseline. Figure 1 shows the distribution of hit rate across the configurations; all semantic chunkers are in blue while the token chunker is in green. 

Figure 1: Results of 20 different RAG configurations with varying embedding and chunking hyperparameters. 

The results show a large variance in hit rate (from near 0% to 85+%), which verifies that even among semantic chunking alone, different choices of hyperparameters make a dramatic impact on RAG quality. Additionally, the results support our intuition based on the qualitative analysis above. For this dataset, a simpler token chunker outperforms more complex algorithms. The best configuration that achieves ~86% hit rate uses tiktoken (a simpler, non-semantic approach). 

A note on computational cost

Another dimension to compare these different chunkers is on computational cost. We include a few measurements below averaged over 1,000 Wikipedia documents on a M3 Macbook Pro:

Token Chunker: 0.002 seconds per doc
Word Chunker: 0.045 seconds per doc
Sentence Chunker: 0.015 seconds per doc
Semantic Chunker: 0.477 seconds per doc
SDMP Chunker: 0.727 seconds per doc

Since semantic & SDMP chunking require computing an embedding, they are significantly more expensive in runtime. In practice, these differences can make a big difference when conducting a large search experiment like the one above. From our experiment above, searching across RAG designs with a traditional chunker takes ~3 hours while searching across RAG designs with a semantic chunker can take 6+ hours, a 2x increase.

This is by no means an exhaustive experiment around chunking. In theory, there are hundreds to thousands of configurations to search from. In fact, Pointable’s configuration explores a range of chunking, embedding, preprocessing, and query-time parameters for each new task+dataset, because although semantic chunking appears unimportant here, there could be documents or tasks where maintaining coherent semantic chunks is vital to system quality.

Conclusions

For chunking specifically, we offer two main takeaways:

Semantic parsing has pros and cons. No one size fits all. Blanket semantic parsing is not good at parsing structured components with documents like lists and headers. The tendency of Chonkie SDPM is to over-separate and requires careful tuning of a similarity function. In cases like this, even simpler chunking is better.

Semantic chunking is still just embeddings under the hood. We’ve found that the performance of RAG-based retrieval systems rely heavily on the embedding we choose. For semantic chunking, the right chunker is still about picking the right embedding. From our experiments, we find that RAG performance is very sensitive to that choice. 

On a broader note, we think the right way to think about chunking is altogether with the other components of RAG. Designing a highly performant and reliable RAG system is difficult. Choosing the right chunker by itself is nice, but choosing the right chunker alongside the right embedding, retriever, normalization, and prompting choices is what makes a real difference. 

These are the kind of RAG-LLM problems we love to think about at Pointable. Reach out to team@pointable.ai to learn more!

Special thanks to Stanford researchers Omar Khattab, Chris Potts, and Herumb Shandilya for thoughts on creating the dataset for this evaluation

Citations

1. https://github.com/bhavnicksm/chonkie

2. https://en.wikipedia.org/wiki/The_Hitchhiker%27s_Guide_to_the_Galaxy 

3. https://github.com/openai/tiktoken 

4. https://platform.openai.com/docs/guides/embeddings 

5. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 

6. https://huggingface.co/datasets/legacy-datasets/wikipedia 

7. https://www.pointable.ai/blog/no-embedding-to-rule-them-all

© Pointable 2024. All rights reserved.