Skip to main content

Chunking Strategies for RAG: Size, Overlap, and Metadata

In the rapidly evolving landscape of Generative AI, Retrieval-Augmented Generation (RAG) has emerged as the standard architecture for grounding Large Language Models (LLMs) in private, domain-specific data. However, while much attention is lavished on choosing the right Vector Database (like Milvus, Pinecone, or Weaviate) or the most capable LLM (GPT-4, Claude 3, Gemini), the “middle layer” of data processing often gets neglected.

This middle layer is Chunking.

If you simply dump a 50-page PDF into a vector store as a single document, your RAG system will fail. If you split it into single sentences, it will likely lack context. Mastering rag chunking strategies is the difference between a generic chatbot and a high-precision enterprise assistant.

In this article, we will dissect the anatomy of chunking, focusing on the “Holy Trinity” of data preparation: Size, Overlap, and Metadata, and demonstrate how to implement these strategies using Spring AI.

The Physics of RAG: Why Chunking Matters
#

Before diving into code, we must understand the “physics” limiting our systems. LLMs have a Context Window (the maximum number of tokens they can process at once). While these windows are growing (128k to 1M+ tokens), three constraints remain:

  1. Cost: Sending 50 pages of text to OpenAI for every simple query is financially ruinous.
  2. Latency: Processing massive contexts takes time. Users expect sub-second responses.
  3. The “Lost in the Middle” Phenomenon: Research shows that LLMs are excellent at retrieving information from the beginning and end of a prompt but frequently hallucinate or forget details buried in the middle of a massive context block.

Chunking is the process of breaking down large documents into smaller, semantically meaningful units that can be embedded into vectors. When a user asks a question, we retrieve only the specific chunks relevant to the query, creating a concise, high-value prompt for the LLM.

Strategy 1: Determining the Optimal Chunk Size
#

The most common question in RAG architecture is: “How big should my chunks be?”

There is no single magic number, but there is a logic to the decision. Chunk size is usually measured in tokens (roughly 0.75 words per token) or characters.

Small Chunks (50 - 250 Tokens)
#

Use Case: FAQ datasets, granular fact-checking, definition lookups.

  • Pros: High precision. The vector embedding represents a very specific concept. When retrieved, it brings very little noise.
  • Cons: Lack of context. If a chunk says “It costs $50,” but the previous chunk identified what “It” refers to, the LLM will fail to answer the user correctly.

Medium Chunks (250 - 600 Tokens)
#

Use Case: General knowledge bases, technical documentation, standard SOPs.

  • Pros: The “Goldilocks” zone. Usually contains a full paragraph or a logical section of text. Enough context to stand alone, but specific enough to be retrievable.
  • Cons: Might still split complex arguments across boundaries.

Large Chunks (600 - 1500+ Tokens)
#

Use Case: Legal contracts, academic papers, literary analysis where narrative flow matters.

  • Pros: Captures broad context and relationships between disjointed ideas.
  • Cons: “Noise” increases. The embedding vector becomes a diluted average of many different topics contained in the text. Retrieval accuracy often drops (semantic dilution).

Spring AI Implementation: TokenTextSplitter
#

In Spring AI, we rely on DocumentTransformer implementations. The TokenTextSplitter is preferred over character splitting because LLMs think in tokens.

import org.springframework.ai.transformer.splitter.TokenTextSplitter;

// Configuration for a Medium Chunk Strategy
// defaultChunkSize = 800, defaultMinChunkSizeChars = 350
TokenTextSplitter splitter = new TokenTextSplitter(800, 350, 10, 10000, true);

List<Document> splitDocuments = splitter.apply(originalDocuments);

Note: It is vital to align your chunking strategy with your embedding model. If you use text-embedding-3-small, keep chunks within the model’s optimal semantic window.

Strategy 2: The Art of Overlap
#

If you split a document strictly at the 500-token mark, you risk cutting a sentence—or worse, a logical thought—in half.

Chunk A: “…the primary cause of the server crash was the…” Chunk B: “…memory leak in the legacy module.”

If the user asks “What caused the crash?”, the vector search might find Chunk A (which contains “server crash”) or Chunk B (which contains “memory leak”), but neither chunk alone provides the full answer.

Chunk Overlap solves this by creating a sliding window.

The Rolling Window Technique
#

A standard overlap is usually 10% to 20% of the chunk size.

If Chunk Size is 500 tokens, and Overlap is 50 tokens:

  1. Chunk 1: Tokens 0 to 500.
  2. Chunk 2: Tokens 450 to 950.
  3. Chunk 3: Tokens 900 to 1400.

This ensures that semantic connections near the boundaries are preserved in at least one chunk.

Adjusting Overlap in Spring AI
#

Using the TokenTextSplitter mentioned above, the parameters allow strict control. However, Spring AI also supports RecursiveCharacterTextSplitter (similar to LangChain’s implementation), which tries to split by paragraph first, then newline, then sentence, then word. This is often superior to raw token splitting as it respects linguistic boundaries.

// Spring AI allows configuring splitters via properties or beans
// Conceptually, ensuring overlap looks like this in code logic:

int chunkSize = 500;
int chunkOverlap = 50; // 10% overlap

// The splitter handles the sliding window logic internally
var textSplitter = new TokenTextSplitter(chunkSize, chunkOverlap, ...);

Pro Tip: If you are dealing with code (Java, Python), standard text splitters are suboptimal. You need AST-based splitters (Abstract Syntax Tree) or splitters that respect Class and Method definitions. While Spring AI is evolving, you may need to implement a custom DocumentTransformer for code-heavy RAG.

Strategy 3: Metadata - The Secret Weapon
#

Most basic RAG tutorials stop at splitting text. Advanced rag chunking strategies rely heavily on Metadata.

When you chunk a document, you destroy its global context. Metadata is how you inject that context back into the isolated chunk.

Why Raw Text isn’t Enough
#

Imagine a chunk that says: “The Q3 revenue increased by 15%.”

Without metadata, we don’t know:

  1. Which Year? (2022? 2023?)
  2. Which Company?
  3. Source Document? (Draft PDF or Final Report?)

The Metadata Injection Strategy
#

In Spring AI, the Document class is essentially a wrapper for text plus a Map<String, Object> metadata.

You should extract and attach metadata before the embedding step.

  1. Descriptive Metadata: Filename, Author, Page Number, Section Header.
  2. Temporal Metadata: Creation Date, Last Modified.
  3. Structural Metadata: Previous Chunk ID, Next Chunk ID (allowing the LLM to traverse the document if needed).

Pre-Retrieval Filtering (Hybrid Search) #

Metadata allows for Hybrid Search. Instead of scanning the entire vector database, you can apply a metadata filter first.

Query: “What was the revenue in 2023?”

If you only use vector search, you might get revenue for 2022 because the vectors are semantically similar. If you use Metadata Filtering, you filter for year == 2023 and then run vector search on the remaining subsets.

import org.springframework.ai.document.Document;
import java.util.Map;
import java.util.HashMap;

// Reading a file
String content = loadContent("annual_report_2023.txt");

Map<String, Object> metadata = new HashMap<>();
metadata.put("source", "finance_dept");
metadata.put("year", 2023);
metadata.put("doc_type", "report");
metadata.put("section", "executive_summary");

Document doc = new Document(content, metadata);

// When this document is chunked, Spring AI splitters 
// generally propagate the metadata to the child chunks.
List<Document> chunks = splitter.apply(List.of(doc));

// Now, every chunk knows it belongs to 2023.

Advanced Pattern: Parent-Child Indexing
#

One of the most powerful rag chunking strategies emerging in 2024 is the Parent-Child (or Small-to-Big) strategy.

The Conflict
#

  • Small chunks are better for search (matching specific keywords/concepts).
  • Large chunks are better for generation (giving the LLM enough context to write a coherent answer).

The Solution
#

  1. Split the document into large Parent Chunks (e.g., 2000 tokens). Store these in a standard key-value store or the vector DB (without embedding).
  2. Split the Parent Chunks into small Child Chunks (e.g., 200 tokens).
  3. Embed the Child Chunks and store them in the Vector DB.
  4. Link the Child to the Parent via ID in metadata.

The Retrieval Flow
#

  1. User Query -> Vector Search against Child Chunks.
  2. Identify top 5 matching Child Chunks.
  3. Retrieve the corresponding Parent Chunks using the metadata ID.
  4. Feed the Parent Chunks to the LLM.

This gives you the precision of granular search with the comprehensive context of large documents. While Spring AI does not have a one-line “ParentDocumentRetriever” (like LangChain Python) built-in yet, it is easily implementable using Spring Data and the VectorStore interfaces.

Semantic Chunking
#

Fixed-size chunking (e.g., every 500 tokens) is arbitrary. It doesn’t respect how humans write. A paragraph might end at token 502, but we cut it at 500.

Semantic Chunking is an AI-driven approach.

  1. Calculate embeddings for every sentence.
  2. Compare the cosine similarity of Sentence A and Sentence B.
  3. If the similarity is high, they belong in the same chunk.
  4. If there is a sudden drop in similarity, it indicates a topic change. Start a new chunk there.

This creates chunks of variable lengths that represent coherent semantic ideas. This is computationally more expensive (requiring many embedding calls during ingestion) but results in significantly higher quality retrieval.

Best Practices and Decision Matrix
#

When building your Spring AI application, use this decision matrix to select your strategy:

Data Type Strategy Size (Tokens) Overlap Notes
Q&A / FAQ Granular 100-200 0-10% Each entry is usually self-contained.
Tech Docs / Markdown Recursive Split 500-800 15% Respect headers and code blocks.
Legal / Medical Parent-Child Child: 200
Parent: 1500
N/A Precision lookup, broad context required.
Conversational History Sliding Window 1000 20% Maintain flow of conversation.
Code Repositories AST / Logic Based Function/Class N/A Don’t split inside a function definition.

Conclusion
#

RAG is not a “set it and forget it” architecture. The effectiveness of your Spring AI application correlates directly with the quality of your data engineering pipeline.

By moving beyond basic fixed-size splitting and adopting advanced rag chunking strategies—specifically carefully tuned overlap, rich metadata injection, and hierarchical indexing—you can dramatically reduce hallucinations and improve user satisfaction.

In the next article of this Spring AI series, we will explore Vector Databases in Spring Boot: Comparing PGVector, Neo4j, and Milvus for production workloads.


Further Reading
#