Skip to main content

What is RAG? A Complete Guide for Spring Developers

Jeff Taakey
Author
Jeff Taakey
21+ Year CTO & Multi-Cloud Architect.
Table of Contents

The landscape of software development is undergoing a seismic shift. For years, Spring developers have focused on deterministic logic, REST APIs, and relational databases. Today, the demand is shifting toward probabilistic outcomes, Large Language Models (LLMs), and semantic search.

If you are a Java developer looking to integrate Generative AI into your enterprise applications, you have likely encountered a significant hurdle: LLMs do not know your business. They are trained on public internet data with a knowledge cutoff date. They do not know your latest internal PDFs, your SQL database schema, or your proprietary documentation.

This is where RAG (Retrieval-Augmented Generation) enters the picture. And with the advent of Spring AI, Java developers finally have a first-class citizen framework to implement RAG without switching to Python.

In this comprehensive guide, we will dissect Spring AI RAG, covering the theory, the architecture, and a production-grade implementation strategy.

The Problem: LLM Hallucinations and Knowledge Cutoffs
#

Before writing code, we must understand the architectural gap RAG solves. When you ask an LLM (like GPT-4 or Claude 3) a question, it relies solely on its pre-trained parametric memory.

If you ask ChatGPT: “What is the specific refund policy for our ‘Enterprise Plus’ tier configured in the 2024 Q2 update?”

The LLM will do one of two things:

  1. Refuse: “I don’t have access to your internal documents.”
  2. Hallucinate: It will confidently invent a plausible-sounding but entirely fake refund policy.

Fine-tuning the model (retraining it on your data) is often prohibitively expensive, slow, and hard to update. We need a way to inject “fresh” facts into the LLM’s context window just before it generates an answer.

What is RAG?
#

Retrieval-Augmented Generation (RAG) is an architectural pattern that combines the capabilities of a pre-trained LLM with an external data retrieval system.

Think of it as an “Open Book Exam.”

  • Without RAG: The student (LLM) must answer from memory.
  • With RAG: The student is allowed to look up the relevant chapter in a textbook (your database) before writing the answer.

The RAG Workflow
#

  1. User Query: The user asks a question.
  2. Retrieval: The system searches a database (usually a Vector Store) for chunks of text relevant to that question.
  3. Augmentation: The system constructs a prompt that looks like this:

    “Here is some context: [Insert Retrieved Data]. Based on this context, answer the user’s question: [User Query].”

  4. Generation: The LLM generates the answer using the provided facts.

Why Spring AI for RAG?
#

For a long time, the GenAI ecosystem was dominated by Python (LangChain, LlamaIndex). Java developers had to build microservices in Python just to handle the AI logic, introducing polyglot complexity and latency.

Spring AI changes this. It provides:

  1. Portable API: Switch between OpenAI, Azure, Bedrock, Mistral, and Ollama with configuration changes, not code rewrites.
  2. Vector Store Abstraction: A unified interface for PGVector, Redis, Neo4j, Milvus, Qdrant, and Pinecone.
  3. ETL Pipeline: Built-in readers for PDFs, JSON, and Text, along with token splitters.

It brings the “Spring Way”—dependency injection, auto-configuration, and declarative coding—to the chaotic world of AI.


Architectural Deep Dive: The ETL Pipeline
#

Implementing Spring AI RAG isn’t just about calling an API; it is about data engineering. You must build an ETL (Extract, Transform, Load) pipeline for your unstructured data.

1. Extract (Document Readers)
#

You cannot feed a 500-page PDF into an LLM. First, because of context window limits (token limits), and second, because of cost and “lost in the middle” phenomena.

Spring AI provides DocumentReader implementations. These convert raw resources (Files, S3 buckets, URLs) into a Document object—a simple wrapper around text and metadata.

2. Transform (Token Splitters)
#

Once text is extracted, it must be chunked. This is the most critical tuning parameter in RAG.

  • Too small chunks: You lose context (e.g., a pronoun “he” is separated from the name it refers to).
  • Too large chunks: You retrieve irrelevant noise that confuses the LLM.

Spring AI uses TokenTextSplitter to divide documents into semantically meaningful pieces.

3. Load (Embedding & Vector Stores)
#

This is the “Magic” step. Computers don’t understand text; they understand numbers. We pass our text chunks through an Embedding Model (like text-embedding-3-small or llama2). The model turns the text into a Vector—a long array of floating-point numbers (e.g., [0.012, -0.934, 0.55...]).

These vectors represent the semantic meaning of the text.

  • “The cat sits on the mat”
  • “The feline rests on the rug”

These two sentences will have very similar vector representations, even though they share almost no words. We store these vectors in a Vector Database.


Hands-On: Building a RAG Application with Spring AI
#

Let’s build a Spring Boot application that reads a corporate policy PDF and allows users to chat with it.

Prerequisites
#

  • Java 17+
  • Spring Boot 3.2+
  • OpenAI API Key (or a local Ollama instance)
  • Docker (for running a Vector DB, we will use PGVector)

Step 1: Dependencies
#

Add the Spring AI BOM and starters to your pom.xml. Note: Spring AI is currently in the milestone/snapshot phase, so ensure you have the correct repositories configured.

<dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-bom</artifactId>
            <version>0.8.1</version> <!-- Check for latest version -->
            <type>pom</type>
            <scope>import</scope>
        </dependency>
    </dependencies>
</dependencyManagement>

<dependencies>
    <!-- OpenAI Auto-Configuration -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
    </dependency>

    <!-- Postgres Vector Store -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-pgvector-store-spring-boot-starter</artifactId>
    </dependency>
    
    <!-- PDF Reader -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-tika-document-reader</artifactId>
    </dependency>
</dependencies>

Step 2: Configuration
#

Configure your application.yml. We need to set up the OpenAI key and the Postgres connection details.

spring:
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      chat:
        options:
          model: gpt-3.5-turbo
      embedding:
        options:
          model: text-embedding-3-small
  
  # Postgres Configuration
  datasource:
    url: jdbc:postgresql://localhost:5432/vectordb
    username: postgres
    password: password
  
  # Spring AI Vector Store init
  vectorstore:
    pgvector:
      index-type: HNSW
      distance-type: COSINE_DISTANCE

Step 3: The Ingestion Service (ETL)
#

We need a service that loads a PDF, splits it, creates embeddings, and saves them to Postgres.

package com.springdevpro.rag.service;

import org.springframework.ai.reader.tika.TikaDocumentReader;
import org.springframework.ai.transformer.splitter.TokenTextSplitter;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.core.io.Resource;
import org.springframework.stereotype.Service;
import org.springframework.ai.document.Document;

import java.util.List;

@Service
public class IngestionService {

    private final VectorStore vectorStore;

    public IngestionService(VectorStore vectorStore) {
        this.vectorStore = vectorStore;
    }

    public void ingest(Resource pdfResource) {
        // 1. Extract
        TikaDocumentReader reader = new TikaDocumentReader(pdfResource);
        List<Document> documents = reader.get();

        // 2. Transform (Split)
        // Split into chunks of 800 tokens, with 400 token overlap to preserve context boundaries
        TokenTextSplitter splitter = new TokenTextSplitter(800, 400, 5, 10000, true);
        List<Document> splitDocuments = splitter.apply(documents);

        // 3. Load (Embed & Store)
        // The add method automatically calls the EmbeddingModel and saves to DB
        vectorStore.add(splitDocuments);
        
        System.out.println("Ingested " + splitDocuments.size() + " chunks into Vector Store.");
    }
}

Key Code Analysis:

  • TikaDocumentReader: Apache Tika is robust and can read PDF, Docx, PPT, etc.
  • TokenTextSplitter(800, 400): This configuration is crucial. We want chunks large enough to contain a complete thought, but small enough to fit several into the context window. The “overlap” ensures that if a sentence is cut in the middle, it appears in both chunks, preventing information loss.

Step 4: The Retrieval & Chat Service
#

Now for the RAG logic. We need to query the vector store and augment the prompt.

Spring AI introduced the ChatClient and Advisor API (formerly retrieval via ChatClient direct calls), but let’s look at the explicit flow to understand the mechanics.

package com.springdevpro.rag.service;

import org.springframework.ai.chat.ChatClient;
import org.springframework.ai.chat.prompt.Prompt;
import org.springframework.ai.chat.prompt.PromptTemplate;
import org.springframework.ai.document.Document;
import org.springframework.ai.vectorstore.SearchRequest;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.core.io.Resource;
import org.springframework.stereotype.Service;

import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

@Service
public class RagChatService {

    private final ChatClient chatClient;
    private final VectorStore vectorStore;

    @Value("classpath:/prompts/rag-prompt.st")
    private Resource ragPromptTemplate;

    public RagChatService(ChatClient chatClient, VectorStore vectorStore) {
        this.chatClient = chatClient;
        this.vectorStore = vectorStore;
    }

    public String generateAnswer(String userQuery) {
        // 1. Retrieval
        // Search for the top 4 most similar chunks
        List<Document> similarDocuments = vectorStore.similaritySearch(
                SearchRequest.query(userQuery).withTopK(4)
        );

        // Convert documents to a single string
        String context = similarDocuments.stream()
                .map(Document::getContent)
                .collect(Collectors.joining("\n"));

        // 2. Augmentation
        PromptTemplate template = new PromptTemplate(ragPromptTemplate);
        Map<String, Object> promptParameters = Map.of(
                "input", userQuery,
                "documents", context
        );
        Prompt prompt = template.create(promptParameters);

        // 3. Generation
        return chatClient.call(prompt).getResult().getOutput().getContent();
    }
}

The Prompt Template (rag-prompt.st)
#

Do not hardcode prompts in Java strings. Externalize them. Here is a standard RAG system prompt:

You are a helpful assistant for the Spring DevPro company.
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

CONTEXT:
{documents}

QUESTION:
{input}

ANSWER:

Optimizing Spring AI RAG for Production
#

The code above gets you a working prototype. However, putting Spring AI RAG into production requires handling edge cases and optimizing performance.

1. Metadata Filtering
#

In a real system, you might ingest documents from different departments (HR, IT, Sales). If a user asks “What is the vacation policy?”, you don’t want the AI to mix up the “contractor policy” with the “full-time employee policy”.

Spring AI supports Metadata Filtering. When you ingest documents, add metadata:

// Ingestion
doc.getMetadata().put("department", "HR");
doc.getMetadata().put("year", 2024);

// Retrieval
FilterExpressionBuilder b = new FilterExpressionBuilder();
SearchRequest request = SearchRequest.query("vacation policy")
    .withFilterExpression(b.eq("department", "HR").build());

This filters the vectors before the semantic search (or during, typically using HNSW filters), ensuring higher accuracy.

2. The “Lost in the Middle” Problem
#

When you retrieve 10 documents and stuff them into the context, LLMs tend to focus on the beginning and the end of the text block, ignoring the middle.

  • Solution: Re-ranking.
  • Concept: Retrieve 20 documents from the Vector Store (fast), then use a “Re-ranking Model” (like Cohere Rerank) to score them strictly by relevance, and pass only the top 5 to the LLM.

3. Vector Database Selection
#

Spring AI supports many stores. Which one should you choose?

  • PGVector: Best for existing Postgres users. ACID compliant, sits right next to your relational data. Great for starting out.
  • Redis: Best for low latency. If you need speed, Redis as a Vector Store is incredibly fast.
  • Pinecone/Milvus: Specialized, managed vector databases. Good for massive scale (millions/billions of vectors).
  • Neo4j: Best if your data is highly relational (Graph RAG). You can combine vector search with knowledge graph traversals.

4. Cost Management (The FinOps Angle)
#

Every time you run vectorStore.add(), you pay for embeddings. Every time you chat, you pay for input tokens (the context).

  • Caching: Cache common queries.
  • Embedding Storage: Don’t re-embed documents unless they change. Calculate a hash of the content; if the hash matches, skip the embedding step.

Troubleshooting Common RAG Issues
#

Even with the best architecture, RAG can fail. Here is how to debug using Spring AI.

Issue: “The AI answers generic knowledge, not my document.”
#

  • Cause: The retrieval step failed to find relevant chunks, or the distance threshold was too loose.
  • Fix: Debug your vectorStore.similaritySearch. Print out the retrieved documents before sending them to the LLM. If the chunks look irrelevant, adjust your TokenTextSplitter size or check your embedding model quality.

Issue: “The AI says ‘I don’t know’ even though the info is there.”
#

  • Cause: The context window is overloaded, or the specific fact was cut in half by the splitter.
  • Fix: Increase chunk overlap. Ensure the prompt explicitly instructs the model to trust the provided context over its internal knowledge.

Issue: High Latency.
#

  • Cause: OpenAI API calls are slow.
  • Fix: Use ChatClient.stream() (Server-Sent Events) to stream the response to the user byte-by-byte. This improves perceived performance significantly. Spring AI fully supports Flux-based streaming.
public Flux<String> generateStream(String query) {
   // ... construct prompt ...
   return chatClient.stream(prompt)
       .map(response -> response.getResult().getOutput().getContent());
}

Conclusion: The Future of Java and AI
#

RAG is not just a trend; it is the standard architecture for enterprise AI. By leveraging Spring AI, we can build these systems without leaving the robust, type-safe, and observable ecosystem of the JVM.

For Spring developers, the learning curve is surprisingly gentle. VectorStore feels like Repository. ChatClient feels like WebClient. The abstractions map perfectly to what we already know.

As you build your Spring AI RAG applications, remember that the quality of your output is determined by the quality of your data pipeline. Garbage in (bad chunks, poor embeddings), garbage out (hallucinations). Focus on your ETL process, experiment with chunk sizes, and monitor your token usage.

Ready to build? Start by cloning the Spring AI examples repository and setting up a local Postgres container. The era of the intelligent Spring application is here.


Recommended Reading on Spring DevPro #

  • Spring Boot 3.2 vs 3.1: Performance Benchmarks
  • Understanding Vector Embeddings: A Primer for Java Developers
  • Deploying Spring AI on Kubernetes: A DevOps Guide

About This Site: [StonehengeHugoTemplate].com

[StonehengeHugoTemplate].com is the ..., helping you solve core business and technical pain points.