Building RAG with Spring AI and Pinecone (Cloud Solution)

Table of Contents

The landscape of enterprise application development is undergoing a seismic shift. For years, Spring Boot has been the de facto standard for building robust microservices. Today, as Generative AI becomes a non-negotiable requirement for modern software, Java developers face a new challenge: integrating non-deterministic AI models into deterministic business logic.

While Python has historically dominated the AI space, the Spring AI project has leveled the playing field, providing a portable, modular, and Spring-native interface for interacting with LLMs and Vector Databases.

In this deep dive, we will focus on a specific, high-value architecture: Retrieval-Augmented Generation (RAG) using Spring AI and Pinecone. We choose Pinecone for this tutorial because it represents a true “Cloud Solution”—a managed, serverless vector database that frees developers from infrastructure management, allowing them to focus purely on business value.

The Architectural Blueprint: Why RAG?
#

Before writing code, we must understand the “Why.” Large Language Models (like GPT-4) suffer from two specific limitations in an enterprise context:

Knowledge Cutoffs: They only know what they were trained on.
Hallucinations: When they don’t know an answer, they often make one up confidently.

RAG solves this by injecting your proprietary data into the prompt context dynamically. The flow we will build looks like this:

Ingestion: Load documents -> Split into chunks -> Generate Embeddings -> Store in Pinecone.
Retrieval: User Query -> Generate Query Embedding -> Semantic Search in Pinecone -> Retrieve top matching chunks.
Generation: System Prompt + Retrieved Context + User Query -> LLM -> Final Answer.

Prerequisites
#

To follow this tutorial, ensure you have the following:

Java 17 or higher (Spring AI requires modern Java).
Spring Boot 3.2.x.
An OpenAI API Key (for generating embeddings and chat completion).
A Pinecone Account (The free tier is sufficient for this tutorial).
Maven or Gradle.

Step 1: Project Setup and Dependencies
#

Spring AI is evolving rapidly. As of this writing, it is best to use the Milestone or Snapshot repositories to get the latest features.

Maven Configuration
#

First, configure the repositories in your pom.xml to access Spring Milestones:

<repositories>
    <repository>
        <id>spring-milestones</id>
        <name>Spring Milestones</name>
        <url>https://repo.spring.io/milestone</url>
        <snapshots>
            <enabled>false</enabled>
        </snapshots>
    </repository>
</repositories>

Next, add the Spring AI BOM (Bill of Materials) to ensure version compatibility across modules:

<dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-bom</artifactId>
            <version>0.8.1</version> <!-- Check for the latest version -->
            <type>pom</type>
            <scope>import</scope>
        </dependency>
    </dependencies>
</dependencyManagement>

Finally, include the starters for OpenAI (for the LLM and Embedding model) and Pinecone (for the Vector Store):

<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-web</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-pinecone-store-spring-boot-starter</artifactId>
    </dependency>
</dependencies>

Step 2: Configuring the Cloud Environment
#

One of the primary benefits of using Spring AI with Pinecone is the externalization of configuration. We avoid hardcoding credentials.

Navigate to the Pinecone Console:

Create an Index. Name it spring-ai-demo.
Set Dimensions to 1536 (This matches OpenAI’s text-embedding-ada-002 model).
Metric: cosine (Standard for text similarity).
Copy your API Key and Environment/Region.

Update your application.yml:

spring:
  application:
    name: spring-ai-rag-demo
  ai:
    openai:
      api-key: ${OPENAI_API_KEY} # Export this in your OS environment
      embedding:
        options:
          model: text-embedding-ada-002
    vectorstore:
      pinecone:
        api-key: ${PINECONE_API_KEY}
        environment: ${PINECONE_ENV} # e.g., gcp-starter
        index-name: spring-ai-demo
        project-id: ${PINECONE_PROJECT_ID}
        # namespace: optional-namespace

Security Note: Never commit API keys to Git. Use environment variables (export OPENAI_API_KEY=sk-...) or a secrets manager.

Step 3: The ETL Pipeline (Ingestion)
#

Before we can query data, we must load it. In a production scenario, this might be an event-driven listener that watches an S3 bucket. For this guide, we will create a CommandLineRunner to load data on startup.

We need to perform three distinct actions:

Read: Load raw text from a resource.
Transform: Split the text into smaller tokens. This is crucial because LLMs have context windows. Sending a 100-page PDF at once will fail or cost a fortune.
Write: Embed the text and upsert it to Pinecone.

Create a class VectorStoreLoader.java:

package com.springdevpro.rag.loader;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.ai.document.Document;
import org.springframework.ai.reader.JsonReader;
import org.springframework.ai.reader.TextReader;
import org.springframework.ai.transformer.splitter.TokenTextSplitter;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.boot.CommandLineRunner;
import org.springframework.core.io.Resource;
import org.springframework.stereotype.Component;

import java.util.List;

@Component
public class VectorStoreLoader implements CommandLineRunner {

    private static final Logger log = LoggerFactory.getLogger(VectorStoreLoader.class);

    private final VectorStore vectorStore;

    @Value("classpath:docs/policy-document.txt")
    private Resource policyDocument;

    public VectorStoreLoader(VectorStore vectorStore) {
        this.vectorStore = vectorStore;
    }

    @Override
    public void run(String... args) throws Exception {
        // 1. Check if we need to load data (Optimization)
        // In a real app, you might check a hash or version.
        // For now, we assume if the index is empty or we want to force reload.
        
        log.info("Loading documents into Pinecone...");

        // 2. Read
        TextReader textReader = new TextReader(policyDocument);
        textReader.getCustomMetadata().put("filename", "policy-document.txt");
        List<Document> documents = textReader.get();

        // 3. Transform (Split)
        // TokenTextSplitter defaults to chunks suitable for OpenAI
        TokenTextSplitter tokenTextSplitter = new TokenTextSplitter();
        List<Document> splitDocuments = tokenTextSplitter.apply(documents);

        log.info("Split {} documents into {} chunks.", documents.size(), splitDocuments.size());

        // 4. Write (Embed & Store)
        vectorStore.add(splitDocuments);

        log.info("Data loaded successfully into Pinecone.");
    }
}

Why Token Splitting Matters
#

The TokenTextSplitter is a critical component often overlooked by beginners. If you upload a document as a single vector, the semantic meaning gets “diluted” over the length of the text. Furthermore, when retrieving, you want to retrieve specific paragraphs that contain the answer, not the whole book.

Pinecone excels at storing these thousands of small chunks and retrieving the top 3-5 most relevant ones in milliseconds.

Step 4: The Retrieval Service
#

Now that our data is in the cloud, we need a service to query it. This is the “R” in RAG.

Spring AI provides the ChatClient interface, which abstracts the interaction with OpenAI (or other providers).

Create RagService.java:

package com.springdevpro.rag.service;

import org.springframework.ai.chat.ChatClient;
import org.springframework.ai.chat.ChatResponse;
import org.springframework.ai.chat.messages.Message;
import org.springframework.ai.chat.messages.UserMessage;
import org.springframework.ai.chat.messages.SystemMessage;
import org.springframework.ai.chat.prompt.Prompt;
import org.springframework.ai.chat.prompt.SystemPromptTemplate;
import org.springframework.ai.document.Document;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.ai.vectorstore.SearchRequest;
import org.springframework.stereotype.Service;

import java.util.List;
import java.util.Map;
import java.util.stream.Collectors;

@Service
public class RagService {

    private final ChatClient chatClient;
    private final VectorStore vectorStore;

    // Define the prompt template
    private static final String RAG_PROMPT_TEMPLATE = """
            You are a helpful assistant for an insurance company.
            Use the following pieces of context to answer the question at the end.
            If you don't know the answer, just say that you don't know, don't try to make up an answer.
            
            Context:
            {context}
            
            Question: {question}
            """;

    public RagService(ChatClient chatClient, VectorStore vectorStore) {
        this.chatClient = chatClient;
        this.vectorStore = vectorStore;
    }

    public String generateAnswer(String userQuery) {
        // 1. Retrieve Context from Pinecone
        // We search for the top 3 most similar documents to the user query
        List<Document> similarDocuments = vectorStore.similaritySearch(
                SearchRequest.query(userQuery).withTopK(3)
        );

        // 2. Extract content from documents
        String context = similarDocuments.stream()
                .map(Document::getContent)
                .collect(Collectors.joining("\n"));

        // 3. Construct the System Prompt
        SystemPromptTemplate systemPromptTemplate = new SystemPromptTemplate(RAG_PROMPT_TEMPLATE);
        Message systemMessage = systemPromptTemplate.createMessage(Map.of(
                "context", context,
                "question", userQuery
        ));

        // 4. Call the LLM
        // We send the "engineered" prompt, not just the raw user query
        Prompt prompt = new Prompt(List.of(systemMessage));
        ChatResponse response = chatClient.call(prompt);

        return response.getResult().getOutput().getContent();
    }
}

Analyzing the Search Logic
#

The line vectorStore.similaritySearch(SearchRequest.query(userQuery).withTopK(3)) is where the magic happens.

Spring AI takes the string userQuery.
It calls the OpenAI Embedding API to turn that string into a vector (array of floats).
It sends that vector to Pinecone.
Pinecone performs a Nearest Neighbor search using the Cosine Similarity metric.
It returns the text chunks associated with those vectors.

This happens entirely transparently to the developer. You simply inject VectorStore and call similaritySearch.

Step 5: Exposing the API
#

Let’s create a simple REST Controller to test our RAG implementation.

package com.springdevpro.rag.controller;

import com.springdevpro.rag.service.RagService;
import org.springframework.web.bind.annotation.*;

import java.util.Map;

@RestController
@RequestMapping("/api/ai")
public class RAGController {

    private final RagService ragService;

    public RAGController(RagService ragService) {
        this.ragService = ragService;
    }

    @PostMapping("/chat")
    public Map<String, String> chat(@RequestBody Map<String, String> payload) {
        String query = payload.get("query");
        String answer = ragService.generateAnswer(query);
        return Map.of("answer", answer);
    }
}

Advanced Topics: Optimizing for Production
#

While the code above works, a “Cloud Solution” for enterprise requires robustness. Let’s discuss three advanced optimizations essential for production environments.

1. Metadata Filtering
#

In a real-world scenario, you rarely search all documents. You might want to search documents only related to a specific user, department, or year. Pinecone excels at this via Metadata Filtering.

Spring AI supports this through the FilterExpressionBuilder.

import org.springframework.ai.vectorstore.SearchRequest;
import org.springframework.ai.vectorstore.filter.FilterExpressionBuilder;

// ... inside your service

FilterExpressionBuilder b = new FilterExpressionBuilder();

List<Document> results = vectorStore.similaritySearch(
    SearchRequest.query(userQuery)
        .withTopK(3)
        .withFilterExpression(b.eq("year", 2024).build()) // Only search docs from 2024
);

By pushing this filter to Pinecone, you reduce the search space, improve latency, and ensure data isolation.

2. Error Handling and Resilience
#

External calls to OpenAI and Pinecone can fail. The Spring ecosystem provides Spring Retry to handle this gracefully.

Add spring-retry to your dependencies and annotate your service method:

@Retryable(retryFor = { RestClientException.class }, maxAttempts = 3, backoff = @Backoff(delay = 1000))
public String generateAnswer(String userQuery) {
    // ... implementation
}

This ensures that a momentary network blip from the cloud providers doesn’t crash your user’s request.

3. Cost Management and Token Counting
#

RAG can get expensive. Every time you send context to OpenAI, you pay per token.

Vector Database Costs: Pinecone’s serverless model charges based on read/write units and storage. It is generally very cost-effective for RAG.
LLM Costs: This is the bulk of the cost.

To optimize, consider using cheaper models for the summarization (like gpt-3.5-turbo instead of gpt-4o) unless complex reasoning is required. Additionally, tune your TopK parameter. Do you really need 5 chunks? Maybe 2 are enough.

The Cloud Advantage: Why Pinecone?
#

Throughout this tutorial, we utilized Pinecone. You might ask, “Why not use a local PostgreSQL with pgvector?”

For a local dev environment, pgvector via Testcontainers is fantastic. However, for the Spring DevPro audience targeting production scale, Pinecone offers specific advantages:

Serverless Architecture: You don’t provision pods or manage disk space. You pay for what you use. This aligns perfectly with modern Spring Boot microservices deployed on AWS Lambda or Kubernetes.
Latency: Pinecone is built specifically for vector math. It often outperforms general-purpose databases that have vector plugins added on top.
SDK Integration: As seen in our configuration, the integration with Spring AI is seamless.

Testing the Application
#

To verify your application:

Create a text file src/main/resources/docs/policy-document.txt with some fake content (e.g., “The Spring DevPro return policy states that refunds are processed within 30 days.”).
Run the Spring Boot application. Watch the logs to ensure the VectorStoreLoader ingests the data.
Use curl or Postman:

curl -X POST http://localhost:8080/api/ai/chat \
     -H "Content-Type: application/json" \
     -d '{"query": "How long do refunds take?"}'

You should receive a JSON response containing the answer “Refunds are processed within 30 days,” derived directly from your text file, not from the LLM’s general training data.

Conclusion
#

We have successfully built a cloud-native RAG application using Java, Spring AI, and Pinecone. This architecture bridges the gap between the stateless nature of LLMs and the stateful, proprietary knowledge of your enterprise.

The combination of Spring AI’s clean abstractions and Pinecone’s managed infrastructure allows Java developers to deploy AI solutions rapidly without needing to become Python data scientists or infrastructure engineers.

As the Spring AI project matures toward its 1.0.0 release, we expect even tighter integrations, more robust evaluation metrics, and enhanced support for agentic workflows. For now, this stack represents the cutting edge of Java development.

Stay tuned to Spring DevPro for our upcoming series on deploying this architecture to Kubernetes and implementing semantic caching.

References:

The Architectural Blueprint: Why RAG? #

Prerequisites #

Step 1: Project Setup and Dependencies #

Maven Configuration #

Step 2: Configuring the Cloud Environment #

Step 3: The ETL Pipeline (Ingestion) #

Why Token Splitting Matters #

Step 4: The Retrieval Service #

Analyzing the Search Logic #

Step 5: Exposing the API #

Advanced Topics: Optimizing for Production #

1. Metadata Filtering #

2. Error Handling and Resilience #

3. Cost Management and Token Counting #

The Cloud Advantage: Why Pinecone? #

Testing the Application #

Conclusion #

About This Site: [StonehengeHugoTemplate].com