OpenAI Integration with Spring AI: GPT-4, Embeddings & Streaming

Table of Contents

The landscape of Java development has shifted seismically with the introduction of Generative AI. For years, Spring developers managed database connections and REST APIs. Today, we are tasked with managing prompts, tokens, and embeddings.

While the OpenAI REST API is well-documented, integrating it directly into a robust enterprise Java application involves significant boilerplate: handling HTTP clients, mapping complex JSON schemas, managing retries, and parsing streaming responses.

Enter Spring AI.

Just as Spring Data abstracted away the complexity of JDBC and JPA, Spring AI provides a portable, consistent interface for interacting with Large Language Models (LLMs).

In this deep dive, we will explore the Spring AI OpenAI implementation. We will move beyond “Hello World” to build a production-grade integration featuring GPT-4, reactive streaming, structured output mapping (Java Records), and the foundations of Retrieval Augmented Generation (RAG) via embeddings.

The Architecture: Why Spring AI?
#

Before writing code, it is crucial for an architect or senior developer to understand the why. You could use the raw com.theokanning.openai-gpt3-java library, but Spring AI offers distinct advantages for the “Spring AI OpenAI” stack:

Portability: The ChatModel interface allows you to switch between OpenAI, Azure OpenAI, Vertex AI, or Bedrock with minimal code changes.
POJO Mapping: It handles the tedious work of coercing JSON output from the LLM into strongly typed Java objects.
RAG Support: First-class support for Vector Stores (Pinecone, Milvus, pgvector) and Document loaders.

Prerequisites and Project Setup
#

To follow this tutorial, ensure your environment meets these standards:

Java: JDK 17 or 21 (Spring Boot 3 requirements).
Spring Boot: 3.2.x or 3.3.x.
OpenAI API Key: You need a valid key from platform.openai.com with credit balance.

Dependency Management
#

Spring AI is currently in the Milestone phase. You must configure your pom.xml to include the Spring Milestone repositories.

<repositories>
    <repository>
        <id>spring-milestones</id>
        <name>Spring Milestones</name>
        <url>https://repo.spring.io/milestone</url>
        <snapshots>
            <enabled>false</enabled>
        </snapshots>
    </repository>
</repositories>

Next, add the specific starter for OpenAI. This “starter” acts as the bridge between the Spring AI core and the OpenAI API.

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
    <version>1.0.0-M1</version> <!-- Check for latest milestone -->
</dependency>

Configuration: The `application.yml`
#

Security is paramount. Never hardcode your API keys. Use environment variables in production, but for local development, we configure application.yml. We will also set the default model to gpt-4-turbo for better reasoning capabilities, though you can use gpt-3.5-turbo for cost efficiency.

spring:
  application:
    name: spring-ai-openai-demo
  ai:
    openai:
      api-key: ${OPENAI_API_KEY} # Read from env variable
      chat:
        options:
          model: gpt-4-turbo-preview
          temperature: 0.7

Temperature: Controls randomness. 0.0 is deterministic (good for code generation), 1.0 is creative.
Model: The identifier for the OpenAI model version.

Part 1: The Core `ChatModel` Integration
#

The heart of the “Spring AI OpenAI” interaction is the ChatModel (formerly ChatClient in earlier versions). Spring Boot auto-configures this bean for you.

Basic Synchronous Chat
#

Let’s create a simple REST controller to verify connectivity.

package com.springdevpro.ai.controller;

import org.springframework.ai.chat.model.ChatModel;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;

@RestController
public class SimpleChatController {

    private final ChatModel chatModel;

    // Constructor Injection
    public SimpleChatController(ChatModel chatModel) {
        this.chatModel = chatModel;
    }

    @GetMapping("/ai/simple")
    public String generate(@RequestParam(defaultValue = "Tell me a joke about Java developers") String message) {
        return chatModel.call(message);
    }
}

When you hit this endpoint, Spring AI serializes your string into a JSON payload, sends it to OpenAI’s completion endpoint, awaits the response, extracts the text content from the choices array, and returns the raw string.

Prompt Templates: Moving Beyond Hardcoded Strings
#

In a real application, you rarely send raw user input to an LLM. You wrap it in a context or instructions. This is Prompt Engineering. Spring AI provides PromptTemplate to handle parameter substitution, preventing prompt injection attacks and ensuring consistency.

@GetMapping("/ai/template")
public String generateWithContext(@RequestParam String topic) {
    String template = """
        You are a senior software architect with 20 years of experience.
        Explain the concept of {topic} to a junior developer.
        Use analogies involving construction or cooking.
        """;

    PromptTemplate promptTemplate = new PromptTemplate(template);
    promptTemplate.add("topic", topic);
    
    // The call method accepts a Prompt object, not just strings
    return chatModel.call(promptTemplate.create()).getResult().getOutput().getContent();
}

Using PromptTemplate separates your prompt logic from your Java code, making it easier to version control your prompts or move them to external configuration files.

Part 2: Structured Outputs (Parsing JSON to POJOs)
#

One of the biggest challenges in LLM integration is getting the AI to return data that your code can process programmatically. If you ask for a list of books, you don’t want a paragraph of text; you want a JSON array.

Spring AI solves this with BeanOutputConverter.

Let’s define a Java Record for a technical book review:

public record BookReview(String title, String author, int rating, List<String> keyTakeaways) {}

Now, let’s instruct GPT-4 to return this specific format.

@GetMapping("/ai/review")
public BookReview getBookReview(@RequestParam String bookTitle) {
    // 1. Create the converter for the specific type
    var converter = new BeanOutputConverter<>(BookReview.class);

    // 2. The converter generates specific instructions for the LLM
    // regarding the JSON schema it expects.
    String format = converter.getFormat();

    String template = """
        Analyze the book {bookTitle}.
        {format}
        """;

    PromptTemplate promptTemplate = new PromptTemplate(template);
    promptTemplate.add("bookTitle", bookTitle);
    promptTemplate.add("format", format);

    Prompt prompt = promptTemplate.create();

    // 3. Call the model via ChatModel
    ChatResponse response = chatModel.call(prompt);

    // 4. Convert the string output back to the Record
    return converter.convert(response.getResult().getOutput().getContent());
}

How it works: Spring AI appends a rigorous system prompt to your request describing the JSON schema of BookReview. GPT-4 (being excellent at instruction following) generates compliant JSON. The converter then marshals that JSON into your Java object.

Part 3: Reactive Streaming with Flux
#

Latency is the killer of GenAI user experience. GPT-4 is powerful, but it is slow. Waiting 10 seconds for a full paragraph allows users to get distracted.

Streaming allows you to push tokens to the frontend as they are generated (Typewriter effect). Spring AI leverages Project Reactor (Flux) for this.

import reactor.core.publisher.Flux;

@GetMapping("/ai/stream")
public Flux<String> streamResponse(@RequestParam String message) {
    String template = "Write a short poem about {message}.";
    PromptTemplate promptTemplate = new PromptTemplate(template);
    promptTemplate.add("message", message);

    return chatModel.stream(promptTemplate.create())
            .map(chatResponse -> {
                // Extract only the newly generated token content
                String content = chatResponse.getResult().getOutput().getContent();
                return content != null ? content : "";
            });
}

Technical Nuance: The chatModel.stream() method returns a Flux<ChatResponse>. Each ChatResponse contains a “chunk” of the answer. By mapping this directly to the response body, Spring WebFlux handles the Server-Sent Events (SSE) or chunked transfer encoding, allowing the browser to render text progressively.

Part 4: Embeddings – The Foundation of RAG
#

While Chat is the most visible feature, Embeddings are the most powerful for enterprise data. An embedding model turns text into a vector (a list of floating-point numbers). Text with similar meanings will have vectors that are mathematically close to each other (Cosine Similarity).

This is essential for Retrieval Augmented Generation (RAG), where you search your own database for relevant info before sending a prompt to OpenAI.

Using the EmbeddingModel
#

Spring AI provides the EmbeddingModel interface.

import org.springframework.ai.embedding.EmbeddingModel;
import org.springframework.ai.embedding.EmbeddingResponse;

@RestController
public class EmbeddingController {

    private final EmbeddingModel embeddingModel;

    public EmbeddingController(EmbeddingModel embeddingModel) {
        this.embeddingModel = embeddingModel;
    }

    @GetMapping("/ai/embed")
    public Map<String, Object> embedText(@RequestParam String text) {
        // Defaults to text-embedding-3-small (usually)
        List<Double> vector = embeddingModel.embed(text);
        
        return Map.of(
            "text", text,
            "vector_dimension", vector.size(), // e.g., 1536 for OpenAI
            "vector_sample", vector.subList(0, 5) // Show first 5 dimensions
        );
    }
}

From Embeddings to Vector Stores
#

While creating embeddings is the first step, storing them is the second. Spring AI supports various Vector Stores. A typical flow looks like this:

Reader: Read a PDF or Database row (using TikaDocumentReader).
Transformer: Split the text into chunks (TokenTextSplitter).
Embedding: Use EmbeddingModel to vectorize chunks.
Writer: Save vectors to VectorStore (e.g., PGVector).

Here is a snippet showing how you might query a VectorStore to find context for a GPT-4 prompt:

// Conceptual example of RAG
public String askWithDocs(String query) {
    // 1. Search for similar documents in your database
    List<Document> similarDocs = vectorStore.similaritySearch(query);

    // 2. Concatenate the content of these docs
    String context = similarDocs.stream()
        .map(Document::getContent)
        .collect(Collectors.joining("\n"));

    // 3. Stuff the context into the prompt
    String template = """
        Answer the question based ONLY on the context below:
        Context: {context}
        
        Question: {query}
        """;
        
    // ... create prompt and call ChatModel
}

Part 5: DALL-E Image Generation
#

Spring AI isn’t limited to text. The ImageModel interface interacts with DALL-E 2 or DALL-E 3.

import org.springframework.ai.image.ImageModel;
import org.springframework.ai.image.ImagePrompt;
import org.springframework.ai.image.ImageOptionsBuilder;

@RestController
public class ImageGenController {

    private final ImageModel imageModel;

    public ImageGenController(ImageModel imageModel) {
        this.imageModel = imageModel;
    }

    @GetMapping("/ai/image")
    public String generateImage(@RequestParam String prompt) {
        ImageResponse response = imageModel.call(
            new ImagePrompt(
                prompt,
                ImageOptionsBuilder.builder()
                    .withModel("dall-e-3")
                    .withHeight(1024)
                    .withWidth(1024)
                    .build()
            )
        );
        
        // Returns the URL of the generated image
        return response.getResult().getOutput().getUrl();
    }
}

Advanced Configuration & Best Practices
#

To make your application production-ready, consider the following aspects.

1. Token Usage and Cost Monitoring
#

OpenAI charges by the token. You can inspect token usage in the ChatResponse metadata.

ChatResponse response = chatModel.call(prompt);
Usage usage = response.getMetadata().getUsage();
logger.info("Prompt Tokens: {}, Gen Tokens: {}", 
    usage.getPromptTokens(), 
    usage.getGenerationTokens());

2. Error Handling and Retries
#

OpenAI APIs can experience transient failures or rate limiting (HTTP 429). Spring AI allows configuring RestClient or WebClient customizers, but a higher-level approach is using Spring Retry.

Wrap your AI calls in a service method annotated with @Retryable:

@Retryable(retryFor = { OpenAiHttpException.class }, maxAttempts = 3, backoff = @Backoff(delay = 1000))
public String robustGeneration(String text) {
    return chatModel.call(text);
}

3. Function Calling (Tools)
#

GPT-4 has the ability to decide to call a function (tool) instead of generating text. Spring AI maps Java Functions to OpenAI Tools.

You can register a java.util.function.Function as a bean, give it a description, and pass it to the ChatModel. If the user asks “What is the weather in London?”, the model will pause generation, ask Spring to run the currentWeatherFunction, and then use that result to formulate the final answer. This bridges the gap between the static knowledge of the LLM and real-time data.

Conclusion
#

The integration of Spring AI with OpenAI represents a significant leap forward for Java developers. We no longer need to write manual HTTP requests or wrestle with raw JSON.

By using the abstractions provided by Spring AI—ChatModel, StreamingChatModel, and EmbeddingModel—we can build applications that are:

Clean: Adhering to standard Spring patterns.
Flexible: Capable of switching models or providers.
Powerful: Leveraging Streaming, RAG, and Structured Outputs.

As we continue to build out the Spring AI column here at Spring DevPro, our next articles will focus on local LLMs using Ollama and advanced RAG patterns with Neo4j.

Ready to build? Ensure you have your OpenAI API key ready, pull the latest Spring AI milestone, and start streaming intelligence into your applications today.

Disclaimer: The Spring AI project is evolving rapidly. Code snippets are based on the 1.0.0-M1 milestone versions available at the time of writing. Always refer to the official Spring documentation for breaking changes.

The Architecture: Why Spring AI? #

Prerequisites and Project Setup #

Dependency Management #

Configuration: The application.yml #

Part 1: The Core ChatModel Integration #

Basic Synchronous Chat #

Prompt Templates: Moving Beyond Hardcoded Strings #

Part 2: Structured Outputs (Parsing JSON to POJOs) #

Part 3: Reactive Streaming with Flux #

Part 4: Embeddings – The Foundation of RAG #

Using the EmbeddingModel #

From Embeddings to Vector Stores #

Part 5: DALL-E Image Generation #

Advanced Configuration & Best Practices #

1. Token Usage and Cost Monitoring #

2. Error Handling and Retries #

3. Function Calling (Tools) #

Conclusion #

About This Site: [StonehengeHugoTemplate].com