Spring AI Testing Guide: Unit Tests, Integration Tests & Mocking

Table of Contents

Building applications with Large Language Models (LLMs) introduces a new paradigm of uncertainty in software engineering. Unlike traditional deterministic logic, LLMs are probabilistic. They hallucinate, they vary their phrasing, and they are often expensive and slow to call.

For the enterprise Java developer, this poses a critical question: How do we apply robust engineering practices to this chaotic ecosystem?

This article serves as the definitive guide to Spring AI testing. We will move beyond “vibe checking” (manually running the app to see if it looks okay) and establish a rigorous testing pyramid involving Unit Tests, Integration Tests with Testcontainers, and Semantic Evaluation.

The Challenges of Spring AI Testing
#

Before diving into code, we must acknowledge why spring ai testing differs from testing a standard Spring Boot REST API:

Non-Determinism: The same prompt can yield different results. Assert.assertEquals("Hello", response) is virtually useless.
Latency: A real call to GPT-4 can take 5-10 seconds. Running a suite of 100 tests against a live API is unacceptable for CI/CD feedback loops.
Cost: Hitting OpenAI or Anthropic APIs during every CI build burns money.
Rate Limits: External APIs will throttle your test suite if you parallelize execution.

To solve this, we adopt a three-tiered strategy:

Unit Tests: Mock the ChatModel to test your prompt engineering and output parsers.
Local Integration Tests: Use Testcontainers with Ollama to run a real (but small) LLM locally.
RAG Integration Tests: Spin up vector databases (like PgVector) in containers to verify retrieval accuracy.

Setting Up the Test Environment
#

First, ensure your pom.xml includes the necessary testing dependencies. We rely on JUnit 5, Mockito, and Spring Boot Test.

<dependencies>
    <!-- Spring AI Starter -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
    </dependency>
    
    <!-- Standard Testing -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-test</artifactId>
        <scope>test</scope>
    </dependency>

    <!-- Testcontainers for Integration Testing -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-testcontainers</artifactId>
        <scope>test</scope>
    </dependency>
    <dependency>
        <groupId>org.testcontainers</groupId>
        <artifactId>junit-jupiter</artifactId>
        <scope>test</scope>
    </dependency>
    <dependency>
        <groupId>org.testcontainers</groupId>
        <artifactId>ollama</artifactId>
        <scope>test</scope>
    </dependency>
</dependencies>

Level 1: Unit Testing with Mocking
#

The fastest feedback loop comes from Unit Tests. In Spring AI, you typically interact with the ChatClient (a fluent API) or the ChatModel (the low-level driver).

Testing the fluent ChatClient can be verbose because you have to mock the entire chain (prompt(), user(), call(), content()). A better approach for unit testing is often to test the service that uses the client, or mock the underlying ChatModel if you aren’t using the fluent builder.

Scenario: Testing a Sentiment Analysis Service
#

Imagine a service that classifies text as POSITIVE, NEGATIVE, or NEUTRAL.

@Service
public class SentimentService {

    private final ChatClient chatClient;

    public SentimentService(ChatClient.Builder builder) {
        this.chatClient = builder.build();
    }

    public String analyze(String text) {
        return chatClient.prompt()
                .user(u -> u.text("Classify sentiment: {text}").param("text", text))
                .call()
                .content();
    }
}

The Mocking Strategy
#

Since ChatClient is final or hard to mock due to its fluent nature, Spring AI applications are often better tested by mocking the ChatModel response if you instantiate the client manually, or by using a FakeChatModel provided by the framework (if available in your version), or simply using Mockito deep stubs (though brittle).

However, a cleaner architectural pattern for testability is isolating your Prompt Logic. But let’s look at how to mock the behavior using standard Spring Boot testing features.

Here, we will mock the ChatModel bean which is injected into the ChatClient.Builder.

@SpringBootTest
class SentimentServiceUnitTests {

    @MockBean
    private ChatModel chatModel;

    @Autowired
    private SentimentService sentimentService;

    @Test
    void testSentimentAnalysisReturnsPositive() {
        // 1. Prepare the Mock Response
        String expectedResponse = "POSITIVE";
        
        ChatResponse mockResponse = new ChatResponse(
            List.of(new Generation(new AssistantMessage(expectedResponse)))
        );

        // 2. Define Behavior
        // Note: ArgumentMatchers can be specific to verify prompt engineering
        given(chatModel.call(any(Prompt.class))).willReturn(mockResponse);

        // 3. Execute
        String result = sentimentService.analyze("I love using Spring Boot!");

        // 4. Assert
        assertThat(result).isEqualTo("POSITIVE");
        
        // 5. Verify the Prompt was constructed correctly
        ArgumentCaptor<Prompt> promptCaptor = ArgumentCaptor.forClass(Prompt.class);
        verify(chatModel).call(promptCaptor.capture());
        
        String actualPrompt = promptCaptor.getValue().getContents();
        assertThat(actualPrompt).contains("Classify sentiment");
        assertThat(actualPrompt).contains("I love using Spring Boot!");
    }
}

Key Takeaway: This test runs in milliseconds. It does not hit OpenAI. It verifies that your service correctly constructs the prompt and correctly returns the raw content from the model. This is the bedrock of spring ai testing.

Level 2: Output Parser Testing
#

One of the most fragile parts of AI apps is Structured Output (turning JSON text from an LLM into a Java Record). You should unit test your BeanOutputConverter configurations thoroughly.

@Test
void testOutputParsing() {
    // The raw string usually returned by an LLM
    String rawJson = """
        {
            "status": "active",
            "confidence": 0.95
        }
    """;

    BeanOutputConverter<AnalysisResult> converter = new BeanOutputConverter<>(AnalysisResult.class);
    
    // Act
    AnalysisResult result = converter.convert(rawJson);
    
    // Assert
    assertThat(result.status()).isEqualTo("active");
    assertThat(result.confidence()).isEqualTo(0.95);
}

Do not waste expensive API calls to test if Jackson can deserialize JSON. Test this locally.

Level 3: Integration Testing with Testcontainers and Ollama
#

Mocking is great, but it doesn’t prove the LLM actually understands your prompt. For that, we need a real model.

However, relying on external clouds (OpenAI) for integration tests is flaky. Enter Testcontainers and Ollama. Ollama allows you to run models like llama3, mistral, or gemma locally.

By wrapping Ollama in a Docker container, we can spin up an ephemeral LLM environment for our test suite.

Step 1: The Test Configuration
#

We use @ServiceConnection to automatically wire the container configuration to Spring properties.

@TestConfiguration(proxyBeanMethods = false)
public class TestContainersConfig {

    @Bean
    @ServiceConnection
    public OllamaContainer ollama() {
        return new OllamaContainer("ollama/ollama:latest")
            .withPullModel("gemma:2b"); // Use a small model for speed
    }
}

Note: Using a small model like gemma:2b or tinyllama is recommended for CI/CD pipelines to reduce memory usage and startup time.

Step 2: The Integration Test
#

Now we run a test that actually performs inference.

@SpringBootTest
@Import(TestContainersConfig.class)
@ActiveProfiles("test")
class SentimentIntegrationTest {

    @Autowired
    private SentimentService sentimentService;

    @Test
    void testActualInference() {
        // This actually calls the local LLM running in Docker
        String result = sentimentService.analyze("This pizza is absolutely terrible and cold.");

        // Semantic Assertion
        // We can't expect an exact string match, but we expect keywords.
        assertThat(result.toUpperCase())
            .satisfiesAnyOf(
                s -> assertThat(s).contains("NEGATIVE"),
                s -> assertThat(s).contains("BAD"),
                s -> assertThat(s).contains("POOR")
            );
    }
}

Why this matters for Spring AI Testing
#

This setup creates a hermetic environment. Your tests run offline. No API keys are leaked, no credit cards are charged, and the environment is identical on your laptop and the Jenkins/GitHub Actions runner.

Level 4: Testing RAG (Retrieval Augmented Generation)
#

Retrieval Augmented Generation is the most common enterprise pattern. Testing it requires two components:

Vector Database (e.g., PostgreSQL with pgvector).
Embedding Model.

We can use Testcontainers for the database. For embeddings, we can use the TransformersEmbeddingModel (running locally in Java/ONNX) or Ollama.

Setting up the Vector Store Test
#

@Testcontainers
@SpringBootTest
class VectorStoreIT {

    @Container
    @ServiceConnection
    static PostgreSQLContainer<?> postgres = new PostgreSQLContainer<>("pgvector/pgvector:pg16");

    @Autowired
    VectorStore vectorStore;

    @Test
    void testDocumentIngestionAndRetrieval() {
        // 1. Create Document
        Document doc = new Document(
            "Spring AI simplifies building intelligent apps.",
            Map.of("version", "1.0")
        );

        // 2. Ingest
        vectorStore.add(List.of(doc));

        // 3. Retrieve
        // We query for something semantically similar, not exact wording
        List<Document> results = vectorStore.similaritySearch(
            SearchRequest.query("How to build AI apps in Java?").withTopK(1)
        );

        // 4. Assert
        assertThat(results).isNotEmpty();
        assertThat(results.get(0).getContent()).contains("Spring AI");
    }
}

This ensures your embedding logic and database connectivity are functioning correctly before you ever introduce the Chat Model.

Level 5: The “Semantic Assertion” Problem
#

In traditional unit testing, we verify expected.equals(actual). In spring ai testing, the output is probabilistic.

How do we assert that a summary of a 50-page PDF is “correct”?

1. Keyword Presence (Weak)
#

As seen above, checking for specific keywords (contains("NEGATIVE")). This is brittle but fast.

2. Similarity Scoring (Better)
#

If you have an Embedding Model loaded, you can compute the cosine similarity between the expected output (the “Ground Truth”) and the actual output.

double similarity = embeddingModel.embed(actual)
    .dotProduct(embeddingModel.embed(expectedGroundTruth));

assertThat(similarity).isGreaterThan(0.85);

3. LLM-as-a-Judge (Advanced)
#

The most robust way to test complex reasoning is to use an LLM to grade the output of another LLM.

You can create a EvaluationService in your test code:

String prompt = """
    You are a grader. 
    Question: %s
    Reference Answer: %s
    Actual Answer: %s
    
    Does the Actual Answer convey the same meaning as the Reference Answer? 
    Reply with only 'YES' or 'NO'.
""".formatted(question, expected, actual);

String verdict = chatClient.prompt(prompt).call().content();
assertThat(verdict).contains("YES");

While this adds latency, it is currently the industry standard for validating complex RAG applications.

Best Practices for Spring AI Testing in CI/CD
#

When integrating these tests into your pipeline (Jenkins, GitLab CI, GitHub Actions), follow these rules:

1. Tagging and Filtering
#

Separate your fast unit tests from slow integration tests. Use JUnit 5 tags:

@Tag("unit")
class FastTests {}

@Tag("ai-integration")
class OllamaTests {}

Configure Maven/Gradle to run @Tag("unit") on every commit, and @Tag("ai-integration") only on Pull Requests or nightly builds.

2. Managing Resource Limits
#

Running gemma:2b or llama3 inside a container requires RAM. Ensure your CI runner has at least 8GB (preferably 16GB) of RAM. If using GitHub Actions, you may need larger runners.

3. Record and Replay (VCR)
#

To get the speed of mocks with the realism of integration tests, consider “VCR” tools (like Spring Cloud Contract WireMock).

Run the test once against a live LLM.
Record the HTTP response.
Save it as a JSON stub.
Future tests replay the stub.

This is excellent for regression testing, ensuring that prompt changes don’t break existing functionality, without incurring costs.

Conclusion
#

Testing AI applications in Spring is no longer the “Wild West.” By leveraging the Spring ecosystem, we can apply disciplined engineering to non-deterministic models.

To summarize your Spring AI Testing strategy:

Mock ChatModel interactions for pure logic tests.
Verify JSON parsing logic in isolation.
Use Testcontainers with Ollama for offline, zero-cost integration tests.
Validate vector retrieval using Postgres containers.
Apply Semantic Assertions (Similarity or LLM-as-a-Judge) for complex outputs.

By implementing these layers, you ensure that your Spring AI application is robust, cost-effective, and ready for production, transforming your AI features from a cool demo into enterprise-grade software.

The Challenges of Spring AI Testing #

Setting Up the Test Environment #

Level 1: Unit Testing with Mocking #

Scenario: Testing a Sentiment Analysis Service #

The Mocking Strategy #

Level 2: Output Parser Testing #

Level 3: Integration Testing with Testcontainers and Ollama #

Step 1: The Test Configuration #

Step 2: The Integration Test #

Why this matters for Spring AI Testing #

Level 4: Testing RAG (Retrieval Augmented Generation) #

Setting up the Vector Store Test #

Level 5: The “Semantic Assertion” Problem #

1. Keyword Presence (Weak) #

2. Similarity Scoring (Better) #

3. LLM-as-a-Judge (Advanced) #

Best Practices for Spring AI Testing in CI/CD #

1. Tagging and Filtering #

2. Managing Resource Limits #

3. Record and Replay (VCR) #

Conclusion #

About This Site: [StonehengeHugoTemplate].com