Error Handling and Retry Strategies in Spring AI Applications

Table of Contents

The Fragile Nature of Deterministic Code and Non-Deterministic AI
#

In the traditional world of Spring Boot microservices, error handling is a well-trodden path. We deal with database timeouts, 404s, and NullPointerExceptions using predictable try-catch blocks and @ControllerAdvice.

However, Spring AI introduces a paradigm shift. When integrating Large Language Models (LLMs) like OpenAI, Claude, or Gemini, you are introducing a non-deterministic, highly volatile dependency into your deterministic Java application. The challenges are distinct:

Rate Limiting (HTTP 429): LLM providers enforce strict token and request limits.
Context Window Overflows: Inputs often exceed the model’s capacity.
Hallucinated Output: The model returns a 200 OK, but the JSON body is malformed or hallucinates fields.
High Latency: A request might take 30 seconds, leading to client-side timeouts.

Building a production-ready application requires more than just calling chatModel.call(). This article dives deep into Spring AI error handling, exploring strategies to make your AI-infused applications resilient, robust, and cost-efficient.

Understanding the Exception Hierarchy
#

Before implementing a cure, we must diagnose the disease. Spring AI abstracts various provider errors (OpenAI, Azure, Bedrock) into a unified exception hierarchy. Understanding this is crucial for deciding when to retry.

Spring AI generally wraps underlying driver exceptions into:

TransientAiException: Errors that are temporary.
- Examples: Network timeouts, HTTP 500/503 from the provider, and specifically HTTP 429 (Too Many Requests).
- Action: Safe to Retry.
NonTransientAiException: Errors that will not resolve with a simple retry.
- Examples: Invalid API Key (HTTP 401), Context Window Exceeded (Bad Request), Content Policy Violation.
- Action: Do Not Retry. Log and fail fast.

When configuring your strategies, distinguishing between these two is the difference between a self-healing system and one that burns through your API credits in an infinite loop.

Native Retry Strategies with `ChatClient`
#

As of the latest Spring AI milestones, the fluent ChatClient API is the preferred way to interact with models. It comes with built-in support for “Advisors,” which are essentially interceptors that can modify requests, responses, and handle errors.

The Default Retry Advisor
#

Spring AI leverages the RetryTemplate concept familiar to Spring developers. You can configure a standard retry mechanism directly when building your client.

@Configuration
public class AiConfig {

    @Bean
    public ChatClient chatClient(ChatClient.Builder builder) {
        return builder
                .defaultAdvisors(
                    // Enable retry specifically for 429 and 5xx errors
                    new RetryAdvisor(retryTemplate()) 
                )
                .build();
    }

    @Bean
    public RetryTemplate retryTemplate() {
        RetryTemplate template = new RetryTemplate();

        // 1. Define Backoff Policy (Exponential is best for Rate Limits)
        ExponentialBackOffPolicy backOffPolicy = new ExponentialBackOffPolicy();
        backOffPolicy.setInitialInterval(1000); // 1 second
        backOffPolicy.setMultiplier(2.0);       // Double the wait time each retry
        backOffPolicy.setMaxInterval(10000);    // Max wait 10 seconds
        template.setBackOffPolicy(backOffPolicy);

        // 2. Define Exception Classification
        SimpleRetryPolicy retryPolicy = new SimpleRetryPolicy();
        retryPolicy.setMaxAttempts(3);
        template.setRetryPolicy(retryPolicy);

        return template;
    }
}

Why Exponential Backoff?
#

In standard microservices, a fixed delay (e.g., retry every 2 seconds) is often sufficient. However, for Spring AI error handling, exponential backoff is mandatory.

When an LLM provider returns an HTTP 429 (Rate Limit Exceeded), immediate retries usually exacerbate the problem, causing the provider to block you longer. Exponential backoff (1s, 2s, 4s, 8s) allows the token bucket to replenish, significantly increasing the success rate of the subsequent attempt.

Advanced Resilience with Resilience4j
#

While Spring’s RetryTemplate is useful, enterprise applications often require the robust features of Resilience4j. This library allows for Circuit Breakers, Bulkheads (concurrency limiting), and Rate Limiters.

This is particularly useful if you are aggregating multiple AI calls.

Dependencies
#

Ensure you have the AOP starter in your pom.xml:

<dependency>
    <groupId>org.springframework.cloud</groupId>
    <artifactId>spring-cloud-starter-circuitbreaker-resilience4j</artifactId>
</dependency>
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-aop</artifactId>
</dependency>

The Circuit Breaker Pattern
#

Imagine your AI Provider (e.g., OpenAI) is suffering a major outage. If you have 1000 concurrent users, your application threads will block waiting for timeouts, eventually crashing your own server.

A Circuit Breaker detects the failure rate. If 50% of calls fail, it “opens” the circuit, failing subsequent calls immediately without hitting the provider, saving your system resources.

resilience4j:
  circuitbreaker:
    instances:
      aiService:
        registerHealthIndicator: true
        slidingWindowSize: 10
        minimumNumberOfCalls: 5
        permittedNumberOfCallsInHalfOpenState: 3
        automaticTransitionFromOpenToHalfOpenEnabled: true
        waitDurationInOpenState: 5s
        failureRateThreshold: 50
        eventConsumerBufferSize: 10

Implementing the Circuit Breaker
#

Wrap your AI service interaction:

@Service
public class RobustAiService {

    private final ChatModel chatModel;

    public RobustAiService(ChatModel chatModel) {
        this.chatModel = chatModel;
    }

    @CircuitBreaker(name = "aiService", fallbackMethod = "fallbackResponse")
    @Retry(name = "aiService")
    public String generateContent(String prompt) {
        return chatModel.call(prompt);
    }

    // Fallback method must have same signature + Exception
    public String fallbackResponse(String prompt, Throwable t) {
        log.error("AI Service failed: {}", t.getMessage());
        return "Our AI agents are currently overwhelmed. Please try again later.";
    }
}

Handling Structured Output Failures
#

One of the most frustrating errors in Spring AI isn’t a network crash—it’s the LLM failing to follow instructions.

You ask for JSON. The LLM returns Markdown-wrapped JSON, or a sentence about the JSON. The BeanOutputConverter will throw a parsing exception. This is a functional error, not a network one.

The “Repair” Pattern
#

We can implement a recursive retry strategy that feeds the error back to the LLM.

public MyPojo getStructuredData(String userInput) {
    BeanOutputConverter<MyPojo> converter = new BeanOutputConverter<>(MyPojo.class);
    String promptFormat = converter.getFormat();
    
    String fullPrompt = userInput + "\n" + promptFormat;
    
    try {
        String response = chatModel.call(fullPrompt);
        return converter.convert(response);
    } catch (Exception e) {
        // First attempt failed. Let's ask the LLM to fix it.
        return repairOutput(userInput, response, e.getMessage(), converter);
    }
}

private MyPojo repairOutput(String originalInput, String badOutput, String errorMsg, BeanOutputConverter<MyPojo> converter) {
    String repairPrompt = """
        The previous response failed to parse as valid JSON.
        Error: %s
        
        Original Input: %s
        Bad Output: %s
        
        Please correct the output to match the required JSON schema exactly. Do not add markdown.
        """.formatted(errorMsg, originalInput, badOutput);
        
    String fixedResponse = chatModel.call(repairPrompt);
    return converter.convert(fixedResponse); // If this fails, we throw to the client
}

This pattern significantly improves reliability for structured data extraction tasks, effectively acting as a “semantic retry.”

The Multi-Model Fallback Strategy
#

In a high-stakes environment, relying on a single provider (e.g., OpenAI) is a single point of failure. Spring AI’s abstraction makes it incredibly easy to swap models.

A robust error handling strategy involves Model Fallback: If GPT-4 is down, degrade gracefully to Claude 3.5 or a local Llama model.

@Service
public class ResilienceService {

    private final OpenAiChatModel openAi;
    private final VertexAiGeminiChatModel gemini;

    public ResilienceService(OpenAiChatModel openAi, VertexAiGeminiChatModel gemini) {
        this.openAi = openAi;
        this.gemini = gemini;
    }

    public String robustChat(String message) {
        try {
            // Primary Attempt
            return openAi.call(message);
        } catch (TransientAiException | ResourceAccessException e) {
            log.warn("Primary model failed, switching to fallback: {}", e.getMessage());
            try {
                // Secondary Attempt
                return gemini.call(message);
            } catch (Exception ex) {
                log.error("All models exhausted.");
                throw new ResponseStatusException(HttpStatus.SERVICE_UNAVAILABLE, "AI Services Offline");
            }
        }
    }
}

This approach ensures business continuity. Note that prompt engineering differs slightly between models, so keep your prompts generic or use templates specific to each provider.

Cost Implications of Retries
#

When implementing Spring AI error handling, developers often overlook the financial aspect.

Token Consumption: If you send a 4,000-token prompt and it times out after the server processed it (read timeout), you might still be billed.
Retry Storms: If you retry a 4,000-token prompt 3 times, you have effectively tripled the cost of that transaction.

Best Practice:

Set a shorter ConnectTimeout but a generous ReadTimeout to allow the LLM to think.
Do not retry on ContextWindowExceededException or ContentFilterException. These are non-transient and will just burn money.
Use Circuit Breakers to stop retries entirely during confirmed outages.

Monitoring and Observability
#

You cannot fix what you cannot see. Spring AI integrates with Micrometer to provide observability.

Ensure you track these metrics in Grafana/Prometheus:

spring.ai.chat.client.requests (Count)
spring.ai.chat.client.errors (Tagged by Exception Type)
resilience4j.circuitbreaker.state

By monitoring the ratio of 429 Too Many Requests to 200 OK, you can dynamically adjust your token usage or request quotas.

Summary
#

Building production-grade applications with Spring AI requires a defensive mindset. The non-deterministic nature of LLMs means that failure is not an anomaly—it is an expectation.

To recap the strategy:

Classify Errors: Distinguish between Transient (Retry) and Non-Transient (Abort).
Use Exponential Backoff: Give providers time to recover from rate limits.
Leverage Resilience4j: Protect your system resources with Circuit Breakers.
Implement Semantic Retries: Fix broken JSON output by asking the model to correct itself.
Plan for Fallbacks: Have a backup model ready for provider outages.

By implementing these Spring AI error handling patterns, you transform a fragile demo into a resilient, enterprise-ready platform capable of serving users reliably, regardless of the fluctuations in the AI ecosystem.

The Fragile Nature of Deterministic Code and Non-Deterministic AI #

Understanding the Exception Hierarchy #

Native Retry Strategies with ChatClient #

The Default Retry Advisor #

Why Exponential Backoff? #

Advanced Resilience with Resilience4j #

Dependencies #

The Circuit Breaker Pattern #

Implementing the Circuit Breaker #

Handling Structured Output Failures #

The “Repair” Pattern #

The Multi-Model Fallback Strategy #

Cost Implications of Retries #

Monitoring and Observability #

Summary #

About This Site: [StonehengeHugoTemplate].com