Advanced RAG: Hybrid Search, Re-ranking & Query Rewriting with Spring AI

The Plateau of "Naive" RAG

In the rapidly evolving landscape of Generative AI, Retrieval-Augmented Generation (RAG) has become the de facto standard for grounding Large Language Models (LLMs) on private, proprietary data. For Java developers, Spring AI has made the barrier to entry incredibly low. With just a few lines of configuration, you can ingest documents into a vector database (like PgVector or Weaviate) and chat with your data.

However, moving from a proof-of-concept to a production environment reveals the limitations of what the industry calls "Naive RAG."

Naive RAG relies solely on semantic similarity search (dense vector retrieval). While powerful, it suffers from critical blind spots:

The Keyword Gap: Vector embeddings capture meaning but often struggle with exact matches (e.g., specific part numbers like "XJ-900" or acronyms).
Ambiguous Queries: Users rarely prompt perfectly. A query like "pricing" is too vague for a vector store to find specific price sheets without context.
The "Lost in the Middle" Phenomenon: Feeding too many irrelevant documents into the LLM's context window degrades the answer quality and increases latency and cost.

To build a robust, enterprise-grade AI application, we must graduate to advanced RAG techniques.

In this article, we will re-architect a standard Spring AI pipeline to include three specific layers of sophistication:

Query Rewriting: optimizing the user's intent before it touches the database.
Hybrid Search: combining vector search with keyword (BM25) search.
Re-ranking: using a cross-encoder to strictly filter results before generation.

1. The Input Layer: Query Rewriting and Transformation

The first point of failure in RAG is the user. Users type short, ambiguous, or typo-laden queries. If you feed garbage into your retriever, you will get garbage out (GIGO).

Query rewriting involves using an LLM to "hallucinate" a better query or break a complex query into sub-queries before we attempt retrieval.

Technique A: HyDE (Hypothetical Document Embeddings)

HyDE is a clever technique where, instead of searching for the user's question, we search for a hypothetical answer.

The Theory: If a user asks "How do I reset the transaction manager?", a vector search might look for documents containing those words. However, the relevant document might not repeat the question; it simply says "To rollback the platform transaction..."

With HyDE, we ask the LLM: "Write a theoretical paragraph answering the question: 'How do I reset the transaction manager?'" The LLM generates a fake document. We embed that fake document and search for real documents that look like it.

Spring AI Implementation

We can implement this using a simple ChatClient interceptor or a service layer transformation.

@Service
public class QueryTransformationService {

    private final ChatClient chatClient;

    public QueryTransformationService(ChatClient.Builder builder) {
        this.chatClient = builder.build();
    }

    public String generateHypotheticalDocument(String userQuery) {
        String prompt = """
            Please write a short, hypothetical passage that answers the question below. 
            Do not include the question itself, only the answer. 
            Include technical keywords relevant to the domain.
            
            Question: %s
            """.formatted(userQuery);

        return chatClient.prompt()
                .user(prompt)
                .call()
                .content();
    }
}

When integrating this, your RAG flow changes:

Receive userQuery.
Call generateHypotheticalDocument(userQuery) -> returns hydeVector.
Perform vector search using hydeVector.

Technique B: Multi-Query Expansion

Sometimes a user's question requires information from different distinct topics. "Compare the pricing of Enterprise Plan vs. Pro Plan."

A single vector search might land somewhere in the middle of the vector space, retrieving neither plan's documents effectively. Multi-query expansion breaks this into:

"What is the pricing of the Enterprise Plan?"
"What is the pricing of the Pro Plan?"

We execute both searches and deduplicate the results.

public List<String> expandQuery(String originalQuery) {
    String prompt = """
        You are an AI language model assistant. Your task is to generate 3 different versions 
        of the given user question to retrieve relevant documents from a vector database. 
        By generating multiple perspectives on the user question, your goal is to help 
        the user overcome some of the limitations of distance-based similarity search. 
        Provide these alternative questions separated by newlines.
        
        Original question: %s
        """.formatted(originalQuery);

    String response = chatClient.prompt().user(prompt).call().content();
    return Arrays.asList(response.split("\n"));
}

Trade-off Analysis: Query rewriting adds latency. You are making an extra LLM call before retrieval. However, for complex domains (legal, medical, extensive documentation), the accuracy boost is usually worth the 1-2 second delay.

2. The Retrieval Layer: Hybrid Search

This is the cornerstone of advanced RAG techniques.

The Problem with Pure Vectors

Vector databases use Dense Retrieval (Cosine Similarity). They are amazing at understanding that "canine" and "dog" are related. However, they are often terrible at:

Exact Keyword Matching: If a user searches for Error code 503, a vector search might return documents about "Server errors" in general, but miss the specific document explicitly mentioning "503" because "503" doesn't have a strong semantic embedding on its own.
Domain-Specific Jargon: acronyms or project code names (e.g., "Project Hades") often get lost in vector space.

The Solution: Sparse + Dense (Hybrid)

Hybrid search combines:

Vector Search (Dense): Captures semantic meaning.
Keyword Search (Sparse/BM25): Captures exact token matches.

Merging Results: Reciprocal Rank Fusion (RRF)

How do you combine a list of results from a Vector search (ranked by similarity score 0.0-1.0) and a Keyword search (ranked by BM25 score, which is unbounded)? You cannot simply add the scores.

We use Reciprocal Rank Fusion (RRF). RRF ignores the absolute scores and relies on the rank.

Formula: $$ RRF\_score(d) = \sum_{r \in R} \frac{1}{k + rank(d, r)} $$

Where:

$d$ is the document.
$R$ is the set of rankers (Vector and Keyword).
$k$ is a constant (usually 60).
$rank(d, r)$ is the position of the document in that specific result list.

Essentially, if a document appears at the top of both lists, it gets a massive score boost.

Spring AI Implementation Strategy

As of early 2025, Spring AI's VectorStore abstraction is primarily dense-focused, but underlying implementations (like Elasticsearch, Azure AI Search, and Weaviate) support Hybrid search natively.

If you are using PostgreSQL (pgvector), you can implement Hybrid search manually. You need pgvector for embeddings and tsvector for full-text search.

Here is a conceptual implementation of a Hybrid Retriever service in Spring Boot:

@Service
public class HybridRetrievalService {

    private final JdbcTemplate jdbcTemplate;
    private final EmbeddingModel embeddingModel;

    // SQL to perform Hybrid Search using CTEs (Common Table Expressions) and RRF
    private static final String HYBRID_SQL = """
        WITH semantic_search AS (
            SELECT id, content, 
                   RANK() OVER (ORDER BY embedding <=> ?::vector) as rank_ix
            FROM documents
            ORDER BY embedding <=> ?::vector
            LIMIT 50
        ),
        keyword_search AS (
            SELECT id, content,
                   RANK() OVER (ORDER BY ts_rank_cd(to_tsvector('english', content), plainto_tsquery('english', ?)) DESC) as rank_ix
            FROM documents
            WHERE to_tsvector('english', content) @@ plainto_tsquery('english', ?)
            LIMIT 50
        )
        SELECT 
            COALESCE(s.id, k.id) as id,
            COALESCE(s.content, k.content) as content,
            (COALESCE(1.0 / (60 + s.rank_ix), 0.0) + COALESCE(1.0 / (60 + k.rank_ix), 0.0)) as rrf_score
        FROM semantic_search s
        FULL OUTER JOIN keyword_search k ON s.id = k.id
        ORDER BY rrf_score DESC
        LIMIT 20;
    """;

    public List<Document> search(String query) {
        // 1. Generate Embedding
        float[] embedding = embeddingModel.embed(query);
        
        // 2. Execute Hybrid SQL
        // Note: We pass the embedding twice (for order and select) and query twice (for rank and where)
        return jdbcTemplate.query(HYBRID_SQL, (rs, rowNum) -> {
            return new Document(rs.getString("id"), rs.getString("content"), Map.of("score", rs.getDouble("rrf_score")));
        }, 
        embedding, embedding, query, query);
    }
}

Note: This raw SQL approach is necessary until high-level abstractions for Hybrid Search are fully standardized across all Spring AI Vector Store implementations.

Why this matters: This SQL query performs the fusion at the database level. It retrieves the top 50 semantic matches and the top 50 keyword matches, then calculates the RRF score to return the "Best of Both Worlds" top 20.

3. The Precision Layer: Re-ranking

We have rewritten the query and performed a hybrid search. We now have perhaps 20-30 high-quality candidate documents. Should we feed all of them into the LLM?

No.

Cost: More tokens = higher cost.
Distraction: LLMs can get confused by irrelevant context ("Lost in the Middle").
Context Limits: Even with 128k context windows, precision drops as noise increases.

We need a Re-ranker.

Bi-Encoders vs. Cross-Encoders

Bi-Encoder (Standard Vector Search): Calculates the embedding for the document and the query separately. Fast, but loses nuance because the query and document never "interact" until the math comparison.
Cross-Encoder (Re-ranking): Takes the query and the document together as a single input pair and outputs a similarity score.

Cross-encoders are much more accurate because they can "pay attention" to how specific words in the query relate to specific words in the document. They are computationally expensive, so we cannot run them on the whole database.

The Pipeline:

Retrieval: Use Bi-Encoder (Vector) + BM25 to get Top 50 candidates. (Fast)
Re-ranking: Use Cross-Encoder to score those 50 and pick the Top 5. (Slow but precise)

Integration with Spring AI

Spring AI does not yet bundle a local cross-encoder (which is usually Python/Torch based), but we can interact with re-ranking APIs (like Cohere Rerank or Hugging Face Inference Endpoints).

Here is how to integrate Cohere's Re-rank API using Spring's RestClient alongside Spring AI.

@Service
public class RerankingService {

    @Value("${cohere.api.key}")
    private String cohereApiKey;
    
    private final RestClient restClient;

    public RerankingService() {
        this.restClient = RestClient.builder()
                .baseUrl("https://api.cohere.ai/v1/rerank")
                .build();
    }

    public List<Document> rerank(String query, List<Document> initialDocuments) {
        if (initialDocuments.isEmpty()) return List.of();

        List<String> docsText = initialDocuments.stream()
                .map(Document::getContent)
                .toList();

        var requestBody = Map.of(
            "model", "rerank-english-v3.0",
            "query", query,
            "documents", docsText,
            "top_n", 5
        );

        RerankResponse response = restClient.post()
                .header("Authorization", "Bearer " + cohereApiKey)
                .contentType(MediaType.APPLICATION_JSON)
                .body(requestBody)
                .retrieve()
                .body(RerankResponse.class);

        // Map back to Document objects based on index
        return response.results().stream()
                .map(res -> initialDocuments.get(res.index()))
                .toList();
    }
}

// Java Records for JSON mapping
record RerankResponse(List<RerankResult> results) {}
record RerankResult(int index, double relevance_score) {}

This step is the "magic" that makes RAG feel intelligent. The re-ranker filters out the "technically similar but actually irrelevant" results.

4. Architectural Blueprint: Putting It All Together

We have the components. Now, let's look at the AdvancedRagService that orchestrates this flow. This follows the Retriever-Reader pattern.

@Service
@RequiredArgsConstructor
public class AdvancedRagService {

    private final QueryTransformationService queryRewriter;
    private final HybridRetrievalService hybridRetriever; // Custom service from Part 2
    private final RerankingService reranker;             // Custom service from Part 3
    private final ChatClient chatClient;

    public String answer(String rawUserQuery) {
        
        // Step 1: Query Transformation (Optional but recommended)
        // We might simply clean the query or generate a HyDE vector here.
        // For this example, let's assume we use the raw query for hybrid search 
        // to keep latency manageable, or use a simple keyword extractor.
        
        // Step 2: Hybrid Retrieval (The "Broad Net")
        // Fetches top 30-50 candidates using Vector + BM25 RRF
        List<Document> broadCandidates = hybridRetriever.search(rawUserQuery);

        // Step 3: Re-ranking (The "Filter")
        // Narrows down to top 5 highly relevant documents
        List<Document> topContext = reranker.rerank(rawUserQuery, broadCandidates);

        // Step 4: Context Construction
        String contextString = topContext.stream()
                .map(Document::getContent)
                .collect(Collectors.joining("\n\n"));

        // Step 5: Generation
        PromptTemplate promptTemplate = new PromptTemplate("""
                You are an expert technical assistant.
                Answer the question based ONLY on the context provided below.
                If the answer is not in the context, say "I don't know."
                
                Context:
                {context}
                
                Question:
                {query}
                """);

        return chatClient.prompt(promptTemplate.create(Map.of(
                "context", contextString,
                "query", rawUserQuery
        ))).call().content();
    }
}

The System Advisor Pattern

Spring AI provides an Advisor API (formerly RetrievalAugmentationAdvisor) which allows you to chain these components in a more declarative way. However, for fully custom logic like Hybrid RRF + External Re-ranking, the imperative service approach above often provides better control and debuggability in the early stages of adoption.

5. Performance, Caching, and Trade-offs

Implementing advanced RAG techniques introduces complexity. As an architect, you must weigh the benefits against the costs.

Latency Analysis

Naive RAG: Query -> Vector DB (50ms) -> LLM (2s). Total: ~2.1s
Advanced RAG:
- Query Rewrite (LLM Call): +1.5s
- Hybrid Search (Vector + BM25 + DB sort): +100ms
- Re-ranking (External API call): +400ms
- Final Generation: 2s
- Total: ~4.0s

The latency doubles. Is it worth it? For a chatbot asking about company HR policies, yes. Getting the wrong answer about "Maternity Leave" is worse than waiting 4 seconds. For a real-time code completion tool, no.

Cost Analysis

Embedding Costs: Relatively low.
Re-ranking Costs: Services like Cohere charge per request. Re-ranking 50 documents per query scales linearly with traffic.
LLM Costs: Query rewriting requires an extra LLM input/output token charge.

Optimization Strategy: Semantic Caching

To mitigate the latency and cost, implement Semantic Caching. Do not just cache based on String.equals(query). Cache based on Vector.similarity(query).

If a user asks "How do I reset the password?" and 5 minutes later another asks "Password reset procedure," the vector similarity is high (>0.95). You can serve the cached answer from the previous RAG run instantly.

Spring AI works well with Redis or Hazelcast for this layer.

// Pseudo-code for Semantic Cache
float[] queryVector = embeddingModel.embed(query);
CachedResponse hit = vectorCache.findNearest(queryVector, threshold=0.98);
if (hit != null) return hit.answer;

6. Advanced Metadata Filtering

Another layer of optimization is Metadata Filtering. Before you even perform Hybrid Search, apply strict filters.

If your application supports multi-tenancy, or if documents have timestamps, you must filter first. In Spring AI FilterExpression:

FilterExpressionBuilder b = new FilterExpressionBuilder();
Expression filter = b.and(
    b.eq("tenantId", "tenant-123"),
    b.gte("creationDate", "2024-01-01")
).build();

vectorStore.similaritySearch(
    SearchRequest.query(userQuery)
        .withFilterExpression(filter)
);

Self-Correction for Hybrid Search: When doing the custom SQL Hybrid search (shown in Part 2), ensure you inject these WHERE clauses into both the Vector sub-query and the Keyword sub-query. Failing to filter purely on the vector side allows "data leaks" where vectors from other tenants might influence the ranking.

7. Conclusion: The Future of RAG in Spring

Naive RAG was the exciting "Hello World" of 2023. In 2025, Advanced RAG is the baseline for production.

By implementing the Rewrite-Retrieve-Rerank pattern, we solve the most common complaints about AI assistants: hallucinations and missed context.

Rewriting bridges the gap between how users talk and how data is written.
Hybrid Search ensures we catch both specific keywords and general concepts.
Re-ranking acts as the quality gatekeeper, ensuring the LLM only sees the most pertinent information.

As Spring AI continues to mature, we expect to see HybridVectorStore and RerankingAdvisor interfaces become standard, reducing the boilerplate code needed. For now, the combination of Spring Boot's robustness with these architectural patterns provides a solid foundation for building world-class Generative AI applications.

Next Steps for Developers:

Set up PGVector with tsvector support locally.
Get an API key for a Re-ranking service (Cohere or similar).
Refactor your ChatClient calls to include an abstraction layer for retrieval.

The gap between a demo and a product is reliability. Advanced RAG techniques are how you bridge that gap.

About the Author: The Spring DevPro Team specializes in cloud-native Java architectures. We explore the intersection of Traditional Enterprise Java and the new wave of AI engineering.

The Plateau of "Naive" RAG​

1. The Input Layer: Query Rewriting and Transformation​

Technique A: HyDE (Hypothetical Document Embeddings)​

Spring AI Implementation​

Technique B: Multi-Query Expansion​

2. The Retrieval Layer: Hybrid Search​

The Problem with Pure Vectors​

The Solution: Sparse + Dense (Hybrid)​

Merging Results: Reciprocal Rank Fusion (RRF)​

Spring AI Implementation Strategy​

3. The Precision Layer: Re-ranking​

Bi-Encoders vs. Cross-Encoders​

Integration with Spring AI​

4. Architectural Blueprint: Putting It All Together​

The System Advisor Pattern​

5. Performance, Caching, and Trade-offs​

Latency Analysis​

Cost Analysis​

Optimization Strategy: Semantic Caching​

6. Advanced Metadata Filtering​

7. Conclusion: The Future of RAG in Spring​