Document Processing Pipeline: PDF, Word, Excel to Embeddings

The Unstructured Data Challenge in RAG

In the rapidly evolving landscape of Generative AI, Retrieval-Augmented Generation (RAG) has become the de facto standard for bringing domain-specific knowledge to Large Language Models (LLMs). While tutorials often start with a simple text file, real-world enterprise data is messy. It lives in PDF contracts, Word policy documents, and financial Excel sheets.

For Java developers, Spring AI offers a unified abstraction layer to tackle this challenge. However, simply "reading" a file is not enough. You need a pipeline—a systematic process to load, clean, chunk, embed, and store data.

In this deep dive, we will focus on the Spring AI Document Loader capabilities (specifically leveraging Apache Tika) to build a production-ready ingestion engine. We will cover the journey from raw binary files to high-dimensional vector embeddings ready for semantic search.

1. Architecture: The Ingestion Pipeline

Before writing code, we must visualize the flow. A naive approach reads a whole PDF and sends it to OpenAI. This fails because of context window limits and poor retrieval accuracy.

A production pipeline consists of four distinct stages:

Ingestion (The Loader): Detecting file types and extracting raw text and metadata.
Transformation (The Splitter): Breaking text into semantically relevant chunks (Tokenization).
Embedding: Converting text chunks into vector representations (e.g., float arrays).
Persistence: Storing vectors in a Vector Database (PGVector, Milvus, Weaviate, etc.).

This article focuses heavily on steps 1 and 2, utilizing the spring ai document loader abstractions.

2. Project Setup and Dependencies

Spring AI abstracts the underlying complexity of file parsing. While there are specific readers for PDFs, the most robust approach for a multi-format pipeline is integrating Apache Tika. Tika detects and parses over a thousand different file formats.

Add the following dependencies to your pom.xml (assuming Spring Boot 3.2+ and Spring AI milestone releases):

<dependencies>
    <!-- Spring AI Core -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
    </dependency>

    <!-- The Universal Document Reader -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-tika-document-reader</artifactId>
    </dependency>
    
    <!-- Vector Store (Example: Simple In-Memory or PGVector) -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-simple-vector-store-spring-boot-starter</artifactId>
    </dependency>
</dependencies>

<dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-bom</artifactId>
            <version>1.0.0-SNAPSHOT</version> <!-- Check for latest version -->
            <type>pom</type>
            <scope>import</scope>
        </dependency>
    </dependencies>
</dependencyManagement>

Note: The spring-ai-tika-document-reader is heavy because it includes parsers for everything from PDFs to obscure image formats. In a containerized production environment, you might want to exclude certain transitive dependencies if image OCR isn't required to save image size.

3. The Core Abstraction: `DocumentReader`

In Spring AI, the primary interface for ingestion is DocumentReader.

public interface DocumentReader {
    List<Document> get();
}

The Document object is the "currency" of the framework. It contains:

Content: The extracted string text.
Metadata: A Map<String, Object> containing file names, page numbers, authors, etc.

The Unified Tika Loader Service

Let's build a service that accepts any Spring Resource (file, URL, classpath item) and converts it into documents.

package com.springdevpro.ingestion;

import org.springframework.ai.document.Document;
import org.springframework.ai.reader.tika.TikaDocumentReader;
import org.springframework.core.io.Resource;
import org.springframework.stereotype.Service;

import java.util.List;

@Service
public class UniversalDocumentLoader {

    public List<Document> loadResource(Resource resource) {
        // TikaDocumentReader automatically detects the file type
        // based on the resource stream signature.
        TikaDocumentReader reader = new TikaDocumentReader(resource);
        
        List<Document> documents = reader.get();
        
        // Post-processing: Add source tracking metadata
        documents.forEach(doc -> {
            doc.getMetadata().put("source_filename", resource.getFilename());
            doc.getMetadata().put("ingestion_timestamp", System.currentTimeMillis());
        });
        
        return documents;
    }
}

This single method handles PDF, Word, and Excel. However, "handling" and "handling well" are different things. Let's look at format-specific nuances.

4. Handling PDFs: Text vs. Layout

PDF is a layout-based format, not a text-based one. This means extracting text often results in headers and footers interrupting sentences.

Strategy 1: Tika (The Generalist)

Using the UniversalDocumentLoader above works for 90% of text-heavy PDFs. Tika extracts metadata like Author or Creation-Date automatically, which Spring AI injects into the Document metadata map.

Strategy 2: `PagePdfDocumentReader` (The Specialist)

If you need strict page-level separation (e.g., ensuring one chunk never spans two pages), Spring AI offers the PagePdfDocumentReader (based on PDFBox).

import org.springframework.ai.reader.pdf.PagePdfDocumentReader;
import org.springframework.ai.reader.pdf.config.PdfDocumentReaderConfig;

public List<Document> loadPdfWithPageAwareness(Resource pdfResource) {
    PdfDocumentReaderConfig config = PdfDocumentReaderConfig.builder()
            .withPageTopMargin(0)
            .withPageBottomMargin(0)
            .withPageExtractedTextFormatter(
                // Optimize text for LLM reading (removing excessive newlines)
                ExtractedTextFormatter.builder()
                    .withNumberOfBottomTextLinesToDelete(1) // Remove footer
                    .withNumberOfTopTextLinesToDelete(1)    // Remove header
                    .build()
            )
            .build();

    PagePdfDocumentReader reader = new PagePdfDocumentReader(pdfResource, config);
    return reader.get();
}

Why this matters for RAG: If your PDF has a footer saying "Confidential - Page 1" on every page, and you embed that, a search for "Confidential" will retrieve every page in your database, ruining your search relevance. The PdfDocumentReaderConfig is crucial for cleaning noise.

5. Processing Microsoft Word (.docx)

Word documents are generally easier than PDFs because they preserve document structure (paragraphs, headings) in the underlying XML.

When using the TikaDocumentReader for .docx files, the extracted text usually flows naturally. However, a common issue is Images and Tables inside Word docs.

Handling Tables in Word

By default, Tika flattens tables into text, often tab-separated or newline-separated. This destroys the semantic relationship between a column header and a cell value.

To handle this advanced scenario, you often need to pre-process the DOCX file before passing it to Spring AI, or configure Tika via a tika-config.xml file passed to the reader to output HTML instead of Plain Text.

Pro Tip: If your LLM supports it, keeping HTML tags (like <table>, <tr>) in the embedded text can actually help the model understand the structure, although it consumes more tokens.

6. The Excel (.xlsx) Conundrum

Excel is the most difficult format for RAG. An Excel file is often a database, not a document. Reading an Excel file row-by-row into a single text blob results in a mess of numbers without context.

The Problem with Default Loading

If you feed a 5,000-row spreadsheet to TikaDocumentReader, you get one massive Document object containing a stream of values. Splitting this arbitrary text will cut rows in half.

The Solution: CSV Conversion & Row-based Loading

For Excel, it is highly recommended to convert the target sheet to CSV first, or use a custom parser that treats one row = one document.

Here is a custom implementation pattern for Spring AI to handle structured data like Excel/CSV effectively:

public List<Document> loadStructuredExcel(Resource excelResource) {
    // 1. Use Apache POI (underlying lib) to iterate rows
    // 2. Convert each row to a semantically rich string
    //    e.g., "Employee: John Doe, Salary: $50,000, Department: IT"
    
    List<Document> docs = new ArrayList<>();
    
    // Pseudo-code logic for the extraction loop
    for (Row row : sheet) {
        StringBuilder content = new StringBuilder();
        content.append("Product: ").append(row.getCell(0)).append(". ");
        content.append("Description: ").append(row.getCell(1)).append(". ");
        content.append("Price: ").append(row.getCell(2)).append(".");
        
        // Create a document per row (or group of rows)
        Document doc = new Document(content.toString());
        doc.getMetadata().put("row_index", row.getRowNum());
        docs.add(doc);
    }
    return docs;
}

By serializing the row into a key-value sentence, you ensure the Embedding Model understands the relationship between the data points.

7. The Transformation Stage: Token Splitting

Once DocumentReader returns a list of Document objects, they are likely too large for your Embedding Model (e.g., OpenAI text-embedding-ada-002 has a token limit, but practically, you want smaller chunks for retrieval accuracy).

We use TokenTextSplitter.

import org.springframework.ai.transformer.splitter.TokenTextSplitter;

@Service
public class DocumentSplitterService {

    public List<Document> splitDocuments(List<Document> sourceDocs) {
        // Configure the splitter
        // defaultChunkSize: 800 tokens (approx 600 words)
        // minChunkSizeChars: 350
        // minChunkLengthToEmbed: 5
        // maxNumChunks: 10000
        // keepSeparator: true
        
        TokenTextSplitter splitter = new TokenTextSplitter(); 
        
        // This splits the sourceDocs into smaller list of Documents
        // inheriting metadata from the parent.
        return splitter.apply(sourceDocs);
    }
}

The "Sliding Window" Context

Standard splitting cuts text abruptly. It is vital to maintain context overlap. Spring AI's splitter supports configuration to overlap chunks? Currently, TokenTextSplitter in Spring AI relies on underlying logic that aims to respect sentence boundaries, but explicit overlap configuration (like overlap: 50) is a key feature to watch for in the TextSplitter API updates.

If your splitter cuts a sentence in half:

Chunk A: "The total revenue for Q3 was..."
Chunk B: "...5 million dollars."

Neither chunk answers the question "What was Q3 revenue?". Ensure your splitter implementation respects sentence boundaries.

8. Putting It Together: The Full Service

Here is the full code for a service that takes a generic file, detects the type, processes it, and loads it into a Vector Store.

package com.springdevpro.rag;

import org.springframework.ai.document.Document;
import org.springframework.ai.reader.tika.TikaDocumentReader;
import org.springframework.ai.transformer.splitter.TokenTextSplitter;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.core.io.Resource;
import org.springframework.stereotype.Service;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.List;

@Service
public class IngestionPipeline {

    private static final Logger log = LoggerFactory.getLogger(IngestionPipeline.class);
    
    private final VectorStore vectorStore;

    public IngestionPipeline(VectorStore vectorStore) {
        this.vectorStore = vectorStore;
    }

    public void processFile(Resource file) {
        log.info("Starting ingestion for: {}", file.getFilename());

        // 1. Load (ETL)
        TikaDocumentReader reader = new TikaDocumentReader(file);
        List<Document> rawDocs = reader.get();
        log.info("Loaded {} raw document(s)", rawDocs.size());

        // 2. Transform (Split)
        TokenTextSplitter splitter = new TokenTextSplitter();
        List<Document> chunkedDocs = splitter.apply(rawDocs);
        log.info("Split into {} chunks", chunkedDocs.size());

        // 3. Metadata Enrichment (Optional but Recommended)
        chunkedDocs.forEach(doc -> {
            doc.getMetadata().put("file_size", file.contentLength()); // Handle IOException in real code
            doc.getMetadata().put("pipeline_version", "v1.0");
        });

        // 4. Embed and Persist
        // The vectorStore.add() method automatically calls the EmbeddingModel
        // to convert the text to vectors before saving.
        vectorStore.add(chunkedDocs);
        
        log.info("Ingestion complete.");
    }
}

9. Best Practices and Performance Optimization

1. Handling Large Files & Memory

The TikaDocumentReader usually processes files in-memory. For massive PDFs (e.g., 500MB), this will cause an OutOfMemoryError.

Solution: Do not load the entire file into a single List<Document>. Create a custom streaming reader that yields pages one by one, splits them, embeds them, and clears them from memory before reading the next page.

2. Idempotency (Preventing Duplicates)

If you run the pipeline twice on the same file, you will duplicate embeddings in your vector store.

Solution: Compute a hash (MD5/SHA256) of the file content before processing. Query the Vector Store (or a side SQL table) to see if this hash has already been processed. Add the hash to the Document metadata.

3. Asynchronous Processing

Document processing is I/O bound (reading disk) and Network bound (calling Embedding API).

Solution: Use Spring Boot's @Async or Java 21 Virtual Threads.

Code:

@Async
public CompletableFuture<Void> processFileAsync(Resource file) {
    processFile(file); // reuse logic above
    return CompletableFuture.completedFuture(null);
}

4. Retry Mechanisms

Calls to OpenAI or local embedding models can fail.

Solution: Wrap the vectorStore.add() call with Spring Retry.

@Retryable(maxAttempts = 3, backoff = @Backoff(delay = 2000))
public void robustSave(List<Document> docs) {
    vectorStore.add(docs);
}

10. Conclusion

Building a document processing pipeline with the spring ai document loader ecosystem transforms your application from a simple text wrapper into a knowledge-aware enterprise solution.

While Apache Tika provides the heavy lifting for file compatibility (PDF, Word, Excel), the true engineering effort lies in the Splitting Strategy. How you chunk your data defines how well your AI retrieves it.

As Spring AI evolves towards version 1.0, we expect even more specialized readers. For now, mastering the pipeline of Reader -> Splitter -> VectorStore is the critical skill for Java developers entering the AI domain.

In the next article, we will connect this pipeline to a PostgreSQL (PGVector) database and implement the Retrieval component of RAG.

Key Takeaways:

Use TikaDocumentReader for broad format support.
Use PagePdfDocumentReader when page boundaries matter.
Flatten Excel data into descriptive strings before embedding.
Always split text into token-limited chunks to satisfy Embedding Model constraints.
Metadata is as important as content—use it to filter your context later.

Happy Coding! If you found this guide helpful on your Spring AI journey, subscribe to the feed for upcoming deep dives into Vector Stores and Function Calling.

The Unstructured Data Challenge in RAG​

1. Architecture: The Ingestion Pipeline​

2. Project Setup and Dependencies​

3. The Core Abstraction: DocumentReader​

The Unified Tika Loader Service​

4. Handling PDFs: Text vs. Layout​

Strategy 1: Tika (The Generalist)​

Strategy 2: PagePdfDocumentReader (The Specialist)​

5. Processing Microsoft Word (.docx)​

Handling Tables in Word​

6. The Excel (.xlsx) Conundrum​

The Problem with Default Loading​

The Solution: CSV Conversion & Row-based Loading​

7. The Transformation Stage: Token Splitting​

The "Sliding Window" Context​

8. Putting It Together: The Full Service​

9. Best Practices and Performance Optimization​

1. Handling Large Files & Memory​

2. Idempotency (Preventing Duplicates)​

3. Asynchronous Processing​

4. Retry Mechanisms​

10. Conclusion​